All Articles

From Review Thread to Team Standard: How We Built AwesomeReviewers

AwesomeReviewers demonstrates how code review discussions can turn into guidelines that capture the team’s collective wisdom. In this post, we’ll share how it works under the hood, and what we learned building it.

Blog
7.8.2025
Guy Eisenkot, CEO & Co-Founder
3 min

Background

The best software teams treat code reviews not as opportunities to share knowledge and improve their architecture. However, much of this hard-earned insight stays buried in pull requests, inaccessible. 

The AwesomeReviewers project aims to change that. It demonstrates how code review discussions can turn into guidelines that capture the team’s collective wisdom. In this post, we’ll share how it works under the hood, and what we learned building it.

Why Review Feedback Gets Lost: The Problem

Developers using AI Coding agents now handle tens of PRs, with hundreds of comments, monthly. Many comments suggest improvements that could benefit the entire codebase if abstracted into general rules. Yet, whether they are adopted or not, these valuable inputs often remain siloed, lost once a PR is merged. New team members may repeat mistakes because prior reviews aren’t readily accessible as guidance. The challenge is how to capture and reuse code review knowledge systematically. We needed an automated solution that combs through review threads and extracts reusable best practices – essentially building a living coding standards from real discussions and decisions made on-the-fly.

The core idea behind AwesomeReviewers is to treat code reviews as a data source for continual learning. If multiple reviews highlight, say, SQL injection risks or error handling issues, the system should detect that pattern and formulate a guideline (enforced by an AI “reviewer”) around it. To ensure these reviewers are truly useful, the system evaluates each discussion on criteria like generalizability and actionability. It intentionally filters out purely context-specific nitpicks or trivial style comments. The goal is to identify review insights that extend beyond the specific code instance and capture patterns or rules of broader relevance. In other words, AwesomeReviewers looks for discussions that should become part of the team’s documented standards.

By solving this, we aim to close the loop between code review and knowledge sharing. Developers get a curated set of reviewer guidelines distilled from past PRs, reinforcing best practices and speeding up onboarding. Now let’s dive into how the system is designed to achieve this.

Architecture and System Design

We built AwesomeReviewers with a simple principle: Code reviews are a dataset. Use them.

If the same kinds of feedback appear across multiple PRs about performance regressions, test coverage, or misuse of types we should extract and formalize those patterns into reviewer rules.

That’s what AwesomeReviewers does:

  • It scans your repo’s review history
  • It filters for technically meaningful, generalizable feedback
  • It groups similar discussions
  • It distills them into AI-ready guidelines that can be used by tools like Baz, Cursor, or any custom agent
Awesome Architecture

AwesomeReviewers is a distributed pipeline with three main stages: Ingestion, Processing, and Publishing. Let’s walk through each step.

Ingestion

At a high level, AwesomeReviewers comprises a serverless ingestion layer and an AI processing backend, connected by a queue for reliability and scaling. Anyone can request it to analyze a repository by simply submitting a request. This call invokes an AWS Lambda function , which enqueues the job and immediately returns. The job is placed onto an Amazon SQS request queue with the repository name as payload. 

Running continuously in the background is the AwesomeReviewers Service, a FastAPI application deployed on Kubernetes. On startup, the service spins up a task that listens to a queue. Using long-polling, it waits for new repository jobs and fetches messages one by one. Each message triggers the core processing pipeline, and thanks to the queue, we can scale horizontally or throttle consumption as needed. This design decouples the user request from the heavy lifting – the user isn’t left waiting on a webpage for minutes while analysis runs. 

Processing

The processing service integrates with several key technologies. It uses the GitHub API to fetch data (through a GitHub App installation token for appropriate permissions), AWS S3 for caching intermediate results, and large language models for analysis and generation. We leverage OpenAI/Anthropic models via a LangChain-based workflow to perform the code review analysis. 

The use of models is central: we treat them as sophisticated interpreters of review text, capable of classification and summarization that would be hard to achieve with manual rules. The service’s code is organized into an orchestrated workflow that chains multiple model prompts (for preprocessing, labeling, and guideline generation). This approach made it easier to debug and adjust each step independently. 

To make the system robust, we implemented caching and retries. All PR comments for a repo are cached in S3 so that if we re-run the process (or it crashes and restarts), we don’t hit the GitHub API again unnecessarily. Similarly, once we generate reviewers for a repo, those results can be cached short-term – ensuring that minor tweaks or view requests don’t recompute everything from scratch.

Publishing

Finally, the output of the pipeline is published to a static site. We maintain a separate Git repository (e.g. baz-scm/awesome-reviewers) that serves as the content for awesomereviewers.com. Each guideline (“reviewer”) becomes a Markdown file (with YAML front-matter metadata) plus a JSON file with structured data for the website. Once committed, our CloudFront-backed site picks up the changes (the repository is likely the source for a static site generator or direct Jekyll pages). This way, the end-to-end flow is automated: from a single request, to a background analysis, to published web content.

The stack at a glance

  • Frontend: Static site powered by GitHub Pages and CloudFront
  • Content Format: Markdown + JSON (with YAML front-matter)
  • Queueing: Amazon SQS for job scheduling and decoupling
  • Ingestion: AWS Lambda for lightweight job submission
  • Backend Service: FastAPI app running in Kubernetes (EKS)
  • Data Fetching: GitHub API (via GitHub App installation token)
  • Caching: AWS S3 for raw PR comments and intermediate results
  • LLM Orchestration: LangChain for structured multi-step prompts
  • LLMs: Anthropic Claude (classification + guideline generation)
  • Validation: Pydantic schemas to enforce structured outputs
  • GitHub Integration: PyGitHub for file commits and branch updates
  • Deployment: Auto-publish via commit to GitHub Pages-backed repo

Reviewer Selection and Extraction

Let’s walk through what happens when the AwesomeReviewers service processes a repository, step by step:

1. Fetching Code Review Data: Given a repository name, the service first pulls all relevant code review discussions from GitHub. We define a discussion as a thread of comments on a specific PR, typically attached to a code diff (GitHub’s “review threads”). Once comments are in hand, we group them by discussion thread (so all comments in one code review conversation are aggregated). 

2. Preprocessing and Filtering: Not every review comment is worth turning into a guideline. Many might be too specific (“rename this variable”) or simply praise (“LGTM”). The system’s first AI-driven step is a discussion evaluation filter. For each discussion, we prompt a model (Anthropic’s Claude in a fast, low-cost mode) to judge it on four criteria: generalizability, technical substance, clarity, and actionability. The prompt provides the code snippet and the full comment thread, then asks: does this discussion capture a general coding principle? Is it about a real technical concept or just superficial? Is the feedback clearly stated? And does it suggest a concrete action or improvement? The model returns a structured verdict (True/False for each criterion) along with a short rationale. An example: a comment pointing out a missing cache invalidation would score True on all counts (it’s a broadly applicable performance bug, clearly described with a recommended action).

3. Discussion Labeling: Next, the remaining discussions are categorized by topic. We built a custom set of discussion labels covering common domains in code reviews: Security, Performance, Observability, Error Handling, Documentation, Coding Style, Null Handling, etc. Using another model prompt, we ask: “Given the discussion and the code, which category best fits the core issue being discussed?” The model must choose exactly one label per discussion, ignoring language-specific details and focusing on the core technical topic. The result is that each discussion now has a tag like SECURITY, PERFORMANCE, etc. We then bucket the discussions by label and by programming language. Grouping by language is important because the next step will generate language-specific guidance – we wouldn’t want to mix Python-specific feedback with, say, C++ discussions under the same rule. At this point, we might have, for example, 5 “Security” discussions in a JavaScript codebase and 3 “Performance” discussions in that same repo’s C++ component, each group to be processed separately.

4. Deriving “Reviewer” Guidelines: This is the heart of the system – using an model to synthesize each group of discussions into a single guideline (which we call a “reviewer”). For each category group (e.g. all Security discussions in Python files), we construct a prompt feeding in the list of discussions and the category context. The prompt instructs the AI to derive one comprehensive code reviewer for that category. It explicitly describes a step-by-step reasoning process: analyze each discussion (and even quote key points), identify common themes, draft candidate guidelines, and then converge on one final guideline that addresses the most important aspects across all these inputs. Essentially, the model is performing a form of clustering and summarization: merging multiple related review comments into a single generalized best practice.

We use Anthropic’s Claude model for this generation, in a high-complexity mode that allows longer output and “thinking” steps (we enabled chain-of-thought in the prompt). We chose Claude for its ability to handle long prompts (some discussions can be lengthy) and structured output. The workflow code creates a structured output schema and parses the model’s answer into a Pydantic object, validating that all required fields are present. If the model’s output fails validation (which can happen if the AI’s answer doesn’t match the expected JSON schema), we log an error and skip that would-be reviewer. We found this schema-guided approach essential – it acts as a guardrail against hallucinations and format errors. We also apply post-processing heuristics: if the model forgot to include any reference discussions, we discard that result (we want every guideline to be grounded in at least one real code review).

5. Enriching and Formatting Results: The raw output from the model includes references to discussion indices and perhaps implicitly the files involved. We enrich this data with a bit of static analysis. One nifty addition: for each guideline, we compute a set of file path patterns (glob patterns) indicating where this rule applies in the codebase. We extract the file paths (modules) from the referenced discussions and find common directories or file types. For example, if many referenced comments came from files under src/db/, the system might suggest a pattern like src/db/**/*.py for that reviewer. 

Now we format each reviewer guideline for publishing. We create two files per reviewer: a Markdown file and a JSON file. The Markdown contains a YAML front-matter header (with fields like title, description, original repository name, label/category, language, number of comments aggregated, etc.) and then the full description in Markdown format. This front-matter allows the static site to render a nice page for the guideline, showing the title and metadata in a consistent way. The JSON file contains the structured data, including the list of underlying discussions with cleaned-up code snippets. We include the JSON mostly for transparency and potential future use (for example, we could allow readers to expand and see the actual review comments that led to a guideline).

6. Publishing to GitHub and Cleanup: Finally, the service uses a GitHub App client to commit these files to the AwesomeReviewers content repo. All new reviewer files for the repository are added in one commit, to the _reviewers directory of that repo. We compose a commit message like “Add 3 awesome reviewers for owner/repo” to summarize the change. Using the GitHub API, the service creates new file blobs and a tree, then a commit referencing the latest branch head, and updates the branch ref to this new commit. This process is done via the PyGitHub library and is wrapped in error handling to retry if GitHub returns an error. By committing to the main branch of the static site repo, our changes go live almost immediately (the site is configured to auto-deploy from that repo). The user can then navigate to awesomereviewers.com and find a page listing the new guidelines distilled for their repository.

AwesomeReviewers demonstrates how we can turn raw human feedback into polished, shareable knowledge. It’s a pipeline combining data ingestion, caching, multiple model inference steps, and developer tools integration. We have quite a few more ideas on how to make it even more awesome. Stay tuned.

Visit the repo at https://github.com/baz-scm/awesome-reviewers 

The future of code review is agentic...

Meet your new code review agents.