All Articles

Scaling AI feedback onto Codebase Context: A Primer on RAG Code Reviews

Discover how Baz Reviewer leverages Retrieval-Augmented Generation (RAG) to transform code review. Learn how dynamic retrieval, semantic analysis, and AI-powered collaboration help developers detect redundancy, enforce best practices, and streamline reviews.

Blog
3.10.2025
Shachar Azriel, VP Product
3 min

Hitting the Limits of Large Context Windows

In theory, models with extended context windows should be able to process an entire codebase at once, allowing for deep, context-aware coding recommendations. But in practice, what works great for new code generation falls apart in perspective code review. Over the past 18 months, we’ve hit clear limits that resonate with frustrations shared widely among developers and AI practitioners.

First, there’s irrelevant code overload. Even when feeding models thousands of tokens, they struggled to order and prioritize which files mattered for a given change. Instead of evaluating a specific relevant dependency, we saw them focus on unrelated top-level imports—distracting noise that diluted their insights. Many in the community have noted that loading an entire codebase into a model often overwhelms it, suggesting a need for smarter, more selective retrieval over brute-force context stuffing.

Second, context fragmentation proved a persistent hurdle. A change in one part of a file might affect files spread across different directories or even different repositories. Static codebases fed into a given context window lacked the ability to dynamically retrieve the missing elements, leading to incomplete evaluation of impact. This limitation aligns with broader critiques that static large language models (LLMs) fail to adaptively connect disparate pieces of a system, a challenge often raised in tech discussions.

Finally, blind spots undermined the feedback we needed. Models analyzing a change in isolation couldn’t recognize that similar logic had already been written elsewhere in the codebase, missing opportunities to prevent redundancies. Developers elsewhere have pointed out that without domain-specific augmentation, LLMs struggle to leverage existing patterns effectively—exactly the gap we encountered.

Over these past 18 months, it became clear that large context windows alone are not enough. We needed a smarter way to retrieve and assemble relevant software flows, a realization shared by those pushing for more dynamic AI approaches in the field. This is why we launched the RAG-powered Baz Reviewer - more details on the product release here.

The Immediate Upsides of a RAG Pipeline

This realization led us to experiment with early Retrieval-Augmented Generation (RAG) pipelines, a shift gaining traction among those exploring AI’s potential in software development. We began working on techniques to retrieve relevant information dynamically rather than relying on a static context window. Instead of feeding the model an arbitrary chunk of code, we broke the codebase into smaller, structured segments—such as functions, classes, and API definitions—which were then converted into vector embeddings using Voyage AI’s code embedding models to capture their semantic meaning. These embeddings are stored in a vector database, allowing for efficient semantic searches during review.

This selective approach, praised by many as a way to fetch only the most relevant snippets, prevented us from overwhelming the model. Today, when a developer submits a change request, Baz Reviewer searches the vector database for semantically related code snippets, retrieving relevant dependencies, prior implementations, and coding patterns. It then compares the new changes against existing functions to detect redundant implementations and recommend reuse instead of duplication. Finally, it enforces consistency by dynamically checking whether the new code aligns with team-specific best practices.

With this system in place, we finally saw precision in suggestions—not just generic advice, but contextualized, relevant, and actionable insights. This reflects enthusiasm in the community, where the ability to index and retrieve targeted code is seen as enhancing LLM performance, delivering focused representations of software rather than bloated, unfocused outputs. Our initial implementation was straightforward, but its impact was immediate, validating the growing belief that RAG is a game-changer for contextual AI assistance.

Refining Retrieval for Maximum Impact

While our RAG-powered approach dramatically improved precision in code review, it also introduced new challenges—ones that echo ongoing discussions in the AI space. Developers needed greater transparency into why specific snippets were retrieved, and the model occasionally surfaced tangentially related code instead of the most critical dependencies. This aligns with critiques that RAG systems need refinement to ensure relevance and explainability, a challenge we tackled head-on.

Addressing these issues required refining our ranking mechanisms, improving retrieval heuristics, and incorporating interactive feedback loops to make AI-assisted reviews even more intuitive. We leaned into advanced techniques—like multi-stage retrieval and knowledge graph integration, concepts gaining attention elsewhere—to sharpen our system’s focus on critical dependencies. The result was a more transparent and precise review process, meeting the community’s call for systems that don’t just retrieve, but justify their choices.

Looking ahead, our focus is on further optimizing retrieval accuracy, enhancing explainability, and expanding the scope of AI-assisted review workflows. We’re working on multi-stage retrieval pipelines to improve dependency tracking across repositories, dynamic ranking adjustments to surface the most relevant context, and AI-guided refactoring suggestions that go beyond review to actively improve maintainability. These goals reflect a shared optimism that RAG can evolve into a foundation for smarter, workflow-integrated tools—capable of not just reviewing code, but enhancing it proactively.

RAG Reviewer in Action

The past 18 months have shown us that retrieval, not just generation, is the key to AI-powered software intelligence—a sentiment increasingly clear among developers and AI researchers. Large context windows falter under the complexity of real-world code review, as we and others have found, while RAG pipelines offer a dynamic, precise alternative. The ability to retrieve, analyze, and contextualize software flows is what separates generic AI-powered code review from true, workflow-integrated intelligence. With RAG as our foundation, we’re only scratching the surface of what’s possible in AI-driven software development—and the wider conversation suggests we’re part of a broader movement toward that future.

Duplicated TestClient Initialization

  • Code Context: The test files initialize a TestClient instance across multiple test files to interact with a FastAPI application.
  • Finding: The reviewer suggests moving TestClient initialization into a shared pytest fixture (tests/conftest.py). This reduces duplication and ensures consistent configuration across tests.
  • Why It’s Awesome: Instead of manually spotting duplicated test setup, Baz suggests a modular approach, leading to better maintainability and DRY (Don’t Repeat Yourself) principles.
Duplicate Dataclass Definition
  • Code Context: The PromptPlaceholder dataclass appears in multiple locations, duplicating functionality.
  • Finding: The reviewer detects that this class exists elsewhere and suggests refactoring by either:
    1. Moving it to a shared location and reusing it.
    2. Making the examples field optional to allow reuse in both cases.
  • Why It’s Awesome: This is an example of how Baz enforces code reuse and structure across the codebase. The tool doesn’t just spot the duplication—it offers actionable suggestions, showing a deep understanding of maintainability and scalability.

We are shaping the future of code review.

Discover the power of Baz.