Research

The anatomy of code review accuracy

Feb 23, 2026

Nimrod Kor

Tables of content

Heading H2

Heading H3

Accuracy in an agentic code reviewer emerges when four engineering systems work together reliably: dependable inputs, a decomposed and auditable architecture, constrained and verifiable reasoning, and a measurement loop that privileges developer value. At Baz we rewired the reviewer from a demo into a production system by treating retrieval, decomposition, extraction, and evaluation as first-class engineering problems. Below I describe what we changed, why those changes mattered, and where we placed Opus 4.6 and Codex 5.3 so the models amplified engineering discipline.

Inputs first: the reviewer needs correct, traceable code context

Our earliest production failures were operational. Reviewer runs disappeared because the file service entered an out-of-memory state when callers requested whole git archives that the service buffered. To users this behavior looked like the reviewer had stopped delivering comments. Treating repository access as a runtime SLA rather than a convenience changed the trajectory. We rewrote heavy endpoints to stream with backpressure semantics, introduced concurrency controls so a single archive request could not destabilize a reviewer pool, and converted invisible failures into traceable incidents. With those runtime guarantees in place the reviewer’s coverage became measurable and the team could stop guessing whether a missing comment reflected a model mistake or an infrastructure gap .

Alongside operational fixes we created canonical, minimal inputs for agents. Instead of feeding a model raw repository blobs we exposed deterministic endpoints that return file-by-path contents, hunks near a changed line, precomputed diffs, and element-level lookups. Reviewer jobs became first-class objects carrying the URL, commit SHAs, repository metadata and trace information so a missing comment can be followed through scheduling, fetch, subagent runs and model invocations. The practical benefit was immediate: when the retrieval layer delivered validated hunks the downstream reasoning became stable and measurement became meaningful.

Placement of models in this flow is intentional. Codex 5.3 converts repository blobs into compact, validated artifacts: AST snippets, normalized diffs, and precise file offsets. Codex’s code-native behavior and syntactic fidelity make it ideal for producing the small, verified inputs that the orchestration tier consumes. Opus 4.6 runs orchestration and policy decisions, deciding which hunks to inspect, when to retry a failing fetch, and how to route work under degraded conditions. Codex handles the code-local parsing and validation; Opus handles intent and reliability.

Decompose the reviewer: meta agent, repo skills, and narrow subagents

A single agent attempting to do everything produced brittle behavior and opaque failure modes. Replacing that monolith with a meta agent, repo-local skills and narrow subagents changed how the reviewer behaved and how we operated it.

The meta agent serves as the orchestration and policy tier. It aggregates multi-file context, applies repo-level policy and constructs the plan for a review job. Opus 4.6 performs meta-agent work because its long-form reasoning and context management create stable structured outputs that downstream graders can validate. Repo skills are versioned documents stored with the code that encode architecture facts, conventions and what this team wants the reviewer to flag. Shipping skills for our core codebases produced an immediate behavioral change: the reviewer followed explicit constraints in the repo when deciding what to flag, and the grader and the reviewer read from the same source of truth .

Subagents are narrow and single-purpose. Mapping agents find effected call paths and entry points. API-contract agents validate signature changes and usage. Comment-resolution agents decide whether a discussion was addressed by the new commit. An outdated-comment workflow handles stale commentary. Narrow responsibilities made subagents easy to unit test, instrument, and roll back. Our refactor of addressed-comment logic into a clearer outdated-comment workflow reduced stale or misattributed findings in PRs and produced far stronger attribution: when a subagent reported a problem it was clear which component produced it and why .

That architectural split produced practical advantages. Opus 4.6 enforces policy and fuses typed outputs; Codex 5.3 performs AST-level checks, precise localization and patch generation inside subagents. The result was lower variance in outputs, fewer false positives from vague reasoning, and a pipeline that is auditable and rollable at subagent granularity.

Constrain reasoning: analysis then extraction, and preserve local signal

Early noise in our pipeline came from two sources. Large, unstructured inputs caused models to guess, producing hallucinations. Unconstrained outputs made automated judges and UIs disagree about whether comments were useful. The response combined a two-stage reasoning pattern with deterministic context tooling to preserve the true signal.

In the two-stage pattern the agent first creates a free-form analysis. This verbose internal artifact helps debugging and traceability. A second, schema-constrained extraction step then produces a typed result with fields such as analysis text, an addressed boolean, a structured explanation, a confidence score, and a precise location. The extraction enforces types and returns confidence metrics; low-confidence extractions are routed to human adjudication rather than becoming ambiguous production comments. The system addressed-comment implementation demonstrates how schema extraction reduces ambiguity and makes judgeable outputs possible .

We preserved local signal through deterministic tools and proximity-based truncation. The system selects hunks adjacent to the changed lines, includes a window of surrounding code, and trims thread conversation after code context because empirical evidence shows bug signals are usually local to modifications. Agents call validated tools that return hunks, AST nodes, or element descriptors so models never need to guess where the code lives. This approach reduces token waste and prevents spurious associations across unrelated files.

Model roles in this constrained flow are purposeful. Opus 4.6 runs analysis and the extraction stage because its long-context reasoning and calibrated outputs produce reliable structured claims. Codex 5.3 performs verification: confirming that a localization is syntactically correct, generating minimal patches, and running quick static or semantic checks that validate a candidate. Together they convert a fuzzy analysis into an auditable, testable claim. Operationally we log model identity, token usage, and extraction confidence for every comment; conservative behavior routes low-confidence results to humans or records analysis without posting a comment, while model fallbacks provide resilience when primary extraction is uncertain.

Measure and iterate: labels, memory, and a living benchmark discipline

Improving accuracy at scale depends on a tightly coupled measurement loop that combines cheap labels, reviewer memory, and a benchmark practice that connects offline experiments to developer behavior.

We started by collecting low-friction labels: engineers replied to Baz comments with a simple good/bad marker. That tiny labeling cost delivered immediate supervision that prioritized work and allowed the team to focus on the errors that mattered. Labels are stored alongside metadata so each judgment can be correlated with the model, the subagent and the repo skill that produced the comment. Those correlations direct remediation toward the correct component rather than firing broad, unfocused changes.

Reviewer memory and reflection make those improvements durable. The reviewer keeps summaries of patterns developers accept or reject; memory suppresses repeated low-value comments and amplifies high-value ones. Periodic reflection summarizes recent acceptances, decision heuristics and updates memory so the system evolves as developer preferences shift.

A living benchmark discipline ensures we improve against signals that map to real developer value. Offline tests feed identical inputs to models and tools for controlled comparisons, isolating model capability from harness engineering. Online evaluation measures developer actions in the wild; fixing a flagged issue is strong evidence the flag had value. By running both views we close the gap that appears when systems optimize for brittle test sets that do not reflect practical utility. To raise recall beyond human-only gold sets we over-generate candidate bugs with models and filter them through judges and human review. We trace production bugs back to their origin commits to build held-out validation sets of issues that were discovered in production but missed in review, and we perform adversarial validation when multiple strong tools agree on a finding absent from the gold set. This process expands the benchmark as model capability grows rather than letting it cap evaluation at human recall.

Conditioning review and grading on repo-local specs is the final alignment. Agents.md and similar repo specs state what a team cares about; the reviewer and the grader use the same spec so precision and recall answer the question “does the tool do what this team asked?” This conditioning prevents mismatches between tool output and the developer’s intent.

The production picture and typical failure modes

In production a ReviewerJob begins with the a commit SHAs, repo name and job id. Codex 5.3 converts repository content into validated artifacts—hunks, AST snippets and precise localizations. Opus 4.6 orchestrates subagents, supplies the repo skill, and runs the analysis stage. Subagents create free-form analysis; Opus performs schema extraction and assigns a confidence score. Codex validates any suggested localization or patch and runs quick syntactic checks. Low-confidence extractions route to human review, while all artifacts and signals are logged with job metadata so labels and downstream behavior feed reviewer memory and the benchmark.

Common failure modes teach operational rules. Missing inputs cause failures, so ReviewerJob tracing is essential to separate infra from model issues. Truncation that removes signal requires proximity-based heuristics and deterministic tools to ensure the important code is preserved. Low-confidence extractions require conservative behavior: hold posting, route to humans, or record analysis for later adjudication. Gold set incompleteness demands continuous expansion through model-assisted generation, production-trace validations and adversarial validation campaigns rather than freezing the set.

Models multiply engineering discipline

Models magnify whatever the harness supplies. Opus 4.6 and Codex 5.3 delivered the best returns because we placed them where they match the work. Opus performs orchestration, long-form analysis, judge calibration and schema extraction; Codex performs AST-level checks, precise localization and patch verification. Those roles amplify engineering discipline rather than masking gaps in retrieval, decomposition, or evaluation. The engineering rules we enforced—streaming and deterministic retrieval, repo-local skills, two-stage constrained reasoning, robust labeling and memory, and a living benchmark—are what convert model capability into reproducible, production-grade accuracy.

If you would like, I can condense this into a one-page operational runbook: the ReviewerJob contract, the extraction schema we use, token-aware truncation defaults, routing and fallback rules between Opus and Codex, confidence thresholds and fallback behavior, a minimal label and memory schema, and the adversarial validation triggers we run on each model release. That runbook is the artifact new engineers read to transform model capability into repeatable reviewer accuracy.

‍

The anatomy of code review accuracy

Inputs first: the reviewer needs correct, traceable code context

Decompose the reviewer: meta agent, repo skills, and narrow subagents

Constrain reasoning: analysis then extraction, and preserve local signal

Measure and iterate: labels, memory, and a living benchmark discipline

The production picture and typical failure modes

Models multiply engineering discipline

More posts

Beyond LGTM: Master Code Reviews with AI

Code reviews are broken: How GitHub’s poor UX Is hurting developers

5 Ways to Measure the Impact of AI Code Review

Meet your new code review agents