Blog

Baz #1 in Precision in independent Code Review Bench

Feb 26, 2026

Guy Eisenkot

Tables of content

Heading H2

Heading H3

Today, Martian is launching Code Review Bench, an independent benchmark that measures how effective automated code review tools are in practice. Baz ranks #1 in precision in the current results, with strong overall performance as the benchmark evolves.

Precision is the metric we optimize for. If a reviewer is noisy, teams ignore it. If it is high-signal, it becomes part of the workflow.

What Code Review Bench is measuring

The benchmark exists because code review has an evaluation problem. Its increasingly important but harder and harder to measure because of the rate of development of models and coding form factors.

Many previous “benchmarks” in this category are too small to generalize, too static to stay clean, and too tied to a single vendor’s scoring choices. Martian’s approach is to treat benchmarking as a living system, not a one-off report.

At a high level, the benchmark combines two ideas:

Controlled measurement (offline): run tools on the same pull requests and score them against a curated set of issues.
Behavioral grounding (online): measure what developers actually act on in real repositories, so the benchmark stays anchored to real use.

The goal is simple: keep the measurement aligned with what developers value, and keep it refreshable so it does not get stale. This is the direction the field needs.

Why precision is the bottleneck

Legacy rule-based code review failed in a predictable way: It starts useful with plenty of signal but then, over time, becomes noise. Reviewers start to scroll past it and find ways to comment it out. Soonafter teams stop trusting it, and whatever recall the tool had stops mattering because the signal no longer impacts quality.

We always believed that when we flag fewer items but they are consistently worth attention, it changes developer behavior: It reduces review load, accelerates merges, and makes teams more willing to let AI propose larger changes because verification is reliable.

What we built Baz to do

Baz was designed around a practical constraint: review bandwidth.

The fastest way to lose review bandwidth is to ask for attention too often. The fastest way to earn it is to be right when it matters.

So we optimize for:

Fewer comments, higher impact

Baz tries to emit comments only when they cross a usefulness threshold. That means we are willing to miss marginal suggestions if the alternative is training developers to ignore the tool.

Clear, local reasoning

When Baz flags an issue, it should be obvious what code is implicated and why the issue matters. Review is a communication problem as much as a detection problem.

Consistent standards

Teams disagree about what “good” means. A benchmark that treats all teams as identical ends up scoring tools on a definition nobody asked for. We think the future of review is configurable and spec-driven, and benchmarking needs to move in that direction.

Precision is the first-order signal that these choices are working.

Interpreting the result responsibly

Two things can be true at the same time:

Ranking well on an independent benchmark is a meaningful milestone.
Benchmarks have error bars, and early results are not the end of the story.

Code Review Bench is explicit about iteration and refresh, which is the correct stance. We will treat this result the same way: as a strong signal, not a final verdict.

The outcome we care about is not a static placement on a leaderboard. It is sustained performance as the benchmark updates, datasets refresh, and evaluation improves.

What’s next

We are using this moment for two concrete things.

1) Double down on verifiable signal

Precision is the adoption gate, but it is not the only metric that matters. As the benchmark adds stronger recall measurement, better grounding, and richer definitions of what counts as a bug, we want Baz to remain high-signal while expanding coverage in a way teams still trust.

2) Participate in standardization

Benchmarks become standards when tool builders take them seriously, challenge them, and help shape the interface. A shared evaluation harness, consistent output formats, and transparent methodology are how this category matures.

We want that maturity. Review is the verifier for code generation, and verification is a lever for everything that comes next.

Try it, pressure test us

If you are evaluating automated code review tools, you can look at the benchmark results and then test the claims in your own environment.

If you find failure modes, edge cases, or categories where you want Baz to be stricter or quieter, tell us. The goal is not to “win” a single benchmark run. The goal is to build a reviewer that teams keep enabled.