News

Launching the New Baz Evaluations: From Metrics to Action

Sep 17, 2025

Shachar Azriel

Tables of content

Heading H2

Heading H3

Every conversation we have with a VP of Engineering or Staff Engineer eventually comes back to the same pain point:

“I can see the AI comments but I have no idea if they’re actually effective.”

Traditional accuracy metrics do not map well to real developer workflows. Comment counts tell you nothing about quality. Most tools give you raw data without telling you what to do with it.

We built Baz Evaluations last year to start answering this question - and recently shared the key signals required to understand if AI Code Review is effective. Now we have taken it another step further with a redesigned, end-to-end experience that turns feedback into clear, actionable recommendations.

It is a new way to measure, manage, and improve every reviewer in your organization based on how developers actually respond in the PR. Learn more below about the key signals we use and features that will help your team optimize AI Code Review.

What’s New: A Complete Upgrade to Agents’ Evaluations

Evaluations now come with richer signals, better visibility, and a new Impact Dashboard that connects the dots between data and action. Here’s what’s inside:

Impact Dashboard with Recommendations

A new central view shows:

Org-wide impact: PRs reviewed, suggestions accepted, rejection rates
Category breakdown: correctness, security, naming, and more
Reviewer performance: compare acceptance and sentiment rates across agents
Actionable recommendations: identify declining reviewers, surface top performers, and discover new customized agets based on your team’s best practices

Smarter Outcomes

Every AI comment is now classified as Accepted, Rejected, or Unaddressed so you can see a true adoption rate across your team, not just how often the AI speaks up.

Real Sentiment (Even Emojis Count)

We analyze the conversation around each comment to gauge sentiment - positive, neutral, or negative - and weigh it by who is giving the feedback. A principal engineer’s 👍 carries more weight than a casual +1.

Reviewer Memory That Learns Your Standards

When developers push back, Baz captures it as a “memory” and updates the reviewer’s prompt automatically. Reviewers stop making the same mistakes, and all prompt changes are versioned so you can roll back anytime. This approach has been shown to improve reviewers’ acceptance rates across the board.

Prompt Versioning for Continuous Tuning

Treat prompts like living code. Test changes, see the impact on acceptance and sentiment, and iterate safely.

This is not just more data. It is a guided path to higher impact, telling you exactly how to make your automated code reviews more effective for your team.

Get started with Baz

From our very first customer conversations, the message was clear. Teams want AI code review that delivers real, measurable impact. They want to know which reviewers are pulling their weight, which are creating noise, and what they should do next.

The new Baz Evaluations experience was built to answer exactly that. It gives you: