Where AI succeeds and fails in predicting breaking changes in code review

Skilled code reviewers play a crucial role in development, but even the best can miss breaking changes. AI has the potential to enhance code reviews, but existing tools fall short.

Blog

6.26.2025

Anton Grübel, Software Engineer

3 min

Skilled code reviewers are invaluable in any development team. With their expertise, full visibility of the codebase, and the help of standard diffing tools, they can intuitively identify breaking changes – without wasting time on inconsequential edits. But even when those rockstar reviewers are on the job, the most experienced of them sometimes miss breaking changes – and some breaking changes are almost impossible to catch.

AI holds the promise of an always-on, never-tiring, all-seeing reviewer – but existing AI-powered tools fall well short of that ideal. Feeding an entire repository to an AI model is rarely effective: Large Language Models (LLMs) require well-structured, deterministic inputs. As we’ve discussed in detail, Baz has experimented with various ways of ordering files for AI-assisted development, in order to optimize AI performance and improve the accuracy and efficiency of code reviews. Part of the fruit of that research will soon be available as the new Baz Reviewer.

But to understand why developers need Baz Reviewer, let’s take a step back and ask: What are we asking human reviewers to do when we task them with finding breaking changes?

Finding breaking changes in the modern codebase isn’t easy

A breaking change itself might be as simple as a subtle tweak to an API or a minor adjustment in a database schema, but its impact may be felt in multiple systems. And it’s not a simple task to assess whether that small change is a breaking one. To identify breaking changes, and distinguish them from innocuous changes, reviewers must traverse many domains: usage patterns, debugging workflows, performance tuning, security hardening, observability, and more.

In highly interdependent software systems – such as those built with microservices or multi-repo architectures – this task becomes exponentially more complex. Reviewers must account for changes across languages, file types, and even entirely different services. A breaking change can have a massive blast radius, encompassing API behavior, parameter structures, response formats, and underlying implementations.

Even when best practices like comprehensive testing, compilation checks, and continuous integration are followed, it’s unrealistic to expect reviewers to catch every potential issue.

Missing breaking changes can lead to costly troubleshooting later, but catching them in the review process yields more than just incident prevention. A well-conducted review that gives a complete view of critical changes can serve as a roadmap for cross-team collaboration. This ensures that necessary modifications are communicated and effectively coordinated across functions.

AI has the potential to help teams spot breaking changes more consistently, but only when paired with the right context and (as we’ve learned) robust static analysis.

How to get the most comprehensive review of open pull requests

Our approach to better code reviews relies on intentional tooling and human-guided intervention to prepare code inputs, before LLMs are deployed.

We use two methods to prepare code for data analysis:‍

1. Break code down into elements and building blocks with Abstract Syntax Tree (AST) diffing

Where file-based/ Git-based diffing only compares literal changes between two files (additions, deletions, and modifications), AST diffing takes code structure into consideration. This makes it far more powerful and useful for helping us to find critical changes.

AST diffing can detect when a function signature is modified or when a line of code is moved within a specific method or class. Trivial changes like reformatting can be filtered out and structural changes that impact code behavior are brought to the fore.

AST diffing can also recognize refactoring, categorizing it as a code-strutcure evolution, rather than a simple edit.

By implementing AST diffing, advanced tooling can enforce coding standards or actively identify potential bugs – something that’s beyond the abilities of traditional diffing. Baz Reviewer’s AST diffing capabilities are powered by difftastic and tree-sitter, two tools that analyze files based on their syntax. These tools also support over 30 languages and 10 structured file types.

While both Git-based and AST diffing have important roles to play in the review process, AST diffing delivers deeper insights that can take reviews to the next level.

2. Optimize AI outputs by structuring inputs

Our research into file-ordering clearly demonstrated the importance of structured inputs when it comes to using LLMs too. With full files as inputs, sorting ability was erratic. When we provided input-graph data, the AI was able to sort the files far more effectively, ultimately resulting in better outputs with fewer hallucinations. Incidentally, in the battle of the LLMs, we consistently found that Claude beat GPT 4o!

Using structured inputs to prime AI models provides a clear advantage in efficiency, cost, and accuracy. Rather than feeding an entire repository into an LLM, Baz Reviewer takes a targeted approach. Because the LLM processes only the most relevant chunks, reviews become faster, less costly, and more reliable.

This is achieved by using LangChain for breaking code into manageable, context-rich chunks and tree-sitter for syntax-aware analysis of those chunks. Baz Reviewer structures inputs so that the LLM receives context-rich, deterministic data for analysis. This step isn’t just about improving AI performance—it’s about setting developers up for success.

By pairing the AI with human-curated and semantically enriched inputs, the results align closely with real-world expectations, reducing noise and unnecessary back-and-forth in the review process.

As large leaps in LLM capabilities become less frequent, optimizing how we interact with current-generation models is critical. Our research indicates that structured inputs are an effective long-term strategy to deliver context consistently and deterministically.

Breaking changes with Baz Reviewer

Baz puts in the work for the best output - AI is limited in a myriad of ways when you implement with the objective of contextual recommendations. Baz Reviewer enhances your GitHub workflow by identifying critical API modifications, such as removed or renamed endpoints, HTTP methods, parameters, response fields, and enum values. It links API endpoints to code entry points, tracing parameter and return types across complex scenarios, including destructured types or generics.

For more technical details, read the docs.

‍

Ready to try Baz? Click here to Get Started.

Where AI succeeds and fails in predicting breaking changes in code review

Finding breaking changes in the modern codebase isn’t easy

How to get the most comprehensive review of open pull requests

1. Break code down into elements and building blocks with Abstract Syntax Tree (AST) diffing

2. Optimize AI outputs by structuring inputs

Breaking changes with Baz Reviewer

More posts

AI Code is Everywhere - Baz Gives Pull Requests the Overhaul GitHub Won’t with $8M Seed Round

Baz Reviewer: One More Step Towards Automated Code Review

AI Code Review: Baz is now available, backed by $8million Seed Round

The future of code review is agentic...