Does feeding code in order improve model outputs?
We’ve previously discussed the roadblocks to LLMs being able to realize their full potential: limited context windows, difficulties handling multi-repository projects, and insufficient real-time app telemetry. These challenges have been our primary focus over the past year, leading to extensive research into the strengths and weaknesses of frontier LLMs.
In this deep dive, we explore a crucial factor for improving AI integration into developer workflows: the ability of LLMs to interpret code grammar and deduce downstream impacts. Specifically, we analyze whether optimizing file ordering can help LLMs better understand code diffs.
What is Code Grammar, and Why Does It Matter?
Code grammar serves as the GPS of software development. It includes universal language rules (e.g., syntax) and project-specific conventions, such as naming standards, formatting, dependency structures, and file organization.
A useful analogy for code grammar is traffic signals. Just as traffic signals guide drivers to ensure smooth and safe traffic flow, code grammar helps developers understand and navigate a codebase effectively. For instance:
In this simple exampe:
- Stop Signal: Similar to input validation structs. Just as stop signals ensure vehicles proceed only when safe, input validation ensures that only valid, correct data is processed.
- Yield Signal: Analogous to dependency injection. Yield signals let drivers give way, just as dependency injection lets classes rely on external providers for resources rather than managing them internally.
- One-Way Signal: Comparable to singleton structs. Both regulate flow in a controlled, unidirectional manner.
When an AI model generates a code snippet, it adheres to universal rules but may disregard your project’s specific grammar. This is akin to running a red light. While future models with infinite context windows might overcome this limitation, today’s models often fall short.
What Happens When Models Understand Code Grammar?
An LLM’s ability to meaningfully contribute to developer workflows depends on its grasp of code grammar.
Let’s explore how this manifests in practice. First-generation AI coding tools, such as Copilot and Cursor, assist developers in writing code, while tools like Copilot Workspace and Devin provide more control and access. We mapped LLMs' ability to understand code grammar across two dimensions:
- X-axis: From code completion (aiding developers with existing code) to code generation (starting from scratch).
- Y-axis: From developer-controlled workflows (e.g., tab-to-accept suggestions) to autonomous, agent-driven workflows.
In all scenarios, grammar is essential. For example, in human-in-the-loop workflows, LLMs help enforce feedback and refine results. As we move toward autonomous code generation, maintaining consistent grammar becomes even more critical. Proper code grammar ensures AI-generated outputs align with human workflows and project requirements.
Improving Code Handling Through Data Ordering
Expanding context windows and using chain-of-thought (CoT) prompting are common methods to improve LLM performance. However, this article focuses on a third approach: data ordering—the sequence in which data is fed to models.
Chain-of-Thought (CoT) Prompting
CoT prompting guides an AI model to reason step-by-step:
- Breaking tasks into smaller units.
- Analyzing code incrementally.
- Prompting the model to think through relationships between functions, variables, and structures.
While CoT focuses on reasoning post-input, data ordering optimizes the input sequence itself.
Data Ordering
Data ordering analyzes code grammar and structure before inputting it into the model. This preprocessing ensures the model receives data in a sequence reflecting logical dependencies and relationships. For example:
- A modified function is followed by its dependent elements, such as libraries and related functions.
This structured approach aligns with how developers navigate codebases, enhancing AI’s ability to process and reason about code.
The Problem: Pull Requests
Now, let’s look at the problem we are aiming to solve at Baz: the PR. We believe there is significant potential here to make this key procedure better – more efficient, less difficult, and more effective. And grammar is the key. Improving the PR process is our core motivation at Baz.
For better or worse, pull requests (mostly via GitHub) are the main convention by which code edits are introduced into a codebase. The problem is that they are blind to grammar, which ultimately means that the burden of predicting the downstream impacts of code changes falls on the shoulders of human developers. For various reasons, humans are often ill-equipped to support that burden.
When we ask our peers to evaluate code changes in a huge project, we necessarily give them only a subset of files. The unspoken assumption is that they can interpolate the impacts within the blast radius of a change, just as a compiler or interpreter could. The reality is that this is often beyond the knowledge and know-how of the human reviewer, who may be new to the project or even the coding language. Instead, it is left to other downstream processes – like our CICD, integration, or build systems – will catch the downstream impacts.
Anatomy of a PR
Let’s take a close look at the PR review process to try and understand how it can be improved.
There are four areas in the GitHub PR tab where we go to understand the flow of a change:
- The description tab: A timeline of all the changes that have occurred: the main description, checks on the bottom, and merge blockers. You don't see the code, but you see all your team members’ interactions.
- The commits tab: Where you can see a list of all incremental changes. This is really only used after you've already reviewed the PR.
- The checks tab. This is where you go to see the results of your CI checks, such as whether automated tests passed or if there were build failures. However, these results are often incomplete, forcing you to go to the Actions tab for more detailed logs or errors, disrupting the review flow
- The file change tab. This is where most reviewers spend most of their time: an A to Z ordered subset of file changes, based on a Git diff.
The PR process in GitHub is simply difficult: time-consuming, confusing, and disjointed.
A better way to do PRs
The most effective way to review a PR is to check out the branch locally, run it, debug, and manually examine definitions and references. While slow, this approach is thorough. At Baz, our goal is to automate this process—teaching AI to read code like humans. That brings us back to grammar and, more specifically, code indexing.
Here is where our empirical research really started – comparing the relative merits of existing code indexing methods and LLMs. There are several different tools currently in use that perform code indexing – here’s a quick rundown:
- LSP: The Language Server Protocol, maintained by Microsoft. LSP is a stateful process that usually runs alongside the IDE to enable development features such as “find usages”, “find definitions”, etc.
- LSF: The stateless improvement on the LSP protocol. Scans an entire project and turns it into a graph representation that can be persisted.
- Stack graphs: GitHub project. A language/mechanism that was developed in order to infer reference data by simply parsing codebases (no completion / execution necessary).
- Tree-Sitter: Another GitHub project. Creates Abstract Syntax Trees (ATSs) from source code. Tree-Sitter breaks code down into nodes, with each child-node representing more specific parts of the codebase.
- SCIP: From SourceGraph. Similar to LSPs but intended to solve some of their problems by including meaningful node names and more efficient data structures.
What these tools all have in common is that they are static code analyzers that rely on a set of predefined rules. And then we have LLMs.
LLMs can take on the role of code indexing with – but in a manger quite different from the existing static tools. You can open your ChatGPT or Claude console, insert an entire codebase or repo, and ask it specific questions about how code is defined and referenced in specific languages – and receive intelligible, intelligent answers in return. However, unlike the static analyzers we looked at above, the LLMs do not build an Abstract Syntax Tree (AST). Instead, they use natural language, which means that if it’s visible, there's a good likelihood the model will capture it – and, by extension, if it is not visible, the LLM may fail to capture it. That gives us some idea of both the promises and potential drawbacks of LLMs in this space.
Benchmarking the tools
We compared several code indexing tools and LLMs:
Test #1: Sorting Java with SCIP
Our first test was using SCIP to analyze Java repositories. Although we found that SCIP was effective in a variety of use-cases, we also found many areas where it fell short.
We found SCIP to be severely limited when it came to local variable handling. This is because it names local variables generically (e.g. “local 2”), which makes it difficult to work out what each variable does. The messy workaround is to cross-reference with other tables or “documentation” fields. Not ideal. We also found that parsing relationships between package structure elements was inconsistent, leading to fragmented data. For some Java elements, it outputs lines of zero.
Overall, despite finding workarounds (and we are contributing to possible fixes to some of these issues) we concluded that SCIP could not work for code indexing all the languages required.
Test #2: Sorting RUST with Tree-Sitter
We sought to benchmark Tree-Sitter – a highly effective incremental code scanning tool that’s used by many static analyzers in security applications – against a typical Rust project. For this test, we created a project that combined core Rust capabilities with the usual JavaScript and Python for specific use cases.
As with SCIP, for some use cases we found it was effective. It could parse the basic structs in field types adequately, reliably establishing the expected parent-child relationships. But again, we quickly discovered flaws. Input functions couldn't be reliably mapped to their parent structs – trying to do so would break the entire chain. We also found that multi-line use statements caused identifier splitting issues in some cases.
Test #3: Sorting Python with GPT-4o
Ultimately we had not found the silver bullet we were looking for. Static analyzers had given us a benchmark standard of how current tools traverse the AST in various languages. We now knew where the gaps were.
It's time to see whether the LLM could bridge them. Unlike static tools, LLMs do not traverse a dependency graph or analyze relationships across the codebase. Instead, they process code snippets as input and apply the probabilistic modeling to generate explanations. This approach falls short when relying solely on direct string matching.
In analyzing a sample Python project, GPT-4o was able to offer up the full spectrum of definitions. Multi-language support, also, was excellent. In fact, we believe that GPT-4o can essentially support all languages – even those that are very new. However, we quickly found areas that the LLM was not so effective. Importantly, however, these areas were quite different from where the static tools struggled.
Reference matching emerged as the key problem for GPT-4o. We found that GPT-4o struggled with the most common patterns: aliases, pluralized names, contextual method calls, and other references.
Proposed Implementation: Deterministic traversal inputs with AI generated explanations
The key idea behind this implementation is to provide the AI with a fixed, repeatable sequence of code inputs based on logical dependencies, ensuring that it processes code in the correct order. We observed how AI models lack the ability to reflect complex relationships within large codebases. By curating the sequence in which code is presented, we help the AI understand the dependencies between different files, functions, and data structures.
For instance, when a function is modified, all dependent elements like imported libraries or other functions relying on it are prioritized in the correct order. This deterministic traversal ensures the AI can follow the flow of logic, similar to how a developer would navigate a codebase.
AI-generated explanations complement this by offering human-readable justifications for AI predictions, detailing not just what has changed but why and how those changes may impact the broader system. These explanations empower developers to make informed decisions and reduce the need for further manual investigation.
Initial Results
In our preliminary tests with Rust, TypeScript, deterministic inputs allowed AI to better scope impacted dependencies. The AI-generated explanations provided clear, actionable insights, helping developers understand how code changes could affect the system.
However, we also encountered scalability issues when applying this approach to larger codebases. While it worked well for smaller projects, handling large-scale repositories remains an ongoing challenge, and optimizing performance in this area is a priority.
As we explored the optimal method for ordering topics in a pull request (PR), we faced a key decision: should we provide the LLM with the full files and let it deduce the appropriate order, or should we supply it with import-graph data—indicating which files import from which others—and let the model leverage this for sorting? Ultimately, we chose the latter approach in an effort to keep the process simpler and minimize the risk of hallucinations in the AI’s reasoning. When we supplied import-graph data, Claude consistently outperformed ChatGPT4o, generating more accurate orderings. For example, in one PR, Claude correctly prioritized files based on their import relationships, while ChatGPT4o ordered files with fewer imports first.
When we had to fall back to sorting files alphabetically due to the absence of adjacency data, Claude handled the task logically, while ChatGPT4o displayed erratic behavior. ChatGPT4o sometimes ordered files in reverse or grouped them based on supposed similarity, which wasn’t present in the data. These results suggest that, while LLMs can be powerful tools for sorting and organizing code, they still require carefully curated inputs to avoid misinterpretation or errors. The gap in reliability between different AI models highlights the importance of carefully considering the data provided to AI, especially in complex tasks like file ordering where context is critical.
Moving forward, we will focus on optimizing our traversal algorithms to improve efficiency in larger projects, refine the clarity of AI-generated explanations, and expand our language support. Additionally, user feedback will play a crucial role in ensuring that our solution aligns with real-world developer workflows.
At Baz, we’re committed to bridging this gap by exploring innovative techniques to enhance AI’s effectiveness in real-world coding environments. Follow along with our research in our resources and sign up to get first access to Baz here → https://baz.co/