All Articles
Blog

The AI Coding Loop Hits Runtime

Developers found flow in the IDE, now agents are chasing truth in execution.

Oct 24, 2025
Guy Eisenkot
Tables of content

Until recently, “AI coding” meant autocomplete, snippets, and assisted development inside the IDE. But models are evolving fast. They’re not just generating code anymore, they’re learning to run, test, and debug what they build.

That shift moves AI development from assistive to autonomous, extending the coding loop all the way to runtime.

These last few months in AI coding have been intense. New models, new operation modes and new form-factors continuously push even more developers to move from assisted-development workflows to (more) autonomous programming.

On the individual level, developers are discovering a new flow state that is not interrupted by web search and stack overflow, but persistent implementation and iteration on the IDE and CLI. Runtime was always there locally, developers have long had local environments either semi wired or fully wired to their pipelines, databases and workloads. Debugging locally with AI is a natural extension of that setup.

The Team-Level Challenge

The team level is an entirely different story. Building reliable staging and pre-prod environments, let alone ephemeral environments with reliable connectivity require non-trivial platform engineering. In past cycles, we were obsessed with test coverage and creating optimal scenarios to help emulate real life user flows. Not only can agents continuously increase coverage for both unit, integration and end-to-end, but it can go beyond input/output.

For years, our “safe-shipping” rituals were built around our fears of downtime. We put in place guardrails and chokepoints in the form of CI steps, PR checks and feature flags. Continuous delivery tools thrived because they promised safety through smaller, slower releases. Don’t get me wrong, these are extremely effective means of safeguarding prod but now the tides have shifted. Coding agents mean more code and more code means more time spent on satisfying past controls.

That’s the paradox though: in trying to prevent bad code from shipping, we stopped getting closer to understanding it.

The Return of the CLI

One more exciting plot twist in this story is the re-emergence of the CLI form factor. After being crammed into a tab in most common IDEs Cursor, Claude Code and others have helped revitalize the CLI as an effective operating system for the modern coding agent. While many credit its fast paced transactional character, I believe the main reason we went back to loving the CLI is it's the most effective way to proxy runtime into developer workflows.

What’s next?

The holy grail for code review is one where a reviewer drops whatever it is they are doing and steps into the pull request author’s shoes to validate and verify their work. There’s many ways about it but across hundreds of developers we surveyed over the past few months only a handful work this way. Most (like me) were satisfied with reading the code, seeing tests have passed and of course seeing Baz AI Code reviewer didn’t find any bugs. 

But we all know that’s NOT effective code review - effective code review inspects intent, specifications and requirements against outcomes. The loop has to be extended.

Agents can analyze diffs in the context of environment variables, API tokens, and pre-prod data.

They can reason through “what happens next” — not just “what changed.” The result is a review that feels more like validation than inspection. The agent becomes a shared team-level resource that enforces standards while freeing engineers to build the thing.

The Open Problems

There’s quite a few big unsolved problems to get this right:

  • Ephemeral access – Agents struggle to reach short-lived, per-PR preview environments with inconsistent URLs and auth.
  • Credential injection – Different login mechanisms (SSO, magic links, OTP) require distinct, brittle automation flows.
  • Network boundaries – VPNs, IAP, and private ingress block agents from accessing preview environments entirely.
  • Token expiry – Short-lived JWTs or session cookies expire mid-review, breaking analysis continuity.
  • Dynamic frontends – JS-heavy apps render unpredictably in headless browsers, obscuring the true UI state.
  • Auth scope mismatch – Agents lack the right user role or feature context, reviewing the wrong version of the app.
  • CI lifecycle drift – Preview URLs become available asynchronously, causing race conditions in automated reviews.
  • Security constraints – Preview data often includes sensitive content, complicating secure agent access and auditing.
  • Feature flag variance – Different flag states produce inconsistent runtime behavior, invalidating review results.
  • Non-determinism – Randomized or live API-driven pages prevent stable diffs or reproducible validations.
  • Observability gaps – Missing logs, traces, and metrics block agents from detecting performance or runtime issues.
  • Context fragmentation – Runtime behavior in previews isn’t linked to code intent or PR context, weakening review quality.

The Road Ahead

Solving these problems means redefining code review around runtime truth — not static checks. The next generation of reviewers won’t just read code; they’ll execute it safely, observe the outcomes, and decide what’s truly safe to merge.

More on this coming soon.

Meet your new code review agents