All Articles
Research

Model vs. SAST

Internal Baz research on how newer cyber-capable models are changing AppSec detection for pull-request review.

May 18, 2026
Guy Eisenkot
Tables of content

Over the past several months, we have been comparing two different layers of Baz security review. The first is the Basic Security reviewer, which runs on previous-generation models and behaves like a high-recall AI SAST layer. It is broad, consistent, and effective at common AppSec failure modes that can be represented as known patterns, framework anti-patterns, or relatively direct source-to-sink flows. It is also generally architected like a scanner (inspired by our past work on checkov).

The second is the Advanced Security agent, which runs on the latest class of models, including GPT-5.5-class and Opus 4.7-class systems. It is architected as an agent-harness, allowing it selectively over-power specific issue types after exploring multiple exploit options.

Today, our Basic reviewer still carries much of the scaled, high-severity detection load, while the Advanced agent is more additive in categories where the bug depends on semantic reasoning across trust boundaries, object ownership, provenance, canonicalization, and deployment context.

We are trying to evaluate how stronger models impact developers on a day-to-day basis. Unlike a full-codebase scan (stay-tuned...) a PR scan needs to detect the kind of vulnerability that should be prevented in a current, given threat-model. The progress made by the advanced model, should, over time, erode our need for a basic model (and the legacy tool chain - SAST/DAST/IAST etc) because it will account for the real engineering risks the current posture of the app mandates.

Sneak peak - expansive models are not always better

Authorization and tenant isolation

The Basic Security reviewer, running previous-generation models, is strong at broad authorization coverage: routes without guards, missing auth middleware, unprotected endpoints, tenant-scoped handlers that lose tenant context, credentials APIs exposed without an obvious access check, and mutations that perform sensitive actions without an expected policy gate. These are classic high-value AppSec detections, and the historical data shows Basic remains the stronger scaled detector for many of them.

The Advanced Security agent, running latest-class models, shows up when the code appears to contain a control but the control does not enforce the intended relationship between caller and object. In those cases, the agent has to infer the subject, object, action, and context from controller structure, route naming, ORM predicates, resolver shape, neighboring checks, and product-specific concepts like tenant, account, project, study, or role. The most powerful Advanced finding is “this request is authenticated, but the resource being fetched is not proven to belong to the requesting account,” or “this state transition is guarded, but not by the policy that constrains this object.”

CI/CD injection and supply-chain execution

CI/CD files are effectively production code with deployment authority, but they are often edited with less security scrutiny than application code.

Our Basic reviewer detects CI/CD issues in the way a strong previous-generation AI SAST system should: it catches untrusted interpolation in workflow run: steps, risky use of event fields, shell injection patterns in automation, unsafe secret handling, and common build-pipeline anti-patterns.

The Advanced Security agent becomes more interesting when the finding requires provenance reasoning rather than just interpolation detection. In the internal data, the more advanced class of finding is about whether a build job is executing something whose origin, mutability, and integrity are actually trusted.

In one case a script fetched from object storage is not inherently a vulnerability; it becomes one when the object can change outside the reviewed code path, is executed without integrity verification, and runs inside a job with cloud credentials or deployment privileges.

In another case a workflow input is not inherently dangerous; it becomes dangerous when it crosses into shell execution, Kubernetes operations, secret creation, or deployment selection without allowlisting, quoting, or isolation.

Currently, Advanced is better positioned to reason about the build plane as a taint system: workflow inputs, branch names, artifact names, storage keys, generated scripts, and deployment metadata become sources, while shells, cloud CLIs, package managers, kubectl, and release automation become sinks.

Path traversal and filesystem boundary failure

Path traversal is one of the cleaner examples of model-improvement in detection depth.

The Basic reviewer is useful for the conventional SAST cases: user-controlled filename material flows into a filesystem operation, path normalization is missing, a string join creates a path under an assumed root, or a request parameter influences a read or write location.

The Advanced Security agent is more useful when the vulnerable path is indirect and the local code contains something that looks like validation. The long-form version of a bug is often that the code validates one representation of a path, but the filesystem later enforces another. Another example was a manifest entry, generated metadata file, archive member, object-storage key, or symlink that influenced the final path after the application believes it has constrained the operation to a safe root.

In those cases, a shallow detector may see canonicalization or a helper function and treat the flow as safe. A latest-class model can sometimes reason through the order of operations: where the path originates, whether the manifest or metadata is trusted, when normalization occurs, whether symlinks are resolved before or after validation, and whether the final file operation uses the same resolved path that was checked.

IaC, cloud policy, and production configuration

Cloud security shows the same split between broad pattern detection and policy-failure reasoning.

Our Basic reviewer is strong at the familiar misconfiguration layer: broad IAM wildcards, permissive WAF defaults, hardcoded cloud identifiers, exposed internal tools, insecure deployment settings, and configuration that obviously weakens a control. That kind of coverage remains valuable because many severe production risks are still ordinary configuration mistakes.

The Advanced Security agent adds value when the issue is not visible as a single bad attribute but as a mismatch between the declared configuration and the security role of the component.

Our findings indicate the agent infers intended policy from context: resource names, environment names, module structure, deployment target, consumer service, and the surrounding infrastructure.

The caveat is that remediation in this category is often structurally harder, because fixes can require platform ownership, infra review, rollout planning, or shared-module changes, which affects acceptance independently of detection quality.

Secrets, credentials, and sensitive data exposure

Secrets and sensitive-data exposure remain a high-volume category where the Basic reviewer continues to do useful work. Previous-generation models are sufficient for many direct detections: database URLs in manifests, hardcoded credentials, sensitive values in config, tokens or user objects in logs, JWTs stored in browser-readable locations, and credential material appearing in deployment files or scripts.

The Advanced Security agent is more differentiated when the issue is not the literal presence of a secret but the movement of sensitive data across a boundary where it should not appear. The reviewer has to decide whether a field is sensitive in context, whether the recipient is inside or outside the trust boundary, and whether the exposure changes attacker capability.

In other words, this is more sensitive-data flow analysis rather than secret detection. Basic still finds the obvious leaked sources and unsafe storage patterns, while Advanced can sometimes identify that an apparently ordinary field, response shape, log object, or generated value is security-relevant because of where it flows.

SSRF, open redirect, and unsafe network access

SSRF and open redirect are categories where the current evidence still favors broad, mature detection. Our Basic reviewer is well aligned to the recurring patterns: unvalidated proxy URLs, webhook destinations built from weakly trusted values, user-controlled redirect targets, server-side fetches influenced by input, environment-derived internal URLs, and network clients that cross from external control into internal access. These are classic AppSec flows, and tuned high-recall review still matters.

The Advanced Security agent’s expected advantage would be in the harder cases but that is not the representation ultimately fetched.

In the current data, that advantage is not yet clearly established. This is an important negative finding. Not every AppSec category improves immediately with stronger models. Some categories still benefit more from mature rules, broad framework coverage, and known sink modeling than from deeper semantic reasoning.

For now, SSRF remains primarily covered by the high-recall layer, with the Advanced agent used to inspect complex flows where destination trust, URL parsing, or internal network reachability is ambiguous.

Injection across runtime and build-time contexts

Our Basic reviewer is effective at recognizable SQL, shell, command, and workflow-injection patterns, especially when a value visibly reaches an interpreter boundary. Alternately, The Advanced Security agent is more useful when the interpreter boundary is implicit or spread across configuration, scripts, and generated commands.

The detection task is source-to-sink taint analysis with interpreter awareness: identify the attacker-controlled or weakly trusted source, identify the interpreter or privileged execution sink, and determine whether the value is constrained by allowlisting, parameterization, structured APIs, quoting, or isolation appropriate for that interpreter.

Net-net, the latest-class models appear most useful when they can explain the boundary being crossed rather than merely flagging that a string is used in a command.

Operational conclusion

Our data does not support treating the latest-class Advanced Security agent as a universal replacement for the Basic Security reviewer, and narrowly, SAST alternatives.

Our Basic Security Agent remains the high-recall AppSec layer, especially for mature high-severity categories where previous-generation models, tuned heuristics, and broad coverage still produce strong results.

Final results

We will continue to evaluate Advanced agent as an incremental semantic layer. Its value is in unique validated findings that the Basic reviewer misses, especially in categories involving object ownership, provenance, canonicalization, build-plane taint flow, policy intent, and deployment trust.

We are specifically intrested in validating unique, diffrentiated, severity-weighted findings added by the latest-class agent per reviewed pull request. On that basis, the current evidence shows a real capability shift, but still a narrow one.

Meet your new code review agents