Anthropic and OpenAI just exposed SAST's structural blind spot with free tools

hero image anthropic and claude
OpenAI launched Codex Security on March 6, entering the application security market that Anthropic had disrupted with Cloud Code Security 14 days earlier. Both scanners use LLM reasoning instead of pattern matching. Both proved that traditional static application security testing (SAST) tools are structurally blind to entire vulnerability classes. The enterprise security stack is stuck in the middle.

Anthropic and OpenAI independently released logic-based vulnerability scanners, and both found bug classes that pattern-matching SAST was never designed to detect. Competitive pressure between two laboratories with a combined private-market valuation of more than $1.1 trillion means the quality of tests will improve faster than any single vendor could provide alone.

Neither Cloud Code Security nor Codex Security replaces your existing stack. Both tools permanently change the purchasing math. Right now, both are free for enterprise customers. Before the board of directors asks which scanner you’re running and why, you need a head-to-head comparison and the seven actions below.

How Anthropic and OpenAI reach the same conclusion from different architectures

Anthropic published its zero-day research on February 5 with the release of Cloud Opus 4.6. Anthropic said Cloud Opus 4.6 found more than 500 previously unknown high-severity vulnerabilities in the production open-source codebase that had survived decades of expert review and millions of hours of fuzzing.

In the CGIF library, Cloud discovered a heap buffer overflow by reasoning about the LZW compression algorithm, a flaw that coverage-directed fuzzing could not catch even with 100% code coverage. Anthropic shipped Cloud Code Security as a limited research preview on February 20, available to enterprise and Teams customers, with free instant access for open-source maintainers. Anthropic created Cloud Code Security to make defensive capabilities more widely available, Gabby Curtis, Anthropic’s head of communications, told VentureBeat in an exclusive interview.

OpenAI’s numbers come from a different architecture and broader scanning surface. Codex Security evolved from Aardvark, an internal tool powered by GPT-5, which entered private beta in 2025. During the Codex Security beta period, OpenAI’s agent scanned more than 1.2 million commits in external repositories, finding what OpenAI said were 792 critical findings and 10,561 high-severity findings. OpenAI reported vulnerabilities in OpenSSH, GnuTLS, GOGS, Thorium, libssh, PHP, and Chromium, resulting in 14 CVEs being assigned. According to OpenAI, Codex Security’s false positive rates dropped by more than 50% across all repositories during the beta. Over-reported severity reduced by more than 90%.

Checkmarks Zero researchers demonstrated that moderately complex vulnerabilities sometimes escape detection by cloud code security. Developers can trick the agent to ignore vulnerable code. In a full production-grade codebase scan, Checkmarks Zero found that the cloud identified eight vulnerabilities, but only two were true positives. If minor complex ambiguity causes the scanner to fail, the detection limit is less than the headline number. Neither Anthropic nor OpenAI has submitted traceability claims to independent third-party audits. Security leaders should consider the reported numbers indicative, not audited.

Merritt Baer, ​​CSO at Encrypt AI and former deputy CISO at AWS, told VentureBeat that the competitive scanner race compresses the window for everyone. Baer advised security teams to prioritize patches based on exploits in their runtime context rather than CVSS scores alone, shorten the window between discovery, triage, and patches, and maintain software bill of content visibility so they can immediately know where a vulnerable component runs.

Different methods, with almost no overlap in the codebases they scan, yet the conclusion is the same. Pattern-matching SAST has a limit, and LLM logic detects beyond it. When two competing laboratories deliver that capacity at the same time, the mathematics of dual use becomes uneasy. Any financial institution or fintech running a commercial codebase should assume that if Cloud Code Security and Codex Security can find these bugs, adversaries with API access can find them too.

Baer put it bluntly: Open-source vulnerabilities revealed by reasoning models should be treated closer to zero-day class discoveries, not backlogged items. The window between discovery and exploitation has just narrowed, and most vulnerability management programs are still working on CVSS alone.

What do seller responses prove?

Snik, a developer security platform used by engineering teams to find and fix vulnerabilities in code and open-source dependencies, acknowledged the technical breakthrough but argued that finding vulnerabilities has never been the hard part. Fixing them at scale, across hundreds of repositories, without breaking anything. This is the hurdle. According to Veracode’s 2025 GenAI Code Security Report, Snick revealed in research that AI-generated code is 2.74 times more likely to introduce security vulnerabilities than human-written code. The same models that find hundreds of zero-days also introduce new vulnerability classes when writing code.

Cycode CTO Ronen Slavin wrote that cloud code security represents a real technological advance in static analysis, but AI models are probabilistic by nature. Slavin argued that security teams need consistent, reproducible, audit-grade results, and that scanning capabilities embedded in IDEs are useful but do not constitute infrastructure. Slavin’s position: SAST is a discipline very broad in scope, and free scanning does not displace platforms that handle governance, pipeline integrity, and runtime behavior at enterprise scale.

“If code reasoning scanners from major AI labs are effectively free for enterprise customers, then static code scanning becomes commoditized overnight,” Baer told VentureBeat. Over the next 12 months, Bayer expects the budget to focus on three areas.

  1. Runtime and exploitability layersWhich includes runtime security and attack path analysis.

  2. AI governance and model securityIncluding guardrails, quick injection protection and agent inspection.

  3. Troubleshooting Automation. “The net effect is that AppSec spend probably doesn’t go down, but the center of gravity shifts away from traditional SAST licenses and toward tooling that shortens remediation cycles,” Baer said.

Seven things to do before your next board meeting

  1. Run both scanners against a representative codebase subset. Compare cloud code security and codecs security findings against your existing SAST outputs. Start with a single representative repository, not your entire codebase. Both tools are in research preview with access constraints that make full-asset scanning premature. Delta is your blind spot inventory.

  2. Build governance structures before the pilot project, not after. Baer told VentureBeat to treat any tool like a new data processor for the crown jewels, which is your source code. Bayer’s governance model includes a formal data-processing agreement with clear statements on training exclusions, data retention, and subprocessor usage, a segmented submission pipeline so that only the repos you want to scan are broadcasted, and an internal classification policy that separates code that can meet your limits from code that can’t. In interviews with more than 40 CISOs, VentureBeat found that formal governance frameworks for logic-based scanning tools barely exist yet. Bayer identified derivative IP as a blind spot that most teams have not addressed. Can model providers retain traces of embeddings or logic, and are those artifacts considered your intellectual property? The second difference is data residency for code, which historically was not regulated like customer data, but increasingly falls under export controls and national security review.

  3. Map what no one tool covers. Software structure analysis. Container Scanning. Infrastructure-as-code. DAST. Runtime detection and response. Cloud code protection and Codex protection work at the code-reasoning layer. Your existing stack handles everything else. The pricing power of that stack has changed.

  4. Quantify dual-use exposure. Each zero-day Anthropic and OpenAI resides in an open-source project on which enterprise applications depend. Both labs are disclosing and patching responsibly, but the window between their discovery and you adopting those patches is exactly where attackers operate. AI security startup AISLE independently discovered all 12 zero-day vulnerabilities in OpenSSL’s January 2026 security patch, including a stack buffer overflow (CVE-2025-15467) that is potentially remotely exploitable without valid key content. Fuzzers ran against OpenSSL for years and missed every single one. Assume that adversaries are running the same model against the same codebase.

  5. Prepare a board comparison before they ask. Cloud code security considers code contextually, traces data flows, and uses multi-step self-verification. Codex Security creates a project-specific threat model prior to scanning and validates the findings in a sandboxed environment. Each tool is in research preview and requires human approval before any patches are applied. The board needs a collaborative analysis, not a one-vendor pitch. When the conversation turns to what Anthropic didn’t find in your existing suite, Bayer offers framing that works at the board level. Baer told VentureBeat that pattern-matching SAST solved a variety of generational problems. It was designed to detect known anti-patterns. That capability still matters and still reduces risk. But reasoning models can evaluate multi-file logic, state transitions, and developer intent, where many modern bugs reside. Bayer’s board-ready summary: “We bought the right equipment for the threats of the last decade; the technology has just advanced.”

  6. Track the competitive cycle. Both companies are headed toward IPOs, and the enterprise security win furthers the growth story. When one scanner misses a blind spot, it’s on another lab’s feature roadmap within weeks. Both labs send model updates on monthly cycles. That rhythm will transcend any one vendor’s release calendar. Baer said running both is the right move: “Different models provide different logic, and the delta between them can reveal bugs that neither tool would catch alone. In the short term, using both is not redundant. It’s a defense through the diversity of logic systems.”

  7. Set a 30-day pilot window. This test did not exist before 20 February. Run Cloud Code Security and Codex Security against the same codebase and let Delta drive the procurement conversation with empirical data instead of vendor marketing. Thirty days gives you that data.

Anthropic and OpenAI were launched fourteen days apart. The gap between next releases will be shorter. The attackers are looking at the same calendar.



<a href

Leave a Comment