Theorem Wants To Stop AI-written Bugs Before They Ship — And Just Raised $6M To Do It

As artificial intelligence reshapes software development, one small startup is betting that the industry’s next big hurdle won’t be writing code — it will rely on it.

Theorum, a San Francisco-based company that emerged from Y Combinator’s Spring 2025 batch, announced on Tuesday that it has raised $6 million in seed funding to build automated tools that verify the correctness of AI-generated software. Khosla Ventures led the round with participation from angel investors including Y Combinator, E14, SAIF, Halcyon and Recursion Pharmaceuticals co-founder Blake Borgeson and Arthur Breitman, co-founder of blockchain platform Tezos.

The investment comes at a crucial moment. AI coding assistants at companies like GitHub, Amazon, and Google now produce billions of lines of code annually. Enterprise adoption is accelerating. But the ability to verify that AI-written software actually works as intended has not kept pace – which Theorem’s founders described as expanding "inspection interval" This threatens critical infrastructure, from financial systems to the power grid.

"We are already there," Jason Gross, co-founder of Theorum, said when we asked whether AI-generated code was surpassing human review ability. "If you asked me to review 60,000 lines of code, I wouldn’t know how to do it."

Why is AI writing code faster than humans?

The theorem’s core technology combines formal verification – a mathematical technique that proves that software behaves exactly as specified – with AI models trained to automatically generate and check proofs. This approach turns a process that historically required years of PhD-level engineering into one that the company claims can be completed in weeks or days.

Formal verification has existed for decades but is limited to the most mission-critical applications: avionics systems, nuclear reactor control, and cryptographic protocols. The prohibitive cost of the technique—often requiring eight lines of mathematical proof for each line of code—made it impractical for mainstream software development.

Gross knows this firsthand. Before founding Theorem, he earned his PhD at MIT, working on the verified cryptography code that now powers the HTTPS security protocol protecting trillions of Internet connections every day. According to his estimate, the project cost fifteen person-years of labor.

"Nobody likes to have wrong code," Gross said. "Software verification has already been uneconomical. Proofs were written by PhD level engineers. Now, AI writes it all."

How formal verification catches bugs that traditional testing misses

The system of theorems works on a principle gross calls "Partial proof decomposition." Rather than performing exhaustive testing of every possible behavior – computationally infeasible for complex software – the technique allocates validation resources in proportion to the importance of each code component.

Approach recently identified a bug that was missed before testing at Anthropic, the AI security company behind the cloud chatbot. Gross said the technology helps developers "Catch their bugs now without spending too much computation."

In a recent technology demonstration called SFBench, Theorem used AI to translate 1,276 problems from Rock (a formal proof assistant) to Lean (another verification language), then automatically proved each translation to be equivalent to the original. The company estimates that a human team would need about 2.7 person-years to complete the same task.

"Everyone can run agents in parallel, but we are also able to run them sequentially," Gross explained, noting that Theorem’s architecture handles interdependent code – where solutions build on each other across dozens of files – which eliminates traditional AI coding agents limited by context windows.

How one company turned a 1,500-page specification into 16,000 lines of reliable code

The startup is already working with customers in AI research labs, electronic design automation, and GPU-accelerated computing. A case study shows the practical value of the technology.

A customer came to Theorem with a 1,500-page PDF specification and an old software implementation plagued with memory leaks, crashes, and other elusive bugs. Their most pressing problem: improving performance from 10 megabits per second to 1 gigabit per second – a 100-fold increase – without introducing additional errors.

Theorem’s system generated 16,000 lines of production code, which the customer deployed without manually reviewing it. Confidence came from a compact executable specification – a few hundred lines that generalize the huge PDF document – paired with an equivalence-checking harness that verifies that the new implementation matches the intended behavior.

"They now have a production-grade parser operating at 1 Gbps that they can deploy with confidence that no information is lost during parsing," Gross said.

Security risks hidden in AI-generated software for critical infrastructure

The funding announcement comes as policymakers and technologists increasingly scrutinize the reliability of AI systems embedded in critical infrastructure. Software already controls financial markets, medical devices, transportation networks, and electrical grids. AI is accelerating how fast software is developed – and how easily subtle bugs can spread.

Presents the overall challenge from the security point of view. As AI makes it cheaper to find and exploit vulnerabilities, defenders need what it says "asymmetric defense" – Security that is measured without a proportional increase in resources.

"Software security is a delicate offense-defense balance," He said. "With AI hacking, the cost of hacking a system is falling rapidly. The only viable solution is asymmetric defense. If we want a software security solution that can last more than a few generations of model improvements, it will be through validation."

Asked whether regulators should mandate formal verification for AI-generated code in critical systems, Gross gave a clear response: "Now that formal verification is cheap enough, not using it for guarantees about critical systems could be considered gross negligence."

What differentiates Theorem from other AI code verification startups

Theorem enters a market where many startups and research labs are exploring the intersection of AI and formal verification. Gross argues that the company’s differentiation focuses on increasing software inspection rather than implementing verification in mathematics or other domains.

"Our tools are useful for systems engineering teams working close to the metal, who need to guarantee correctness before merging changes," He said.

The founding team reflects that technical orientation. Gross brings deep expertise in programming language theory and a track record of deploying verified code into production at scale. Co-founder Rajshree Agarwal, a machine learning research engineer, focuses on training the AI models that power the validation pipeline.

"We’re working on formal program logic so that everyone can not only oversee the work of an average software-engineer-level AI, but actually use the capabilities of a Linus Torvalds-level AI." Agarwal said, referencing the great creator of Linux.

The race to verify AI code before it takes control of everything

Theorem plans to use the funding to expand its team, increase compute resources for training validation models, and pursuing new industries including robotics, renewable energy, cryptocurrency, and drug synthesis. The company currently employs four people.

The emergence of startups signals a shift in how enterprise technology leaders may need to evaluate AI coding tools. The first wave of AI-assisted development promised productivity gains – the more code, the faster. The theorem is betting that the next wave will demand something different: mathematical proof that speed does not come at the expense of security.

Gross sets out the stakes in clear terms. AI systems are improving rapidly. If that trajectory holds, he believes extraterrestrial software engineering is inevitable – capable of designing systems more complex than anything humans have ever created.

"And without a fundamentally different economics of surveillance," He said, "We will stop deploying systems we do not control."

Machines are writing code. Now someone will have to check their work.

<a href

Theorem wants to stop AI-written bugs before they ship — and just raised $6M to do it

Why is AI writing code faster than humans?

How formal verification catches bugs that traditional testing misses

How one company turned a 1,500-page specification into 16,000 lines of reliable code

Security risks hidden in AI-generated software for critical infrastructure

What differentiates Theorem from other AI code verification startups

The race to verify AI code before it takes control of everything

Like this:

Related

Leave a Comment Cancel reply

Why is AI writing code faster than humans?

How formal verification catches bugs that traditional testing misses

How one company turned a 1,500-page specification into 16,000 lines of reliable code

Security risks hidden in AI-generated software for critical infrastructure

What differentiates Theorem from other AI code verification startups

The race to verify AI code before it takes control of everything

Share this:

Like this:

Related

Leave a Comment Cancel reply