
Deploying AI agents for repository-scale tasks like bug detection, patch verification, and code review requires overcoming significant technical hurdles. A major hurdle: the need to set up dynamic execution sandboxes for each repository, which are expensive and computationally heavy.
Using large language models (LLM) reasoning instead of executing code to bypass this overhead is growing in popularity, yet it often leads to unsupported inferences and hallucinations.
To improve execution-free reasoning, Meta researchers introduce "semi-formal logic," A structured incentive technique. This method requires the AI agent to fill out a logical certificate by explicitly stating premises, tracing concrete execution paths, and drawing formal conclusions before giving an answer.
The structured format forces the agent to systematically gather evidence and follow function calls before drawing conclusions. This increases the accuracy of the LLM in coding tasks and significantly reduces errors in fault localization and codebase question-answering.
For developers using LLM in code review tasks, semi-formal reasoning enables highly reliable, execution-free semantic code analysis while significantly reducing the infrastructure cost of AI coding systems.
agent code logic
Agentic code reasoning is the ability of an AI agent to navigate files, detect dependencies, and gather context to perform deep semantic analysis on the codebase without running the code. In enterprise AI applications, this capability is essential for automated bug detection, comprehensive code review, and patch validation in complex repositories where relevant context spans multiple files.
The industry currently tackles execution-free code verification through two primary approaches. The first involves unstructured LLM evaluators that attempt to verify code, either directly or by training specialized LLMs as a reward model to approximate test results. The main drawback is their reliance on unstructured logic, which allows models to make credible claims about code behavior without explicit justification. Without structured constraints, it is difficult to ensure that agents reason thoroughly rather than making inferences based on superficial patterns such as function names.
The second approach involves formal verification, which translates code or logic into formal mathematical languages such as Lean, Coq, or Datalog to enable automated proof checking. Whereas rigorous, formal methods require defining the semantics of the programming language. This is completely impractical for arbitrary enterprise codebases that span multiple frameworks and languages.
Existing approaches are also highly fragmented and task-specific, often requiring completely different architectures or specialized training for each new problem domain. They lack the flexibility needed for broad, multi-purpose enterprise applications.
How does semi-formal logic work?
To bridge the gap between unstructured conjecture and highly rigorous mathematical proofs, meta researchers propose a structured inductive method, which they call “semi-formal logic”. This approach equips LLM agents with task-specific, structured logic templates.
These templates act as mandatory logical certificates. To complete a task, the agent must clearly state premises, trace execution paths for specific tests, and draw formal conclusions based only on verifiable evidence.
The template forces the agent to gather evidence from the codebase before making a decision. The agent must actually follow function calls and data flows step-by-step rather than predicting their behavior based on surface-level naming conventions. This systematic evidence gathering helps the agent handle complex cases such as confusing function names, and avoid making unsupported claims.
Semi-formal logic in action
The researchers evaluated semi-formal reasoning in three software engineering tasks: patch equivalence verification to determine whether two patches produce identical test results without running them, fault localization to pinpoint the exact lines of code causing bugs, and code question answering to test subtle semantic understanding of complex codebases. The experiments used Cloud Opus-4.5 and SONET-4.5 models acting as autonomous verifier agents.
The team compared their structured semi-formal approach to several baselines, including standard logic, where an agentic model is given minimal prompting and allowed to freely explain its thinking in unstructured natural language. They also compared it to traditional text-similarity algorithms like Deflib.
In patch equivalence, semi-formal reasoning increased accuracy on challenging, curated examples from 78% to 88% using standard reasoning. When evaluating real-world, agent-generated patches with available test specifications, the Opus-4.5 model using semi-formal reasoning achieved 93% validation accuracy, outperforming both the unstructured single-shot baseline at 86% and the deflib baseline at 73%. Other functions also showed similar gains across the board.
The paper highlights the value of semi-formal logic through real-world examples. In one case, the agent evaluates two patches in the Python Django repository that attempt to fix a bug with 2-digit year formatting for years before 1000 CE. A patch uses a custom format() function within the library that overrides the standard function used in Python.
Standard logic models look at these patches, assume that format() refers to Python’s standard built-in function, calculate that both approaches will produce the same string output, and incorrectly declare the patches equivalent.
With semi-formal logic, the agent traces the execution path and checks method definitions. Following the structured template, the agent discovers that within one of the library’s files, the name format() is actually shadowed by a custom, module-level function. The agent formally proves that given the characteristics of the input to the code, this patch will crash the system while another will succeed.
Based on their experiments, the researchers suggest that “LLM can perform meaningful semantic code analysis without agent execution, potentially reducing validation costs in RL training pipelines by avoiding expensive sandbox execution.”
Cautions and Agreement
While semi-formal logic provides substantial reliability improvements, enterprise developers should consider several practical caveats before adopting it. There is an obvious compute and latency tradeoff. Semi-formal logic requires more API calls and tokens. In patch equivalent evaluation, semi-formal logic requires approximately 2.8 times more execution steps than standard unstructured logic.
The technique also does not universally improve performance, especially if a model is already highly proficient at a specific task. When researchers evaluated the SONET-4.5 model on the code question-answer benchmark, standard unstructured reasoning already achieved a high accuracy of about 85%. There was no additional benefit from applying the semi-formal template in this scenario.
Furthermore, structured reasoning can lead to overly confident wrong answers. Because the agent is forced to build detailed, formal evidence chains, he may become overly confident if his investigation is deep but incomplete. In a Python evaluation, the agent carefully explored five different functions to uncover a valid edge case, but completely missed that a downstream piece of code already safely handled that exact scenario. Because it created a strong evidence chain, it led to a false conclusion with high confidence.
The system’s reliance on hard evidence also breaks down when it hits the limitations of the codebase. When analyzing third-party libraries where the underlying source code is unavailable, the agent will still resort to inferring behavior based on function names.
And in some cases, despite strict instantiation of instructions, models will sometimes fail to fully detect concrete execution paths.
Ultimately, while semi-formal logic significantly reduces unstructured guesses and hallucinations, it does not eliminate them completely.
What should developers take
This technique can be used out-of-the-box, requiring no model training or special packaging. It is code-execution free, meaning you do not need to add additional tools to your LLM environment. To achieve higher accuracy in code review tasks you perform more calculations at the time of estimation.
The researchers suggest that structured agentic logic “can provide a flexible alternative to classical static analysis tools: instead of encoding analysis logic in specialized algorithms, we can motivate LLM agents with task-specific logic templates that are generalized across languages and frameworks."
The researchers have made quick templates available, making them easy to implement into your applications. While there’s a lot of talk about the demise of prompt engineering, this technique shows how much performance you can still get from a well-structured prompt.
<a href