New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget

arbor
Imagine that your engineering team deployed an AI agent to search internal company documents and answer employee questions. This works perfectly in development, but in production, it constantly hallucinates or misses key obstacles. Fixing this is hardly a simple patch. This requires a difficult, trial-and-error process of chunking strategies, recovery methods, and changing system signals simultaneously. Because these adjustments are so complex, it becomes almost impossible to tell which specific change actually solved the problem.

To address this challenge, researchers at China’s Renmin University and Microsoft Research introduced Arbor, a framework that elevates AI-powered research and optimization from a sequence of trial-and-error guesses to a cumulative learning process. Arbor organizes hypotheses, experiments, and insights into a tree that helps the system learn from past failures to make better, verified improvements over time.

In practical tests, Arbor provided more than 2.5 times the verifiable performance advantage of standard AI coding agents in real-world engineering tasks while operating under the same resource budget.

For enterprise AI, this technology directly translates into automating the continuous improvement of complex, real-world engineering systems.

Understanding the bottleneck in autonomous adaptation

As larger language models and AI systems become more capable, they are expected to perform more complex operations such as autonomous optimization (AO) of software systems such as agent harnesses or model training algorithms.

AO captures the fundamental loop of autonomous research. An AI agent starts with an initial variable artifact, such as a machine learning codebase or data pipeline, and a specific purpose. The goal of the agent is to iteratively improve this artifact through experimental feedback step by step without human supervision.

The main challenge of AO is often misunderstood. Many engineering teams believe that giving a coding agent more time or calculations to optimize the codebase does not yield better results. "Automation can keep AI working for a very long time – but a loop does not equate to progress," Jiaji Jin, co-author of the paper, told VentureBeat. "If the goal is vague, or the metrics are easy to hack, long-term automation often produces rapid ‘improvements’ that no one really wants."

Jin points out that complex tasks take multiple attempts to get right, and standard agent architectures lack the critical data structure to maintain state. "How do you ensure that the insights and experience from each attempt are actually accumulated rather than lost in the scrollback buffer?" He said. Without this structure, agents simply repeat the same mistakes.

Current agent systems can run experiments for many hours against well-specified targets: editing code, applying tools, running tests autonomously. But they approach each endeavor in isolation, missing the structural mechanisms that allow them to aggregate and act on what they have learned.

They lack the ability to maintain and compare multiple competing research directions simultaneously. Without it, they cannot interpret both successes and failures to reshape their future exploration, which is the main mechanism that makes human research cumulative.

General coding agents generally rely on conversation transcripts for their memory. Since AO tasks run for hundreds of turns and easily exceed the limits of the context window, these agents struggle to preserve and reuse factual evidence over long histories. As a result, they miss the broader structure of the research process and tend to stop at initial failures or pursue noisy evaluative swings. The system requires a structured, durable memory that records which directions have been attempted, what factual evidence has been produced, and how each outcome changes the space of future hypotheses.

The current framework also has the potential to reward hacking and overfit growth metrics. This gives them the illusion of progress without improvements that transfer to real-world performance.

Finally, general purpose coding agents typically chain their tool calls onto a shared functional tree. This architectural limitation prevents them from testing hypotheses in parallel in separate environments without corrupting the main codebase or obscuring which hypothesis led to a specific result.

arbor frame

Arbor solves the challenges of AO with a framework that automates the long-horizon loop of exploration, experimentation, and abstraction that characterizes human research. Arbor separates the strategic direction of research from ground-level coding tasks with two key components:

Coordinator: A long-lived AI agent who acts like a principal investigator. It never edits the target codebase directly. Instead, this adaptation owns the general state of the research, observes the accumulated evidence, comes up with new hypotheses and directions for exploration, and decides what to do with the results of the experiments.

Executor: Short-lived, highly focused AI agents. When the coordinator wants to test an idea, he creates an executor and places it in an isolated environment, essentially a fresh Git worktree. Each executor is assigned a hypothesis. It implements the specified idea, runs evaluation, debugs errors, and reports back to the coordinator with the results and built artifacts.

These two components cooperate through a mechanism the researchers call “hypothesis tree refinement” (HTR). HTR represents the entire research process as a continuous, branching tree where each node ties together four things: a hypothesis, executable artwork, produced factual evidence, and a distilled insight. This means that the coordinator can explore multiple competing directions at the same time without losing its position.

The coordinator builds the tree by placing broad ideas near the root, while concrete refinements form the branches as leaves. This allows Arbor to safely explore multiple competing hypotheses simultaneously. If an executor’s experiment fails, the tree records the reason for the failure as a negative interrupt, ensuring that the system does not repeatedly repeat the same mistake.

To understand why Arbor’s isolation matters, consider a common enterprise scenario: optimizing a retrieval-augmented generation (RAG) pipeline for an internal AI assistant. "When you ask a single agent like Cloud Code or Codex to ‘improve accuracy’, it will usually change a lot of things at once – chunking, prompt, retrieval method," Jin said. This confuses the changes, making it impossible to tell what actually helped. It also directly converts the repository without isolation.

Arbor solves this by treating each lever as a separate hypothesis. Chunking becomes one branch, recovery another, and prompt another – each implemented and evaluated in its own separate Git worktree. "So you get clean attribution: ‘Constraint decomposition given +X on recovery side; Breadth-first search really hurts,’" Jin said.

When an executor returns a report, the coordinator writes the evidence to the tree and propagates the insights back to the parent nodes. This means that a local observation becomes a generalized constraint that shapes the coordinator’s future idea generation.

To prevent reward hacking or overfitting in the development data, HTR implements a strict “merge gate”. Even if an executor reports a stellar development score, the coordinator will prepare a separate worktree to test the candidate against a conducted test evaluator. The artifact is only merged into the current best trunk if it clearly improves the test scores, verifying that the progress is genuine.

Arbor generally belongs to the concept of "loop engineering," Popularized by industry figures such as Peter Steinberger, creator of OpenClave, and Boris Cherny, head of Cloud Code. The idea is to move from single signals to designing iterative cycles (observe, reason, act, verify) that drive autonomous agents. However, as Jin points out, "A loop can be filled with dirty, unrepeatable attempts, and you have nothing to show and no way to recreate what changed."

Arbor in action

Researchers evaluated Arbor on real-world research settings and an autonomous optimization task suite built from MLE-Bench Lite machine learning engineering benchmarks. The AO suite includes functions from various areas of AI development, including model training, harness engineering, and data synthesis.

The researchers used various backbone models for the coordinator and executor agents, including Cloud Opus 4.6, GPT-5.5, and Gemini-3-Flash. They tested Arbor against the strongest coding agents, codecs, and cloud code. Arbor and Baseline were given equal resources. For MLE-bench light tasks, Arbor was also compared to top-tier agentic research systems such as AI-Scientist, ML-Master, and AIDE.

Arbor consistently outperformed the baseline. It achieved the best test results on all tasks, achieving more than 2.5 times the average relative advantage of codecs and cloud codes. On the BrowseComp task, which involves optimizing a search agent, Arbor improved the system’s held-out accuracy to 67.67% from a baseline of 45.33%. Meanwhile, Codex and Cloud Code stalled at 50% and 53.33% respectively. On MLE-Bench Lite, when equipped with GPT-5.5, Arbor achieved the strongest results among all benchmark systems.

Arbor proved to be resilient against overfitting. For example, during Terminal-Bench 2.0 task experiments, Cloud Code achieved a high development score of 75, but its score dropped to 71 on hold-out data. Arbor had a lower development score of 72.22, but achieved the highest held-out score of 77.36, making its results transferable to real-world applications.

Arbor also showed generalization in a cross-task transfer experiment. After Arbor adapted Search Harness for the BrowseComp task, the researchers took the optimized codebase and tested it on two unrelated search-agent tasks, HLE and DeepSearchQA. Arbor’s optimized codebase significantly improved performance even on those unseen tasks.

Arbor deployment: good locations and hidden costs

For engineering leads who want to drop Arbor into their existing tech stack, the framework is designed to sit on top of existing Git workflows rather than replace them. "Its output is a simple Git branch that your existing code review, CI, and human review can inspect directly," Jin said. Only verified commits are merged into the per-run trunk, leaving the main repository untouched unless a developer chooses to manually promote the code.

However, Arbor deployment comes with specific tradeoffs. Jin points out that the biggest problem is the token cost, as maintaining a long-term coordinator who constantly manages the tree and dispatches executors is a major expense. Running multiple separate worktrees simultaneously also requires real compute and disk resources to process real experiments.

So where is Arbor’s favorite spot? According to Jin, it excels at tasks involving a clear, reliable metric, tolerance for long time horizons, and a real search space with several laudable directions, such as pipeline optimization, data-synthesis quality, and model-training recipe tuning.

Conversely, teams should avoid using Arbor for tasks with explicitly real-time latency, explicit one-line fixes, or when the underlying evaluation metric is flawed. The extent of quality of the entire round is strictly tied to the quality of the evaluator. "If the metric is not reliable, Arbor will adapt rapidly to unreliable results," Jin said.

Jin sees the next evolution moving beyond single scalar metrics. "A natural evolution is to have each node’s artifacts be a vector – accuracy, latency, cost – rather than a single score," Jin said. "Going from single scalar to multi-objective Pareto search is a very natural extension of the framework."



<a href

Leave a Comment