Xiaomi's HarnessX rewrites its own AI scaffolding mid-task — and smaller models gain the most

self improving harness
As enterprise AI agents perform increasingly complex, long-horizon tasks, their performance is often restricted by their harness, the software scaffolding that connects the backbone AI to its environment.

Currently, harnesses are largely stationary and hand-crafted. Improving them is largely manual and they do not improve automatically based on performance data collected from their environment.

To overcome this engineering hurdle, Xiaomi researchers introduced HarnessX, a framework that treats an AI harness as a composable object and autonomously applies improvements to its code.

In real-world enterprise applications, this automatic adaptation enables AI systems to dynamically adjust to application-specific needs. Practical tests showed that HarnessX is providing substantial performance gains in domains such as software engineering and web interaction.

The results show that scaling up foundation models is not the only path to more capable AI – and for smaller models, it may not even be the best one. HarnessX’s Harness Evolution achieved an average +14.5% performance gain across 15 model-benchmark combinations; For open-weight Qwen3.5-9B, the gains on contiguous planning tasks reached +44%.

Challenges of harness engineering

In AI applications, the capability of a foundation model depends heavily on exploiting its surroundings. The harness serves as the operational layer that converts raw model output into structured, executable agent behavior. This includes signaling, external device integration, memory management, and control flows that determine how an AI system observes its environment, detects a problem, and takes action.

As enterprise agents adopt more complex, longer-horizon workflows, harness engineering has become a fundamental part of AI development. Despite its importance, harness development is far from a mature engineering discipline and presents three major challenges.

First, harnesses are stable and made by hand. Any change to the underlying foundation model, introduction of new tools, or pivot to a different operational domain requires bespoke, manual code rewriting. Traditional harnesses lack mechanisms to autonomously learn and improve from past performance experiences.

Second, most existing harnesses suffer from architectural complexity. They tightly combine prompt templates, tool wrappers, retry policies, and memory management within the same code paths. This entanglement means that changes to one component can silently break others. Attempts to reuse harnesses across different business domains often result in replication of raw code rather than a clean, modular structure.

Third, the harness and foundation models are optimized in isolation. When engineers run tests to improve the harness, the generated execution traces are typically discarded rather than used as training data to improve the model. As a result, model upgrades do not inherently improve the harness, creating a bottleneck where teams fail to capture the full value of their agent’s operational data.

HarnessX: An Autonomous Foundry for AI Agents

The HarnessX manual solves the engineering hurdles of harness development with what researchers call an “integrated harness foundry.”

The main innovation of HarnessX is to treat the harness as a "first class item". In software engineering terms, this means that the harness is an independently serializable, modular, and replaceable unit. By separating model configuration (i.e., which AI model is being driven) from the harness configuration, engineers can seamlessly swap, adapt, and evolve the scaffold without touching the underlying model.

HarnessX divides the agent’s behavior into different components, such as context assembly, memory management, tool ecosystem, control flow, and observability. Each specific behavior is implemented as a "processor" Which plugs into the harness’s precise lifecycle hook. This modular structure allows the system to swap, add, or remove these processors without breaking the surrounding pipeline.

To automate the optimization of this modular architecture, HarnessX has introduced AEGIS, a trace-driven development engine. AEGIS formulates harness optimization as a reinforcement learning (RL) problem on various symbolic components of the harness.

Formulating harness optimization as a reinforcement learning problem introduces three distortions that researchers had to explicitly engineer against:

  • Reward Hacking: The system may use shortcuts to the solution instead of actually solving the task.

  • Disastrous mistake: An edit that fixes a failure pattern in one domain may silently break a previously resolved workflow in another.

  • Minor exploration: The system can operate on minor quick changes rather than searching for new, structurally superior tool configurations.

To prevent these problems, AEGIS relies on full trace observations and a four-step pipeline:

  1. Digester: Compresses execution traces into structured summaries to identify where the agent failed.

  2. Planner: The system analyzes these summaries to enable it to detect structural changes rather than just local instantaneous changes.

  3. Developer: Generate code-level harness edits and tests to ensure they run correctly before deployment.

  4. Critic and Gate: A critic evaluates edits to detect reward hacking, while a deterministic gate rejects any update that reverts a previously solved task to prevent a catastrophic mistake.

HarnessX enters the growing field of self-correcting harness research – but what sets it apart is harness-model co-development.

The researchers highlight that optimizing any component in isolation eventually hits a wall. If the underlying model lacks the logic capability to utilize new tools, harness development alone hits the scaffolding ceiling. Only the training of the model reaches a training-signal limit if the harness never prompts the model to use its advanced capabilities.

HarnessX combines harness evolution with model training. When the harness attempts to adapt to tasks the generated execution traces are converted into reinforcement learning signals for the foundation model. Each time Harness improves its strategy, the model learns to better exploit that new strategy, simultaneously breaking the capability limits of traditional AI agent development.

HarnessX makes this co-evolution possible through cross-harness GRPO (Group Relative Policy Optimization). GRPO is a popular RL algorithm used to train reasoning models such as DeepSeq-R1.

When fine-tuning the model, cross-harness GRPO pools the execution trajectories of an agent for the same task in completely different versions of the application’s harness. This allows the underlying model to internalize high-level strategy changes, such as using new API endpoints or managing execution budgets, rather than just learning minor prompt-phrase variations.

HarnessX in action on industry benchmarks

To validate the practical utility of HarnessX, researchers tested it on five benchmarks, including software engineering, multi-turn customer service interactions, web navigation, open-ended multi-step reasoning, and embodied planning.

He divided AI into two roles. A “meta-agent” powered by Cloud Opus 4.6 analyzed the logs and wrote code to develop the harness. “Task agents” ran the actual workflow. To prove that the framework is model-agnostic, they tested it on three different worker models: Cloud Sonnet 4.6, GPT-5.4, and Open-Weight Quench3.5-9b.

HarnessX was compared to two primary baselines. The first was a static harness, which reflects how most enterprises today deploy AI using hand-crafted, frozen setups with benchmark-specific prompts and tools. The second was the Cloud Code SDK, a baseline representing a single-agent evolver to test whether the complex, four-step AEGIS pipeline performed better than asking a single language model to iterate over the code.

Dynamically evolving the harness offers significant advantages over similar base models. HarnessX improved performance in 14 out of 15 model-benchmark combinations. Across all trials, developing the harness resulted in an average absolute performance gain of +14.5%.

The weakest models benefited the most from the dynamic harness improvements. The open-weight Qwen3.5-9B saw a +44.0% performance jump on the ALFWorld contiguous planning benchmark and a +18.2% jump on the SWE-Bench Verified for Software Engineering.

Co-development also proved highly effective. When researchers trained the Foundation model using data generated while developing the harness, they saw an additional +4.7% average performance increase. The highest limits are achieved by improving the harness and model together. Co-evolution benefits only apply to open-weight models.

Real-world evidence from experiments shows how HarnessX solves dangerous problems when building agent harnesses for real-world tasks. For example, in the GAIA Multi-Step Reasoning benchmark, the Task Agent consistently failed because the headless browser tool used to scrape Wikipedia timed out on the site’s JavaScript-heavy frontend. HarnessX analyzed the execution traces, diagnosed the error, and wrote a new tool that bypassed the browser entirely and directly queried the MediaWiki API for plain text. This turned this tool into a harness and instantly unlocked failed tasks.

During Webshop e-commerce tests, the AI ​​agent often got stuck in a pagination loop and kept clicking away "next page" And to refine searches without committing to purchasing a product. Instead of simply making changes to the prompt, HarnessX built an advisory processor that detected when the agent was repeating navigation actions. This put a warning in context to force the decision, fix looping behavior, and increase performance.

Limitations of Automated Harness Engineering

An important caveat is that the system currently relies on powerful models to act as meta-agents that rewrite harness code. In their experiments, the researchers relied on closed frontier models such as Cloud Opus. Open-weight models are rapidly improving, but their ability to serve as meta-agents has not been tested.

Another limitation to consider is the intrinsic capabilities of the models used. If the underlying task model is fundamentally too weak to execute the complex workflows the new harnesses propose, then HarnessX will not be able to improve the overall capabilities of the agent (researchers observed this with the Qwen3.5-9B model on SWE-bench coding tests).

Despite these limitations, HarnessX makes a solid case that harness engineering – not just model scaling – is a lever that practitioners can now pull. For teams running small open-weight models on complex workflows, the benefits here are so large that it is worth evaluating harness development as a first step before moving up to the more expensive Frontier models. The researchers plan to release the code in a future update.



<a href

Leave a Comment