
Not every company can or should create its own Frontier AI language model. however, harness Controlling the model is something most enterprises can do Needed Adapt to their specific purposes.
Of course, this is easier said than done. AThe Gent harness is still largely tuned through manual, ad-hoc debugging – a process that relies heavily on intuition rather than systematic feedback loops, making it difficult to keep pace with rapidly evolving LLMs.
To solve this challenge, researchers from Shanghai Artificial Intelligence Laboratory have introduced “self-harness”, a new paradigm in which an LLM-based agent systematically improves its own operating rules. By examining its own execution traces to apply edits, the system trades manual guesswork for empirical evidence.
Self-improving harnesses can enable development teams to deploy robust custom agents that continuously adapt their own execution protocols to overcome model-specific vulnerabilities.
harness engineering challenge
The performance of an LLM-based agent is determined not only by its underlying base model, but also by its usage: the surrounding system that provides context and enables the model to interact with the environment. The harness includes components such as system prompts, tools, memory, validation rules, runtime policies, orchestration logic, and failure-recovery processes.
This layer is important because many common agent failures arise from the harness rather than the model. For example, an agent may report success without checking the model’s response (for example, running the code to see whether it passes the test), or it may retry the failed action repeatedly. The harness is also responsible for preventing context decay or overload when the agent’s interaction history becomes too large. Examples of popular harnesses include SWE-Agent, Cloud Code, Codex, and OpenHands.
Harness engineering remains a significant challenge, but the obstacle is not that humans are too slow or incompetent.
In fact, Hangfan Zhang, lead author of the Self-Harness paper, told VentureBeat "In many cases, an experienced engineer with deep domain knowledge can offer a better transformation proposal than an LLM even today."
Instead, the real bottleneck of manual engineering is that it relies heavily on ad-hoc debugging rather than verifiable, empirical feedback loops. "The deeper issue is that the current harness-engineering paradigm often lacks systematic feedback loops," Zhang explained. "Many edits are made based on intuition, some observed failures, or ad-hoc debugging."
With new models being released rapidly, relying on human intuition to manually tune model-specific harnesses has become increasingly expensive and untenable. While some approaches use robust models to improve the harnessing of weak target agents, reliance on external guidance has its own challenges, as these models may be expensive, unavailable to marginal models, or may not match the failure modes of the target model.
How does a self-harness work?
The self-harness paradigm enables an LLM-based agent to improve its own harness without relying on human engineers or strong external models.
This continuous self-development is driven by a three-step iterative loop that turns behavioral evidence into harness updates:
- Weakness Mining: Starting from the initial harness, the agent runs a set of tasks, producing execution traces with verifiable results. The agent classifies failure traces and attempts to detect model-specific failure patterns.
-
Harness proposal: Based on these failure patterns, the agent uses the “proposer” role to generate a set of diverse but minimal harness modifications, each of which is associated with a specific failure mechanism to avoid overly generic fixes.
-
Offer Verification: The system evaluates candidate modifications through regression tests. An edit is only promoted if it improves performance without causing a measurable degradation on paused tasks. If multiple candidate revisions pass regression testing, they are merged into the next version of the harness, which serves as the starting point for the next iteration.
To imagine why an enterprise would need this, imagine an automated problem-solving agent that reads internal documentation, writes patches, and opens pull requests. If the company updates its documentation style, the agent may suddenly fail, pull the wrong reference, or write a bad patch.
On the surface, the agent simply looks broken. But Self-Harness turns this obscure failure into a solvable problem. "Failure traces highlight where the agent is abusing the new document format; The proposer can generate a targeted harness edit… and the evaluator can decide whether that edit improves the failing cases without leaving other cases behind," Zhang said.
Self-harness in action
The researchers evaluated Self-Harness on Terminal-Bench-2.0, a benchmark that tests common tool-based execution, including artifact management, command usage, validation behavior, and recovery from execution errors. They implemented self-harness with MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5.
To isolate the impact of self-developed harnesses, they started with a minimal harness built on the DeepAgent SDK, which included only the benchmark-facing system prompt and default file system and shell tools. The model backend, tool set, benchmark environment, and evaluator were kept unchanged, while only the harness was allowed to change.
Quantitative results show that Agents improved their performance through automated harness editing. On pending tasks, Significant increase in performance across the board, from 33 to 60 percent Relative improvements for different models.
Importantly, an explicit acceptance rule promotes only those edits that improve performance without introducing unacceptable regressions. What makes Self-Harness powerful for enterprise applications is that it doesn’t simply lengthen the prompt or add generic instructions. Instead, it introduces targeted changes that reflect recurring problems each model encounters during execution.
For example, under the baseline harness, Minimax M2.5 will be stuck searching for dataset configurations until the execution environment times out, and fails to generate any deliverables. Through self-harness, the system identified this specific fault and wrote "loop breaker" In its runtime policy, the agent was forced to stop and redirect its approach after 50 tool calls. It also added the rule of creating early versions of needed artifacts as quickly as possible.
Quen-3.5, on the other hand, had a habit of making an error overwriting a file and then blindly repeating the same command over and over again, deleting necessary files in confusion before eventually stopping. Self-Harness fixed this by introducing a strict command-retry discipline (prohibiting exact duplicate commands) and a mechanism that forced the agent to immediately recreate any missing artifacts if a file error occurred.
GLM-5 struggled to preserve environment changes across various commands, and often wasted time performing mass downloads or finalizing tasks even if sanity checks failed. Its self-built harness introduced rules instructing the agent to persist the PATH variable in shell sessions, limit external calculations, and repair any failed sanity checks before terminating its operations.
Hidden costs of automatic harnesses
While self-harness automates the difficult work of tracking specific model failures, decision makers must be realistic about the trade-offs. Replacing human engineering with automated trial-and-error requires significant computational overhead.
"Self-harness replaces part of the human engineering burden with repeated proposal generation, parallel candidate evaluation, and regression testing," Zhang said. "This could mean more API tokens, more latency during optimization, and more infrastructure to run evaluation tasks."
Also, the system depends on the accuracy of its evaluation pipeline. During their experiments on terminal-bench-2.0, the researchers relied on strict, deterministic validators to ensure that the agent’s edits were actually helpful. Without this hard ground truth, an automated system runs the risk of promoting bad updates. "[The] The evaluation system is not an optional component; This is what lets us trade human intuition for empirical evidence," Zhang said.
This reliance on strict verifiers also dictates where the self-harness should be deployed. "The best deployment targets today are environments where failures can be measured and where trial-and-error is relatively safe," Zhang said, pointing to coding, internal workflow automation and DevOps data pipelines as ideal use cases.
Conversely, enterprises should avoid completely automating harnesses in high-stakes or subjective areas. "The most obvious red flags are domains where assessment is subjective, delayed, non-deterministic, or costly to get wrong, such as medical decision making, safety-critical infrastructure, or legal decisions."
From speedy tweaker to feedback architect
The introduction of self-improvement agents doesn’t mean that coding or enterprise workflows will suddenly become human-free. The quality of the collaboration between the human engineer and the AI is still paramount and difficult to capture with automated benchmarks.
Instead, the engineering profession is moving toward a layer of abstraction. "The role of enterprise engineers will shift from manually patching individual signals or tool calls to designing feedback systems that enable agent improvements," Zhang predicted. moving forward, "The engineer becomes less of a quick tweaker and more of a feedback architect."
As basic models become more capable, they will naturally absorb many of the capabilities that currently require manual harness engineering. "But once that happens, the harness won’t disappear; Its scope will extend outward to connect the model to the rich external environment," Zhang said. "As long as that limit does not exceed human assessment, humans will remain important providers of feedback."
<a href