Fine-tuning forgets. RAG leaks context. Hypernetworks build the model your agent needs on demand.

Gemini Generated Image rwbmyvrwbmyvrwbm
Enterprise teams keep seeing the same thing happen. An AI agent demos beautifully, goes to production, and stalls: it runs for a short time, then needs a human to increase its context and check its output, and the promised efficiency goes away under supervision. The agent worked; You observed. This is one reason why so many agent pilots never transition to production systems.

The pitch on the other side of that wall is the one every team wants to believe: an agent who runs a long job on its own, overnight if necessary, and leaves one person only to validate the last 10%. Even if it is attainable, once the problem arises, the orchestration conversation mostly grinds to a halt. When AI firm Chroma tested 18 leading models, each one lost accuracy as the input increased, a property of how attention works, and not that a stronger model closes a gap. An agent gives your business a greater boost as it moves and is not static. It shakes more.

This orchestration is the layer beneath the race. Routing, durable execution, and observability all assume that each agent is already capable of coordinating in advance. The deeper question is how long an agent can run before a human steps in, and that depends on where your company’s knowledge resides relative to the model. Both standard corrections put man in trouble.

Why teaching a model of your business keeps you in the know?

Frontier models are becoming more capable, and the gap is not closing, because it is not a problem of capability. It’s about where your knowledge sits relative to the model, and enterprises have two ways to put it there.

The first is fine-tuning, which turns knowledge into weight. This remains a matter of catastrophic forgetting, a problem that was identified in the 1980s and is still unresolved in 2026: teaching a model something new destroys what it already knew. Teams work by separating each task into its own fine-tuned model or adapter, producing a vast wealth of models that drive up costs and administration overhead. And a well-established model is a snapshot, which becomes obsolete the day the policy changes, when the costly, slow retraining cycle begins again.

The second is in-context learning, which skips retraining by keeping relevant policies in the prompt at run time. This is where context bites. Retrieval limits what goes into the signal, but a retrieval miss looks like an assured answer, and both cost and latency increase with each token added.

Two failures rhyme. With fine-tuning, the model can operate confidently from the previous quarter’s policy. With learning in context, it can confidently work through details lost in the middle of a long signal. Either way the output looks equally convincing, so you can’t tell which parts are wrong without checking them all. That is why a person can never escape. Some teams often run both simultaneously, fixing the static knowledge and refactoring the rest. This softens each failure but does not eliminate any: at any given output you still cannot be sure that the model is on and working from the right context, so you still check it.

Third way: create expert models on demand

The third approach is to move from research to initial product. Instead of retraining a model or populating its prompt, a generator creates a small, task-specific model for your policies on demand, at inference time. A generator is a hypernetwork: a network whose outputs are the weights of another network.

The idea was named in 2016; It is latest and active in applying it to generate expert language models from text or documents. Sakana AI’s text-to-LoRa, presented at ICML 2025, produces a model adapter from a plain language description in a single pass, and a 2026 system called Shine calls hypernetwork optimization a promising new frontier, precisely because it bypasses both the retraining cost of fine-tuning and the context limitation of prompting.

The purpose of generating adapters rather than training and storing them is to collapse a huge library of per-task LoRAs into a network that can produce them on demand, including tasks it has not seen.

What’s interesting is how this closes the loop on the problem above: the per-task adapter teams build by hand, to avoid disastrously forgetting, is the same object that HyperNetwork generates automatically. The model zoo ceases to be a headache of governance and becomes a generated output.

The case for getting smaller underneath all this was made directly by Nvidia researchers in a 2025 paper: For the narrow, repetitive tasks that fill agent workflows, smaller models are capable enough and are 10 to 30 times cheaper to run than frontier generalists. Nace.AI, a Palo Alto company that raised a $21.5 million seed round in May, is the clearest business example. Its core technology, a generator it calls a metamodel, generates parameter optimizations for a model at the time of inference from company policies, focused on regulated work: audit, compliance, risk assessment. The company says its agents handle most of the workflow while human experts validate the results, a 90/10 split.

How do the three approaches compare

fine tuning

in context/RAG

Hypernetwork-generated models

where business knowledge resides

in model weight

At the prompt, each run resupplied

In weight generated on-demand

Cost of updating upon policy change

High: Retrain

Following: Edit Source

low: revive

basampan

High: a snapshot

Less

Following: Revived from current policy

Per-call cost and latency

Less

high, grows with context

low on run time

major failure modes

to forget; Model-zoo spread

Context rot; silent recovery missed

quality of generator; calibration

Who owns the property subject to improvements?

whoever trains the model

Anyone who stores data

Depends where the generator and feedback live

Why does the hypernetwork-built model increase the autonomy limit?

A model that is narrow, on, and small has less surface on which to go wrong. Fewer errors limited to a known domain means an agent has to send fewer outputs to a person, which is the real basis of any high-autonomy claim. This is also where numbers like 90/10 come from: not a preset dial, but the result of how little the system needs to hand back. Reported autonomy shares are best read as measurements of the architecture, not the settings.

Two design choices decide whether autonomy is reliable or just fast. The first is grounding: linking each output to its source so that a reviewer can verify rather than repeat. Research models created for this, such as Hallugaard, label each claim as supported or not and cite the passage they relied upon. Ness sends his agents with grounding models and logic traces for this very reason. 10% review only has meaning if a human can confirm the origin in a few seconds.

The second is the feedback loop, and it forces a question every buyer should ask: When your experts validate the output, whose model is better, and where does it stay? This decides whether the compounded property belongs to the seller or yours. Arrangements vary. Ness, for example, uses an external network of certified experts for some tasks and, for direct enterprise deployments, uses the customer’s own employees, with the resulting model housed inside the customer’s cloud. Each option leads to somewhere different paths to learning and mastery.

where the third path breaks

The approach is still preliminary, and some questions will determine how far it goes. Calibration plays an important role: value depends on the model knowing when it is uncertain. And this is indeed unsettling, with recent work designing these adapters finding that they do not automatically improve calibration upon general fine-tuning, with benefits only appearing under specific constraints.

The quality of the generated model largely depends on the policy data from which it is built, which puts a premium on data curation. And scale is the open limit of research, hypernetworks shown in published work so far have been small. This is where Ness’s own work gets interesting: In our interview, the company said it has scaled its generator far beyond those published sizes and devised a scaling law for how performance increases, has begun sharing the results publicly and is now putting it through peer review. If it holds up, it will help answer one of the central open questions in the field, and it’s a paper worth looking at.

Whichever approach wins, the work still ends up on a human, and that handoff is its own design problem. When Deloitte Australia delivered a government report worth almost A$440,000, it was shipped with fabricated citations and an invented court citation after it passed senior review, because the reviewers checked the findings, which were correct, and not the provenance, which was not. Controlled research shows that the pattern is general: Experts corrected a similarly flawed recommendation less often when it was labeled AI-generated.

Article 14 of the EU AI Act now names this automation bias. The lesson isn’t about any one vendor: part of having a high autonomy focuses the human’s attention on a slim, late portion of the work, so the value of that review depends entirely on whether the human can quickly check the provenance, which goes back to grounding.

What to make, and what to ask before buying

Honest Conclusion: What holds your agents back is usually not the orchestration or the size of the model, but whether the model knows your business well enough to be left alone, and the right scheduling depends on the task at hand. To automate a long, repetitive, high-volume process from start to finish, run most of your internal audits overnight and have your own experts check the final portion, a hypernetwork generated model is the approach that is most likely to do it cheaply and last for a long time. For a small task that ends in just a few steps and never needs to run without attention, the difference between this and a well-motivated frontier model becomes almost zero, and the integration is not worth the cost.

When a salesperson introduces autonomous or expert agents, four questions arise.

  1. Where does business knowledge reside: in weight, in signal, or generated on demand?

  2. What does each output come up with, so that a reviewer can verify it instead of redoing it?

  3. What decides what work a man should pursue?

  4. And whose model is better from that feedback, and where does it go?

The answers, not the headline ratios, tell you what you’re buying.

The hypernetwork approach is by far the most reliable attempt to explain a specific business to a small model without having to forget and re-explain it every time. It is also the least certified, and the parts that matter most, calibration and scale, are still under peer review. For proper operation, operate it now. For the wrong, the integration cost buys you very little that a well-motivated Frontier model wouldn’t buy.



<a href

Leave a Comment