
In building LLM applications, enterprises often have to create very long system signals to adjust the behavior of the models to their applications. These signals include company knowledge, preferences, and application-specific instructions. At enterprise scale, these context inferences can push latency beyond acceptable limits and significantly increase per-query costs.
On-Policy Reference Distillation (OPCD), a new training framework proposed by researchers at Microsoft, helps transform applications’ knowledge and preferences directly into a model. OPCD uses the model’s own responses during training, which avoids some of the pitfalls of other training techniques. This improves the capabilities of models for specific applications while preserving their general capabilities.
Why do long system prompts become a liability?
learning in context Allows developers to update the behavior of a model at inference time without modifying its underlying parameters. Updating parameters is generally a slow and expensive process. However, knowledge in context is momentary. This knowledge isn’t captured in individual interactions with the model, meaning you have to feed the model exactly the same huge set of instructions or documents every time. For an enterprise application, this might mean repeatedly pasting company policies, customer tickets, or dense technical manuals into the prompt. This ultimately slows down the model, increases costs, and can confuse the system.
“Enterprises often use longer system signals to enforce security constraints (for example, detecting hate speech) or to provide domain-specific expertise (for example, medical knowledge),” paper co-author and Microsoft Research Asia researcher Tianzhu Ye said in comments to VentureBeat. “However, longer signals significantly increase the computational overhead and latency at inference time.”
The main idea behind context distillation is to train a model to internalize information that you repeatedly put into context. like other distillation techniqueIt follows the teacher-student model. The teacher is an AI model that receives large-scale, detailed signals. Because it has all the instructions and reference documents, it produces highly customized responses. The student is a model being trained that only sees the main question and does not have access to the entire context. The goal is to simply observe the teacher’s reactions and learn to imitate his or her behavior.
Through this training process, the student model effectively compresses complex instructions directly from the teacher’s prompt into its own parameters. For an enterprise, the primary value is at the time of estimation. Because the student model has internalized the context, you can deploy it in your application without re-pasting long instructions. This makes the model quite fast and has very low computational overhead.
However, classic context distillation relies on a flawed training method called “off-policy training”, where the model is trained on fixed datasets that were collected before the training process. This is problematic in many ways. During training, the student is exposed only to ground truth data and teacher-generated answers that say "Exposure bias." In production, the model would have to come up with its own token sequences to reach those answers. Because it never practiced making its own decisions or recovering from its mistakes during training, it can easily derail when operating independently. It’s like showing videos of a professional driver to a student and expecting them to learn driving without trial and error.
Another problem is “further”. Kullback–Leibler (KL) divergence“A minimization approach is used to train the model. Under this method, the model is classified based on how similar its answers are to the teacher’s, which encourages "mode-covering" Behavior, they say. The student model is often small or lacks the teacher’s rich context, meaning it lacks the ability to fully replicate the teacher’s complex reasoning. Since the student is forced to try and cover all those possibilities anyway, his implicit estimates become overly broad and unfocused.
In real-world applications, this can result in hallucinations, where the AI becomes confused and makes things up in confidence because it is trying to mimic a depth of knowledge it does not actually possess. This also means that the model cannot generalize well to new tasks.
How OPCD fixes the teacher-student problem
To fix significant issues with legacy teacher-student dynamics, Microsoft researchers introduced On-Policy Context Distillation (OPCD). The most important change in OPCD is that the student model learns from its own generation trajectories as opposed to a static dataset (which is why it is called “on-policy”). Instead of passively studying a dataset of the teacher’s entire output, the student is given a task largely without seeing instructional prompts and must generate an answer entirely on his or her own.
The teacher acts as a live instructor as the student formulates his answer. The teacher has access to the full, customized prompt and evaluates the student’s output. At each step of student generation, the system compares the student’s token distribution to what the context-aware teacher would do.
OPCD uses “reverse KL divergence” to grade a student. “By reducing reverse KL divergence, it promotes ‘mode-seeking’ behavior. It focuses on high-probability regions of the Student’s distribution,” Ye said. “This suppresses tokens that the student considers unlikely, even if the teacher’s belief gave them high probability. This alignment helps the student correct his or her mistakes and avoid the wide, hallucinatory distribution of standard distillations.”
Because the student model actively practices its decision making and learns to correct its mistakes during training, it behaves more reliably when deployed in a live application. It successfully stores complex business rules, security constraints, or specialized knowledge in its permanent memory.
What OPCD offers: Benchmark results
Researchers tested OPCD in two key areas: experiential knowledge distillation and system prompt distillation. For experiential knowledge distillation, the researchers wanted to see if LLMs could learn from their past successes and adopt those lessons permanently. They tested it on models of different sizes using mathematical logic problems.
First, the model solved problems and was asked to write down general rules learned from her successes. Then, using OPCD, they incorporated those written texts directly into the model’s parameters. The results showed that the models improved dramatically and no longer needed to paste learned experience into their signals. On complex math problems, the 8-billion-parameter model improved from 75.0% over the baseline to 80.9%. For example, on the Frozen Lake navigation game, a small 1.7 billion parameter model initially had a success rate of 6.3%. After OPCD took into account the experience learned, its accuracy increased to 38.3%.
The second set of experiments was on longer system prompts. Enterprises often use system prompts extensively to enforce strict behavioral guidelines, such as maintaining a professional tone, ensuring medical accuracy, or filtering toxic language. The researchers tested whether OPCD could permanently incorporate these dense behavioral rules into the model so that they did not have to be sent with every single user query. Their experiments show that OPCD successfully assimilates these complex rules and significantly increases performance. When testing the 3-billion parameter LAMA model on safety and toxicity classification, the base model scored 30.7%. After using OPCD to internalize the security signal, its accuracy increased to 83.1%. On medical question answering, the same model increased from 59.4% to 76.3%.
Fine-tuning is one of the major challenges of the model disastrous mistakeWhere the model becomes too focused on fine-grained tasks and worse on normal tasks. Researchers tracked performance outside of delivery to test this tunnel vision. When they subjected a model to strict safety regulations, they immediately tested its ability to answer unrelated medical questions. OPCD successfully retained the model’s general medical knowledge, outperforming older off-policy methods by about 4 percentage points. It achieved specialization without losing its broad intelligence.
Where OPCD fits in – and where it doesn’t
While OPCD is a powerful tool for internalizing static knowledge and complex rules, it does not replace all external reference methods. “RAG is better when the required information is highly dynamic or involves a huge, frequently updated external database that cannot be compressed into model weights,” Ye said.
For enterprise teams evaluating their pipelines, adopting OPCD does not require overhauling existing systems or investing in specialized hardware. “OPCD can be integrated into existing workflows with very little friction,” Ye said. “Any team already running standard RLVR [Reinforcement Learning from Verifiable Rewards] Pipelines can adopt OPCD without major architectural changes.”
In practice, the student model serves as the policy model that drives the rollout, while the frozen teacher model serves as the context that provides the logs. The hardware requirements are highly accessible. According to Ye, enterprise teams can reproduce the researchers’ experiments using about eight A100 GPUs.
The data requirements are equally light. For experiential knowledge distillation, developers only need about 30 seed examples to generate solution traces. Since this technique is applied to previously unoptimized environments, even a small amount of data brings most of the performance improvement. For system prompt distillation, existing customized prompts and standard task datasets are sufficient.
Researchers created their own implementation WorlAn open-source RLVR codebase, proving that the technique fits clearly within the framework of traditional reinforcement learning. They plan to release their implementation as open source after internal reviews.
Self-Improvement Model: What Comes Next
Looking ahead, OPCD truly paves the way for self-improving models that continuously adapt to specific enterprise environments. Once deployed, a model can take lessons from real-world interactions and use OPCD to progressively internalize those features without the need for manual supervision or data annotation from model trainers.
“This represents a fundamental shift in model improvement: the main improvements in the model will move from training time to testing time,” Ye said. “Using the model – and allowing it to gather experience – will become the primary driver of its advancement.”
<a href