How to build custom reasoning agents with a fraction of the compute

self distillation
Training AI reasoning models demands resources that most enterprise teams do not have. Engineering teams are often forced to choose between deriving knowledge from large, expensive models or relying on reinforcement learning techniques that provide sparse feedback.

Researchers at JD.com and several academic institutions recently introduced a new training paradigm that addresses this dilemma. The technique, called reinforcement learning with verifiable rewards with self-distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self-distillation.

Experiments indicate that models trained with RLSD outperform models built on classic distillation and reinforcement learning algorithms. For enterprise teams, this approach reduces the technical and financial barriers to creating custom logic models tailored to specific business needs.

Problem with training logic model

The standard method for training logic models is reinforcement learning with verifiable rewards (RLVR). In this paradigm, the model learns from its environment through trial and error, guided by the end result. An automated verifier checks whether the model’s answer is correct or incorrect, providing a binary reward, such as 0 or 1.

RLVR suffers from rare and uniform reactions. “Standard GRPO has a signal density problem,” Chenxu Yang, co-author of the paper, told VentureBeat. “A multi-thousand-token logic trace gets a single binary reward, and every token inside that trace gets equal credit, whether it’s an important logical step or a throwaway phrase.” As a result, the model never learns which intermediate steps led to its success or failure.

On-policy distillation (OPD) takes a different approach. Rather than wait for the final result, developers pair a smaller student model with a larger, more capable teacher model. For each training example, the student compares his response to the teacher’s response token by token. It provides the student with detailed feedback on the entire reasoning chain and response-generating process.

Deploying and running a separate, huge teacher model along with the student during the entire training process incurs massive computational overhead. “You have to keep a larger teacher model resident throughout the entire training, which will almost double your GPU footprint,” Yang said. Furthermore, the teacher and student models must share the exact same terminology structure, which, according to Yang, “tacitly rules out most cross-architecture, cross-modality, or multilingual setups actually run by enterprises.”

The promise and failure of self-distillation

On-Policy Self-Distillation (OPSD) emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD a single model plays the role of both student and teacher.

During training, the student receives a standard prompt while the teacher receives privileged information such as a verified, step-by-step answer key. This well-informed teacher version of the model then evaluates the student version, providing token-by-token feedback as the student attempts to solve the problem using only standard hints.

OPSD seems to be the perfect compromise for enterprise budgets. It provides detailed, step-by-step guidance of OPD. Because it eliminates the need for an external teacher model, it operates with high computational efficiency and low cost of RLVR, requiring only one additional forward pass to the teacher.

However, researchers found that OPSD suffers from a phenomenon called “privileged information leakage”.

“The objective is structurally wrong,” Yang said. “There is an irreversible mutual-information gap that the student can never close… When self-distillation is set up as distribution matching, the student is asked to simulate the teacher’s full output distribution under a privileged context.”

Because the teacher evaluates the student based on a hidden answer key, the training objective forces the student model to learn the teacher’s exact phrases or steps rather than the underlying reasoning logic. As a result, the student model begins to hallucinate references to an invisible solution to which he or she would not have access in a real-world deployment.

In practice, OPSD models show a rapid increase in performance early in training, but their reasoning abilities soon plateau and progressively decline over time.

Direction to magnitude separation with RLSD

The researchers behind RLSD realized that the signals governing how a model updates its parameters have fundamentally asymmetric requirements. They identified that the signal that determines the direction of updating (i.e., whether to reinforce or punish a behavior) can be sparse, but must be completely reliable, because pointing the model in the wrong direction harms its reasoning policy.

On the other hand, the signal determining the magnitude of the update (i.e., how much relative credit or blame should be given to a specific step) benefits from being highly dense to enable subtle, step-by-step improvements.

RLSD is based on this principle by separating the update direction from the update magnitude. The framework allows verifiable environmental feedback from RLVR signals to rigorously determine the direction of learning. The model receives overall reinforcement only if the final answer is objectively correct.

The auto-tutor is stripped of the power to dictate what the model should produce. Instead, the teacher’s token-by-token assessment is reused to determine the magnitude of the update. It simply distributes total credit or blame across different stages of the model’s logic path.

This changes the way the model learns compared to the classic OPSD paradigm. In standard OPSD, the training objective functions like behavioral cloning, where the model is forced to directly copy the teacher’s exact words and phrases. This causes the student to hallucinate and leak references to data he does not have.

Rather than forcing the model to copy a hidden solution, RLSD provides a natural and virtually cost-free source of per-token credit information.

“Intuition: We’re not teaching the model to reason like a teacher,” Yang said. “We are telling the model which tokens of the path it has chosen are actually working. The model’s exploration distribution remains its own. Only the credit allocation is sped up.”

If a specific deduction strongly supports the correct result, it receives a higher score. If it is just a useless filler word, it receives a baseline score. RLSD eliminates the need to train complex auxiliary reward networks, manually annotate data step-by-step, or maintain large-scale external teacher models.

testing rlsd

To test RLSD, the researchers trained the open-weight QUEN3-VL-8B vision-language model and evaluated it on several visual reasoning benchmarks. These include MMMU, MathVista, MathVision, VMath, and ZeroBench, a stress-testing benchmark for college-level multidisciplinary questions that is apparently designed to be nearly impossible for current Frontier models.

They compared the RLSD model with a base model without post-training, standard RLVR through the GRPO algorithm, standard OPSD, and a hybrid combination of the two.

RLSD outperformed all other methods, achieving the highest average accuracy of 56.18% across all five benchmarks. It beat the base model by 4.69% and outperformed the standard RLVR by 2.32%. The benefits were most pronounced in complex mathematical reasoning tasks, where RLSD outperformed the standard RLVR by 3.91% on the MathVision benchmark.

Beyond accuracy, the framework offers massive efficiency gains. “Apparently, RLSD in 200 training steps already outperforms GRPO trained for 400 steps, so roughly 2x the convergence speed,” Yang said. “Cost-wise, the only overhead beyond the normal GRPO pipeline is one additional forward pass per response to get the teacher logs. Compared to rollout generation… it’s basically free.”

Unlike OPSD, which saw performance increase and then completely collapsed due to information leakage, RLSD maintained long-term training stability and converged on a higher performance threshold than standard methods.

Qualitative findings shed light on how the model changes its learning behavior. For example, in a complex visual counting task, the standard RLVR looks at the last correct answer and gives the same reward to the entire paragraph of reasoning. RLSD surgically implemented rewards for specific mathematical subtraction steps that solved the problem, while actively reducing generic filler text e.g. "Looking at the image, I think…".

In another example, the model displayed incorrect mathematical derivations based on bar charts. Instead of labeling the entire response as a failure, RLSD focused the largest penalty on the exact point where the model misread the relationship from the chart. It remained neutral on the remaining logical setup, believing that the initial outline was valid.

This is especially important for unstructured, real-world enterprise use cases. If a model makes a mistake analyzing a 50-page quarterly earnings report, developers don’t want it to forget its entire analytical framework. They just want it to fix a specific perception that went wrong. RLSD allows the model to learn token by token which logical jumps are valuable and which are flawed. Because RLSD does this by reusing the model itself, it provides granular reasoning capabilities to the model while keeping the training cost reasonable.

How to start a business

For data engineers and AI orchestration teams, integrating RLSD is straightforward, but it requires the right setup. The most important requirement is a verifiable reward signal, such as code compilers, math checkers, SQL execution, or schema validators. “Tasks without verifiable reward (open-ended dialogue, brand-voice writing) belong in priority-based pipelines,” Yang said.

However, RLSD is highly flexible with respect to the privileged information it requires. While OPSD requires structurally complete intermediate logic traces, forcing enterprises to either pay annotators or distill from the frontier model, RLSD does not.

“If you have full verified logic traces, great, RLSD will use them,” Yang said. “If you have a ground truth final answer, that works too… OPSD doesn’t have that flexibility.”

The technology is incredibly lightweight to integrate into existing open-source multi-modality RL frameworks such as VERL or EZR1. According to Yang, this does not require rewriting any framework and slots directly into the standard stack. A code swap involves changing just tens of lines to adjust the GRPO objective and sync the teacher with the student.

Looking ahead, RLSD offers enterprises a powerful way to maximize their existing internal assets.

Yang concluded, “Proprietary data enterprises keep inside their perimeter (compliance manuals, internal documents, historical stamps, verified code snippets) is essentially free privileged information.” “RLSD lets enterprises feed such data directly into the privileged context, accelerating signal learning on small models without the need for an external teacher and without sending anything outside the network.”



<a href

Leave a Comment