
When enterprises fine-tune LLMs for new tasks, they risk breaking everything the models already know. This forces companies to maintain separate models for every skill.
Researchers at MIT, the Improbable AI Lab, and ETH Zurich have developed a new technique that enables large language models to learn new skills and knowledge without forgetting their previous abilities.
His technique, is called Self-distillation fine-tuning (SDFT), by leveraging the built-in context learning capabilities of modern LLMs, allows models to learn directly from demonstrations and their own experiments. Experiments show that SDFT consistently outperforms traditional supervised fine-tuning (SFT) while addressing the limitations of reinforcement learning algorithms.
For enterprise applications, the method enables a single model to accumulate multiple skills over time without suffering performance regression on earlier tasks. This offers a potential route to building AI agents that can adapt to dynamic business environments, gathering new proprietary knowledge and skills as needed without requiring expensive retraining cycles or losing their normal reasoning abilities.
Challenge of continuous learning
Once an LLM is trained and deployed, it remains stable. It does not update its parameters to acquire new skills, assimilate new knowledge, or improve with experience. To create truly adaptive AI, the industry needs to solve "continuous learning," Allowing the system to accumulate knowledge, as humans do throughout their careers.
The most effective way for models to learn "On-policy learning.” In this approach, the model learns from self-generated data allowing it to correct its own errors and reasoning processes. This is in contrast to learning by simply replicating a static dataset. Without on-policy learning, models tend to "disastrous mistake," A phenomenon where the model loses its previous knowledge and ability to perform previous tasks due to learning a new task.
However, on-policy learning is usually required reinforcement learning (RL), which relies on an explicit reward function to score the output of the model. This works well for problems with clear outcomes like math and coding. But in many real-world enterprise scenarios (for example, writing a legal brief or summarizing a meeting), it is difficult or impossible to define a mathematical reward function.
RL methods also often fail when trying to teach a model completely new information, such as a specific company protocol or a new product line. As Idan Shenfeld, a doctoral student at MIT and co-author of the paper, told VentureBeat, "No matter how many times the base model tries, it cannot generate the correct answer for a subject about which it has zero knowledge," This means that there is never a positive signal to learn from.
The standard option is supervised fine-tuning (SFT), where the model is trained on a fixed dataset of expert performances. While SFT provides clear ground truth, it is inherently "Off-policy." Because the model is merely copying the data rather than learning from its own efforts, it often fails to generalize to examples outside the distribution and suffers from catastrophic forgetting.
SDFT seeks to bridge this gap: enabling the benefits of on-policy learning using only pre-recorded demonstrations, without the need for an awards ceremony.
How does SDFT work?
This problem is solved using SDFT "distillation," A process where a student model learns to imitate a teacher. The researchers’ insight was to use the model itself "learning in context" (ICL) has the ability to create feedback loops within a single model.
Learning in context is the phenomenon where you provide the LLM with a difficult task and provide one or more demonstrations of how similar problems are solved. Most advanced LLMs are designed to solve new problems with ICL examples without any parameter updates.
During the training cycle, SDFT employs the model in two roles.
Teacher: A frozen version of the model is fed into the query along with expert demonstrations. Using ICL, the teacher detects the correct answer and the logical reasoning required to reach it.
Student: This version looks only at the query, simulating a real-world deployment scenario where no answer key is available.
When the student prepares an answer, the teacher, who has access to expert demonstrations, provides feedback. The student then updates his parameters to align closer to the teacher’s distribution.
This process effectively creates an on-policy learning loop by combining elements of SFT and RL. The observations come not from a static dataset, but from the model’s own interactions and outputs. This allows the model to correct its own reasoning trajectories without the need for an external reward signal. This process also works for new knowledge that RL would miss.
SDFT in action
To validate the approach, researchers tested SDFT using open-weights Quen 2.5 model On three complex enterprise-grade skills: science questioning, use of software tools, and medical reasoning.
The results showed that SDFT learned new tasks more effectively than standard methods. On the Science Questionnaire benchmark, the SDFT model achieved 70.2% accuracy compared to 66.2% for the standard SFT approach.
More important for enterprise adoption is the impact on catastrophic failure. When the standard SFT model learned a science task, its ability to answer general questions (like logic or the humanities) collapsed. In contrast, the SDFT model improved the science task while maintaining its performance "previous work" The score is stable at 64.5%. This consistency suggests that companies can specialize models for specific departments (for example, human resources or legal) without diminishing the model’s basic common sense or reasoning capabilities.
The team also simulated a knowledge injection scenario, creating a hypothetical dataset "2025 natural disasters" To teach new facts to the model. They tested the model on questions such as indirect reasoning "Given floods in 2025, which countries need humanitarian assistance?"
Standard SFT resulted in a model that remembered facts but struggled to use them in reasoning scenarios. The SDFT model, absorbing the logic during training, achieved a score of 98% on the same questions.
Finally, the researchers conducted a sequential learning experiment, in which the model was trained one after another on science, tool use, and medical tasks. While the standard model’s performance fluctuated, with previous skills being lost when learning new skills, the SDFT model successfully accumulated all three skills without any degradation.
This capability solves a major problem for enterprises currently managing "model zoo" Different adapters for different functions.
"We provide the ability to maintain only one model for all the needs of the company," Shenfeld said. this consolidation "Estimated cost can be reduced significantly" Because organizations do not need to host multiple models simultaneously.
SDFT Limits and Availability
The code for SDFT is available on GitHub and is ready to be integrated into existing model training workflows.
"The SDFT pipeline is similar to the RL pipeline in that it requires online response generation during training," Shenfeld said. They are working with SDFT to integrate it into Hugging Face Learning Transformer Reinforcement (TRL) Library, he said, noting that a pull request is already open for developers who want to test the integration.
For teams considering SDFT, practical tradeoffs come down to model size and computation. This technique requires in-context learning models strong enough to act as their own teacher – currently there are about 4 billion parameters with new architectures like QWEN 3, although Shenfeld expects 1 billion-parameter models to be in the works soon. It demands about 2.5 times the computation of standard fine-tuning, but is best suited for organizations that need a single model to model the accumulation of multiple skills over time, especially in domains where it is difficult or impossible to define the reward function for reinforcement learning.
While effective, this method comes with computational tradeoffs. SDFT is approximately four times slower and requires 2.5 times more computational power (FLOPs) than standard fine-tuning because the model must actively generate its own answers ("phased") To compare with the teacher during training. However, researchers say that because the model retains knowledge better, organizations can avoid the costly multi-stage retraining processes that are often required to repair models that suffer catastrophic forgetting.
This technique also relies on the underlying model being large enough to benefit from learning in context. The paper notes that smaller models (for example, 3 billion parameters) initially struggled because they lacked "intelligence" Acting as your own teacher.
However, Shenfeld said the rapid improvement of smaller models is changing this dynamic. "The Quen 2.5 3B models were very weak, but in some experiments currently conducted we found that the Quen 3 4B model is quite strong," He said. "I see a future where even 1B models will have good enough ICL capabilities to support SDFT."
Ultimately, the goal is to move beyond static snapshots to systems that improve through use.
"”Lifelong learning, combined with the ability to extract learning signals from unstructured user interactions… will lead to models that simply persist and improve over time,” Shenfeld said.
“Think about the fact that already most of the computation around the world goes into inference rather than training. We need to find ways to use this computation to improve our models."
<a href