
Researchers at META, the University of Chicago, and UC Berkeley have developed a new framework that addresses the high cost, infrastructure complexity, and unreliable feedback associated with using reinforcement learning (RL) to train large language model (LLM) agents. structure, dreamgymSimulates a RL environment to train agents for complex applications. The framework dynamically adjusts the difficulty of the task as it progresses through the training process, ensuring that the agent learns to solve more challenging problems with gradual improvement.
The research team’s experiments show that DreamGym significantly improves RL training in both fully synthetic settings and scenarios where the model must apply its simulated learning in the real world. In settings where RL is possible but expensive, it matches the performance of popular algorithms using only synthetic interactions, significantly cutting the costs of data gathering and environment interactions.
This approach could be valuable for enterprises, allowing them to train agents for customized applications while avoiding the complexities of setting up and running a live RL environment.
Challenge of training LLM agents
reinforcement learning It is an important technique to train LLMs to handle complex tasks in agentic environments such as web navigation, tool usage, and robotics. This allows models to learn from direct interactions and experience, moving beyond the static datasets used in pre-training.
However, RL remains difficult for agent training. Real-world applications often involve long action sequences with sparse signals, meaning that the agent receives a positive signal only after a long and correct sequence of actions.
Gathering enough diverse and valid data is also expensive, often requiring human experts to verify actions and interpret results. And the infrastructure required to create live environments for large-scale RL training can be extremely complex and expensive. Needless to mention, there are risks involved in interacting with a live system, as the wrong action (such as deleting a file) can cause irreparable damage.
“These limitations make building general-purpose and scalable systems for training agents with RL an open and pressing challenge,” the researchers write.
DreamGym directly challenges that model by delivering comparable performance in full simulation, removing the infrastructure burden that has prevented most enterprises from adopting RL – and providing teams a practical path to train agents without having to touch expensive or risky live environments.
How does DreamGym work?
The researchers describe DreamGym as an “integrated and scalable RL framework that synthesizes diverse experience data in an online manner to enable efficient and effective training of LLM agents.” It is built around three main components that work together to create a controlled and effective training loop.
The first component is a “logic-based experience model” that translates the dynamics of the target environment into textual space. This model acts as a simulator of the application environment. Instead of interacting with an expensive real environment, the agent interacts with this model, which generates continuous state changes and feedback based on the agent’s actions.
Researchers argue that agent training does not require an absolutely realistic environment, but rather data. "On a sufficiently diverse, informative and causal basis." For example, in a web shopping task, the model synthesizes clean listings of on-page elements rather than processing raw HTML code. This abstract approach makes the training of experience models highly efficient, requiring only small amounts of public data.
The second component is an “experience replay buffer”, which acts as a dynamic memory. At the beginning of the training process, the buffer is combined with offline data to provide the necessary context and is continuously updated with new synthetic trajectories generated during training. This buffer helps guide the experience model’s predictions, ensuring that synthetic experiences remain diverse and factually based.
The third component, a “course task generator”, works in conjunction with the experience model to adaptively create new tasks that are progressively more challenging. It identifies tasks where the agent’s performance is mixed (indicating that they are difficult but solvable) and generates variations to enhance the agent’s capabilities.
Together, these components form a closed-loop system for scalable agent training. According to the researchers, “By integrating interaction, memory, and adaptive online task generation, DreamGym addresses persistent challenges that limit RL for training LLM agents: prohibitive cost, lack of diverse tasks, unstable reward signals, and heavy infrastructure demands.”
dreamgym in action
Researchers evaluated DreamGym on several agent benchmarks, including Webshop (e-commerce), ALFworld (embodied controls), and WebArena (realistic web interactions). they used llama 3 And Quen 2.5 model as the agent backbone and compared DreamGym with several traditional training strategies. These include offline methods such as supervised fine-tuning (SFT) and direct preference optimization (DPO), as well as online RL algorithms such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which improve agents through live environment interactions.
DreamGym showed its most significant benefits in environments like WebArena, where setting up large-scale RL infrastructure is difficult. Agents trained entirely inside DreamGym achieved success rates over 30% compared to baseline methods, which struggled in real environments with sparse rewards and limited exploration. The researchers said this shows that DreamGym is a mechanism that makes RL training possible “in domains that were previously difficult due to the underlying task and engineering constraints.”
In an environment where RL is supported but expensive, agents trained with DreamGym performed equivalent to those trained using GRPO and PPO, but without any costly interactions with the external environment. The team also introduced a sim-to-real approach, DreamGim-S2R, where an agent is first trained in a synthetic environment and then fine-tuned on a small amount of real-world data. This strategy improved performance by more than 40% compared to training from scratch in a real environment using less than 10% external data. It provides a scalable "good start" For training of general purpose agents.
Finally, the framework demonstrated strong generalization. An agent trained on tasks in one domain such as Webshop can successfully transfer its learned skills to another domain such as WebArena. Researchers suggest this is because DreamGym agents learn "The abstract meta-representation space enables the agent to learn domain-agnostic behavioral preferences rather than remembering task-specific patterns."
Still in its early stages, DreamGym shows that simulated environments can provide great benefits in training agents. In practice, an enterprise may collect small amounts of trajectories and details for the tasks it wants to automate. It can then use this small seed to bootstrap the DreamGym framework for scalable and sample-efficient training of agents.