
Miromind AI and researchers from several Chinese universities have released OpenMMReasonerA new training framework that improves the capabilities of language models in multimodal reasoning.
The framework uses a two-step process. It first refines a base model with a curated dataset in a supervised fine-tuning (SFT) stage. Then, a reinforcement learning (RL) step guides the model to reason more effectively in tasks that involve both text and visual data.
Experiments show that models trained with OpenMMReasoner outperform other leading visual reasoning models, often when trained on smaller, higher-quality datasets. The framework and all of its assets, including the trained 7B models, are completely open source, providing a trusted foundation for building applications that require traceability and robustness.
According to Kaichen Zhang, co-author of a research paper outlining the new approach, OpenMMReasoner offers significant benefits for businesses looking beyond large, closed systems. "A small open-source reasoning model has practical advantages: enterprises can deploy it locally, reduce latency, reduce token costs associated with long chains of thought, maintain full control over their data and [it is] Fine-tunable to optimize their specific downstream function," he told VentureBeat.
Challenge of transparent multimodal reasoning
Recent advances in reinforcement learning with verifiable rewards (RLVR) have significantly improved the reasoning capabilities of large language models (LLMs). RLVR trains LLM to generate chain of thought (COT) tokens (which mimic the reasoning processes used by humans) before generating the final answer. This improves the model’s ability to solve complex logic tasks such as mathematics and coding.
Inspired by this success, researchers have implemented similar RL-based methods large multimodal model (LMM), showing that the benefits can extend beyond text to improve visual comprehension and problem-solving across different modalities.
However, lack of transparency in the training pipeline has been a major hurdle. Many studies on multimodal reasoning do not provide detailed information about their data curation and training processes, making it difficult to reproduce their results or understand how these models work.
The researchers note, “This lack of openness restricts reproducibility and obscures a deeper understanding of how reasoning-enabled LMMs are actually built and how their training dynamics evolve.”
OpenMMReasoner recipe
OpenMMReasoner addresses this gap with fully transparent and scalable training recipes built on open-source LMM. The researchers found that it was important to curate high-quality datasets by measuring data diversity. Although using diverse data sources is important, increasing the variety of correct answers to the same question was an essential pivot to improvement.
The first stage of the recipe is a three-stage supervised fine-tuning (SFT) pipeline. It starts with data sourcing, where the team collected approximately 103,000 raw question-answer pairs from public datasets covering common visual question-and-answer and reasoning tasks. Next, he added a data distillation stageBy using a powerful model (Qwen3-VL-235B-Instructions) Generating new, high-quality reasoning marks for selected questions. (The data will then be used to train a smaller model.)
To increase answer variety, the team created multiple verified reasoning marks for each question. This expanded the dataset to 583,000 samples. Finally, they applied a “domain blending” step by adding data from the mathematical logic domain to further generalize the model’s capabilities, resulting in a final SFT dataset of 874,000 examples.
The second step is an RL recipe that uses a smaller, 74,000-sample dataset curated from domains such as science, mathematics, and puzzles. The model is trained with a composite reward function that considers both the correctness of the final answer and the consistency of the output format. This process also includes fines to improve efficiency. "overthinking," Discouraging the model from generating excessively long answers (a problem with many logic models trained via RL, which accidentally learn to generate excessively long argument sequences, resulting in additional cost and slower answers).
This recipe can provide a blueprint for enterprises to train their own models. "For companies with limited domain-specific data, a viable strategy is to first increase answer diversity for their existing dataset, then use domain blending to integrate this domain data into a common reasoning recipe like ours." Zhang explained. "This allows the model to adapt to industry-specific tasks, as well as acquire strong general-purpose reasoning skills, without requiring millions of samples."
A more efficient and capable logic model
According to Zhang, the step-by-step process fundamentally changes the reliability of the model’s output. "Traditional models often ‘jump’ straight to the answer, meaning they explore only a narrow part of the argument space," He said. "In contrast, a logic-first approach forces the model to explicitly examine several intermediate steps… [allowing it] To walk much deeper paths and reach the answer with far greater inner stability."
The researchers used OpenMMReasoner recipes to generate data to fine-tune the Qwen2.5-VL-7B-instruct open-source vision-language model. The result is a highly efficient LMM that consistently outperforms state-of-the-art methods, e.g. open vision reasoner (OVR), across a wide range of multimodal reasoning benchmarks. The SFT step alone creates a robust baseline model that achieves better performance and data efficiency than other SFT approaches, despite using a significantly smaller training dataset.
The subsequent RL stage further sharpens and stabilizes these capabilities, making performance more consistent and better. After RL, the final model achieves state-of-the-art results on multiple benchmarks, including WeMath, MathVerse, and MathVista.
One of the main findings was that, as models improved multimodal reasoning, it also showed "The gradual emergence of text reasoning behavior suggests a transfer of reasoning ability from multimodal to purely linguistic domains," The researchers noted. This indicates that skills learned in one method can strengthen performance in another method.
"Our results show that strengthening multimodal reasoning can also improve text-only mathematical skills – providing evidence that basic reasoning abilities can transfer across modalities," Zhang said. "Looking ahead, we expect these methods to extend to video and audio."
The researchers also found that token efficiency is important. While allowing a model to generate longer reasoning steps can improve performance, excessive tokens reduce efficiency. Their results show that the setting is small "logic budget" Can achieve comparable or even better accuracy, which is an important consideration for deploying cost-effective enterprise applications.
By Open-sourcing all components In their workflow, researchers provide a reproducible view of the entire process. For enterprise teams, this transparency is invaluable. "For business leaders concerned about vendor lock-in, hidden biases or opaque data sources, this level of transparency is essential," Zhang said. "This empowers teams to verify data, adapt the pipeline to new domains, and maintain long-term independence from any single provider."
<a href
