Mistral's Small 4 consolidates reasoning, vision and coding into one model — at a fraction of the inference cost

crimedy7 illustration of small robots building a tall buildin 362e883b f1a0 4032 8390 41f733af2faa 2
Enterprises that are juggling separate models for logic, multimodal tasks, and agentic coding may be able to simplify their stack: Mistral’s new Small 4 brings all three into a single open-source model, with adjustable logic levels under the hood.

The Small 4 enters a crowded field of smaller models – including the Queue and Cloud Haiku – that are competing on projected cost and benchmark performance. Mistral’s pitch: Small outputs that translate into low latency and cheap tokens.

Mistral Small 4 updates Mistral Small 3.2, which came out in June 2025, and is available under the Apache 2.0 license. “With Small 4, users no longer need to choose between a fast instruction model, a powerful reasoning engine, or a multimodal assistant: one model now delivers all three with configurable reasoning effort and best-in-class efficiency,” Mistral said in a blog post.

The company said that despite its small size – Mistral Small 4 has 119 billion total parameters with only 6 billion active parameters per token – this model combines the capabilities of all of Mistral’s models. It has the reasoning capabilities of Magistral, the multimodal understanding of Pixtral, and the agentic coding performance of Devstral. It also has a 256K reference window which the company says works well for long-term conversations and analysis.

Rob May, co-founder and CEO of small language model marketplace Neurometric, told VentureBeat that Mistral Small4 is known for its architectural flexibility. However, as it joins a growing number of smaller models, he said there is a risk of adding more fragmentation to the market.

"From a technology perspective, yes, it can be competitive against other models,” May said. “The bigger issue is that it has to clear the market confusion. Mistral will first have to win mindshare to be part of that test set. Only then can they show off the technical capabilities of the model.

logic on demand

The smaller models still offer good options for enterprise builders wanting the same LLM experience at a lower cost.

This model is built on the same expert blending architecture as other Mistral models. It has 128 experts and four active per token, which Mistral says enables efficient scaling and specialization.

This allows the Mistral Small 4 to respond faster even to more logic-intensive outputs. It can also process and reason about text and images, allowing users to parse documents and graphs.

Mistral said the model has a new parameter called reasoning_effort, which will allow users to “dynamically adjust the behavior of the model.” According to Mistral, enterprises will be able to configure Small 4 to deliver fast, lightweight responses in a similar style to Mistral Small 3.2, or make it wordy like Magistral, providing step-by-step logic for complex tasks.

Mistral said the Small 4 runs on fewer chips than comparable models, with a recommended setup of four Nvidia HGX H100 or H200 or two Nvidia DGX B200.

Mistral said, “Delivering advanced open-source AI models requires extensive customization. Through close collaboration with Nvidia, inference has been optimized for both open source VLLM and SGLang, ensuring efficient, high-throughput service across deployment scenarios.”

benchmark performance

According to Mistral’s benchmarks, the Small 4 performs close to the level of the Mistral Medium 3.1 and Mistral Large 3, especially in MMLU Pro.

Mistral said the instruction-following performance makes the Small 4 suitable for high-volume enterprise tasks such as document understanding.

Competing with other small models from other companies, the Small 4 still performs below other popular open-source models, especially in logic-intensive tasks. The Queue 3.5 122b and Queue 3-Next 80b outperform the Small 4 on LiveCodeBench, as does Cloud Haiku in instruction mode.

The Mistral Small 4 was able to defeat OpenAI’s GPT-OSS120B in LCR.

Mistral argues that the Small 4 achieves these marks with “significantly lower output” which translates into lower inference costs and latency compared to other models. Particularly in instruction mode, the Small 4 produces the smallest output of any model tested – 2.1K characters versus 14.2K for Cloud Haiku and 23.6K for GPT-OSS 120B. In reasoning mode, the outputs are very long (18.7K), which is expected for that use case.

May said that while the choice of model depends on an organization’s goals, latency is one of the three pillars they should prioritize. “It depends on your goals and what you are optimizing your architecture to achieve. Enterprises should prioritize these three pillars: reliability and structured output, latency to intelligence ratio, fine-tunability, and privacy,” May said.



<a href

Leave a Comment