MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost


Among the many Chinese AI companies and labs competing for market share and attention in the global market, Minimax stands out for its commitment to providing leading-level intelligence across multiple methods, including text, coding, and video (through its Hailuo model series) – often under permissive, enterprise-friendly, standard open source licenses.

Now, MiniMax has again raised the eyebrows of AI power users and developers around the world by releasing a new, in-depth technical report on the creation of its popular M2 series language models (M2, M2.5, and M2.7), which highlights its many engineering innovations and clever approaches – while the company and its leaders have also teased an entirely new sparse attention approach for its upcoming MiniMax M3 series models, which they say will Yield is up to 15.6 times. Faster decoding (or LLM response) speed on long contexts (millions of tokens) by adopting a custom sub-quadratic framework. In doing so, MiniMax has designed M3 to make ultra-long-context AI agent deployment economically viable.

The M2 report is noteworthy for any enterprise working with AI models, and especially those looking to improve and train their own in-house. After all, Minimax’s M2 series models often achieved the top benchmarks in the world for open source AI performance upon release.

While the title has been taken up by several other Chinese labs, including DeepSeek and Xiaomi, the new report from MiniMax offers a blueprint that can be used by enterprises around the world to improve AI models and agent performance.

As Hugging Face’s Adina Yakup observed on X, "Beyond benchmarks, they have done some solid work on MoE efficiency and agent-oriented design. Excited to see where the M3 goes next!"

attention dilemma

The core technical architecture of the M2 series relies on the sparse mixture-of-experts (MOE) decoder-only transformer layout used by many other state-of-the-art LLMs.

The basic backbone has a total of 229.9 billion parameters, yet maintains a remarkably low operational footprint by activating only 9.8 billion parameters per token across 256 fine-grained experts.

However, to optimize routing and avoid standard load-balancing problems, Minimax implemented sigmoid gating with learnable, expert-specific bias terms, significantly reducing reliance on restrictive auxiliary losses.

The most definitive engineering decision documented in the M2 paper was strict adherence to full multi-head attention with grouped query attention (GQA) in all 62 layers.

In larger language models, "quadratic scaling" The standard refers to the computationally expensive reality of full attention mechanisms, where every token in the sequence must be mathematically connected to every other token. To use a real-world analogy, this is similar to attending a networking event and being forced to have in-depth conversations with every single person in the room, while simultaneously monitoring all other ongoing conversations.

Although this approach produces incredibly dense context, the processing power and memory required explodes in the square of the input length, creating a severe hardware bottleneck as models attempt to ingest hundreds of thousands of words.

Problem with sub-quadratic scaling

"sub-quadratic" Scaling introduces architectural shortcuts designed to bypass this exponential computational load. Instead of mapping every possible connection, sub-quadron methods – such as sliding window attention or compressed linear attention – can only analyze a localized window of nearby words or generate a compressed summary of broader text.

These efficient methods significantly reduce hardware costs and allow models to process larger documents at higher speeds, but they have historically introduced serious trade-offs in accuracy, causing AI to often miss the mark. "big picture" Or lose track of distant context.

This mathematical dilemma defines MiniMax’s architectural evolution from the M2 to its upcoming M3 series. During the development of M2, researchers rigorously tested sub-quadratic shortcuts but found that they crippled the model. "multi-hop logic"-Its ability to connect separate clues in a long document -forces the team to absorb the enormous computational cost of full quadratic attention in order to maintain boundary-level intelligence.

Indeed, they aggressively benchmarked efficient attention options during pre-training but deliberately excluded them. He experimented extensively with hybrid setups, using full focus with sub-quadrant architectures such as Lightning Attenuation or hybrid Sliding Window Attenuation (SWA) configurations.

The empirical results were definitive: largely, linear and windowed attention variants demonstrated severe reasoning deficits.

On evaluation over a 32K context window, the SWA variant performed significantly worse than full attention, with the baseline score falling from 90.0 to 72.0 on the RULER 128K complex word extraction task.

The sub-quadratic configuration proved to suffer from memory-bound bottlenecks during training, lacked native prefix caching support, and failed to align smoothly with the Multi-Token Prediction (MTP) module used for speculative decoding. Close attention was considered necessary to preserve multi-hop logic capability.

However, recognizing that physical hardware limitations cannot sustain quadratic scaling indefinitely, Minimax is ultimately designing the M3 series around a novel sub-quadratic framework to provide both high-speed processing and discontinuous logic.

Minimax Sparse Attention (MSA) and Sub-Quadratic Scaling Incoming

The upcoming MiniMax-M3 overcomes the compute-heavy constraints of its predecessor. As revealed by Minimax’s engineering team under the banner "Something big is coming," M3 introduces "minimax sparse attention" (MSA).

Unlike DeepSeek’s Multi-Head Latent Attention (MLA), which compresses keys and values ​​into a low-dimensional latent space, MSA works on a standard GQA backbone but uses block-level selection on real, uncompressed key-values.

Eli Bakoch at AI training infrastructure and platform lab Prime Intellect posted on X and explained that the main changes are to the feature "Like CSA, selection takes place at block level but the focus is on actual KVs, not inside [compressed space]."

This solves the precision loss and prefix-caching constraints mentioned in the M2 paper. By dynamically filtering and selecting block-level sequences, MSA provides an architectural leap: preliminary hardware profiling indicates a 9.7x speedup in prefilling latency and a massive 15.6x speedup during decoding steps at 1 million token sequence lengths compared to the full-attention M2 architecture.

To understand why it is bullish "decoding stage" This is very important, it helps to understand how AI actually reads and writes information. When you interact with AI, processing occurs in two distinct stages: prefilling and decoding.

When you give AI a signal – whether it’s a short sentence or a large 1,000-page document – ​​it processes entire chunks of text simultaneously in parallel, called "prefill" This is basically "reads" Input in one big gulp to build your initial understanding and establish context.

To generate a response, the AI ​​must enter a "Decoding stage." To predict the first word of his response, he looks at the signal. To predict the second word, one has to look at the signal. Plus First word. To predict the hundredth word, it must recalculate the context of the signal And The last 99 words he just wrote. So as it progresses, it becomes increasingly difficult to generate feedback, ultimately requiring a complete review of all prior parts.

For a layman, imagine reading an in-depth legal brief (prefilling) and then being forced to write a summary report where, before writing every single new word, you have to rapidly re-read the entire brief and everything you’ve written so far to make sure your next word makes sense (decoding).

Because the AI ​​has to constantly and repeatedly look backward to take each new step forward, the decoding step is the most serious computational bottleneck in generating text. This is why AI models often type their answers word-for-word, and they become significantly slower as conversations get longer.

So, when the paragraph states that the new architecture achieves a massive speedup of 15.6x during the decoding step on a 1 million token sequence length, it means that the model has found a structural shortcut to generate its answer – token by token – about 16 times faster. This directly solves the exact bottleneck that typically causes AI chatbots to freeze or stutter when handling massive amounts of information.

Development of the Minimax M series and creation of ‘Forge’

At the product level, Minimax has continuously evolved its models from simple text generation interfaces to autonomous workers.

M2 series took the lead "interconnected thinking" Protocol where the model alternates between natural-language planning traces and explicit tool invocations inside the same trajectory. Instead of removing intermediate blocks of the thought-chain between execution turns, M2 connects the entire thought history directly to the context of the conversation. This planning persistence prevents state drift, allowing the model to gracefully recover from runtime errors and modify its strategies based on environmental feedback.

To train these long-horizon workflows, MiniMax was created "forge," A scalable agent-native reinforcement learning system. Forge divides execution into three independent modules—the agent side, the middleware abstraction layer (gateway server and data pool), and the training/inference engine.

As Minimax engineer Olive Song explained on the Thursday Podcast, "We realized that such a small model has a lot of potential if we train reinforcement learning on it with a large amount of environments and agents… but this is not a very easy thing to do," Having said that, this environment training was where the team spent a significant portion of their development timeline. To absorb the trajectory-length variation common in multi-step agent environments, Forge implements two important engineering solutions:

  1. Windowed FIFO Scheduling: A training scheduler that maps a sliding window onto the generation queue. This allows greedy, high-throughput completion tasks to be fetched within a window to prevent cluster idle time, while strictly enforcing FIFO limits to maintain delivery consistency and avoid serial oscillations.

  2. Prefix tree merging: An optimization that reorganizes batch training into tree computation. Completions sharing the same conversation prefixes are calculated exactly once in the forward pass before branching. This eliminates redundant computations, yielding a 40x training speedup with zero approximation error.

This reinforcement infrastructure directly led to the M2.7 checkpoint, leading to the series "self development". Working inside an automated agent harness, M2.7 acts as an independent machine learning engineer. The model runs its own proactive training, diagnoses anomalies, reads logs, and automatically modifies its own codebase and configuration.

According to Minimax, M2.7 successfully managed between 30% and 50% of its own development workflow.

On OpenAI’s rigorous MLE bench light suite, which tests autonomous ML research capability, M2.7 achieved a 66.6% medal rate in independent 24-hour tests, which is effectively on par with Google’s close-weighted Gemini 3.1 Pro.

The continued cadence from M2 to M2.5, which famously saw 30% of internal work and 80% of newly committed code completed at Minimax headquarters, outlines a comprehensive approach.

As the Minimax team noted during that phase of deployment, "We believe that M2.5 offers virtually unlimited possibilities for the development and operation of agents in the economy."

With technical reports codifying the successes of the M2 generation and an MSA technical blog on the horizon, MiniMax is signaling that the next frontier of AI is clearly about translating a mini-activation footprint into maximum real-world intelligence.



<a href

Leave a Comment