Every Frontier model in 2026 advertises a reference window of at least a million tokens, but almost none of them are really great at using all that information. On the Multi-Reference Retrieval Benchmark Labs report, MRCR v2, the best model is GPT-5.5, with a score of 74.0%. Others like Cloud Opus 4.7 are far behind at 32.2%.
At this point, one million tokens appears to be the maximum for the reference window that major frontier laboratories are offering. A key reason for the million-token maximum is the same one that has shaped every Transformer-based model since 2017: attention costs scale quadratically with context length, so doubling the input quadruples the work. Essentially, RAG, agentive decomposition, hybrid model architectures, and every other workaround created by the industry are ways of making tradeoffs to get around this.
Subquadratic, a Miami-based startup, launched its first model on Tuesday and claims it can get around all this, now offering a model that can handle token windows of 12 million tokens. Additionally, the company says it plans to soon introduce a model with a 50-million-reference window.

The company, which has 11 Ph.D. Are. Researchers on staff argue that its architecture, called subquadratic selective attention (SSA), scales linearly in both computation and memory with respect to context length. The company says it runs 52 times faster than deep focus on one million tokens, hits 92.1% on needle-in-the-haystack retrieval on 12 million tokens – a reference length that no Frontier model currently comes close to – and scores 83 on MRCR v2, beating OpenAI by nine points.
The company says its subquadratic selective attention architecture runs 52 times faster than dense attention on one million tokens, hits 92.1% on needle-in-a-haystack retrieval on 12 million tokens, and scores 83 on MRCR v2, beating OpenAI by nine points.
These are big claims, and Subquadratic is not the first to attempt to tackle this problem. The benchmarks the company is releasing are impressive, including a score of 82.4% on SWE-Bench, which is better than Anthropic’s previous model, the Opus 4.6, which scored 81.42%, and Google’s Gemini 3.1 Pro at 80.6%. And it’s doing all this at a fairly low cost.
Subquadratic is making this model available via an API – which will feature a 12-million-token context window as well as a coding agent (SubQ Code) and a deep research tool (SubQ Search).
what came first
The quadratic cost of attention is obviously not a new problem, and SSA is not the first attempt to solve it. The research line goes almost back to the original Transformer paper, and the overall pattern has remained consistent. Each approach has traded off an essential property to gain the other, and none has been able to replace intensive attention to marginal scale.
Each approach has traded off an essential property to gain the other, and none has been able to replace intensive attention to marginal scale.
Among various approaches, for example, there is fixed-pattern sparse attention. In models such as Longformer, this achieves linear scaling by allowing each token to be present in only one sliding window. It works when relevant information is nearby and breaks down when it is not.
State-space models such as Mamba, Mamba-2, RWKV, RetNet replace all-pairs comparison with a recurrent state that compresses everything observed so far. However, compression is lossy. Nvidia’s study on 8B scale found pure Mamba-2 lagged transformers on MMLU and phonebook lookups, with the gap only closing when attention was added back.
Hybrid architectures, as seen in Jamba, KM Linear, Queue3-Next, and Nvidia’s Nemotron v3, are the practical answer. They keep most of the layers efficient and retain some deep attention layers for recovery. But the economics are less favorable than they appear. A hybrid that is three times cheaper at 32K tokens remains three times cheaper at 10M tokens, because the dense layers it retains still do O(n²) work.
The most recent entries went in a different direction. Instead of trying to fix patterns or compress positions, they learn which positions to pay attention to.
For example, DeepSeek’s Native Sparse Attention won the ACL 2025 Best Paper Award. Its successor, DeepSeek Sparse Attention (DSA), is shipping in DeepSeek v3.2-exp. DSA’s Lightning Indexer focuses on a small subset of selected keys, and the focus on those keys is really sparse. However, the indexer that selects them must score each query against each key, which means that the selection step itself is quadratic.
Subquadratic CTO Alex Whedon explains new pile“It basically means paying less attention to what the transformers do, instead of if you have 1,000 words, you look at every possible relationship between all 1,000 words, which is 1,000 class combinations. You realize that only a part of them really matter and you only process the part that matters.”
SSA does what it says differently
SSA’s argument is that it does what DSA tried to do without the indexer trap. Selection depends on the material. For any given query, the model chooses which conditions matter based on what is actually in the query and the keys – and most importantly, the selection mechanism itself is not quadratic.
“For signal A, words one and six will be important to each other,” says Whedon. “For Prompt B, it’s probably two and three words. It’s different for every single input.”
According to Whedon, hybrids provide “a scalar advantage”, but a pure subclassing mechanism provides a scaling-law advantage. SubQ reported a 7.2× speedup on 128K and a 52.2× speedup on 1M in its benchmarks.
benchmark
On RULER at 128K, the SubQ scores 97.1 compared to the Opus 4.6’s 94.8. On MRCR v2, the gap between the remaining limits is wider than the difference between the remaining limits and itself.
On SWE-Bench Verified, the SubQ reported 82.4%, beating the Opus 4.6’s 81.4% and the Gemini 3.1 Pro’s 80.6%. At 12 million tokens, where no Frontier model operates, SubQ holds 92.1% on the needle-in-a-haystack benchmark.
There are some caveats. According to the technical paper, due to their high estimation costs, each model was run only once. The SWE-bench margin, as the paper acknowledges, is “about as much as the model uses.” And the SubQ model is, according to Whedon’s own description, “much smaller than the big labs.”
What is Subquadratic shipping now
The company is launching two products in beta: an API that exposes the full 12M-token window and Subqueue Code, a CLI agent built on the same model. Both run on neoclouds rather than major hyperscalers — “they’re too expensive,” says CEO Justin Dangle.
The company is not open-sourcing heavy, but plans to provide training tools to enterprises for their own post-training work. A 50-million-token reference window target has been set for Q4.
However, here’s a bit of a cautionary tale. Magic.Dev announced a 100M-token reference-window model in August 2024 with a claimed 1000× efficiency gain. It raised more than $500 million on its own strength. As of early 2026, there is no public evidence of LTM-2-mini use outside of Magic.
Grant
Subquadratic has raised $29 million so far at a $500 million valuation from investors including former SoftBank Vision Fund partner Javier Villamizar and Tinder co-founder Justin Mateen. The company was previously called Aldea and worked on speech models before pivoting. The technical issue is real. The track record of the category is the rest of the story.
youtube.com/thenewsstack
Tech moves fast, don’t miss a single episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more.
subscribe
<a href