AI Hit The Memory Wall — Now It Needs A New Context Tier

Presented by Solidigm

As inference workloads evolve from discrete question-answer exchanges to continuous, multi-stage agentic systems, GPU availability is no longer the most significant AI bottleneck. Instead, the bottleneck has shifted from computation to context, says Jeff Harthorn, AI applied research lead at Solidigm.

"Why context management has become a primary bottleneck, more than GPU availability or compute efficiency, is the question of 2026." Harthorn says. "GPUs have become dramatically cheaper per FLOP. Model architectures and inference service engines have all become more efficient. But what has evolved faster than both of them is context. The persistent state between sessions has grown even faster than the context."

This is happening as the context window is growing dramatically, making individual inputs much larger than before. Agent AI systems chain together dozens or hundreds of model calls, each generating state must be tracked, and enterprises require that inference state persist across sessions for audit, governance, and reuse. These trends compound each other, pushing the reference volume beyond what any current memory level is designed to handle.

"Those three things are happening at the same time, all of which is pushing reference data and reference memory into the stratosphere much faster than we’re used to seeing," says Ace Stryker, director of AI and ecosystem marketing at Solidigm.

The solution is a dedicated context layer emerging between GPU memory and bulk network storage: a layer of high-performance, high-density flash specifically designed to hold and serve key-value (KV) caches, inference data that allows models to maintain and reuse context, and perform data retrieval at inference speed. Nvidia has formalized this architecture under the term CMX. Storage companies including Solidigm are creating SSD products optimized for this workload.

"Storage isn’t the first thing people might think of when they’re planning on building their enterprise infrastructure," Striker says. "In many ways, it was a relatively low cost compared to calculations, and it was a commodity. You just bought for the lowest dollar per gigabyte and called it good. But now, if your storage is not enough, your ROI is affected, and that directly impacts your profits.”

Why does AI inference require a different storage architecture than training?

The storage architecture that AI systems rely on today is largely inherited from the training workflow. Training is sequential and write-intensive, with data being moved from bulk object storage in large blocks. The tiered architecture, with high-bandwidth memory on the GPU, fast NVMe in the servers, and bulk storage on the network, serves that use case reasonably well.

However, estimation is a different animal. Its I/O signature is fine-grained, latency-sensitive, and fast stateful. KV cache data and retrieval data each have different access patterns, but both need to be presented quickly and reused across interactions. Neither does GPU fit neatly within high-bandwidth memory, which is expensive and physically constrained, nor within traditional bulk storage, which was never designed for active inference workloads.

"The architectural gap that’s interesting to me right now isn’t at the top or bottom of the stack, it’s right in the middle," Harthon says. "Whatever sits beneath the GPU HBM is being asked to do things it wasn’t really designed for, which is what makes most interesting systems working today."

The most visible symptom of this difference is recalibration. As an assumption, the pre-fill phase processes all contexts related to a given session before token generation begins. When the KV cache state is no longer available at a fast, accessible level, the system recalculates it – burning GPU cycles that produce no new values.

"A meaningful portion of GPU cycles eventually goes into re-prefilling," Harthon explains. "During all that computed context, there is potentially computation that is being spent reproducing state rather than performing new tasks. When you start looking at the problem this way, GPU usage starts to look like it’s partly a storage problem."

This reframing is generating renewed interest in a metric borrowed from networking: goodput, or useful tokens per dollar, rather than raw tokens per dollar.

AI Reference Memory Level and how it works

The industry’s response is taking a structural form. A new layer is emerging between GPU memory and traditional network storage, specifically designed to hold and serve inference context, this is a separate layer from drives inside the GPU server (G3) and storage servers on the network (G4), which is engineered to get context data back to the accelerator as quickly as possible.

"If you’re building a data center in the second half of this year, or early next year, you can’t just think about storage in two places," Striker says. "Storage will need to reside in at least three locations to handle the reference memory level, and this is likely to be a permanent fixture as the infrastructure is built going forward."

This is similar to the emergence of object storage as a category, which did not exist until substantial workloads required it. And once that happened, it developed its own primitives, SLAs, cost models, and an ecosystem of vendors.

"The reference level looks like it might be on a uniform arc," Harthorn says. "That huge pressure is leading to the creation of a category rather than a roadmap from any one vendor."

For infrastructure leaders, this means actively planning for the new scale rather than treating it as optional. Deploying additional NAND at this layer reduces reliance on DRAM, which is more expensive per gigabyte and limited in both availability and thermal headroom.

"In terms of the effectiveness of your investment, if you rely on the SSD layer the way Nvidia is now recommending and scheduling for a lot of use cases, you’re putting less cash into doing it," Striker says.

What flash is needed to support AI inference

Meaningfully participating in the inference stack creates new demands on SSD technology. Tail latency, the worst performance of a drive, should be predictable, not just measurably fast. An orchestration system that allocates GPU resources based on expected storage response time cannot tolerate unexpected multi-second delays. Here consistent, observable performance matters more than peak throughput.

Beyond latency, density becomes a serious concern, especially at hyperscale. In data centers where power, not cost, is the binding constraint, watts per petabyte becomes the operative metric. Floating gate NAND, the manufacturing approach at the core of Solidigm’s products, is well suited for that calculation. Given the tight latency budget of active inference pipelines, network integration through Fabrics, RDMA, and eventually NVMe over CXL support is also required.

"The drive must have reliable performance characteristics, beyond the throughput side and must be able to transfer as much data as quickly as possible, in the way that training requires," Harthon says. "Now it’s about being able to do it very consistently, in a way that is very noticeable to the people who operate and organize these systems."

How enterprise AI leaders should plan for context level

The standards, software primitives, and best practices being established now will define how AI inference infrastructure operates for years to come. Solidigm is engaged in that process through standards bodies, partner laboratory collaborations, and published research, which is important because the category is still being formed.

"The interesting question for the next few years is not whether AI infrastructure needs more compute," Harthorn says. "The point is whether he can use what he has more efficiently. Much of that answer passes through the layers that are being built today."

Sponsored articles are content produced by a company that is either paying for the post or that has a business relationship with VentureBeat, and they are always clearly marked. Contact for more information sales@venturebeat.com.

<a href

AI hit the memory wall — now it needs a new context tier

Why does AI inference require a different storage architecture than training?

AI Reference Memory Level and how it works

What flash is needed to support AI inference

How enterprise AI leaders should plan for context level

Like this:

Related

Leave a Comment Cancel reply

Why does AI inference require a different storage architecture than training?

AI Reference Memory Level and how it works

What flash is needed to support AI inference

How enterprise AI leaders should plan for context level

Share this:

Like this:

Related

Leave a Comment Cancel reply