Liquid-cooled AI Systems Expose The Limits Of Traditional Storage Architecture

Presented by Solidigm

Liquid cooling is rewriting the rules of AI infrastructure, but most deployments haven’t fully crossed the line. GPUs and CPUs have moved to liquid cooling, while storage has become dependent on airflow, creating an operationally inefficient hybrid architecture.

What appears to be a practical change strategy is in practice a structural liability.

“The hybrid cooling approach is an operationally inefficient situation,” explains Hardeep Singh, thermal-mechanical hardware team manager at Solidigm. “You’re paying for and maintaining two completely separate, expensive cooling infrastructures, and you could end up with the worst of both worlds.”

While liquid cooling requires pumps, fluid manifolds and coolant delivery units (CDUs), air-cooled components require CRAC units, cold aisles and evaporative cooling towers. Organizations moving toward hybrid solutions by simply adding some liquid cooling are absorbing the cost premium without achieving the full TCO benefit.

Thermal physics makes things worse. Heavy liquid-cooled cooling plates, thick hoses, and manifolds physically impede airflow inside a GPU server chassis. This concentrates thermal stress on the remaining air-cooled components, including storage drives, memory, and network cards, as server fans cannot push enough airflow around the liquid plumbing. The components most dependent on fans end up in the worst thermal environments.

Water consumption is a totally neglected, equally serious problem. Traditional air-cooled components rely on server fans to transfer heat to ambient air, which is then absorbed by a water loop and pumped into evaporative cooling towers. These systems can consume millions of gallons of water over time. As rack power density continues to climb to support modern AI workloads, the evaporative water penalty becomes, as Singh says, “environmentally and economically unsustainable.”

As AI infrastructure evolves toward liquid-cooled and fanless GPU systems, the real barriers to scale are shifting from compute performance to system-level thermal design. Modern AI platforms are no longer built server by server; These are engineered as tightly integrated rack- and pod-level systems where power delivery, cooling delivery and component placement are inseparable.

In this environment, storage architecture designed for airflow-dependent data centers is becoming a limiting factor. As GPU platforms move to fully shared liquid-cooling domains anchored by rack-level CDUs, every component in the system must operate seamlessly within the same thermal and mechanical design. Storage can no longer rely on isolated cooling paths or bespoke thermal assumptions without introducing inefficiency, complexity, or density trade-offs at the system level.

Why is storage no longer a passive subsystem?

For infrastructure leaders, this marks a fundamental change. Storage is no longer a passive subsystem tied to compute, but an active participant in system-level cooling, serviceability, and GPU utilization. The ability to scale AI now depends on whether liquid-cooled GPUs can be cleanly integrated into the system, without fragmenting the storage cooling architecture or disrupting the rack-level design.

And the race to scale AI is no longer just about who has the most GPUs, but who can keep them cool, says Scott Shadley, director and evangelist of the leadership narrative at Solidigm.

“Finding a way to enable liquid-cooled storage while still making it useful to the user is one of the biggest challenges in designing fanless system solutions,” says Shadley. “As AI workloads evolve, the pressure on storage will become greater.”

Techniques such as KV cache offload, which transfers data between GPU memory and high-speed storage during inference, make storage latency and thermal performance directly relevant to model service efficiency. In these architectures, a storage subsystem that throttles under thermal load due to poor conventional airflow slows down both reads and models.

Moving towards integrated liquid cooling

Moving from traditional air-cooled GPU servers to integrated liquid-cooled racks improves power usage efficiency (PUE) and reduces operating costs for datacenters. It also replaces the noisy Computer Room Air Handler (CRAH) and introduces a modern, efficient liquid CDU with potential scope to eliminate the chiller if the racks can be cooled to a liquid temperature of 45 °C.

It must also support serviceability without any liquid leakage when the storage is cooled via liquid in the absence of a fan. It also creates a new requirement that many infrastructure teams are just beginning to grapple with: every component in the rack must operate natively within the same cooling architecture.

Storage as an active participant in system design

Storage design is no longer an isolated engineering problem. It is a direct variable in GPU utilization, system reliability, and operational efficiency. The solution is to redesign the storage from the ground up for a liquid-cooled, fanless environment. It is more difficult than it seems. Traditional SSD design assumes airflow for thermal management and places components on both sides of a thermally insulated PCB. Neither assumption is valid in a CDU-anchored architecture.

“SSDs need to be specifically designed with best-in-class thermal solutions to efficiently conduct heat from internal components and transfer it to fluids,” says Singh. “The design must include a low-resistance path to transfer heat to a single cold plate attached on one side.”

Also, the drive must support serviceability without liquid leakage during insertion and removal and without deforming the thermal interface between the drive and the cold plate.

Solidigm has worked with NVIDIA to address SSD liquid-cooling challenges, such as hot swap-ability and single-side cooling, reducing the thermal footprint of storage within the shared liquid loop and ensuring the GPU receives a proportionate share of the coolant.

“If the storage is not efficiently designed for the liquid-cooled environment, it will either reduce performance or require more liquid volume,” he says. “Which directly and indirectly leads to underutilization of GPU capacity.”

Alignment on standards and path to interoperability

Solidigm is not working on this separately. The broader industry is uniting around standards to ensure that liquid-cooled AI systems are interoperable rather than a patchwork of custom solutions. SNIA and the Open Compute Project (OCP) are the primary bodies driving this work.

Solidigm led the industry standard for liquid cooling in SFF-TA-1006 for the E1.S form factor and is an active participant in OCP work streams covering rack design, thermal management and sustainability. Custom, custom cooling solutions for storage are paving the way for standards-aligned, production-ready designs that integrate cleanly into liquid-cooled GPU platforms.

“There are many organizations involved in this work,” says Shadley, who is also a SNIA board member. “They started with component-level solutions, which is largely driven by SNIA and the SFF TA TWG. The next level is solution-level work, which is currently being driven largely by OCP.”

Solidigm’s roadmap is moving forward

The design rules for system level architectures have changed due to the advent of liquid and immersion cooling technologies that allow more unique design rules and the removal of some constraints. Shadley says the system’s ability to run NVMe SSD-only platforms also allows it to remove the platter-based box barrier that exists with HDD solutions.

“We have an active and leading role in roadmap decisions for our products because of our deep technical alignment with the Solidigm customer ecosystem,” he says. “We don’t just make and sell products, we integrate, co-design, co-develop, and innovate with our partners, customers, and their customers.”

Singh adds: “Solidigm’s core strength is innovation and customer-driven system level engineering. It will continue to aggressively pave the way for the adoption of liquid cooling for storage.”

Sponsored articles are content produced by a company that is either paying for the post or that has a business relationship with VentureBeat, and they are always clearly marked. Contact for more information sales@venturebeat.com.

<a href

Liquid-cooled AI systems expose the limits of traditional storage architecture

Why is storage no longer a passive subsystem?

Moving towards integrated liquid cooling

Storage as an active participant in system design

Alignment on standards and path to interoperability

Solidigm’s roadmap is moving forward

Like this:

Related

Leave a Comment Cancel reply

Why is storage no longer a passive subsystem?

Moving towards integrated liquid cooling

Storage as an active participant in system design

Alignment on standards and path to interoperability

Solidigm’s roadmap is moving forward

Share this:

Like this:

Related

Leave a Comment Cancel reply