Many data center providers claim to be ready to support AI. (Including us!) Of course, it’s easy to say and hard to do.
The first challenge is accommodating the ever-rising densities associated with AI deployments. The second is being able to support both AI and traditional workloads in the same facility, and to be able to transition to liquid cooling without significant added expense, operational disruption, and multi-year supply chain delays. The third is supporting next-gen AI deployments even as tenants’ requirements are still evolving.
Challenge 1: With rising density, deployments are becoming too hot to handle with air
We are in the midst of what ASHRAE calls the Power Wars Era. Gone are the days when chip performance improvements could be achieved without increasing power demand. Now, more computation requires more watts. These power wars are driven by the increasingly widespread adoption of power-intensive workloads, which is driven by exponentially rising demand for more data, real-time data consumption, and real-time analytics – particularly as organizations embrace Artificial Intelligence (AI) and Machine Learning (ML) for an increasingly wide range of applications.
As a result, heat flux (a measure of heat flow rate intensity) is rising. More heat is flowing from similarly sized CPU/GPU socket areas. This, as ASHRAE explains it, “puts a strain on traditional data center cooling systems – pushing the limits of air-based cooling.” At some point in the not-distant future, heat fluxes for the most powerful processors will be too high to manage with direct air cooling. Simply put, air is not nearly as effective a heat transfer medium as liquid, and at some point, air is unable to remove all the heat generated by high-power chips, resulting in artificial performance limits or equipment damage.
The move to liquid cooling has an added benefit: it’s much more sustainable, as we explain in our recent white paper Liquid Cooling is More Sustainable – And It’s Coming to a Data Center Near You. Liquid cooling substantially reduces the amount of energy and MEP infrastructure required, increasing opportunities for economization, CPU efficiency, and even waste heat use. In those ways, liquid cooling has the potential to reduce Scope 2 (indirect GHG emissions associated with the purchase of electricity) and Scope 3 (indirect GHG emissions from activities within the value chain).
Challenge 2: As providers reconfigure facilities for liquid cooling en masse, MEP becomes the new TP
Most data centers today were configured for air cooling, and few will be able to support liquid cooling without substantial redesign – and the time, money, and operational disruption that comes with. Supporting AI workloads “requires a massive investment in capital and time to overhaul an existing data center or to create these new purpose-built AI data centers,” explained Unisys executive Manju Naglapur in the Wall Street Journal.
Likely more challenging even than the investments of capital and time or the disruption to operations, redesigning an air-cooled facility for liquid cooling sends the data center operator back to the supply chain for new MEP equipment. One can easily imagine a rush to procure MEP equipment and ensuing panic about an inability to procure enough. It could be just like the Great Toilet Paper Panic of 2020, when “perception became reality” and 70% of the nation’s stores (even Amazon) ran out of TP.
Indeed, “The increase in demand has exceeded the capacity of traditional infrastructure deployment methods to efficiently support it, and this will continue to challenge project timelines and milestones well into 2024 and beyond,” said Zech Connell, VP of Program Development with BluePrint Supply Chain, in a recent DCF Executive Roundtable.
At Stream, we don’t worry about battling our grandpa for the last role of Charmin (so to speak), because we designed a proprietary cooling system that supports air cooling today, and liquid cooling as soon as a customer needs it. In part because CDUs are integrated into our base design, transitioning from air to liquid requires minimal equipment additions, minimal added cost, and minimal disruption to operations.
Stream’s proprietary STU (“Server Thermal Unit”) is our pre-engineered integrated solution for both air and liquid cooling. It's a configurable system that keeps the mechanical blocks out of the white space and doesn’t require off-the-shelf equipment – hence no panic over an impending run on the DLC supply chain. Customers can match any ratio of liquid cooling and air cooling, deploying liquid cooling today and ramping into it as needed.
It’s an approach in line with what hyperscalers are doing in their own designs. Meta, for example, “has reworked its data center design for a hybrid approach combining direct-to-chip water-cooling for GPUs and air cooling for cloud workloads. It also simplified its power distribution, eliminating switchgear that created capacity bottlenecks.”
Challenge 3: Few tenants yet have a precise spec for their next-gen deployments
AI workloads perform optimally at densities of at least 20 kW per rack. But they’re not going to stop there. Density will keep increasing, rapidly and by huge margins, as each new generation of chips is increasingly power-hungry and IT infrastructure leaders are designing ever-denser configurations. According to JLL’s Data Centers 2024 Global Outlook, “The rapid adoption of generative AI will continue to drive the upward trajectory of rack power density. Average rack density has been slowly climbing over the past few years and will see significant jumps in the coming years.”
Last year customers were telling us they needed 50 kW per rack to support AI workloads. Then it was 100 kW per rack. Now some are asking for densities as high as 400 kW per rack. These customers are among the most sophisticated technology companies in the world, but like everyone, they’re still figuring out how to best deploy AI to meet their needs, and their end users’ needs. Manoj Sukumaran, principal analyst for data center compute and networking at Omdia, says a server optimized for running LLMs (such as the DGX H100) consumes about 11 kW of power. Just one server.
For at least the next several years, we expect our customers’ requirements to remain fluid. We will too. During this transition period, our focus will be on delivering data center capacity that evolves to meet our customers’ rapidly changing performance specifications. (It’s an approach we lay out in the new white paper From Build-to-Suit to Build-to-Performance-Spec: Data Center Development in an AI World).
Our customers’ technology infrastructure is changing so fast it’s a hard ask for them to lock in a particular decision two years before a new data center is scheduled to be commissioned. Being able to defer decisions on cooling technology gives our customers the assurance that their data center deployments will support their current and future performance specifications.
Bottom line: A note from Chris Bair
At Stream, we joke that we’ve been fans of liquid cooling since Mike Licitra was an 8-year old kid on Long Island running through the sprinklers on a hot summer day. Jokes aside, we have been thinking about liquid cooling for a long time. We put that thinking into action to develop STU, our proprietary design that overcomes the big three challenges associated with the transition from air to liquid cooling.
Our other Stu (Lawrence) likes to say we’ve been supporting our customers’ fluid requirements since we named the company Stream way back in ’99. While that’s not actually the origin of our name, it is true we embrace designs that are fluid and that we think DLC is very ‘cool’. Most importantly, our longstanding relationships with some of the world’s largest (and smartest) users of AI infrastructure ensures that we will continue to evolve to suit ever-changing AI deployments, and whatever else comes next.