Meta Plans Shift to Liquid Cooling for its Data Center Infrastructure

The massive Meta data centers to support the metaverse will feature lots of liquid cooling. At the Open Compute Summit, Meta outlined a roadmap for a gradual shift to a water-cooled infrastructure, using cold plates to provide direct-to-chip cooling for AI workloads.

Rich Miller

Oct. 18, 2022

6 min read

Rows of equipment in the Meta AI Research SuperCluster, Meta AI Research SuperCluster (RSC), a new supercomputer to enable new AI models. (Photo: Meta)

Meta’s vision of an immersive metaverse will require powerful hardware to process the artificial intelligence (AI) to create these digital worlds. The massive data centers that support the metaverse will feature lots of liquid cooling.

At today’s Open Compute Summit, Meta introduced a new AI computing platform, along with updates to its Open Rack and a roadmap for a gradual shift to a water-cooled AI infrastructure. The company plans to use cold plates to provide direct-to-chip cooling for AI workloads on its GPU servers, and is preparing several designs for managing the temperature of supply water as rack power densities increase.

“The power trend increases we are seeing, and the need for liquid cooling advances, are forcing us to think differently about all elements of our platform, rack and power, and data center design,” writes Alexis Bjorlin, Meta Vice President for Engineering, in a blog post accompanying her keynote today at the Summit in San Jose. “As we move into the next computing platform, the metaverse, the need for new open innovations to power AI becomes even clearer.”

In her keynote, Bjorlin unveiled several innovations that will advance Meta’s ambitions:

The Grand Teton platform, a next-generation GPU-based hardware platform designed to offer twice the compute power and enhanced memory-bandwidth – along with two times the power envelope of predecessor Meta AI systems.
Open Rack v3, with new features to offer flexibility in how users configure their power and cooling infrastructure, along with longer on-rack backup power.
As early look at the Air-Assisted Liquid Cooling design that will bring chip-level liquid cooling into Meta data centers.

A New Phase for Meta’s Infrastructure

Today’s announcements at the OCP Summit mark the latest evolution in data center design for Meta, which operates more than 40 million square feet of data centers and says it has 47 data centers under construction across its global network.

Due to the scale of its operations, a shift to liquid cooling by Meta is likely to boost demand for advanced cooling in the OCP ecosystem, and perhaps beyond. A large buyer like Meta could give a shot in the arm to liquid cooling, which has been focused on high-performance computing (HPC) and supercomputing. Google has already shifted its AI infrastructure to liquid cooling, while Microsoft is testing immersion cooling in its production data centers.

Earlier this year Meta revealed a new facility to house its Research SuperCluster (RSC), will likely become the fastest AI system in the world when it is completed later this year. Much of the GPU-powered infrastructure in that system is air-cooled, but the facility’s InfiniBand network uses a liquid-to-liquid cooling distribution unit.

By embracing Air-Assisted Liquid Cooling (AALC), Meta will begin using cold plates to provide direct-to-chip liquid cooling within their existing data hall design, without the need to install a raised floor or piping to deliver water from outside cooling sources. AALC uses a closed-loop cooling system with a rear-door heat exchanger. The cool air from the existing room-level cooling passes through the rear door, cooling the hot water exiting the server. An RPU (Reservoir & Pumping Unit) pumping system housed in an adjacent rack keeps the water moving through the cold plates and heat exchanger.

An app from Meta provides a 3D experience of its new data center designs, including its Air-Assisted Liquid Cooling (AALC) implementation. (Image: Meta)

Meta and Microsoft have been working together on prototypes for AALC that could support up to 40kW of power density, which they demonstrated at last year’s OCP Summit. Last fall an AALC rack design was introduced by Delta ICT, which develops OCP designs for hyperscale users.

A roadmap released with the blog post indicates that Meta plans to begin a shift to AALC, and expects to see power usage increase as its AI gear adds more power for high-bandwidth memory, which will prompt a shift to a “facility water” strategy as thermal loads exceed the limits of the rear-door heat exchanger. That next phase will likely require the addition of piping to bring chilled water to the rack.

Meta’s strategy allows it to add higher-density workloads within its current data centers, while working out the details of a next-generation design to transition to facility water supplies and the additional infrastructure that will require. Meta did not indicate when it is implementing the AALC design in production, how widely the design would be used in its infrastructure, or when it contemplates a shift to add facility water.

A virtual demo of Meta’s data center hardware is available at MetaInfraHardware.com, which offers the option of using a web interface or Meta Quest VR goggles for the tour, which provides a visual overview of the components of the AALC rack and how it works.

OCP Open Rack v3

A key component in this roadmap is the Open Rack v3, which was unveiled at today’s event after years of development. The Open Rack v3 (ORV3) design accommodates multiple configurations for both power and cooling, providing a flexible building block for hyperscale deployments.

“The ORV3 ecosystem has been designed to accommodate several different forms of liquid cooling strategies, including air-assisted liquid cooling and facility water cooling,” Bjorlin wrote in the blog post. “The ORV3 ecosystem also includes an optional blind mate liquid cooling interface design, providing dripless connections between the IT gear and the liquid manifold, which allows for easier servicing and installation of the IT gear.”

The Open Rack v3 is designed to bring 48V power to the equipment for higher efficiency, and a taller design that supports the addition of liquid cooling infrastructure.

A diagram of the features of the Open Rack v3. (Source: Meta)

Meta Grand Teton GPU-powered AI Hardware

Meta’s new Grand Teton AI hardware was showcased in today’s presentation by Bjorlin, who previously worked in the silicon operations at Broadcom and Intel.

“We’re excited to announce Grand Teton, our next-generation platform for AI at scale that we’ll contribute to the OCP community,” said Björlin. “As with other technologies, we’ve been diligently bringing AI platforms to the OCP community for many years and look forward to continued partnership.”

Grand Teton uses NVIDIA H100 Tensor Core GPUs to train and run AI models that are rapidly growing in their size and capabilities, requiring greater compute. The NVIDIA Hopper architecture, on which the H100 is based, includes a Transformer Engine to accelerate work on these neural networks, which are often called foundation models because they can address an expanding set of applications from natural language processing to healthcare, robotics and more.

“With Meta sharing the H100-powered Grand Teton platform, system builders around the world will soon have access to an open design for hyperscale data center compute infrastructure to supercharge AI across industries,” said Ian Buck, vice president of hyperscale and high performance computing at NVIDIA.

Grand Teton sports 2x the network bandwidth and 4x the bandwidth between host processors and GPU accelerators compared to Meta’s prior Zion system, Meta said.

More details of the Meta OCP announcements are available at the Meta Engineering Blog.

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Oracle’s Global AI Infrastructure Strategy Takes Shape with Bloom Energy and Digital Realty

DoD Taps 8 Nuclear SMR Vendors in Push to Deploy On-Site Microreactors: Data Center Energy Implications

Sponsored

NECA Manual of Labor Rates Chart

Sponsored

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

Voices of the Industry

Sponsored

When AI Compute Meets Real-World Infrastructure: What Operators Need to Know

Schneider Electric's Vance Peterson and Gia Wiryawan explain why power distribution and thermal management—not compute—are the bottleneck for operators when supporting NVIDIA'...

Sponsored

Taking the Compromise Out of Buy vs. Build Calculus

Stream Data Centers' Chris Bair explains why hyperscalers need the timing flexibility of third-party capacity—and the optionality of internal capacity— to scale properly.

Meta Plans Shift to Liquid Cooling for its Data Center Infrastructure

A New Phase for Meta’s Infrastructure

OCP Open Rack v3

Meta Grand Teton GPU-powered AI Hardware

About the Author

Rich Miller

Related

Oracle’s Global AI Infrastructure Strategy Takes Shape with Bloom Energy and Digital Realty

DoD Taps 8 Nuclear SMR Vendors in Push to Deploy On-Site Microreactors: Data Center Energy Implications

NECA Manual of Labor Rates Chart

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

Voices of the Industry

When AI Compute Meets Real-World Infrastructure: What Operators Need to Know

Taking the Compromise Out of Buy vs. Build Calculus

Trending

Transmission at the Breaking Point: Why the Grid Is Becoming the Defining Constraint for AI Data Centers

Rethinking Water in the AI Data Center Era

CBRE’s 2026 Data Center Outlook: Demand Surges as Delivery Becomes the Constraint

Sponsored Picks

The modular solution to the AI infrastructure challenge

Case Study: Energy-Efficient Cooling and Cost Savings

3 Strategies to Future-Proof the Sustainability of Your Data Center