Meta is optimizing its massive digital infrastructure for artificial intelligence, fine-tuning everything from tiny chips to giant data centers. As part of that shift, Meta confirmed that it will use liquid cooling to support a “significant percentage” of its AI hardware, which will use new ASIC chips designed specifically for AI workloads.
The new chips and data center design were unveiled at AI Infra @Scale event, with Meta outlining a pivot to AI as a driver of its global platform, which includes Facebook, Instagram, WhatsApp and Messenger. Meta also announced an expansion of its Research Supercomputing Cluster and new hardware to accelerate video production and delivery.
"We've been building advanced infrastructure for AI for years now, and this work reflects long-term efforts that will enable even more advances and better use of this technology across everything we do," said Meta CEO Mark Zuckerberg.
Custom chips and liquid cooling weren’t the only new wrinkles in the next-generation data center design. Meta also has streamlined its power distribution to eliminate equipment, and will focus on software-based resiliency that allows it to use fewer backup generators.
“We're reimagining everything we do about IT infrastructure for AI,” said Aparna Ramani, VP of Engineering, Infrastructure at Meta. “We're creating data centers that are specific for AI. We're creating new hardware, including our own silicon. Thousands of engineers are innovating on this large-scale infrastructure that's built specifically for AI.”
Meta says the redesign will help it build faster and cheaper data centers, with expected savings of 31 percent over its current design. The company also plans to use less carbon-intensive materials in construction – including concrete – to keep Meta on track to meet its goal to be water-positive and have net-zero emissions by 200.
Building for Even Greater Scale
In December, Meta decided to overhaul its data center design to optimize its facilities for artificial intelligence, while also pausing construction on a number of its data center projects.
"Meta’s AI compute needs will grow dramatically over the next decade as we break new ground in AI research, ship more cutting-edge AI applications and experiences for our family of apps, and build our long-term vision of the metaverse," writes Santosh Janardhan,, Head of Global Infrastructure at Meta.
"We need to plan for roughly 4X scale," said Meta Engineering Director Alan Duong. That's a startling number, given that Meta operates 21 data center campuses around the globe, representing an investment of $16 billion and more than that span 40 million square feet of space.
Yet Meta says it may double the number of data center buildings between now and 2028 to 160 in total, and will pack more computing horsepower into each next-generation data center with the Meta Training and Inference Accelerator (MTIA), a new custom chip that Meta designed in-house. The MTIA is an ASIC (application-specific integrated circuit), a type of chip that is highly customized for a particular workload. Meta says the MTIA will be twice as efficient as the GPU (graphics processing units) used in most AI infrastructure.
That extra power will generate more heat, and require new approaches to data center design, which will begin to be deployed in 2025.
“We see a future in which the AI chips are expected to consume more than 5x the power of our typical CPU servers,” said Rachel Peterson, Vice President for Data Center Strategy at Meta. “This has really caused us to rethink the cooling of the data center and provide liquid cooling to the chips in order to manage this level of power."
The Liquid Cooling RoadMap
Last year Meta laid out a roadmap for a gradual shift to a water-cooled AI infrastructure using cold plates to provide direct-to-chip cooling for AI workloads, along with several designs for managing the temperature of supply water as rack power densities increase. Meta described today's sessions at AI Infra @Scale as "an early view into the technical vision" for the next-generation design.
The addition of liquid cooling technologies will happen in two steps. The first phase will feature Air-Assisted Liquid Cooling (AALC), which uses cold plates to provide direct-to-chip liquid cooling within Meta's existing data hall design, without the need to install piping to deliver water from outside cooling sources. AALC uses a closed-loop cooling system with a rear-door heat exchanger. The cool air from the existing room-level cooling passes through the rear door, cooling the hot water exiting the server. An RPU (Reservoir & Pumping Unit) pumping system housed in an adjacent rack keeps the water moving through the cold plates and heat exchanger.
"Our next-gen data center will not be available until late 2025," said a Meta spokesperson. "In the meantime, we’re deploying AI servers across our fleet of data centers which, by early 2025, will leverage AALC for liquid-to-chip cooling. With a facility water plant in our next-gen data center, we will continue to leverage air cooling, AALC or direct-to-chip technical water distribution as hardware evolves and requires it."
When the next-generation design launches, it will continue to use a slab floor, and house plenty of Meta's conventional CPU-powered servers, along with AALC racks.
But the new design will add racks of custom liquid cooling to support the training of AI models. The videos and images shared by Meta included a new design with racks filled with square chassis with piping entering the front, along with a design in which a processor and cold plate are immersed in coolant fluid.
Duong said the next-generation design can support a "large percentage" of liquid cooling, but it will happen gradually.
"We're going to only deploy a small percentage of liquid-to-chip cooling on day one, and we'll scale it up as we need," said Duong, "This means more complex upfront rack placement and planning. But it allows us to save capital and deploy faster."
Streamlining the Power Chain
Meta is also streamlining elements of its power infrastructure.
"Delivering power infrastructure closer to the server rack will be simpler and more efficient with our new design," said Duong. "We're eliminating as much equipment as possible through our power distribution chain."
That includes reducing switchgear that was creating bottlenecks of capacity. "This allows the server rack to grow in density in the future with minor modifications to our infrastructure, and it continues to allow for greater power utilization," said Duong. "It means that we strand less power and eventually means we build less data centers."
Duong also said Meta will rely more on software-based resiliency rather than equipment redundancy. "This allows us to right-size our physical backup infrastructure, like using fewer diesel generators, saving time and deployment," he said.
There will be some tradeoffs with the new design, Duong said, including balancing the use of power and water, which both factor in sustainability goals.
"Liquid cooling doesn't come for free," said Duong. "We can't just open our windows and rely on free air cooling anymore. We can't keep leveraging evaporation to reject heat because that will continue to be a challenge for us as we go into regions that are water constrained. This means that we'll be using a little bit more power to cool our equipment, but on the flip side, we'll reduce our water consumption."
Meta said the next-generation design will be used in future data center builds, beginning this year.
"We don’t have any immediate plans to retrofit existing campuses to meet the requirements of our next-gen data center, however, we will continue to provide optionality to enable evolution in AI technology and hardware," said a Meta spokesperson.
More Change Ahead
The clear theme from Meta's presentation was that AI will be disruptive to many things, especially digital infrastructure.
"We're in the middle of a pivot to the next age of information," said Alexis Bjorlin, VP of Engineering Infrastructure at Meta. "AI workloads are growing at a pace of 1,000x every two years.
"As we look to the future, the generative AI workloads and models are much more complex," she added. "They require a much larger scale. Whereas traditional AI workloads may be run on tens or hundreds of GPUs at a time, the generative AI workloads are being run on thousands, if not more."
"This is a really rapidly evolving space," said Peterson. "We're going to continue to innovate on our design, and really continue to think about how we can support the business."