New Meta Supercomputer Boosts Power for AI Workloads, Future Metaverse

Jan. 24, 2022
Meta has built a new supercomputer that will likely become the fastest AI system in the world when it is completed later this year, the company said today. Meta says the system will boost its AI capabilities and vision for a future digital metaverse.

Meta has built a new supercomputer that will likely become the fastest AI system in the world when it is completed later this year, the company said today. The new Research SuperCluster (RSC) is already being used to train large models for natural language processing and computer vision, technologies that have broad application today and will be important for Meta’s vision for a future digital metaverse.

“The experiences we’re building for the metaverse require enormous compute power (quintillions of operations / second!),” said Met founder and CEO Mark Zuckerberg. “RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more.”

The RSC represents the next phase of digital infrastructure for Meta, which currently operates 18 data center campuses around the globe to support its Facebook, Instagran and Messenger services. These data centers represent an investment of $16 billion and more than that span 40 million square feet of space.

The RSC currently features 760 NVIDIA DGX A100 systems for its compute nodes, with more than 6,080 GPUs housed in more than 500 racks of equipment. The system is up and running now, but will continue to be expanded until it reaches 16,000 GPUs in 1,200 racks later this year, which will increase AI training performance by more than 2.5 times. When that goal is reached, Meta believes the RSC will be the fastest AI supercomputing system in the world.

Paving the Way for a New Computing Platform

“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” Meta said in a blog post by Technical Program Manager Kevin Lee and Software Engineer Shubho Sengupta. “Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”

This video provides an overview of the RSC and its operations.

In artificial intelligence (AI), computers are assembled into neural networks that emulate the learning process of the human brain to solve new challenges. It’s a process that requires lots of computing horsepower, which is why the leading players in the field have moved beyond traditional CPU-driven servers. A CPU consists of a few cores optimized for sequential serial processing, while a GPU has a parallel architecture consisting of hundreds or even thousands of smaller cores designed for handling multiple tasks simultaneously.

The NVIDIA DGX A100 is the latest version of the GPU-powered “supercomputer in a box.” Each DGX system takes up about 6 rack units (RU) of space, and Meta is deploying two DGX systems in each 40RU rack, keeping the racks at a manageable power density and leaving adequate space to cool the systems.

The DGX compute nodes are connected by an NVIDIA Quantum 200 Gb/s InfiniBand networking fabric that will expand to support 16,000 ports in a two-layer topology with no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade. The InfiniBand network uses a liquid-to-liquid cooling distribution unit, while all the other equipment in the RSC is air cooled. Meta is not sharing the location of the RSC facility, but it will exchange data with an existing Meta data center campus, as seen in this diagram.

Meta’s AI research team has been building these high-powered systems for many years. The first generation of this infrastructure, designed in 2017, has 22,000 NVIDIA V100 Tensor Core GPUs in a single cluster that performs 35,000 training jobs a day. Up until now, this infrastructure has set the bar for Meta’s researchers in terms of its performance, reliability, and productivity.

“In early 2020, we decided the best way to accelerate progress was to design a new computing infrastructure from a clean slate to take advantage of new GPU and network fabric technology,” M

“We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services, but also to enable completely new user experiences, especially in the metaverse,” the Meta team added “Our long-term investments in self-supervised learning and in building next-generation AI infrastructure with RSC are helping us create the foundational technologies that will power the metaverse and advance the broader AI community as well.”

A room level view of the Meta AI Research SuperCluster. (Image: Meta)

Read the Meta blog post for additional details.

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Sponsored Recommendations

NECA Manual of Labor Rates Chart

See how Champion Fiberglass compares to PVC, GRC and PVC-coated steel in installation.

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

To help identify cost savings that don’t cut corners on quality, Champion Fiberglass developed a free resource for engineers and contractors.

Conduit Sweeps and Elbows for Data Centers and Utilities

Data Centers and Utilities projects require a large number of electrical conduit sweeps and elbows. Learn why Champion Fiberglass is the best supplier for these projects.

Prefabricated Conduit Duct Banks Enable Smooth and Safe Electrical Installation for a Data Center

Prefabricated conduit duct banks encourage a smooth, safe electrical conduit installation for a data center.

Image courtesy of Submer
Image courtesy of Submer

The Future of Data Center Cooling: Addressing Jitter and Thermal Inconsistencies

Ryan Howard, Solution Engineer at Submer, explores the impact of jitter and thermal inconsistencies on various sectors and how innovative cooling solutions can improve efficiency...

White Papers

Dcf Siemon Casestudy 2022 08 15 12 10 23 233x300

Wellstar Health Systems Delivers Successful Expansion with Siemon Advanced Data Center Solutions

July 15, 2022
Siemon explains how Wellstar Health Systems used advanced data center solutions to expand fiber densities within their leased colocation space.