Meta has built a new supercomputer that will likely become the fastest AI system in the world when it is completed later this year, the company said today. The new Research SuperCluster (RSC) is already being used to train large models for natural language processing and computer vision, technologies that have broad application today and will be important for Meta’s vision for a future digital metaverse.
“The experiences we’re building for the metaverse require enormous compute power (quintillions of operations / second!),” said Met founder and CEO Mark Zuckerberg. “RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more.”
The RSC represents the next phase of digital infrastructure for Meta, which currently operates 18 data center campuses around the globe to support its Facebook, Instagran and Messenger services. These data centers represent an investment of $16 billion and more than that span 40 million square feet of space.
The RSC currently features 760 NVIDIA DGX A100 systems for its compute nodes, with more than 6,080 GPUs housed in more than 500 racks of equipment. The system is up and running now, but will continue to be expanded until it reaches 16,000 GPUs in 1,200 racks later this year, which will increase AI training performance by more than 2.5 times. When that goal is reached, Meta believes the RSC will be the fastest AI supercomputing system in the world.
Paving the Way for a New Computing Platform
“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” Meta said in a blog post by Technical Program Manager Kevin Lee and Software Engineer Shubho Sengupta. “Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”
This video provides an overview of the RSC and its operations.
In artificial intelligence (AI), computers are assembled into neural networks that emulate the learning process of the human brain to solve new challenges. It’s a process that requires lots of computing horsepower, which is why the leading players in the field have moved beyond traditional CPU-driven servers. A CPU consists of a few cores optimized for sequential serial processing, while a GPU has a parallel architecture consisting of hundreds or even thousands of smaller cores designed for handling multiple tasks simultaneously.
The NVIDIA DGX A100 is the latest version of the GPU-powered “supercomputer in a box.” Each DGX system takes up about 6 rack units (RU) of space, and Meta is deploying two DGX systems in each 40RU rack, keeping the racks at a manageable power density and leaving adequate space to cool the systems.
The DGX compute nodes are connected by an NVIDIA Quantum 200 Gb/s InfiniBand networking fabric that will expand to support 16,000 ports in a two-layer topology with no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade. The InfiniBand network uses a liquid-to-liquid cooling distribution unit, while all the other equipment in the RSC is air cooled. Meta is not sharing the location of the RSC facility, but it will exchange data with an existing Meta data center campus, as seen in this diagram.
Meta’s AI research team has been building these high-powered systems for many years. The first generation of this infrastructure, designed in 2017, has 22,000 NVIDIA V100 Tensor Core GPUs in a single cluster that performs 35,000 training jobs a day. Up until now, this infrastructure has set the bar for Meta’s researchers in terms of its performance, reliability, and productivity.
“In early 2020, we decided the best way to accelerate progress was to design a new computing infrastructure from a clean slate to take advantage of new GPU and network fabric technology,” M
“We expect such a step function change in compute capability to enable us not only to create more accurate AI models for our existing services, but also to enable completely new user experiences, especially in the metaverse,” the Meta team added “Our long-term investments in self-supervised learning and in building next-generation AI infrastructure with RSC are helping us create the foundational technologies that will power the metaverse and advance the broader AI community as well.”