With Big Basin, Facebook Beefs Up its AI Hardware

SANTA CLARA, Calif. – Facebook is beefing up its high performance computing horsepower, enhancing its use of artificial intelligence to personalize your news feed.

Facebook introduced brawny new hardware to power its AI workloads today at the 2017 Open Compute Summit at the Santa Clara Convention Center. Known as Big Basin, the unit brings more memory to its GPU-powered data crunching. It’s a beefier successor to Big Sur, the first-generation Facebook AI server unveiled last July.

“With Big Basin, we can train machine learning models that are 30 percent larger because of the availability of greater arithmetic throughput and a memory increase from 12 GB to 16 GB,” said Kevin Lee, a Technical Program Manager at Facebook. “This enables our researchers and engineers to move more quickly in developing increasingly complex AI models that aim to help Facebook further understand text, photos, and videos on our platforms and make better predictions based on this content.”

Making Your Newsfeed Smarter

Big Sur and Big Basin play important roles in Facebook’s bid to create a smarter newsfeed for its 1.9 billion users around the globe. With this hardware, Facebook can train its machine learning systems to recognize speech, understand the content of video and images, and translate content from one language to another.

As leading tech companies push the boundaries of machine learning, they are often following a do-it-yourself approach to their HPC hardware. Google, Apple and Amazon have also created research labs to pursue faster and better AI capabilities. They have used different approaches to hardware, with Google opting for custom ASICs (application specific integrated circuits) for its machine learning operations.

Facebook has chosen to use NVIDIA graphics processing units (GPUs) for its machine learning hardware. Facebook has been designing its own hardware for many years, and In preparing to upgrade Big Sur, the Facebook engineering team gathered feedback from colleagues in Applied Machine Learning (AML), Facebook AI Research (FAIR), and infrastructure teams.

The Power of Disaggregation

For Big Basin, Facebook collaborated with QCT (Quanta Cloud Technology), one of the orginal design manufacturers (ODMs) that works closely with the Open Compute community. Big Basin features eight NVIDIA Tesla P100 GPU accelerators, connected using NVIDIA NVLink to form an eight-GPU hybrid cube mesh — similar to the architecture used by NVIDIA’s DGX-1 “supercomputer in a box.”

Big Basin features eight NVIDIA Tesla P100 GPU accelerators. It’s the successor to Big Sur, which used an earlier verson of NVIDIA’s GPU technology. (Photo: Facebook)

Big Basin offers an example of one of the key principles guiding Facebook’s hardware design – disaggregation. Key components are built using a modular design, separating the CPU compute from the GPUs, making it easier to integrate components as new technology emerges.

“For the Big Basin deployment, we are connecting our Facebook-designed, third-generation compute server as a separate building block from the Big Basin unit, enabling us to scale each component independently,” Lee writes in a blog post announcing Big Basin. “The GPU tray in the Big Basin system can be swapped out for future upgrades and changes to the accelerators and interconnects.”

Flexibility and Faster Upgrades

Big Basin is split into three main sections: the accelerator tray, the inner chassis, and the outer chassis. The disaggregated design allows the GPUs to be positioned directly in front of the cool air being drawn into the system, removing preheat from other components and improving the overall thermal efficiency of Big Basin.

There are multiple advantages to this disaggregated design, according to Eran Tal, and engineering manager at Facebook.

“The concept is breaking down and separating components, and creating the ability to select what solution you want at different levels of hardware,” said Tal. “It gives you a lot of flexibility in addressing design with a fast-changing workload. You can never know what you will need tomorrow.

“You’re maximizing efficiency and flexibility,” he added.

Two New Server Models

Facebook also introduced two new server designs, each representing the next generation of existing OCP designs.

Tioga Pass is the successor to Leopard, which is used for a variety of compute services at Facebook. Tioga Pass has a dual-socket motherboard, which uses the same 6.5” by 20” form factor and supports both single-sided and double-sided designs. The double-sided design, with DIMMs on both PCB sides, allows Facebook to maximize the memory capacity. The flexible design allows Tioga Pass to serve as the head node for both the Big Basin JBOG (Just a Bunch of GPUs) and Lightning JBOF (Just a Bunch of Flash).This doubles the available PCIe bandwidth when accessing either GPUs or flash.
Yosemite v2 is a refresh of Facebook’s Yosemite multi-node compute platform. The new server includes four server cards. Unlike Yosemite, the new power design supports hot service — servers can continue to operate and don’t need to be powered down when the sled is pulled out of the chassis for components to be serviced. With the previous design, repairing a single server prevents access to the other three servers since all four servers lose power.