SAN JOSE, Calif. – Facebook is creating the next generation of open hardware, building new technologies into its data center platform. The social network is leveraging an alphabet soup of powerful technologies – including SSDs, GPUs, NVM and JBOFs – to build new servers and storage gear to accelerate its infrastructure.
These upgrades are part of Facebook’s vision to create a network of powerful data centers that will push the boundaries of delivering services over the Internet.
“Over the next decade, we’re going to build experiences that rely more on technology like artificial intelligence and virtual reality,” said Facebook CEO Mark Zuckerberg. “These will require a lot more computing power, and through efforts like the Open Compute Project, we’re developing a global infrastructure to enable everyone to enjoy them.”
Facebook discussed its progress Wednesday at the Open Compute Summit, which brought together the growing community of open source hardware hackers who are building on designs that started life in Facebook’s data centers. It showed of a number of updates to its infrastructure. These include:
- A retooled server form factor to pack more performance into the same power footprint.
- New servers for high-performance data crunching, powered by graphic processing units (GPUs) rather than CPUs.
- An evolved storage sled, in which the original JBOD (“just a bunch of disks”) has become a much faster JBOF (“Just a Bunch of Flash”).
- An experiment with advances in non-volatile memory (NVM) to provide more options for storage tiering.
The summit marked the fifth anniversary for the Open Compute Project, which prompted reflection on how far OCP has come since 2011, when it was founded to innovate upon designs released by Facebook.
“It’s remarkable to see where we are today,” said Jason Taylor, chairman of the Open Compute Project, and also a VP of Infrastructure at Facebook. “OCP is where engineers can get together to build amazing things.
“I feel a tremendous sense of momentum, as we’ve moved beyond hyperscale and into finance and telecom,” he said.
Servers: Next-Generation Design
Facebook has totally retooled its server design and infrastructure, shifting from its traditional two-processor server to a system-on-chip (SoC) based on a single Intel Xeon-D processor that uses less power and solves several architectural challenges.
TheMono Lake server boards are housed in a new enclosure called Yosemite, which houses four SoCs in each sled chassis. Facebook engineers Vijay Rao and Edwin Smith described the new design on the Facebook Engineering Blog.
“We worked closely with (Intel) on the design of a new processor, and in parallel redesigned our server infrastructure to create a system that would meet our needs and be widely adoptable by the rest of the industry,” they wrote. “The result was a one-processor server with lower-power CPUs, which worked better than the two-processor server for our web workload and is better suited overall to data center workloads … At the same time, we redesigned our server infrastructure to accommodate double the number of CPUs per rack within the same power infrastructure.”
The new design streamlines communication between processors, and between the processors and memory.
“We minimized the CPU to exactly what we required,” the Facebook engineers reported. “We took out the QPI (Quick Path Interconnect,an Intel point-to-point processor interconnect) links, which reduced costs for Intel and removed the NUMA (Non-Uniform Memory Access) problem for us, given that all servers would be one-socket-based. We designed for it to be a system-on-a-chip (SOC), which integrates the chipset, thus creating a simpler design. This single-socket CPU also has a lower thermal design power (TDP). At the same time, we redesigned our server infrastructure to accommodate double the number of CPUs per rack within the same power infrastructure.”
This allowed Facebook to create a server infrastructure that could pack far more performance into each rack, while remaining under the designed rack power density of 11 kW per cabinet.
Beefier Servers for AI Data-Crunching
Facebook shared an update on its use of GPUs, which in recent years have played a major role in high performance computing. GPUs were initially used to accelerate the performance of desktop PCs to handle graphics, but are now helping accelerate workloads for some of the world’s most powerful supercomputers.
Facebook is using GPUs to bring more horsepower to bear on data-crunching for its artificial intelligence (AI) and machine learning platform. Facebook’s AI Lab trains neural networks (computers that emulate the learning process of the human brain) to solve new challenges. This requires lots of computing horsepower.
“We’ve been investing a lot in our artificial intelligence technology,” said Jay Parikh, Global Head of Engineering and Infrastructure for Facebook. “AI is now powering things like your Newsfeed. It is helping us serve better ads. It is also helping make the site safer for people that use Facebook on a daily basis.”
The Big Sur system leverages NVIDIA’s Tesla Accelerated Computing Platform, with eight high-performance GPUs of up to 300 watts each, with the flexibility to configure between multiple PCI-e connections. Facebook has optimized these new servers for thermal and power efficiency, allowing them to operate them in the company’s data centers alongside standard CPU-powered servers.
The gains in performance and latency provided by Big Sur help Facebook process more data, dramaticallly shortening the time needed to train its neural networks.
“It is a significant improvement in performance,” said Parikh. “We’ve deployed thousands of these machines in a matter of months. It gives us the ability to drive this technology into more product use cases within the company.”
Storage: Just a Bunch of Flash
Facebook has used Flash for many years to accelerate server boot drives and caching. As its infrastructure has continued to scale, it has created a new “building block” to integrate more Flash into its operations. Facebook has adapted its initial Open Compute storage sled, known as Knox, and substituted solid state drives (SSDs) for the hard disk drives (HDDs) – transforming the “Just a Bunch of Disks” storage unit to “Just a Bunch of Flash’ (JBOF).
Facebook has worked with Intel to develop the new JBOF unit, called Lightning, reflecting the speed gained through the use of NVM Express (NVMe), a high-speed PCI Express interface that’s been optimized for SSDs. Here’s a look at the specs in a slide from Parikh’s presentation at the Open Compute Summit.
As a disaggregated storage appliance, Lightning can support a variety of different applications. “It brings a new building block in the form of high-performance storage for the applications we’re building,” said Parikh.
Parikh said there will be more storage innovation ahead, particularly in using non-volatile memory (NVM) in new ways.
“In the storage industry, disk drives are getting bigger, but they’re not getting more reliable, latency isn’t getting any better, and IOPS (input/ouput operations per second) isn’t improving.” said Parikh. “Flash is also getting slightly better, but endurance is not improving that dramatically. We’re really stuck with this paradigm where things are scaling out and getting bigger, but from a performance perspective, we’re not getting what we actually need.”
Facebook sees a potential answer in new NVM implementations, especially the 3D XPoint technology developed by Intel and Micron. Parikh called on the Open Compute community to focus on this technology as a worthwhile solution to current storage challenges.
“We can start to think about our storage problems, and spread that (storage) across many more tiers that give us more price and performance levers to scale out things for performance, or capacity, or optimizing on price,” said Parikh, who said NVM offered an attractive option between DRAM and NAND (Flash).
Facebook is test-driving its NVM configurations with an open source project called MyRocks, which is built atop MySQL and RocksDB database technologies.
The Road Ahead: Scaling for the Data Deluge to Come
Facebook’s relentless push to build a faster and more powerful infrastructure is driven by the growth of its audience, which now includes 1.6 billion users on Facebook, 1 billion on WhatsApp, 800 million on Facebook Messenger, and 400 million using Instagram. The company’s ambitions are also powered by Zuckerberg’s embrace of virtual reality, reflected in the $2 billion acquisition of VR pioneer Oculus.
Virtual reality can deliver immersive 3D experiences, and many analysts believe the technology is nearly ready for prime time. Zuckerberg believes Facebook can deliver its social network as a virtual reality experience.
“Pretty soon we’re going to live in a world where everyone has the power to share and experience whole scenes as if you’re just there, right there in person,” Zuckerberg said at the recent Mobile World Congress. “Imagine being able to sit in front of a campfire and hang out with friends anytime you want. Or being able to watch a movie in a private theater with your friends anytime you want. Imagine holding a group meeting or event anywhere in the world that you want. All these things are going to be possible. And that’s why Facebook is investing so much early on in virtual reality, do we can hope to deliver these types of social experiences.”[clickToTweet tweet=”Mark Zuckerberg: We’re going to build experiences that will require a lot more computing power.” quote=”Mark Zuckerberg: We’re going to build experiences that will require a lot more computing power.”]
That will require a LOT of infrastructure. Full VR video files can be up to 20 times larger than the size of today’s HD video files.
“The file sizes are so large they can be an impediment to delivering 360 video or VR in a quality manner at scale,” write Facebook’s Evgeny Kuzakov and David Pio, who recently outlined Facebook’s progress on encoding and compression technologies for virtual reality files. Facebook is moving from equirectangular layouts to a cube format in 360 video, reducing file sizes by 25 percent.
But Facebook realizes that real-time delivery of virtual reality will require faster networks, and they can’t do it alone. Following the Open Compute model, Facebook has created the Telecom Infra Project, teaming with Equinix, Intel, Nokia, SK Telecom and T-Mobile/Deutsche Telekom to develop new 5G technologies to accelerate global networks.
“Scaling traditional telecom infrastructure to meet the global data challenge (of video and virtual reality) is not moving as fast as people need it to,” said Parikh. “Driving a faster pace of innovation in telecom infrastructure is necessary to meet these new technology challenges and to unlock new opportunities.”