Uptime: Longer Data Center Outages Are Becoming More Common

June 8, 2022
The frequency of data center downtime hasn’t changed significantly, but outages are becoming longer and more expensive, according to new research from The Uptime Institute, which reported a “disturbing increase” in outages longer than 24 hours.

Uptime is always the prime directive for data centers. As the world recovers from the COVID-19 pandemic, reliable digital infrastructure is more important than ever in keeping the economy connected.

So how are things going? The frequency of data center downtime hasn’t changed significantly, but outages are becoming longer and more expensive, according to new research from The Uptime Institute. The key findings:

  • Prolonged downtime is becoming more common in publicly reported outages. The gap between the beginning of a major public outage and full recovery has stretched significantly over the last five years, with nearly 30% of these outages in 2021 lasted more than 24 hours, which Uptime characterized as “a disturbing increase” from just 8% in 2017.
  • Downtime is also becoming more expensive, with more than 60% of failures resulting in at least $100,000 in total losses, up substantially from 39% in 2019. The share of outages that cost upwards of $1 million increased from 11% to 15% over that same period.
  • In a trend we first highlighted last year, networking issues have become the single biggest cause of all IT service downtime incidents – regardless of severity – over the past three years. Uptime attributes this to “complexities from the increasing use of cloud technologies, software-defined architectures and hybrid, distributed architectures. “
  • The most significant outages are usually tied to electrical equipment, especially uninterruptible power supply (UPS) failures. “Power-related outages account for 43% of outages that are classified as significant (causing downtime and financial loss),” said Uptime.

The survey is the latest annual survey from The Uptime Institute, whose data is notable because it highlights trends in data center outages that may not be publicly reported.

Online services are more important than ever in the wake of the COVID-19 pandemic, which has boosted reliance on remote work and learning – meaning that service outages are more broadly felt, and generate wider notice.

Lengthy Downtime Incidents Make Headlines

A new wrinkle is the growth of lengthier outages over the past two years. Some of these have been very public, such as a massive global outage at Meta last October that left  Facebook, Instagram and WhatsApp offline for at least five hours. Facebook later said that a configuration error broke its connection to a key network backbone, disconnecting all of its data centers from the Internet and leaving its DNS servers unreachable.

Another example is the 73-hour outage last year at Roblox, which cost the metaverse company an estimated $25 million in lost bookings. In an incident report, Roblox said several software services contended for resources, making it harder to diagnose a bug in a database.

The Facebook and Roblox incidents illustrates how the growing complexity of online applications can sometimes make it harder to trouble-shoot automated infrastructure, leading to lengthier outages.

The growing role for network issues was seen in a major outage at Amazon Web Services in December, with the ripples spreading across the Internet to interrupt service for many popular web services that run their infrastructure on the AWS cloud. The issue was traced to problems with several network devices in the AWS data center cluster in Northern Virginia.

Power and equipment issues were prominent in another lengthy outage in 2021, when a data center fire at OVH in Strasbourg, France left many customers offline for days. The SBG2 data center was destroyed by fire on March 9, which required the power to be turned off for the entire four-building campus. A second data center building, SBG1, was eventually shuttered after a smoke incident in a UPS room.

The growing financial impact of outages is not a surprise, given how digital services have become central to nearly every business. The “cost of downtime” has long been used to underscore the value of data center services and maintenance, but now serves as a reflection of increased reliance on data infrastructure.

More Investment, Yet More Complexity

All of this is happening in a period of enormous investment in digital infrastructure, including huge growth for cloud platformsrecord-setting M&A action and the creation of new operating platforms for data centers.

That investment doesn’t neatly translate into improved reliability, especially in a complex environment in which new architectures are spreading IT workloads across cloud, colocation, edge and on-premises facilities.

“Digital infrastructure operators are still struggling to meet the high standards that customers expect and service level agreements demand – despite improving technologies and the industry’s strong investment in resiliency and downtime prevention,” said Andy Lawrence, founding member and executive director, Uptime Institute Intelligence. The survey resuls will be summarized in a presentation next week.

“The lack of improvement in overall outage rates is partly the result of the immensity of recent investment in digital infrastructure, and all the associated complexity that operators face as they transition to hybrid, distributed architectures,” said Lawrence. “In time, both the technology and operational practices will improve, but at present, outages remain a top concern for customers, investors, and regulators. Operators will be best able to meet the challenge with rigorous staff training and operational procedures to mitigate the human error behind many of these failures.”

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Sponsored Recommendations

How Deep Does Electrical Conduit Need to Be Buried?

In industrial and commercial settings conduit burial depth can impact system performance, maintenance requirements, and overall project costs.

Understanding Fiberglass Conduit: A Comprehensive Guide

RTRC (Reinforced Thermosetting Resin Conduit) is an electrical conduit material commonly used by industrial engineers and contractors.

NECA Manual of Labor Rates Chart

See how Champion Fiberglass compares to PVC, GRC and PVC-coated steel in installation.

Electrical Conduit Cost Savings: A Must-Have Guide for Engineers & Contractors

To help identify cost savings that don’t cut corners on quality, Champion Fiberglass developed a free resource for engineers and contractors.

ZincFive
Source: ZincFive

Data Center Backup Power: Unlocking Shorter UPS Runtimes

Tod Higinbotham, COO of ZincFive, explores the race to reduce uninterruptible power supply (UPS) runtimes.

White Papers

Dcf Venyu Wp Cover2022 03 31 16 31 33 1 232x300

The Future of Future Proofing

April 7, 2022
Venyu provides IT leaders with a framework for future-proofing their systems, networks, and partner ecosystems.