Problems With AWS Network Devices Caused Widespread Cloud Outage

Problems with several network devices in Northern Virginia caused a major outage at Amazon Web Services, with the ripples spreading across the Internet to interrupt service for many popular web services that run their infrastructure on the AWS cloud.

The lengthy outage highlighted the essential role played by cloud platforms like AWS, which support the web operations of at least 1 million enterprise customers. The problems at AWS were blamed for performance issues at Netflix, Disney+, Ring, Ticketmaster, Venmo, Roku. Fidelity Investments, Hootsuite, and many others. The outage interrupted online finals for students using the Canvas Learning Management platform, and even deliveries at Amazon warehouses, as the outage impacted apps required to scan packages and plan delivery routes.

The AWS outage was focused on US-East-1, a service region based in Northern Virginia which houses the largest concentration of Amazon data center infrastructure. The problems began at around 12:30 p.m. Eastern, when users began to experience problems accessing AWS services. Approximately 5 hours later, at 5:47 p.m., AWS reported that it had “mitigated the underlying issue” and services were beginning to be restored.

“The root cause of this issue is an impairment of several network devices in the US-EAST-1 Region,” AWS said on its status page. As of 7:30 pm Eastern, AWS said the network devices issues had been resolved, and it was “now working towards recovery of any impaired services.”

Large-scale IT service outages can be expensive. A 2021 survey from The Uptime Institute found that data center outages cost companies an average of $100,000 per incident, with about a third of respondents citing costs of $1 million or more.

The stakes could be even higher for Amazon Web Services, which is the largest cloud computing platform. AWS had revenue of $16.4 billion in the third quarter of 2021, which works out to about $7.4 million per hour. Although cloud workloads running outside the US-East-1 region apparently were unaffected, an outage lasting more than six hours in the largest cloud region would add up quickly – although such “losses” at service providers are often accounted for through customer credits.

Why Networks Are So Important

The rise of cloud computing underscores the importance of networks and how they are configured. Networking and software issues are surpassing power outages as the most common causes of data center downtime, according to 2021 outage data from Uptime Institute. This trend reflects the growing role of cloud computing and SaaS (software as a service) applications, which often use architectures that can route around physical failures of electrical components like UPS systems, transfer switches and generators.

When Amazon Web Services experiences reliability problems, they often involve US -East-1, which is not surprising because it is the largest AWS region and also the oldest, as Amazon has had data centers in Virginia since 2004. AWS has spent $35 billion on its cloud computing infrastructure in Northern Virginia over the past 10 years, and operates about 50 data centers in the region. It’s the largest single concentration of corporate data centers on earth, positioned near a strategic Internet intersection in Ashburn, which serves as a global crossroads for data traffic.

Network problems are complicated by the highly-automated nature of cloud platforms. These data traffic flows are designed to be large and fast and work without human intervention – which makes them hard to tame when humans intervene. Some of the largest outages impacting cloud platforms and social networks have been tied to network problems. Some examples:

On October 5, a configuration error broke Facebook’s connection to a key network backbone, disconnecting all of its data centers from the Internet and leaving its DNS servers unreachable, the company said.
A lengthy 2019 Google outage was caused by unusual network congestion in its operations in the Eastern U.S. In an incident report, Google said that YouTube measured a 10 percent drop in global views during the incident, while Google Cloud Storage measured a 30 percent reduction in traffic.

Resiliency is Still A Challenge

At DCF we have often noted how cloud computing is bringing change to how companies approach uptime, introducing architectures that create resiliency using software and network connectivity (See “Rethinking Redundancy”). This strategy, pioneered by cloud providers, is creating new ways of designing applications. Data center uptime has historically been achieved through layers of redundant electrical infrastructure, including uninterruptible power supply (UPS) systems and emergency backup generators.

Cloud providers like Google have been leaders in creating failover scenarios that shift workloads across data centers, spreading applications and backup systems across multiple data centers, and using sophisticated software to detect outages and redirect data traffic to route around hardware failures and utility power outages.

Amazon Web Services has been a pioneer in this effort by popularizing the use of availability zones (AZs), clusters of data centers within a region that allow customers to run instances of an application in several isolated locations to avoid a single point of failure. These architectures enable sophisticated approaches to failover and backup of applications. But even a distributed uptime plan can break down if the network fails, breaking the flow of data across cloud infrastructure.

As often happens with AWS downtime, the incident prompted some to wonder about whether cloud computing has reached a scale where the downtime equation is shifting.

“A multi-day full outage of us-east-1 will have an observable effect on the world economy,” tweeted Corey Quinn, Chief Cloud Economist at The Duckbill Group. “That is not an exaggeration.

“I don’t think AWS has done anything wrong here,” Quinn tweeted. “This is the natural end result of their success at massive scale.”