The Cloudflare outage in the first week of November drew quite a bit of attention, not only because Cloudflare’s services are extremely popular, so their failure was quickly noticed, but also because of the rapid explanation of the problem posted in the Cloudflare Blog shortly after the incident.
This explanation placed a significant portion of the blame squarely on Flexential and their response to the issues with electricity provider PGE, and potential issues that PGE was having.
Cloudflare was able to restore most of its services in 8 hours at its disaster recovery facility. It runs its primary services at three data centers in the Hillsboro, Oregon area, geolocated in such a way that natural disasters are unlikely to impact more than a single data center.
And while almost all of the coverage of this incident starts off by focusing on the problems that might have been caused by Flexential, I find that I have to agree with the assessment of Cloudflare CEO Matthew Prince: “To start, this never should have happened.”
“We believed that we had high availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically. And, while many systems did remain online as designed, some critical systems had non-obvious dependencies that made them unavailable. I am sorry and embarrassed for this incident and the pain that it caused our customers and our team.”
A heartfelt apology, for sure.
And the Root Cause Was?
There were certainly issues with power delivery at the Flexential data center. These issues ranged from:
- A ground fault probably caused by unplanned PGE maintenance.
- Both utility feeds, along with all of the site’s backup generators going offline.
- A 10-minute battery backup supply, designed to provide sufficient time for the backup power to initiate, failing after just four minutes.
- The electric breakers covering Cloudflare’s equipment failing with no replacements on hand.
- And most tellingly, the lack of trained staff on-site to address the problem as it happened.
Flexential has not, at this time, confirmed the chain of events that caused the data center power problems or identified any specific issues with the site or PGE that could have contributed to the failure. Cloudflare acknowledges that their take on the event leading up to their problems is the result of “informed speculation."
Mr. Prince’s blog post on the incident goes on to explain how their multi-data center design was supposed to work to prevent this sort of failure. After all, they placed their data centers far enough apart to prevent a single external event from impacting them all, while keeping them close enough together to allow for the operation of active-active redundant data clusters.
A Clear, Detailed Explanation of the Failure
Prince noted, “By design, if any of the facilities goes offline then the remaining ones are able to continue to operate.” From the description of the failure in the blog, it appears that Cloudflare’s redundancy goal was in reach.
And it also appears that they fell into a somewhat common trap when it comes to testing failover and redundancy. They tested the individual components of the solution without ever doing a full-scale test of the entire process. Had they shut down the connection to this specific data center, these problems would have clearly manifested.
“We had performed testing of our high availability cluster by taking each (and both) of the other two data center facilities entirely offline," wrote Prince. "And we had also tested taking the high availability portion of PDX-04 offline. However, we had never tested fully taking the entire PDX-04 facility offline. As a result, we had missed the importance of some of these dependencies on our data plane.”
It appears that for the sake of expediency and rapid time to market, Cloudflare allowed for new products to go to general availability without fully integrating with their high-availability cluster. Which, of course, begs the question, what’s the point of offering services with high-availability without integrating them into that environment?
To be fair, Cloudflare has acknowledged what happened, and how they failed to properly plan and test for this situational failure. To their credit, they have also listed specific actions that they plan to take in order to limit potential future problems. These include:
- Remove dependencies on core data centers for control plane configuration of all services and move them wherever possible to be powered first by Cloudflare's distributed network.
- Ensure that the control plane running on the network continues to function even if all core data centers are offline.
- Require that all products and features that are designated Generally Available must rely on the high availability cluster (if they rely on any of the company's core data centers), without having any software dependencies on specific facilities.
- Require all products and features that are designated Generally Available have a reliable disaster recovery plan that is tested.
- Test the blast radius of system failures and minimize the number of services that are impacted by a failure.
- Implement more rigorous chaos testing of all data center functions, including the full removal of each of the company's core data center facilities.
- Thorough auditing of all core data centers and a plan to re-audit to ensure they comply with Cloudflare standards.
- Implement a logging and analytics disaster recovery plan that ensures no logs are dropped, even in the case of a failure of all core facilities.
Taking a lesson from how the hyperscalers handle crisis issues in their data centers, Cloudflare said it also plans to initiate a process that allows for a notification that sets a condition where the majority of engineering resources are directed to solving an identified crisis as quickly as possible.
While it is easy to point at their data center provider as causing this problem -- and Flexential is clearly not blameless -- this particular failure would have happened regardless of who was providing the hosting services.
To a large extent, I feel that the takeaway from this incident has been been preached to IT for a very long time: “Test your backups!” Especially if that backup is presented as a fully redundant failover system.
Keep pace with the fast-moving world of data centers and cloud computing by connecting with Data Center Frontier on LinkedIn, following us on X/Twitter and Facebook, and signing up for our weekly newsletters using the form below.