2025 Data Center Failures: What Outages Revealed About Resilience in the AI Era
Key Highlights
- Physical infrastructure failures, such as fires and cooling system breakdowns, remain a primary source of outages, highlighting the need for sovereign-grade facility resilience and comprehensive testing of full-site loss scenarios.
- Control-plane dependencies, especially within hyperscale cloud providers, can cause widespread disruptions, emphasizing the importance of independent observability and robust failover mechanisms.
- Operational practices, including maintenance procedures and redundancy management, significantly impact reliability; reducing redundancy during maintenance can increase vulnerability to failures.
- Densification of racks and increased power and cooling demands in AI facilities narrow operational margins, making small errors more likely to cause significant service impacts.
- Resilience in the AI era is about operational capability—designing for predictable failure, effective recovery, and continuous service restoration, not just preventing outages.
2025 was a consequential year for the data center industry. As artificial intelligence moved from experimentation to industrial-scale deployment, the rise of AI factories, purpose-built AI data centers, and unprecedented investment in power- and cooling-intensive infrastructure commanded most of the industry’s attention, and much of the media narrative.
Less visible, but no less important, was a steady drumbeat of operational failures across the year. Beyond growing public and political pushback against data center development, 2025 saw a series of incidents ranging from localized facility outages to large-scale systemic disruptions. Taken together, these events offer clear lessons about the limits of resilience assumptions, and about where reliability models are being tested hardest in the AI era.
South Korea: National Information Resources Service (NIRS) Data Center Fire and Cascading E-Government Failures
In late September 2025, a fire at South Korea’s state-run National Information Resources Service (NIRS) data center in Daejeon disrupted hundreds of government digital services, triggering a multi-day restoration effort. Reporting indicated that roughly 647 systems and services, supporting agencies ranging from emergency services to customs, were affected by the outage at the facility. While dozens of services were restored during the initial remediation phase, broader disruption persisted as recovery efforts continued.
This was not a familiar “cloud region hiccup,” of the type that has periodically affected commercial cloud platforms. Instead, it was a sovereign digital services event: the cascading failure of a tightly interdependent e-government ecosystem. The incident exposed how modernization efforts can concentrate operational risk when architecture, failover strategies, and operational runbooks are not designed to withstand a full facility-level disaster. Coverage emphasized both the breadth of affected agencies and the staged pace of restoration: classic indicators of shared platforms, deep service dependencies, and recovery queues prioritizing foundational systems over individual applications.
Even without full forensic details being made public, the failure pattern is familiar to experienced data center professionals:
-
Facility incidents impair shared infrastructure. When a building or critical rooms (power distribution, UPS and battery areas, generator systems, or network cores) are compromised, outages propagate rapidly because many nominally “redundant” components remain physically co-located.
-
Control-plane, identity, and shared services amplify the blast radius. E-government platforms commonly depend on shared identity services, API gateways, logging systems, and back-end registries. When those foundational layers fail, applications that appear “unaffected” can still break.
-
Recovery is orchestration-heavy. Restoring digital services is not the same as re-energizing racks. It requires dependency mapping, data-integrity validation, and carefully staged bring-up. Most recovery plans are not designed for what is effectively a from-scratch restoration of hundreds of interdependent services.
Key Takeaways
-
Physical-layer resilience remains foundational to digital sovereignty. Governments pursuing “sovereign cloud” strategies still require sovereign-grade facility separation, not merely logical or contractual isolation.
-
Fire segmentation and battery strategy are national continuity issues. Continuity planning must assume exposure to smoke, suppression systems, and thermal damage, not just loss of utility power or single-component failures.
-
Test full facility loss the way cyber incidents are tested. Many organizations regularly rehearse single-system or localized failures, but few practice scenarios in which the primary site is physically unavailable for days.
CyrusOne: The CME Group Trading Halt (Aurora/Chicago Area) and What It Revealed About “Five Nines” Under AI-Era Constraints
Over Thanksgiving week, a failure at a CyrusOne-operated data center in the Chicago area forced CME Group to halt trading for a prolonged period, widely reported as exceeding 10 hours. Coverage pointed to a cooling failure tied to the chiller plant as the triggering event, with immediate market impact as futures platforms were taken offline and colocated market participants experienced knock-on disruptions.
Subsequent reporting indicated that CyrusOne bolstered cooling capacity following the incident. That response is notable. Rather than treating the outage as an isolated anomaly, the operator moved to address what it implicitly acknowledged as a material reliability gap. Coverage cited a chiller plant failure at the CHI1 facility and the addition of backup cooling capacity afterward. Financial press reports further underscored the severity of the incident, noting that some firms were forced to turn away trading and clearing activity: an outcome that elevates a “site issue” into a market-structure concern.
Root-Cause Category: Mechanical and Thermal Single-Point Failure
What matters here is not simply that cooling failed, but what the failure revealed structurally:
-
Cooling is now a first-class dependency, not a facilities afterthought. AI-era densification, even in mixed-use colocation halls, compresses tolerance windows and raises the penalty for any thermal excursion.
-
“N+1” can obscure common-mode risk. Redundant chillers provide limited protection if failures originate in shared controls, piping headers, power feeds, condenser water systems, or operational procedures that inadvertently take multiple paths down together.
-
Financial-market colocation is a distinct risk class. The stakes extend beyond SLA penalties to market continuity and systemic confidence.
This incident stands out because it mapped cleanly to a specific and well-understood failure mode: loss of cooling capacity. The consequences were neither abstract nor theoretical.
-
A physical plant failure halted trading.
-
In a year dominated by concern over digital and software risk, the proximate cause was mechanical.
-
The corrective action was also physical: add cooling capacity and revisit plant design and operational controls.
The CyrusOne/CME outage is a reminder that the industry’s reliability envelope remains bounded by thermodynamics and fluid movement, not software alone. Amid a year dominated by discussion of AI factories and massive electrical build-outs, this incident served as a blunt counterpoint: even with power available, you can still lose the room.
AWS: The October 2025 US-EAST-1 Disruption and the Anatomy of a Hyperscale “Region Event”
On October 20, 2025, AWS experienced a significant disruption centered on its US-EAST-1 region, with widespread downstream effects across major internet services. Initial coverage described the event as an AWS outage that contributed to broad website and application disruptions, while subsequent reporting detailed a prolonged recovery period marked by backlogs that persisted even after AWS declared “normal operations” restored.
Independent outage analyses from multiple sources characterized the incident as a region-scale disruption lasting more than 15 hours, affecting services relied upon by large platforms, including Slack and others. The event renewed debate about systemic dependence on a small number of hyperscale cloud providers. AWS attributed the disruption to an “operational” issue within US-EAST-1, with later disclosures pointing to internal automation and software failures as contributing factors.
Commentary from impacted vendors, including blog coverage from Akamai, framed the incident as a case study in control-plane fragility. The analysis underscored how failures within foundational services can cascade well beyond the originating fault domain.
A typical availability-zone outage is painful but generally containable. A true “region event” becomes existential for customers because:
-
Control-plane dependencies define the blast radius. When shared services such as DNS, identity, messaging backbones, or orchestration layers degrade, ostensibly resilient multi-AZ architectures can still falter.
-
Customer architectures are often region-concentrated. Many enterprises operate multi-AZ but single-region deployments to satisfy latency, data-gravity, compliance, or cost constraints. When region-level shared services are impaired, assumed resilience is quickly tested.
-
Recovery becomes a queuing problem. Once large backlogs of messages or operations accumulate, recovery is no longer about fixing a fault, but about safely draining and reconciling distributed systems. Media reporting on extended processing delays was consistent with this long-tail failure mode.
Lessons From the Outage
-
Multi-region architecture is essential for truly critical services, but it is not a panacea. Effective resilience requires tested failover mechanisms, well-designed data-replication strategies, and the ability to decouple critical dependencies from a single region.
-
Plan explicitly for degraded operation. When authorization services are unstable or queues back up, applications must have defined behaviors: fail closed, operate read-only, or shed non-critical functionality.
-
Independent observability matters. Third-party internet telemetry often provides customers with clearer timelines and impact assessments than waiting for provider postmortems.
Beyond the Headline Outages: A Pattern of Smaller Failures With Outsized Lessons
Beyond the most widely reported incidents of 2025, the year also saw a steady stream of smaller (but no less instructive) failures across data center operations and digital infrastructure services. While these events drew less attention individually, together they reinforce the same underlying themes of operational fragility, common-mode risk, and the consequences of degraded redundancy.
May 31, 2025 – Equinix: SG1 Singapore Outage During Maintenance
A major incident affected Equinix’s SG1 facility in Singapore, where two data halls went offline during a maintenance window. The disruption lasted more than an hour and triggered widespread downstream impacts across local providers and services.
This followed a familiar failure chain: maintenance activity combined with temporarily reduced redundancy and an equipment fault. When redundancy is intentionally dialed down, facilities operate closer to their failure envelope; any additional fault can quickly become service-impacting.
May 22, 2025 – Digital Realty–Leased Facility: X Outage Linked to Hillsboro, Oregon Fire
A fire at a Hillsboro, Oregon data center site used by X and operated by Digital Realty, according to reporting, was linked to a major global outage of the platform. The incident underscored how a tenant’s global service availability can hinge on localized facility-level failures.
Subsequent reporting indicated authorities were investigating the origin of the fire, with early accounts pointing to a power-system cabinet. Even as a “contained-area” incident, the event demonstrated how fires affecting power distribution, suppression response, or adjacent critical systems can still trigger cascading shutdowns, even when the fire itself is controlled.
May 19–20, 2025 – Oracle Cloud Infrastructure: Europe Outage Reports
OCI customers reported a multi-hour outage affecting European regions, with user reports and outage trackers documenting service disruption across multiple days. Trade press cited hours-long service impacts.
The incident reinforced a theme seen repeatedly in 2025 cloud outages: many are effectively region-scale dependency failures, and customer failover does not always behave as expected under real-world stress. It also served as another reminder of the need to regularly test recovery processes; not just in theory, but under degraded conditions.
June 12 and July 18, 2025 – Google Cloud: Major Service Incidents
Google Cloud experienced at least two widely analyzed incidents in 2025:
-
June 12, 2025: Independent internet telemetry provided customers with insight into why specific applications were experiencing failures, highlighting gaps between provider status reporting and customer-visible impact. Poorly documented internal issues affected many applications dependent on GCP services.
-
July 18, 2025: A separate incident was documented through Google’s public status reporting with a detailed timeline. Together, the events illustrated how shared services, control-plane components, and network dependencies can still produce large, correlated failures - even when compute is geographically distributed.
November 18 and December 5, 2025 – Cloudflare: Global Edge and Service Outages
Cloudflare published postmortems for two significant incidents:
-
November 18, 2025: A software bug triggered failures across multiple Cloudflare services.
-
December 5, 2025: A separate incident caused significant disruption for a subset of customers and traffic, resolved in approximately 25 minutes, according to Cloudflare.
While not tied to a single data center failure, Cloudflare’s edge network effectively functions as a globally distributed, internet-facing data center. These outages matter to data center customers because they are dependency failures: workloads may remain operational while applications are effectively unreachable.
October 9, 2025 – Regional MSP / Colocation: Centron UPS Fire in Germany
German MSP Centron reported a data center incident near Nuremberg involving a fire in low-voltage distribution and UPS systems following maintenance work. The event reinforced the recurring maintenance-linked risk pattern; this time centered on the power chain rather than cooling infrastructure.
Common Threads Across 2025 Data Center Failures
Across these incidents, root causes and failure dynamics consistently fell into three broad categories:
-
Maintenance combined with reduced redundancy and procedural or human error.
-
Facility-level incidents with platform-wide or global blast radius.
-
Control-plane and dependency failures at hyperscale and edge providers.
Together, they point to a common conclusion: minimizing downtime in the future is less about eliminating failure entirely than about designing for predictable, recoverable failure modes. Other commonalities across incidents included the following points:
Physical Failures Are Back in the Spotlight
The South Korea data center fire and the CyrusOne/CME cooling failure served as blunt reminders that:
-
Battery rooms, UPS systems, switchgear, and thermal plants remain tier-zero infrastructure.
-
A single facility incident can exceed the blast radius of many cyber events.
Cloud-Scale Outages Are Control-Plane Stories
The AWS US-EAST-1 disruption reinforced that regional shared services and automation layers often represent the true single points of failure, even when applications are nominally “distributed.”
Reliability Is About Recoverability, Not Just Redundancy
Modern resilience planning increasingly hinges on whether organizations can answer fundamental operational questions:
-
Can systems operate in a degraded mode?
-
Can failover occur without manual heroics?
-
Can services be restored cleanly, without data corruption or inconsistent state?
AI-Era Densification Raises the Cost of Small Mistakes
As rack densities continue to climb, tolerance for cooling and power excursions shrinks. Issues once considered minor plant events can now translate into immediate workload and customer impact. While delays in AI training may be tolerable in some contexts, interruptions to inference services carry far broader commercial consequences.
In the AI era, reliability margins are thinner, and the operational discipline required to preserve them has never been higher.
Resilience as a Measurable Capability
Taken together, the failures of 2025 make one point unmistakable: in an AI-driven data center economy, resilience is no longer a theoretical design goal or a contractual SLA; it is an operational capability that must perform under real stress.
As density rises and dependency chains lengthen, the industry’s margin for error continues to narrow. The organizations that succeed will not be those that assume outages can be engineered away, but those that design facilities, platforms, and operating models to fail predictably, recover cleanly, and restore service without improvisation.
In the AI era, reliability is no longer about avoiding failure; it is about proving, repeatedly, that recovery works.
At Data Center Frontier, we talk the industry talk and walk the industry walk. In that spirit, DCF Staff members may occasionally use AI tools to assist with content. Elements of this article were created with help from OpenAI's GPT5.
Keep pace with the fast-moving world of data centers and cloud computing by connecting with Data Center Frontier on LinkedIn, following us on X/Twitter and Facebook, as well as on BlueSky, and signing up for our weekly newsletters using the form below.
About the Author



