Cloud Outages Sharpen Focus on Uptime and Reliability

March 22, 2022
Reliability is in the spotlight after major outages in 2021 for some of the Internet economy’s marquee names Our DCF Data Center Executive Roundtable panel of industry experts examines uptime in the cloud computing era.

Uptime is always job one for data center operators, but it remains a challenge for the largest tech platforms, as shown in Apple’s lengthy outage on Monday. Reliability was already in the spotlight after major outages in 2021 for Amazon Web Services and Facebook. In today’s edition of the DCF Data Center Executive Roundtable, our panel of industry thought leaders examines uptime in the cloud computing era.

Our panelists include Sean Farney of Kohler Power Systems, Michael Goh from Iron Mountain Data Centers, DartPoints’ Brad Alexander, Amber Caramella of Netrality Data Centers and Infrastructure Masons, and Peter Panfil of  Vertiv. The conversation is moderated by Rich Miller, the founder and editor of Data Center Frontier. Here’s today’s discussion:

Data Center Frontier: Last year’s major service outages at Facebook and Amazon Web Services sharpened the focus on data center reliability. As companies embrace the benefits of cloud and hybrid IT architectures, what are the key strategies for ensuring uptime?

SEAN FARNEY, Kohler Power Systems

Sean Farney, Kohler Power Systems: I recall a fascinating dinner conversation with Ray Ozzie years ago while at Microsoft. In it, he shared that the original design tenet for Azure was to move redundancy and resilience way up the stack from the physical layer to the application layer so that software would obviate outages. So if utility power failed in one facility, the data center manager would simply roll bits over to a different facility, removing the cost and complexity of redundant power systems, for example. It’s a noble idea and is working to a great extent with many CDN services in production today.

However, with the increasing complexity of interdependent applications and the rapidity of data set growth, there are stateful information services that must be homed in a single location. For this reason, the best way to ensure uptime in the cloud – which is just another data center – is to design, build and properly maintain multiple levels of redundancy across all key points of failure in a system. N+1, 2N, Concurrent Maintainability, etc. are table stakes for operators beholden to Service Level Agreements.

Equipment like the venerable but ever-reliable diesel generator will continue to be in high demand for many years to come because we can trust them to provide backup power, flawlessly. And amid a data center building boom, Kohler is seeing just this – unparalleled demand for proven and reliable power products.

Iron Mountain

Michael Goh, Iron Mountain Data Centers: From a colocation service provider perspective, the fundamental requirement is data center uptime. While the data center industry is facing widespread growth, it’s also adapting to a more complex playing field with evolving efficiency and sustainability requirements next to the challenging supply chain.

Hiring and maintaining qualified staff, monitoring and increasing the level of automation in the data center for less chance of human error are key strategies for ensuring uptime. Having comprehensive operations procedures and disaster recovery plans in place is also key.

From an end user perspective, it’s important not to put all your eggs in one basket. We see customers adopting hybrid cloud strategy where they mix workloads in colocation and the cloud providers as well as embracing multi-cloud platforms. This inevitably drives up complexity of the customer’s infrastructures and has indirectly contributed to the rising demand for managed cloud services segment.

Infrastructure Masons and Netrality

Amber Caramella, Infrastructure Masons and Netrality: The key to uptime begins with a fault-tolerant design that mitigates single points of failure, as the foundation of infrastructure. Companies that work closely with their vendors to collaborate and play a role in the planning, design, and build phase have greater resiliency in their network.

A preventive maintenance strategy — including regularly scheduled maintenance check-ins executed by your data center operations team – ensures power and cooling systems are running at optimal levels and evaluates when systems need to be replaced or upgraded. Monitoring and reporting will notify operators if a system is down. Real-time reporting is essential to address issues before they escalate and cause system outages.


Brad Alexander, DartPoints: Horizontal scalability not only reduces the risk of lost data, but it also ensures that there is no single point of failure – this is a huge safety net. This same principle is also what makes a multi-cloud, multi-provider solution so attractive to companies that are focused on reliability. A multi-cloud platform adds an extra layer of protection. If one provider experiences an infrastructure breakdown or is victim to a cyber-attack, companies with more than one provider can quickly switch to the other provider or back everything up to a private cloud to secure important data.

Geographic and safety awareness are also important contributing factors to uptime and data security. Some locations are at lower risk for natural disasters such as hurricanes and earthquakes minimize geographic risks and make it an attractive location for colocation tenants. Data center locations should be carefully evaluated based on climate as well as environmental conditions and the probability of a natural disaster.

Network visibility and control helps avoid issues before they occur, which is why application intelligence is a key component of reliability for service providers. It gives them the power to collect reliable, actionable application data for more effective monitoring and security. Intelligent applications also understand the proper flow of data and can detect traffic that might indicate a threat. This protects confidential information from application security attacks.

When it comes to uptime and business continuity, the number one and number two threats are human error and security and procedure flaws. The end user is typically the most overlooked threat to a business, and is a common entry point for ransomware, malware, phishing participants, and the source of data leaks from social engineering attacks. Companies shouldn’t assume that network hardware and security software will protect against end user mistakes. My best advice is to train, reinforce, test, review, and train some more.


Peter Panfil, Vertiv: To meet growing demand for services, cloud operators have to balance speed-of-deployment, cost, reliability and sustainability. In some cases, infrastructure redundancy has been sacrificed to achieve lower build costs, which can backfire if downtime causes the market to lose confidence in the reliability of cloud services.

Two trends have emerged that enable operators to achieve their speed and cost goals without compromising reliability. One is value engineering of high-utilization critical power architectures that maintain redundancy while eliminating stranded capacity and maximizing efficiency. The other is the availability of modular prefabricated data centers, which can be deployed in less time than is possible using traditional construction methods while delivering extremely low PUEs and high availability.

NEXT: Our roundtable panel discusses the state of the data center supply chain. 

Keep pace with the fact-moving world of data centers and cloud computing by following us on Twitter and Facebook, connecting with DCF on LinkedIn, and signing up for our weekly newspaper using the form below:

About the Author

Rich Miller

I write about the places where the Internet lives, telling the story of data centers and the people who build them. I founded Data Center Knowledge, the data center industry's leading news site. Now I'm exploring the future of cloud computing at Data Center Frontier.

Sponsored Recommendations

Get Utility Project Solutions

Lightweight, durable fiberglass conduit provides engineering benefits, performance and drives savings for successful utility project outcomes.

Guide to Environmental Sustainability Metrics for Data Centers

Unlock the power of Environmental, Social, and Governance (ESG) reporting in the data center industry with our comprehensive guide, proposing 28 key metrics across five categories...

The AI Disruption: Challenges and Guidance for Data Center Design

From large training clusters to small edge inference servers, AI is becoming a larger percentage of data center workloads. Learn more.

A better approach to boost data center capacity – Supply capacity agreements

Explore a transformative approach to data center capacity planning with insights on supply capacity agreements, addressing the impact of COVID-19, the AI race, and the evolving...


The Competitive Edge of Enterprise Edge

Brett Lindsey, CEO ark data centers, explains why edge data centers will become increasingly vital, making them the cornerstone of digital transformation strategies for enterprises...

White Papers

Get the full report

A Modern Approach to Disaster Recovery and Business Continuity: What You Need to Know

July 8, 2022
DartPoints presents three questions to consider when creating disaster recovery and business continuity plans.