Welcome to our seventh Data Center Executive Roundtable, a quarterly feature showcasing the insights of thought leaders on the state of the data center industry, and where it is headed. In our Second Quarter 2017 roundtable, we will examine four topics: How to eliminate data center downtime like the recent outage at British Airways, how the industry is being affected by robust M&A activity, opportunities for innovation in electrical infrastructure, and how the data center industry can adapt to an increasingly multi-cloud world.
Here’s a look at our distinguished panel:
- Andrew Schaap: Chief Executive Officer of Aligned Energy, which designs and operates highly-efficient multi-tenant data centers.
- Shay Demmons, EVP and General Manager of RunSmart software. Demmons is responsible for running all facets of the software business unit including sales, marketing, product development and customer service.
- Robert McClary, Chief Operating Officer of FORTRUST, a colocation provider headquartered in Denver, Colorado.
- Bob Woolley, Vice President of Critical Facilities Engineering and Design at RagingWire Data Centers. Woolley is responsible for the teams that design, develop and engineer RagingWire’s portfolio of data centers.
The conversation is moderated by Rich Miller, the founder and editor of Data Center Frontier. Each day this week we will present a Q&A with these executives on one of our key topics. We begin our discussion by asking our panel to address recent headlines about data center service outages.
Data Center Frontier: The recent British Airways data center outage caused widespread disruption to the airline’s operations, with early estimates placing its business impact at more than 80 million pounds ($104 million US). What are the most effective ways to eliminate these type of outages?
Andrew Schaap: It has been widely reported how much data center outages cost businesses not just in capital losses, but also to reputation and brand over the long-term. To help avoid these critical disruptions – not just in the airline industry, but across all business sectors – it’s important for company leadership to take an in-depth look at their existing technology systems – both hardware and software – and consider updating those systems to meet the demands of today’s ever-evolving digital world.
According to a 2016 Gartner report, digital business, 24/7 operations, cyberattacks and more business disruptions are switching the conversation from recovery to resilience. Continuous monitoring of your resilience will allow IT to take proactive measures to improve it without the risk of failure. But even today, a year after the report was published, businesses expect resiliency to be built into the application. Perhaps one of the most effective ways to eliminate outages is for management to select specific availability zones that are best suited to meet their needs. If one area has overflow, administrators can distribute that traffic out to other locations.
Robert Woolley, RagingWire Data Centers: According to published accounts, the British Airways incident was caused by an operator at their Boadicea House (BoHo) data center. The operator improperly disconnected, and then reconnected system power in such a way that it caused a power surge that damaged some IT equipment, taking a number of production systems off line. The ascribed cause of the failure was operator error.
While human error may have precipitated the incident, it’s clear that there were other contributing factors. Facilities such as BoHo are designed to tolerate the loss of a single electrical feed, and its protective devices should prevent power surges from reaching the IT equipment. Moreover, the BoHo facility is only one of several data centers that support BA’s critical operations. Failover to a secondary facility should have occurred automatically, but didn’t.
The evidence points to a cascading series of errors that resulted in a catastrophic failure, which is typical of major data center incidents. It may be convenient to point to a single cause such as human error, but every link needs to be strong for the chain to maintain integrity.
The lesson learned from this outage is that a simple error can compound into a large scale failure. Better procedures and training could minimize future human errors, but a component failure might produce the same result in the future if other remedies are not enacted.
The root cause is only part of the story. Emphasis should be on the proper operation of the failover scheme between data centers to protect against ANY facility failure. Secondly, the design of the electrical system should preclude the ability to harm the critical load due to a switching procedure. Of course, proper training and change control methodology are also essential.
Shay Demmons, RunSmart: The British Airways outage highlights the exposure of not having a well thought out operational plan that accounts for failures and other expected operating conditions. Failures are part of a well-conceived operational plan, and it is the responsibility of the core infrastructure team to have the people, processes, and technology in place to identify and react to this wide range of operating conditions. All applications and services depend on this core infrastructure and yet a single human error or cyber-attack can wreak havoc on an enterprise and their core business unless the operating plan includes provisions for those conditions and the optimal responses.
One of the most basic yet readily available technologies today is that associated with detection and response. Many Data Centers lack this basic real-time monitoring of infrastructure, which is the first step to truly eliminating these sorts of outages. Once detected, most data centers lack the ability to respond to these conditions. The concept of a software-defined data center (SDDC) includes both the detection of failures and automated response capabilities. The power of integration across the IT and Facilities layers can be leveraged by these software-defined structures, allowing business continuity goals to be met. Bottom line: No longer can IT and infrastructure management remain siloed systems. They need to compliment and respond together to mitigate disruption with limited (or no) human intervention.
For example, after receiving real-time information that a system is failing or being removed from service for maintenance purposes, the automated response needs to determine the best course of action to shed demand and maintain a level of business services consistent with the business itself. This may include preemptively shutting down non-critical servers, throttling equipment, bursting into the cloud, or turning on other assets. With a well-conceived software-defined data center which includes failure detection and automated response, operational changes can happen quickly and automatically without human intervention.
Robert McClary: I would say two things along these lines. First of all focus the most time and effort around the “most likely” causes of downtime; if the statistics are anywhere near true and 60 to 80 percent of outages occur due to human error, then, certainly, not enough time is spent on correcting human error. Secondly, another major cause of downtime, statistically, is poor maintenance and lifecycle strategy. These two items make up the majority of the most likely causes of downtime. These are both avoidable with effort. A lot of people believe that they can design around human error. And what happens is, when you start trying to design around human error, you create a lot of complexity in the infrastructure design, and then start creating more problems by overcomplicating designs.
It’s a self-fulfilling prophecy. You try to design around human error but you end up causing more of it. We aren’t spending enough time trying to eliminate or mitigate human error, we are spending time and money on complex designs to eliminate human error and we’re causing complexity that causes more human error. Focusing on eliminating human error is much less expensive than designing around it. We just refuse to admit that human error is not inevitable, and that it can be minimized or eliminated. We just don’t want to do it, it’s too hard, it’s psychological; it’s not comfortable. Equipment failures are also avoidable. Comprehensive predictive and preventive maintenance is about ten times cheaper over the long haul than corrective maintenance. We think we save time and money but end up paying a bigger price on the backend.
NEXT: How is the ongoing M&A impacting the data center industry?
Keep pace with the fact-moving world of data centers and cloud computing by following us on Twitter and Facebook, connecting with me on LinkedIn, and signing up for our weekly newspaper using the form below: