ORLANDO, Fla. – You know the Tier of your data center, and you probably know what your availability has been. But what’s the probability that your data center will experience a failure within the next 12 months? Do you know?
Steve Fairfax believes you should. At the 7×24 Exchange Spring conference earlier this month, Fairfax outlined his proposal for a new data center metric to provide a simple way to understand the risk of future failure. The Class metric would express the probability, in percentage terms, that a facility will fail in one year of operations.
“Critical facilities are for the risk averse,” said Fairfax, the President of MTechnology. “A data center is a giant insurance policy.
“Yet this risk-averse industry has no metric for risk,” he added. “Executives use metrics to help make sense of complex decisions. I think there’s a real need (for a new metric). Class lets us talk about risk and probability without using the word ‘failure.’ ”
Fairfax specializes in risk analysis, and has spent decades studying failures in complex mission-critical systems. At MTechnology, he adapted safety engineering best practices from the nuclear power industry, known as probabilistic risk assessment (PRA) , and applied them to data center risk modeling. PRA uses computer calculations to analyze what could go wrong, how a chain of failures could combine to endanger a system, and how design decisions can minimize the impact of these scenarios.
The Class proposal seeks to take the benefits of probability analysis and express it in a simple measure of future risk.
Why do we need a new metric? Fairfax argues that the Uptime Institute’s Tier System is primarily a measure of redundancy (the duplication of critical components), while IEEE Standard 3006.7 focuses on reliability (an indication of how long a system will operate before it fails).
Fairfax is a fan of IEEE 3006.7, but says it doesn’t translate well to the business world. “It’s very technical and detailed,” he said. “It’s written by engineers and for engineers. Class is for the rest of us. It’s a way to talk about risk. This metric should be easy to understand and use. It should be intuitive.”
Defining the probability of failure helps customers make informed decisions about the consequences of design decisions, and align their business accordingly. A company running mobile gaming apps will have a different risk profile than a financial services data center. Fairfax asserts that probability is a better tool for making decisions than redundancy or availability.
“Let’s start a conversation about what is the right amount of risk,” he said. “Not everyone needs the ultimate data center and the ultimate reliability.”
Not ‘One Metric to Rule Them All’
After Fairfax outlined his proposal in a 45-minute morning keynote, the 7×24 Exchange convened an afternoon panel of industry thought leaders to discuss the state of metrics in the data center industry, and how the Class metric might fit. The panelists emphasized that Class could be useful as a supplement to existing metrics, rather than a replacement. There is no “one metric to rule them all,” but rather a diverse offering of metrics that provide different views of performance.
In particular, panelists said the Class proposal was not an effort to replace the Tier System, developed by the late Ken Brill at the Uptime Institute. The tier system classifies four levels of data center reliability, defined by the equipment and design of the facility, and has become central to discussions of how to plan and design enterprise data centers.
“Ken Brill was trying to bring order to chaos,” said Fairfax. “Ken proposed a classification system that helped us to try and make sense of this. The Tier system has put data centers into four big buckets, and did the industry a great service. But it doesn’t really measure risk. I don’t think it’s a competitive thing.”
Peter Gross of Bloom Energy, a leading voice in data center design, seconded that sentiment.
“No single document has done more to improve our industry than the tier system,” said Gross. “But it’s not uncommon for a Tier II facility to be more reliable than Tier III. That’s crazy.”
“We don’t have a lot of metrics in this industry,” said Gross. “Both PUE (Power Usage Effectiveness) and the Tier System have contributed significantly to improving reliability and efficiency. But in a way, in a sophisticated industry like ours, having only two metrics is pathetic. Metrics are complicated. PUE and Tiers have succeeded because they’re simple. People have a difficult time understanding the concept of probability.”
The Problem With Availability
What about availability? Fairfax says availability is a misleading metric, because in practice it measures a data center’s ability to deliver power to customers, rather than the actual operation of customer equipment. The exalted “five nines” avilability – uptime of 99.999%, equating to just five minutes of downtime per year – is often a misnomer, he says.
“There is no such thing as a five-minute system outage,” said Fairfax. “If there was an award for abuse of statistics, this would win it.”
Gross noted that even a brief power loss can lead to hardware failures and database recovery challenges that translate into lengthy customer outages.
“There is no correlation between the availability of M&E (mechanical and electrical equipment) and the availability of a data center,” he said. “People don’t care about the availability of their electrical systems; they care about the availability of their computer systems. What’s the availability of my compute, my network, and my storage? You can have a one second power loss, and it takes 24 hours to get the business back online.”
Availability also doesn’t reflect the number of outages, as noted in a blog by Schneider Electric. “Consider two data centers that are both considered 99.999% available,” Schneider’s Susan Hartman wrote. “In one year, Data Center A loses power once, for 5 minutes. Data Center B loses power 10 times, but for only 30 seconds each time. … The data center that loses power 10 times will have a far greater MTR (Mean Time to Recover) than the one that lost power only once – and hence probably far from a 99.999% availability rating.”
“The idea is to diminish the pursuit of ‘five nines’ availability,” said Fairfax. “Availability cannot measure risk. We allow our facilities to increase risk to improve some other metric.”
8 Million Ways to Fail
Fairfax says that probability analysis provides a more complete picture of the interaction between components of the data center building, power distribution and cooling system.
Probability and Class can be calculated using a number of methods:
- Fault trees that diagram how a failure in one component can impact other elements of the system.
- Reliability Block Diagrams, which create a model of the system flow to assess dependencies.
- “Monte Carlo” simulations using software tools (including MATLAB and Excel add-ins) to model all possible outcomes of failures in complex systems.
MTechnology’s work on fault tree analysis has identified up to 8 million failure modes. Most of these have very small probability of failure, but they add up, Fairfax said, adding that four or five components are involved in 85 percent of failures.
What kind of insights can be gained from this approach? Predictive modeling of design decisions can prevent data center builders from investing large amounts of money on equipment that isn’t likely to improve their uptime.
“Our analysis suggests that adding one more generator has vastly more benefit than a second utility line,” said Fairfax. “”There’s no benefit to a second utility feed. Spend that money on a second generator instead. There’s some real money to be saved.”
So what’s next for the proposal? In his keynote, Fairfax proposed that the 7×24 Exchange can develop Class as a critical facility metric. Bob Cassiliano, Chairman and CEO of the 7×24 Exchange, invited conference attendees to provide feedback on the proposal, which will guide the group’s decision on whether to pursue establishing Class as a new metric. The 7×24 Exchange is the leading knowledge exchange among professionals who design, build, operate and maintain enterprise mission critical infrastructures.