Data Center Health Management

Nov. 22, 2016
In this week’s Voices of the Industry, Jeff Klaus General Manager of Intel Data Center Solutions, discusses an approach to data center health management .

In this week’s Voices of the Industry, Jeff Klaus General Manager of Intel® Data Center Solutions, discusses an approach to data center health management. 

JEFF KLAUS, Intel

As winter approaches, I’m sure you’ve already heard more sneezes and coughs circulating throughout the office. Just as regular checkups are important to your health and well-being, the same can be said for your data center’s health. Preventative measures are critical to avoiding outages and downtime.

Just how critical is Data Center Health Management?

Consider the Delta Airlines data center outage that occurred this past August, which grounded more than 2,000 flights over three days and cost the company $150 million. Or the data center outage that Southwest Airlines experienced, which also lasted three days and is estimated to have caused at least $177 million in lost passenger revenue.

That, my friends, is nothing to sneeze about.

In view of the financial prognosis, how could companies not afford to employ a preventative health management approach in their data centers to catch issues before they spiral out of control and are deemed untreatable?

Yes, the numbers cited above are extraordinarily high and point out that major airlines have a lot more at stake when designing and managing critical infrastructure than most other data center operators. But the risks involving outages do not discriminate. All data center facilities across every industry sector run similar risks when left unprotected by a sound health management approach. According to a study by the Ponemon Institute, the average cost of a single data center outage today is about $730,000. Of the 60-plus data center operators surveyed for the study, the costliest outage reported caused the data center operator to lose approximately $2.4 million.

To be certain, today’s data center operators are faced with significant, long-term challenges and daily uncertainties.

Among these: How can they know when a server’s components fail? Is it necessary to manually check the LEDs? How soon can a data center manager anticipate his facility’s fans to fail? Moreover, with thousands of heterogamous servers in the typical data center, there is the need for a tool to control and access these servers to maintain full availability.

Add to that the need to spend exorbitant amounts of money on hardware KVMs as well as to receive failure reports and know without question when it’s necessary to make a service call to remote data centers, and maintaining data center health can become a Sisyphean task.

Providing a remote control for your data center, Intel Virtual Gateway is a cross-platform, virtual keyboard-video-mouse used for maintaining the health of data center hardware. Given its firmware-based capability that is embedded directly into the server, Intel Virtual Gateway eliminates the need for complicated and expensive KVM infrastructure.

Health management in the data center has four main pillars: monitoring, analytics, diagnostics and remediation. Let’s take a closer look at the capabilities specific to each of these requirements (all of which are supported by Intel Virtual Gateway).

Monitoring

  • Provides root cause failures with down-to-components’ health details
  • Creates a failure device report with severity and failure details
  • Using hardware failure trending, can better predict when components will need to be replaced
  • Provides failure rate and MTTR analysis, per server model, components, etc., for the future
  • Provides server failure predication for the future

Diagnostics

  • Produces server diagnostics and troubleshooting
  • Checks BIOS settings and BIOS configuration
  • Analyzes server logs
  • Makes configuration changes or verification
  • Uses both OOB (KVM) and IB (SSH, RDP, VNC)

Remediation

  • Can remotely power servers on and off
  • Provides the ability to create groups of servers and then assigns power tasks to them
  • Can stagger turn-on to keep from overloading racks
  • Can schedule and automate and individual or group power task
  • Provides vMedia for remote OS provisioning and installation
  • Links server failures to workload and/or workflow management system for IT

Through ongoing monitoring, analytics, diagnostics and remediation, data center operators can employ a health management approach to addressing the risk of costly downtime and outages. Think of Intel Virtual Gateway as “an apple a day” for the health and well-being of your facility.

Submitted by Jeff Klaus, GM of Intel Data Center Solutions.

About the Author

Voices of the Industry

Our Voice of the Industry feature showcases guest articles on thought leadership from sponsors of Data Center Frontier. For more information, see our Voices of the Industry description and guidelines.

Sponsored Recommendations

Guide to Environmental Sustainability Metrics for Data Centers

Unlock the power of Environmental, Social, and Governance (ESG) reporting in the data center industry with our comprehensive guide, proposing 28 key metrics across five categories...

The AI Disruption: Challenges and Guidance for Data Center Design

From large training clusters to small edge inference servers, AI is becoming a larger percentage of data center workloads. Learn more.

A better approach to boost data center capacity – Supply capacity agreements

Explore a transformative approach to data center capacity planning with insights on supply capacity agreements, addressing the impact of COVID-19, the AI race, and the evolving...

How Modernizing Aging Data Center Infrastructure Improves Sustainability

Explore the path to improved sustainability in data centers by modernizing aging infrastructure, uncovering challenges, three effective approaches, and specific examples outlined...

SeventyFour / Shutterstock.com

Improve Data Center Efficiency with Advanced Monitoring and Calculated Points

Max Hamner, Research and Development Engineer at Modius, explains how using calculated points adds up to a superior experience for the DCIM user.

White Papers

Get the full report

The Affordable Microgrid: Securing Electric Reliability through Outsourcing

Feb. 12, 2022
Microgrids, which use controllers to connect multiple power generation and storage sources, can provide electric reliability but they can also be too complex and costly for businesses...