Using Out-of-Band Real-Time Power, Thermal, and Utilization Analysis in HPC

Oct. 5, 2020
George Clement, Senior Application Engineer, from Intel,  highlights a case study that shows the importance of data center management to gain greater insight into power demand, thermal efficiency, server utilization, and capacity planning in their HPC environment. 

George Clement, Senior Application Engineer from Intel, highlights a case study that shows the importance of data center management to gain greater insight into power demand, thermal efficiency, server utilization, and capacity planning in their HPC environment. 

George Clement, Senior Application Engineer, Intel

Recently the Institute for Health Metrics and Evaluation (IHME) at the University of Washington, faced a challenge to instrument there HPC environment. They needed to ensure the Data Center was cooling and providing enough power for the COVID 19 work they were supporting, and ensure the racks are provisioned and maintained for optimal physical configuration.

Finally, they needed it to be simple and focused on providing their operational team actionable data at the right time.

“We learned a lot from other products about realistic maximum power consumption, but it was only relevant for historical data, it couldn’t provide us the real-time alerting. If something goes wrong in the data center, right now, other products couldn’t tell us that. [This] was easy to plug in, and easy to get the data and analysis from our machines immediately. The alerts and power limitations were set up within a day.”
— Vern Harbers, Technical Project Manager, Infrastructure, IHME, University of Washington

Key Learnings

The Institute for Health Metrics and Evaluation (IHME), an independent global health research center at the University of Washington worked with the Intel Data Center Solutions team to help improve their COVID 19 HPC cluster’s availability and performance by targeting several key elements as they implemented there solution:

  • Utilizing OOB interfaces (exposed on each server platform) and standard software packages to create a solution that monitored more than 600 servers in its High-Performance Computing (HPC) data center environment at the University’s colocation facility.
  • Using real-time health, power, and thermals from the HPC servers the team supports with no additional hardware or software. This enabled IT staff to better plan and manage capacity and utilization in racks without complicating their solution stack.
  • Use the Realtime data for alerting and long-term planning by storing data and aggregating it into meaningful groups (for IHME that was Racks, Rows, Rooms).
  • Access each server platform’s power control features through a central system. This allowed the team to ensure that rack loading was efferent and balanced without having to rely on estimates or bench testing workloads.

(Graph: Courtesy of Intel)

Impressions from Team

Using these techniques the IHME staff reported they gained greater insight into power demand, thermal efficiency, server utilization, and capacity planning in their HPC environment. They had the solution up and working within hours of roll out. Also, they were able to compile and aggregate actionable, real-time data from its collection of servers, quickly and consistently (using OOB interfaces).

George Clement, Senior Application Engineer, from Intel,  highlights a case study that shows the importance of data center management to gain greater insight into power demand, thermal efficiency, server utilization, and capacity planning in their HPC environment. 

About the Author

Voices of the Industry

Our Voice of the Industry feature showcases guest articles on thought leadership from sponsors of Data Center Frontier. For more information, see our Voices of the Industry description and guidelines.

Sponsored Recommendations

Optimizing AI Infrastructure: The Critical Role of Liquid Cooling

In this executive brief, we discuss the growing need for liquid cooling in data centers due to the increasing power demands of AI and high-performance computing. Discover how ...

AI-Driven Data Centers: Revolutionizing Decarbonization Strategies

AI hype has put data centers in the spotlight, sparking concerns over energy use—but they’re also key to a greener future. With renewable power and cutting-edge cooling, data ...

Bending the Energy Curve: Decoupling Digitalization Trends from Data Center Energy Growth

After a decade of stability, data center energy consumption is now set to surge—but can we change the trajectory? Discover how small efficiency gains could cut energy growth by...

AI Reference Designs to Enable Adoption: A Collaboration Between Schneider Electric and NVIDIA

Traditional data center power, cooling, and racks aren’t sufficient for GPU-based servers arranged in high-density AI clusters...

Courtesy of Stream Data Centers
Image courtesy of Stream Data Centers

The Rise of the “Fake” Data Center Developer — And How to Tell the Difference

Stream Data Centers’ Co-Managing Partners expand on the problem of “fake” data center developers and explain how investors and end users can separate the wheat from the chaff....

White Papers

IMDC_SRCover_2022-10-18_11-16-58

Beyond Greenwashing: Sustainability Meets Compliance

Oct. 19, 2022
This special report, courtesy of Iron Mountain, explores a set of metrics and mechanisms that data center operators can use track progress towards their environmental, social,...