Your servers aren’t always happy. When they’re unhappy, Amir Michael thinks you should know it. The same goes for whether your servers are happier and more productive than the servers in your neighbors’ data center.
Michael, a veteran of the data center teams at Google and Facebook and a co-founder of the Open Compute project, founded Coolan to develop software that can extend the benefits of Big Data analytics beyond the hyperscale computing sector. Coolan is designed as a community-based analytics solution, in which the collective experience of data center operators can be harnessed to spot trends that can benefit the industry, including both end users and IT vendors.
Coolan is applying machine learning to discover problems that impact the total cost of ownership (TCO) of data center infrastructure. Like “unhappy server days,” for example. That’s the metric Coolan uses to track how often a server is undergoing maintenance, and how long it is out of service (mean time to repair, or MTTR).
“I got the sense that by not measuring this, we were missing an important piece of the picture,” said Michael, the CEO and co-founder of Coolan. “No one had a really good system or metric for these things.”
Tracking maintenance and failure rates helps understand the reliability (and thus the TCO) of servers from different vendors – as well as providing data on the cost to own your hardware (either on premises or in a colo facility) or run workloads in the cloud.
These analytics will grow in importance as Open Compute hardware gains broader adoption. Many enterprise companies are interested in the benefits of open hardware, but don’t have the luxury to build in-house tools for operations analysis, as do Google and Facebook.
Harnessing The Power of Community
With his new venture, Michael is seeking to replicate a key feature of the Open Compute Project, the open source hardware initiative that drew upon a collective approach to innovation in server design.
Coolan is designed as a community-based analytics solution, in which the data of many users can be pooled and analyzed to spot trends that can benefit the entire industry – like comparing the reliability of servers from traditional vendors like Dell and HP versus Open Compute products.
“No one actually publishes that data,” said Michael. “The consumers of infrastructure need to be informed.”
That information leads to interesting questions and disruptive ideas. For example: Could detailed failure data for servers and components allow users to hold hardware vendors to uptime agreements, much like the service level agreements (SLAs) used to compensate users for data center downtime?
“It seemed like an opportunity to shake things up, and that’s what I like to do,” said Michael.
A History of Innovation
Michael was previously a member of the data center engineering teams at Google and Facebook as those companies began building their own server hardware and data centers. He helped Facebook develop custom hardware and power distribution designs for its own data centers, which formed the nucleus of the Open Compute Project (OCP) and the open hardware movement.
In 2013, Michael formed Coolan with his brother Yoni, a software engineer with experience at Silver Spring Networks and Practice Fusion, and former Facebook colleague and investor Jonathan Heiliger. The company is based in San Mateo, Calig., and recently raised seed funding.
Coolan’s analytics software is currently in beta, with pilot programs underway with several end users, representing thousands of servers and tens of thousands of components. The product features a data collector on each server, with its analytics tools offered in a software as a service (SaaS) model, with reports accessed either through an online portal or APIs. Alerts and trouble tickets can be generated automatically.
The software runs more than 40 different types of analyses, including downtime, unhappy server days, reboot rates and many more. Coolan correlates these with the data center operating environment (including server inlet temperature and humidity) to look for trends.
“We’re providing intelligence and insight into how hardware is running,” said Michael.
How Big Data Benefits the Bottom Line
These analytics are important because they can evaluate the economics of more aggressive environmental set points – for example, running inlet temperatures of 80 degrees, instead of 75 degrees – and allow further refinement. It can also spot trends among different vendor hardware under various conditions, like which servers have fewest failures in high heat or humidity.
These insights can guide procurement teams in selecting vendors, and help operations teams optimize the data center environment.
If companies elect to opt in to the data sharing features, Coolan will anonymize the data and establish a “community benchmark” for its key metrics, allowing customers to gauge how they perform versus others using similar equipment and data center environments.
“That’s really where customers begin to benefit from each other,” said Amir Michael, who noted that the community benchmarks can identify failure rates when equipment reaches a certain age and alert customers.
Why Elderly Servers Are Problematic
Coolan provided an example of its analytics capabilities with a recent blog post that examined how to calculate the best time to refresh server hardware, evaluating the cost and performance of existing units against newer servers with improved power and energy efficiency.
“Aging infrastructure costs more than you might think,” the blog post noted. “Old machines lingering in a data center exact a hidden cost – with each new generation of hardware, servers become more powerful and energy efficient. Over time, the total cost of ownership drops through reduced energy bills, a lower risk of downtime, and improved IT performance. Figuring out when to invest in new servers doesn’t have to be a guessing game, though.”
The answer? In many cases, a three-year refresh cycle will yield the best results. Coolan offers a downloadable spreadsheet to help companies test their own TCO assumptions.
Amir Michael says most commercial tools available today are developed for enterprise IT, rather than “scale-out” computing. Coolan can be used as a diagnostic and failure tracking tool, but also offers features of a CMDB or asset management software.
What has Coolan learned from its early feedback?
“The biggest surprise is that the state of operations is not as advanced as I thought,” said Amir Michael. “We’ve seen a range of solutions, from spreadsheets to SQL databases to full asset management tools. It’s eye-opening to see how much variation there is between those who are automating really well, and those who aren’t.
“Some customers have really poor utilizations,” he added. “Some are buying server hardware almost as though they are deploying instances.” [clickToTweet tweet=”Amir Michael: Some users are buying server hardware almost as though they are deploying instances.” quote=”Amir Michael: Some users are buying server hardware almost as though they are deploying instances.”]
Michael believes the sweet spot for Coolan will be customers whose server counts are growing faster than the staff needed to support them.
“Once you have 1,000 servers, you have events on a regular basis,” he said. “It becomes a problem when you have someone dealing with this on a regular basis. We want to allow these administrators to focus on their jobs, rather than tracking down dead servers.”
Or even servers that are having an unhappy day.