There are multiple issues related to controlling the air intake temperature that reaches your IT equipment, however, they generally fall into two major areas. The first is the method of controlling the supply air temperature that leaves the cooling units. The second is controlling the airflow to avoid or minimize the mixing of supply and return air – before it enters the IT equipment.
One of the most common methods of temperature control in data center cooling systems has been traditionally based on sensing the temperature of the air returning to the cooling units. The temperature is typically set and controlled on each cooling unit individually. Based on this, when the return air temperature rises above the set-point, this simply causes the unit to begin the cooling cycle. Therefore in most systems, if set for 70°F, the cooling unit would then lower the temperature by 18-20°F, resulting in supply air temperatures of 50-52°F.
This simple return temperature based control method has been used for over 50 years. Nonetheless, this inherent drawback has essentially been overlooked by most facility managers, and was and still is considered a normal and “safe” operating practice for most data centers. This wastes energy and despite the very low supply air temperatures, it does not really ensure that the IT equipment ever received enough airflow or that it was within the recommended temperature range, primarily due to poor airflow management issues.
This article is the third in a series on data center cooling taken from the Data Center Frontier Special Report on Data Center Cooling Standards (Getting Ready for Revisions to the ASHRAE Standard)
To improve energy efficiency, more recently there has been a trend to try to sense and control supply air temperatures, either at the output of the cooling system, in the underfloor plenum, in the cold aisle or at the cabinet level. Supply air temperatures can be controlled relatively easily in newer CRAHs, which can continuously modulate the chilled water flow rate from 0-100% and also may vary fan speeds to maintain the supply temperature, to more efficiently adapt to heat load changes. It is more difficult to implement supply based temperature control in the DX-CRAC units which need to cycle the internal compressors on and off, and are therefore limited to only a few stages of cooling (dependent on the number of compressors). While there have been some more recent developments utilizing variable speed compressors, the majority of installed CRACs only have simple on-off control of the compressors.
There are wide variety of opinions and recommendations about the best, most energy-efficient way to control supply temperature, such as; centralized control of individual CRAC/CRAH or by averaged supply air using under-floor sensors or sensors in the cold aisles or a combination of inputs. Of course, that just led to the inevitable issues of where and how many sensors would be required, and how would the control system address if there were some areas that were too warm and some that were too cold. Nonetheless, there is an ongoing development of more sophisticated control systems that can optimize and adapt to these challenges.
Setting aside the supply temperature control strategies, one of the most significant issues to ensuring IT equipment intake temperatures is Airflow Management. While the basic concept of Hot Aisle – Cold Aisle layouts have been generally adopted as commonly accepted cabinet layout, it is only the beginning of ensuring the temperature and amount of airflow to the IT equipment. There are varying methods and levels of airflow management schemes to avoid recirculation or bypass airflow. This can range from the most basic recommendations, such as blanking plates in the cabinets to prevent exhaust recirculation within the cabinet, all the way up to complete aisle containment systems (hot or cold). By minimizing the mixing of cold supply air with warm IT exhaust air (both within the cabinets and between the aisles), it reduces or negates the need to unnecessarily overcool supply temperatures to make certain that the air entering the IT is within the desired operational range (either recommended or allowable).[clickToTweet tweet=”One of the keys to ensuring proper IT equipment intake temperatures is Airflow Management.” quote=”One of the keys to ensuring proper IT equipment intake temperatures is Airflow Management.”]
In an ideal world, the airflow though the cooling system would perfectly match the airflow requirements (CFM) of the IT gear, even as the CFM requirements change dynamically—the result, no bypass air, no recirculation—a perfectly optimized airflow balance. However, the increasingly dynamic nature of server heat loads also means that the fan rates have a very wide CFM range. This level of airflow control may be accomplished in very tightly enclosed systems, such as row level containment systems. However, this is typically not very feasible for most Multi-Tenant colocation data centers with centralized cooling, which uses a common open return airflow path to provide flexibility to various customers.
Moreover, even well managed dedicated enterprise facilities still need to have the flexibility to adapt to major additions and upgrades of IT equipment, as well as regular operational moves and changes. Temperatures can vary greatly across the various cold aisles, depending on the power density and airflow in uncontained aisles. In particular, there is typically temperature stratification from the bottom to the top of racks, which can range from only a few degrees to 20°F or more in cases of poor airflow management. Higher temperatures at the top of the racks and end of aisles, as well as any “hot-spots,” are one of the reasons that many sites still need to keep supply temperatures low, to ensure that these problem areas are still within the desired temperature range.
Server Reliability versus Ambient Temperature – the X-Factor
Beside introducing and defining the expanded “allowable” temperature ranges, the projected failure rates provided by the various IT equipment manufacturers were compiled and used as the basis for the “X-Factor” mentioned previously. This provided a statistical projection of the relative failure rate vs temperature, and is described in the 2011 guidelines as a method “to allow the reader to do their own failure rate projections for their respective data center locale.” These higher allowable temperature ranges were meant to promote the use of “free cooling” wherever possible, to help improve facility cooling system energy, and the X-Factor was provided to help operators assess the projected failure rate impact of operating in the expanded ranges. It also provided a list of time-weighted X-Factors based for major cities in the US, as well as Europe and Asia.
Upon first inspection, the X-Factor seems to run counter to everything that the previous editions of the ASHRAE guidelines held as sacrosanct for maximum reliability; a constant temperature of 68°F (20°C) to be tightly controlled 7x24x365. This may still be the preferred target temperature for many facilities to minimize risks, but this data is used as the baseline x-factor reference point for A2 rated volume servers, wherein 68°F (20°C) is assigned as a reference risk value of 1.0. Thereafter, the risk factor increases with temperature above 68°F and also deceases below 68°F (*see footnote regarding humidity). This was used to also create a table of time-weighted X-Factors for major cities in the US, as well as Europe and Asia.
However, part of the confusion about the X-Factor stems from the fact that the increase (or decrease) only represents a statistical deviation of the existing equipment failure rate (assuming the server was continuously operating at the reference 68F). It is important to note that the underlying data failure rate (i.e. xx failures per 1000 units, per year), is not disclosed and is already a “built-in” IT failure risk. This “normal” failure rate is a fact of life which must be dealt with in any IT architecture. For example, the undisclosed failure rate from a given server manufacturer may be 4 failures per 1000 servers per year. Therefore, if the server was operated at 90°F (X-Factor of 1.48) for a year, the statically projected failure would be 6 servers per year, which should not have a significant operational impact. That is why the projected and actual historic failure rate of any particular server needs to be understood in the context of the impact of a failure, and evaluated when making operation temperature decisions.
*X-Factor Footnote: It should be noted that all of the above are based only on dry bulb temperatures and it does not take into account the effects of pollution and humidity introduced by the use of airside economizers. However, in the case of waterside economizers, where the IT intake air is not subjected to wider humidity variations from outside air, or any related pollution issues, which are ignored in these calculations and would improve reliability compared to exposure to direct outside air.
Hidden Exposure – Rate of Change
While the wider temperature ranges get the most attention, one of the lesser noticed operational parameters in the TC 9.9 thermal guidelines is the rate of temperature change. Since 2008 it has been specified as 36°F (20°C) per hour (except for tape based back-up systems, which is limited to less than 9°F/5°C per hour). While expressed as “per hour”, this should be evaluated in minutes or even seconds, especially when working with modern high density servers (such a rack of 5 kW bladeservers with a high delta-t), since even a 10°F rise could occur in 5 minutes (i.e. 2°F per minute) which would effectively represent a 120°F per hour rate of rise and could result in internal component damage to servers from thermal shock.[clickToTweet tweet=”A rapid change in temperature could result in component damage to servers from thermal shock.” quote=”A rapid change in temperature could result in component damage to servers from thermal shock.”]
While operating at 90°F may result in an increase in the statistical failure rate of a few more servers a year, a sudden increase in temperature from the loss of cooling, due to failure (or related to 5-15 minute compressor restart delays after utility failure and generator power comes on-line), could prove catastrophic to all the IT equipment. It is therefore just as important, if not more important, to ensure a stable temperature in the event of cooling system incident.
The loss or interruption of cooling is another operational concern to keep supply temperatures low, thus increasing thermal ride-through time in the event of a brief loss of cooling, especially for higher-density cabinets, where an event of only a few minutes would cause an unacceptable high intake IT temperature. There are several ways to minimize or mitigate this risk, such as increased cooling system redundancy, rapid re-start chillers, and or thermal storage systems. Ultimately this is a still a technical and business decision for the operator.
Next week we will explore the impact of new servers and Energy Star IT equipment. If you prefer you can download the Data Center Frontier Special Report on Data Center Cooling Standards in PDF format from the Data Center Frontier White Paper Library courtesy of Compass Data Centers. Click here for a copy of the report.