Top-level energy and environmental dashboard for data center monitoring.
Have you ever asked yourself why the automobile's dashboard looks the way it does? The four top-level gauges are generally the speedometer, the fuel gauge, the engine-temperature gauge, and the clock. They provide the most important information a driver needs to know to stay out of trouble and keep moving. A second tier of warning icons are hidden from view until something goes wrong with the vehicle (e.g., check engine icon) or the behavior of the passengers (e.g., fasten seatbelt icon). This hierarchy is a time tested way of arranging the information provided to the driver.
A data center facility should provide an adequate thermal IT-equipment environment while minimizing infrastructure energy usage. Curbing energy consumption in energy intensive data centers is important for economic reasons and ensuring a satisfactory operating thermal environment is important for protecting IT-equipment from failure.
However, there is a perceived conflict between these two important goals. The most scrutinized link is "air management," which essentially is about keeping cold and hot air from mixing. Cold supply air from the air handler should enter the heat-generating IT-equipment without mixing with ambient air and the hot exhaust air should return to the air handler without mixing. Managing the cold and hot air streams in data centers is important both for infrastructure energy management and IT-equipment thermal management.
Air management has great potential to make data centers more energy efficient. Correctly implemented air management also has great potential to improve the thermal IT-equipment conditions. The information required to implement effective air management can be obtained by monitoring only three data entities: Infrastructure energy efficiency, thermal IT-equipment conditions, and air management effectiveness. A number of useful metrics have been developed over the past years. This paper lays out analogous rationale to the automobile dashboard as well as the justification for selecting three of those metrics for the proposed top-level energy and environmental dashboard for data center monitoring.
DATA CENTER MONITORING
The old adage "you can't manage what you don't measure" was never truer than it is for data centers. As data center operators take action to improve the energy efficiency of their data centers, they need data to give them visibility into this dynamic environment. Without these data, they are unable to make informed decisions regarding data center optimization.
Historically, data centers have been over-provisioned especially when it comes to cooling. As data center operators improve cooling efficiency, the safety margin for error created by the over-provisioning is removed. With a greatly reduced response-time cushion, effective monitoring becomes a mission critical application for data centers. Such monitoring systems must provide near real-time data collection and an alarm notification and escalation system. Monitoring systems must be robust, enterprise-wide systems that are capable of providing multi-site reporting and analysis.
To fully realize the benefits of monitoring, these systems must provide an effective way to access data, including:
* Statistical analysis--covariance, regression, etc.
* Conditional alarming--notifications based on complex conditional exceptions
* Access via standard Business Intelligence tools, including Excel
Monitoring systems are the basis for:
* Continuous improvement cycles
* Improved operational effectiveness
* Availability--downtime avoidance
* Capacity planning
* Utility rebates.
Typical capabilities of monitoring systems include:
* Ability to monitor and record granular data for data center devices
* Support for metrics such as DCiE, RCI, and RTI
* Simple data access--ODBC, reporting, dashboards, SQL, etc.
* Capable of running in a high availability environment
* Near-real time access
* Alarm escalation and management
Monitoring systems get their data either from instrumentation built into data center equipment or from separate sensors and meters. Most data center devices such as PDUs, UPSs, CRACs, and intelligent power strips are capable of transmitting data via communication protocols such as Modbus, SNMP, BacNet, etc. A wide range of power meters, environmental sensors, pressure sensors, etc. are available in the market. Sufficient instrumentation is a prerequisite to effective monitoring.
The level of instrumentation determines the level of detail with which performance data can be measured. For example, the computation of basic DCiE simply requires a data point for the total amount of power coming into the data center and a data point for the total amount of power going to the IT equipment. With additional instrumentation, however, a more detailed DCiE can be computed.
With more performance data collected, there is a need to intelligently summarize the information. Performance metrics play a key role in making sense of data from comprehensive and continuous monitoring of data center devices. The next sections discuss three such metrics.
Metrics and the ability to monitor and track the performance of data centers are integral to successful operation. One of the most powerful features of metrics is the capability of trending complex data over time. Generally speaking, a "metric" is defined as a standard for measuring or evaluating something. All three top-level metrics that were selected to be included in the proposed energy and environmental dashboard act in accordance with this definition.
Data Center infrastructure Efficiency (DCiE) is a metric used to determine the energy efficiency of a data center. The Rack Cooling Index (RCI) is a measure of how well the IT-equipment is cooled within the manufacturers' specifications. Since a thermal guideline becomes truly useful when there is an unbiased and objective way of determining the operating compliance with the guideline, the RCI index is included in the ASHRAE Thermal Guideline (ASHRAE 2008) for purposes of showing compliance. Finally, the Return Temperature Index (RTI) is a measure of the performance of the air-management system.
These metrics are individually used in the DOE's "DC Pro" data center energy assessment software tool suite (DOE 2009a) as well as in the Data Center Certified Energy Practitioner (DC-CEP) Program (DOE 2009b). They reduce a great amount of data to understandable numbers that can easily be trended and analyzed. The rationale and definition of the three metrics are presented next.
Data center infrastructure efficiency (DCiE) and the power usage effectiveness (PUE) have become commonly used metrics for data center efficiency. The PUE is essentially the reciprocal of DCiE. These metrics were developed by members of the Green Grid, which is an industry group focused on data center energy efficiency. One benefit of using the DCiE rather than the PUE is that it has an easily understood scale of 0-100% (Green Grid 2008).
DCiE = [[IT - Equipment Power/Total Facility Power]] 100% (1)
Standard guidelines for the use and reporting of these metrics have been developed by the Green Grid. All DCiE measurements should be reported with subscripts that identify (1) the accuracy of the measurements (2) the averaging period of the measurements (e.g., yearly, monthly, weekly, daily), and (3) the frequency of the measurement (e.g., monthly, weekly, daily, continuous). For the purpose of the proposed dashboard, the user can select both the averaging period and the frequency (limited by the actual measurement frequency).
Table 1 shows ratings of the DCiE. A value of 100% simply indicates 100% efficiency, i.e., all energy is used by the IT-equipment (ideal). However, a typical value is only 50% (EPA 2007). State-of-the-Art installations have values around 85% (Google 2009).
Table 1.Rating of the DCiE Rating DCiE Ideal (maximum) 100 State-of-the-Art 85 Best Practice 70 Improved Operations 60 Current Trend 55 Typical (average) 50
The DCiE allows data center operators to quickly estimate the energy efficiency of their data centers and determine whether any energy efficiency improvements need to be made. DCiE will represent infrastructure energy efficiency on the proposed dashboard.
RACK COOLING INDEX (RCI)
The main task for a data center facility is to provide an adequate equipment environment, therefore a relevant metric for IT-equipment intake temperatures should be used to gauge the thermal environment. The Rack Cooling Index (RCI) is a measure of how effectively equipment racks are cooled within a given thermal guideline, both at the high end and at the low end of the temperature range (Herrlin 2005). Specifically, the RCI is a performance metric explicitly designed to gauge compliance with the thermal guidelines of ASHRAE (2008) and NEBS (Telcordia 2001, 2006) for a given data center. The index is included in the ASHRAE thermal guideline for purposes of showing compliance.
Both guidelines use recommended and allowable ranges. The recommended intake temperature range is a statement of reliability (facility operation) whereas the allowable range is a statement of functionality (equipment testing). The numerical values of the recommended and allowable ranges depend on the applied environmental guideline. In the ASHRAE specification, the recommended and allowable temperature ranges are 64.4-80.6[degrees]F (18-27[degrees]C) and 59.0-89.6[degrees]F (15-32[degrees]C), respectively.
Over-temperature conditions exist once one or more intake temperatures exceed the maximum recommended temperature. Similarly, under-temperature conditions exist when intake temperatures drop below the minimum recommended. The RCI "compresses" the equipment intake temperatures into two numbers--the [RCI.sub.HI] and the [RCI.sub.LO]. An [RCI.sub.HI] of 100% means no over-temperatures whereas an [RCI.sub.LO] of 100% mean no under-temperatures. Both numbers at 100% mean that all temperatures are within the recommended temperature range--i.e., absolute compliance. The lower the percentage, the greater probability (risk) intake temperatures are above the maximum allowable and below the minimum allowable, respectively. A value below 90% is often characterized as "poor."
Figure 1 provides a graphical representation of the [RCI.sub.HI] (the [RCI.sub.LO] is analogous). The bold curve is the intake temperature distribution for all N intakes; the temperatures have been arranged in order of increasing temperature. The Total Over-Temperature represents a summation of all over-temperatures (triangular area). The Maximum Allowable Over-Temperature is also defined in the figure (rectangular area). The definition of [RCI.sub.HI] is as follows:
[FIGURE 1 OMITTED]
[RCI.sub.HI] = [1 - [Total Over-Temp/Max Allowable Over-Temp]]100% (2)
Table 2 shows proposed rating of the RCI based on numerous numerical analyses (Herrlin 2007). Lawrence Berkeley National Laboratory (LBNL) is also in the process of benchmarking this performance metric. The risk for temperatures above (below) the maximum (minimum) allowable temperature increases with declining values. A warning flag "*" appended to the index indicates that one or several intake temperatures are above (below) the allowable range. The index value for the intake temperatures shown in Figure 1 is [RCI.sub.HI] = 95%*.
Table 2. Proposed Rating of the RCI Proposed Rating RCI Ideal 100% Good [greater than or equal to]95% to <100% Acceptable [greater than or equal to]90% to <95% Poor <90%
The RCI provides an unbiased and objective way of quantifying the quality of an air management design from a thermal perspective. The RCI will represent the thermal equipment environment on the proposed dashboard.
RETURN TEMPERATURE INDEX (RTI)
The Return Temperature Index (RTI) is a measure of the net level of by-pass air or net level of recirculation air in the equipment room (Herrlin 2007 and 2008). Both effects are detrimental to the overall energy and thermal performance of the space. By-pass air does not contribute to the cooling of the electronic equipment, and it depresses the return air temperature. Recirculation, on the other hand, is one of the main reasons for hot spots or areas significantly hotter than the ambient temperature. Thus, the RTI provides a link between energy usage (DCiE) and the thermal IT-equipment environment (RCI).
The Return Temperature Index (RTI) is a measure of the performance of the air-management system and how well it controls by-pass and recirculation air. Deviations from 100% are generally an indication of declining performance. The index is defined as follows:
RTI = ([[DELTA][T.sub.AHI]/[DELTA][T.sub.Equip]]) 100% = ([[V.sub.Equip]/[V.sub.AHU]]) 100% (3)
RTI = Return temperature index
[DELTA][T.sub.AHU] = Temperature drop across the air-handler units (airflow weighted average)
[DELTA][T.sub.Equip] = Temperature rise across the IT-equipment (airflow weighted average)
[V.sub.AHU] = Total airflow rate through the air-handler units
[V.sub.Equip] = Total airflow rate through the IT-equipment
Since the temperature rise across the IT-equipment provides the potential for high return temperatures, it makes sense to normalize the RTI with regard to this entity. In other words, the RTI provides a measure of the actual utilization of the available temperature differential. Consequently, a low return air temperature is not necessarily a sign of poor air management. If the IT- equipment only provides a modest temperature rise, the return air temperature cannot be expected to be high. Many legacy servers and other electronic systems have a temperature rise of only 10[degrees]F (6[degrees]C) whereas new blade servers can have a temperature differential of 50[degrees]F (28[degrees]C).
The equation shows the intrinsic link between energy and thermal management. The RTI is also the ratio of total airflow through the IT-equipment to the total airflow through the air handlers. The interpretation of the index is now straight forward (see Table 3): A value above 100% suggests net recirculation air, which elevates the return air temperature. Unfortunately, this also means elevated equipment intake temperatures. A value below 100% suggests net by-pass air; cold air by-passes the electronic equipment and is returned directly to the air handler, reducing the return temperature. This may happen when the supply airflow is increased to combat hot-spots or if there are leaks in the raised floor.
Table 3. Interpretation of the RTI Interpretation RTI Balanced 100% Net Recirculation Air >100% Net By-Pass Air <100%
There might be a number of legitimate reasons to operate below or above 100%. For example, some air-distribution schemes are designed to provide a certain level of air mixing (recirculation) to provide an even equipment intake temperature. Some overhead air-distribution systems are designed to operate this way. Raised-floor cooling, on the other hand, often needs some excess air to function properly.
Finally, as stated in the Introduction, an RCI analysis should be accompanied by an energy analysis; in this case by using the DCiE and the RTI. Improving the RCI can lead to an energy penalty. The DCiE and RTI can help evaluate how severe such a penalty may be. RTI will represent air management on the proposed dashboard. There we have it! The next section discusses the proposed dashboard.
The proposed dashboard consists of four gauges: one for energy efficiency (DCiE), two for IT-equipment intake temperatures ([RCI.sub.HI] and [RCI.sub.LO]), and one for air management effectiveness (RTI). Again, the last one provides the intrinsic link between the energy gauge and the temperature gauges. All gauges are based on non-dimensional performance metrics to intelligently summarize and trend a large amount of data and avoid operator fatigue. The gauges share a common feature of most automobile fuel gauges, that is, an analog gauge for the current status and a warning icon for out-of-bound conditions. The dashboard also has access to detailed data when needed.
The shown readings of the gauges in the preproduction dashboard (Figure 2) could be interpreted as follows:
[FIGURE 2 OMITTED]
* DCiE of 80% is near state-of-the-art
* [RCI.sub.HI] of 97% is considered good
* [RCI.sub.LO] of 81% indicates an over-cooled space (< 90% is often considered poor)
* RTI of 77% indicates an under-utilization of available equipment temperature differential AND an over-ventilated space; by-pass air of 30% (1/0.77).
The overall goal is to move all four needles towards the 12 o'clock position (100%). The crux of the matter is to know what corrective actions may be needed. However, in this example, improved air management could reduce the by-pass air (increase RTI), reduce the fan energy (improve DCiE), and increase the supply air temperature (raise [RCI.sub.LO]).
The alarm levels are user-defined as well as the coloring of the dials in green, orange, and red to indicate good, acceptable, and poor operation, respectively. In addition, the operator can select both the averaging period and the sampling frequency. A second tier of gauges include data of higher granularity. Second tier data also include trending of the utilized metrics.
A glance at the proposed dashboard provides instant visual information on the operational status of infrastructure energy efficiency (DCiE), IT-equipment intake air temperature compliance ([RCI.sub.HI] and [RCI.sub.LO]), and air management effectiveness (RTI). The dashboard is not only a monitoring tool but also a diagnostic tool for reconfiguring the site and resolving air management issues. Furthermore, the alarm functionality provides important information of out-of-bound conditions. All this striking simplicity is made possible by utilizing the selected performance metrics.
A data center facility should provide an adequate thermal IT-equipment environment while minimizing infrastructure energy usage. Air management has great potential to make data centers more energy efficient, and correctly implemented it also has great potential to simultaneously improve the thermal IT-equipment conditions. This paper has presented the rationale for a top-level energy and environmental dashboard for data center monitoring, consisting of four gauges: one for infrastructure energy efficiency, two for IT-equipment intake air temperature compliance, and one for air management effectiveness. A glance at the dashboard provides instant visual information on the operational status of the data center. All this simplicity is made possible by utilizing selected non-dimensional performance metrics (DCiE, [RCI.sub.HI], [RCI.sub.LO], and RTI) to intelligently summarize a large amount of data and avoid operator fatigue. The three data center metrics have been described in some detail in this paper. The basis for the metrics is comprehensive and continuous monitoring of select data center devices. Utilizing the proposed dashboard makes monitoring and managing data center energy efficiency and IT-ASHRAE Transactions equipment thermal conditions a less daunting task, and it provides the capability of improving both.
ASHRAE. 2008. Special publication, thermal guidelines for data processing environments. American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., Atlanta, GA.
DOE. 2009a. "DC Pro" energy assessment software tools suite. http://wwwl.eere.energy.gov/industry/saveenergy now/dc_pro.html.
DOE. 2009b. Data Center Certified Energy Practitioner (DC-CEP) Program. http://wwwl.eere.energy.gov/industry/saveenergynow/cep_program.html.
EPA. 2007. Report to Congress on Server and Data Center Energy Efficiency Public Law 109-431. August 2007.
Google. 2009. Insights into Google's PUE results, efficient data center summit, April 1, 2009, Google, Mountain View, CA. http://www.google.com/corporate/green/datacenters/summit.html.
Green Grid. 2008. Green grid data center power efficiency metrics: PUE and DCiE.
Herrlin, M.K. 2005. Rack cooling effectiveness in data centers and telecom central offices: The rack cooling index (RCI). ASHRAE Transactions 111(2), American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., Atlanta, GA.
Herrlin, M.K. 2007. Improved data center energy efficiency and thermal performance by advanced airflow analysis. digital power forum, 2007. San Francisco, CA, September 10-12.
Herrlin, M.K. 2008. Airflow and cooling performance of data centers: Two performance metrics. ASHRAE Transactions 114(2), American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., Atlanta, GA.
Telcordia. 2006. (Kluge, R.) Generic requirements NEBS GR-63-CORE, NEBS requirements: Physical Protection, Issue 3, March 2006, Telcordia Technologies, Inc., Piscataway, NJ.
Telcordia. 2001. (Herrlin, M.K.) Generic Requirements NEBS GR-3028-CORE, Thermal Management in Telecommunications Central Offices, Issue 1, December 2001, Telcordia Technologies, Inc., Piscataway, NJ.
H. Ezzat Khalifa, Professor, Syracuse University, Syracuse, New York: I suggest adding ambient conditions to the trend chart to help interpret changes in chiller or CRAC power.
Craig Compiano: Excellent suggestion, since one needs to track the correlation between outside air temperature and dc cooling metrics. I would add that it is equally desirable to track IT load against these metrics.
Magnus K. Herrlin is president at ANCIS Incorporated, San Francisco, CA. Craig M. Compiano is president at Modius Inc., San Francisco, CA.
|Printer friendly Cite/link Email Feedback|
|Author:||Herrlin, Magnus K.; Compiano, Craig M.|
|Date:||Jan 1, 2010|
|Previous Article:||Natural ventilation in London Underground Sub-Surface lines--modelling for normal operations.|
|Next Article:||System architectures and fluids for high heat density cooling solutions.|