Printer Friendly

Airflow management in a liquid-cooled data center.

ABSTRACT

Electronics densification is continuing at an unrelenting pace at the server, rack, and facility levels. With increasing facility density levels, airflow management has become a major challenge and concern. Hot spots, air short-circuiting, and inadequate tile airflow are a few of the issues that are complicating airflow management.

This paper focuses on a thermal management approach that simplifies facility airflow management in a cost-effective and efficient manner. Implementation of the technology was undertaken with the DOE's Pacific Northwest National Laboratory. Under the effort, a single 8.2 kW rack of HP rx2600 servers was converted from air cooling to liquid cooling. The liquid-cooling solution employs spray modules that indirectly cool the processors and remove the processor heat load directly to the facility water and not the facility air. An infrared camera was used to measure the temperature distributions over the rear doors of the liquid-cooled rack and several air-cooled racks, as well as the internal areas of a liquid-cooled and an air-cooled server. A tile hood was also used to measure the airflow rate out of all of the perforated tiles in the data center.

The air exiting an 8.2 kW air-cooled rack located in a best-case facility location reached a maximum of 34[degrees]C and the air exiting an air-cooled rack located in a worst-case location reached a maximum of 44[degrees]C, while the air exiting the liquid-cooled rack was 10[degrees]C to 20[degrees]C cooler, reaching a maximum of 24[degrees]C. The air is delivered to the tiles at approximately 14[degrees]C. The thermal gradient over the air-cooled racks approximated 10[degrees]C (hottest servers at the top), while that over the liquid-cooled rack was less than half and on the order of 3[degrees]C-4[degrees]C. The image of the internal areas of the air-cooled server showed some significant hot spots on the power pods and memory, while these were significantly diminished for the liquid-cooled server. The tile airflow measurements revealed that the vast majority of the tiles delivered approximately 725 cfm, with five tiles delivering between 1,300 and 1,480 cfm.

This paper provides further details on the study and will analyze the manner in which facility airflow management complexity and cost can be reduced for a liquid-cooled facility.

INTRODUCTION

Electronics densification is continuing at an unrelenting pace at the server, rack, and facility levels. While the most common rack power level is still in the 2-3 kW range (Rasmussen 2005), several vendors are offering racks upwards of 20 kW. ASHRAE's Datacom Equipment Power Trends and Cooling Applications (ASHRAE 2005) provides additional details with regard to equipment power trends. In particular, Figure 3.12 of that book provides an update to the original Uptime Institute power trend chart.

As the capital, construction, and operating costs of facilities continue to climb, data center managers are forced to push for more computationally dense and productive facilities. This forces the data center manager to seek out racks of increasingly higher power levels and to drive facility densities upward. With increasing rack and facility density levels, airflow management has become a major challenge and concern. Data center hot spots, air short-circuiting, and inadequate tile airflow are a few of the challenges that are now plaguing today's data centers. These conditions are making it increasingly difficult for data center managers to maintain recommended server or rack inlet conditions. ASHRAE's Thermal Guidelines for Data Processing Environments offers guidelines covering items such as rack inlet temperatures and humidities for equipment racks in data centers (ASHRAE 2004).

The area of data center airflow management, as a means to ensure proper server or rack inlet conditions, is now getting a lot of attention in the academic and industrial research communities. Researchers are attacking the problem from many different angles, with the overall objective of ensuring that server or rack inlet conditions meet manufacturers' specifications.

Sharma et al. (2004) have proposed a supply heat index (SHI) for use in the design and optimization of air-cooled data centers. The study reports results from the first comprehensive heat transfer and fluid flow experiments in a production-level data center. The authors use the experimental results from the study to calculate the SHI under varying conditions and demonstrate the utility of this dimensionless parameter to ensure proper rack inlet conditions.

Wang (2004) has proposed a new door design to prevent hot air from being recirculated into the tops of the racks. Wang proposes a mostly solid rack door with perforations restricted to the base of the rack door. This allows the door to pull in chilled air from the perforated tiles and not from the hot air stratified toward the tops of the racks. For a given 3.5 kW rack, Wang shows a reduction in temperature rise over the rack, i.e., the temperature difference between the air issuing from the perforated tile and that entering the equipment at the top of the rack, from 12[degrees]C down to 4[degrees]C. Wang acknowledges that this design is susceptible to high inlet air velocities, which increases the potential to entrain particulate contaminants.

Bhopte et al. (2005) studied the minimization of rack inlet air temperatures via a multi-variable optimization study. The variables studied were data center floor plenum depth, floor tile placement, and ceiling height. The authors showed a significant effect of all three variables on rack inlet air temperature. Future study is suggested in the areas of computer room air-conditioning (CRAC) unit placement, CRAC flow rates, and floor tile resistances.

Schmidt et al. (2005) have designed a water-cooled (rack) rear door heat exchanger (RDHX) to extract a large portion of the rack heat load from the exhaust air before it is placed back into the data center. The RDHX relies on a cooling distribution unit (CDU) to deliver water above the dew point of a given facility. For a demonstration rack with six IBM BladeCenters (25 kW rack simulated), the RDHX was shown to remove 50%-60% of the heat from the exhaust air while simultaneously lowering the exhaust air temperature 25[degrees]C-30[degrees]C. Schmidt et al. also demonstrated a favorable total cost of ownership for this solution.

Heydari and Sabounchi (2004) propose refrigeration-assisted hot spot cooling of data centers by placing refrigeration/fan-coil heat exchanger units over the hot spots. The authors combined thermal hydraulic modeling of the refrigeration system with computational fluid dynamic (CFD) analysis of the data center airflow. Their analytical results show a reduction in data center hot spots.

An alternative to managing the data center airflow is provided through the use of so-called "refrigerated" racks. Such racks are totally enclosed and include an air-to-liquid heat exchanger. The air inside the rack is cooled when it passes over the air-to-liquid heat exchanger and is then delivered to the servers.

The authors of this paper propose a technology that significantly reduces facility airflow management challenges. The approach is to spot-cool the microprocessors with dielectric liquid-cooled cold plates. This approach allows approximately 45% of the rack computing heat load to be rejected directly to the facility chilled water (and eventually directly to the cooling towers). By reducing the amount of heat dissipated directly to the facility ambient, there is a dramatic reduction in the volume of airflow required per rack and, in turn, for a full facility populated with liquid-cooled racks. The present paper focuses on an analysis of the impact that the technology has on facility airflow management. Infrared images of the rear door of a liquid-cooled rack and of several air-cooled racks are used in the analysis.

HARDWARE SETUP

This study was conducted using several air-cooled racks and a single liquid-cooled rack in the Molecular Sciences Computing Facility (MSCF) of Pacific Northwest National Laboratory (PNNL). An infrared camera was used to take the infrared images used in the study. The following sections provide further detail on the liquid-cooling hardware and the infrared camera.

Liquid-Cooling Hardware

This study was conducted on racks of servers housed in MSCF's supercomputer. The supercomputer consists of 84 racks of air-cooled HP rx2600 2U servers (dual processor, 1.5 GHz IA64 server). Under a DOE Energy Smart Data Center study, a single rack has been converted from air cooling to liquid cooling. As part of the conversion, the air-cooled fan heatsinks were removed and replaced with ISR spray module kits (SMKs). A single spray module and a converted server are shown in Figure 1.

Each SMK is supplied with conditioned dielectric coolant that is used to keep the processors cool. The heat absorbed from the processors converts the single-phase coolant supplied into a two-phase mixture. All SMKs have a dielectric coolant supply line leading to them from a server manifold and a return line leading away from them. The server manifold, in turn, has a supply line leading to it from a rack supply manifold (see Figure 1) and a return line leading away from it to the rack return manifold. These supply and return lines connect to their respective manifolds via quick disconnects. The return manifolds returns the two-phase mixture from all SMKs to a thermal management unit (TMU) sitting under the raised floor, underneath the rack (this unit is also designed to mount in a standard 19 in. rack). The TMU consists of a pump, a reservoir, a controller, power supplies, and a liquid-to-liquid heat exchanger. The liquid-to-liquid heat exchanger is supplied with facility water, which condenses all the vapor and provides a subcooled single-phase liquid. The TMU then supplies the conditioned coolant to the supply manifold.

[FIGURE 1 OMITTED]

The liquid-cooled rack is installed on MSCF's main floor as part of the supercomputing cluster.

Infrared Camera

A FLIR Systems Thermacam S45 was used to take the infrared images for this study. The camera is designed for research and development and scientific applications and has high resolution (320 x 240 pixels) and high quality images. The camera has a thermal sensitivity of 0.08[degrees]C at 30[degrees]C (i.e., it can read temperature differences as low as 0.08[degrees]C) and can record temperatures in the range of -40[degrees]C to +1,500[degrees]C (up to a maximum of +2,000[degrees]C with additional hardware). The camera has a field of view of 24[degrees] x 180[degrees], which allowed approximately one-third of a rack door to be imaged from roughly five feet away. The camera has an accuracy of [+ or -]2[degrees]C, or [+ or -]2% of the reading.

TEST METHODOLOGY

The testing consisted primarily of thermal performance testing, software benchmarking, facility airflow measurements, and infrared imaging. The following sections provide further details.

Test Conditions

The spray modules in the servers were supplied with PF5050, a dielectric coolant. PF5050 (fluorinert) has the following approximate properties given at 1 atm and room temperature: boiling point of 30[degrees]C, specific heat of 1,048 J/kg x K, viscosity of 4.69E-4 kg/m x s, thermal conductivity of 0.056 W/m x K, and a latent heat of vaporization of 102.9 kJ/kg. The coolant was delivered at an atomizing pressure of approximately 20 psid across the atomizers, and the system pressure was maintained at roughly one atmosphere.

The facility was designed as an air-cooled facility. Sixteen air-handling units (AHUs) located on the periphery of the data center deliver 13[degrees]C-15[degrees]C chilled air to all the racks. Chilled air is drawn in through the front of all the servers and is exhausted out the backs. The heated air mixes with the residual air that is not passed through the servers and returns to the AHUs (no special ducting is used).

For the purposes of this study, PNNL provided a chilled water supply line to the rack and also provided a water return line. The TMU deploys a fluorinert-to-water heat exchanger, which uses the facility water. The facility delivers chilled water at a temperature as low as 7[degrees]C at a supply pressure of 50 psid and a calculated flow rate of 6.5 gpm.

Benchmark Routine

In order to get the air-cooled and liquid-cooled racks to their maximum operating temperatures, several benchmark routines were run. While the system cases were closed, both High Performance Linpack (HPL) (Petitet et al. 2004; Dongarra et al. 1979, 2003) and Stream 2 (McAlpin 2005) were executed. For the test where the servers' internal temperatures were monitored, the interconnect was disconnected in order to open the server case. This precluded the running of HPL within the default environment under the time constraints during which the test took place. Instead of HPL, a small C program was constructed to exercise similar portions of the processor.

A separate instance of HPL was executed on each dual-processor server at a time, with two processes. The test cycled through different sizes of N, including 3000, 4000, and 5000, with a P of 1, a Q of 2, and NBs of 8, 16, and 32. Unfortunately, the programs run were compiled on a stand-alone system, and the optimizing Intel compiler was not available due to licensing issues and the use of an older version of GNU compiler collection (GCC) (Redhat release 2.96). This limited performance to only 50% of peak ([+ or -]5%, or around 3 Gflops/CPU).

The Stream2 benchmark, run with NMIN = 3 and NMAX = 2,000,000, was used to observe the thermal behavior of the memory chips and supporting systems under load.

The test program used to stimulate the processor when the system case was open executed two nested loops of floating-point mult-add instructions, with periodic divides to rescale the data to avoid overflows. The data were blocked such that it would efficiently pipeline onto all four of the Itanium floating-point units. The overall measure of work performed was similar to HPL (50% of peak, [+ or -]5%) as measured by the pfmon program (HP 2005). Two instances of this program were executed per node in order to exercise both CPUs. For the desired purpose of exercising the CPU, this was deemed to be functionally equivalent to running HPL to create a temperature increase.

Facility Airflow Rate Measurements

A TSI Model 8373 AccuBalance tile hood was used to measure the airflow rate from all of the perforated and high-percentage open grate tiles at PNNL. The tile hood is capable of measuring flow rates in the range of 30 to 2,000 cfm, at [+ or -]5 cfm or [+ or -]5% of the reading. The tile hood can be used to measure airflow rates within a temperature range of 0[degrees]C to 60[degrees]C, with a resolution of 0.1[degrees]C and an accuracy of [+ or -]0.5[degrees]C. The flow rates from all the tiles and grates were measured. Results are presented in the section entitled "Airflow Management in the Molecular Sciences Computing Facility."

Infrared Imaging

The manner in which the server racks were fully exercised has been described in detail in the "Liquid-Cooling Hardware" section. A given rack was allowed to run HPL for at least 30 minutes before an image was taken. Given the size of the camera's field of view and the width of the data center aisles (approximately four feet wide), only one-third of a rack's door could be imaged at once. In addition, the images were taken at an angle to the rack door's plane.

In the initial stages of the imaging, a type-T thermocouple was placed on a rack door. The thermocouple was read with a calibrated Fluke handheld thermocouple reader. The camera was then focused on the thermocouple and the image probe placed over the top of the thermocouple. This was done to ensure that the camera settings were correct and that there was minimal deviation between the temperature shown at the probe and that recorded by the thermocouple. In all cases, the camera measured within 2[degrees]C of the thermocouple.

Images were taken over the entire surface of several rack doors as well as the fronts of several racks. In the case of the liquid-cooled rack, the door was opened and several images were taken of the supply and return tubing. Images were also taken of the internal areas of an air-cooled server and of a liquid-cooled server. The images of the internal areas were taken while the servers were running HPL. Several images were also taken of the perforated tiles.

RESULTS AND DISCUSSION

The primary focus of this study was to quantify the projected positive impact of liquid-cooling technology on PNNL's facility ambient enviromental conditions. One of the means selected to do this was through the use of infrared imaging.

Infrared images were taken of several air-cooled racks of servers and the liquid-cooled rack of servers. In addition, images were taken of the internal areas of an air-cooled and liquid-cooled server as well as of several perforated tiles. The images and associated results are discussed in the following sections.

Comparison of Air-Cooled and Liquid-Cooled Racks

Prior to taking any infrared images, several racks were chosen for the study. One rack, referred to as a "worst-case" or "hottest" rack, was selected from a group of racks located in the hottest spot in the data center. These racks were located 3-4 tiles away from the outlet of an AHU. This placement resulted in an extremely high air velocity under these racks, meaning that the volumetric airflow rate issuing from the dedicated tiles was less than optimal (see airflow discussion in the "Comparison of Air-Cooled and Liquid-Cooled Servers" section). At least one rack, referred to as a "best-case" or "coolest" rack, was located in the coolest part of the data center. This rack was located between two other racks and was optimally located with respect to any AHU. The liquid-cooled rack was located on the end of a row of racks and was subject to hot airflow recirculation around the front of the rack. The liquid-cooled rack was intentionally located in a less-than-optimal location in the data center. As discussed in the "Liquid Cooling Hardware" section, HPL was run on all servers in order to get them to their maximum operating temperatures.

Figures 2 and 3 show infrared (IR) images of the inlet and outlet to a worst-case rack of air-cooled servers. In Figure 3, the camera's probe indicates a rack outlet temperature of 43.4[degrees]C, with a maximum recorded temperature of 44[degrees]C. A rough analysis of this image shows a temperature gradient of at least 10[degrees]C over the surface area of the rack shown. Comparing Figures 2 and 3, and using the temperatures indicated at the probe locations (approximately two-thirds up the height of the rack), shows that the air rises approximately 20[degrees]C across the rack. This temperature rise will differ based upon location of the probe.

Figures 4 and 5 show IR images of the inlet and outlet to the liquid-cooled rack of servers. In Figure 5, the camera's probe is actually in error (temperature has been blacked out), but investigation of the temperature scale suggests that the temperature at the location of the probe is approximately 22[degrees]C, with a maximum recorded temperature of 24[degrees]C. The image does not have enough information to allow for an estimation of the gradient over the full door, but the field measurements showed a highly uniform temperature distribution over this door. Comparing Figures 4 and 5, and using the temperatures indicated at the probe locations, shows that the air rises approximately 7[degrees]C across the rack. The temperature rise of 20[degrees]C across the air-cooled rack is 187% greater than the 7[degrees]C rise for the liquid-cooled rack. It should be noted that the temperature rise across the two racks is highly dependent upon the location of the temperature probing point. Figure 6 shows an image of the rear of a given liquid-cooled server (rear door has been opened). In particular, the image shows the coolant supply and return lines. The temperatures of approximately 20[degrees]C for the supply line and 30[degrees]C for the return line are consistent with the temperatures measured by the temperature sensors utilized by the liquid-cooling system.

[FIGURE 2 OMITTED]

[FIGURE 3 OMITTED]

[FIGURE 4 OMITTED]

[FIGURE 5 OMITTED]

[FIGURE 6 OMITTED]

[FIGURE 7 OMITTED]

Figure 7 shows an image of the rear doors of two air-cooled racks in a best-case location in the data center. These racks are in the middle of the data center and are supplied with chilled air from CRACs on two opposing walls. The rack on the left-hand side is in a favorable position, as it is sandwiched between two other racks. Estimating from the temperature indicated by the probe placed on the rack on the end, the temperature on the rear door of this rack, at the same height as the probe, is approximately 32[degrees]C. It is also safe to assume that the maximum temperature for the area of the door shown is 34[degrees]C. Comparison of Figures 5 and 7 shows that the air-cooled rack outlet air, for either of the two racks shown in Figure 7, is at least 10[degrees]C hotter than the outlet air for the liquid-cooled rack. Comparison of Figures 2 and 5 shows that the worst-case air-cooled rack outlet air is at least 20[degrees]C hotter than the outlet air for the liquid-cooled rack.

Comparison of Air-Cooled and Liquid-Cooled Servers

Figure 8 shows the inside of an air-cooled server, while Figure 9 shows the inside of a liquid-cooled server. For both servers, the server lid was removed and the image taken immediately thereafter. Removing the server lid compromises the airflow over the server components, but taking the image rapidly upon opening the lid provides a relatively good comparison of the internal temperatures of the two servers.

Comparison of the two figures shows the components inside the air-cooled server running significantly hotter than those of the liquid-cooled server. This is particularly evident for the memory dual inline memory modules (DIMMs). Testing of three different servers running Burn P6 showed the air-cooled memory DIMMs running 3.3[degrees]C to 9.7[degrees]C hotter than the DIMMs in the liquid-cooled server (data are not shown in this paper). By removing approximately 170 W for the two processors (average of 85 W per processor), or roughly 45% of the average total server power dissipation, the server internal ambient runs significantly cooler. A cooler server internal ambient results in a cooler motherboard, a cooler server chassis, and cooler components. An additional benefit is that the total server power dissipation decreases with a reduced internal server ambient.

Airflow Management in the Molecular Sciences Computing Facility

A liquid-cooled data center currently does not exist. The objective of this paper is to analyze and discuss airflow management for a single rack of liquid-cooled servers. The results for a single rack of servers are used to perform a rough scale-up to a full-scale liquid-cooled data center in the section entitled "Scale-Up to a Liquid-Cooled Data Center."

[FIGURE 8 OMITTED]

[FIGURE 9 OMITTED]

As discussed previously, a TSI Model 8373 tile hood was used to measure the airflow rate for each of the perforated and grated tiles installed at PNNL. In total, 110 tiles are installed. Table 1 shows the distribution of airflow rate for all the tiles, while Figure 10 shows an infrared image of several perforated tiles. The majority of the tiles provide approximately 725 cfm of airflow, with the grated tiles at approximately 1,400 cfm. The tile directly in front of the best-case air-cooled rack discussed in "Comparison of Air-Cooled and Liquid-Cooled Servers" provides 675 cfm, while the tile in front of the liquid-cooled rack provides 680 cfm. Each rack receives air from an average of 1.5 tiles. In total, the facility provides roughly 76,788 cfm from 16 CRAC units.

Airflow management challenges in a data center arise in a number of different ways. For example, racks located too close to CRACs may experience very low to negative static pressure at their tiles, thereby receiving very limited airflow. Racks located at the end of a row may be subjected to hot air recirculation around the side of the rack from the rack exhaust to the rack inlet. Other racks may recirculate rack exhaust air over the top of the racks if poor facility airflow patterns do not effectively return this air to the CRACs (see Figure 3 in Wang [2004]). Sharma et al. (2004) have proposed a supply heat index (SHI) as a means of gauging rack and facility airflow recirculation and air delivery design. The index is defined as

SHI = ([T.sub.rack,in] - [T.sub.CRAC])/([T.sub.rack,out] - [T.sub.CRAC]), (1)

where

[T.sub.rack,in] = rack inlet air temperature,

[T.sub.rack,out] = rack outlet air temperature, and

[T.sub.CRAC] = temperature of air as supplied by the CRACs (tile exit temperature used).

For this index, the higher the value, the greater the airflow recirculation and the poorer the air delivery design. The index values are comparable only for identical racks under identical work loads operating in identical airflow conditions. For their studies, Sharma et al. (2004) do not report values much higher than 0.5.

[FIGURE 10 OMITTED]

[FIGURE 11 OMITTED]

The SHI has been calculated for an air-cooled rack located on the end of a row (see right-hand rack in Figure 7) and for the liquid-cooled rack at PNNL. Figure 11 also shows the facility locations of the air-cooled rack (end of row) and the liquid-cooled rack. The results from the calculation of the SHI have been tabulated in Table 2. The index has been calculated for the bottom, middle, and top servers. With the exception of the top liquid-cooled server, the air-cooled servers have SHI values significantly higher than those of the liquid-cooled servers. The high values of SHI for the air-cooled servers support the idea that a significant amount of heated rack exhaust air is being recirculated around the side of the rack and re-entrained into the front of the rack. An additional contributor to this difference is the fact that the liquid-cooled rack needs significantly less airflow than the equivalent air-cooled rack. A lower airflow rate requirement reduces the chance of hot air recirculation.

Figure 12 presents airflow requirements for blade servers and standard IT equipment. This chart can also be used to place the airflow rate requirement for PNNL's enterprise servers in perspective. The most common airflow rate of 725 cfm per tile at PNNL is indicated on the horizontal axis. Using direct measurements of the power dissipated by all the liquid-cooled servers (while running HPL) and the measured air temperature rise over each server (multiple thermocouples at both the inlet and outlet of each server), an energy balance over each server provided the required airflow rate per server. The average server airflow rate for the rack was then calculated; this value is indicated on the horizontal axis of Figure 12. A similar energy balance was conducted for the air-cooled rack. The result of this calculation indicates that the air-cooled rack needs 543 cfm of air, which is 83% more volumetric airflow than the 300 cfm needed by the liquid-cooled rack. This is based upon semi-empirical data for the energy balance across the rack and actual test data. This result supports the idea that, due to the lower volumetric airflow rate requirement, the liquid-cooled rack is much less susceptible to hot exhaust air recirculation, even though it is also located on the end of a row of racks. The lower airflow rate requirement for the liquid-cooled rack also means that it will be much less likely to recirculate hot exhaust air in from above the rack or even be highly affected by the low tile flow for locations very close to CRACs. The lower airflow rate required by the liquid-cooled rack will also allow PNNL to get back to more reasonable flow rates for perforated tiles, as indicated in Figure 12.

[FIGURE 12 OMITTED]

Figure 12 includes the note "Full board cooling." This refers to a new implementation of liquid cooling that is currently being developed. In this implementation, the full board would be cooled and would allow PNNL to deal with budding problem areas such as the memory and communications chips. This approach offers additional opportunity to further reduce the new requirement for large volumetric airflow rates in data centers.

Scale-Up to a Liquid-Cooled Data Center

As part of the DOE's Energy Smart Data Center program, the authors have used the results of this study to investigate the feasibility of scaling up to a full-scale liquid-cooled data center. The analysis was conducted for the current 2U servers, for 1U servers, and for dual Opteron blade servers. For each scale-up exercise, PNNL's system architects conducted an analysis to ensure that the correct supercomputer system balance was maintained to allow them to run current production jobs. Results for the 2U and 1U servers are discussed in this paper.

A full inventory of all of PNNL's hardware was taken. The primary hardware consisted of 2U servers ("thin" node racks with only compute servers and "fat" node racks with an additional 2U of storage per 2U server), interconnect switch racks, storage racks, and network equipment. Using the power dissipation numbers provided by the system vendor, the supercomputer's total power dissipation was calculated at 590 kW. PNNL's current facility uses a combination of thin node and fat node server racks for a total of 84 racks. They also employ 24 racks of interconnect switches. The theoretical computational capacity for the facility is 11.232 TeraFlops.

For the 2U server scale-up, it was assumed that all the processors would be cooled with spray modules and that each rack would be cooled by a single thermal management unit (no change to the total number of racks). Using the measured average power dissipation per CPU, it was estimated that the CPU load for the facility's 1,994 CPUs would be approximately 156 kW, or 26% of the supercomputer's total power dissipation, which includes network and storage power. Since the facility requires 76,788 cfm for the full 590 kW, a linear scaling shows that four fewer CRACs are needed if 156 kW are rejected directly to the process chilled water and not the facility air, keeping in mind that the pumping power required to operate a liquid-cooled facility is equal to the power used by 1.4 CRACs. Schmidt et al. (2005) use a similar argument in their scale-up study for IBM's CoolBlue rear door heat exchanger. While the total cost ownership and COP benefits of scaling up to a liquid-cooled facility were favorable, they were not as attractive relative to a scale-up using 1U servers. No benefits were assumed for the space freed up by the removal of four CRACS or from the lower airflow requirement for the liquid-cooled racks.

Before conducting the scale-up to 1U servers, an HP rx1620 1U server was converted to liquid cooling to verify that it could be efficiently cooled. For the conversion, the processors were liquid-cooled similarly to the 2U servers. The liquid-cooled server was cooled with a fluorinert-to-water thermal management unit. The liquid-cooled server was investigated over a range of fluid temperatures in order to demonstrate that the server could still be effectively cooled when rejecting to water with a temperature as high as 30[degrees]C. The reason for rejecting to such warm water was to show the ability to bypass PNNL's chillers and to reject the heat directly to the cooling tower water--PNNL's highest summer water temperatures are unlikely to exceed 30[degrees]C. Additional test results for the 1U server and other platforms are provided in Cader and Regimbal (2005). Rejecting to cooling tower water would increase the COP by simply removing the water chiller power load. This affects the process for cooling the processors but does not affect the remaining air-cooled components. Therefore, the air temperature and airflow rate within the data center would need to be maintained.

The scale-up to the 1U servers showed that the current supercomputer balance can be maintained with 69 server racks (combination of thin node and fat node racks), 16 racks of node switches, and 8 racks of top switches. By switching to 1U servers, it was assumed that the current Itanium2 processors would be used, meaning that the computational capacity of 11.232 TFlops would be achieved with 15 fewer racks of servers. By rejecting the 156 kW of CPU power to the cooling tower water, four CRACs can be removed. In addition, the removal of 15 racks means that the facility can be significantly reduced in footprint, resulting in the removal of additional CRACs or the addition of other computational resources. Scale-up assuming 1U servers resulted in a 22% increase in facility COP, relative to the current air-cooled facility, and a payback time ranging from 0.5 to 2.8 years. The range in the payback years depends upon the assumptions made, with 0.5 year taking advantage of the fact that PNNL can increase computational capacity without increasing facility footprint.

The scale-up exercise has highlighted the benefits of a liquid-cooled data center at the facility level. The scale-up was conducted in a relatively conservative fashion. It is clear from the results that the reduction in required airflow rate for the facility will dramatically reduce the facility airflow management challenges. The reduced need for airflow rate delivered to a rack will allow datacenter operators to deploy significantly higher power (density) racks.

CONCLUSION

Under funding from the DOE's Energy Smart Data Center program, an analysis of the airflow management in PNNL's Molecular Sciences Computing Facility was conducted. As part of the analysis, several high-performance air-cooled racks and a single liquid-cooled rack were investigated. High Performance Linpack was run on the racks of servers while thermal data, airflow rate data, and infrared images were captured. The results of the study were also used to conduct an initial study of the feasibility of scaling up to at least one vision of a full-scale liquid-cooled data center.

The infrared images show that the exhaust air from the liquid-cooled rack is 10[degrees]C-20[degrees]C cooler than the exhaust air from the air-cooled racks investigated. The measured data also showed that the air temperature rise across the hottest air-cooled rack investigated was 187% greater than that across the liquid-cooled rack. The SHI for the majority of the air-cooled servers analyzed was significantly higher than that for the liquid-cooled servers used in the comparison. The high value of SHI for the air-cooled rack supported the idea that the rack was re-entraining a significant amount of hot exhaust air. This was also supported by the fact that energy balances over the air-cooled rack and liquid-cooled rack showed that the air-cooled rack needed 83% more airflow. The scale-up study showed a favorable result when using liquid-cooled 1U servers.

The results indicate multiple benefits for the liquid-cooled rack. Key among the benefits are (1) cooler air exhausting from the liquid-cooled rack into the facility ambient; (2) a significantly lower airflow rate requirement for the liquid-cooled rack, which has the effect of reducing the amount of airflow shortcircuiting; (3) fewer CRACs; and (4) little to no limitations on the data center placement of liquid-cooled racks. By rejecting the heat directly to the facility's chilled water, or even directly to the cooling tower water, the challenges of facility airflow management are dramatically reduced.

While significant challenges remain, scale-up to a liquid-cooled data center appears to be feasible. There are still challenges with implementing a liquid-cooled facility, including the stigma with using water near computing equipment. Historical precedents indicate that this stigma can be overcome; however, the CRAY 2 supercomputer and IBM mainframes have long histories in datacenter computing. Key techniques for mitigating perceived risks associated with water include advanced plumbing and leak detection technology, which, when integrated at the facility level, will mitigate risk operating coolant water in the datacenter.

ACKNOWLEDGMENTS

This research was performed in part using the Molecular Science Computing Facility in the William R. Wiley Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the US Department of Energy's Office of Biological and Environmental Research and located at the Pacific Northwest National Laboratory, operated for the Department of Energy by Battelle.

The assistance of Andrew Wolf (ISR) and Kevin Fox (PNNL) are acknowledged.

REFERENCES

ASHRAE. 2004. Thermal Guidelines for Data Processing Environments. Atlanta: American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc.

ASHRAE. 2005. Datacom Equipment Power Trends and Cooling Applications. Atlanta: American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc.

Bhopte, S., D. Agonafer, R. Schmidt, and B. Sammakia. 2005. Optimization of data center room layout to minimize rack inlet air temperature. Proceedings of InterPACK05, San Francisco, CA, July 18-22.

Cader, T., and K. Regimbal. 2005. Energy smart data center. InterPack05, San Francisco, CA, July 18-22.

Dongarra, J., J. Bunch, C. Moler, and G.W. Stewart. 1979. Linpack Users Guide. Philadelphia: Siam.

Dongarra, J., P. Luszczek, and A. Petitet. 2003. The Linpack benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience Journal 15:1-18.

HP. 2005. Perfmon project. http://www.hpl.hp.com/research/linux/perfmon/. Hewlett-Packard Development Company.

Heydari, A., and P. Sabounchi. 2004. Refrigeration-assisted spot cooling of a high heat density data center. Proceedings of Itherm 2004, Las Vegas, NV, June 1-4.

McCalpin, J. Stream: Sustainable Memory Bandwidth in High Performance Computers. Computer Science Department, University of Virginia. http://www.cs.virginia.edu/stream/stream2/.

Petitet, A., R.C. Whaley, J. Dongarra, and A. Cleary. 2004. HPL--A portable implementation of the high-performance Linpack benchmark for distruted-memory computers, version 1.0a. Innovative Computing Laboratory of the University of Tennessee Computer Science Department. http://www.netlib.org/benchmark/hpl/.

Rasmussen, N. 2005. Cooling strategies for ultra-high density racks and blade servers. APC White Paper #46. http://www.apcmedia.com/salestools/SADE5-TNRK6_R4_EN.pdf.

Schmidt, R., R.C. Chu, M. Ellsworth, M. Iyengar, and D. Porter. 2005. maintaining datacom rack inlet air temperatures with water-cooled heat exchanger. Proceedings of InterPACK05, San Francisco, CA, July 18-22.

Sharma, R., C. Bash, C. Patel, and M. Beitelmal. 2004. Experimental investigation of design and performance of data centers. Proceedings of Itherm 2004, Las Vegas, NV, June 1-4.

Wang, D. 2004. A passive solution to a difficult data center environmental problem. Proceedings of Itherm 2004, Las Vegas, NV, June 1-4.

Tahir Cader, PhD

Levi Westra

Kevin Regimbal

Ryan Mooney

Tahir Cader is the technical director and Levi Westra is a mechanical engineer in the High Performance Computing Group, Isothermal Systems Research, Liberty Lake, Washington. Kevin Regimbal is the Information Technology manager and Ryan Mooney is a technical specialist for MSCF Operations/EMSL at Pacific Northwest Laboratory, Richland, Washington.
Table 1. Distribution of Airflow Per Tile Throughout PNNL's Data Center

Number of Tiles 0 11 20 55 11 5 5

Airflow Rate (cfm) 550 600 650 700 750 800 more

Table 2. Supply Heat Index for an Air-Cooled Rack of Servers and a
Liquid-Cooled Rack of Servers

 Air-Cooled Servers Liquid-Cooled Servers

Bottom server 0.342 0.068
Middle server 0.635 0.395
Top server 0.062 0.166
COPYRIGHT 2006 American Society of Heating, Refrigerating, and Air-Conditioning Engineers, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2006 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Cader, Tahir; Westra, Levi; Regimbal, Kevin; Mooney, Ryan
Publication:ASHRAE Transactions
Geographic Code:1USA
Date:Jul 1, 2006
Words:6542
Previous Article:Thermal profile of world's third fastest supercomputer.
Next Article:Human responses to intermittent work while wearing encapsulating chemical-biological protective clothing with personal HVAC.
Topics:


Related Articles
Real-time prediction of rack-cooling performance.
Liquid cooling for extreme heat densities.
Performance of a rack of liquid-cooled servers.
Comparison between underfloor supply and overhead supply ventilation designs for data center high-density clusters.
Capture index: an airflow-based rack cooling performance metric.
Characterization of a high-density data center.
Prediction of distributed air leakage in raised-floor data centers.
Small steps to a greener data center: energy efficiency demands have forced managers to consider something known as managed density.
How one data center kept its cool: carmaker enhanced delivery of chilled air to decrease temperatures, and increase reliability and capacity.
Optimize data center energy use: improve cooling, processor and server efficiencies to reduce power consumption.

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters