Case study: using simulation techniques to optimize a migration in an existing mission critical data center.
Owner operators of modern data center are under the constant pressure of being able to reliably provide power, space, and cooling to IT loads that become defined well after the design and construction of the facility. It is not uncommon for business demands to go against original design goals and common best practices. Business demands change IT configurations in various ways including changing the form factor, utilization, device types, and layout of the devices. Even a well design data center with adequate controls and automation cannot anticipate and cope with changes that the Data Center sees as Information Technologies mature, and business demands change.
One example of a common data center change is load migration. Migration, by practice, requires that loads that have been supported in one location be moved to another location, which may not have been designed or tested to support said load. These migrations need to happen without any disruption to IT uptime. In an ideal migration, the new location must be able to support the migrated loads with existing available power and cooling capacities such that minimal troubleshooting or facility modification is required.
In this paper, we intend to show through the use of a real migration effort that Citigroup is undergoing, that migration plans can be studied prior to implementation, using environmental modeling. Environmental modeling involves the use of 3D simulation software to study the physical condition of the room in terms of power, space, and cooling. Computational Fluid Dynamics (CFD) is used to predict cooling performance. This prediction data is then populated into a new Performance Indicator, developed by The Green Grid (TGG), with which an operator can ensure adequate capacity utilization for both current and future states without risk to critical IT loads.
Citigroup is undergoing a major renovation project in their existing data center facility in NYC. The project entails a complete overhaul of the mechanical and electrical infrastructure, as well as a complete refresh of the legacy whitespace. The existing whitespace currently houses live IT equipment that supports Citi's thousands of applications. The main challenge for the Citi team is accommodating a complete refresh of the whitespace "in place", with live equipment in operation. In order to accomplish this, small sections of existing whitespace must be vacated, renovated to modern standards, and then populated with migrated IT equipment from other sections of vacating whitespace, so that the renovation-juggling act can continue. This project examines one such migration from existing data center to a newly refreshed space.
Critical Load to be Migrated
To accurately simulate this migration scenario, the loads in the original location needs to be modeled in full detail first. Figure 1 shows a sampling of rows planned to be migrated. The IT assets and cabinets are defined in their exact location, U-slot, and configuration pre-migration using an import of the IT asset inventory maintained by the operational team.
Each IT device type is modeled to represent the unique airflow and thermal aspects of that device. Power draw data, recorded at the cabinet level, is used to proportionally scale the IT equipment power based on each vendor's nameplate power.
For this migration, the initial intent was to attempt a "Lift and Shift" migration, wherein the cabinets from their original location are quite literally lifted from their old location after disconnecting power and network connectivity, and shifted to their new location. Further, the intent was to place cabinets in the same position relative to the center of aisle Network frame. Cost was the major consideration in this decision. Firstly, new cabinets would not need to be purchased; secondly, premeasured and terminated network cabling was able to be reused, a substantial saving on copper cabling; and lastly the manpower and time required to restack and reconnect existing IT equipment in new cabinets would be saved. The original cabinets are 42U high.
As such, the final destination of the migrated cabinets was determined based on factors such as physical space for an entire row of cabinets and their associated networking frame. Power was readily available, and cooling was assumed to be available. Figure 2 schematically shows the "Lift and Shift" migration plan.
Designed Room for Migrated Loads
The design room for the migrated loads is a raised floor data center that utilizes an unconstrained hot aisle cold aisle layout, and 48U high cabinets. Table 1 summarizes the room data. Figure 3 shows the design model, complete with technology and facility infrastructure.
Modeling and Analysis Plan
During the design process for this room, an environmental simulation had previously been performed to ensure that the cooling infrastructure would be able to support the total planned IT load. In that study, due to lack of information about the technology profile, the load was assumed to be server loads of homogeneous density, form factor, and exhaust fan characteristics. This model will be used to establish the expected performance for the room.
Comparing the design model against the migration plan, the two differ significantly in IT configuration in terms of U-slot, vent inlet and outlet locations for servers, cabinet ventilation characteristics, power consumed, and power distribution. As such, the results of the original model will no longer be valid, and a new calculation will be rerun for the room with "Lift and Shift" cabinets in place only.
Additional models will be run to ensure that after this initial migration, the facility will be able to support future installations up to the remaining design capacity of 1079 kW. This will be studied with two "Fill-to-Capacity" models: One model filling the remaining floor space with design cabinets at the original cabinet design load (not reaching the full room design load), and another model filling remaining available U-slots for both the design cabinets and the migration cabinets with representative loads to maximize the facility capacity.
To determine the success of the migration, the CFD analysis is used to evaluate the effectiveness of cooling delivery to the IT equipment. Using the Performance Indicator from The Green Grid (WI4-011), Eq. 1 is a performance score used to describe cooling effectiveness for a present state of the data center. It accounts for the percentage of devices within the recommended ASHRAE T.C. 9.9 temperature range based on power consumption of the devices within the recommended temperature range.
IT Thermal Conformance = (Eq.Load([T.sub.maxinlet]<806[degrees]f under normal operation conditions/Total Eq.Load) (eq.1)
This Citigroup data center is considered Mission Critical and as such also requires resilient operations, meaning the critical equipment should function at suitable temperatures during potential cooling failure and maintenance scenarios. The following equation is used to determine the IT Thermal Resilience with regard to the ASHRAE T.C. 9.9 allowable temperature range and usng the equation stated in The Green Grid WI14-011.
IT Thermal Resilience = (E.Load([T.sub.maxinlet]<896Funder worst case failure conditions/ Total Eq.Load) (eq. 2)
IT Thermal Conformance and IT Thermal Resilience describe the cooling performance of the facility in its current loading profile, and Capacity Lost describes the potential loss of capacity utilization projecting from current loading profile to full facility design load.
Although eq. 1 and eq. 2 are descriptive of the cooling performance of any change in the data center, including installations and migrations, they do not describe how well the data center can be filled. Cooling performance of a projected filled state needs to also be evaluated in the case that an installation will result in decrease in future performance. Eq. 3 is a modified version of eq. 1 and descrbed the performance of a filled future state.
IT Thermal Conformance Filled = Eq.Load ([T.sub.maxilet]<80.6[degrees]F under normal operatinq conditions/Total Eq.Load Filled) (eq.3)
The "Filled" in this equation represents all IT equipment that would be installed to reach full loading capacity of the data center. In the CFD envirnonment, we simulate that growth to full capacity by simulating a fill of the CFD model to the full intent of the facility, using "test loads" that make sense based on typical loading. Because any score under 100% for Compliance represents a loss in capacity and a loss in capital expenditure, any score under a full 100% for Compliance is considered an unsuccessful installation.
Planning for migrations also incurrs additional losses not related to cooling performance. In eq. 3, the Total Eq. Load Filled is based on how much the data center can be filled in any given future end of life scenario taking into account design infrastructure capacity. Migrations can also reduce the Total Eq. Load Filled if space restrictions prevent the fill to reach infrastructure capacity. Eq. 3 describes the total loss of data center capacity due to IT Thermal Conformance at a Filled state and eq. 4 describes the total losses in capacity in the data center, including space and planning issues that may result in a loss of capacity.
Capacity Lost Ratio = Desiqn Capacity-IT Thermal Conformance Filled x Total Eq.Load Filled/ Design Capacity (eq. 4)
Eq. 1-4 together quantifies the performance of the white space from the perspective of providing adequate cooling to the equipment and retaining maximum useful capacity for the future of the data center. For a total view of performance, energy efficiency is also considered in The Green Grid's Performance Indicator, however it was not considered as part of thjis study.
For the Design room, Figure 4 shows environmental modeling results for normal and failure operating conditions. Table 2 Summarizes the performance results of the design before any considerations for the migration plan is made:
IT Thermal Conformance and IT Thermal Resilience are scored out of a maximum of 1, 1.0 being the optimal score. Capacity Lost Ratio is optimal when it is at a score of 0. The results in table 2 show that the room obtains optimal performance using the above mentioned loading profile assumptions. Operational changes such as implementing a migration plan can degrade the score due to changes in said loading profiles. The following sections will analyze operational plans by evaluating the performance in a simulated environment and assessing alternative scenarios prior to physical implementation of the plans.
LIFT AND SHIFT
Figure 2 shows an example of how the cabinets are being migrated in a "Lift and Shift" fashion from an old data center to the new data center. In this migration, the migrated cabinets maintain its distance from the network cabinet and the servers are untouched. The power consumed in the cabinet in the new location is the same as the power consumed in the old location. Figure 5 shows the power consumed per cabinet in this scenario for the data center after the migration
After the "Lift and Shift" migration plan was implemented in the model, we see that the data center is now loaded to a total of 371 kW of IT load. Capacity utilization is low, however, there may still be losses in data center performance. Using a combination of thermal modeling results and eq 1, 2, 3, and 4, we can quantify the change in performance to the facility due to the "Lift and Shift" migration plan. Figure 6 colors the individual IT devices based on ASHRAE T.C. 9.9 thermal compliance under both normal operating conditions and in a Failure condition.
It is seen from Figure 6 that a majority of IT equipment conforms with ASHRAE T.C. 9.9 recommended thermal guideline under normal operation. Locations in Figure 6 that report a loss of conformance are easily addressable cooling issues typically caused by networking switches mounted in rack and tack cabinets. Eq. 1 quantifies the total loss of cooling conformance in this configuration
In the resilience evaluation, the redundant cooling units are turned off in the virtual environment such that cooling capacity matches the IT demand and reaching an N state for cooling capacity. Figure 6 highlights the cooling units that have been turned off to create that scenario. There are a number of locations where cooling is not properly delivered, indicating that there is a loss of resilient cooling supplied to specific locations in this configuration. Eq. 2 quantifies that loss of cooling resilience.
FILL TO CAPACITY: CABINET FILL
To evaluate potential future capacity lost, the models need to consider the full intent of the room. The full intent of the room is constrained by a number of factors including total room load and total power per cabinet. The first "Fill to Capacity" scenario is created by filling the empty floor locations by adding homogeneous 4.85 kW cabinets (original design load) filled top to bottom with sever gear. Note that this fill methodology fills the floor space, but only to a load of 834 kW, not to the room power limit of 1079kW. Figure 7 shows the results of a filled room plotted as a compliance to ASHRAE TC 9.9 Thermal Compliance.
Eq. 3 and 4 are used with the results of this model to quantify the loss of capacity. The plot in Figure 7 looks quite healthy, however the lost capacity ratio is 0.24. In this case, the capacity lost ratio is highly weighted toward power constraint, which is expected due to the fact that the existing cabinets are not filled. Of the 0.24, only 0.02 is due to cooling performance (noted by all the green/compliant servers) and 0.22 is power related, due to the fact that the full facility capacity was not utilized.
FILL TO CAPACITY: U-SLOT FILL
Because the previous capacity lost evaluation is highly weighted toward losses due to unused power, an additional Fill to Capacity model is considered so as to study the case of making full use of the facility capacity. This is done using a two-stage fill process to, as realistically as possible, use the remaining space in the data center. The first stage is to take the "Lift and Shift" cabinets and fill them with representative servers. For this case, a realistic assumption of 100W servers at 2U height are used to fill the remaining spaces (Fig. 8) in the "Lift and Shift" cabinets.
With the remaining facility capacity, the second stage is to fill the remaining cabinet locations with dense enough cabinets to reach a filled room load of 1079kW. The expectation is that this should completely eliminate any capacity loss due to unused power. The previous cabinet fill used cabinets of 4.875kW, which came from the original design. To reach the full capacity of the room, the new cabinets used for the fill scenario are at 7.4kW. Figure 9 compares the two fill power profiles against each other. The cooling performance of the final Fill to Capacity model is shown in Figure 10.
Using equations 3 and 4 again, a new capacity lost ratio is calculated of 0.12. While this method eliminates lost capacity due to stranded power, the capacity loss ratio of 0.12 is a significant increase in stranded capacity due to cooling, highlighting a major issue with the proposed Lift and Shift plan particularly on future capacity.
A study of the air paths to the above non-compliant servers shows hot air recirculation from the hot aisle to the cold aisles. These breakdowns are a symptom of the power-cooling-airflow imbalances that occur with the constantly changing IT that enterprise data centers have to cope with on an ongoing basis.
Initial considerations for migration of load from one data center to another may revolve around potential cost savings for labor and material if able to use existing infrastructure. However a closer look at a Lift and Shift migration shows that while the migration will certainly work Day 1, there are considerable losses that become apparent in the future as the data center becomes more fully utilized.
It is recognized by the authors that while the resulting hot spots caused by said imbalance is an issue that can likely be solved with solutions such as purchasing additional cooling infrastructure, or containment solutions, the enhanced visibility that is now available through simulation offers a possible solution to fully utilizing the upfront investment in data center capacity. Without the foresight that is now available using advanced modeling tools, and new analysis methodologies like the Performance Indicator, potentially dangerous migrations will go unchecked, permanently strand capacity, and degrade returns on upfront capital investments.
FUTURE WORK AND OPTIMIZATION
The "Lift and Shift" migration plan results in a reduction of performance in the data center with both fill method. In order to increase the performance of the data center, a Cabinet Restack migration plan is under consideration, wherein the existing IT equipment is restacked into higher U cabinets. While the IT equipment's form factor, air discharge, utilization, and device type still differ from the design case, it would be possible to achieve cabinet densities closer to that of the design case.
Mark Seymour from Future Facilities and The Green Grid for use of the Performance Indicator (PI) metric.
ASHRAE. 2011. ASHRAE T.C. 9.9, Thermal Guidelines for Data Processing Environments. American Society of Heating, Refrigeration, and Air-Conditioning Engineers.
TGG. 2016. WI14-011, The Performance Indicator: Accessing & Visualizing Data Center Cooling Performance. The Green Grid.
Christian Pastrana, PE
Christian Pastrana, PE is VP of Data Center Planning and Critical Systems at Citigroup, NY, NY Tom Wu is an Applications Engineer at Future Facilities Inc, NY, NY
Caption: Figure 1--Cabinets in their Pre-Migration Configuration
Caption: Figure 2--Lift and Shift Migration Plan
Caption: Figure 3--Designed Data Center with Homogeneous Loads
Caption: Figure 4--Data Center Under Normal and Failure Conditions
Caption: Figure 5--Power Density and Loading Location after Lift and Shift Migration
Caption: Figure 6--ASHRAE Temperature Compliance in Normal Operating and in Failure Condition
Caption: Figure 7--ASHRAE Temperature Compliance in a Filled to Capacity State with All Cooling On
Caption: Figure 8--Example of a Cabinet Fill to Increase Space and Power Utilization in Filled to Capacity State
Caption: Figure 9--Power Comparison of the two Fill Plans
Caption: Figure 10--ASHRAE Thermal Compliance for 2-Stage Fill Plan
Table 1. Design Intent of Room Property Score Room Capacity 1079 kW Total Size 10,775 sq. ft. Cooling Delivery Type Raised Floor Cooling Redundancy N+2 Design Density 100W/sq. ft. Design Power per Cabinet 4.85kW/cabinet Table 2. Performance Results of Design Performance Criterion Score IT Thermal Conformance 1.00 IT Thermal Resilience 1.00 Capacity Lost Ratio 0.00 Table 3. Performance Results of Lift and Shift Performance Criterion Score IT Thermal Conformance 0.99 IT Thermal Resilience 1.00
|Printer friendly Cite/link Email Feedback|
|Author:||Pastrana, Christian; Wu, Tom|
|Date:||Jan 1, 2017|
|Previous Article:||Thermosyphon cooler hybrid system for water savings in an energy-efficient HPC data center: modeling and installation.|
|Next Article:||Restoring acceptable HVAC performance with Ultraviolet Germicidal Irradiation (UVGI) coil treatment.|