1 Introduction
The scale of the
data center (
DC) industry has been rapidly growing in response to the ever-increasing cloud computing and storage demands [
3]. Such growth brings substantial challenges for DC operations, particularly in avoiding operational risks and reducing energy costs of the mission-critical infrastructure. Currently, DCs are mostly operated in a reactive way by the
data center infrastructure management (
DCIM) system with feedback controllers [
2,
42]. The DCIM provides the operators with deployed sensor measurements for proper responses in case of abnormalities and failures. However, traditional DCIM does not provide accurate prediction capabilities that are desired by proactive DC management. These capabilities enable the operators to perform various what-if analyses, such as whether the increase of certain temperature setpoints can improve the energy efficiency without causing server overheating.
We consider
predictive digital twins for the desired capability extension [
33]. The
computational fluid dynamics (
CFD) modeling is a primary technique to characterize the thermodynamics in data halls [
27]. CFD model can estimate the air velocity and temperature distributions in a given space by solving the
Navier–Stokes (
NS) and energy balance equations [
4]. It has been adopted in the offline optimization for DC energy cost reduction and thermal risk management [
28]. However, the accuracy and simulation speed of CFD models in general do not meet the online analysis requirements for two reasons [
30]. First, the assumptions or simplifications made in the offline phase may lead to online result distortions. Second, the solving process of the governing NS equations may need lengthy computing time from hours to days.
The accuracy of a CFD model is mainly determined by the accuracy and completeness of the given boundary conditions. A model with incomplete boundary conditions may diverge from the ground truth. For example, as reported in [
34,
41], an uncalibrated CFD model can yield temperature prediction errors up to 5°C. The low accuracy impedes the use of CFD model for the desired operational adjustment to pursue energy efficiency without causing thermal risk. Unfortunately, obtaining the complete and accurate boundary conditions often faces substantial challenges due to (1) the large number of parameters in the boundary spaces and (2) the labor-intensive and error-prone manual calibration process for these parameters. For instance, each server in a data hall may have its own characteristics of the passing-through air flow rate due to its internal fan control logic. Such information is often not available in the server hardware’s specification and can only be empirically estimated or manually collected via
in situ measurement. As a result, the rough settings of the server air flow rates can significantly downgrade the CFD prediction capability. The existing heuristic approaches [
10,
29] (evolution strategies, genetic algorithms, simulated annealing, etc.) can be applied to calibrate these boundary conditions. However, these approaches in general require many search iterations, e.g., hundreds as shown in Section
5, to find accurate settings for the system configuration parameters. In each iteration, a CFD model solving is performed with the candidate parameters. When the CFD is built for a large-scale data hall with millions of mesh cells, the iterative search process may incur unacceptable computation times. As such, the existing search-based approaches scale poorly with the granularity of the CFD model.
To advance automatic calibration, we propose Kalibre, a neural surrogate-assisted approach to calibrate data hall CFD models with increasing scales and complexities. Kalibre avoids directly solving the CFD model for parameter search with the help of a trainable neural net by iterating four key steps. First, the “coarse” surrogate is trained to align with the “fine” CFD model in the current system state locality by updating its internal weights based on CFD-generated data. Second, the trained surrogate is re-optimized by updating the system configuration, which is also a part of trainable variables of the neural net, to maximize the consistency between the surrogate’s predictions and the ground-truth sensor measurements. Third, the updated system configuration is set back to the CFD model for refining. Finally, the ground-truth sensor measurements are used to validate the refined CFD model. Therefore, Kalibre offloads the fine-grained parameter configuration search to the surrogate. Vis-à-vis the existing heuristic approaches that solve the CFD model in every configuration search step, Kalibre solves the CFD model much less frequently by merely providing feedback to the surrogate.
The implementation of Kalibre faces two challenges. First, the training data for the neural surrogate is limited since generating such data using the CFD model is compute-intensive. Second, the design of the surrogate to capture the high-dimensional feature space of a data hall is challenging. Piecemeal solutions to address the above two challenges separately tend to be contradictory, i.e., a deeper neural surrogate to better capture the complex feature space may require a large amount of CFD-generated training data. Without proper consideration, the computational cost of training an accurate surrogate might be higher than that of directly calibrating the CFD model. To address the challenges, we incorporate prior knowledge and sensor measurements to adaptively generate training data in each Kalibre’s iteration. With this adaptive design, the surrogate update is guided toward searching better configurations in the locality rather than toward ensuring global optimality. Compared with the random training data sampling adopted by the vanilla approach, the adaptive sampling enabled by the introduction of prior knowledge improves the efficiency of using data generated by the CFD simulations. To approximate the temperatures at locations with sensors, we design a knowledge-based neural surrogate to capture the spatial thermal relations that a considered sensor measurement is mostly affected by the settings of the nearby facilities. We implement Kalibre and apply it to calibrate the CFD models of two production data halls sized hundreds of square meters that host thousands of servers. The calibrated CFD models achieve
mean absolute errors (
MAEs) of 0.57°C and 0.88°C in predicting the temperatures at tens of cold/hot aisle positions in each hall, respectively. In contrast, the heuristic configuration search and the vanilla neural net-based surrogate approach achieve MAEs of around 1.46
\(\sim\) 2.2°C with the same computation time for calibration as Kalibre. We also invite a domain expert to manually fine-calibrate the two CFD models, yielding MAEs of 0.98°C and 1.16°C, respectively. As previous research [
20] has shown that increasing the air temperature is a common practice to reduce cooling energy, the high prediction accuracy achieved by Kalibre is beneficial for data center energy optimization while ensuring thermal safety constraints. For example, according to the ASHERE standard [
5], the server inlet temperature is not allowed to exceed 27°C to prevent overheating. Therefore, an accurate predictive model can be used to explore a less conservative policy that achieves more energy saving.
Although the calibrated CFD models achieve high-fidelity temperature prediction, the high computation overhead still presents challenges for their online usage. During the online usage, the prediction should be affordable to low-end computing devices with short response time, such that the potential thermal alarms can be properly prevented ahead of time. A possible workaround is to adopt the Kalibre’s neural surrogate model for real-time temperature prediction. However, the neural surrogate does not provide a full-fledged temperature field approximation. For example, it is incapable to predict the temperatures at locations without sensors. To address the high computation overhead, we extend Kalibre to Kalibreduce by integrating a model reduction technique developed based on the
proper orthogonal decomposition (
POD) [
11]. The POD method aims at describing a full field profile with a linear combination of a set of spatial basis functions, i.e., the POD modes and corresponding coefficients. While previous studies have investigated the POD for low-order data hall modelings [
25,
32], they assume that the boundary conditions from the CFD are well calibrated. Thus, the POD prediction results are only compared with the original CFD predictions instead of the sensor-measured data. Based on the calibrated CFD models, we further evaluate the reduced POD’s performance with sensor data. With the calibrated CFD models, the reduced POD models achieve comparable MAEs of 0.84°C and 0.98°C, respectively, while only taking 0.53 and 0.76 seconds in reconstructing the temperature field.
In summary, this article develops a systematic framework to evolve the data hall CFD models into high-fidelity and real-time digital twins. We incorporate prior knowledge to address the CFD accuracy and speed problems through surrogate-assisted model calibration and POD-based model reduction, respectively. The contributions of this article are summarized as follows:
–
We formulate the model calibration and reduction problems and propose a systematic solution to solve the problems.
–
We develop a surrogate-assisted approach that incorporates prior knowledge to solve the model calibration problem. The calibration is implemented with less human effort compared with manual baseline and fewer CFD simulations compared with search-based algorithms.
–
Based on the calibrated CFD models, we further reduce the order of the calibrated model using the POD method and energy balance principle to accelerate the simulation speed.
–
We conduct extensive evaluations for two industry-grade data halls hosing thousands of servers. The calibrated CFD models achieve MAE of 0.57°C and 0.88°C, respectively. The reduced-order models achieve comparable performance with MAE of 0.84°C and 0.98°C, respectively, while only taking 0.53 and 0.76 seconds in reconstructing the temperature field.
Article organization: The rest of this article is structured as follows. Section
2 reviews the related work. Section
3 formulates the calibration and reduction problems. Section
4 presents our proposed approach. Section
5 evaluates the calibration and reduction performances. Section
6 discusses several issues. Section
7 concludes this article.
2 Related Work
This section reviews the relevant studies in DC thermal modeling, CFD model calibration, CFD model reduction, and knowledge-based methods. Table
1 categorizes the existing thermal models, model calibration, model reduction, and their temperature prediction errors, respectively. In what follows, we discuss the details of these existing studies.
\(\blacksquare\) DC thermal modeling. A variety of modeling techniques have been proposed for thermal management in data halls. They can be broadly categorized into white box [
7,
9,
23,
37,
40], black box [
19,
43], and grey box [
17] methods. The CFD models are representative white box models, in that they capture the thermodynamic laws followed by the physical processes. However, the CFD models are computationally expensive due to their internal recursive execution. To reduce computation overhead, the reduced-order models are often used as alternatives. For example, the
fast fluid dynamics (
FFD) is proposed to accelerate the solving process in [
37] and the
heat recirculation matrix (
HRM) is fitted to predict server inlet/outlet temperatures in [
9]. Another alternative is to use black-box data-driven models to learn a thermal map in the data hall. For example, the Weatherman system [
19] predicts the steady-state temperatures of certain server blocks using a neural net consisting of two hidden layers. In [
43], a
long short-term memory (
LSTM) network is designed for predicting server CPU temperature. Although these data-driven models are fast and suitable for real-time prediction, they often perform poorly in the cases that are not covered by the training data. For instance, these models cannot well capture the thermal processes in case of cooling system failures, because the training data for such failure scenarios is generally lacking. Grey box models integrate physical laws and sensor data for temperature forecasting. For instance, in [
17], the grey box model named ThermoCast is proposed based on simplified thermodynamics and fitted with historical data. However, such a grey box method often relies on specific assumptions of the system dynamics, and may not be transferable to other data halls.
\(\blacksquare\) CFD model calibration. A variety of modeling techniques have been proposed for thermal management in data halls. They can be broadly categorized into white-box [
7,
9,
23,
37,
40], black-box [
19,
43], and grey-box [
17] methods. The CFD models are representative white-box models, in that they capture the thermodynamic laws followed by the physical processes. To ensure fidelity, the CFD models are often manually calibrated by human experts through trial-and-error process. For example, the CFD models in [
7,
23] are manually fine-calibrated by a human expert. As such, the manual approach is labor-intensive and only suitable for small-size testbeds. The heuristic search methods [
10,
29] can be adopted for automatic calibration, but they often require many iterations. As the mesh complexity increases for the modeled data hall, the CFD model’s solving time may increase from hours to days. Surrogate-assisted calibration [
15] speeds up the parametric search of those compute-intensive and non-differentiable models. It builds a lightweight surrogate of the original model and then uses the surrogate to guide the parameter search. The surrogate design is application-specific [
21,
22,
26,
41]. For example, response surface methodology based on radial basis function is studied for CFD model [
26]. Among these studies, data-driven surrogates are advantageous in fast forwarding. However, the design of surrogate-assisted optimization faces a general challenge in balancing the surrogate fidelity and the computation overhead of generating training data for surrogate via executing the original compute-intensive model. A possible solution is to improve the local approximation of the data-driven surrogate via proper training data selection. Unfortunately, few studies are dedicated to investigating this in the context of CFD modeling for large-scale data halls.
\(\blacksquare\) CFD model reduction. To accelerate the thermal simulation of a data hall, several approaches have been proposed to reduce the computational complexity of the CFD model. They can be classified into partial-reserved and full-reserved approaches. The partial-reserved approaches simulate the effects of certain parameters on temperature only at certain discrete points, such as the server inlets/outlets [
9] or the cold/hot aisles [
40]. To maintain the spatial resolution, the full-reserved reductions are desirable for a complete temperature field approximation. The POD-based method is a representative full-reserved approach. It approximates the temperature field with a set of orthogonal base functions and corresponding coefficients. Existing studies [
25,
32] have shown that the POD-based methods exhibit good approximation performance for the original CFD models built for data halls on a small scale. However, they assume the boundary conditions of the CFD models are calibrated and only evaluate the POD’s accuracy with the CFD simulated results. As such, evaluation of POD’s performance against real sensor measurements has not been systematically investigated.
\(\blacksquare\) Knowledge-based methods. Knowledge-based modeling incorporates empirical methods or first principles to improve model approximation with less data. For neural nets, the knowledge can be any extra information about the modeled function beyond the function’s inputs/outputs used as training samples [
6]. Several studies have shown that the knowledge-based neural nets exhibit better extrapolation capabilities while requiring less training data, compared with vanilla neural nets. In [
35], the neural net is trained by learning a loss function capturing a physical constraint expressed in closed form. This method is also applied in neural surrogate modeling for fluid flows without using any simulator-generated data [
36]. For POD-based reduction, the knowledge can be used to develop equations for solving the POD coefficients with new boundary conditions. The knowledge-based methods include the Galerkin projection [
25] and the heat flux matching process [
32]. In this article, we will adopt the principle of energy balance to build a linear equation system at the locally specified regions to solve the POD coefficients.
Our prior work [
40] proposed the knowledge-based neural surrogate calibration method and evaluated its effectiveness on two CFD models built for industry-grade data halls. In this article, we further reduce the order of the calibrated CFD models to accelerate their simulation speed and evaluate the performance of the reduced-order models using physical sensor measurements. The reduced-order model is developed based on the POD technique and can be efficiently solved by adopting the energy balance principle of the modeled data hall.
6 Discussions
We now discuss several worth-noting issues. First, the primary purpose of the Kalibre’s surrogate is to improve the efficiency of parameter search. The surrogate does not provide a full-fledged approximation of the CFD model. For instance, it does not model the temperatures at the locations without sensors, which are modeled by the CFD model in contrast. Thus, only the calibrated CFD model or the reduced-order model shall be used as a digital twin for the run-time temperature evaluation of the modeled data hall. Second, the surrogate’s architecture described in this article is for data halls installed with hot-aisle containments and server blanking panels. Thus, heat recirculation is not considered. To address the data halls without hot-aisle containments, heat recirculation, and temperature mixing effects should be added to the neural surrogate’s design. Third, due to the computation overhead, the training data used to solve the POD modes are generated using boundary conditions from only a subset of historical measurements. To extrapolate the reduced-order model to other cases, it is important to generate extra CFD samples under various boundary configurations. Fourth, this article mainly focuses on temperature prediction. For other types of prediction, the proposed approach can be extended to address their calibration and reduction problems with proper surrogate designs. For instance, if air flow rate sensors are deployed, Kalibre and Kalibreduce can be extended to calibrate the CFD for predicting air velocity distribution.