research-article

Open access

Toward Data Center Digital Twins via Knowledge-based Model Calibration and Reduction

Authors:

Rui Tan,

Xin ZhouAuthors Info & Claims

ACM Transactions on Modeling and Computer Simulation, Volume 33, Issue 4

Article No.: 11, Pages 1 - 24

https://doi.org/10.1145/3604283

Published: 26 October 2023 Publication History

PDF eReader

Abstract

Computational fluid dynamics (CFD) models have been widely used for prototyping data centers. Evolving them into high-fidelity and real-time digital twins is desirable for the online operations of data centers. However, CFD models often have unsatisfactory accuracy and high computation overhead. Manually calibrating the CFD model parameters is tedious and labor-intensive. Existing automatic calibration approaches apply heuristics to search the model configurations. However, each search step requires a long-lasting process of repeatedly solving the CFD model, rendering them impractical, especially for complex CFD models. This article presents Kalibre, a knowledge-based neural surrogate approach that calibrates a CFD model by iterating four steps of (i) training a neural surrogate model, (ii) finding the optimal parameters through neural surrogate retraining, (iii) configuring the found parameters back to the CFD model, and (iv) validating the CFD model using sensor-measured data. Thus, the parameter search is offloaded to the lightweight neural surrogate. To speed up Kalibre’s convergence, we incorporate prior knowledge in training data initialization and surrogate architecture design. With about ten hours of computation on a 64-core processor, Kalibre achieves mean absolute errors (MAEs) of 0.57°C and 0.88°C in calibrating the CFD models of two production data halls hosting thousands of servers. To accelerate CFD-based simulation, we further propose Kalibreduce that incorporates the energy balance principle to reduce the order of the calibrated CFD model. Evaluation shows the model reduction only introduces 0.1°C to 0.27°C extra errors while accelerating the CFD-based simulations by thousand times.

1 Introduction

The scale of the data center (DC) industry has been rapidly growing in response to the ever-increasing cloud computing and storage demands [3]. Such growth brings substantial challenges for DC operations, particularly in avoiding operational risks and reducing energy costs of the mission-critical infrastructure. Currently, DCs are mostly operated in a reactive way by the data center infrastructure management (DCIM) system with feedback controllers [2, 42]. The DCIM provides the operators with deployed sensor measurements for proper responses in case of abnormalities and failures. However, traditional DCIM does not provide accurate prediction capabilities that are desired by proactive DC management. These capabilities enable the operators to perform various what-if analyses, such as whether the increase of certain temperature setpoints can improve the energy efficiency without causing server overheating.

We consider predictive digital twins for the desired capability extension [33]. The computational fluid dynamics (CFD) modeling is a primary technique to characterize the thermodynamics in data halls [27]. CFD model can estimate the air velocity and temperature distributions in a given space by solving the Navier–Stokes (NS) and energy balance equations [4]. It has been adopted in the offline optimization for DC energy cost reduction and thermal risk management [28]. However, the accuracy and simulation speed of CFD models in general do not meet the online analysis requirements for two reasons [30]. First, the assumptions or simplifications made in the offline phase may lead to online result distortions. Second, the solving process of the governing NS equations may need lengthy computing time from hours to days.

The accuracy of a CFD model is mainly determined by the accuracy and completeness of the given boundary conditions. A model with incomplete boundary conditions may diverge from the ground truth. For example, as reported in [34, 41], an uncalibrated CFD model can yield temperature prediction errors up to 5°C. The low accuracy impedes the use of CFD model for the desired operational adjustment to pursue energy efficiency without causing thermal risk. Unfortunately, obtaining the complete and accurate boundary conditions often faces substantial challenges due to (1) the large number of parameters in the boundary spaces and (2) the labor-intensive and error-prone manual calibration process for these parameters. For instance, each server in a data hall may have its own characteristics of the passing-through air flow rate due to its internal fan control logic. Such information is often not available in the server hardware’s specification and can only be empirically estimated or manually collected via in situ measurement. As a result, the rough settings of the server air flow rates can significantly downgrade the CFD prediction capability. The existing heuristic approaches [10, 29] (evolution strategies, genetic algorithms, simulated annealing, etc.) can be applied to calibrate these boundary conditions. However, these approaches in general require many search iterations, e.g., hundreds as shown in Section 5, to find accurate settings for the system configuration parameters. In each iteration, a CFD model solving is performed with the candidate parameters. When the CFD is built for a large-scale data hall with millions of mesh cells, the iterative search process may incur unacceptable computation times. As such, the existing search-based approaches scale poorly with the granularity of the CFD model.

To advance automatic calibration, we propose Kalibre, a neural surrogate-assisted approach to calibrate data hall CFD models with increasing scales and complexities. Kalibre avoids directly solving the CFD model for parameter search with the help of a trainable neural net by iterating four key steps. First, the “coarse” surrogate is trained to align with the “fine” CFD model in the current system state locality by updating its internal weights based on CFD-generated data. Second, the trained surrogate is re-optimized by updating the system configuration, which is also a part of trainable variables of the neural net, to maximize the consistency between the surrogate’s predictions and the ground-truth sensor measurements. Third, the updated system configuration is set back to the CFD model for refining. Finally, the ground-truth sensor measurements are used to validate the refined CFD model. Therefore, Kalibre offloads the fine-grained parameter configuration search to the surrogate. Vis-à-vis the existing heuristic approaches that solve the CFD model in every configuration search step, Kalibre solves the CFD model much less frequently by merely providing feedback to the surrogate.

The implementation of Kalibre faces two challenges. First, the training data for the neural surrogate is limited since generating such data using the CFD model is compute-intensive. Second, the design of the surrogate to capture the high-dimensional feature space of a data hall is challenging. Piecemeal solutions to address the above two challenges separately tend to be contradictory, i.e., a deeper neural surrogate to better capture the complex feature space may require a large amount of CFD-generated training data. Without proper consideration, the computational cost of training an accurate surrogate might be higher than that of directly calibrating the CFD model. To address the challenges, we incorporate prior knowledge and sensor measurements to adaptively generate training data in each Kalibre’s iteration. With this adaptive design, the surrogate update is guided toward searching better configurations in the locality rather than toward ensuring global optimality. Compared with the random training data sampling adopted by the vanilla approach, the adaptive sampling enabled by the introduction of prior knowledge improves the efficiency of using data generated by the CFD simulations. To approximate the temperatures at locations with sensors, we design a knowledge-based neural surrogate to capture the spatial thermal relations that a considered sensor measurement is mostly affected by the settings of the nearby facilities. We implement Kalibre and apply it to calibrate the CFD models of two production data halls sized hundreds of square meters that host thousands of servers. The calibrated CFD models achieve mean absolute errors (MAEs) of 0.57°C and 0.88°C in predicting the temperatures at tens of cold/hot aisle positions in each hall, respectively. In contrast, the heuristic configuration search and the vanilla neural net-based surrogate approach achieve MAEs of around 1.46 \(\sim\) 2.2°C with the same computation time for calibration as Kalibre. We also invite a domain expert to manually fine-calibrate the two CFD models, yielding MAEs of 0.98°C and 1.16°C, respectively. As previous research [20] has shown that increasing the air temperature is a common practice to reduce cooling energy, the high prediction accuracy achieved by Kalibre is beneficial for data center energy optimization while ensuring thermal safety constraints. For example, according to the ASHERE standard [5], the server inlet temperature is not allowed to exceed 27°C to prevent overheating. Therefore, an accurate predictive model can be used to explore a less conservative policy that achieves more energy saving.

Although the calibrated CFD models achieve high-fidelity temperature prediction, the high computation overhead still presents challenges for their online usage. During the online usage, the prediction should be affordable to low-end computing devices with short response time, such that the potential thermal alarms can be properly prevented ahead of time. A possible workaround is to adopt the Kalibre’s neural surrogate model for real-time temperature prediction. However, the neural surrogate does not provide a full-fledged temperature field approximation. For example, it is incapable to predict the temperatures at locations without sensors. To address the high computation overhead, we extend Kalibre to Kalibreduce by integrating a model reduction technique developed based on the proper orthogonal decomposition (POD) [11]. The POD method aims at describing a full field profile with a linear combination of a set of spatial basis functions, i.e., the POD modes and corresponding coefficients. While previous studies have investigated the POD for low-order data hall modelings [25, 32], they assume that the boundary conditions from the CFD are well calibrated. Thus, the POD prediction results are only compared with the original CFD predictions instead of the sensor-measured data. Based on the calibrated CFD models, we further evaluate the reduced POD’s performance with sensor data. With the calibrated CFD models, the reduced POD models achieve comparable MAEs of 0.84°C and 0.98°C, respectively, while only taking 0.53 and 0.76 seconds in reconstructing the temperature field.

In summary, this article develops a systematic framework to evolve the data hall CFD models into high-fidelity and real-time digital twins. We incorporate prior knowledge to address the CFD accuracy and speed problems through surrogate-assisted model calibration and POD-based model reduction, respectively. The contributions of this article are summarized as follows:

–

We formulate the model calibration and reduction problems and propose a systematic solution to solve the problems.

–

We develop a surrogate-assisted approach that incorporates prior knowledge to solve the model calibration problem. The calibration is implemented with less human effort compared with manual baseline and fewer CFD simulations compared with search-based algorithms.

–

Based on the calibrated CFD models, we further reduce the order of the calibrated model using the POD method and energy balance principle to accelerate the simulation speed.

–

We conduct extensive evaluations for two industry-grade data halls hosing thousands of servers. The calibrated CFD models achieve MAE of 0.57°C and 0.88°C, respectively. The reduced-order models achieve comparable performance with MAE of 0.84°C and 0.98°C, respectively, while only taking 0.53 and 0.76 seconds in reconstructing the temperature field.

Article organization: The rest of this article is structured as follows. Section 2 reviews the related work. Section 3 formulates the calibration and reduction problems. Section 4 presents our proposed approach. Section 5 evaluates the calibration and reduction performances. Section 6 discusses several issues. Section 7 concludes this article.

2 Related Work

This section reviews the relevant studies in DC thermal modeling, CFD model calibration, CFD model reduction, and knowledge-based methods. Table 1 categorizes the existing thermal models, model calibration, model reduction, and their temperature prediction errors, respectively. In what follows, we discuss the details of these existing studies.

Table 1.

Modeling category	Ref.	Hall scale \(^*\)	Model	Calibration	Reduction	Error (°C)
Modeling category	Ref.	Hall scale \(^*\)	Model	Calibration	Reduction	Original	Reduced
White box	[37]	Mid	CFD	–	FFD	–	RMSE: 0.7
	[9]	Small		–	HRM	–	MAE: <3
	[23]	Mid		Manual	–	Max: 1.9	–
	[7]	Small		Manual	Linear	RMSE: 0.7	RMSE: <1
	Ours	Large		Automated	POD	MAE: 0.57	MAE: 0.84
	Ours	Large		Automated	POD	MAE: 0.88	MAE: 0.98
Black box	[19]	Mid	MLP	–	–	MSE: 0.87	–
Black box	[43]	Small	LSTM	–	–	RMSE: 1.24	–
Grey box	[17]	Small	ThermoCast	–	–	MSE: 0.1	–

Table 1. Summary of the Existing Studies Relevant to DC Thermal Modeling

\(^*\) Small scale: <200 \(\text{m}^2\) , Mid scale: 200 \(\sim\) 800 \(\text{m}^2\) , Large scale: > 800 \(\text{m}^2\) .

\(\blacksquare\) DC thermal modeling. A variety of modeling techniques have been proposed for thermal management in data halls. They can be broadly categorized into white box [7, 9, 23, 37, 40], black box [19, 43], and grey box [17] methods. The CFD models are representative white box models, in that they capture the thermodynamic laws followed by the physical processes. However, the CFD models are computationally expensive due to their internal recursive execution. To reduce computation overhead, the reduced-order models are often used as alternatives. For example, the fast fluid dynamics (FFD) is proposed to accelerate the solving process in [37] and the heat recirculation matrix (HRM) is fitted to predict server inlet/outlet temperatures in [9]. Another alternative is to use black-box data-driven models to learn a thermal map in the data hall. For example, the Weatherman system [19] predicts the steady-state temperatures of certain server blocks using a neural net consisting of two hidden layers. In [43], a long short-term memory (LSTM) network is designed for predicting server CPU temperature. Although these data-driven models are fast and suitable for real-time prediction, they often perform poorly in the cases that are not covered by the training data. For instance, these models cannot well capture the thermal processes in case of cooling system failures, because the training data for such failure scenarios is generally lacking. Grey box models integrate physical laws and sensor data for temperature forecasting. For instance, in [17], the grey box model named ThermoCast is proposed based on simplified thermodynamics and fitted with historical data. However, such a grey box method often relies on specific assumptions of the system dynamics, and may not be transferable to other data halls.

\(\blacksquare\) CFD model calibration. A variety of modeling techniques have been proposed for thermal management in data halls. They can be broadly categorized into white-box [7, 9, 23, 37, 40], black-box [19, 43], and grey-box [17] methods. The CFD models are representative white-box models, in that they capture the thermodynamic laws followed by the physical processes. To ensure fidelity, the CFD models are often manually calibrated by human experts through trial-and-error process. For example, the CFD models in [7, 23] are manually fine-calibrated by a human expert. As such, the manual approach is labor-intensive and only suitable for small-size testbeds. The heuristic search methods [10, 29] can be adopted for automatic calibration, but they often require many iterations. As the mesh complexity increases for the modeled data hall, the CFD model’s solving time may increase from hours to days. Surrogate-assisted calibration [15] speeds up the parametric search of those compute-intensive and non-differentiable models. It builds a lightweight surrogate of the original model and then uses the surrogate to guide the parameter search. The surrogate design is application-specific [21, 22, 26, 41]. For example, response surface methodology based on radial basis function is studied for CFD model [26]. Among these studies, data-driven surrogates are advantageous in fast forwarding. However, the design of surrogate-assisted optimization faces a general challenge in balancing the surrogate fidelity and the computation overhead of generating training data for surrogate via executing the original compute-intensive model. A possible solution is to improve the local approximation of the data-driven surrogate via proper training data selection. Unfortunately, few studies are dedicated to investigating this in the context of CFD modeling for large-scale data halls.

\(\blacksquare\) CFD model reduction. To accelerate the thermal simulation of a data hall, several approaches have been proposed to reduce the computational complexity of the CFD model. They can be classified into partial-reserved and full-reserved approaches. The partial-reserved approaches simulate the effects of certain parameters on temperature only at certain discrete points, such as the server inlets/outlets [9] or the cold/hot aisles [40]. To maintain the spatial resolution, the full-reserved reductions are desirable for a complete temperature field approximation. The POD-based method is a representative full-reserved approach. It approximates the temperature field with a set of orthogonal base functions and corresponding coefficients. Existing studies [25, 32] have shown that the POD-based methods exhibit good approximation performance for the original CFD models built for data halls on a small scale. However, they assume the boundary conditions of the CFD models are calibrated and only evaluate the POD’s accuracy with the CFD simulated results. As such, evaluation of POD’s performance against real sensor measurements has not been systematically investigated.

\(\blacksquare\) Knowledge-based methods. Knowledge-based modeling incorporates empirical methods or first principles to improve model approximation with less data. For neural nets, the knowledge can be any extra information about the modeled function beyond the function’s inputs/outputs used as training samples [6]. Several studies have shown that the knowledge-based neural nets exhibit better extrapolation capabilities while requiring less training data, compared with vanilla neural nets. In [35], the neural net is trained by learning a loss function capturing a physical constraint expressed in closed form. This method is also applied in neural surrogate modeling for fluid flows without using any simulator-generated data [36]. For POD-based reduction, the knowledge can be used to develop equations for solving the POD coefficients with new boundary conditions. The knowledge-based methods include the Galerkin projection [25] and the heat flux matching process [32]. In this article, we will adopt the principle of energy balance to build a linear equation system at the locally specified regions to solve the POD coefficients.

Our prior work [40] proposed the knowledge-based neural surrogate calibration method and evaluated its effectiveness on two CFD models built for industry-grade data halls. In this article, we further reduce the order of the calibrated CFD models to accelerate their simulation speed and evaluate the performance of the reduced-order models using physical sensor measurements. The reduced-order model is developed based on the POD technique and can be efficiently solved by adopting the energy balance principle of the modeled data hall.

3 Background and Problem Formulation

In this section, we first introduce the related background. Then, we formulate the model calibration and reduction problem, respectively.

3.1 Background

Figure 1 illustrates the layout of a typical data hall, where racks hosting servers are assigned into multiple rows that separate aisles. These aisles alternate between cold and hot aisles. The computer room air conditioningunits (CRACs) supply cold air, denoted by \(T_\text{supply}\) , to the servers through the cold aisles and draw hot air, denoted by \(T_\text{return}\) , from the hot aisles. To reduce air recirculation, containments are often constructed for the hot aisles. To evaluate the thermal condition in a data hall, the inlet and outlet temperatures of servers are often used as the key thermal variables. Therefore, temperature sensors are deployed in the cold and hot aisles to monitor such thermal variables. The inlet temperature of a server, denoted by \(T_\text{in}\) is often required to be in the range of 15°–27°C [5]. The outlet temperature, denoted by \(T_\text{out}\) , is related to the heat generated by the server and the passing-through air flow rate. We assume that the energy dissipated from the server in the forms of electromagnetic radiation and mechanical movements is negligible compared with that in the form of heat. Thus, in a steady state, the temperature difference between a server inlet/outlet and a CRAC supply/return can be derived from the energy balance principle [13] as

\begin{equation} \begin{array}{ll} & T_\text{out}-T_\text{in} = \frac{P}{c_\text{p} \rho \alpha }, \;\;\;\;\;\;\;\;\;\;\;\,\,\cdots \text{server side}\\ & T_\text{return}-T_\text{supply} = \frac{\sum _{i}{P_i}}{c_\text{p} \rho V}, \;\;\;\,\,\cdots \text{CRAC side} \end{array} , \end{equation}

(1)

where P is the power usage of a server, \(\sum _{i}{P_i}\) is the total power usage of servers within a CRAC air loop, \(\alpha\) is the server passing-through volume flow rate, V is the CRAC supply volume flow rate, \(c_\text{p}\) and \(\rho\) is the heat capacity and density of air, respectively.

Fig. 1.

The servers in general have different characteristics in passing the cooling air through them. The characteristic highly depends on the server form factor and the control logic of the server’s internal fans. Owing to the distinct characteristics, the servers in a data hall often have different passing-through air flow rates. The server air flow rates are part of the CFD boundary configurations that greatly affect the predicted temperature distributions of the data hall. Therefore, to achieve high CFD accuracy, the server air flow rates should be correctly configured before CFD simulation. Unfortunately, they are often unknown and hard to obtain. The current manual in situ measurement using an air volume flow rate meter for each server is labor intensive, especially for a large-scale data hall that hosts many types of servers. As a result, the server air flow rates are often empirically estimated by human experts. For a CFD model with many (e.g., thousands) servers, the rough settings of the server air flow rates could significantly downgrade the temperature prediction capability of the CFD model. The low accuracy will impede the use of CFD model for the desired fine-grained operational adjustment to pursue energy efficiency without causing thermal risk.

In this article, we focus on devising an automatic approach to calibrate the server air flow rate configuration for data hall CFD models on a steady system state. The approach can be also extended to include other parameters (e.g., by-pass air flow rates and recirculated air flow rates) into calibration. The system state consists of the following measurements: The setpoints and fan speeds of CRAC units, server powers, and the temperatures measured in the hot and cold aisles. With the calibrated server air flow rates, the CFD model will yield more accurate temperature prediction results. Although the CFD model can predict the temperature at any location, we focus on the locations that are deployed with temperature sensors and thus have ground-truth temperature measurements for accuracy evaluation. After model calibration, we will further investigate the POD-based model reduction technique to accelerate the CFD simulation speed. The reduced-order model is expected to achieve real-time temperature prediction without losing spatial resolution and maintain satisfactory temperature prediction accuracy.

3.2 Problem Formulation

To formulate the model calibration and reduction problem, we first define the relevant parameters and variables. Unless particularly specified, the notations used in this article are summarized in Table 2. We consider a data hall hosting l CRACs, m servers, and n temperature sensors deployed in the cold and hot aisles, respectively.

Table 2.

Sym.	Definition	Sym.	Definition
\(\left\Vert \cdot \right\Vert _2\)	\(\ell _2\) -norm	\(\mathbf {T}_\mbox{c}\)	vector of CRAC setpoints
\(\langle \cdot , \cdot \rangle\)	inner product	\(\mathbf {V}\)	vector of CRAC volume air flow rates
\(\otimes\)	element-wise product	\(\mathbf {P}\)	vector of server powers
l	number of CRACs	\(\mathbf {\alpha }\)	vector of server volume air flow rates
m	number of servers	\(\mathbf {e}\)	one-hot vector of cold/hot aisle sensors
n	number of sensors	\(\mathbf {T}_\mbox{s}\)	vector of sensor measurements
\(T_\text{in}\) , \(T_\mbox{out}\)	server inlet/outlet temperature	\(\widetilde{\mathbf {T}}\)	vector of CFD results at sensor locations
\(T_\text{supply}\) , \(T_\mbox{return}\)	CRAC supply/return temperature	\(\widetilde{\mathbf {T}}_\mbox{s}\)	vector of full field CFD results
\(\mathcal {L}_1, \mathcal {L}_2\)	loss functions	\(\widehat{\mathbf {T}}_{\mbox{s}}\)	vector of neural surrogate predictions
\(\boldsymbol {\beta }\)	vector of POD coefficients	\(\mathbf {W}^{\mbox{cs}}\)	adjacency matrix of CRAC to sensor
\(\boldsymbol {\phi }\)	vector of POD modes	\(\mathbf {W}^{\mbox{ss}}\)	adjacency matrix of server to sensor

Table 2. Summary of Notations

Definition 1 (Input).

The input data for solving a CFD model is a vector consisting of all boundary conditions. Formally, the input \(\mathbf {x} = (\mathbf {T}_\mbox{c}, \mathbf {V}, \mathbf {P}, \mathbf {\alpha })\) , where \(\mathbf {T}_\mbox{c}=(T_{\mbox{c}1}, T_{\mbox{c}2}, \ldots ,T_{\mbox{c}l})\) , \(\mathbf {V}=(V_1, V_2, \ldots , V_l)\) , \(\mathbf {P}=(P_1, P_2, \ldots , P_m)\) , and \(\mathbf {\alpha } = (\alpha _1, \alpha _2, \ldots , \alpha _m)\) are the vectors of CRAC setpoints, CRAC volume flow rates, server powers, and server air flow rates, respectively.

Definition 2 (Output).

The output of CFD is a steady-state temperature distribution field \(\widetilde{\mathbf {T}}(\mathbf {x})=(\tilde{T}_1, \tilde{T}_2, \ldots , \tilde{T}_N)\) . For CFD model calibration, we focus on a subset of results within the field at n locations ( \(n \ll N\) ) installed with temperature sensors, which is denoted by \(\widetilde{\mathbf {T}}_\mbox{s}=(\tilde{T}_{\mbox{s}1}, \tilde{T}_{\mbox{s}2}, \ldots , \tilde{T}_{\mbox{s}n})\) .

Definition 3 (Measurement).

The measurement is a vector of real temperature values recorded by the physical sensors, which is denoted by \(\mathbf {T}_\mbox{s}=(T_{\mbox{s}1}, T_{\mbox{s}2}, \ldots , T_{\mbox{s}n})\) .

3.2.1 Model Calibration Problem.

Let \(\left\Vert \cdot \right\Vert _2\) denote the \(\ell _2\) -norm of a vector. With the above definitions, the CFD model calibration aims at finding the server air flow rate configurations that minimize the \(\ell _2\) -norm of the error vector between the model outputs and the measurements:

\begin{equation} \begin{aligned}& \mathbf {\alpha }^*\! \triangleq \! \mathop{\arg\min}_{\mathbf {\alpha }} \frac{1}{M}\sum _{i=1}^{M}\left\Vert \widetilde{\mathbf {T}}_{\mbox{s}}(\mathbf {x}_i) - \mathbf {T}_{\mbox{s}i}\right\Vert ^2_2, \\ & \mathrm{s.t.} \; \alpha _1 \le \alpha _i \le \alpha _{\text{u}}, \; i=1,2,\dots ,m, \\ & \; \; \; \; \sum _{i=0}^{m}\alpha _i \le \sum _{j=0}^{l}V_j, \end{aligned} \end{equation}

(2)

where M is the number of the collected samples and \(\mathbf {\alpha }^*\) is the vector of calibrated air flow rates. Each element in \(\mathbf {\alpha }\) should be within an empirically estimated range \([\alpha _\mbox{l}, \alpha _\mbox{u}]\) and the total volume of server air flow rate should be less than the total CRAC supply. We assume that the servers of the same type have the same volume air flow rate.

We now use an example in a real production data hall to illustrate the discrepancy between an uncalibrated CFD model and the actual sensor measurements. We first show a summary of the working conditions of the data hall. Figure 2(a) shows a sample distribution of the servers’ power consumption at a time instant. Figure 2(b) is the CRAC setpoints and the corresponding fan speed ratios. Figure 2(c) shows the temperature values measured by a number of sensors and uncalibrated CFD predictions on the locations of these sensors. For the sensor measurements, the cold aisle temperatures range from 20°C to 24°C, which are related to the CRAC setpoints and fan speed ratios. The hot aisle temperatures range from 30°C to 36°C, which are affected by the generated heat from the servers. The air flow rate of each server is empirically determined for the raw CFD model. With these initial configurations, the CFD model has temperature prediction errors from 2°C to 6°C. Such large errors disqualify the raw CFD model as a high-fidelity digital twin.

Fig. 2.

3.2.2 Model Reduction Problem.

The model reduction assumes the temperature field can be approximated by H orthogonal basis vectors \(\boldsymbol {\Phi } =(\boldsymbol {\phi }_1,\boldsymbol {\phi }_2,\ldots ,\boldsymbol {\phi }_{H})\) by \(\widetilde{\mathbf {T}}(\mathbf {x})=\sum _{i=1}^{H}\beta _i(\mathbf {x})\boldsymbol {\phi }_i\) , where \(\beta _i, i=1,2,\ldots ,H\) is the boundary-specific coefficients. Let \(\langle \cdot , \cdot \rangle\) denote the inner product of two vectors. With the above assumption and definitions, the model reduction first aims at finding the H orthogonal basis vectors that minimize the \(\ell _2\) -norm of the error between the CFD model output and its projection onto the vector bases as

\begin{equation} \begin{aligned}& \boldsymbol {\Phi }^*\! \triangleq \! \mathop{\arg\min}_{\boldsymbol {\Phi }} \frac{1}{\tilde{M}}\sum _{i=1}^{\tilde{M}}\left\Vert \widetilde{\mathbf {T}}(\mathbf {x}_i) - \sum _{j=1}^{H}\langle \widetilde{\mathbf {T}}(\mathbf {x}_i), \boldsymbol {\phi }_j\rangle \boldsymbol {\phi }_j\right\Vert ^2_2,\\ & \mathrm{s.t.} \; \boldsymbol {\phi }_i^\intercal \boldsymbol {\phi }_j = {\left\lbrace \begin{array}{ll} 1 & \quad \text{if } i=j\\ 0 & \quad \text{if } i\ne j \end{array}\right.}, \; i,j=1,2,\ldots ,H, \end{aligned} \end{equation}

(3)

where \(\tilde{M}\) is the number of CFD generated samples and \(\boldsymbol {\Phi }^*\) is the set of optimal basis vectors. The constraint ensures that the basis vectors are orthogonal and normalized with a unit length of 1. The above problem is equivalent to maximizing the projection of the model output onto the vector bases. Thus, we can write the Lagrange function as \(L(\boldsymbol {\Phi },\lambda) = \frac{1}{\tilde{M}}\sum _{i=1}^{\tilde{M}}(\Vert \sum _{j=1}^{H}\langle \widetilde{\mathbf {T}}(\mathbf {x}_i), \boldsymbol {\phi }_j\rangle \Vert ^2_2)-\sum _{i,j=1}^{H}\lambda _{ij}(\Vert \boldsymbol {\phi }_j\Vert ^2_2-1)\) , where \(\lambda\) is the Lagrange multiplier. Then, the optimal solution of \(\boldsymbol {\Phi }\) can be derived by using the Karush–Kuhn–Tucker (KKT) conditions as \(\frac{1}{\tilde{M}}\sum _{i=1}^{\tilde{M}}\widetilde{\mathbf {T}}(\mathbf {x}_i)(\widetilde{\mathbf {T}}(\mathbf {x}_i)^\intercal \boldsymbol {\phi }_j) = \lambda _{jj} \boldsymbol {\phi }_j,j=1,2\dots ,H\) . The optimization problem is then converted to solving the eigenvectors of the correlation matrix \(\mathbf {C}=\frac{1}{\tilde{M}}\widetilde{\mathbf {T}}\widetilde{\mathbf {T}}^\intercal \in \mathbb {R}^{N \times N}\) . Once the optimal modes are determined, we need to solve the boundary-specific coefficients to reconstruct the temperature field. We next present the knowledge-based approach to solve the coefficients.

4 Model Calibration and Reduction Approach

In this section, we first present the overview of our approach. Then, we introduce the proposed knowledge-based model calibration and reduction methods, respectively.

4.1 Approach Overview

Figure 3 illustrates the workflow of our proposed approach that consists of two major blocks of CFD calibration and reduction. We first build the CFD model using the meta-data of the target DC, such as the data hall geometric layout and the facility locations. Then, we conduct CFD model calibration based on real sensor measurements to improve its temperature prediction fidelity. The calibration is implemented in four iterative steps. After CFD calibration, we reduce the order of the calibrated model in four consecutive steps to achieve real-time temperature prediction without losing spatial resolutions. In what follows, we describe the details of the model calibration and reduction, respectively.

Fig. 3.

\(\blacksquare\) Model calibration approach. Due to the high computational cost of CFD model, directly solving the optimization problem in Equation (2) using the traditional search-based algorithms incur unacceptable computation overhead. To address this issue, we design a surrogate of the CFD model. Let \(\widehat{{\bf T}}_{\mbox{s}} \in \mathbb {R}^{1 \times n}\) denote the temperature output vector of the surrogate. Then, the problem in Equation (2) is converted to a surrogate-assisted optimization that can be solved by iterating four consecutive steps as shown in the CFD calibration of Figure 3. Specifically, in the first step marked by ➊, the surrogate model is trained to be locally aligned with the CFD model by minimizing the discrepancy between the surrogate’s and the CFD’s outputs as \({\bf W}^* \triangleq \text{argmin}_{{\bf W}}\frac{1}{M}\sum _{i=1}^{M}\Vert \widetilde{{\bf T}}_{\mbox{s}}({\bf x}_i)-\widehat{{\bf T}}_{\mbox{s}_i}(\mathbf {W}, \mathbf {x}_i)\Vert ^2_2,\) where \({\bf W}\) is a set of trainable weights of the surrogate and \({\bf W}^*\) is the result of the surrogate training. In the second step marked by ➋, with \({\bf W}^*\) , the surrogate is re-optimized through re-training such that the discrepancy between the surrogate’s output and the measurement is minimized as \(\boldsymbol {\alpha }^* \triangleq \text{argmin}_{\boldsymbol {\alpha }} \frac{1}{M}\sum _{i=1}^{M}\Vert \widehat{{\bf T}}_{\mbox{s}i}(\mathbf {W}^*, \mathbf {x}_i) - \mathbf {T}_{\mbox{s}i}\Vert ^2_2\) . In the third step marked by ➌, the found \(\boldsymbol {\alpha }^*\) is configured into the CFD model for a steady-state simulation. In the forth step marked by ➍, the CFD is validated based on physical sensor measurements. If the surrogate approaches to the CFD model, the \(\boldsymbol {\alpha } ^*\) after the convergence of the four-step iterations will approach to the one given by Equation (2). We will present the details of the surrogate modeling in Section 4.2.

\(\blacksquare\) Model reduction approach. After the CFD model is calibrated, it is desirable to reduce its order to accelerate the simulation speed. From Section 3.2.2, the model reduction first requires to solve the eigenvectors of the correlation matrix \(\mathbf {C}\in \mathbb {R}^{N \times N}\) derived from Equation (3). However, directly solving \(\mathbf {C}\) is computationally intensive when the mesh points N of a CFD model is large. For example, N can be larger than \(10^7\) for a large-scale data hall in Section 5.1.1. To address this challenge, we adopt the snapshot method [11] to build another correlation matrix \(\widetilde{\mathbf {C}}=\frac{1}{\tilde{M}}\widetilde{\mathbf {T}}^\intercal \widetilde{\mathbf {T}} \in \mathbb {R}^{\tilde{M} \times \tilde{M}}\) , where \(\tilde{M}\) is the number of CFD generated samples, i.e., the snapshots. Then, the POD modes are derived by \(\boldsymbol {\Phi }=\widetilde{\mathbf {T}}\boldsymbol {\Psi }\) , where each column of \(\boldsymbol {\Psi }\) is the eigenvector of \(\widetilde{\mathbf {C}}\) . Thus, the original \(N\times N\) eigenvector problem is reduced to \(\tilde{M} \times \tilde{M}\) , where \(\tilde{M} \ll \tilde{N}\) . The model reduction consists of four steps as illustrated in the CFD reduction of Figure 3. Specifically, in the first step marked by ①, we generate \(\tilde{M}\) snapshots by running the CFD model with different boundary conditions. In the second step marked by ②, the POD modes are obtained by solving Equation (3) using the snapshot method. In the third step marked by ③, with new boundary conditions as input, the POD coefficients are determined based on the energy balance principle. We will present the details of this step in Section 4.3. In the fourth step marked by ④, the temperature field is obtained by the linear combination of the derived POD modes and corresponding coefficients.

4.2 Knowledge-based Neural Surrogate Calibration

4.2.1 Incorporated Prior Knowledge.

As discussed in Section 1, to address the challenges of the surrogate’s complexity versus the needed volume of CFD-generated training data, we incorporate prior knowledge for initial training data selection and surrogate architecture design. We now illustrate three pieces of incorporated knowledge. First, we model the server air flow rate as a linear function of the server power. Specifically, according to [8], the server’s internal fan speed is related to the CPU utilization and the inlet temperature. These two factors jointly affect the server power usage. Therefore, we model the server air flow rate as a linear function of the server power usage, i.e., \(\alpha =\overline{\alpha }P\) , where \(\overline{\alpha }\) is a constant in volume per second per Watt ( \(\text{m}^3\) /s/W). Second, we initialize the server air flow rate based on the closest sensor measurements and the energy balance principle to generate the initial training data. Specifically, from Equation (1), the \(i^\text{th}\) server’s air flow rate can be initialized as \(\alpha _i=\frac{P_i}{(\Delta T_i+\delta) c_\text{p} \rho }\) , where \(\delta\) is a uniform random variable and \(\Delta T_i\) is approximated by the difference between the closest server sensor measurements at the hot and cold aisles, respectively. Third, we build a knowledge-based neural surrogate that can capture the physical layout and thermal relations among a number of key variables of the considered data hall. Specifically, we model a set of facilities (i.e., CRACs, servers, and sensors) in the considered hall as nodes and their connections as edges into a directed graph. The direction of an edge characterizes the thermal causality between the two end nodes of the edge. For example, an edge points from a CRAC node to a sensor node, because the supply air temperature of the CRAC affects the measured temperature of the sensor. The normalized reciprocal spatial distance between any two facilities will be used as the weight of the edge connecting the corresponding two nodes in the graph. This modeling approach follows the fact that the temperature measured by a sensor is mostly affected by the facilities in its neighborhood [16]. We will present the detailed architecture design in Section 4.2.2.

4.2.2 Neural Surrogate Architecture.

The neural surrogate aims at approximating the complex thermophysics encompassed in the CFD model. In particular, its efficient training with a small amount of data generated from the CFD model is desirable, since the data generation requires intensive computation. Figure 4 shows the proposed neural surrogate architecture. It consists of a cooling block and a heating block. The cooling block models the impact of the CRACs on the temperatures at all sensor locations; the heating block models the impact of the servers on the temperatures at the hot aisle sensor locations. Thus, the summation of the two blocks captures the effects from both CRACs and servers. The input of the model consists of the free variables of the data hall’s steady state at a time instant, including CRAC temperature setpoints and fan speeds, and server powers. The server air flow rates are designated as trainable variables of the neural surrogate and initialized with rough estimates. Note that, as the neural surrogate is differentiable, the server air flow rates can be updated efficiently by backpropagation-based neural net training algorithms for the purpose of calibration. The output of the neural surrogate is a vector of n predicted temperatures at the sensor locations. In what follows, we present the designs of the cooling and heating blocks of the neural surrogate to capture prior knowledge of the thermal relations among the key variables. Lastly, we present the settings of the constants used by the neural surrogates, which are also based on the prior knowledge of the layout of the modeled data hall.

Fig. 4.

\(\blacksquare\) Cooling block. This block models the impact of the CRAC temperature setpoints \({\bf T}_\mbox{c}\) and fan speeds \({\bf V}\) on the temperatures at all sensor locations. First, we encode the two free variables (i.e., \({\bf T}_\mbox{c}\) and \({\bf V}\) ) into a hidden-layer variable for the \(k^{\mbox{th}}\) sensor as \(X^{\mbox{cold}}_k = \sum _{i=1}^{l} T_{\mbox{c}i} \cdot c_{ik}\) , where \(T_{\mbox{c}i}\) is the setpoint of the \(i^{\mbox{th}}\) CRAC and \(c_{ik}\) is a cooling coefficient characterizing the impact of the \(i^{\mbox{th}}\) CRAC on the \(k^{\mbox{th}}\) sensor. We design \(c_{ik}\) to be positively related to the CRAC fan speed. Specifically, we use softmax activation to compute the cooling coefficient matrix as \(c_{ik} = \frac{e^{z_{ik}}}{\sum _{a=1}^{l}e^{z_{ak}}}\) , where \(z_{ik}\) is an intermediate variable defined by \(z_{ik} = V_i \cdot W^{\mbox{cs}}_{ik}\) , \(V_i\) is the fan speed of the \(i^{\mbox{th}}\) CRAC, and \(W_{ik}^{cs}\) is a weight characterizing the thermal impact of the \(i^{\mbox{th}}\) CRAC on the \(k^{\mbox{th}}\) sensor. The CRAC-to-sensor matrix \({\bf W}^{\mbox{cs}} \in \mathbb {R}^{n \times l}\) consisting of \(W_{ik}^{cs}\) for \(i = 1, \ldots , l\) and \(k = 1, \ldots , n\) is an adjacency matrix. The weights in this matrix can be fixed or trainable. If they are fixed, their settings are important and will be discussed in Section 4.2.2. Section 5 will compare the performance of the neural surrogates with \({\bf W}^{\mbox{cs}}\) fixed or trainable. Lastly, we use a linear layer to project the hidden-layer variable to temperature as \(T^{\mbox{cold}}_k = a_k X^{\mbox{cold}}_k + b_k\) , where \(a_k\) and \(b_k\) are two trainable weights.

\(\blacksquare\) Heating block. This block models the impact of the servers on the temperatures at the hot aisle sensor locations. Based on this energy balance principle, we use server powers and air flow rates to predict the server-induced temperature increase \(\Delta T_k\) at the \(k^{\mbox{th}}\) hot aisle sensor location. Specifically, \(\Delta T_k = c_k X^{\mbox{hot}}_k + d_k\) , where \(c_k\) and \(d_k\) are two trainable weights, and \(X^{\mbox{hot}}_k\) is a hidden-layer variable. The \(X^{\mbox{hot}}_k\) is defined by \(X^{\mbox{hot}}_k=\sum _{j=1}^{m} \frac{P_j}{\alpha _j} \cdot W^{\mbox{ss}}_{jk}\) , where \(P_j\) is the \(j^{\mbox{th}}\) server power, \(\alpha _j\) is the \(j^{\mbox{th}}\) server air flow rate, and \(W_{ij}^{ss}\) is the weight characterizing the thermal impact of the \(j^{\mbox{th}}\) server on the \(k^{\mbox{th}}\) sensor. We define the server-to-sensor adjacency matrix \({\bf W}^{\mbox{ss}}\) consisting of \(W^{\mbox{ss}}_{jk}\) for \(j = 1, \ldots , m\) and \(k = 1, \ldots , n\) . Similar to \({\bf W}^{\mbox{cs}}\) , \({\bf W}^{\mbox{ss}}\) can be fixed or trainable. For the former case, its setting is discussed in Section 4.2.2. Note that the heating block outputs \(\Delta T_k\) for all sensor locations (denoted by \(\Delta {\bf T}\) ); but only the outputs at hot aisle sensor locations will be used when combing the results of the cooling and heating blocks. This design simplifies the vectorized implementation of the neural surrogate using PyTorch (cf. Section 5.1.4).

\(\blacksquare\) Joining two blocks and adjacency matrices settings. With hot-aisle containment and blanket, heat recirculation is negligible. Thus, the temperatures at cold aisle sensor locations are mainly affected by the CRACs; the temperatures at hot aisle sensor locations are jointly affected by the CRACs and servers. To combine the outputs of the cooling and heating blocks, we define a one-hot vector \({\bf e} \in \lbrace 0,1\rbrace ^n\) , where its element \(e_k=1\) or 0 represents that the \(k^{\mbox{th}}\) sensor location is in hot or cold aisle, respectively. Therefore, the final output of the neural surrogate, i.e., the temperatures at all sensor locations, can be expressed by \(\widehat{{\bf T}}_{\mbox{s}} = {\bf T}_{\mbox{cold}} \otimes (\boldsymbol {1} - {\bf e}) + {\bf T}_{\mbox{cold}} \otimes {\bf e} + \Delta {\bf T} \otimes (\boldsymbol {1} - {\bf e})\) , where \(\otimes\) represents element-wise product, \({\bf T}_{\mbox{cold}} \otimes (\boldsymbol {1} - {\bf e})\) gives the temperatures at the cold aisle sensor locations, and \({\bf T}_{\mbox{cold}} \otimes {\bf e} + \Delta {\bf T} \otimes (\boldsymbol {1} - {\bf e})\) gives the temperatures at the hot aisle sensor locations.

If \({\bf W}^{\mbox{cs}}\) and \({\bf W}^{\mbox{ss}}\) are fixed, the weights of the neural surrogate are \({\bf W} = \lbrace a_k, b_k, c_k, d_k | k = 1, \ldots , n\rbrace\) ; otherwise, \({\bf W}\) additionally include \({\bf W}^{\mbox{cs}}\) and \({\bf W}^{\mbox{ss}}\) . Each of their elements represents the thermal impact of a facility (CRAC or server) on a sensor location. Since the thermal impact decreases with spatial distance, in this article, we set it to be a normalized reciprocal of the spatial distance between the facility and the sensor location. When it is lower than a threshold, it is forced to be zero, indicating that the corresponding thermal impact is negligible. Thus, to set these two matrices, the layout of the data hall and the sensor locations will be needed, which are available to the DC operator in general.

4.2.3 Four-step Iterations for CFD Calibration.

Let \(\widehat{T}_{\mbox{s}k}\) , \(\widetilde{T}_{\mbox{s}k}\) , \(T_{\mbox{s}k}\) denote the surrogate-predicted temperature, CFD-predicted temperature, and the measured temperature at the location of the \(k^{\mbox{th}}\) sensor, respectively. Algorithm 1 shows the pseudocode of the four-step iterations. We now explain it in detail.

Neural surrogate training (Lines 4–6): The training data is generated by solving the CFD model with collected system input (including CRAC temperature setpoints and fan speeds, server powers) and the initial \(\boldsymbol {\alpha }\) or calibrated \(\boldsymbol {\alpha }\) by the previous iterations to yield the predicted temperatures at sensor locations. The detailed training data generation is described in Section 5.2.1. Note that each element of \(\boldsymbol {\alpha }\) should be within \([\alpha _l, \alpha _u]\) . The system input, the \(\boldsymbol {\alpha }\) , and the predicted temperatures form a new training data sample that is added to the training dataset accumulated from the first iteration. With the training dataset, the neural surrogate is updated to minimize the errors between its predicted temperatures and the CFD-predicted temperatures of the training samples. Thus, the weights of the neural surrogate are updated using the gradient of the least squares loss function of \(\mathcal {L}_1 = \frac{1}{M} \sum _{i=1}^{M} ||{(\widehat{\mathbf {T}}_{\mbox{s}i}(\mathbf {W},\mathbf {x}_i) - \widetilde{\mathbf {T}}_{\mbox{s}i}(\mathbf {x}_i))}||^2_2\) . As a result, the surrogate is trained to align with the CFD model. At the end of this step, \(\mathbf {W}\) is frozen.

Surrogate-assisted calibration (Lines 7–8): The surrogate is re-optimized to minimize the errors between its predicted temperatures and the measured temperatures by updating \(\boldsymbol {\alpha }\) . In this step, \(\boldsymbol {\alpha }\) is set trainable. An empirical regularization term is added to penalize the loss function if the temperature difference between the hot and cold aisle, i.e., \(\Delta T\) , is out of the empirical range [ \(\Delta T_{\mbox{l}}, \Delta T_{\mbox{u}}\) ]. The penalty term is expressed using the rectified linear units (ReLU) as \(h(T) = \sum _{j=1}^{m} (\text{ReLU}(\Delta T_{\mbox{l}} - \Delta T) + \text{ReLU}(\Delta T - \Delta T_{\mbox{u}})) \times P_j\) , where \(P_j\) is the \(j^{\text{th}}\) server power usage. The term means that, if the server power is higher, the penalty should be more significant. Thus, the second loss function is expressed by \(\mathcal {L}_2 = \frac{1}{M}\sum _{i=1}^{M}||{\widehat{\mathbf {T}}_{\mbox{s}i}(\mathbf {W^*},\mathbf {x}_i) - \mathbf {T}_{\mbox{s}i}}||^2_2 + \lambda h(\Delta T)\) . In our experiments, we set \(\Delta T_{\mbox{l}}\) and \(\Delta T_{\mbox{u}}\) to be 5°C and 15°C, based on the operator’s experience. To accelerate the re-optimization, we implement a hybrid approach of combining differential evolution algorithm with gradient backpropagation to minimize the loss function \(\mathcal {L}_2\) . This hybrid approach has been shown effective in accelerating neural net training [39]. We will also evaluate its effectiveness for our specific problem in Section 5.2.1.

CFD configuration (Line 9): Once the \(\boldsymbol {\alpha }\) is recommended by the surrogate, the updated value is configured back to solve the CFD model. The refined boundary conditions are then used for initializing the next calibration iteration.

CFD validation (Lines 10–12): For each calibration iteration, the CFD model’s accuracy is validated against the ground-truth sensor measurements. During the process, only better \(\boldsymbol {\alpha }\) is recorded for the final output candidate.

Through the adaptive sampling and iterative optimization of the two loss functions \(\mathcal {L}_1\) and \(\mathcal {L}_2\) , the surrogate will be updated toward finding better \(\boldsymbol {\alpha }\) that can improve the CFD model’s accuracy. We will evaluate the effectiveness of this approach in Section 5.2.

4.3 Knowledge-based CFD Model Reduction

4.3.1 Determination of POD Coefficients.

As presented in Section 4.1, the POD modes can be derived by solving the eigenvector problem given a set of snapshots. To obtain the temperature field with a new set of boundary conditions, we need to find the corresponding POD coefficients such that the error between the reconstructed temperature and the original CFD output is minimized as \(\boldsymbol {\beta }^*\! \triangleq \! \text{argmin}_{\boldsymbol {\beta }} ||{\sum _{i=1}^{H}\beta _i(\mathbf {x})\boldsymbol {\phi }^*_i-\widetilde{\mathbf {T}}(\mathbf {x})}||_2^2\) . One approach to solve this problem is to project the derived POD modes onto the CFD governing equations by the Galerkin’s method [25] and solve a set of algebraic equations for a steady state. In this article, we focus on locally specified regions where the energy balance equations can be established, such as the server inlet/outlet and the CRAC supply/return surfaces. For a data hall equipped with multiple CRACs, we consider the global energy balance across these CRACs as \(\sum _{k=1}^{l}c_\text{p} \rho V_k (T_\text{return}^k-T_\text{supply}^k)=\sum _{j=1}^{m}{P_j}\) . Then, the equation system in terms of specific boundary conditions is written as

\begin{equation} {\left\lbrace \begin{array}{ll} & \sum _{i=1}^{H}\beta _i(\mathbf {x})(\boldsymbol {\phi }_\text{out}^{i1}-\boldsymbol {\phi }_\text{in}^{i1}) = \frac{P_1}{c_\text{p} \rho \alpha _1}, \\ & \sum _{i=1}^{H}\beta _i(\mathbf {x})(\boldsymbol {\phi }_\text{out}^{i2}-\boldsymbol {\phi }_\text{in}^{i2}) = \frac{P_2}{c_\text{p} \rho \alpha _2}, \\ & \dots \\ & \sum _{i=1}^{H}\beta _i(\mathbf {x})(\boldsymbol {\phi }_\text{out}^{im}-\boldsymbol {\phi }_\text{in}^{im}) = \frac{P_m}{c_\text{p} \rho \alpha _m},\\ & \sum _{k=1}^{l} \sum _{i=1}^{H} V_k \beta _i(\mathbf {x})(\boldsymbol {\phi }_\text{return}^{ik}-\boldsymbol {\phi }_\text{supply}^{ik}) = \frac{\sum _{j=1}^{m}{P_j}}{c_\text{p} \rho }, \end{array}\right.} \end{equation}

(4)

where \(\boldsymbol {\phi }_\text{in}^{im}\) , \(\boldsymbol {\phi }_\text{out}^{im}\) and \(\boldsymbol {\phi }_\text{return}^{ik}\) are the values of the \(i^\text{th}\) POD mode close to the center of the \(m^\text{th}\) server inlet/outlet and the \(k^\text{th}\) CRAC return, respectively. If the summation of servers is greater equal to the number of POD modes, i.e., \(m \ge H\) , the equation system is overdetermined and can be solved using the least-square method. Thus, the computational overhead of solving a CFD model with the governing partial derivative equations is reduced to solving the least-square problem.

4.3.2 Four-step Procedures for CFD Reduction.

After CFD calibration, we use the calibrated server air flow rate \(\boldsymbol {\alpha }^*\) as the default boundary conditions to reduce the order of CFD model. Algorithm 2 shows the pseudocode of the CFD model reduction. We now explain it in detail.

Generate training samples (Line 1): We first generate a set of training samples based on the calibrated \(\boldsymbol {\alpha }^*\) and other boundary conditions. These boundary conditions can be selected from historical measurements or proper experimental designs. We also calculate the mean average temperature across these training samples.

POD modes solving and selection (Line 2): With the generated CFD samples, we derive the POD modes \(\boldsymbol {\Phi }\) by solving the problem in Equation (3) using the snapshot method. We then select a finite number of derived modes with the dominant energy percentage of the temperature field.

POD coefficients solving (Lines 3–5): To calculate the temperature field of a new case using the reduced-order model, we solve the corresponding POD coefficients based on the physics-based linear equation system in Equation (4) by the least-square method.

Temperature prediction (Line 6): With the derived POD modes and the boundary-specific coefficients, the full temperature field can be reconstructed by their linear combination for different numbers of selected POD modes. To increase the accuracy of the approximation, we use the summation of the mean average temperature over snapshots and their fluctuating values from the linear combination to reconstruct the temperature field as \(\widetilde{\mathbf {T}}_\text{mean}+\sum _{i=1}^{H}\beta _i(\mathbf {x})\boldsymbol {\phi }_i\) .

5 Performance Evaluation

In this section, we apply the proposed Kalibre and Kalibreduce to CFD models built for two production data halls and present their evaluation results.

5.1 Experiment Methodology and Settings

5.1.1 Data Halls and CFD Models.

Our targets are two production data halls (referred to as Halls A and B) as shown in Figure 5. Hall A is equipped with one-side CRACs and Hall B is equipped with bilateral CRACs. These two halls are in operation for e-commerce applications. Both of them occupy hundreds of square meters and host thousands of servers.¹ Their CFD models were built by a domain expert using OpenFOAM [12]. Specifically, first, the geometry of the data hall and its hosted facilities is built with Salome [31]. Then, OpenFOAM is used to discretize the fluid domain with hexahedral grid for solving. The numbers of the grids of the two CFD models are 7 and 10 million, respectively. After the grids are created, the NS and energy balance equations are solved in OpenFOAM to derive the air flow and temperature distribution, where the air is assumed to be incompressible and the turbulence is modeled using the k-epsilon method [18].

Fig. 5.

5.1.2 Evaluation Metrics and Experimental Settings.

To evaluate the model accuracy, we use the MAE to measure the differences in temperature prediction at locations installed with sensors. Specifically, \(MAE=\frac{1}{M}\frac{1}{N} \sum _{i=1}^{M}\sum _{i=1}^{N}|y_i - \hat{y}_i|\) , where N is the number of deployed sensors, M is the number of test cases, \(y_i\) and \(\hat{y}_i\) are the \(i^{\mbox{th}}\) sensor measurement and the prediction made by the CFD, respectively. To evaluate the computation overhead, we compare the simulation time of the original and the reduced-order model on the same machine. Table 3 shows the experimental settings. These settings include the experimental settings, empirical bounds, the surrogate hyperparameters and the POD modes. They are selected based on advice from the domain expert or extensive experimental tests.

Table 3.

Parameter	Setting	Parameter	Setting
server flow rate bound \([\alpha _\mbox{l}, \alpha _\mbox{u}]\) (m \(^3\) /s)	[0.005, 0.1]	noise bound for initialization \([\delta _{\mbox{l}}, \delta _{\mbox{u}}]\) (°C)	[6, 12]
CFD solving iterations	1,000	surrogate training epoch	150
cases for calibration	10	cases for test	5
calibration iterations	10	differential search iterations	100
initial learning rate	0.1	crossover rate	0.6
learning rate decay	0.9	cases for POD training	15
population size	10	selected POD modes	10

Table 3. Experimental Settings

5.1.3 Compared Approaches.

We compare Kalibre in CFD model calibration with three baselines discussed in Section 1:

▪

Manual calibration involves extensive tuning of the server air flow rates by a CFD expert with years of experience as conducted by [7]. Specifically, if the CFD-predicted temperature at a sensor location is higher than the ground-truth temperature, the expert empirically increases the flow rates of nearby servers and vice versa.

▪

Heuristic parameter search uses the covariance matrix adaptation evolution strategy (CMA-ES) to search good \(\boldsymbol {\alpha }\) . It solves the CFD model every search iteration. CMA-ES is a gradient-free numerical optimization method. It applies the (1+1) strategy to generate one candidate solution per iteration. If the MAE of the new offspring is smaller, it becomes the parent. The mutation rate is set to \(\sigma =5\) and updated for each iteration by following the 1/5 successful evolution rule described in [29].

▪

Vanilla neural surrogate uses a black-box neural surrogate with three fully-connected layers consisting of 518, 128, and 32 neurons, respectively. It also follows Kalibre’s four-step iterations to perform the model calibration. The difference is that it does not incorporate the prior knowledge mentioned in Section 4.2.1 for initial training data selection.

We also consider two variants of Kalibreduce that incorporate different level of knowledge to reduce the CFD model:

▪

Local reduction only considers the energy balance on the individual server side. It constructs the linear equation system based on the temperature difference between each server’s outlet and inlet.

▪

Global reduction jointly considers the energy balance across multiple CRACs and servers. It constructs the linear equation system using both server-level and CRAC-level knowledge.

5.1.4 Implementation.

We implement the model calibration and reduction with Python 3.8 and PyTorch 1.9 [24], where the latter is a library widely used for building machine learning applications. When we use PyTorch to build the neural surrogate’s computational graph, the server air flow rates \(\boldsymbol {\alpha }\) are set as a vector of trainable variables. This allows us to control their updating by choosing to freeze the gradients or not. We choose Adam [14] as the optimizer, which is a method for efficient stochastic optimization that only requires first-order gradients and little memory space. The linear equation system in Equation (4) is solved by the least-square solver in SciPy [38]. The OpenFOAM can load the boundary conditions from a configuration file. In our implementation, the Python program writes the boundary conditions to a configuration file, invokes an OpenFOAM session to solve the CFD model, and collects results by parsing the OpenFOAM’s output.

5.2 Evaluation of Kalibre

5.2.1 Convergence and Effectiveness of Kalibre.

In this set of experiments, we evaluate the convergence speed and the effectiveness of different calibration approaches. To initiate Kalibre’s four-step iterations, we first solve the CFD model to generate five training data samples by setting each element of \(\boldsymbol {\alpha }\) as described in Section 4.2.1. In the following four-step iterations, a new training data sample is generated by solving the CFD model configured with the latest \(\boldsymbol {\alpha }^*\) found by the neural surrogate. This new training data sample is added to the training dataset. In this set of experiment, the calibration terminates after ten CFD simulations. Figure 6 shows the MAEs of the two CFD models over the ten iterations. We observe that the MAE under Kalibre converges faster than both the heuristic and vanilla approaches. After around three to four iterations, the MAEs of the CFD models are lower than that calibrated by human experts. We also note that the CFD models under Kalibre achieve lower errors at the first calibration iteration. This is benefited from the incorporated knowledge to sample training data at the initialization phase. In the following iterations, Kalibre’s surrogate is updated toward finding better configurations of \(\boldsymbol {\alpha }^*\) from that initialization.

Fig. 6.

As presented in Section 4.2.3, Kalibre adopts a hybrid approach combining the gradient backpropagation widely used for neural net training and the differential evolution to find \(\boldsymbol {\alpha }^*\) . Figure 7(a) shows the average loss over the four-step iterations. In the first iteration, the average loss is very large, reaching around \(10^5\) . A closer examination shows that the regularization penalty of the loss function \(\mathcal {L}_2\) is large in the very early iterations. However, in the subsequent iterations, the average loss sharply decreases and converges to zero in the tenth iteration. For comparison, we adopt a baseline of using the Adam optimizer only to find \(\boldsymbol {\alpha }^*\) . Figure 7(b) shows the results of this baseline. We can see that the convergence is slow and the average loss remains large (about \(0.5 \times 10^5\) ) after ten iterations. The results show that the differential evolution effectively accelerates the convergence of Kalibre.

Fig. 7.

5.2.2 Performance of Calibrated Model.

Then, we show the effectiveness of the model calibration. After ten calibration iterations, we choose the \(\boldsymbol {\alpha }\) with the lowest error as the default setting to evaluate the CFD models on test cases with different boundary conditions. Figures 8(a) and 9(a) show the two halls’ thermal planes computed based on the uncalibrated CFD models presented for one test case. We can see that the temperature distribution is uneven in both the cold and hot aisles. The uncalibrated CFD models’ prediction errors are from 3°C to 6°C as shown in Figure 2(c). Such large errors are due to the inaccurate estimation of the server air flow rates. Figures 8(b) and 9(b) show the thermal planes computed based on the CFD models with the calibrated server air flow rates. Figures 8(d) and 9(d) show the temperatures predicted by the two halls’ calibrated CFD models at the sensor locations and the ground-truth values measured by the sensors. We can see that the temperature distributions become more uniform, compared with the results shown in Figures 8(a) and 9(a). From Figures 8(d) and 9(d), the temperatures predicted by the calibrated CFD models well match the ground-truth values, with MAEs of 0.51°C and 0.86°C for the two halls, respectively. The above results show the effectiveness of Kalibre for large-scale data halls.

Fig. 8.

Fig. 9.

5.2.3 Comparison with Baseline Approaches.

We next compare the performance of Kalibre with other baselines mentioned in Section 5.1.3. Table 4 shows the MAEs of five independent cases achieved by different approaches after ten calibration iterations. With ten calibration iterations, Kalibre achieves lower MAEs compared with the baseline approaches for both halls. After about 10 calibration iterations, Kalibre achieves MAEs of 0.57°C and 0.88°C for the two halls, respectively. As shown in Section 5.2.1, the convergence speed of the vanilla neural surrogate is slow without better initialization. It thus requires more calibration iterations to generate more training data samples to well represent the CFD model. Thus, with ten calibration iterations, its calibrated CFDs still yield MAEs higher than the manual approach. The heuristic parameter search cannot find good configurations with the same CFD iteration times. Its MAEs remain at high levels of 1.52°C and 2.20°C, respectively. Even after one hundred CFD model-solving processes, their MAEs still saturate at 1.42°C and 1.96°C for the two halls, respectively. Thus, we can see that the heuristic parameter search and vanilla neural surrogate are inefficient to find the optimal configuration under the same computation time. Although the manual calibration reduces the MAE from 0.98°C to 1.16°C, it is labor-intensive. In addition, its MAEs are higher than Kalibre’s. The lower MAEs achieved by Kalibre further improve the fidelity of the CFD results. If such results are used to guide the DC operations, the risks caused by the errors can be further reduced. In sum, systematic approaches to improve the fidelity of data hall digital twins are always desirable.

Table 4.

Data hall	Calibration (°C)				Reduction (°C)
	Manual	Heuristic	Vanilla	Kalibre	Kalibreduce local	Kalibreduce global
	Manual	Heuristic	Vanilla	Kalibre	Hall A	0.98	1.52	1.46	0.57	0.89	0.84
Hall B	1.16	2.20	1.91	0.88	1.02	0.98

Table 4. MAE Achieved Over Test Cases After Ten Calibration Iterations

The bold value indicates our proposed method achieves the lowest error compared with other baselines.

5.2.4 Compute Time.

Table 5 shows the online inference time for solving the CFD models when the number of used CPU cores varies. Note that the OpenFOAM software package supports parallel computing with multiple CPU cores on the same computer. The evaluation shows that a single CFD model solving takes up to several hours and the solving time decreases with the number of used CPU cores. However, the model solving speed (i.e., the reciprocal of the solving time) is sub-linear to the number of used CPU cores. This suggests that the CFD computation is not completely divisible and the communications among the paralleled units matter. Thus, even if the 64-core limit is lifted, the attempt to use more CPU cores across multiple computers may face performance bottlenecks due to the cross-computer communication overheads. With 64 CPU cores, the CFD model-solving time is about one hour. This solving time still renders the heuristic search-based model calibration approaches impractical since they generally need a large number of iterations (e.g., hundreds as shown shortly). Note that GPU acceleration has been introduced to another commercial CFD software package [1]. However, it brings 3.7 \(\times\) acceleration only, which does not change the impracticality of the heuristic search-based model calibration approaches. From the results in Table 4, Kalibre achieves sub-1°C MAEs with 10 calibration iterations. The compute time breakdown of each iteration is as follows: about 200 seconds for surrogate training, about 100 seconds for configuration search, and about one hour for CFD model solving with 64 CPU cores. Kalibre’s compute time for calibrating the two hall’s CFD models in 10 iterations are about 12 and 9 hours, respectively.

Table 5.

Model		CFD model					Reduced-order model
*CPU cores \(^\) (#)**		4	8	16	32	64	1
Solving	Hall A	77,400	38,880	25,200	10,440	3,240	0.53
time (s)	Hall B	75,960	39,240	23,400	12,960	4,320	0.76

Table 5. Average Online Inference Time of the Two CFD and Reduced-Order Models with Varying CPU Cores

\(^*\) Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20 GHz.

5.3 Evaluation of Kalibreduce

5.3.1 Impact of the POD Mode Numbers.

We first generate the training samples for solving the POD modes. Specifically, we generate 15 samples by running the calibrated CFD models with different boundary conditions from historical input measurements and the calibrated server air flow rates. These samples are then used to calculate the POD modes by solving the eigenvector problem of Equation (3). The captured energy of each POD mode is proportional to the corresponding eigenvalue. Figure 10(a) shows the POD captured cumulative energy percentage versus the number of modes for the two data hall models. From the figure, we observe that the first ten modes take 97.8% and 92.7% of the total energy, respectively, indicating the temperature field information is mainly captured by the first few dominant POD modes. In this study, we use the first ten modes to reconstruct the temperature field. Figure 10(b) shows the POD coefficients \(\boldsymbol {\beta }\) versus the mode index for two cases. We observe that the absolute values of the first coefficients are high for both two cases, reaching around \(10^4\) . The values then decrease dramatically when the mode index increases, indicating the first few terms of \(\sum _{i=1}^{H}\beta _i(\mathbf {x})\boldsymbol {\phi }_i\) are dominant in reconstructing the temperature field. We also evaluate the accuracy of the reconstructed temperature field with the full CFD simulation. Figure 10(c) shows the temperature MAE versus the number of modes for two cases. The error bar represents the standard deviation. From the figure, we observe that the error converges with the increased number of used POD modes. The standard deviation is high with only a few used modes. This is because the reconstruction only focuses on energy balance at local boundaries, i.e., it only aims at satisfying the linear equation system in Equation (4). Therefore, the errors could be high for regions far away from those boundaries.

Fig. 10.

5.3.2 Performance of Reduced-order Model.

Next, we show the effectiveness of the model reduction based on the derived POD modes and corresponding coefficients. From Figure 10(c), with ten used modes, we observe that the reduced-order models achieve MAE of 0.23°C and 0.61°C and standard deviation of 0.45°C and 1.09°C compared with the full CFD simulated temperature field. Figures 8(c) and 9(c) show the thermal planes reconstructed by the reduced-order models. The thermal planes show close matches to the thermal planes of Figures 8(b) and 9(b), indicating the POD-based reduction can well approximate the CFD-generated temperature fields. Figures 8(d) and 9(d) show the temperature predictions of the reduced-order models at locations installed with sensors. Table 4 shows the MAEs of the reduced-order models. Compared with the ground-truth values, the predictions with global energy balance produce MAEs of 0.84°C and 0.98°C for the two data halls, respectively. Although the MAEs of the reduced-order models are slightly higher than the calibrated CFD models, the simulation only takes 0.53 and 0.76 seconds, respectively, for reconstructing the temperature fields as shown in Table 5, which is thousand times faster than solving the original CFD models.

6 Discussions

We now discuss several worth-noting issues. First, the primary purpose of the Kalibre’s surrogate is to improve the efficiency of parameter search. The surrogate does not provide a full-fledged approximation of the CFD model. For instance, it does not model the temperatures at the locations without sensors, which are modeled by the CFD model in contrast. Thus, only the calibrated CFD model or the reduced-order model shall be used as a digital twin for the run-time temperature evaluation of the modeled data hall. Second, the surrogate’s architecture described in this article is for data halls installed with hot-aisle containments and server blanking panels. Thus, heat recirculation is not considered. To address the data halls without hot-aisle containments, heat recirculation, and temperature mixing effects should be added to the neural surrogate’s design. Third, due to the computation overhead, the training data used to solve the POD modes are generated using boundary conditions from only a subset of historical measurements. To extrapolate the reduced-order model to other cases, it is important to generate extra CFD samples under various boundary configurations. Fourth, this article mainly focuses on temperature prediction. For other types of prediction, the proposed approach can be extended to address their calibration and reduction problems with proper surrogate designs. For instance, if air flow rate sensors are deployed, Kalibre and Kalibreduce can be extended to calibrate the CFD for predicting air velocity distribution.

7 Conclusion

This article first presents Kalibre, an automatic surrogate-based approach to calibrating data hall CFD models. The design of Kalibre’s neural surrogate integrates prior knowledge, including the energy balance principle and thermal relations among the key variables of the physical infrastructure. The incorporated knowledge improves the approximation capability of the neural surrogate to the CFD model in the locality for calibration. We demonstrate its effectiveness on two CFD models built for two production data halls that host thousands of servers. The CFD models calibrated by Kalibre achieve temperature prediction MAEs of 0.57°C and 0.88°C, respectively. Compared with manual calibration, Kalibre’s improvement of up to 0.3 \(\sim\) 0.4°C is significant in CFD modeling due to the sharply increased difficulty in improving accuracy when the errors are already low (i.e., at around 1°C). With the calibrated CFD models, this article further presents Kalibreduce to accelerate the simulation speed of the CFD models. Kalibreduce adopts the POD techniques and energy balance principle to reduce the order of the calibrated CFD models. The reduced CFD models achieve satisfactory MAEs of 0.84°C and 0.98°C, respectively, while accelerating CFD-based simulations for temperature prediction by thousand times. Kalibre and Kalibreduce shed lights on transforming the CFD models built for large-scale data halls into high-fidelity and low-overhead digital twins.

Footnote

The detailed configurations of their hosted facilities are omitted here due to confidentiality requirement.

References

[1]

[n.d.]. GPU-accelerated Ansys Fluent. Retrieved from https://www.nvidia.com/en-sg/data-center/gpu-accelerated-applications/ansys-fluent/.

Abstract

1 Introduction

2 Related Work

3 Background and Problem Formulation

3.1 Background

3.2 Problem Formulation

3.2.1 Model Calibration Problem.

3.2.2 Model Reduction Problem.

4 Model Calibration and Reduction Approach

4.1 Approach Overview

4.2 Knowledge-based Neural Surrogate Calibration

4.2.1 Incorporated Prior Knowledge.

4.2.2 Neural Surrogate Architecture.

4.2.3 Four-step Iterations for CFD Calibration.

4.3 Knowledge-based CFD Model Reduction

4.3.1 Determination of POD Coefficients.

4.3.2 Four-step Procedures for CFD Reduction.

5 Performance Evaluation

5.1 Experiment Methodology and Settings

5.1.1 Data Halls and CFD Models.

5.1.2 Evaluation Metrics and Experimental Settings.

5.1.3 Compared Approaches.

5.1.4 Implementation.

5.2 Evaluation of Kalibre

5.2.1 Convergence and Effectiveness of Kalibre.

5.2.2 Performance of Calibrated Model.

5.2.3 Comparison with Baseline Approaches.

5.2.4 Compute Time.

5.3 Evaluation of Kalibreduce

5.3.1 Impact of the POD Mode Numbers.

5.3.2 Performance of Reduced-order Model.

6 Discussions

7 Conclusion

Footnote

References

Cited By

Index Terms

Recommendations

Deteção e Extinção de Incêndio em Data Center

An axisymmetric computational model of generalized hydrodynamic theory for rarefied multi-species gas flows

A Framework For Digital Model Checking

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations