Deep Reinforcement Learning For HVAC Control in Smart Buildings
Deep Reinforcement Learning For HVAC Control in Smart Buildings
Deep Reinforcement Learning For HVAC Control in Smart Buildings
…
…
…
BCVTB
…
𝒔 Learning loop
EnergyPlus Real building (Last state) 𝑄 Network
…
…
State transition
…
>?
…
(Next state)
Figure 1: Our deep reinforcement learning (DRL) based framework for HVAC control and evaluation. The details of building
state transition are defined in Section 2. The details of DRL learning and control process are presented in Section 3.
framework and an efficient heuristic variant, and (3) facili- 2. MDP FORMULATION FOR DRL-BASED
tating algorithm training and evaluation with a co-simulation BUILDING HVAC CONTROL
framework. Figure 1 illustrates our DRL-based framework
for HVAC control and evaluation. During building oper- The building HVAC system is operated to maintain a de-
ation, it learns an effective control policy based on sens- sired temperature within each zone, based on current tem-
ing data input, without relying on any thermal dynamics perature and outside environment disturbances. The zone
model. For offline training and validation of the algorithm, temperature at next time step is only determined by the
we leverage detailed building dynamics model built in the current system state and environment disturbances, and the
widely-adopted EnergyPlus simulation tool [4]. Simulation conditioned air input from the HVAC system. It is indepen-
results demonstrate that the proposed framework is able to dent from the previous states of the building. Therefore, the
significantly reduce the energy cost while meeting the room HVAC control operation can be treated as a Markov decision
temperature requirements. It should be noted that while process. Next, we formulate the key concepts in this process
the detailed EnergyPlus models are highly accurate and suit- to facilitate our DRL-based HVAC control algorithm.
able for offline training and validation, their high complexity Control actions: We consider a building that has z tem-
makes them unsuitable for real-time control. perature zones and is equipped with a VAV (variable air flow
In summary, the main contributions of our paper include: volume) HVAC system. The VAV terminal box at each zone
• We formulate HVAC control operations as a Markov de- provides conditioned air (typically at a constant tempera-
cision process, with definitions of system state, control ture) with an air flow rate that can be chosen from multiple
action and reward function. These are the key concepts discrete levels, denoted as F = {f 1 , f 2 , ..., f m }. Therefore,
in our data-driven HVAC control approach using DRL. the entire action space A = {A1 , A2 , ..., An } of the building
• We develop a DRL based algorithm for minimizing the HVAC control includes all possible combinations of air flow
building energy cost while maintaining comfortable tem- rate for every zone, i.e., n = mz . Clearly, the dimension
perature for building tenants. For higher scalability, we of action space will increase rapidly with larger number of
further propose a heuristic variant for efficient control for zones and air flow rate levels, which will then greatly increase
complex multiple-zone systems. the training time and degrade the control performance. In
• We develop a co-simulation framework based on Ener- Section 3.3, we introduce a multi-level control heuristic for
gyPlus for offline training and validation of our DRL- multiple zones to combat this challenge.
based algorithms, with real-world weather and time-of-
use pricing data. Our experiment results demonstrate System states: The optimal control action is determined
that the proposed DRL-based algorithms can achieve up based on the observation of the current system state. In
to 20% − 70% energy cost reduction when compared with this work, we consider current (physical) time, zone temper-
a rule-based baseline control strategy. ature and environment disturbances (i.e. ambient temper-
ature and solar irradiance intensity) to determine the op-
The remainder of the paper is organized as follows. Sec- timal control action. In particular, incorporating current
tion 2 presents our MDP formulation for the HVAC control time information in the state enables the DRL algorithm to
operation. Section 3 presents our DRL-based HVAC control adapt to time related activities, such as time-varying tem-
algorithms. Section 4 shows the experimental results and perature requirements, electricity price, occupant activities
Section 5 concludes the paper. and equipment operation in the building. For environment
disturbances, instead of just using current ambient tempera-
ture and solar irradiance, we also take into account of multi- 𝑡 𝑡 + Δ𝑡3
step forecast of weather data. This is important because (4) DRL (4) DRL
observe update observe update
the weather pattern can vary significantly. Considering a …
𝑎+,-+. 𝑎+ 𝑎+/(0,1)-+. 𝑎+/0-+.
short sequence of weather forecast data enables our DRL
DRL algorithm
algorithm to capture the trend of the environment, perform
proactive control and adapt to time-variant systems.
𝑡 − Δ𝑡$ 𝑡 … 𝑡 + (k − 1)Δ𝑡$ 𝑡 + kΔ𝑡$ …
Rewards function: The goal of the DRL algorithm is to
minimize the total energy cost while maintaining the tem-
perature of each zone within a desired range, by taking a (5) (5) (5)
Squared error
𝑇:<$
…
+ η(rt+1 + γ max Qt (st+1 , at+1 ) − Qt (st , at )) (3)
…
at+1
+ )
L
𝑄=<+ 𝑄(𝑠$%&, 𝑎$%&
where η ∈ (0, 1] represents the learning rate of value es-
timates during the training process. Equation (3) should 𝑤, 𝑏 𝑤, 𝑏
converge to the optimal value Q∗ (st , at ) over time under the 𝑟$ + 𝛾max𝑄4 (𝑠$ , 𝑎$ )
23
MDP environment.
Building control sequence: Our DRL algorithm interacts Figure 3: Structure of the neural network utilized in our
with the building environment during operation, or with the DRL framework
EnergyPlus model via the BCVTB (a Ptolemy II platform We use the mean squared error between the target Q-
that enables co-simulation across different models [26]) in- value and the inferred output of neural network as loss func-
terface during offline training and validation. tion (6), where n denotes the number of possible control
As shown in Figure 2, we use a separate control step actions. Parameters (weights) in the neural network are
∆tc = k∆ts to represent the control frequency of DRL al- updated by the mini-batch gradient descent method w :=
gorithm. Every ∆tc time, as shown in Equation (4), the w − α∆w [2], where α is the learning rate and ∆w = ∂w ∂L
.
DRL algorithm will observe the building state and update n
the control action. Between two control time steps, the con- 1 X ∗
L= [Q (st , ait ) − Q(st , ait )]2 (6)
trol action used to operate the HVAC system remains the 2n i=1
same as the last updated action. While in Equation (5), the
building receives the control signal and enters its next state Being consistent with the Q-learning update process (3),
every ∆ts time, which represents the building simulation or the target value Q∗ (st , ait ) in the neural network can be esti-
sensor sampling frequency. mated by Equation (7) when using gradient descent, where
Q values are approximated by the neural network.
at = fDRL (st−∆ts ) (4)
Q∗ (st , at ) = rt+1 + γ max Q(st+1 , at+1 ) (7)
st = fEN V (st−∆ts , at−∆ts ) (5) at+1
Training data pre-processing: The input state vector st from the network Q during the learning process. In line 4,
consists of various types of features in the building. The the variable a stores the control action in the last step, and
range of value for each feature can vary significantly. To spre and scur represent the building state in the previous
facilitate the learning process, we scale the feature values to and current control time steps, respectively.
a similar range before feeding the input state to the neural
network. In this work, we scale the input state vector to Algorithm 1 DRL-based HVAC Control Algorithm
the range [0, 1] as shown in (8), where x represents a feature 1: Initialize memory M = [empty set]
in the input state. The minimum and maximum values for 2: Initialize neural network Q with parameters w
each feature can be estimated from historical observations. 3: Copy neural network Q and store as Q̂(·|ŵ)
x − min(x) 4: Initialize control action a, state spre and scur
x0 = (8)
max(x) − min(x) 5: for m := 1 to N do
For output units, the linear layer is used to infer Q-value esti- 6: Reset building environment to initial state
mates from hidden units. However, if we directly use reward 7: for ts := 0 to L do
function (1) to calculate the target Q-value as shown in (7), 8: if ts mod k == 0 then
it may result in a large variance in the target value. Dur- 9: scur ← current observation
ing backward propagation, the corresponding bias factor in 10: r = reward(spre , a, scur )
the last linear layer may dominate the derivative of the loss 11: M ← (spre , a, r, scur )
function, which will prevent weights in earlier layers from 12: Draw mini-batch (s, a, r, s0 ) ← M
learning the optimal value. In order to overcome this lim- 13: Target vectors v ← target(s)
itation, we calculate the target value by first shrinking the 14: Train Q(·|w) with s, v
original immediate reward with a factor ρ and then clipping 15: Every d∆tc steps, Q̂(·|ŵ) ← Q(·|w)
it if the target is smaller than −1, as shown in Equation (9). 16: ( i − ∆, min )
= max(
rt A ∈ A | i = random(n) probability
target val(st−1 , at−1 ) = max[ + γmaxQ(st , at ), −1] (9) 17: a= argmaxQ(scur , ã) otherwise
ρ at
ã
In this way, we squash the original target value with a large 18: spre ← scur
variance to the range [−1, 0]. The underlying principle is
19: end if
that while it does not help to know which control actions are
20: Execute action a in building environment
worse, we focus on which control actions are better.
21: end for
Training of the neural network: As shown in Figure 1,
22: end for
the one-step state transition process is represented by a tu-
ple (st−1 , at−1 , rt , st ), which includes previous state, previ-
ous action, immediate reward and current state. The target Learning process: Within each training episode, line 8
vector of the neural network can be calculated by Equa- determines whether the current time step ts is a control
tion (10), where the target value associated with at−1 , i.e., time step. As discussed in Section 2, the control step ∆tc
target val(st−1 , at−1 ), is calculated by Equation (9). For is k times of the simulation step ∆ts . If ts is a control time
other control actions, the target value is set to the current step, the algorithm will perform training and determine the
value estimate of that action. new control action (line 9 to 18). Otherwise, the building
target val(st−1 , a) if a = at−1 will maintain the current control action.
target(st−1 ) = (10) During the learning process, in line 9 we first observe the
Q(st−1 , a) otherwise
state at current control time step. Then, the immediate re-
Next, the target vector is compared with current inference ward is calculated by Equation (1). Next, in line 11 the state
output of the neural network to calculate the approximation transition tuple is stored in memory. Then, a mini-batch
error. Then, we use the RMSprop [8] method to update of transition tuples are drawn randomly from the memory.
parameters in the neural network. Lines 13 to 14 follow Equation (10) to calculate the target
vector and update weights in neural network Q by using the
3.2 DRL Algorithm Design RM Sprop Back-propagation method [8]. In line 15, the net-
Our DRL-based HVAC control algorithm is presented in work Q̂ will be updated with current weights in network Q
Algorithm 1. The outer loop controls the number of training in every d control time steps. Then, this Q̂ network is used
episodes, while the inner loop performs HVAC control at for inferring the target value for the next d control steps.
each simulation time step within one training episode. Next, from line 16 to 18 the network Q is utilized to de-
Initial setup: During the learning process, the recent tran- termine the next control action. The -greedy policy is used
sition tuples (st−1 , at−1 , rt , st ) are stored in the memory M , to select the optimal control action based on the output of
from which a mini-batch of samples will be generated for Q. The algorithm has a probability to explore the action
neural network training. At the beginning, we first initialize space by randomly selecting an available action; otherwise,
memory M as an empty set. Then, we initialize weights w it will choose the action with the maximum value estimate.
in the neural network similar as in [7]. As shown in Equa- After each training process, in line 16 the exploration rate
tion (7), updating neural network weights requires the target will gradually decrease until reaching at a lower bound min .
value, which also depends on weights in the neural network. In this way, the DRL algorithm is more likely to try differ-
To break this dependency loop between target value and ent control actions at the beginning. As the training process
weights w, in line 3, a separate neural network Q̂ is created proceeds, the DRL algorithm will have a higher chance to
for calculating the target value similar as in [12]. This net- follow the learned policy. Finally, in line 18 the current state
work Q̂ will be periodically updated by copying parameters is assigned to spre to prepare for the next training process.
3.3 Heuristic Adaption for Multiple Zones exceeds the cooling setpoint (i.e. 24◦ C in our experiment),
We present a heuristic mechanism that adapts our DRL the room will be cooled at the maximum air flow rate. If
algorithm for multi-zone HVAC control. As discussed in Sec- the temperature drops below certain threshold 2 , the air flow
tion 2, the action space has a cardinality of mz , which in- in the zone will be turned off. In our experiment, for both
creases exponentially with the number of zones in the build- baseline approaches and DRL algorithms, the conditioned
ing. Training a neural network with such a large number of air temperature from the HVAC system is set to 10◦ C.
outputs is inefficient or even infeasible in practice.
In our heuristic, instead of using a single neural network 4.2 Experiment Results
to approximate the Q-values of all control actions in the A. Effectiveness of DRL control algorithms in meet-
building, we separately train a neural network for each zone ing temperature requirements: We evaluate the perfor-
using Algorithm 1. Each neural network is responsible for mance of our DRL algorithms with three buildings modeled
approximating the Q-value in one zone. At each time step, in EnergyPlus, which have 1 zone, 4 zones and 5 zones, re-
all networks will receive the state of buildings, and then de- spectively. The HVAC system can provide multi-level air
termine the control action for each zone separately. After ex- flow rate for each zone. In this work, we test our DRL algo-
ecuting the control action, the temperature violation penalty rithms with two-level (i.e. on-off control) and five-level air
for each zone is calculated similarly as Equation (1). The flow control, where each level is evenly distributed between
electricity cost in each zone is calculated by Equation (11), the minimum and maximum air flow rate of the HVAC sys-
which is proportional to the air flow demand in each zone tem. Figure 4 shows the zone temperature in the 1-zone and
based on the total cost. 4-zone building in August, where our regular DRL algorithm
ui in Algorithm 1 performs on-off control to operate the HVAC
costi = cost · P (11) system. We can see that after training the DRL algorithm is
i ui
quite effective in maintaining the zone temperature within
where cost denotes the total electricity cost in the building the desired range.
and ui represents the air flow rate in each zone. Although 27.0
the total electricity cost is not exactly a linear function of the
Temperature ℃
24.0
air flow rate, we can still heuristically estimate the amount
of cost contributed by each zone by following Equation (11). 21.0
18.0
4. EXPERIMENTAL RESULTS Before training After training Comfort temperature
15.0
1 6 11 16 21 26 31
Days
4.1 Experiment Setup
(a) 1 zone
We demonstrate the effectiveness of our DRL-based al-
27.0
gorithms through simulations in EnergyPlus. We train the
Temperature ℃
We evaluate the performance of our DRL algorithms by (a) 1 zone (b) 4 zones
comparing them with a rule-based HVAC control strategy Figure 5: Q value in 1-zone and 4-zone buildings
(similarly as the one in [22]) and the conventional RL method.
In the rule-based approach, the HVAC system is operated by 2
We find out that setting it to 20◦ C helps the rule-based
an on-off control strategy such that if the zone temperature approach minimize temperature violation rate.
As discussed in Section 3.3, the action space will expo- Acknowledgments. The authors gratefully acknowledge
nentially increase with the number of zones. For the 5-zone the support from the National Science Foundation award
building, the total number of actions is more than 3000 with CCF-1553757 and the Riverside Public Utilities.
5-level air flow rate control, which would be intractable for
our regular DRL algorithm. Therefore, we leverage the effi- 6. REFERENCES
cient heuristic method in Section 3.3 to perform multi-level
[1] E. Barrett and S. Linder. Autonomous HVAC Control, A
control in multi-zone buildings. Figure 6 compares the av- Reinforcement Learning Approach. Springer, 2015.
erage frequency of temperature violations of the baseline [2] L. Bottou. Large-scale machine learning with stochastic
strategy, conventional Q learning, our regular DRL algo- gradient descent. Proceedings of COMPSTAT. 2010.
rithm with on-off control, and the heuristic DRL algorithm [3] G. T. Costanzo and et al. Experimental analysis of
with 5-level control. We can see that the DRL algorithms data-driven control for a building heating system. CoRR,
are able to keep the percentage of temperature violations at abs/1507.03638, 2015.
a low level. [4] EnergyPlus. https://energyplus.net/.
[5] D. Ernst and et al. Tree-based batch mode reinforcement
Baseline Q learning DRL regular on-off DRL heuristic 5-level learning. Journal of Machine Learning Research, 2005.
6.0%
[6] P. Fazenda and et al. Using reinforcement learning to
4.0% optimize occupant comfort and energy usage in hvac
2.0% systems. Journal of Ambient Intelligence and Smart
Environments, pages 675–690, 2014.
0.0%
1-zone Area 1 1-zone Area 2 4-zone Area 1 4-zone Area 2 5-zone Area 1 5-zone Area 2
[7] K. He and et al. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. IEEE
Figure 6: Comparison of temperature violation rate between International Conference on Computer Vision, 2015.
our DRL algorithm, baseline approach and Q learning [8] G. Hinton, N. Srivastava, and K. Swersky. Lecture 6a
overview of mini–batch gradient descent. http://www.cs.
toronto.edu/˜tijmen/csc321/slides/lecture slides lec6.pdf.
B. Effectiveness of DRL algorithms in energy cost
[9] B. Li and L. Xia. A multi-grid reinforcement learning
reduction: Figure 7 shows the comparison of average daily method for energy conservation and comfort of HVAC in
electricity cost of our DRL control algorithms, conventional buildings. pages 444–449, 2015.
Q learning and the baseline approach. The percentage of [10] Y. Ma and et al. Model predictive control for the operation
cost reduction achieved by DRL algorithms compared with of building cooling systems. IEEE Transactions on Control
the baseline approach is marked in the figure. Systems Technology, 20(3):796–803, 2012.
[11] M. Maasoumy and et al. Model-based hierarchical optimal
$
Baseline Q learning DRL regular on-off DRL heuristic 5-level control design for HVAC systems. DSCC, 2011.
400
10.6% [12] V. Mnih and et al. Human-level control through deep
300 16.2% 19.1%
23.3%
4.9% 18.9%
reinforcement learning. Nature 518.7540, 2015.
200 22.2% 35.1% [13] National Solar Radiation Data Base. http://rredc.nrel.gov.
56.8% 25.9%
100 71.2% 52.9% [14] D. Nikovski, J. Xu, and M. Nonaka. A method for
0 computing optimal set-point schedules for HVAC systems.
1-zone Area 1 1-zone Area 2 4-zone Area 1 4-zone Area 2 5-zone Area 1 5-zone Area 2
REHVA World Congress CLIMA, 2013.
Figure 7: Comparison of energy cost between our DRL al- [15] F. Oldewurtel and et al. Energy efficient building climate
gorithms, baseline approach and Q learning control using stochastic model predictive control and
weather predictions. ACC, 2010.
We can see that our regular DRL algorithm can achieve [16] S. J. Olivieri and et al. Evaluation of commercial building
significant energy cost reduction compared with the base- demand response potential using optimal short-term
line approach and conventional Q learning. The efficient curtailment of heating, ventilation, and air-conditioning
loads. Journal of Building Performance Simulation, 2014.
heuristic DRL can leverage multi-level control in multi-zone
[17] D. Ormoneit and Ś. Sen. Kernel-based reinforcement
buildings to achieve further reduction. Furthermore, the learning. Machine Learning, 49(2):161–178, 2002.
DRL algorithms are more effective in reducing energy cost [18] M. Riedmiller. Neural Fitted Q Iteration – First
for Area 2, since the learning process is more effective with a Experiences with a Data Efficient Neural Reinforcement
milder weather profile. Compared with the 5-zone building, Learning Method. Springer, 2005.
the DRL algorithms achieve more reduction for 1-zone and [19] D. Silver and et al. Mastering the game of go with deep
4-zone buildings. That is likely because the 5-zone building neural networks and tree search. Nature, 529(7587), 2016.
is more sensitive to outside disturbances and hence more [20] SCE. https://www.sce.com/NR/sc3/tm2/pdf/CE281.pdf.
challenging for the learning process. [21] A. Standard. Standard 55-2004-thermal environmental
conditions for human occupancy. ASHRAE Inc., 2004.
[22] D. Urieli and P. Stone. A learning agent for heat-pump
5. CONCLUSIONS thermostat control. AAMAS, 2013.
This paper presents a deep reinforcement learning based [23] U.S. DoE. Buildings energy data book.
data-driven approach to control building HVAC systems. [24] C. J. Watkins and P. Dayan. Q-learning. Machine learning,
8(3-4):279–292, 1992.
A co-simulation framework based on EnergyPlus is devel-
[25] T. Wei, Q. Zhu, and M. Maasoumy. Co-scheduling of
oped for offline training and validation of the DRL-based ap- HVAC control, EV charging and battery usage for building
proach. Experiments with detailed EnergyPlus models and energy efficiency. ICCAD, 2014.
real weather and pricing data demonstrate that the DRL- [26] M. Wetter. Co-simulation of building energy and control
based algorithms (including the regular DRL algorithm and systems with the building controls virtual test bed. Journal
a heuristic adaptation for efficient multi-zone control) are of Building Performance Simulation, 2011.
able to significantly reduce energy cost while maintaining [27] L. Yang and et al. Reinforcement learning for optimal
the room temperature within desired range. control of low exergy buildings. Applied Energy, 2015.