Deep Reinforcement Learning For HVAC Control in Smart Buildings

Deep Reinforcement Learning for Building HVAC Control
Tianshu Wei Yanzhi Wang Qi Zhu

University of California, Syracuse University University of California,
Riverside ywang393@syr.edu Riverside
twei002@ucr.edu qzhu@ece.ucr.edu
ABSTRACT authors propose a system model that is bilinear in inputs,
Buildings account for nearly 40% of the total energy con- states and weather parameters, and model the control opti-
sumption in the United States, about half of which is used by mization as a sequential linear programming (SLP) problem.
the HVAC (heating, ventilation, and air conditioning) sys- The work in [11] models building thermal behavior as RC
tem. Intelligent scheduling of building HVAC systems has networks and proposes a tracking linear-quadratic regulator
the potential to significantly reduce the energy cost. How- (LQR) for HVAC control. The work in [25] uses the similar
ever, the traditional rule-based and model-based strategies building model as [11], and develops an MPC-based algo-
are often inefficient in practice, due to the complexity in rithm for co-scheduling HVAC control with other demands
building thermal dynamics and heterogeneous environment and supplies. The performance and reliability of these ap-
disturbances. In this work, we develop a data-driven approaches depend highly on the accuracy of the building ther-
proach that leverages the deep reinforcement learning (DRL) mal dynamics model, which also has to be efficiently solved
technique, to intelligently learn the effective strategy for op- using mathematical tools for practical runtime control [27].
erating the building HVAC systems. We evaluate the per- However, the building temperature is affected by many fac-
formance of our DRL algorithm through simulations using tors, including building structure and materials, surrounding
the widely-adopted EnergyPlus tool. Experiments demon- environment (e.g., ambient temperature, humidity, and solar
strate that our DRL-based algorithm is more effective in radiation intensity), and internal heat gains from occupants,
energy cost reduction compared with the traditional rule- lighting systems and other equipment. As a result, the build-
based approach, while maintaining the room temperature ing temperature often exhibits randomized behaviors under
within desired range. an incomplete modeling. Overall, it is often intractable to
develop a building dynamics model that is both accurate and
efficient enough for effective runtime HVAC control.
1. INTRODUCTION Therefore, some recent works have started developing data-
Buildings account for nearly 40% of the total energy con- driven approaches that leverage real-time data inputs for
sumption in the United States and 70% of the electricity us- reinforcement learning (RL) based HVAC control. In [1, 9,
age [23]. Among various types of building energy loads, such 14], classical Q-learning techniques are presented, which use
as HVAC, lighting, appliances and electric vehicle charging, the tabular Q value function and are not suitable for control
the HVAC system consumes around 50% of the energy us- problems with large state space. In [6], the authors propose
age. The thermal flywheel effect allows the building oper- a neural fitted RL method through the interaction with ten-
ator to perform pre-cooling/per-heating to shift the HVAC ants to determine the optimal temperature setting point.
energy demand while still meeting the requirements on room This approach is only evaluated with single-zone buildings,
temperature [16]. As the energy usage in buildings is often in which the heat transfer process is modeled by simple
charged with time-of-use price (where energy prices vary in form of differential equations. The work in [3] develops a
different time periods), the scheduling flexibility from HVAC model-assisted batch RL approach, where randomized trees
system provides great potential for reducing building energy are used to approximate the action value and decide sim-
cost and improving grid energy efficiency and stability. ple on-off control strategies. These RL-based methods with
In the literature, many approaches have been proposed function approximation work for the continuous state situa-
to control building HVAC systems for energy efficiency [15, tion. However, their batch update mechanism exhibits high
11, 10, 25]. These approaches typically employ a simpli- computational cost throughout the learning process.
fied building thermal dynamics model during runtime con- The recently proposed deep reinforcement learning (DRL)
trol to predict buildings’ temperature evolution. In [10], technique, which has been shown successful in playing Atari
the authors develop a nonlinear model of the overall cooling and Go games [12, 19], emerges as a powerful data-driven
system, including chillers, cooling towers and thermal stor- method for solving complex control problems. The DRL
age banks, and present a model predictive control (MPC) technique can handle large state space by building a deep
scheme for minimizing energy consumption. In [15], the neural network to relate the value estimates and associated
state-action pairs, thereby overcoming the shortcoming of
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed conventional RL. This paper is the first (to the best of au-
for profit or commercial advantage and that copies bear this notice and the full cita- thors’ knowledge) to apply DRL technique for HVAC con-
tion on the first page. Copyrights for components of this work owned by others than trol. Our work achieves good performance and high scalabil-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission ity by (1) formulating the HVAC operation as a Markov de-
and/or a fee. Request permissions from permissions@acm.org. cision process (MDP) 1 , (2) developing a DRL-based control
DAC ’17, June 18-22, 2017, Austin, TX, USA
1

c 2017 ACM. ISBN 978-1-4503-4927-7/17/06. . . $15.00 Our MDP formulation is general and can be time-variant
DOI: http://dx.doi.org/10.1145/3061639.3062224 as well.
For offline training
and validation During operation Control action 𝑎%
𝜖-greedy exploration
Building Environment Control loop & exploitation
Current state 𝑠% ...

Update parameters 𝑤
from sensing data
…
…
…
BCVTB
…
𝒔 Learning loop
EnergyPlus Real building (Last state) 𝑄 Network
Action value error

building model HVAC system Current Q value
𝑄(𝑠, 𝑎)
(𝑠%&' ,𝑎%&' ,𝑠% ) 𝒂 Copy
(Action) 𝜕𝐿
Rewards 𝜕𝑤
𝑟% = 𝑐𝑜𝑠𝑡 𝑎%&', 𝑠%&' + 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 𝑠% ... Future rewards L
(𝑠%&' ,𝑎%&' , 𝑟%, 𝑠% ) 𝒔/ 𝛾max 𝑄"(𝑠 / , 𝑎′) Gradient of Loss
…
…
State transition
…
>?
…
(Next state)
Experience replay 𝑄" Network Add

𝒓
memory (Reward)
Target Q value
Store historical transitions Mini-batch of transitions 𝑟 + 𝛾max
?
𝑄"(𝑠 / , 𝑎′)
>
Figure 1: Our deep reinforcement learning (DRL) based framework for HVAC control and evaluation. The details of building
state transition are defined in Section 2. The details of DRL learning and control process are presented in Section 3.
framework and an efficient heuristic variant, and (3) facili- 2. MDP FORMULATION FOR DRL-BASED
tating algorithm training and evaluation with a co-simulation BUILDING HVAC CONTROL
framework. Figure 1 illustrates our DRL-based framework
for HVAC control and evaluation. During building oper- The building HVAC system is operated to maintain a de-
ation, it learns an effective control policy based on sens- sired temperature within each zone, based on current tem-
ing data input, without relying on any thermal dynamics perature and outside environment disturbances. The zone
model. For offline training and validation of the algorithm, temperature at next time step is only determined by the
we leverage detailed building dynamics model built in the current system state and environment disturbances, and the
widely-adopted EnergyPlus simulation tool [4]. Simulation conditioned air input from the HVAC system. It is indepen-
results demonstrate that the proposed framework is able to dent from the previous states of the building. Therefore, the
significantly reduce the energy cost while meeting the room HVAC control operation can be treated as a Markov decision
temperature requirements. It should be noted that while process. Next, we formulate the key concepts in this process
the detailed EnergyPlus models are highly accurate and suit- to facilitate our DRL-based HVAC control algorithm.
able for offline training and validation, their high complexity Control actions: We consider a building that has z tem-
makes them unsuitable for real-time control. perature zones and is equipped with a VAV (variable air flow
In summary, the main contributions of our paper include: volume) HVAC system. The VAV terminal box at each zone
• We formulate HVAC control operations as a Markov de- provides conditioned air (typically at a constant tempera-
cision process, with definitions of system state, control ture) with an air flow rate that can be chosen from multiple
action and reward function. These are the key concepts discrete levels, denoted as F = {f 1 , f 2 , ..., f m }. Therefore,
in our data-driven HVAC control approach using DRL. the entire action space A = {A1 , A2 , ..., An } of the building
• We develop a DRL based algorithm for minimizing the HVAC control includes all possible combinations of air flow
building energy cost while maintaining comfortable tem- rate for every zone, i.e., n = mz . Clearly, the dimension
perature for building tenants. For higher scalability, we of action space will increase rapidly with larger number of
further propose a heuristic variant for efficient control for zones and air flow rate levels, which will then greatly increase
complex multiple-zone systems. the training time and degrade the control performance. In
• We develop a co-simulation framework based on Ener- Section 3.3, we introduce a multi-level control heuristic for
gyPlus for offline training and validation of our DRL- multiple zones to combat this challenge.
based algorithms, with real-world weather and time-of-
use pricing data. Our experiment results demonstrate System states: The optimal control action is determined
that the proposed DRL-based algorithms can achieve up based on the observation of the current system state. In
to 20% − 70% energy cost reduction when compared with this work, we consider current (physical) time, zone temper-
a rule-based baseline control strategy. ature and environment disturbances (i.e. ambient temper-
ature and solar irradiance intensity) to determine the op-
The remainder of the paper is organized as follows. Sec- timal control action. In particular, incorporating current
tion 2 presents our MDP formulation for the HVAC control time information in the state enables the DRL algorithm to
operation. Section 3 presents our DRL-based HVAC control adapt to time related activities, such as time-varying tem-
algorithms. Section 4 shows the experimental results and perature requirements, electricity price, occupant activities
Section 5 concludes the paper. and equipment operation in the building. For environment
disturbances, instead of just using current ambient tempera-
ture and solar irradiance, we also take into account of multi- 𝑡 𝑡 + Δ𝑡3
step forecast of weather data. This is important because (4) DRL (4) DRL
observe update observe update
the weather pattern can vary significantly. Considering a …
𝑎+,-+. 𝑎+ 𝑎+/(0,1)-+. 𝑎+/0-+.
short sequence of weather forecast data enables our DRL
DRL algorithm
algorithm to capture the trend of the environment, perform
proactive control and adapt to time-variant systems.
𝑡 − Δ𝑡$ 𝑡 … 𝑡 + (k − 1)Δ𝑡$ 𝑡 + kΔ𝑡$ …
Rewards function: The goal of the DRL algorithm is to
minimize the total energy cost while maintaining the tem-
perature of each zone within a desired range, by taking a (5) (5) (5)
sequence of actions {a1 , a2 , . . . , at }, where at ∈ A. After

Building 𝑠+,-+. 𝑠+ … 𝑠+/(0,1)-+. 𝑠+/0-+.
environment
taking an action at−1 at state st−1 , the building will evolve
into a new state st and the DRL algorithm will receive an Figure 2: Building control sequence with DRL algorithm
immediate reward rt , as calculated below in Equation (1).
z
X 3. DRL-BASED HVAC CONTROL
rt = −cost(at−1 , st−1 ) − λ ([Tti − T it ]+ + [T it − Tti ]+ ) (1)
i=1 3.1 Value Function Approximation
which includes the energy cost of the last control action at−1 The combination of possible values of each feature in the
and the total penalty of temperature violation. We use neg- state vector forms a very large state space. In practice,
ative rewards as our DRL algorithm will maximize the to- it is more efficient to use generalization methods, such as
tal reward. It should be noted that the goal of minimizing randomized trees [5], kernel-based method [17] and neural
energy cost contradicts the goal of maintaining desired tem- networks [18] to approximate the Q-value. In this work,
perature, and the reward function tries to balance the two. we use the artificial neural network to approximate the Q-
During the operation of HVAC systems, we want to max- value calculated by Equation (3). As shown in Figure 3, we
imize the accumulative reward R = ∞
P i−1 r
t+i , where
adopt a similar neural network structure as in [12]. With this
i=1 γ
γ ∈ [0, 1] is a decay factor that controls the window length structure, the Q-value estimates for all control actions can
when maximizing the reward. We use Q∗ (st , at ), i.e., the op- be calculated by performing one forward pass (inference) in
timal value, to represent the maximum accumulative reward the neural network. This can greatly improve the efficiency
we can obtain by taking action at in state st . Q∗ (st , at ) can when selecting actions with the -greedy policy. The input
be calculated by Bellman Equation (2) in a recursive fashion. features of the network are the environment state that is
defined in Section 2. The rectified linear unit (ReLU) is
Q∗ (st , at ) := E[rt+1 + γ max Q∗ (st+1 , at+1 )|st , at ] (2) used as the activation function for hidden layers, and the
at+1
linear layer is used for inferring action value at the output.
The state transition in buildings is stochastic, because the input layer hidden layer hidden layer output layer
zone temperature is affected by various disturbances, which (state) (ReLU) (ReLU) (Linear) loss function
cannot be accurately measured. In this work, we update the

value estimates by following the Q-learning [24] method, as 𝑡
... 𝑄(𝑠$%&, 𝑎&$%&)
shown in Equation (3).
𝑇9:+; * )
𝑄(𝑠$%&, 𝑎$%&
Qt+1 (st ,at ) := Qt (st , at ) 𝑠$%&
…
…
Squared error
𝑇:<$
…
+ η(rt+1 + γ max Qt (st+1 , at+1 ) − Qt (st , at )) (3)
…
at+1
+ )
L
𝑄=<+ 𝑄(𝑠$%&, 𝑎$%&
where η ∈ (0, 1] represents the learning rate of value es-
timates during the training process. Equation (3) should 𝑤, 𝑏 𝑤, 𝑏
converge to the optimal value Q∗ (st , at ) over time under the 𝑟$ + 𝛾max𝑄4 (𝑠$ , 𝑎$ )
23
MDP environment.
Building control sequence: Our DRL algorithm interacts Figure 3: Structure of the neural network utilized in our
with the building environment during operation, or with the DRL framework
EnergyPlus model via the BCVTB (a Ptolemy II platform We use the mean squared error between the target Q-
that enables co-simulation across different models [26]) in- value and the inferred output of neural network as loss func-
terface during offline training and validation. tion (6), where n denotes the number of possible control
As shown in Figure 2, we use a separate control step actions. Parameters (weights) in the neural network are
∆tc = k∆ts to represent the control frequency of DRL al- updated by the mini-batch gradient descent method w :=
gorithm. Every ∆tc time, as shown in Equation (4), the w − α∆w [2], where α is the learning rate and ∆w = ∂w ∂L
.
DRL algorithm will observe the building state and update n
the control action. Between two control time steps, the con- 1 X ∗
L= [Q (st , ait ) − Q(st , ait )]2 (6)
trol action used to operate the HVAC system remains the 2n i=1
same as the last updated action. While in Equation (5), the
building receives the control signal and enters its next state Being consistent with the Q-learning update process (3),
every ∆ts time, which represents the building simulation or the target value Q∗ (st , ait ) in the neural network can be esti-
sensor sampling frequency. mated by Equation (7) when using gradient descent, where
Q values are approximated by the neural network.
at = fDRL (st−∆ts ) (4)
Q∗ (st , at ) = rt+1 + γ max Q(st+1 , at+1 ) (7)
st = fEN V (st−∆ts , at−∆ts ) (5) at+1
Training data pre-processing: The input state vector st from the network Q during the learning process. In line 4,
consists of various types of features in the building. The the variable a stores the control action in the last step, and
range of value for each feature can vary significantly. To spre and scur represent the building state in the previous
facilitate the learning process, we scale the feature values to and current control time steps, respectively.
a similar range before feeding the input state to the neural
network. In this work, we scale the input state vector to Algorithm 1 DRL-based HVAC Control Algorithm
the range [0, 1] as shown in (8), where x represents a feature 1: Initialize memory M = [empty set]
in the input state. The minimum and maximum values for 2: Initialize neural network Q with parameters w
each feature can be estimated from historical observations. 3: Copy neural network Q and store as Q̂(·|ŵ)
x − min(x) 4: Initialize control action a, state spre and scur
x0 = (8)
max(x) − min(x) 5: for m := 1 to N do
For output units, the linear layer is used to infer Q-value esti- 6: Reset building environment to initial state
mates from hidden units. However, if we directly use reward 7: for ts := 0 to L do
function (1) to calculate the target Q-value as shown in (7), 8: if ts mod k == 0 then
it may result in a large variance in the target value. Dur- 9: scur ← current observation
ing backward propagation, the corresponding bias factor in 10: r = reward(spre , a, scur )
the last linear layer may dominate the derivative of the loss 11: M ← (spre , a, r, scur )
function, which will prevent weights in earlier layers from 12: Draw mini-batch (s, a, r, s0 ) ← M
learning the optimal value. In order to overcome this lim- 13: Target vectors v ← target(s)
itation, we calculate the target value by first shrinking the 14: Train Q(·|w) with s, v
original immediate reward with a factor ρ and then clipping 15: Every d∆tc steps, Q̂(·|ŵ) ← Q(·|w)
it if the target is smaller than −1, as shown in Equation (9). 16: ( i − ∆, min )
= max(
rt A ∈ A | i = random(n) probability
target val(st−1 , at−1 ) = max[ + γmaxQ(st , at ), −1] (9) 17: a= argmaxQ(scur , ã) otherwise
ρ at
ã
In this way, we squash the original target value with a large 18: spre ← scur
variance to the range [−1, 0]. The underlying principle is
19: end if
that while it does not help to know which control actions are
20: Execute action a in building environment
worse, we focus on which control actions are better.
21: end for
Training of the neural network: As shown in Figure 1,
22: end for
the one-step state transition process is represented by a tu-
ple (st−1 , at−1 , rt , st ), which includes previous state, previ-
ous action, immediate reward and current state. The target Learning process: Within each training episode, line 8
vector of the neural network can be calculated by Equa- determines whether the current time step ts is a control
tion (10), where the target value associated with at−1 , i.e., time step. As discussed in Section 2, the control step ∆tc
target val(st−1 , at−1 ), is calculated by Equation (9). For is k times of the simulation step ∆ts . If ts is a control time
other control actions, the target value is set to the current step, the algorithm will perform training and determine the
value estimate of that action. new control action (line 9 to 18). Otherwise, the building

target val(st−1 , a) if a = at−1 will maintain the current control action.
target(st−1 ) = (10) During the learning process, in line 9 we first observe the
Q(st−1 , a) otherwise
state at current control time step. Then, the immediate re-
Next, the target vector is compared with current inference ward is calculated by Equation (1). Next, in line 11 the state
output of the neural network to calculate the approximation transition tuple is stored in memory. Then, a mini-batch
error. Then, we use the RMSprop [8] method to update of transition tuples are drawn randomly from the memory.
parameters in the neural network. Lines 13 to 14 follow Equation (10) to calculate the target
vector and update weights in neural network Q by using the
3.2 DRL Algorithm Design RM Sprop Back-propagation method [8]. In line 15, the net-
Our DRL-based HVAC control algorithm is presented in work Q̂ will be updated with current weights in network Q
Algorithm 1. The outer loop controls the number of training in every d control time steps. Then, this Q̂ network is used
episodes, while the inner loop performs HVAC control at for inferring the target value for the next d control steps.
each simulation time step within one training episode. Next, from line 16 to 18 the network Q is utilized to de-
Initial setup: During the learning process, the recent tran- termine the next control action. The -greedy policy is used
sition tuples (st−1 , at−1 , rt , st ) are stored in the memory M , to select the optimal control action based on the output of
from which a mini-batch of samples will be generated for Q. The algorithm has a probability to explore the action
neural network training. At the beginning, we first initialize space by randomly selecting an available action; otherwise,
memory M as an empty set. Then, we initialize weights w it will choose the action with the maximum value estimate.
in the neural network similar as in [7]. As shown in Equa- After each training process, in line 16 the exploration rate
tion (7), updating neural network weights requires the target will gradually decrease until reaching at a lower bound min .
value, which also depends on weights in the neural network. In this way, the DRL algorithm is more likely to try differ-
To break this dependency loop between target value and ent control actions at the beginning. As the training process
weights w, in line 3, a separate neural network Q̂ is created proceeds, the DRL algorithm will have a higher chance to
for calculating the target value similar as in [12]. This net- follow the learned policy. Finally, in line 18 the current state
work Q̂ will be periodically updated by copying parameters is assigned to spre to prepare for the next training process.
3.3 Heuristic Adaption for Multiple Zones exceeds the cooling setpoint (i.e. 24◦ C in our experiment),
We present a heuristic mechanism that adapts our DRL the room will be cooled at the maximum air flow rate. If
algorithm for multi-zone HVAC control. As discussed in Sec- the temperature drops below certain threshold 2 , the air flow
tion 2, the action space has a cardinality of mz , which in- in the zone will be turned off. In our experiment, for both
creases exponentially with the number of zones in the build- baseline approaches and DRL algorithms, the conditioned
ing. Training a neural network with such a large number of air temperature from the HVAC system is set to 10◦ C.
outputs is inefficient or even infeasible in practice.
In our heuristic, instead of using a single neural network 4.2 Experiment Results
to approximate the Q-values of all control actions in the A. Effectiveness of DRL control algorithms in meet-
building, we separately train a neural network for each zone ing temperature requirements: We evaluate the perfor-
using Algorithm 1. Each neural network is responsible for mance of our DRL algorithms with three buildings modeled
approximating the Q-value in one zone. At each time step, in EnergyPlus, which have 1 zone, 4 zones and 5 zones, re-
all networks will receive the state of buildings, and then de- spectively. The HVAC system can provide multi-level air
termine the control action for each zone separately. After ex- flow rate for each zone. In this work, we test our DRL algo-
ecuting the control action, the temperature violation penalty rithms with two-level (i.e. on-off control) and five-level air
for each zone is calculated similarly as Equation (1). The flow control, where each level is evenly distributed between
electricity cost in each zone is calculated by Equation (11), the minimum and maximum air flow rate of the HVAC sys-
which is proportional to the air flow demand in each zone tem. Figure 4 shows the zone temperature in the 1-zone and
based on the total cost. 4-zone building in August, where our regular DRL algorithm
ui in Algorithm 1 performs on-off control to operate the HVAC
costi = cost · P (11) system. We can see that after training the DRL algorithm is
i ui
quite effective in maintaining the zone temperature within
where cost denotes the total electricity cost in the building the desired range.
and ui represents the air flow rate in each zone. Although 27.0
the total electricity cost is not exactly a linear function of the
Temperature ℃
24.0
air flow rate, we can still heuristically estimate the amount
of cost contributed by each zone by following Equation (11). 21.0
18.0
4. EXPERIMENTAL RESULTS Before training After training Comfort temperature
15.0
1 6 11 16 21 26 31
Days
4.1 Experiment Setup
(a) 1 zone
We demonstrate the effectiveness of our DRL-based al-
27.0
gorithms through simulations in EnergyPlus. We train the
Temperature ℃
DRL algorithms on weather profiles of summer days in two 24.0
areas, obtained from the National Solar Radiation Data 21.0

Base [13]. The weather data from Area 1 (Riverside) has 18.0
intensive solar radiation and large variance in temperature, Zone 1 Zone 2 Zone 3 Zone 4 Comfort temperature
15.0
while Area 2 (Los Angeles) has a milder weather profile. We 1 6 11 16 21 26 31
Days
use the practical time-of-use price from the Southern Califor-
nia Edison [20] to calculate buildings’ electricity cost. The (b) 4 zones
desired temperature range is between 19◦ C and 24◦ C based Figure 4: Effectiveness of our regular DRL algorithm in
on the ASHRAE standard [21]. There are 4 hidden layers in maintaining comfort temperature
the neural network. The network layout and other parame- Figure 5 shows the average Q value of our regular DRL al-
ters in our DRL algorithms are listed in Table 1. We train gorithm throughout the learning process. At the beginning,
our DRL algorithm using 100 episodes (months) of data. In the Q value is very small due to the large penalty caused by
practice, the training process can be facilitated by building frequent temperature violations. The Q value will gradually
accurate EnergyPlus models. increase as the DRL algorithm learns the effective strategy
Table 1: Parameter settings in DRL Algorithms to maintain the zone temperature within the desired range.
Eventually, the Q value will stabilize when the DRL algo-
∆ts 1 min ∆tc 15 min rithm learns the policy to avoid temperature violation and
k 15 d 48 ∗ 5 minimize the electricity cost.
mini-batch 48 memory size 48 ∗ 31
0.5 0.5
η 0.99 α 0.003 Q value Q value
0.0 0.0
ρ 1000 λ 100
-0.5 -0.5
min 0.1 N 100
-1.0 -1.0
T 19◦ C T 24◦ C
-1.5 -1.5 1 50 100 150 200
number of neurons 50, 100, 200, 400 1 50 100
Months
150 200
Months
We evaluate the performance of our DRL algorithms by (a) 1 zone (b) 4 zones
comparing them with a rule-based HVAC control strategy Figure 5: Q value in 1-zone and 4-zone buildings
(similarly as the one in [22]) and the conventional RL method.
In the rule-based approach, the HVAC system is operated by 2
We find out that setting it to 20◦ C helps the rule-based
an on-off control strategy such that if the zone temperature approach minimize temperature violation rate.
As discussed in Section 3.3, the action space will expo- Acknowledgments. The authors gratefully acknowledge
nentially increase with the number of zones. For the 5-zone the support from the National Science Foundation award
building, the total number of actions is more than 3000 with CCF-1553757 and the Riverside Public Utilities.
5-level air flow rate control, which would be intractable for
our regular DRL algorithm. Therefore, we leverage the effi- 6. REFERENCES
cient heuristic method in Section 3.3 to perform multi-level
[1] E. Barrett and S. Linder. Autonomous HVAC Control, A
control in multi-zone buildings. Figure 6 compares the av- Reinforcement Learning Approach. Springer, 2015.
erage frequency of temperature violations of the baseline [2] L. Bottou. Large-scale machine learning with stochastic
strategy, conventional Q learning, our regular DRL algo- gradient descent. Proceedings of COMPSTAT. 2010.
rithm with on-off control, and the heuristic DRL algorithm [3] G. T. Costanzo and et al. Experimental analysis of
with 5-level control. We can see that the DRL algorithms data-driven control for a building heating system. CoRR,
are able to keep the percentage of temperature violations at abs/1507.03638, 2015.
a low level. [4] EnergyPlus. https://energyplus.net/.
[5] D. Ernst and et al. Tree-based batch mode reinforcement
Baseline Q learning DRL regular on-off DRL heuristic 5-level learning. Journal of Machine Learning Research, 2005.
6.0%
[6] P. Fazenda and et al. Using reinforcement learning to
4.0% optimize occupant comfort and energy usage in hvac
2.0% systems. Journal of Ambient Intelligence and Smart
Environments, pages 675–690, 2014.
0.0%
1-zone Area 1 1-zone Area 2 4-zone Area 1 4-zone Area 2 5-zone Area 1 5-zone Area 2
[7] K. He and et al. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. IEEE
Figure 6: Comparison of temperature violation rate between International Conference on Computer Vision, 2015.
our DRL algorithm, baseline approach and Q learning [8] G. Hinton, N. Srivastava, and K. Swersky. Lecture 6a
overview of mini–batch gradient descent. http://www.cs.
toronto.edu/˜tijmen/csc321/slides/lecture slides lec6.pdf.
B. Effectiveness of DRL algorithms in energy cost
[9] B. Li and L. Xia. A multi-grid reinforcement learning
reduction: Figure 7 shows the comparison of average daily method for energy conservation and comfort of HVAC in
electricity cost of our DRL control algorithms, conventional buildings. pages 444–449, 2015.
Q learning and the baseline approach. The percentage of [10] Y. Ma and et al. Model predictive control for the operation
cost reduction achieved by DRL algorithms compared with of building cooling systems. IEEE Transactions on Control
the baseline approach is marked in the figure. Systems Technology, 20(3):796–803, 2012.
[11] M. Maasoumy and et al. Model-based hierarchical optimal
$
Baseline Q learning DRL regular on-off DRL heuristic 5-level control design for HVAC systems. DSCC, 2011.
400
10.6% [12] V. Mnih and et al. Human-level control through deep
300 16.2% 19.1%
23.3%
4.9% 18.9%
reinforcement learning. Nature 518.7540, 2015.
200 22.2% 35.1% [13] National Solar Radiation Data Base. http://rredc.nrel.gov.
56.8% 25.9%
100 71.2% 52.9% [14] D. Nikovski, J. Xu, and M. Nonaka. A method for
0 computing optimal set-point schedules for HVAC systems.
1-zone Area 1 1-zone Area 2 4-zone Area 1 4-zone Area 2 5-zone Area 1 5-zone Area 2
REHVA World Congress CLIMA, 2013.
Figure 7: Comparison of energy cost between our DRL al- [15] F. Oldewurtel and et al. Energy efficient building climate
gorithms, baseline approach and Q learning control using stochastic model predictive control and
weather predictions. ACC, 2010.
We can see that our regular DRL algorithm can achieve [16] S. J. Olivieri and et al. Evaluation of commercial building
significant energy cost reduction compared with the base- demand response potential using optimal short-term
line approach and conventional Q learning. The efficient curtailment of heating, ventilation, and air-conditioning
loads. Journal of Building Performance Simulation, 2014.
heuristic DRL can leverage multi-level control in multi-zone
[17] D. Ormoneit and Ś. Sen. Kernel-based reinforcement
buildings to achieve further reduction. Furthermore, the learning. Machine Learning, 49(2):161–178, 2002.
DRL algorithms are more effective in reducing energy cost [18] M. Riedmiller. Neural Fitted Q Iteration – First
for Area 2, since the learning process is more effective with a Experiences with a Data Efficient Neural Reinforcement
milder weather profile. Compared with the 5-zone building, Learning Method. Springer, 2005.
the DRL algorithms achieve more reduction for 1-zone and [19] D. Silver and et al. Mastering the game of go with deep
4-zone buildings. That is likely because the 5-zone building neural networks and tree search. Nature, 529(7587), 2016.
is more sensitive to outside disturbances and hence more [20] SCE. https://www.sce.com/NR/sc3/tm2/pdf/CE281.pdf.
challenging for the learning process. [21] A. Standard. Standard 55-2004-thermal environmental
conditions for human occupancy. ASHRAE Inc., 2004.
[22] D. Urieli and P. Stone. A learning agent for heat-pump
5. CONCLUSIONS thermostat control. AAMAS, 2013.
This paper presents a deep reinforcement learning based [23] U.S. DoE. Buildings energy data book.
data-driven approach to control building HVAC systems. [24] C. J. Watkins and P. Dayan. Q-learning. Machine learning,
8(3-4):279–292, 1992.
A co-simulation framework based on EnergyPlus is devel-
[25] T. Wei, Q. Zhu, and M. Maasoumy. Co-scheduling of
oped for offline training and validation of the DRL-based ap- HVAC control, EV charging and battery usage for building
proach. Experiments with detailed EnergyPlus models and energy efficiency. ICCAD, 2014.
real weather and pricing data demonstrate that the DRL- [26] M. Wetter. Co-simulation of building energy and control
based algorithms (including the regular DRL algorithm and systems with the building controls virtual test bed. Journal
a heuristic adaptation for efficient multi-zone control) are of Building Performance Simulation, 2011.
able to significantly reduce energy cost while maintaining [27] L. Yang and et al. Reinforcement learning for optimal
the room temperature within desired range. control of low exergy buildings. Applied Energy, 2015.

Deep Reinforcement Learning For HVAC Control in Smart Buildings

Uploaded by

Copyright:

Available Formats

Deep Reinforcement Learning For HVAC Control in Smart Buildings

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Reinforcement Learning For HVAC Control in Smart Buildings

Uploaded by

Copyright:

Available Formats

Deep Reinforcement Learning for Building HVAC Control

Tianshu Wei Yanzhi Wang Qi Zhu

Current state 𝑠% ...

Action value error

Experience replay 𝑄" Network Add

sequence of actions {a1 , a2 , . . . , at }, where at ∈ A. After

cannot be accurately measured. In this work, we update the

DRL algorithms on weather profiles of summer days in two 24.0

areas, obtained from the National Solar Radiation Data 21.0

You might also like