AWD3: Dynamic Reduction of the Estimation Bias
Dogan C. Cicek* , Enes Duran* , Baturay Saglam, Kagan Kaya, Furkan Mutlu, Suleyman S. Kozat†
arXiv:2111.06780v1 [cs.LG] 12 Nov 2021
Electrical and Electronics Engineering Department, Bilkent University, Ankara, Turkey
{cicek; enesd; baturay; burak.mutlu; kozat}@ee.bilkent.edu.tr; kagan.kaya@ug.bilkent.edu.tr
*
Equal contribution, †IEEE Senior Member
Abstract—Value-based deep Reinforcement Learning (RL) algorithms suffer from the estimation bias primarily caused by
function approximation and temporal difference (TD) learning.
This problem induces faulty state-action value estimates and
therefore harms the performance and robustness of the learning
algorithms. Although several techniques were proposed to tackle,
learning algorithms still suffer from this bias. Here, we introduce
a technique that eliminates the estimation bias in off-policy
continuous control algorithms using the experience replay mechanism. We adaptively learn the weighting hyper-parameter beta in
the Weighted Twin Delayed Deep Deterministic Policy Gradient
algorithm. Our method is named Adaptive-WD3 (AWD3). We
show through continuous control environments of OpenAI gym
that our algorithm matches or outperforms the state-of-the-art
off-policy policy gradient learning algorithms.
Index Terms—deep RL, estimation bias, deterministic policy
gradient, TD3 algorithm, WD3 algorithm
I. I NTRODUCTION
Reinforcement learning (RL) studies how an agent interacts
with its environment to learn a good policy via optimizing
the cumulative sum of delayed rewards. This field gained
considerable attention recently due to exciting developments
such as controlling continuous systems [1] and achieving
superhuman level performance in Atari games [2]. Despite
these promising developments, deep RL agents suffer from
some issues precluding agents from performing a broad range
of tasks [3]. One of the well-known problems is the estimation
bias in value-based deep RL algorithms [4], [5].
Estimation bias (overestimation and underestimation) is the
chronic case of predicting faulty values diverging from the
true values. Using neural networks as function approximators
leads to an unavoidable noise due to the imprecision of the estimator. Temporal difference learning -using subsequent value
estimation to update current estimate- further amplifies this
bias [6]. Recent works show that Q-learning, a TD learning
method, is susceptible to overestimation originating from the
maximization operation of noisy value estimates. [3], [5].
The estimation bias in discrete action spaces has been
widely studied [7], [8]. Hasselt et al. reveals that using a
single value estimator causes overestimation. They propose
the Double DQN algorithm by introducing an additional
function estimator, named target network [2] that proved to be
more effective than the vanilla version. Similarly, the DDPG
algorithm uses a target network to eliminate overestimation
for continuous control tasks. Despite its effectiveness in some
tasks, the Q-value estimator network (critic) is susceptible to
overestimation, especially in large state spaces [1].
On the other hand, Fujimoto et al. introduced the TD3
algorithm through an additional critic network and taking the
minimum of the network pair to estimate action-value function
[4]. Although minimization presents an effective way to handle
the overestimation, it introduces underestimation in some
cases. The agent’s pessimism about state values often harms its
performance. In response to this, He et al. proposed a weighted
average of the two terms, minimum and average of the two
approximators. This proposition (WD3 algorithm) improved
the precision of action-value estimators and contributed to the
learning process [9]. Unfortunately, it introduces a weighting
hyper-parameter β which is challenging to tune. Moreover,
the proposed range of the weighting hyper-parameter (0,1) is
insufficient, such that in some cases even taking the minimum
or the average of the two value approximators leads to the
estimation bias. Additionally, having a fixed weighting value
does not cover the non-stationary environments because it
implicitly assumes that this weighting ratio is valid throughout
the whole learning process. Hence, a further study of the
estimation bias in continuous action spaces is needed.
Here, we focus on the problem of the estimation bias in the
continuous action spaces. We present a technique to update
the weighting hyper-parameter β in the WD3 algorithm. Our
main motivation is to dynamically combine two opposite sides
to balance the estimation bias. In this way, the need for
tuning the weighting hyper-parameter is eliminated. We named
our algorithm Adaptive Weighted Delayed Deep Deterministic
Policy Gradient (AWD3). Our major contributions are:
• We empirically demonstrate the estimation bias in the
former algorithms. We show the necessity to further
expand the range of the weighting hyper-parameter β.
• We introduce a mechanism to update β. Through simulations on OpenAI gym environments, we show that
our approach performs better than other state-of-theart policy gradient algorithms. In these simulations, our
approach better estimates the state-action values than the
predecessor algorithms.
II. BACKGROUND
We consider the standard RL paradigm where an agent
interacts with an environment in discrete time-steps to learn
reward-optimal behavior [10]. The standard RL paradigm
is formalized as a Markov Decision Process (MDP). Each
discrete time-step t with a given state s ∈ S, the agent selects
an action a ∈ A(st ) according to its policy π, receives a reward
signal r and the new state s′ from the environment. The return
PT
is the discounted sum of rewards Rt = i=t γ i−t r (si , ai )
where γ is the discount factor denoting the importance given to
the future rewards. Next subsections introduce related policy
gradient RL algorithms.
A. Deep Deterministic Policy Gradient (DDPG)
DDPG algorithm combines the Deterministic Policy Gradient algorithm with function approximation [11]. DDPG agents
learn a Q function with a valid policy sampled in continuous
space. The algorithm contains four neural networks named
actor and critic along with their target networks. The critic
network is used to learn the Q function. The actor network
is responsible for outputting a deterministic action given the
state (π : S → A). The actor network is updated through the
deterministic policy gradient algorithm:
h
i
∇φ J(φ) = Es∼pπ ∇a Q(s, a; θ)|a=π(s;φ) ∇φ π(s; φ) . (1)
The motivation here is to update the parameters of the actor
network to maximize the critic network output. Therefore, this
algorithm renders greedy policies. It is analogous to maximum
operator in discrete action spaces. The actor and critic target
networks are both updated softly with small τ values,
θ′ ← τ θ + (1 − τ )θ′ ,
(2)
which increases the stability of the agents via slower update.
Despite the usage of the target networks, the DDPG algorithm
overestimates the Q values especially in large state spaces [11].
B. Twin Delayed Deep Deterministic Policy Gradient (TD3)
One of the successors to DDPG is the TD3 algorithm that
introduces regularization on Q-network update to eliminate
overestimation [4]. The algorithm utilizes two independently
initialized critics and the minimum of which is taken to compute the target for Q-network update. Aside from additional
critic, the algorithm also delays the update of the actor to
prevent overestimation. The label for the critic update is,
y = r + γ min Qθi′ (s′ , π(s′ ; φ′i )).
i=1,2
(3)
The periodic update of actor and utilization of two critics
along with the minimum operator decouple the actor and
critic to eliminate the overestimation error. Fujimoto et al.
show that TD3 significantly outperforms the DDPG algorithm
by correcting the estimation error [4]. Although this method
eliminates the detrimental overestimation in the DDPG [11]
algorithm, the issue with TD3 algorithm is that the minimum
of two critics introduces an underestimation bias (6), [9].
C. Weighted Delayed Deep Deterministic Policy Gradient
(WD3)
The WD3 algorithm underscores that taking the minimum of
two critics leads to underestimation. In response, a weighted
average of the minimum and the average value of the two
target critic networks is used to update critic networks:
2
1−β X
Qθ′ (s′ , ã), β ∈ (0, 1). (4)
Q′ ← β min Qθi′ (s′ , ã)+
i=1,2
2 i=1 i
They introduce a new hyper-parameter β for weighted
averaging. While the average of the two critics overestimates,
the minimum of the two critics underestimates the true action
value function. The motivation is to obtain an accurate estimate
of the state action values by combining these extreme points.
III. E STIMATION B IAS P HENOMENON
The theoretical study of the estimation bias and the effect
of using the minimum operator in continuous action spaces
have not been studied extensively. This section focuses on
the theoretical and empirical aspects of the estimation bias in
continuous state-action spaces.
A. Overestimation
Overestimation indicates the cases when the approximated
value exceeds the true value. One reason is the maximum
operator in Q-learning [3]. Selecting greedy or near-greedy
actions in Q-learning leads to chronic overestimation. Similar
to that, the deterministic policy gradient algorithm (1) causes
overestimation in the continuous state-action spaces [4].
B. Underestimation
Underestimation happens when the approximated values
are below the true values. In response to the overestimation in
the DDPG algorithm, the TD3 algorithm takes the minimum
of the estimated values of the two critics [4]. Let Q∗ (s, a)
indicates the true Q value. Assume the estimation errors of
the two critics are two correlated Gaussian random variables:
Qθi (s, a) − Q∗ (s, a) = Di ∼ N µi , σi2 , i = 1, 2. (5)
Then the expectation of the estimation bias becomes:
µ1 −µ2
µ1 −µ2
µ2 −µ1
+µ2 Φ
−σφ
,
E[min(D1,2 )]=µ1Φ
σ
σ
σ
(6)
p
where σ = σ12 + σ22 − 2ρσ1 σ2 . The terms φ(.), Φ(.) signify
the PDF and CDF of the standard normal distribution. He et
al. treat the estimation biases of the two critics as independent
random variables which is not adequate due to the shared transition data used in the update of the networks [9]. Therefore,
we have the term ρ denoting the correlation coefficient. We
reveal the underestimation bias in the case of minimizing two
estimates (6). The next section provides empirical results.
C. Empirical Demonstration of Estimation Bias
We observe the estimation bias in the TD3 and WD3
algorithms through the environments in MuJoCo [12]. Fig.
1 shows the state-action value estimations along with the true
values for the agents trained on Walker2d-v2 environment.
The TD3 agent underestimates at the beginning but it starts
to overestimate towards the end. Although the WD3 agent
accurately estimates value function at the beginning, it outputs
overoptimistic values towards the end. Our simulations verify
the theoretical derivations in III-A, III-B. We also observe
some cases where the agents suffer from overestimation at
the beginning of the training process depending on the Xavier
initialization networks and the signs of the reward signals [9].
The following section introduces our method.
Algorithm 1 Adaptive WD3 algorithm.
Fig. 1: Measuring the estimation bias in TD3 & WD3 algorithms in Walker2d-v2 environment averaged over 5 different
random seeds
IV. A DAPTIVE W EIGHTED D ELAYED D EEP
D ETERMINISTIC P OLICY G RADIENT (A DAPTIVE WD3)
To eliminate the estimation bias we propose a new approach
which dynamically adjusts the weighting hyper-parameter responsible for balancing between overestimation and underestimation. The following sections provide the key points.
A. Motivation
Here are some advantages of a dynamic hyper-parameter β:
•
•
The hyper-parameter β is a tool for balancing the estimation bias by combining two sides. Hence, its value is
crucial for the Q-value estimation, therefore also vital for
the learning process. Unfortunately, the optimal value for
β is task-specific, meaning that we may need to have
multiple runs for its optimization. In the WD3 paper
there are six runs for every environment for different β
values [9]. In case of a need for more precise value, we
must have additional training. Especially in environments
where the process of data collection is expensive or
slow, tuning β becomes unappealing. The need for a
hyper-parameter agnostic version of the WD3 algorithm
becomes obvious in those cases.
In non-stationary environments where the state transition
probabilities are changing, updating β may speed up the
adaption to the changing true value function.
B. Weighted target update
As explained in the previous sections, the DDPG and TD3
algorithms suffer from opposite problems named overestimation and underestimation. He et al. proposes the combination
of these two effects to achieve a balance through the weighted
average of the pair of target critics shown in (4). If β = 1
throughout the training, then the algorithm becomes TD3. Both
terms of (4) introduce estimation bias. We demonstrate the
expectation of the first term with (6). The expected value of
the second term is E [D1 + D2 ] = µ1 + µ2 [13]. We aim
to dynamically adjust the hyper-parameter β to zero out the
expected value of the estimation bias:
E[D] = βE[min(D1 , D2 )] +
1−β
E[D1 + D2 ] = 0.
2
(7)
Initialize networks φ, θi , θi′ ← θi , φ′ ← φ for i = 1,2
Initialize B, d, σ, σ̃, η, c, N, T, β, µ, s, t = 0
while t < T do
Select a = π(s; φ) + ǫ, ǫ ∼ N 0, σ 2 , and receive r, s′
Store transition tuple (s, a, r, s′ ) to B
Sample mini-batch of N transitions (s,a, r, s′ ) from B
ã ← π (s′ ;φ′ ) + ǫ′ , ǫ′ ∼ clip N 0, σ̃ 2 , −c, c
P2
′
′
y ← r+γ β mini=1,2 Qθi′ (s′ , ã) + 1−β
i=1 Qθi (s , ã)
2
P
2
Update critic θi ← N −1 (y − Qθi (s, a))
if t mod d then
Update φ by the P
deterministic policy gradient:
∇φ J(φ) = N −1 ∇a Qθ1 (s, a) a=π(s;φ) ∇φ π(s; φ)
Update target networks:
θi′ ← ηθi + (1 − η)θi′ , φ′ ← ηφ + (1 − η)φ′
end if
if s′ is terminal then
Sample last terminal transition (s,Pa, r, s′ )
2
′
ỹ ← β mini=1,2 Qθi′ (s, a) + 1−β
i=1 Qθi (s, a)
2
β ← β − µ ∗ (r − ỹ)
end if
t ← t + 1, s ← s′
end while
Since both critics are trained on the same data and labels, bias
values are approximately the same, µ1≈µ2 . This simplifies (6):
σ
µ1 + µ2
−√
2
2π
µ1 + µ2
= 0,
+ (1 − β)
2
(8)
√
2π µ1 + µ2
βoptimal =
.
σ
2
Note that the value of βoptimal may be more than 1 depending
on the values of σ, µ1 , and µ2 . The actor network is updated
via the gradient of the Q-value estimated by the first critic (1).
Details are given in Algorithm 1. The next section explains our
mechanism to update the weighting hyper-parameter β.
β
C. Updating beta
Having a dynamic β enables us to eliminate the estimation
bias without giving precedence over either extreme. The
updates of the critic networks are done through the weighted
outputs of the target critic networks (4), which we seek
to perfectly approximate the true Q function for all stateaction pairs in the ideal case. However, we know that the
function approximators suffer from the extrapolation error
[14]. Therefore, in practice, our aim is to estimate Q-values
within a negligible error interval for important pairs.
In general, the terminal states play a vital role in training
RL agents. Generally, the reward signal given in termination
has more magnitude in comparison with other transitions.
Moreover, success or failure is sometimes defined on the
terminal states. Regarding these, we update the value of β
to eliminate the estimation bias in the terminal states.
(a) Ant-v2
(b) Hopper-v2
(e) Humanoid-v2
(c) Walker2d-v2
(f) InvertedDoublePendulum-v2
(d) BipedalWalker-v3
(g) LunarLanderContinuous-v2
Fig. 2: Learning curves for the OpenAI gym continuous control tasks. The shaded region represents half a standard deviation
of the average evaluation over 5 trials. Curves are smoothed.
TABLE I: Max Average Return over 5 trials of 1M time-steps. BWalker-v3, Hum-v2, InvDouble-v2, Lunar-v2 stands for
BipedalWalker-v3, Humanoid-v2, InvertedDoublePendulum-v2 and LunarLanderContinuous-v2 environments, respectively.
Algs.
TD3
WD3
AWD3
Ant-v2
4926.3 ± 859.3
4157.1 ± 762.0
4948.0 ± 766.1
Hopper-v2
3363.6 ± 154.3
3328.6 ± 188.5
3372.6 ± 114.6
Walker2d-v2
3640.3 ± 738.8
4217.3 ± 685.0
4390.3 ± 484.9
Recent work shows that the terminations caused by time
exceeding should not be considered as true terminations and
be processed carefully [15]. Thus, we exclude the terminal
states caused by time exceeding in adjusting the value of β.
In addition, we reveal the necessity of expanding the range of
β both theoretically (7) and empirically (Fig. 1). Therefore,
we expand the range of β. The update rule is as follows:
β ← β − µ ∗ (r − ỹ),
β ∈ [0, 2.5],
(9)
where r is the terminal transition reward, ỹ is the estimated
value of the terminal state-action pair and µ is the learning rate.
Since the value of r is unbiased, β is guaranteed to converge
if the function approximators converge. Next section gives the
experimental details.
V. E XPERIMENTS
We evaluate AWD3 and compare its performance with the
TD3 and WD3 algorithms via the OpenAI Gym and MuJoCo
environments [12], [16]. For a fair comparison, we mainly
select the common environments mentioned in [4], [9].
BWalker-v3
308.0 ± 8.4
308.7 ± 3.0
309.5 ± 4.2
Hum-v2
5286.0 ± 66.0
5218.7 ± 62.7
5227.6 ± 102.4
InvDouble-v2
9359.7 ± 0.2
9359.8 ± 0.1
9359.8 ± 0.1
Lunar-v2
288.7 ± 1.7
290.2 ± 4.3
293.6 ± 3.4
A. Implementation Details
Considering the reproducibility concerns [17], we explicitly share our implementation details. We select the hyperparameters as the same in the TD3 and WD3 algorithms for a
fair comparison. We do not manipulate the data coming from
the environments and directly input to the networks. We use
the default reward setting determined for the environments.
We train agents for 1 million time-steps for each setting. For
each 5000 time-steps, we stop training and test the agent for 10
episodes. In test mode, agents do not apply exploration noise to
the actions taken. In addition to that, transitions experienced in
the test mode are not stored in the replay buffer. Dependency to
initial parameters is eliminated via randomly sampled actions
in the initial 25000 time-steps. In the exploration phase, the
networks do not see any updates. We consider the terminations
caused by time exceeding normal transitions in updates. This
procedure stays the same for all 5 seeds.
The AWD3 algorithm uses the same hyper-parameters
across all environments and seeds. Both the actor and two
critics have two fully connected feed-forward layers, each
layer having 256 neurons. Networks use the ReLU activation
function except for the tanh non-linearity in the last layer of
the actor network. Adam [18] is the optimizer for the networks.
Transitions are uniformly sampled from the replay buffer with
a batch size of 100 and the learning rate is 3e-4 for both
actor and the two critic networks. Frequency for the actor
network and the soft update is d=2. We set the soft update
hyper-parameter τ =5e-3. The critic networks are updated
each time step. Exploration noise has a Gaussian distribution
ǫ ∼ N (0, 0.1). After the noise addition, the actions are clipped
to be in the action space of the environment. During the update
of the critics, a Gaussian noise of ǫ ∼ N (0, 0.2) is clipped to
[−0.5, 0.5] and added to the output of the target actor.
We also introduce new hyper-parameters as a result of the
beta value update mechanism. The learning rate for the beta
update is 1e-4. Initial beta for environments is taken from [9].
For other environments β is initialized as (β max + β min )/2.
The update for β starts after 100000 time-steps to eliminate
the effect of the Xavier initialization and takes place after each
episode termination excluding time limit induced termination.
For a fair comparison, we implement the TD3 and WD3
algorithms through their respective papers without any engineering tricks. Fig. 2 shows the learning curves. These results
show that AWD3 outperforms or matches other algorithms and
is more robust to catastrophic forgetting.
each algorithm. For true Q values, we simulate transitions to
the terminal states by the Monte Carlo method every 50000
time-step. Fig. 3 shows the estimated Q-values along with the
ground truths. We see that estimating the state-action values
by taking the minimum of the two estimators underestimates
the ground truth. Our proposal best approximates the Q-values.
This advantage increases its performance.
VI. C ONCLUSION
The estimation bias in continuous action spaces is one of
the pivotal setbacks for better performance in value-based RL
algorithms. This problem is not completely addressed by the
current state-of-the-art deterministic policy gradient algorithms
(DDPG, TD3, WD3). Here, we first show the susceptibility
of these algorithms to the estimation bias. Then, for active
elimination of the estimation bias, we introduce a mechanism
to update beta in (4). Through simulations, we verify that our
algorithm estimates state action values more precisely better
than other algorithms.
We also show that the range for weighting hyper-parameter
β in (4) may not be sufficient to zero out the expectation of the
estimation bias. We verify this derivation with empirical results
and show the necessity to expand the range of the weighting
hyper-parameter.
R EFERENCES
(a) LunarLanderContinuous TD3 (b) LunarLanderContinuous WD3
(c) LunarLanderContinuous AWD3
(d) Walker2d AWD3
Fig. 3: Empirical demonstration of the estimation bias on continuous control tasks over 5 random seeds. Red line indicates
true state-action values whereas blue line is for estimations.
B. Q-value estimation
For a comprehensive comparison, we estimate the Q-values
for both three algorithms over 5 seeds. We randomly select
1000 state-action pairs from the replay buffer. For the estimation part, we feed state and action to both critics and
calculate estimation according to the respective formulas for
[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al., “Human-level control through deep reinforcement learning,”
nature, vol. 518, no. 7540, pp. 529–533, 2015.
[2] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning,” 2015.
[3] S. Thrun and A. Schwartz, “Issues in using function approximation
for reinforcement learning,” in Proceedings of the Fourth Connectionist
Models Summer School, pp. 255–263, Hillsdale, NJ, 1993.
[4] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” 2018.
[5] H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and
J. Modayil, “Deep reinforcement learning and the deadly triad,” CoRR,
vol. abs/1812.02648, 2018.
[6] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.
[7] Q. Lan, Y. Pan, A. Fyshe, and M. White, “Maxmin q-learning: Controlling the estimation bias of q-learning,” 2020.
[8] O. Anschel, N. Baram, and N. Shimkin, “Averaged-dqn: Variance
reduction and stabilization for deep reinforcement learning,” 2017.
[9] Q. He and X. Hou, “WD3: taming the estimation bias in deep reinforcement learning,” in 32nd IEEE International Conference on Tools
with Artificial Intelligence, ICTAI 2020, Baltimore, MD, USA, November
9-11, 2020, pp. 391–398, IEEE, 2020.
[10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
The MIT Press, second ed., 2018.
[11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” 2019.
[12] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” 2012 IEEE/RSJ International Conference on Intelligent
Robots and Systems, 2012.
[13] S. Nadarajah and S. Kotz, “Exact distribution of the max/min of two
gaussian random variables,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 16, no. 2, pp. 210–212, 2008.
[14] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement
learning without exploration,” 2019.
[15] F. Pardo, A. Tavakoli, V. Levdik, and P. Kormushev, “Time limits in
reinforcement learning,” 2018.
[16] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,
J. Tang, and W. Zaremba, “Openai gym,” 2016.
[17] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger,
“Deep reinforcement learning that matters,” 2019.
[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
2017.