Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
AWD3: Dynamic Reduction of the Estimation Bias Dogan C. Cicek* , Enes Duran* , Baturay Saglam, Kagan Kaya, Furkan Mutlu, Suleyman S. Kozat† arXiv:2111.06780v1 [cs.LG] 12 Nov 2021 Electrical and Electronics Engineering Department, Bilkent University, Ankara, Turkey {cicek; enesd; baturay; burak.mutlu; kozat}@ee.bilkent.edu.tr; kagan.kaya@ug.bilkent.edu.tr * Equal contribution, †IEEE Senior Member Abstract—Value-based deep Reinforcement Learning (RL) algorithms suffer from the estimation bias primarily caused by function approximation and temporal difference (TD) learning. This problem induces faulty state-action value estimates and therefore harms the performance and robustness of the learning algorithms. Although several techniques were proposed to tackle, learning algorithms still suffer from this bias. Here, we introduce a technique that eliminates the estimation bias in off-policy continuous control algorithms using the experience replay mechanism. We adaptively learn the weighting hyper-parameter beta in the Weighted Twin Delayed Deep Deterministic Policy Gradient algorithm. Our method is named Adaptive-WD3 (AWD3). We show through continuous control environments of OpenAI gym that our algorithm matches or outperforms the state-of-the-art off-policy policy gradient learning algorithms. Index Terms—deep RL, estimation bias, deterministic policy gradient, TD3 algorithm, WD3 algorithm I. I NTRODUCTION Reinforcement learning (RL) studies how an agent interacts with its environment to learn a good policy via optimizing the cumulative sum of delayed rewards. This field gained considerable attention recently due to exciting developments such as controlling continuous systems [1] and achieving superhuman level performance in Atari games [2]. Despite these promising developments, deep RL agents suffer from some issues precluding agents from performing a broad range of tasks [3]. One of the well-known problems is the estimation bias in value-based deep RL algorithms [4], [5]. Estimation bias (overestimation and underestimation) is the chronic case of predicting faulty values diverging from the true values. Using neural networks as function approximators leads to an unavoidable noise due to the imprecision of the estimator. Temporal difference learning -using subsequent value estimation to update current estimate- further amplifies this bias [6]. Recent works show that Q-learning, a TD learning method, is susceptible to overestimation originating from the maximization operation of noisy value estimates. [3], [5]. The estimation bias in discrete action spaces has been widely studied [7], [8]. Hasselt et al. reveals that using a single value estimator causes overestimation. They propose the Double DQN algorithm by introducing an additional function estimator, named target network [2] that proved to be more effective than the vanilla version. Similarly, the DDPG algorithm uses a target network to eliminate overestimation for continuous control tasks. Despite its effectiveness in some tasks, the Q-value estimator network (critic) is susceptible to overestimation, especially in large state spaces [1]. On the other hand, Fujimoto et al. introduced the TD3 algorithm through an additional critic network and taking the minimum of the network pair to estimate action-value function [4]. Although minimization presents an effective way to handle the overestimation, it introduces underestimation in some cases. The agent’s pessimism about state values often harms its performance. In response to this, He et al. proposed a weighted average of the two terms, minimum and average of the two approximators. This proposition (WD3 algorithm) improved the precision of action-value estimators and contributed to the learning process [9]. Unfortunately, it introduces a weighting hyper-parameter β which is challenging to tune. Moreover, the proposed range of the weighting hyper-parameter (0,1) is insufficient, such that in some cases even taking the minimum or the average of the two value approximators leads to the estimation bias. Additionally, having a fixed weighting value does not cover the non-stationary environments because it implicitly assumes that this weighting ratio is valid throughout the whole learning process. Hence, a further study of the estimation bias in continuous action spaces is needed. Here, we focus on the problem of the estimation bias in the continuous action spaces. We present a technique to update the weighting hyper-parameter β in the WD3 algorithm. Our main motivation is to dynamically combine two opposite sides to balance the estimation bias. In this way, the need for tuning the weighting hyper-parameter is eliminated. We named our algorithm Adaptive Weighted Delayed Deep Deterministic Policy Gradient (AWD3). Our major contributions are: • We empirically demonstrate the estimation bias in the former algorithms. We show the necessity to further expand the range of the weighting hyper-parameter β. • We introduce a mechanism to update β. Through simulations on OpenAI gym environments, we show that our approach performs better than other state-of-theart policy gradient algorithms. In these simulations, our approach better estimates the state-action values than the predecessor algorithms. II. BACKGROUND We consider the standard RL paradigm where an agent interacts with an environment in discrete time-steps to learn reward-optimal behavior [10]. The standard RL paradigm is formalized as a Markov Decision Process (MDP). Each discrete time-step t with a given state s ∈ S, the agent selects an action a ∈ A(st ) according to its policy π, receives a reward signal r and the new state s′ from the environment. The return PT is the discounted sum of rewards Rt = i=t γ i−t r (si , ai ) where γ is the discount factor denoting the importance given to the future rewards. Next subsections introduce related policy gradient RL algorithms. A. Deep Deterministic Policy Gradient (DDPG) DDPG algorithm combines the Deterministic Policy Gradient algorithm with function approximation [11]. DDPG agents learn a Q function with a valid policy sampled in continuous space. The algorithm contains four neural networks named actor and critic along with their target networks. The critic network is used to learn the Q function. The actor network is responsible for outputting a deterministic action given the state (π : S → A). The actor network is updated through the deterministic policy gradient algorithm: h i ∇φ J(φ) = Es∼pπ ∇a Q(s, a; θ)|a=π(s;φ) ∇φ π(s; φ) . (1) The motivation here is to update the parameters of the actor network to maximize the critic network output. Therefore, this algorithm renders greedy policies. It is analogous to maximum operator in discrete action spaces. The actor and critic target networks are both updated softly with small τ values, θ′ ← τ θ + (1 − τ )θ′ , (2) which increases the stability of the agents via slower update. Despite the usage of the target networks, the DDPG algorithm overestimates the Q values especially in large state spaces [11]. B. Twin Delayed Deep Deterministic Policy Gradient (TD3) One of the successors to DDPG is the TD3 algorithm that introduces regularization on Q-network update to eliminate overestimation [4]. The algorithm utilizes two independently initialized critics and the minimum of which is taken to compute the target for Q-network update. Aside from additional critic, the algorithm also delays the update of the actor to prevent overestimation. The label for the critic update is, y = r + γ min Qθi′ (s′ , π(s′ ; φ′i )). i=1,2 (3) The periodic update of actor and utilization of two critics along with the minimum operator decouple the actor and critic to eliminate the overestimation error. Fujimoto et al. show that TD3 significantly outperforms the DDPG algorithm by correcting the estimation error [4]. Although this method eliminates the detrimental overestimation in the DDPG [11] algorithm, the issue with TD3 algorithm is that the minimum of two critics introduces an underestimation bias (6), [9]. C. Weighted Delayed Deep Deterministic Policy Gradient (WD3) The WD3 algorithm underscores that taking the minimum of two critics leads to underestimation. In response, a weighted average of the minimum and the average value of the two target critic networks is used to update critic networks: 2 1−β X Qθ′ (s′ , ã), β ∈ (0, 1). (4) Q′ ← β min Qθi′ (s′ , ã)+ i=1,2 2 i=1 i They introduce a new hyper-parameter β for weighted averaging. While the average of the two critics overestimates, the minimum of the two critics underestimates the true action value function. The motivation is to obtain an accurate estimate of the state action values by combining these extreme points. III. E STIMATION B IAS P HENOMENON The theoretical study of the estimation bias and the effect of using the minimum operator in continuous action spaces have not been studied extensively. This section focuses on the theoretical and empirical aspects of the estimation bias in continuous state-action spaces. A. Overestimation Overestimation indicates the cases when the approximated value exceeds the true value. One reason is the maximum operator in Q-learning [3]. Selecting greedy or near-greedy actions in Q-learning leads to chronic overestimation. Similar to that, the deterministic policy gradient algorithm (1) causes overestimation in the continuous state-action spaces [4]. B. Underestimation Underestimation happens when the approximated values are below the true values. In response to the overestimation in the DDPG algorithm, the TD3 algorithm takes the minimum of the estimated values of the two critics [4]. Let Q∗ (s, a) indicates the true Q value. Assume the estimation errors of the two critics are two correlated Gaussian random variables:  Qθi (s, a) − Q∗ (s, a) = Di ∼ N µi , σi2 , i = 1, 2. (5) Then the expectation of the estimation bias becomes:       µ1 −µ2 µ1 −µ2 µ2 −µ1 +µ2 Φ −σφ , E[min(D1,2 )]=µ1Φ σ σ σ (6) p where σ = σ12 + σ22 − 2ρσ1 σ2 . The terms φ(.), Φ(.) signify the PDF and CDF of the standard normal distribution. He et al. treat the estimation biases of the two critics as independent random variables which is not adequate due to the shared transition data used in the update of the networks [9]. Therefore, we have the term ρ denoting the correlation coefficient. We reveal the underestimation bias in the case of minimizing two estimates (6). The next section provides empirical results. C. Empirical Demonstration of Estimation Bias We observe the estimation bias in the TD3 and WD3 algorithms through the environments in MuJoCo [12]. Fig. 1 shows the state-action value estimations along with the true values for the agents trained on Walker2d-v2 environment. The TD3 agent underestimates at the beginning but it starts to overestimate towards the end. Although the WD3 agent accurately estimates value function at the beginning, it outputs overoptimistic values towards the end. Our simulations verify the theoretical derivations in III-A, III-B. We also observe some cases where the agents suffer from overestimation at the beginning of the training process depending on the Xavier initialization networks and the signs of the reward signals [9]. The following section introduces our method. Algorithm 1 Adaptive WD3 algorithm. Fig. 1: Measuring the estimation bias in TD3 & WD3 algorithms in Walker2d-v2 environment averaged over 5 different random seeds IV. A DAPTIVE W EIGHTED D ELAYED D EEP D ETERMINISTIC P OLICY G RADIENT (A DAPTIVE WD3) To eliminate the estimation bias we propose a new approach which dynamically adjusts the weighting hyper-parameter responsible for balancing between overestimation and underestimation. The following sections provide the key points. A. Motivation Here are some advantages of a dynamic hyper-parameter β: • • The hyper-parameter β is a tool for balancing the estimation bias by combining two sides. Hence, its value is crucial for the Q-value estimation, therefore also vital for the learning process. Unfortunately, the optimal value for β is task-specific, meaning that we may need to have multiple runs for its optimization. In the WD3 paper there are six runs for every environment for different β values [9]. In case of a need for more precise value, we must have additional training. Especially in environments where the process of data collection is expensive or slow, tuning β becomes unappealing. The need for a hyper-parameter agnostic version of the WD3 algorithm becomes obvious in those cases. In non-stationary environments where the state transition probabilities are changing, updating β may speed up the adaption to the changing true value function. B. Weighted target update As explained in the previous sections, the DDPG and TD3 algorithms suffer from opposite problems named overestimation and underestimation. He et al. proposes the combination of these two effects to achieve a balance through the weighted average of the pair of target critics shown in (4). If β = 1 throughout the training, then the algorithm becomes TD3. Both terms of (4) introduce estimation bias. We demonstrate the expectation of the first term with (6). The expected value of the second term is E [D1 + D2 ] = µ1 + µ2 [13]. We aim to dynamically adjust the hyper-parameter β to zero out the expected value of the estimation bias: E[D] = βE[min(D1 , D2 )] + 1−β E[D1 + D2 ] = 0. 2 (7) Initialize networks φ, θi , θi′ ← θi , φ′ ← φ for i = 1,2 Initialize B, d, σ, σ̃, η, c, N, T, β, µ, s, t = 0 while t < T do  Select a = π(s; φ) + ǫ, ǫ ∼ N 0, σ 2 , and receive r, s′ Store transition tuple (s, a, r, s′ ) to B Sample mini-batch of N transitions (s,a, r, s′ ) from B ã ← π (s′ ;φ′ ) + ǫ′ , ǫ′ ∼ clip N 0, σ̃ 2 , −c, c  P2 ′ ′ y ← r+γ β mini=1,2 Qθi′ (s′ , ã) + 1−β i=1 Qθi (s , ã) 2 P 2 Update critic θi ← N −1 (y − Qθi (s, a)) if t mod d then Update φ by the P deterministic policy gradient: ∇φ J(φ) = N −1 ∇a Qθ1 (s, a) a=π(s;φ) ∇φ π(s; φ) Update target networks: θi′ ← ηθi + (1 − η)θi′ , φ′ ← ηφ + (1 − η)φ′ end if if s′ is terminal then Sample last terminal transition (s,Pa, r, s′ ) 2 ′ ỹ ← β mini=1,2 Qθi′ (s, a) + 1−β i=1 Qθi (s, a) 2 β ← β − µ ∗ (r − ỹ) end if t ← t + 1, s ← s′ end while Since both critics are trained on the same data and labels, bias values are approximately the same, µ1≈µ2 . This simplifies (6):  σ µ1 + µ2 −√ 2 2π  µ1 + µ2 = 0, + (1 − β) 2 (8) √ 2π µ1 + µ2 βoptimal = . σ 2 Note that the value of βoptimal may be more than 1 depending on the values of σ, µ1 , and µ2 . The actor network is updated via the gradient of the Q-value estimated by the first critic (1). Details are given in Algorithm 1. The next section explains our mechanism to update the weighting hyper-parameter β. β C. Updating beta Having a dynamic β enables us to eliminate the estimation bias without giving precedence over either extreme. The updates of the critic networks are done through the weighted outputs of the target critic networks (4), which we seek to perfectly approximate the true Q function for all stateaction pairs in the ideal case. However, we know that the function approximators suffer from the extrapolation error [14]. Therefore, in practice, our aim is to estimate Q-values within a negligible error interval for important pairs. In general, the terminal states play a vital role in training RL agents. Generally, the reward signal given in termination has more magnitude in comparison with other transitions. Moreover, success or failure is sometimes defined on the terminal states. Regarding these, we update the value of β to eliminate the estimation bias in the terminal states. (a) Ant-v2 (b) Hopper-v2 (e) Humanoid-v2 (c) Walker2d-v2 (f) InvertedDoublePendulum-v2 (d) BipedalWalker-v3 (g) LunarLanderContinuous-v2 Fig. 2: Learning curves for the OpenAI gym continuous control tasks. The shaded region represents half a standard deviation of the average evaluation over 5 trials. Curves are smoothed. TABLE I: Max Average Return over 5 trials of 1M time-steps. BWalker-v3, Hum-v2, InvDouble-v2, Lunar-v2 stands for BipedalWalker-v3, Humanoid-v2, InvertedDoublePendulum-v2 and LunarLanderContinuous-v2 environments, respectively. Algs. TD3 WD3 AWD3 Ant-v2 4926.3 ± 859.3 4157.1 ± 762.0 4948.0 ± 766.1 Hopper-v2 3363.6 ± 154.3 3328.6 ± 188.5 3372.6 ± 114.6 Walker2d-v2 3640.3 ± 738.8 4217.3 ± 685.0 4390.3 ± 484.9 Recent work shows that the terminations caused by time exceeding should not be considered as true terminations and be processed carefully [15]. Thus, we exclude the terminal states caused by time exceeding in adjusting the value of β. In addition, we reveal the necessity of expanding the range of β both theoretically (7) and empirically (Fig. 1). Therefore, we expand the range of β. The update rule is as follows: β ← β − µ ∗ (r − ỹ), β ∈ [0, 2.5], (9) where r is the terminal transition reward, ỹ is the estimated value of the terminal state-action pair and µ is the learning rate. Since the value of r is unbiased, β is guaranteed to converge if the function approximators converge. Next section gives the experimental details. V. E XPERIMENTS We evaluate AWD3 and compare its performance with the TD3 and WD3 algorithms via the OpenAI Gym and MuJoCo environments [12], [16]. For a fair comparison, we mainly select the common environments mentioned in [4], [9]. BWalker-v3 308.0 ± 8.4 308.7 ± 3.0 309.5 ± 4.2 Hum-v2 5286.0 ± 66.0 5218.7 ± 62.7 5227.6 ± 102.4 InvDouble-v2 9359.7 ± 0.2 9359.8 ± 0.1 9359.8 ± 0.1 Lunar-v2 288.7 ± 1.7 290.2 ± 4.3 293.6 ± 3.4 A. Implementation Details Considering the reproducibility concerns [17], we explicitly share our implementation details. We select the hyperparameters as the same in the TD3 and WD3 algorithms for a fair comparison. We do not manipulate the data coming from the environments and directly input to the networks. We use the default reward setting determined for the environments. We train agents for 1 million time-steps for each setting. For each 5000 time-steps, we stop training and test the agent for 10 episodes. In test mode, agents do not apply exploration noise to the actions taken. In addition to that, transitions experienced in the test mode are not stored in the replay buffer. Dependency to initial parameters is eliminated via randomly sampled actions in the initial 25000 time-steps. In the exploration phase, the networks do not see any updates. We consider the terminations caused by time exceeding normal transitions in updates. This procedure stays the same for all 5 seeds. The AWD3 algorithm uses the same hyper-parameters across all environments and seeds. Both the actor and two critics have two fully connected feed-forward layers, each layer having 256 neurons. Networks use the ReLU activation function except for the tanh non-linearity in the last layer of the actor network. Adam [18] is the optimizer for the networks. Transitions are uniformly sampled from the replay buffer with a batch size of 100 and the learning rate is 3e-4 for both actor and the two critic networks. Frequency for the actor network and the soft update is d=2. We set the soft update hyper-parameter τ =5e-3. The critic networks are updated each time step. Exploration noise has a Gaussian distribution ǫ ∼ N (0, 0.1). After the noise addition, the actions are clipped to be in the action space of the environment. During the update of the critics, a Gaussian noise of ǫ ∼ N (0, 0.2) is clipped to [−0.5, 0.5] and added to the output of the target actor. We also introduce new hyper-parameters as a result of the beta value update mechanism. The learning rate for the beta update is 1e-4. Initial beta for environments is taken from [9]. For other environments β is initialized as (β max + β min )/2. The update for β starts after 100000 time-steps to eliminate the effect of the Xavier initialization and takes place after each episode termination excluding time limit induced termination. For a fair comparison, we implement the TD3 and WD3 algorithms through their respective papers without any engineering tricks. Fig. 2 shows the learning curves. These results show that AWD3 outperforms or matches other algorithms and is more robust to catastrophic forgetting. each algorithm. For true Q values, we simulate transitions to the terminal states by the Monte Carlo method every 50000 time-step. Fig. 3 shows the estimated Q-values along with the ground truths. We see that estimating the state-action values by taking the minimum of the two estimators underestimates the ground truth. Our proposal best approximates the Q-values. This advantage increases its performance. VI. C ONCLUSION The estimation bias in continuous action spaces is one of the pivotal setbacks for better performance in value-based RL algorithms. This problem is not completely addressed by the current state-of-the-art deterministic policy gradient algorithms (DDPG, TD3, WD3). Here, we first show the susceptibility of these algorithms to the estimation bias. Then, for active elimination of the estimation bias, we introduce a mechanism to update beta in (4). Through simulations, we verify that our algorithm estimates state action values more precisely better than other algorithms. We also show that the range for weighting hyper-parameter β in (4) may not be sufficient to zero out the expectation of the estimation bias. We verify this derivation with empirical results and show the necessity to expand the range of the weighting hyper-parameter. R EFERENCES (a) LunarLanderContinuous TD3 (b) LunarLanderContinuous WD3 (c) LunarLanderContinuous AWD3 (d) Walker2d AWD3 Fig. 3: Empirical demonstration of the estimation bias on continuous control tasks over 5 random seeds. Red line indicates true state-action values whereas blue line is for estimations. B. Q-value estimation For a comprehensive comparison, we estimate the Q-values for both three algorithms over 5 seeds. We randomly select 1000 state-action pairs from the replay buffer. For the estimation part, we feed state and action to both critics and calculate estimation according to the respective formulas for [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015. [2] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” 2015. [3] S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” in Proceedings of the Fourth Connectionist Models Summer School, pp. 255–263, Hillsdale, NJ, 1993. [4] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” 2018. [5] H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil, “Deep reinforcement learning and the deadly triad,” CoRR, vol. abs/1812.02648, 2018. [6] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, 1988. [7] Q. Lan, Y. Pan, A. Fyshe, and M. White, “Maxmin q-learning: Controlling the estimation bias of q-learning,” 2020. [8] O. Anschel, N. Baram, and N. Shimkin, “Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning,” 2017. [9] Q. He and X. Hou, “WD3: taming the estimation bias in deep reinforcement learning,” in 32nd IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2020, Baltimore, MD, USA, November 9-11, 2020, pp. 391–398, IEEE, 2020. [10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, second ed., 2018. [11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2019. [12] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012. [13] S. Nadarajah and S. Kotz, “Exact distribution of the max/min of two gaussian random variables,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 2, pp. 210–212, 2008. [14] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” 2019. [15] F. Pardo, A. Tavakoli, V. Levdik, and P. Kormushev, “Time limits in reinforcement learning,” 2018. [16] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016. [17] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” 2019. [18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.