Dueling Network Architectures For Deep Reinforcement Learning

Dueling Network Architectures for Deep Reinforcement Learning
Ziyu Wang ZIYU @ GOOGLE . COM

Tom Schaul SCHAUL @ GOOGLE . COM
Matteo Hessel MTTHSS @ GOOGLE . COM
Hado van Hasselt HADO @ GOOGLE . COM
Marc Lanctot LANCTOT @ GOOGLE . COM
Nando de Freitas NANDODEFREITAS @ GMAIL . COM
arXiv:1511.06581v3 [cs.LG] 5 Apr 2016
Google DeepMind, London, UK
Abstract In spite of this, most of the approaches for RL use standard

neural networks, such as convolutional networks, MLPs,
In recent years there have been many successes LSTMs and autoencoders. The focus in these recent ad-
of using deep representations in reinforcement vances has been on designing improved control and RL al-
learning. Still, many of these applications use gorithms, or simply on incorporating existing neural net-
conventional architectures, such as convolutional work architectures into RL methods. Here, we take an al-
networks, LSTMs, or auto-encoders. In this pa- ternative but complementary approach of focusing primar-
per, we present a new neural network architec- ily on innovating a neural network architecture that is better
ture for model-free reinforcement learning. Our suited for model-free RL. This approach has the benefit that
dueling network represents two separate estima- the new network can be easily combined with existing and
tors: one for the state value function and one for future algorithms for RL. That is, this paper advances a new
the state-dependent action advantage function. network (Figure 1), but uses already published algorithms.
The main benefit of this factoring is to general-
ize learning across actions without imposing any
change to the underlying reinforcement learning The proposed network architecture, which we name the du-
algorithm. Our results show that this architec- eling architecture, explicitly separates the representation of
ture leads to better policy evaluation in the pres- state values and (state-dependent) action advantages. The
ence of many similar-valued actions. Moreover, dueling architecture consists of two streams that represent
the dueling architecture enables our RL agent to the value and advantage functions, while sharing a common
outperform the state-of-the-art on the Atari 2600
domain.
1. Introduction
Over the past years, deep learning has contributed to dra-
matic advances in scalability and performance of machine
learning (LeCun et al., 2015). One exciting application
is the sequential decision-making setting of reinforcement
learning (RL) and control. Notable examples include deep
Q-learning (Mnih et al., 2015), deep visuomotor policies
(Levine et al., 2015), attention with recurrent networks (Ba
et al., 2015), and model predictive control with embeddings
(Watter et al., 2015). Other recent successes include mas-
sively parallel frameworks (Nair et al., 2015) and expert Figure 1. A popular single stream Q-network (top) and the duel-
move prediction in the game of Go (Maddison et al., 2015), ing Q-network (bottom). The dueling network has two streams
which produced policies matching those of Monte Carlo to separately estimate (scalar) state-value and the advantages for
tree search programs, and squarely beaten a professional each action; the green output module implements equation (9) to
player when combined with search (Silver et al., 2016). combine them. Both networks output Q-values for each action.
convolutional feature learning module. The two streams VALUE A DVANTAGE

are combined via a special aggregating layer to produce an
estimate of the state-action value function Q as shown in
Figure 1. This dueling network should be understood as a
single Q network with two streams that replaces the popu-
lar single-stream Q network in existing algorithms such as
Deep Q-Networks (DQN; Mnih et al., 2015). The dueling
network automatically produces separate estimates of the
state value function and advantage function, without any
extra supervision.
Intuitively, the dueling architecture can learn which states VALUE A DVANTAGE
are (or are not) valuable, without having to learn the effect
of each action for each state. This is particularly useful
in states where its actions do not affect the environment in
any relevant way. To illustrate this, consider the saliency
maps shown in Figure 21 . These maps were generated by
computing the Jacobians of the trained value and advan-
tage streams with respect to the input video, following the
method proposed by Simonyan et al. (2013). (The experi-
mental section describes this methodology in more detail.)
The figure shows the value and advantage saliency maps for
two different time steps. In one time step (leftmost pair of
images), we see that the value network stream pays atten- Figure 2. See, attend and drive: Value and advantage saliency
tion to the road and in particular to the horizon, where new maps (red-tinted overlay) on the Atari game Enduro, for a trained
cars appear. It also pays attention to the score. The advan- dueling architecture. The value stream learns to pay attention to
the road. The advantage stream learns to pay attention only when
tage stream on the other hand does not pay much attention
there are cars immediately in front, so as to avoid collisions.
to the visual input because its action choice is practically
irrelevant when there are no cars in front. However, in the
second time step (rightmost pair of images) the advantage advantage updating algorithm, the shared Bellman resid-
stream pays attention as there is a car immediately in front, ual update equation is decomposed into two updates: one
making its choice of action very relevant. for a state value function, and one for its associated ad-
In the experiments, we demonstrate that the dueling archi- vantage function. Advantage updating was shown to con-
tecture can more quickly identify the correct action during verge faster than Q-learning in simple continuous time do-
policy evaluation as redundant or similar actions are added mains in (Harmon et al., 1995). Its successor, the advan-
to the learning problem. tage learning algorithm, represents only a single advantage
function (Harmon & Baird, 1996).
We also evaluate the gains brought in by the dueling archi-
tecture on the challenging Atari 2600 testbed. Here, an RL The dueling architecture represents both the value V (s)
agent with the same structure and hyper-parameters must and advantage A(s, a) functions with a single deep model
be able to play 57 different games by observing image pix- whose output combines the two to produce a state-action
els and game scores only. The results illustrate vast im- value Q(s, a). Unlike in advantage updating, the represen-
provements over the single-stream baselines of Mnih et al. tation and algorithm are decoupled by construction. Con-
(2015) and van Hasselt et al. (2015). The combination of sequently, the dueling architecture can be used in combina-
prioritized replay (Schaul et al., 2016) with the proposed tion with a myriad of model free RL algorithms.
dueling network results in the new state-of-the-art for this There is a long history of advantage functions in policy gra-
popular domain. dients, starting with (Sutton et al., 2000). As a recent ex-
ample of this line of work, Schulman et al. (2015) estimate
1.1. Related Work advantage values online to reduce the variance of policy
The notion of maintaining separate value and advantage gradient algorithms.
functions goes back to Baird (1993). In Baird’s original There have been several attempts at playing Atari with deep
1
https://www.youtube.com/playlist?list=
reinforcement learning, including Mnih et al. (2015); Guo
PLVFXyCSfS2Pau0gBh0mwTxDmutywWyFBP et al. (2014); Stadie et al. (2015); Nair et al. (2015); van
Hasselt et al. (2015); Bellemare et al. (2016) and Schaul
et al. (2016). The results of Schaul et al. (2016) are the 2.1. Deep Q-networks
current published state-of-the-art.
The value functions as described in the preceding section
are high dimensional objects. To approximate them, we can
2. Background use a deep Q-network: Q(s, a; θ) with parameters θ. To
estimate this network, we optimize the following sequence
We consider a sequential decision making setup, in which
of loss functions at iteration i:
an agent interacts with an environment E over discrete time
steps, see Sutton & Barto (1998) for an introduction. In the
2
DQN
Atari domain, for example, the agent perceives a video st Li (θi ) = Es,a,r,s0 yi − Q(s, a; θi ) , (4)
consisting of M image frames: st = (xt−M +1 , . . . , xt ) ∈
S at time step t. The agent then chooses an action from a with
discrete set at ∈ A = {1, . . . , |A|} and observes a reward yiDQN = r + γ max
0
Q(s0 , a0 ; θ− ), (5)
a
signal rt produced by the game emulator.
where θ− represents the parameters of a fixed and sepa-
The agent seeks maximize the expected discounted re-
rate target network. We could attempt to use standard Q-
turn, where we define the discounted return as Rt =
P ∞ τ −t learning to learn the parameters of the network Q(s, a; θ)
τ =t γ rτ . In this formulation, γ ∈ [0, 1] is a discount
online. However, this estimator performs poorly in prac-
factor that trades-off the importance of immediate and fu-
tice. A key innovation in (Mnih et al., 2015) was to freeze
ture rewards.
the parameters of the target network Q(s0 , a0 ; θ− ) for a
For an agent behaving according to a stochastic policy π, fixed number of iterations while updating the online net-
the values of the state-action pair (s, a) and the state s are work Q(s, a; θi ) by gradient descent. (This greatly im-
defined as follows proves the stability of the algorithm.) The specific gradient
update is
Qπ (s, a) = E [ Rt | st = s, at = a, π] , and h i
V π (s) = Ea∼π(s) [Qπ (s, a)] . (1) ∇θi Li (θi ) = Es,a,r,s0 yiDQN − Q(s, a; θi ) ∇θi Q(s, a; θi )
The preceding state-action value function (Q function for This approach is model free in the sense that the states and
short) can be computed recursively with dynamic program- rewards are produced by the environment. It is also off-
ming: policy because these states and rewards are obtained with
a behavior policy (epsilon greedy in DQN) different from
Qπ (s, a) = Es0 r + γEa0 ∼π(s0 ) [Qπ (s0 , a0 )] | s, a, π .
the online policy that is being learned.
Another key ingredient behind the success of DQN is expe-
We define the optimal Q∗ (s, a) = maxπ Qπ (s, a). Un- rience replay (Lin, 1993; Mnih et al., 2015). During learn-
der the deterministic policy a = arg maxa0 ∈A Q∗ (s, a0 ), ing, the agent accumulates a dataset Dt = {e1 , e2 , . . . , et }
it follows that V ∗ (s) = maxa Q∗ (s, a). From this, it also of experiences et = (st , at , rt , st+1 ) from many episodes.
follows that the optimal Q function satisfies the Bellman When training the Q-network, instead only using the
equation: current experience as prescribed by standard temporal-
h i difference learning, the network is trained by sampling
Q∗ (s, a) = Es0 r + γ max Q∗ 0 0
(s , a ) | s, a . (2) mini-batches of experiences from D uniformly at random.
0 a The sequence of losses thus takes the form
2
We define another important quantity, the advantage func- DQN
Li (θi ) = E(s,a,r,s )∼U (D) yi
0 − Q(s, a; θi ) .
tion, relating the value and Q functions:
Experience replay increases data efficiency through re-use
Aπ (s, a) = Qπ (s, a) − V π (s). (3) of experience samples in multiple updates and, importantly,
it reduces variance as uniform sampling from the replay
Note that Ea∼π(s) [Aπ (s, a)] = 0. Intuitively, the value buffer reduces the correlation among the samples used in
function V measures the how good it is to be in a particular the update.
state s. The Q function, however, measures the the value
of choosing a particular action when in this state. The ad-
2.2. Double Deep Q-networks
vantage function subtracts the value of the state from the Q
function to obtain a relative measure of the importance of The previous section described the main components of
each action. DQN as presented in (Mnih et al., 2015). In this paper,
we use the improved Double DQN (DDQN) learning al- function. As in (Mnih et al., 2015), the output of the net-
gorithm of van Hasselt et al. (2015). In Q-learning and work is a set of Q values, one for each action.
DQN, the max operator uses the same values to both select
Since the output of the dueling network is a Q function,
and evaluate an action. This can therefore lead to overopti-
it can be trained with the many existing algorithms, such
mistic value estimates (van Hasselt, 2010). To mitigate this
as DDQN and SARSA. In addition, it can take advantage
problem, DDQN uses the following target:
of any improvements to these algorithms, including better
yiDDQN = r + γQ(s0 , arg max Q(s0 , a0 ; θi ); θ− ). (6) replay memories, better exploration policies, intrinsic mo-
a0 tivation, and so on.
DDQN is the same as for DQN (see Mnih et al. (2015)), but The module that combines the two streams of fully-
with the target yiDQN replaced by yiDDQN . The pseudo- connected layers to output a Q estimate requires very
code for DDQN is presented in Appendix A. thoughtful design.
2.3. Prioritized Replay From the expressions for advantage Qπ (s, a) = V π (s) +
Aπ (s, a) and state-value V π (s) = Ea∼π(s) [Qπ (s, a)], it
A recent innovation in prioritized experience re- follows that Ea∼π(s) [Aπ (s, a)] = 0. Moreover, for a de-
play (Schaul et al., 2016) built on top of DDQN and terministic policy, a∗ = arg maxa0 ∈A Q(s, a0 ), it follows
further improved the state-of-the-art. Their key idea was that Q(s, a∗ ) = V (s) and hence A(s, a∗ ) = 0.
to increase the replay probability of experience tuples
that have a high expected learning progress (as measured Let us consider the dueling network shown in Figure 1,
via the proxy of absolute TD-error). This led to both where we make one stream of fully-connected layers out-
faster learning and to better final policy quality across put a scalar V (s; θ, β), and the other stream output an |A|-
most games of the Atari benchmark suite, as compared to dimensional vector A(s, a; θ, α). Here, θ denotes the pa-
uniform experience replay. rameters of the convolutional layers, while α and β are the
parameters of the two streams of fully-connected layers.
To strengthen the claim that our dueling architecture is
complementary to algorithmic innovations, we show that Using the definition of advantage, we might be tempted to
it improves performance for both the uniform and the pri- construct the aggregating module as follows:
oritized replay baselines (for which we picked the easier Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α), (7)
to implement rank-based variant), with the resulting priori-
tized dueling variant holding the new state-of-the-art. Note that this expression applies to all (s, a) instances; that
is, to express equation (7) in matrix form we need to repli-
cate the scalar, V (s; θ, β), |A| times.
3. The Dueling Network Architecture
However, we need to keep in mind that Q(s, a; θ, α, β)
The key insight behind our new architecture, as illustrated is only a parameterized estimate of the true Q-function.
in Figure 2, is that for many states, it is unnecessary to es- Moreover, it would be wrong to conclude that V (s; θ, β)
timate the value of each action choice. For example, in is a good estimator of the state-value function, or likewise
the Enduro game setting, knowing whether to move left or that A(s, a; θ, α) provides a reasonable estimate of the ad-
right only matters when a collision is eminent. In some vantage function.
states, it is of paramount importance to know which action
to take, but in many other states the choice of action has no Equation (7) is unidentifiable in the sense that given Q
repercussion on what happens. For bootstrapping based al- we cannot recover V and A uniquely. To see this, add a
gorithms, however, the estimation of state values is of great constant to V (s; θ, β) and subtract the same constant from
importance for every state. A(s, a; θ, α). This constant cancels out resulting in the
same Q value. This lack of identifiability is mirrored by
To bring this insight to fruition, we design a single Q- poor practical performance when this equation is used di-
network architecture, as illustrated in Figure 1, which we rectly.
refer to as the dueling network. The lower layers of the
dueling network are convolutional as in the original DQNs To address this issue of identifiability, we can force the ad-
(Mnih et al., 2015). However, instead of following the con- vantage function estimator to have zero advantage at the
volutional layers with a single sequence of fully connected chosen action. That is, we let the last module of the net-
layers, we instead use two sequences (or streams) of fully work implement the forward mapping
connected layers. The streams are constructed such that Q(s, a; θ, α, β) = V (s; θ, β) +
they have they have the capability of providing separate es-
timates of the value and advantage functions. Finally, the 0
A(s, a; θ, α) − max
0
A(s, a ; θ, α) . (8)
two streams are combined to produce a single output Q a ∈|A|
Now, for a∗ = arg maxa0 ∈A Q(s, a0 ; θ, α, β) = ticular task because it is very useful for evaluating network
arg maxa0 ∈A A(s, a0 ; θ, α), we obtain Q(s, a∗ ; θ, α, β) = architectures, as it is devoid of confounding factors such as
V (s; θ, β). Hence, the stream V (s; θ, β) provides an esti- the choice of exploration strategy, and the interaction be-
mate of the value function, while the other stream produces tween policy improvement and policy evaluation.
an estimate of the advantage function.
In this experiment, we employ temporal difference learning
An alternative module replaces the max operator with an (without eligibility traces, i.e., λ = 0) to learn Q values.
average: More specifically, given a behavior policy π, we seek to
estimate the state-action value Qπ (·, ·) by optimizing the
Q(s, a; θ, α, β) = V (s; θ, β) + sequence of costs of equation (4), with target
!
1 X yi = r + γEa0 ∼π(s0 ) [Q(s0 , a0 ; θi )] .
A(s, a; θ, α) − A(s, a0 ; θ, α) . (9)
|A| 0
a
The above update rule is the same as that of Expected
On the one hand this loses the original semantics of V and SARSA (van Seijen et al., 2009). We, however, do not
A because they are now off-target by a constant, but on modify the behavior policy as in Expected SARSA.
the other hand it increases the stability of the optimization:
To evaluate the learned Q values, we choose a simple envi-
with (9) the advantages only need to change as fast as the
ronment where the exact Qπ (s, a) values can be computed
mean, instead of having to compensate any change to the
separately for all (s, a) ∈ S × A. This environment, which
optimal action’s advantage in (8). We also experimented
we call the corridor is composed of three connected cor-
with a softmax version of equation (8), but found it to de-
ridors. A schematic drawing of the corridor environment
liver similar results to the simpler module of equation (9).
is shown in Figure 3, The agent starts from the bottom left
Hence, all the experiments reported in this paper use the
corner of the environment and must move to the top right
module of equation (9).
to get the largest reward. A total of 5 actions are available:
Note that while subtracting the mean in equation (9) helps go up, down, left, right and no-op. We also have the free-
with identifiability, it does not change the relative rank of dom of adding an arbitrary number of no-op actions. In our
the A (and hence Q) values, preserving any greedy or - setup, the two vertical sections both have 10 states while
greedy policy based on Q values from equation (7). When the horizontal section has 50.
acting, it suffices to evaluate the advantage stream to make
We use an -greedy policy as the behavior policy π, which
decisions.
chooses a random action with probability or an action
It is important to note that equation (9) is viewed and im- according to the optimal Q function arg maxa∈A Q∗ (s, a)
plemented as part of the network and not as a separate algo- with probability 1 − . In our experiments, is chosen to
rithmic step. Training of the dueling architectures, as with be 0.001.
standard Q networks (e.g. the deep Q-network of Mnih
We compare a single-stream Q architecture with the duel-
et al. (2015)), requires only back-propagation. The esti-
ing architecture on three variants of the corridor environ-
mates V (s; θ, β) and A(s, a; θ, α) are computed automati-
ment with 5, 10 and 20 actions respectively. The 10 and 20
cally without any extra supervision or algorithmic modifi-
action variants are formed by adding no-ops to the original
cations.
environment. We measure performanceP by Squared Error
As the dueling architecture shares the same input-output in- (SE) against the true state values: s∈S,a∈A (Q(s, a; θ) −
terface with standard Q networks, we can recycle all learn- Qπ (s, a))2 . The single-stream architecture is a three layer
ing algorithms with Q networks (e.g., DDQN and SARSA) MLP with 50 units on each hidden layer. The dueling ar-
to train the dueling architecture. chitecture is also composed of three layers. After the first
hidden layer of 50 units, however, the network branches off
4. Experiments into two streams each of them a two layer MLP with 25 hid-
den units. The results of the comparison are summarized in
We now show the practical performance of the dueling net- Figure 3.
work. We start with a simple policy evaluation task and
then show larger scale results for learning policies for gen- The results show that with 5 actions, both architectures
eral Atari game-playing. converge at about the same speed. However, when we in-
crease the number of actions, the dueling architecture per-
forms better than the traditional Q-network. In the dueling
4.1. Policy evaluation
network, the stream V (s; θ, β) learns a general value that
We start by measuring the performance of the dueling ar- is shared across many similar actions at s, hence leading
chitecture on a policy evaluation task. We choose this par- to faster convergence. This is a very promising result be-
C ORRIDOR ENVIRONMENT 5 ACTIONS 10 ACTIONS 20 ACTIONS

103 103 103
102 102 102
SE
101 101 101
Single
100
Duel 100 100
103 104 103 104 103 104
No. Iterations No. Iterations No. Iterations
(a) (b) (c) (d)
Figure 3. (a) The corridor environment. The star marks the starting state. The redness of a state signifies the reward the agent receives
upon arrival. The game terminates upon reaching either reward state. The agent’s actions are going up, down, left, right and no action.
Plots (b), (c) and (d) shows squared error for policy evaluation with 5, 10, and 20 actions on a log-log scale. The dueling network
(Duel) consistently outperforms a conventional single-stream network (Single), with the performance gap increasing with the number of
actions.
cause many control tasks with large action spaces have this as many outputs as there are valid actions2 . We combine the
property, and consequently we should expect that the du- value and advantage streams using the module described by
eling network will often lead to much faster convergence Equation (9). Rectifier non-linearities (Fukushima, 1980)
than a traditional single stream network. In the following are inserted between all adjacent layers.
section, we will indeed see that the dueling network results
We adopt the optimizers and hyper-parameters of van Has-
in substantial gains in performance in a wide-range of Atari
selt et al. (2015), with the exception of the learning rate
games.
which we chose to be slightly lower (we do not do this for
double DQN as it can deteriorate its performance). Since
4.2. General Atari Game-Playing both the advantage and the value stream propagate gradi-
We perform a comprehensive evaluation of our proposed ents to the last convolutional layer in the backward pass,
method on the Arcade Learning Environment (Bellemare we rescale the combined
√ gradient entering the last convo-
et al., 2013), which is composed of 57 Atari games. The lutional layer by 1/ 2. This simple heuristic mildly in-
challenge is to deploy a single algorithm and architecture, creases stability. In addition, we clip the gradients to have
with a fixed set of hyper-parameters, to learn to play all their norm less than or equal to 10. This clipping is not
the games given only raw pixel observations and game re- standard practice in deep RL, but common in recurrent net-
wards. This environment is very demanding because it is work training (Bengio et al., 2013).
both comprised of a large number of highly diverse games To isolate the contributions of the dueling architecture, we
and the observations are high-dimensional. re-train DDQN with a single stream network using exactly
We follow closely the setup of van Hasselt et al. (2015) and the same procedure as described above. Specifically, we
compare to their results using single-stream Q-networks. apply gradient clipping, and use 1024 hidden units for the
We train the dueling network with the DDQN algorithm first fully-connected layer of the network so that both archi-
as presented in Appendix A. At the end of this section, tectures (dueling and single) have roughly the same number
we incorporate prioritized experience replay (Schaul et al., of parameters. We refer to this re-trained model as Single
2016). Clip, while the original trained model of van Hasselt et al.
(2015) is referred to as Single.
Our network architecture has the same low-level convolu-
tional structure of DQN (Mnih et al., 2015; van Hasselt As in (van Hasselt et al., 2015), we start the game with up
et al., 2015). There are 3 convolutional layers followed by to 30 no-op actions to provide random starting positions for
2 fully-connected layers. The first convolutional layer has the agent. To evaluate our approach, we measure improve-
32 8 × 8 filters with stride 4, the second 64 4 × 4 filters with ment in percentage (positive or negative) in score over the
stride 2, and the third and final convolutional layer consists better of human and baseline agent scores:
64 3 × 3 filters with stride 1. As shown in Figure 1, the ScoreAgent − ScoreBaseline
dueling network splits into two streams of fully connected . (10)
max{ScoreHuman , ScoreBaseline } − ScoreRandom
layers. The value and advantage streams both have a fully-
We took the maximum over human and baseline agent
connected layer with 512 units. The final hidden layers of
scores as it prevents insignificant changes to appear as
the value and advantage streams are both fully-connected
with the value stream having one output and the advantage 2
The number of actions ranges between 3-18 actions in the
ALE environment.
Atlantis 296.67%
Tennis 180.00%
Space Invaders 164.11% Table 1. Mean and median scores across all 57 Atari games, mea-
Up and Down 97.90%
Phoenix 94.33%
Enduro 86.35% sured in percentages of human performance.
Chopper Command 82.20%
Seaquest 80.51%
Yars' Revenge 73.63%
Frostbite 70.02% 30 no-ops Human Starts
Time Pilot 69.73%
Asterix 63.17%
Road Runner 57.57% Mean Median Mean Median
Bank Heist 57.19%
Krull 55.85%
Prior. Duel Clip 591.9% 172.1% 567.0% 115.3%
Ms. Pac-Man 53.76%
Star Gunner 48.92%
Surround 44.24% Prior. Single 434.6% 123.7% 386.7% 112.9%
Double Dunk 42.75%
River Raid 39.79%
Venture 33.60% Duel Clip 373.1% 151.5% 343.8% 117.1%
Amidar 31.40%
Fishing Derby 28.82%
Single Clip 341.2% 132.6% 302.8% 114.1%
Q*Bert 27.68%
Zaxxon 27.45%
Ice Hockey 26.45% Single 307.3% 117.8% 332.9% 110.9%
Crazy Climber 24.68%
Centipede 21.68%
Defender 21.18% Nature DQN 227.9% 79.1% 219.6% 68.5%
Name This Game 16.28%
Battle Zone 15.65%
Kung-Fu Master 15.56%
Kangaroo 14.39%
Alien 10.34%
Berzerk 9.86%
Boxing 8.52%
Gopher 6.02%
Gravitar 5.54%
Wizard Of Wor 5.24%
Demon Attack 4.78%
Duel Clip does better than Single Clip on 75.4% of the
Asteroids 4.51%
H.E.R.O. 2.31%
Skiing 1.29%
games (43 out of 57). It also achieves higher scores com-
Pitfall! 0.45%
Robotank 0.32%
Pong 0.24%
Montezuma's Revenge
Private Eye
0.00%
-0.04%
pared to the Single baseline on 80.7% (46 out of 57) of the
Bowling -1.89%
Tutankham
James Bond
-3.38%
-3.42%
games. Of all the games with 18 actions, Duel Clip is better
Solaris -7.37%
Beam Rider -9.71% 86.6% of the time (26 out of 30). This is consistent with the
Assault -14.93%
Breakout -17.56%
Video Pinball -68.31% findings of the previous section. Overall, our agent (Duel
Freeway -100.00%
Clip) achieves human level performance on 42 out of 57
games. Raw scores for all the games, as well as measure-
Figure 4. Improvements of dueling architecture over the baseline ments in human performance percentage, are presented in
Single network of van Hasselt et al. (2015), using the metric de- the Appendix.
scribed in Equation (10). Bars to the right indicate by how much
the dueling network outperforms the single-stream network. Robustness to human starts. One shortcoming of the 30
no-ops metric is that an agent does not necessarily have to
generalize well to play the Atari games. Due to the deter-
ministic nature of the Atari environment, from an unique
large improvements when neither the agent in question nor
starting point, an agent could learn to achieve good perfor-
the baseline are doing well. For example, an agent that
mance by simply remembering sequences of actions.
achieves 2% human performance should not be interpreted
as two times better when the baseline agent achieves 1% To obtain a more robust measure, we adopt the methodol-
human performance. We also chose not to measure perfor- ogy of Nair et al. (2015). Specifically, for each game, we
mance in terms of percentage of human performance alone use 100 starting points sampled from a human expert’s tra-
because a tiny difference relative to the baseline on some jectory. From each of these points, an evaluation episode
games can translate into hundreds of percent in human per- is launched for up to 108,000 frames. The agents are eval-
formance difference. uated only on rewards accrued after the starting point. We
refer to this metric as Human Starts.
The results for the wide suite of 57 games are summarized
in Table 1. Detailed results are presented in the Appendix. As shown in Table 1, under the Human Starts metric, Duel
Clip once again outperforms the single stream variants. In
Using this 30 no-ops performance measure, it is clear that
particular, our agent does better than the Single baseline on
the dueling network (Duel Clip) does substantially better
70.2% (40 out of 57) games and on games of 18 actions,
than the Single Clip network of similar capacity. It also
Duel Clip is 83.3% better (25 out of 30).
does considerably better than the baseline (Single) of van
Hasselt et al. (2015). For comparison we also show results Combining with Prioritized Experience Replay. The du-
for the deep Q-network of Mnih et al. (2015), referred to as eling architecture can be easily combined with other algo-
Nature DQN. rithmic improvements. In particular, prioritization of the
experience replay has been shown to significantly improve
Figure 4 shows the improvement of the dueling network
performance of Atari games (Schaul et al., 2016). Further-
over the baseline Single network of van Hasselt et al.
more, as prioritization and the dueling architecture address
(2015). Again, we seen that the improvements are often
very different aspects of the learning process, their combi-
very dramatic.
nation is promising. So in our final experiment, we inves-
As shown in Table 1, Single Clip performs better than Sin- tigate the integration of the dueling architecture with pri-
gle. We verified that this gain was mostly brought in by oritized experience replay. We use the prioritized variant
gradient clipping. For this reason, we incorporate gradient of DDQN (Prior. Single) as the new baseline algorithm,
clipping in all the new approaches. which replaces with the uniform sampling of the experi-
Asterix 1097.02%
The combination of prioritized replay and the dueling net-
Space Invaders 457.93%
Phoenix 281.56%
Gopher 223.03%
Wizard Of Wor
Up and Down 113.47%
178.13% work results in vast improvements over the previous state-
Yars' Revenge 113.16%
Star Gunner 98.69% of-the-art in the popular ALE benchmark.
Berzerk 83.91%
Frostbite 70.29%
Video Pinball 69.92%
Chopper Command 58.87%
Saliency maps. To better understand the roles of the value
Assault 51.07%
Bank Heist 43.11%
River Raid 38.56%
Defender
Name This Game
35.33%
33.09%
and the advantage streams, we compute saliency maps (Si-
Zaxxon 32.74%
Centipede 32.48% monyan et al., 2013). More specifically, to visualize the
Beam Rider 29.94%
Amidar 24.98%
Kung-Fu Master 22.36% salient part of the image as seen by the value stream, we
Tutankham 21.38%
Crazy Climber 16.16%
Q*Bert 15.56% compute the absolute value of the Jacobian of Vb with re-
Battle Zone 11.46%
Atlantis 11.16%
Enduro 10.20% spect to the input frames: |∇s Vb (s; θ)|. Similarly, to visu-
Krull 7.95%
Road Runner 7.89%
Pitfall! 5.33%
alize the salient part of the image as seen by the advan-
Boxing 3.46%
Demon Attack 1.44%
Fishing Derby 1.37%
tage stream, we compute |∇s A(s,
b arg maxa0 A(s, b a0 ); θ)|.
Pong 0.73%
Private Eye 0.01%
Montezuma's Revenge 0.00%
Tennis
Venture
0.00%
-0.51%
Both quantities are of the same dimensionality as the input
Bowling -0.87%
Freeway
Breakout
-2.08%
-2.12%
frames and therefore can be visualized easily alongside the
Asteroids -3.13%
Alien -3.81% input frames.
H.E.R.O. -6.72%
Gravitar -9.77%
Ice Hockey -13.60%
Time Pilot -29.21%
Solaris
Surround
-37.65%
-40.74%
Here, we place the gray scale input frames in the green and
Ms. Pac-Man -48.03%
Robotank
Seaquest
-58.11%
-60.56%
blue channel and the saliency maps in the red channel. All
Skiing -77.99%
Double Dunk -83.56% three channels together form an RGB image. Figure 2 de-
James Bond -84.70%
Kangaroo -89.22%
picts the value and advantage saliency maps on the Enduro
game for two different time steps. As observed in the in-
Figure 5. Improvements of dueling architecture over Prioritized troduction, the value stream pays attention to the horizon
DDQN baseline, using the same metric as Figure 4. Again, the where the appearance of a car could affect future perfor-
dueling architecture leads to significant improvements over the mance. The value stream also pays attention to the score.
single-stream baseline on the majority of games. The advantage stream, on the other hand, cares more about
cars that are on an immediate collision course.
ence tuples by rank-based prioritized sampling. We keep 5. Discussion

all the parameters of the prioritized replay as described
in (Schaul et al., 2016), namely a priority exponent of 0.7, The advantage of the dueling architecture lies partly in its
and an annealing schedule on the importance sampling ex- ability to learn the state-value function efficiently. With
ponent from 0.5 to 1. We combine this baseline with our every update of the Q values in the dueling architecture,
dueling architecture (as above), and again use gradient clip- the value stream V is updated – this contrasts with the up-
ping (Prior. Duel Clip). dates in a single-stream architecture where only the value
for one of the actions is updated, the values for all other
Note that, although orthogonal in their objectives, these actions remain untouched. This more frequent updating of
extensions (prioritization, dueling and gradient clipping) the value stream in our approach allocates more resources
interact in subtle ways. For example, prioritization inter- to V , and thus allows for better approximation of the state
acts with gradient clipping, as sampling transitions with values, which in turn need to be accurate for temporal-
high absolute TD-errors more often leads to gradients with difference-based methods like Q-learning to work (Sutton
higher norms. To avoid adverse interactions, we roughly & Barto, 1998). This phenomenon is reflected in the ex-
re-tuned the learning rate and the gradient clipping norm on periments, where the advantage of the dueling architecture
a subset of 9 games. As a result of rough tuning, we settled over single-stream Q networks grows when the number of
on 6.25 × 10−5 for the learning rate and 10 for the gradient actions is large.
clipping norm (the same as in the previous section).
Furthermore, the differences between Q-values for a given
When evaluated on all 57 Atari games, our prioritized du- state are often very small relative to the magnitude of Q.
eling agent performs significantly better than both the pri- For example, after training with DDQN on the game of
oritized baseline agent and the dueling agent alone. The Seaquest, the average action gap (the gap between the Q
full mean and median performance against the human per- values of the best and the second best action in a given
formance percentage is shown in Table 1. When initializ- state) across visited states is roughly 0.04, whereas the av-
ing the games using up to 30 no-ops action, we observe erage state value across those states is about 15. This differ-
mean and median scores of 591% and 172% respectively. ence in scales can lead to small amounts of noise in the up-
The direct comparison between the prioritized baseline and dates can lead to reorderings of the actions, and thus make
prioritized dueling versions, using the metric described in the nearly greedy policy switch abruptly. The dueling ar-
Equation 10, is presented in Figure 5.
chitecture with its separate advantage stream is robust to Lin, L.J. Reinforcement learning for robots using neu-
such effects. ral networks. PhD thesis, School of Computer Science,
Carnegie Mellon University, 1993.
6. Conclusions Maddison, C. J., Huang, A., Sutskever, I., and Silver, D.
Move Evaluation in Go Using Deep Convolutional Neu-
We introduced a new neural network architecture that de- ral Networks. In ICLR, 2015.
couples value and advantage in deep Q-networks, while
sharing a common feature learning module. The new duel- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ing architecture, in combination with some algorithmic im- ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
provements, leads to dramatic improvements over existing Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C.,
approaches for deep RL in the challenging Atari domain. Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier-
The results presented in this paper are the new state-of-the- stra, D., Legg, S., and Hassabis, D. Human-level con-
art in this popular domain. trol through deep reinforcement learning. Nature, 518
(7540):529–533, 2015.
References Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C.,
Fearon, R., Maria, A. De, Panneershelvam, V., Suley-
Ba, J., Mnih, V., and Kavukcuoglu, K. Multiple object man, M., Beattie, C., Petersen, S., Legg, S., Mnih,
recognition with visual attention. In ICLR, 2015. V., Kavukcuoglu, K., and Silver, D. Massively paral-
Baird, L.C. Advantage updating. Technical Report WL- lel methods for deep reinforcement learning. In Deep
TR-93-1146, Wright-Patterson Air Force Base, 1993. Learning Workshop, ICML, 2015.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-
The arcade learning environment: An evaluation plat- tized experience replay. In ICLR, 2016.
form for general agents. Journal of Artificial Intelligence Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and
Research, 47:253–279, 2013. Abbeel, P. High-dimensional continuous control us-
Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., ing generalized advantage estimation. arXiv preprint
and Munos, R. Increasing the action gap: New operators arXiv:1506.02438, 2015.
for reinforcement learning. In AAAI, 2016. To appear. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L.,
Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, R. van den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Advances in optimizing recurrent networks. In ICASSP, Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,
pp. 8624–8628, 2013. D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap,
T., Leach, M., Kavukcuoglu, K., Graepel, T., and Has-
Fukushima, K. Neocognitron: A self-organizing neural
sabis, D. Mastering the game of go with deep neural
network model for a mechanism of pattern recognition
networks and tree search. Nature, 529(7587):484–489,
unaffected by shift in position. Biological Cybernetics,
01 2016.
36:193–202, 1980.
Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in-
Guo, X., Singh, S., Lee, H., Lewis, R. L., and Wang, X.
side convolutional networks: Visualising image clas-
Deep learning for real-time Atari game play using offline
sification models and saliency maps. arXiv preprint
Monte-Carlo tree search planning. In NIPS, pp. 3338–
arXiv:1312.6034, 2013.
3346. 2014.
Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing ex-
Harmon, M.E. and Baird, L.C. Multi-player residual ad-
ploration in reinforcement learning with deep predictive
vantage learning with general function approximation.
models. arXiv preprint arXiv:1507.00814, 2015.
Technical Report WL-TR-1065, Wright-Patterson Air
Force Base, 1996. Sutton, R. S. and Barto, A. G. Introduction to reinforce-
ment learning. MIT Press, 1998.
Harmon, M.E., Baird, L.C., and Klopf, A.H. Advantage
updating applied to a differential game. In G. Tesauro, Sutton, R. S., Mcallester, D., Singh, S., and Mansour, Y.
D.S. Touretzky and Leen, T.K. (eds.), NIPS, 1995. Policy gradient methods for reinforcement learning with
function approximation. In NIPS, pp. 1057–1063, 2000.
LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na-
ture, 521(7553):436–444, 2015. van Hasselt, H. Double Q-learning. NIPS, 23:2613–2621,
2010.
Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-
end training of deep visuomotor policies. arXiv preprint van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-
arXiv:1504.00702, 2015. ment learning with double Q-learning. arXiv preprint
arXiv:1509.06461, 2015.
van Seijen, H., van Hasselt, H., Whiteson, S., and Wier-
ing, M. A theoretical and empirical analysis of Expected
Sarsa. In IEEE Symposium on Adaptive Dynamic Pro-
gramming and Reinforcement Learning, pp. 177–184.
2009.
Watter, M., Springenberg, J. T., Boedecker, J., and Ried-
miller, M. A. Embed to control: A locally linear latent
dynamics model for control from raw images. In NIPS,
2015.
A. Double DQN Algorithm
Algorithm 1: Double DQN Algorithm.

input : D – empty replay buffer; θ – initial network parameters, θ− – copy of θ
input : Nr – replay buffer maximum size; Nb – training batch size; N − – target network replacement freq.
for episode e ∈ {1, 2, . . . , M } do
Initialize frame sequence x ← ()
for t ∈ {0, 1, . . .} do
Set state s ← x, sample action a ∼ πB
Sample next frame xt from environment E given (s, a) and receive reward r, and append xt to x
if |x| > Nf then delete oldest frame xtmin from x end
Set s0 ← x, and add transition tuple (s, a, r, s0 ) to D,
replacing the oldest tuple if |D| ≥ Nr
Sample a minibatch of Nb tuples (s, a, r, s0 ) ∼ Unif(D)
Construct target values, one for each of the Nb tuples:
Defineamax (s0 ; θ) = arg maxa0 Q(s0 , a0 ; θ)
r if s0 is terminal
yj =
r + γQ(s0 , amax (s0 ; θ); θ− ), otherwise.
Do a gradient descent step with loss kyj − Q(s, a; θ)k2
Replace target parameters θ− ← θ every N − steps
end
end
Table 2. Raw scores across all games. Starting with 30 no-op actions.
G AMES N O . ACTIONS R ANDOM H UMAN DQN DDQN D UEL P RIOR . P RIOR . D UEL .
Alien 18 227.8 7,127.7 1,620.0 3,747.7 4,461.4 4,203.8 3,941.0
Amidar 10 5.8 1,719.5 978.0 1,793.3 2,354.5 1,838.9 2,296.8
Assault 7 222.4 742.0 4,280.4 5,393.2 4,621.0 7,672.1 11,477.0
Asterix 9 210.0 8,503.3 4,359.0 17,356.5 28,188.0 31,527.0 375,080.0
Asteroids 14 719.1 47,388.7 1,364.5 734.7 2,837.7 2,654.3 1,192.7
Atlantis 4 12,850.0 29,028.1 279,987.0 106,056.0 382,572.0 357,324.0 395,762.0
Bank Heist 18 14.2 753.1 455.0 1,030.6 1,611.9 1,054.6 1,503.1
Battle Zone 18 2,360.0 37,187.5 29,900.0 31,700.0 37,150.0 31,530.0 35,520.0
Beam Rider 9 363.9 16,926.5 8,627.5 13,772.8 12,164.0 23,384.2 30,276.5
Berzerk 18 123.7 2,630.4 585.6 1,225.4 1,472.6 1,305.6 3,409.0
Bowling 6 23.1 160.7 50.4 68.1 65.5 47.9 46.7
Boxing 18 0.1 12.1 88.0 91.6 99.4 95.6 98.9
Breakout 4 1.7 30.5 385.5 418.5 345.3 373.9 366.0
Centipede 18 2,090.9 12,017.0 4,657.7 5,409.4 7,561.4 4,463.2 7,687.5
Chopper Command 18 811.0 7,387.8 6,126.0 5,809.0 11,215.0 8,600.0 13,185.0
Crazy Climber 9 10,780.5 35,829.4 110,763.0 117,282.0 143,570.0 141,161.0 162,224.0
Defender 18 2,874.5 18,688.9 23,633.0 35,338.5 42,214.0 31,286.5 41,324.5
Demon Attack 6 152.1 1,971.0 12,149.4 58,044.2 60,813.3 71,846.4 72,878.6
Double Dunk 18 -18.6 -16.4 -6.6 -5.5 0.1 18.5 -12.5
Enduro 9 0.0 860.5 729.0 1,211.8 2,258.2 2,093.0 2,306.4
Fishing Derby 18 -91.7 -38.7 -4.9 15.5 46.4 39.5 41.3
Freeway 3 0.0 29.6 30.8 33.3 0.0 33.7 33.0
Frostbite 18 65.2 4,334.7 797.4 1,683.3 4,672.8 4,380.1 7,413.0
Gopher 8 257.6 2,412.5 8,777.4 14,840.8 15,718.4 32,487.2 104,368.2
Gravitar 18 173.0 3,351.4 473.0 412.0 588.0 548.5 238.0
H.E.R.O. 18 1,027.0 30,826.4 20,437.8 20,130.2 20,818.2 23,037.7 21,036.5
Ice Hockey 18 -11.2 0.9 -1.9 -2.7 0.5 1.3 -0.4
James Bond 18 29.0 302.8 768.5 1,358.0 1,312.5 5,148.0 812.0
Kangaroo 18 52.0 3,035.0 7,259.0 12,992.0 14,854.0 16,200.0 1,792.0
Krull 18 1,598.0 2,665.5 8,422.3 7,920.5 11,451.9 9,728.0 10,374.4
Kung-Fu Master 14 258.5 22,736.3 26,059.0 29,710.0 34,294.0 39,581.0 48,375.0
Montezuma’s Revenge 18 0.0 4,753.3 0.0 0.0 0.0 0.0 0.0
Ms. Pac-Man 9 307.3 6,951.6 3,085.6 2,711.4 6,283.5 6,518.7 3,327.3
Name This Game 6 2,292.3 8,049.0 8,207.8 10,616.0 11,971.1 12,270.5 15,572.5
Phoenix 8 761.4 7,242.6 8,485.2 12,252.5 23,092.2 18,992.7 70,324.3
Pitfall! 18 -229.4 6,463.7 -286.1 -29.9 0.0 -356.5 0.0
Pong 3 -20.7 14.6 19.5 20.9 21.0 20.6 20.9
Private Eye 18 24.9 69,571.3 146.7 129.7 103.0 200.0 206.0
Q*Bert 6 163.9 13,455.0 13,117.3 15,088.5 19,220.3 16,256.5 18,760.3
River Raid 18 1,338.5 17,118.0 7,377.6 14,884.5 21,162.6 14,522.3 20,607.6
Road Runner 18 11.5 7,845.0 39,544.0 44,127.0 69,524.0 57,608.0 62,151.0
Robotank 18 2.2 11.9 63.9 65.1 65.3 62.6 27.5
Seaquest 18 68.4 42,054.7 5,860.6 16,452.7 50,254.2 26,357.8 931.6
Skiing 3 -17,098.1 -4,336.9 -13,062.3 -9,021.8 -8,857.4 -9,996.9 -19,949.9
Solaris 18 1,236.3 12,326.7 3,482.8 3,067.8 2,250.8 4,309.0 133.4
Space Invaders 6 148.0 1,668.7 1,692.3 2,525.5 6,427.3 2,865.8 15,311.5
Star Gunner 18 664.0 10,250.0 54,282.0 60,142.0 89,238.0 63,302.0 125,117.0
Surround 5 -10.0 6.5 -5.6 -2.9 4.4 8.9 1.2
Tennis 18 -23.8 -8.3 12.2 -22.8 5.1 0.0 0.0
Time Pilot 10 3,568.0 5,229.2 4,870.0 8,339.0 11,666.0 9,197.0 7,553.0
Tutankham 8 11.4 167.6 68.1 218.4 211.4 204.6 245.9
Up and Down 6 533.4 11,693.2 9,989.9 22,972.2 44,939.6 16,154.1 33,879.1
Venture 18 0.0 1,187.5 163.0 98.0 497.0 54.0 48.0
Video Pinball 9 16,256.9 17,667.9 196,760.4 309,941.9 98,209.5 282,007.3 479,197.0
Wizard Of Wor 10 563.5 4,756.5 2,704.0 7,492.0 7,855.0 4,802.0 12,352.0
Yars’ Revenge 18 3,092.9 54,576.9 18,098.9 11,712.6 49,622.1 11,357.0 69,618.1
Zaxxon 18 32.5 9,173.3 5,363.0 10,163.0 12,944.0 10,469.0 13,886.0
Table 3. Raw scores across all games. Starting with Human starts.
G AMES N O . ACTIONS R ANDOM H UMAN DQN DDQN D UEL P RIOR . P RIOR . D UEL .
Alien 18 128.3 6,371.3 634.0 1,033.4 1,486.5 1,334.7 823.7
Amidar 10 11.8 1,540.4 178.4 169.1 172.7 129.1 238.4
Assault 7 166.9 628.9 3,489.3 6,060.8 3,994.8 6,548.9 10,950.6
Asterix 9 164.5 7,536.0 3,170.5 16,837.0 15,840.0 22,484.5 364,200.0
Asteroids 14 871.3 36,517.3 1,458.7 1,193.2 2,035.4 1,745.1 1,021.9
Atlantis 4 13,463.0 26,575.0 292,491.0 319,688.0 445,360.0 330,647.0 423,252.0
Bank Heist 18 21.7 644.5 312.7 886.0 1,129.3 876.6 1,004.6
Battle Zone 18 3,560.0 33,030.0 23,750.0 24,740.0 31,320.0 25,520.0 30,650.0
Beam Rider 9 254.6 14,961.0 9,743.2 17,417.2 14,591.3 31,181.3 37,412.2
Berzerk 18 196.1 2,237.5 493.4 1,011.1 910.6 865.9 2,178.6
Bowling 6 35.2 146.5 56.5 69.6 65.7 52.0 50.4
Boxing 18 -1.5 9.6 70.3 73.5 77.3 72.3 79.2
Breakout 4 1.6 27.9 354.5 368.9 411.6 343.0 354.6
Centipede 18 1,925.5 10,321.9 3,973.9 3,853.5 4,881.0 3,489.1 5,570.2
Chopper Command 18 644.0 8,930.0 5,017.0 3,495.0 3,784.0 4,635.0 8,058.0
Crazy Climber 9 9,337.0 32,667.0 98,128.0 113,782.0 124,566.0 127,512.0 127,853.0
Defender 18 1,965.5 14,296.0 15,917.5 27,510.0 33,996.0 23,666.5 34,415.0
Demon Attack 6 208.3 3,442.8 12,550.7 69,803.4 56,322.8 61,277.5 73,371.3
Double Dunk 18 -16.0 -14.4 -6.0 -0.3 -0.8 16.0 -10.7
Enduro 9 -81.8 740.2 626.7 1,216.6 2,077.4 1,831.0 2,223.9
Fishing Derby 18 -77.1 5.1 -1.6 3.2 -4.1 9.8 17.0
Freeway 3 0.1 25.6 26.9 28.8 0.2 28.9 28.2
Frostbite 18 66.4 4,202.8 496.1 1,448.1 2,332.4 3,510.0 4,038.4
Gopher 8 250.0 2,311.0 8,190.4 15,253.0 20,051.4 34,858.8 105,148.4
Gravitar 18 245.5 3,116.0 298.0 200.5 297.0 269.5 167.0
H.E.R.O. 18 1,580.3 25,839.4 14,992.9 14,892.5 15,207.9 20,889.9 15,459.2
Ice Hockey 18 -9.7 0.5 -1.6 -2.5 -1.3 -0.2 0.5
James Bond 18 33.5 368.5 697.5 573.0 835.5 3,961.0 585.0
Kangaroo 18 100.0 2,739.0 4,496.0 11,204.0 10,334.0 12,185.0 861.0
Krull 18 1,151.9 2,109.1 6,206.0 6,796.1 8,051.6 6,872.8 7,658.6
Kung-Fu Master 14 304.0 20,786.8 20,882.0 30,207.0 24,288.0 31,676.0 37,484.0
Montezuma’s Revenge 18 25.0 4,182.0 47.0 42.0 22.0 51.0 24.0
Ms. Pac-Man 9 197.8 15,375.0 1,092.3 1,241.3 2,250.6 1,865.9 1,007.8
Name This Game 6 1,747.8 6,796.0 6,738.8 8,960.3 11,185.1 10,497.6 13,637.9
Phoenix 8 1,134.4 6,686.2 7,484.8 12,366.5 20,410.5 16,903.6 63,597.0
Pitfall! 18 -348.8 5,998.9 -113.2 -186.7 -46.9 -427.0 -243.6
Pong 3 -18.0 15.5 18.0 19.1 18.8 18.9 18.4
Private Eye 18 662.8 64,169.1 207.9 -575.5 292.6 670.7 1,277.6
Q*Bert 6 183.0 12,085.0 9,271.5 11,020.8 14,175.8 9,944.0 14,063.0
River Raid 18 588.3 14,382.2 4,748.5 10,838.4 16,569.4 11,807.2 16,496.8
Road Runner 18 200.0 6,878.0 35,215.0 43,156.0 58,549.0 52,264.0 54,630.0
Robotank 18 2.4 8.9 58.7 59.1 62.0 56.2 24.7
Seaquest 18 215.5 40,425.8 4,216.7 14,498.0 37,361.6 25,463.7 1,431.2
Skiing 3 -15,287.4 -3,686.6 -12,142.1 -11,490.4 -11,928.0 -10,169.1 -18,955.8
Solaris 18 2,047.2 11,032.6 1,295.4 810.0 1,768.4 2,272.8 280.6
Space Invaders 6 182.6 1,464.9 1,293.8 2,628.7 5,993.1 3,912.1 8,978.0
Star Gunner 18 697.0 9,528.0 52,970.0 58,365.0 90,804.0 61,582.0 127,073.0
Surround 5 -9.7 5.4 -6.0 1.9 4.0 5.9 -0.2
Tennis 18 -21.4 -6.7 11.1 -7.8 4.4 -5.3 -13.2
Time Pilot 10 3,273.0 5,650.0 4,786.0 6,608.0 6,601.0 5,963.0 4,871.0
Tutankham 8 12.7 138.3 45.6 92.2 48.0 56.9 108.6
Up and Down 6 707.2 9,896.1 8,038.5 19,086.9 24,759.2 12,157.4 22,681.3
Venture 18 18.0 1,039.0 136.0 21.0 200.0 94.0 29.0
Video Pinball 9 20,452.0 15,641.1 154,414.1 367,823.7 110,976.2 295,972.8 447,408.6
Wizard Of Wor 10 804.0 4,556.0 1,609.0 6,201.0 7,054.0 5,727.0 10,471.0
Yars’ Revenge 18 1,476.9 47,135.2 4,577.5 6,270.6 25,976.5 4,687.4 58,145.9
Zaxxon 18 475.0 8,443.0 4,412.0 8,593.0 10,164.0 9,474.0 11,320.0
Table 4. Normalized scores across all games. Starting with 30 no-op actions.
G AMES DQN DDQN D UEL P RIOR . P RIOR . D UEL .
Alien 20.2% 51.0% 61.4% 57.6% 53.8%
Amidar 56.7% 104.3% 137.1% 107.0% 133.7%
Assault 781.0% 995.1% 846.5% 1433.7% 2166.0%
Asterix 50.0% 206.8% 337.4% 377.6% 4520.1%
Asteroids 1.4% 0.0% 4.5% 4.1% 1.0%
Atlantis 1651.2% 576.1% 2285.3% 2129.3% 2366.9%
Bank Heist 59.7% 137.6% 216.2% 140.8% 201.5%
Battle Zone 79.1% 84.2% 99.9% 83.8% 95.2%
Beam Rider 49.9% 81.0% 71.2% 139.0% 180.6%
Berzerk 18.4% 44.0% 53.8% 47.2% 131.1%
Bowling 19.8% 32.7% 30.8% 18.0% 17.1%
Boxing 732.5% 762.1% 827.1% 795.5% 823.1%
Breakout 1334.5% 1449.2% 1194.5% 1294.3% 1266.6%
Centipede 25.9% 33.4% 55.1% 23.9% 56.4%
Chopper Command 80.8% 76.0% 158.2% 118.4% 188.1%
Crazy Climber 399.1% 425.2% 530.1% 520.5% 604.6%
Defender 131.3% 205.3% 248.8% 179.7% 243.1%
Demon Attack 659.6% 3182.8% 3335.0% 3941.6% 3998.3%
Double Dunk 557.7% 607.9% 866.5% 1723.3% 280.5%
Enduro 84.7% 140.8% 262.4% 243.2% 268.0%
Fishing Derby 163.8% 202.4% 260.7% 247.7% 251.1%
Freeway 104.0% 112.5% 0.1% 114.0% 111.3%
Frostbite 17.1% 37.9% 107.9% 101.1% 172.1%
Gopher 395.4% 676.7% 717.5% 1495.6% 4831.3%
Gravitar 9.4% 7.5% 13.1% 11.8% 2.0%
H.E.R.O. 65.1% 64.1% 66.4% 73.9% 67.1%
Ice Hockey 76.9% 70.0% 96.4% 103.2% 89.6%
James Bond 270.1% 485.4% 468.8% 1869.8% 286.0%
Kangaroo 241.6% 433.8% 496.2% 541.3% 58.3%
Krull 639.3% 592.3% 923.1% 761.6% 822.2%
Kung-Fu Master 114.8% 131.0% 151.4% 174.9% 214.1%
Montezuma’s Revenge 0.0% 0.0% 0.0% 0.0% 0.0%
Ms. Pac-Man 41.8% 36.2% 89.9% 93.5% 45.5%
Name This Game 102.8% 144.6% 168.1% 173.3% 230.7%
Phoenix 119.2% 177.3% 344.5% 281.3% 1073.3%
Pitfall! -0.8% 3.0% 3.4% -1.9% 3.4%
Pong 114.0% 117.8% 118.2% 117.1% 118.0%
Private Eye 0.2% 0.2% 0.1% 0.3% 0.3%
Q*Bert 97.5% 112.3% 143.4% 121.1% 139.9%
River Raid 38.3% 85.8% 125.6% 83.6% 122.1%
Road Runner 504.7% 563.2% 887.4% 735.3% 793.3%
Robotank 631.5% 643.7% 645.1% 617.5% 259.5%
Seaquest 13.8% 39.0% 119.5% 62.6% 2.1%
Skiing 31.6% 63.3% 64.6% 55.6% -22.3%
Solaris 20.3% 16.5% 9.1% 27.7% -9.9%
Space Invaders 101.6% 156.3% 412.9% 178.7% 997.2%
Star Gunner 559.3% 620.5% 924.0% 653.4% 1298.3%
Surround 26.5% 43.2% 86.9% 114.6% 67.6%
Tennis 231.3% 6.8% 186.2% 153.2% 153.2%
Time Pilot 78.4% 287.2% 487.5% 338.9% 239.9%
Tutankham 36.3% 132.5% 128.1% 123.7% 150.1%
Up and Down 84.7% 201.1% 397.9% 140.0% 298.8%
Venture 13.7% 8.3% 41.9% 4.5% 4.0%
Video Pinball 1113.7% 1754.3% 555.9% 1596.2% 2712.2%
Wizard Of Wor 51.0% 165.2% 173.9% 101.1% 281.1%
Yars’ Revenge 29.1% 16.7% 90.4% 16.1% 129.2%
Zaxxon 58.3% 110.8% 141.3% 114.2% 151.6%
Table 5. Normalized scores across all games. Starting with Human Starts.
G AMES DQN DDQN D UEL P RIOR . P RIOR . D UEL .
Alien 8.1% 14.5% 21.8% 19.3% 11.1%
Amidar 10.9% 10.3% 10.5% 7.7% 14.8%
Assault 719.2% 1275.9% 828.6% 1381.5% 2334.4%
Asterix 40.8% 226.2% 212.7% 302.8% 4938.4%
Asteroids 1.6% 0.9% 3.3% 2.5% 0.4%
Atlantis 2128.0% 2335.5% 3293.9% 2419.0% 3125.3%
Bank Heist 46.7% 138.8% 177.8% 137.3% 157.8%
Battle Zone 68.5% 71.9% 94.2% 74.5% 91.9%
Beam Rider 64.5% 116.7% 97.5% 210.3% 252.7%
Berzerk 14.6% 39.9% 35.0% 32.8% 97.1%
Bowling 19.2% 30.9% 27.5% 15.1% 13.7%
Boxing 648.2% 677.0% 711.2% 666.7% 728.5%
Breakout 1341.9% 1396.7% 1559.0% 1298.3% 1342.4%
Centipede 24.4% 23.0% 35.2% 18.6% 43.4%
Chopper Command 52.8% 34.4% 37.9% 48.2% 89.5%
Crazy Climber 380.6% 447.7% 493.9% 506.5% 508.0%
Defender 113.2% 207.2% 259.8% 176.0% 263.2%
Demon Attack 381.6% 2151.6% 1734.8% 1888.0% 2261.9%
Double Dunk 622.5% 982.5% 948.7% 1998.7% 328.7%
Enduro 86.2% 158.0% 262.7% 232.7% 280.5%
Fishing Derby 91.8% 97.7% 88.8% 105.7% 114.5%
Freeway 105.1% 112.5% 0.6% 113.1% 110.2%
Frostbite 10.4% 33.4% 54.8% 83.2% 96.0%
Gopher 385.3% 727.9% 960.8% 1679.2% 5089.7%
Gravitar 1.8% -1.6% 1.8% 0.8% -2.7%
H.E.R.O. 55.3% 54.9% 56.2% 79.6% 57.2%
Ice Hockey 79.2% 70.8% 82.4% 92.5% 99.4%
James Bond 198.2% 161.0% 239.4% 1172.4% 164.6%
Kangaroo 166.6% 420.8% 387.8% 457.9% 28.8%
Krull 528.0% 589.7% 720.8% 597.7% 679.8%
Kung-Fu Master 100.5% 146.0% 117.1% 153.2% 181.5%
Montezuma’s Revenge 0.5% 0.4% -0.1% 0.6% -0.0%
Ms. Pac-Man 5.9% 6.9% 13.5% 11.0% 5.3%
Name This Game 98.9% 142.9% 186.9% 173.3% 235.5%
Phoenix 114.4% 202.3% 347.2% 284.0% 1125.1%
Pitfall! 3.7% 2.6% 4.8% -1.2% 1.7%
Pong 107.6% 110.9% 110.0% 110.1% 108.6%
Private Eye -0.7% -1.9% -0.6% 0.0% 1.0%
Q*Bert 76.4% 91.1% 117.6% 82.0% 116.6%
River Raid 30.2% 74.3% 115.9% 81.3% 115.3%
Road Runner 524.3% 643.2% 873.7% 779.6% 815.1%
Robotank 863.3% 868.7% 913.3% 824.4% 341.1%
Seaquest 10.0% 35.5% 92.4% 62.8% 3.0%
Skiing 27.1% 32.7% 29.0% 44.1% -31.6%
Solaris -8.4% -13.8% -3.1% 2.5% -19.7%
Space Invaders 86.7% 190.8% 453.1% 290.8% 685.9%
Star Gunner 591.9% 653.0% 1020.3% 689.4% 1431.0%
Surround 24.7% 76.9% 91.1% 103.2% 62.9%
Tennis 220.8% 92.1% 175.0% 109.6% 55.6%
Time Pilot 63.7% 140.3% 140.0% 113.2% 67.2%
Tutankham 26.2% 63.3% 28.1% 35.2% 76.4%
Up and Down 79.8% 200.0% 261.8% 124.6% 239.1%
Venture 11.6% 0.3% 17.8% 7.4% 1.1%
Video Pinball 987.2% 2351.6% 709.5% 1892.3% 2860.5%
Wizard Of Wor 21.5% 143.8% 166.6% 131.2% 257.6%
Yars’ Revenge 6.8% 10.5% 53.7% 7.0% 124.1%
Zaxxon 49.4% 101.9% 121.6% 112.9% 136.1%
Mean 219.6% 332.9% 343.8% 386.7% 567.0%
Median 68.5% 110.9% 117.1% 112.9% 115.3%

Dueling Network Architectures For Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Dueling Network Architectures For Deep Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dueling Network Architectures For Deep Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Dueling Network Architectures for Deep Reinforcement Learning

Ziyu Wang ZIYU @ GOOGLE . COM

Google DeepMind, London, UK

Abstract In spite of this, most of the approaches for RL use standard

convolutional feature learning module. The two streams VALUE A DVANTAGE

C ORRIDOR ENVIRONMENT 5 ACTIONS 10 ACTIONS 20 ACTIONS

102 102 102

ence tuples by rank-based prioritized sampling. We keep 5. Discussion

A. Double DQN Algorithm

Algorithm 1: Double DQN Algorithm.

You might also like