본 논문에서는 분배형 강화학습(Distributional Reinforcement Learning)에서 벨만 다이내믹스를 통해 확률 분포를 학습하는 문제를 고려합니다. 이전 연구들은 각 반환 분포의 유한 개의 통계량을 신경망을 통해 학습하는 방법을 사용해왔으나, 이 방법은 반환 분포의 함수적 형태에 제한을 받아 제한적인 표현력을 가지며, 미리 정의된 통계량을 유지하는 것이 어려웠습니다. 본 논문에서는 이러한 제한을 없애기 위해 최대 평균 거리(Maximum Mean Discrepancy, MMD)라는 가설 검정 기술을 활용해 반환 분포의 결정론적인(의사 난수를 사용한) 표본들을 학습하는 방법을 제안합니다. 이를 통해 반환 분포와 벨만 타겟 간의 모든 모멘트(순간값)를 암묵적으로 일치시킴으로써 분배형 벨만 연산자의 수렴성을 보장하며, 분포 근사에 대한 유한 샘플 분석을 제시합니다. 실험 결과, 본 논문에서 제안한 방법은 분배형 강화학습의 기본 모델보다 우수한 성능을 보이며, Atari 게임에서 분산형 에이전트를 사용하지 않는 경우에도 최고 성적을 기록합니다.
2. 2
⚫ Introduction
▪ Estimation of the probability distribution
▪ Limitation of distribution estimation in conventional Reinforcement Learning(RL)
⚫ Distributional RL via Moment Matching(MMDQN)
▪ Backgrounds
⚫ Experiment results
4. 4
⚫ Moments of a function are quantitative measures related to the shape of the function’s graph
▪ Probability distribution functions(PDF) are generally represented with four measures, mean, variance,
skewness, kurtosis
• Mean : first moment of PDF, (𝜇 = 𝐸[𝑋])
• Variance : second moment of PDF,(𝜎 = 𝐸 𝑋 − 𝜇 2
• Skewness : third moment of PDF, (𝛾 = 𝐸[
• Kurtosis : fourth moment of PDF, (𝐾𝑢𝑟𝑡[𝑥] =
𝐸 𝑋−𝜇 4
𝐸 𝑋−𝜇 2 2)
⚫ Estimation methods of PDF 𝒇 𝒙 in machine learning
▪ Explicit methods
• Determination predetermined statistics as any functional 𝜁: 𝑓 𝑥 ⇒ ℝ
– Median of 𝑓(𝑥): 𝐹−1
), Mean of 𝑓 𝑥 :
𝑃 𝑥 𝑑𝑥
• Estimation of the predetermined statistics using stacked data
▪ Implicit methods
• Estimation of the 𝑓(𝑥; 𝜃) directly using stacked data(GAN, VAE, …)
Estimation of the probability distribution
Various shape of normal dist. based on 𝝁, 𝝈
Various shape of normal dist. based on 𝜸, 𝑲𝒖𝒓𝒕[𝒙]
5. 5
⚫ Naï
ve approaches in estimation of return
▪ Due to stochastic nature of return 𝐺 = Σ𝛾𝑡
𝑟𝑡+1, in general RL, 𝐺 is approximated with 𝑄(𝑠, 𝑎) and 𝑉(𝑠)
• 𝑉 𝑠 = 𝐸 Σ𝛾𝑡
𝑟𝑡+1| 𝑠
▪ These functions, 𝑄 𝑠, 𝑎 and 𝑉(𝑠) are generally estimated with normal bellman operator 𝒯
𝐵 using only mean
• (𝒯
𝐵 𝑉) 𝑠 = 𝐸 𝑟(𝑠, 𝜋(𝑠)) + 𝛾𝐸 𝑉(𝑠′
Limitation of distribution estimation in conventional Reinforcement Learning
Variance, skewness, kurtosis of return 𝑮 are easily neglected
Return distribution
value [-]
These distributions have same
mean, but they are not same!
Return distribution
value [-]
Complex dist. could be
modeled in this framework!
• 𝑄(𝑠, 𝑎) = 𝐸 Σ𝛾𝑡
𝑟𝑡+1|𝑠, 𝑎
• (𝒯
𝐵 𝑄) 𝑠, 𝑎 = 𝐸 𝑟(𝑠, 𝑎) + 𝛾𝐸 𝑄(𝑠′
, 𝑎′
7. 7
⚫ Basics of probability space (𝛀, 𝚺, 𝑷)
▪ Sample space Ω
• In a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑠𝑝𝑎𝑐𝑒, the set Ω is the set of all possible outcomes. Ω is set with element(event) 𝜔, and is called the 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒
▪ 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ
• A 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ is a set of subsets 𝜔 of Ω s.t.:
– 𝜙 ∈ Σ
– If 𝐴 = 𝜙, Ω ⇒ 𝐴 𝑖𝑠 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎, 𝐴 = 𝜙, 𝐸, 𝐸𝐶
, Ω and 𝐸 ∈ Ω ⇒ A is 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎
▪ Random variable
Backgrounds (I)
– If 𝜔 ∈ Σ, 𝑡ℎ𝑒𝑛 𝜔𝐶
∈ Σ – If 𝜔1, 𝜔2, … , 𝜔𝑛 ∈ Σ, 𝑡ℎ𝑒𝑛, 𝑈𝑖=1
𝜔𝑖 ∈ Σ
• 𝐵(ℝ) is the smallest 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 containing open interval I. This set is called
Borel set
– 𝐵 ℝ = 𝐼 = {(𝑎, 𝑏)|𝑎, 𝑏 ∈ ℝ, 𝑎 < 𝑏}
• If 𝑓 is measurable from (Ω, Σ) to (𝐸, ℰ), it is called a 𝐵𝑜𝑟𝑒𝑙 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 or a
𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒(RV)
– Let (Ω, Σ) and (𝐸, ℰ) be measurable spaces and 𝑓 a function from (Ω, Σ) to (𝐸, ℰ).
The function 𝑓 is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 iff 𝑓−1
ℰ ⊂ Σ
8. 8
⚫ Basics of probability space (𝛀, 𝚺, 𝑷)
▪ Probability distribution
• Let (Ω, Σ, 𝜇) be a measure space and 𝑓 be a measurable function from (Ω, Σ) to (𝐸, ℰ). The pushforward measure is denoted by 𝑓#𝜇,,
𝑓# 𝜇, or 𝜇 ∘ 𝑓−1
is a measure on ℰ defined as
– 𝑓#𝜇 𝑌 = 𝑓# 𝜇(𝑌) = 𝜇 ∘ 𝑓−1
𝑌 = 𝜇 𝑓 ∈ 𝑌 = 𝜇 𝑓−1
𝑌 , 𝑌 ∈ ℰ
• If 𝜇 = 𝑃 is a probability measure, and 𝑓 is random variable, then 𝑃 ∘ 𝑓−1
is called the distribution (or the law) of 𝑓 and is denoted by
– 𝑃 is 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 or 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
– In practice, one often use 𝐸, ℰ = ℝ𝑛
, 𝐵 ℝ𝑛
, 𝑓#𝜇: 𝐵 ℝ → [0, 1]
Backgrounds (II)
𝑃 ∘ 𝑓−1
9. 9
⚫ Distributional RL
▪ In distributional RL, the cumulative return of a chosen action at state is modeled with the full distribution
rather than expectation of it, 𝑍𝜃 𝑠, 𝑎 ≔
(𝑠, 𝑎)
• So that the model can capture its intrinsic randomness instead of just first-order moment(high-order moments,
multi-modality in state-action value function)
▪ Existing distributional RL algorithms
• C51: Pre-determination of the supports of return distribution and then training the categorical distribution.
• QR-DQN: To avoid pre-determined supports, and decrease the theory-practice gap, introducing the quantile
• IQN: Using sampling the quantile, training the full quantile function, and considering the risk of policy
• IDAC: Training the full return distribution directly using the adversarial network, and training the policy based on
semi-implicit methods
Backgrounds (III)
10. 10
⚫ Overview of the MMDQN
▪ Unlike the existing distributional RL, MMDQN has no assumption about predetermined statistics and could
learn the unrestricted statistics
• However, for implementation, deterministic pseudo-samples of the return distribution are learned in MMD
– The authors use the Dirac mixture Ƹ
𝜇𝜃 𝑠, 𝑎 =
𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎)
▪ The authors analyze the distributional Bellman operator and establish sufficient conditions for the
contraction of the distributional Bellman operator in the MMD
• They analyzed the 𝒯𝜋
is a contraction when the kernel is unrectified kernel 𝑘𝛼 𝑥, 𝑦 ≔ − 𝑥 − 𝑦
, ∀𝛼 ∈ ℝ, ∀𝑥, 𝑦 ∈ 𝒳
• However, for practical consideration, they demonstrates the commonly used Gaussian kernel has better
performance in this framework(𝑘 𝑥, 𝑦 ≔ exp −
𝑥−𝑦 2
2𝜎2 )
▪ MMDQN with Gaussian kernel mixture showed the state-of-the-art in the 55 Atari 2600 games.
• For a fair comparison, they used the same architecture of DQN and QR-DQN
11. 11
⚫ Problem setting
▪ For any policy 𝜋, let 𝜇𝜋
= law(𝑍𝜋
) be the distribution of the return RV. 𝑍𝜋
𝑠, 𝑎 ≔ Σ𝑡=0
𝑅(𝑠𝑡, 𝑎𝑡)
▪ 𝒯𝜋
𝜇 𝑠, 𝑎 ≔
𝑓𝛾,𝑟 #𝜇 𝑠′
, 𝑎′
ℛ 𝑑𝑟 𝑠, 𝑎 𝜋 𝑑𝑎′
𝑠 𝑃 𝑑𝑠′
𝑠, 𝑎 , 𝑓𝛾,𝑟 𝑧 ≔ 𝑟 + 𝛾𝑧, ∀𝑧 𝑎𝑛𝑑
(𝑓𝛾,𝑟)#𝜇 𝑠′
, 𝑎′
is the push forward measure of 𝜇 𝑠′
, 𝑎′
𝑏𝑦 𝑓𝛾,𝑟
⚫ Algorithmic approach
▪ The authors use the Dirac mixture Ƹ
𝜇𝜃 𝑠, 𝑎 =
𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎)
• They referred to the deterministic samples 𝑍𝜃 𝑠, 𝑎 as particles
▪ Algorithm goal is reduced into learning the particles 𝑍𝜃(𝑠, 𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎).
• To this end, the particles 𝑍𝜃(𝑠, 𝑎) is deterministically evolved to minimize the MMD distance between the
approximate distribution and its distributional Bellman target
12. 12
⚫ Maximum Mean Discrepancy(MMD)
▪ Let ℱ be a Reproducing Kernel Hilbert Space(RKHS) associated with a continuous kernel 𝑘(⋅,⋅) on 𝒳.
▪ The MMD between 𝑝 ∈ 𝑃(𝒳) and 𝑞 ∈ 𝑃 𝒳 is defined as
• MMD 𝑝, 𝑞; ℱ ≔ sup
𝑓∈ℱ: 𝑓 ≤1
𝔼 f Z − 𝔼 𝑓 𝑊 =
𝑘 𝑥,⋅ 𝑝 𝑑𝑥 −
𝑘 𝑥,⋅ 𝑞 𝑑𝑥
= 𝔼 𝑘(𝑍, 𝑍′
) + 𝔼 𝑘(𝑊, 𝑊′
) − 2𝔼 𝑘(𝑍, 𝑊
2, 𝑍, 𝑍′~𝑝,𝑊, 𝑊′
~𝑞 and they are independent respectively.
▪ In practical, MMD is biased estimated from MMDb with empirical samples 𝑧𝑖 𝑖=1
~𝑝 and 𝑤𝑖 𝑖=1
• MMDb
𝑧𝑖 , 𝑤𝑖 ; 𝑘 =
𝑁2 Σ𝑖,𝑗𝑘 𝑧𝑖, 𝑧𝑗 +
𝑀2 Σ𝑖,𝑗𝑘 𝑤𝑖, 𝑤𝑗 −
Σ𝑖,𝑗𝑘(𝑧𝑖, 𝑤𝑗)
▪ The authors utilized the Gaussian kernel 𝑘 𝑥, 𝑦 = exp −
𝑥−𝑦 2
for objective MMDb
• They exemplified the following intuition:
– The first term serves as a repulsive force that pushes the particles {𝑍𝜃 𝑠, 𝑎 𝑖} away from each other, preventing them from
collapsing into a single mode
– The third term acts as an attractive force which pulls the particle {𝑍𝜃 𝑠, 𝑎 i} closer to their target particles {
• They used the kernel mixture trick with 𝐾 kernels(they are different in bandwidth ℎ)
13. 13
⚫ Pseudo code
▪ Hyper-params inputting and target network initialization corresponds with the main network
▪ Every single step, transition is sampled from the replay buffer
• In the control setting, action is inferred from the policy(𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦)
• In the policy evaluation, action is selected with the estimated
return distribution Ƹ
𝜇 𝑠′
, 𝑎′
▪ For the number of statistics 𝑁, target return value
𝒯𝑍𝑖 is
▪ MMDb objective is computed and backpropagated with SGD
Pseudo code of MMDQN
15. 15
⚫ Comparison with previous methods
▪ Experiments show the superior performance of MMDQN compared to previous methods in 55 Atrai 2600
games(OpenAI Gym env)
• C51
Experiment results (I)
Median and mean of best *HN scores
Median and mean of the *HN scores
*HN scores: Human Normalized score
➔ First group
➔ Second group
16. 16
⚫ Comparison with the previous method
▪ Experiments show the superior performance of MMDQN compared to the previous method(QR-DQN) in
55 games in Atari 2600 games(OpenAI Gym env)
• MMD(3 seeds)
Experiment results (III)
Online training curves for MMDQN and QR-DQN
• QR-DQN(2 seeds)
17. 17
⚫ Ablation study
▪ Two sets of ablation studies were performed to answer the following questions
• (a): Which kernel used for the MMDQN shows the best performance?
– Using the mixture of Gaussian kernels with different bandwidths displays the best performance
• (b): What number of particles for the MMDQN shows the best performance?
– Using more than 50 particles of dist. demonstrates better performance, and using 200 particles of dist. exibits the most s
table performance
Experiment results (III)
The sensitivity of MMDQN in the 6 tuning games w.r.t (a): the kernel choice, and (b): the number of particles N
21. 21
⚫ Basics of measure theory
▪ Measurable space (𝑋, Σ)
• A pair (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 if 𝑋 is a set and Σ is a nonempty 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 of subsets of 𝑋
• A measurable space allows us to define a function that assigns real numbered values to the abstract elements of Σ
▪ Measure 𝜇
• Let (𝑋, Σ) be a measurable space, set function 𝜇 defined on Σ is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 iff has the following properties
– 0 ≤ 𝜇 𝐴 ≤ ∞ 𝑓𝑜𝑟 𝑎𝑛𝑦 𝐴 ∈ Σ
– For any sequence of pairwise disjoint sets {𝐴𝑛}∈ Σ such that 𝑈𝑛=1𝐴𝑛 ∈ Σ, we have 𝜇 𝑈𝑛=1
𝐴𝑛 = Σ𝑛=1
▪ Measure space
• A triplet (𝑋, Σ, 𝜇) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒 if (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 and 𝜇: Σ → [0; ∞) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
• If 𝜇 𝑋 = 1, then 𝜇 is a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒, which we usually use notation 𝑃, and the measure space is a
𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒔𝒑𝒂𝒄𝒆
– 𝜇 Φ = 0
22. 22
⚫ Basics of measure theory
▪ Measurable function
• Let (Ω, Σ) and (Λ, 𝐺) be measurable spaces and 𝑓 a function from Ω to Λ. The function 𝑓 is called
𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 from (Ω, Σ) to (Λ, G) iff 𝑓−1
𝐺 ⊂ Σ
▪ Random variable
• A random variable 𝑿 is a measurable function from the probability space (Ω, Σ, 𝑃) into the probability space
(𝒳, 𝐵𝒳, 𝑃𝒳), where 𝒳 in ℝ is the range of the 𝑿, 𝐵𝒳 is a 𝐵𝑜𝑟𝑒𝑙 𝑠𝑒𝑡 𝑜𝑓 𝒳 and 𝑃𝒳 is the probability measure(distribution)
on 𝒳
– Specifically, 𝑿: Ω → 𝒳