Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
1
Distributional Reinforcement Learning
via Moment Matching
(MMDQN)
*백승언, 주정헌, 박혜진
12 Feb, 2023
2
⚫ Introduction
▪ Estimation of the probability distribution
▪ Limitation of distribution estimation in conventional Reinforcement Learning(RL)
⚫ Distributional RL via Moment Matching(MMDQN)
▪ Backgrounds
▪ MMDQN
⚫ Experiment results
Contents
3
Introduction
4
⚫ Moments of a function are quantitative measures related to the shape of the function’s graph
▪ Probability distribution functions(PDF) are generally represented with four measures, mean, variance,
skewness, kurtosis
• Mean : first moment of PDF, (𝜇 = 𝐸[𝑋])
• Variance : second moment of PDF,(𝜎 = 𝐸 𝑋 − 𝜇 2
1
2)
• Skewness : third moment of PDF, (𝛾 = 𝐸[
𝑋−𝜇
𝜎
3
])
• Kurtosis : fourth moment of PDF, (𝐾𝑢𝑟𝑡[𝑥] =
𝐸 𝑋−𝜇 4
𝐸 𝑋−𝜇 2 2)
⚫ Estimation methods of PDF 𝒇 𝒙 in machine learning
▪ Explicit methods
• Determination predetermined statistics as any functional 𝜁: 𝑓 𝑥 ⇒ ℝ
– Median of 𝑓(𝑥): 𝐹−1
(
1
2
), Mean of 𝑓 𝑥 : ‫׬‬
𝒳
𝑃 𝑥 𝑑𝑥
• Estimation of the predetermined statistics using stacked data
▪ Implicit methods
• Estimation of the 𝑓(𝑥; 𝜃) directly using stacked data(GAN, VAE, …)
Estimation of the probability distribution
Various shape of normal dist. based on 𝝁, 𝝈
Various shape of normal dist. based on 𝜸, 𝑲𝒖𝒓𝒕[𝒙]
5
⚫ Naï
ve approaches in estimation of return
▪ Due to stochastic nature of return 𝐺 = Σ𝛾𝑡
𝑟𝑡+1, in general RL, 𝐺 is approximated with 𝑄(𝑠, 𝑎) and 𝑉(𝑠)
• 𝑉 𝑠 = 𝐸 Σ𝛾𝑡
𝑟𝑡+1| 𝑠
▪ These functions, 𝑄 𝑠, 𝑎 and 𝑉(𝑠) are generally estimated with normal bellman operator 𝒯
𝐵 using only mean
value
• (𝒯
𝐵 𝑉) 𝑠 = 𝐸 𝑟(𝑠, 𝜋(𝑠)) + 𝛾𝐸 𝑉(𝑠′
)
Limitation of distribution estimation in conventional Reinforcement Learning
Variance, skewness, kurtosis of return 𝑮 are easily neglected
Return distribution
value [-]
Probability
[-]
These distributions have same
mean, but they are not same!
Return distribution
value [-]
Probability
[-]
Complex dist. could be
modeled in this framework!
• 𝑄(𝑠, 𝑎) = 𝐸 Σ𝛾𝑡
𝑟𝑡+1|𝑠, 𝑎
• (𝒯
𝐵 𝑄) 𝑠, 𝑎 = 𝐸 𝑟(𝑠, 𝑎) + 𝛾𝐸 𝑄(𝑠′
, 𝑎′
)
6
Distributional RL via Moment Matching
(MMDQN)
https://arxiv.org/abs/2007.12354
7
⚫ Basics of probability space (𝛀, 𝚺, 𝑷)
▪ Sample space Ω
• In a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑠𝑝𝑎𝑐𝑒, the set Ω is the set of all possible outcomes. Ω is set with element(event) 𝜔, and is called the 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒
▪ 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ
• A 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ is a set of subsets 𝜔 of Ω s.t.:
– 𝜙 ∈ Σ
– If 𝐴 = 𝜙, Ω ⇒ 𝐴 𝑖𝑠 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎, 𝐴 = 𝜙, 𝐸, 𝐸𝐶
, Ω and 𝐸 ∈ Ω ⇒ A is 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎
▪ Random variable
Backgrounds (I)
– If 𝜔 ∈ Σ, 𝑡ℎ𝑒𝑛 𝜔𝐶
∈ Σ – If 𝜔1, 𝜔2, … , 𝜔𝑛 ∈ Σ, 𝑡ℎ𝑒𝑛, 𝑈𝑖=1
∞
𝜔𝑖 ∈ Σ
Σ ℰ
𝑌
𝐸
• 𝐵(ℝ) is the smallest 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 containing open interval I. This set is called
Borel set
– 𝐵 ℝ = 𝐼 = {(𝑎, 𝑏)|𝑎, 𝑏 ∈ ℝ, 𝑎 < 𝑏}
• If 𝑓 is measurable from (Ω, Σ) to (𝐸, ℰ), it is called a 𝐵𝑜𝑟𝑒𝑙 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 or a
𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒(RV)
– Let (Ω, Σ) and (𝐸, ℰ) be measurable spaces and 𝑓 a function from (Ω, Σ) to (𝐸, ℰ).
The function 𝑓 is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 iff 𝑓−1
ℰ ⊂ Σ
Ω
𝑓−1
(𝑌)
𝑓
8
⚫ Basics of probability space (𝛀, 𝚺, 𝑷)
▪ Probability distribution
• Let (Ω, Σ, 𝜇) be a measure space and 𝑓 be a measurable function from (Ω, Σ) to (𝐸, ℰ). The pushforward measure is denoted by 𝑓#𝜇,,
𝑓# 𝜇, or 𝜇 ∘ 𝑓−1
is a measure on ℰ defined as
– 𝑓#𝜇 𝑌 = 𝑓# 𝜇(𝑌) = 𝜇 ∘ 𝑓−1
𝑌 = 𝜇 𝑓 ∈ 𝑌 = 𝜇 𝑓−1
𝑌 , 𝑌 ∈ ℰ
• If 𝜇 = 𝑃 is a probability measure, and 𝑓 is random variable, then 𝑃 ∘ 𝑓−1
is called the distribution (or the law) of 𝑓 and is denoted by
𝑃𝑋
– 𝑃 is 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 or 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
– In practice, one often use 𝐸, ℰ = ℝ𝑛
, 𝐵 ℝ𝑛
, 𝑓#𝜇: 𝐵 ℝ → [0, 1]
Backgrounds (II)
Σ ℰ
𝑌
𝐸
Ω
𝑓−1(𝑌)
𝑓−1
ℝ
𝑃
𝑓#𝑃
𝑃 ∘ 𝑓−1
9
⚫ Distributional RL
▪ In distributional RL, the cumulative return of a chosen action at state is modeled with the full distribution
rather than expectation of it, 𝑍𝜃 𝑠, 𝑎 ≔
1
𝑁
Σ𝑖=1
𝑁
𝛿𝜃𝑖
(𝑠, 𝑎)
• So that the model can capture its intrinsic randomness instead of just first-order moment(high-order moments,
multi-modality in state-action value function)
▪ Existing distributional RL algorithms
• C51: Pre-determination of the supports of return distribution and then training the categorical distribution.
• QR-DQN: To avoid pre-determined supports, and decrease the theory-practice gap, introducing the quantile
regression
• IQN: Using sampling the quantile, training the full quantile function, and considering the risk of policy
• IDAC: Training the full return distribution directly using the adversarial network, and training the policy based on
semi-implicit methods
Backgrounds (III)
10
⚫ Overview of the MMDQN
▪ Unlike the existing distributional RL, MMDQN has no assumption about predetermined statistics and could
learn the unrestricted statistics
• However, for implementation, deterministic pseudo-samples of the return distribution are learned in MMD
– The authors use the Dirac mixture Ƹ
𝜇𝜃 𝑠, 𝑎 =
1
𝑁
Σ𝑖=1
𝑁
𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎)
▪ The authors analyze the distributional Bellman operator and establish sufficient conditions for the
contraction of the distributional Bellman operator in the MMD
• They analyzed the 𝒯𝜋
is a contraction when the kernel is unrectified kernel 𝑘𝛼 𝑥, 𝑦 ≔ − 𝑥 − 𝑦
𝛼
, ∀𝛼 ∈ ℝ, ∀𝑥, 𝑦 ∈ 𝒳
• However, for practical consideration, they demonstrates the commonly used Gaussian kernel has better
performance in this framework(𝑘 𝑥, 𝑦 ≔ exp −
𝑥−𝑦 2
2𝜎2 )
▪ MMDQN with Gaussian kernel mixture showed the state-of-the-art in the 55 Atari 2600 games.
• For a fair comparison, they used the same architecture of DQN and QR-DQN
MMDQN (I)
11
⚫ Problem setting
▪ For any policy 𝜋, let 𝜇𝜋
= law(𝑍𝜋
) be the distribution of the return RV. 𝑍𝜋
𝑠, 𝑎 ≔ Σ𝑡=0
∞
𝛾𝑡
𝑅(𝑠𝑡, 𝑎𝑡)
▪ 𝒯𝜋
𝜇 𝑠, 𝑎 ≔ ඲
𝒮
න
𝒜
ධ
𝜒
𝑓𝛾,𝑟 #𝜇 𝑠′
, 𝑎′
ℛ 𝑑𝑟 𝑠, 𝑎 𝜋 𝑑𝑎′
𝑠 𝑃 𝑑𝑠′
𝑠, 𝑎 , 𝑓𝛾,𝑟 𝑧 ≔ 𝑟 + 𝛾𝑧, ∀𝑧 𝑎𝑛𝑑
(𝑓𝛾,𝑟)#𝜇 𝑠′
, 𝑎′
is the push forward measure of 𝜇 𝑠′
, 𝑎′
𝑏𝑦 𝑓𝛾,𝑟
⚫ Algorithmic approach
▪ The authors use the Dirac mixture Ƹ
𝜇𝜃 𝑠, 𝑎 =
1
𝑁
Σ𝑖=1
𝑁
𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎)
• They referred to the deterministic samples 𝑍𝜃 𝑠, 𝑎 as particles
▪ Algorithm goal is reduced into learning the particles 𝑍𝜃(𝑠, 𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎).
• To this end, the particles 𝑍𝜃(𝑠, 𝑎) is deterministically evolved to minimize the MMD distance between the
approximate distribution and its distributional Bellman target
MMDQN (II)
12
⚫ Maximum Mean Discrepancy(MMD)
▪ Let ℱ be a Reproducing Kernel Hilbert Space(RKHS) associated with a continuous kernel 𝑘(⋅,⋅) on 𝒳.
▪ The MMD between 𝑝 ∈ 𝑃(𝒳) and 𝑞 ∈ 𝑃 𝒳 is defined as
• MMD 𝑝, 𝑞; ℱ ≔ sup
𝑓∈ℱ: 𝑓 ≤1
𝔼 f Z − 𝔼 𝑓 𝑊 = ‫׬‬
𝜒
𝑘 𝑥,⋅ 𝑝 𝑑𝑥 − ‫׬‬
𝜒
𝑘 𝑥,⋅ 𝑞 𝑑𝑥
ℱ
= 𝔼 𝑘(𝑍, 𝑍′
) + 𝔼 𝑘(𝑊, 𝑊′
) − 2𝔼 𝑘(𝑍, 𝑊
1
2, 𝑍, 𝑍′~𝑝,𝑊, 𝑊′
~𝑞 and they are independent respectively.
▪ In practical, MMD is biased estimated from MMDb with empirical samples 𝑧𝑖 𝑖=1
𝑁
~𝑝 and 𝑤𝑖 𝑖=1
𝑀
~𝑞.
• MMDb
2
𝑧𝑖 , 𝑤𝑖 ; 𝑘 =
1
𝑁2 Σ𝑖,𝑗𝑘 𝑧𝑖, 𝑧𝑗 +
1
𝑀2 Σ𝑖,𝑗𝑘 𝑤𝑖, 𝑤𝑗 −
2
𝑁𝑀
Σ𝑖,𝑗𝑘(𝑧𝑖, 𝑤𝑗)
▪ The authors utilized the Gaussian kernel 𝑘 𝑥, 𝑦 = exp −
𝑥−𝑦 2
ℎ
for objective MMDb
• They exemplified the following intuition:
– The first term serves as a repulsive force that pushes the particles {𝑍𝜃 𝑠, 𝑎 𝑖} away from each other, preventing them from
collapsing into a single mode
– The third term acts as an attractive force which pulls the particle {𝑍𝜃 𝑠, 𝑎 i} closer to their target particles { ෠
𝒯𝑍𝑖}.
• They used the kernel mixture trick with 𝐾 kernels(they are different in bandwidth ℎ)
MMDQN (II)
13
⚫ Pseudo code
▪ Hyper-params inputting and target network initialization corresponds with the main network
▪ Every single step, transition is sampled from the replay buffer
• In the control setting, action is inferred from the policy(𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦)
• In the policy evaluation, action is selected with the estimated
return distribution Ƹ
𝜇 𝑠′
, 𝑎′
▪ For the number of statistics 𝑁, target return value ෠
𝒯𝑍𝑖 is
computed
▪ MMDb objective is computed and backpropagated with SGD
MMDQN (III)
Pseudo code of MMDQN
14
Experiment Results
15
⚫ Comparison with previous methods
▪ Experiments show the superior performance of MMDQN compared to previous methods in 55 Atrai 2600
games(OpenAI Gym env)
• DQN
• C51
• RAINBOW
• FQF
Experiment results (I)
• PRIOR
• QR-DQN
• IQN
Median and mean of best *HN scores
Median and mean of the *HN scores
*HN scores: Human Normalized score
➔ First group
➔ Second group
16
⚫ Comparison with the previous method
▪ Experiments show the superior performance of MMDQN compared to the previous method(QR-DQN) in
55 games in Atari 2600 games(OpenAI Gym env)
• MMD(3 seeds)
Experiment results (III)
Online training curves for MMDQN and QR-DQN
• QR-DQN(2 seeds)
17
⚫ Ablation study
▪ Two sets of ablation studies were performed to answer the following questions
• (a): Which kernel used for the MMDQN shows the best performance?
– Using the mixture of Gaussian kernels with different bandwidths displays the best performance
• (b): What number of particles for the MMDQN shows the best performance?
– Using more than 50 particles of dist. demonstrates better performance, and using 200 particles of dist. exibits the most s
table performance
Experiment results (III)
The sensitivity of MMDQN in the 6 tuning games w.r.t (a): the kernel choice, and (b): the number of particles N
18
Thank you!
19
Q&A
20
Appendix
21
⚫ Basics of measure theory
▪ Measurable space (𝑋, Σ)
• A pair (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 if 𝑋 is a set and Σ is a nonempty 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 of subsets of 𝑋
• A measurable space allows us to define a function that assigns real numbered values to the abstract elements of Σ
▪ Measure 𝜇
• Let (𝑋, Σ) be a measurable space, set function 𝜇 defined on Σ is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 iff has the following properties
– 0 ≤ 𝜇 𝐴 ≤ ∞ 𝑓𝑜𝑟 𝑎𝑛𝑦 𝐴 ∈ Σ
– For any sequence of pairwise disjoint sets {𝐴𝑛}∈ Σ such that 𝑈𝑛=1𝐴𝑛 ∈ Σ, we have 𝜇 𝑈𝑛=1
∞
𝐴𝑛 = Σ𝑛=1
∞
𝜇(𝐴𝑛)
▪ Measure space
• A triplet (𝑋, Σ, 𝜇) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒 if (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 and 𝜇: Σ → [0; ∞) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
• If 𝜇 𝑋 = 1, then 𝜇 is a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒, which we usually use notation 𝑃, and the measure space is a
𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒔𝒑𝒂𝒄𝒆
Backgrounds
– 𝜇 Φ = 0
22
⚫ Basics of measure theory
▪ Measurable function
• Let (Ω, Σ) and (Λ, 𝐺) be measurable spaces and 𝑓 a function from Ω to Λ. The function 𝑓 is called
𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 from (Ω, Σ) to (Λ, G) iff 𝑓−1
𝐺 ⊂ Σ
▪ Random variable
• A random variable 𝑿 is a measurable function from the probability space (Ω, Σ, 𝑃) into the probability space
(𝒳, 𝐵𝒳, 𝑃𝒳), where 𝒳 in ℝ is the range of the 𝑿, 𝐵𝒳 is a 𝐵𝑜𝑟𝑒𝑙 𝑠𝑒𝑡 𝑜𝑓 𝒳 and 𝑃𝒳 is the probability measure(distribution)
on 𝒳
– Specifically, 𝑿: Ω → 𝒳
Backgrounds

More Related Content

Distributional RL via Moment Matching

  • 1. 1 Distributional Reinforcement Learning via Moment Matching (MMDQN) *백승언, 주정헌, 박혜진 12 Feb, 2023
  • 2. 2 ⚫ Introduction ▪ Estimation of the probability distribution ▪ Limitation of distribution estimation in conventional Reinforcement Learning(RL) ⚫ Distributional RL via Moment Matching(MMDQN) ▪ Backgrounds ▪ MMDQN ⚫ Experiment results Contents
  • 4. 4 ⚫ Moments of a function are quantitative measures related to the shape of the function’s graph ▪ Probability distribution functions(PDF) are generally represented with four measures, mean, variance, skewness, kurtosis • Mean : first moment of PDF, (𝜇 = 𝐸[𝑋]) • Variance : second moment of PDF,(𝜎 = 𝐸 𝑋 − 𝜇 2 1 2) • Skewness : third moment of PDF, (𝛾 = 𝐸[ 𝑋−𝜇 𝜎 3 ]) • Kurtosis : fourth moment of PDF, (𝐾𝑢𝑟𝑡[𝑥] = 𝐸 𝑋−𝜇 4 𝐸 𝑋−𝜇 2 2) ⚫ Estimation methods of PDF 𝒇 𝒙 in machine learning ▪ Explicit methods • Determination predetermined statistics as any functional 𝜁: 𝑓 𝑥 ⇒ ℝ – Median of 𝑓(𝑥): 𝐹−1 ( 1 2 ), Mean of 𝑓 𝑥 : ‫׬‬ 𝒳 𝑃 𝑥 𝑑𝑥 • Estimation of the predetermined statistics using stacked data ▪ Implicit methods • Estimation of the 𝑓(𝑥; 𝜃) directly using stacked data(GAN, VAE, …) Estimation of the probability distribution Various shape of normal dist. based on 𝝁, 𝝈 Various shape of normal dist. based on 𝜸, 𝑲𝒖𝒓𝒕[𝒙]
  • 5. 5 ⚫ Naï ve approaches in estimation of return ▪ Due to stochastic nature of return 𝐺 = Σ𝛾𝑡 𝑟𝑡+1, in general RL, 𝐺 is approximated with 𝑄(𝑠, 𝑎) and 𝑉(𝑠) • 𝑉 𝑠 = 𝐸 Σ𝛾𝑡 𝑟𝑡+1| 𝑠 ▪ These functions, 𝑄 𝑠, 𝑎 and 𝑉(𝑠) are generally estimated with normal bellman operator 𝒯 𝐵 using only mean value • (𝒯 𝐵 𝑉) 𝑠 = 𝐸 𝑟(𝑠, 𝜋(𝑠)) + 𝛾𝐸 𝑉(𝑠′ ) Limitation of distribution estimation in conventional Reinforcement Learning Variance, skewness, kurtosis of return 𝑮 are easily neglected Return distribution value [-] Probability [-] These distributions have same mean, but they are not same! Return distribution value [-] Probability [-] Complex dist. could be modeled in this framework! • 𝑄(𝑠, 𝑎) = 𝐸 Σ𝛾𝑡 𝑟𝑡+1|𝑠, 𝑎 • (𝒯 𝐵 𝑄) 𝑠, 𝑎 = 𝐸 𝑟(𝑠, 𝑎) + 𝛾𝐸 𝑄(𝑠′ , 𝑎′ )
  • 6. 6 Distributional RL via Moment Matching (MMDQN) https://arxiv.org/abs/2007.12354
  • 7. 7 ⚫ Basics of probability space (𝛀, 𝚺, 𝑷) ▪ Sample space Ω • In a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑠𝑝𝑎𝑐𝑒, the set Ω is the set of all possible outcomes. Ω is set with element(event) 𝜔, and is called the 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 ▪ 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ • A 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ is a set of subsets 𝜔 of Ω s.t.: – 𝜙 ∈ Σ – If 𝐴 = 𝜙, Ω ⇒ 𝐴 𝑖𝑠 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎, 𝐴 = 𝜙, 𝐸, 𝐸𝐶 , Ω and 𝐸 ∈ Ω ⇒ A is 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 ▪ Random variable Backgrounds (I) – If 𝜔 ∈ Σ, 𝑡ℎ𝑒𝑛 𝜔𝐶 ∈ Σ – If 𝜔1, 𝜔2, … , 𝜔𝑛 ∈ Σ, 𝑡ℎ𝑒𝑛, 𝑈𝑖=1 ∞ 𝜔𝑖 ∈ Σ Σ ℰ 𝑌 𝐸 • 𝐵(ℝ) is the smallest 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 containing open interval I. This set is called Borel set – 𝐵 ℝ = 𝐼 = {(𝑎, 𝑏)|𝑎, 𝑏 ∈ ℝ, 𝑎 < 𝑏} • If 𝑓 is measurable from (Ω, Σ) to (𝐸, ℰ), it is called a 𝐵𝑜𝑟𝑒𝑙 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 or a 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒(RV) – Let (Ω, Σ) and (𝐸, ℰ) be measurable spaces and 𝑓 a function from (Ω, Σ) to (𝐸, ℰ). The function 𝑓 is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 iff 𝑓−1 ℰ ⊂ Σ Ω 𝑓−1 (𝑌) 𝑓
  • 8. 8 ⚫ Basics of probability space (𝛀, 𝚺, 𝑷) ▪ Probability distribution • Let (Ω, Σ, 𝜇) be a measure space and 𝑓 be a measurable function from (Ω, Σ) to (𝐸, ℰ). The pushforward measure is denoted by 𝑓#𝜇,, 𝑓# 𝜇, or 𝜇 ∘ 𝑓−1 is a measure on ℰ defined as – 𝑓#𝜇 𝑌 = 𝑓# 𝜇(𝑌) = 𝜇 ∘ 𝑓−1 𝑌 = 𝜇 𝑓 ∈ 𝑌 = 𝜇 𝑓−1 𝑌 , 𝑌 ∈ ℰ • If 𝜇 = 𝑃 is a probability measure, and 𝑓 is random variable, then 𝑃 ∘ 𝑓−1 is called the distribution (or the law) of 𝑓 and is denoted by 𝑃𝑋 – 𝑃 is 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 or 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 – In practice, one often use 𝐸, ℰ = ℝ𝑛 , 𝐵 ℝ𝑛 , 𝑓#𝜇: 𝐵 ℝ → [0, 1] Backgrounds (II) Σ ℰ 𝑌 𝐸 Ω 𝑓−1(𝑌) 𝑓−1 ℝ 𝑃 𝑓#𝑃 𝑃 ∘ 𝑓−1
  • 9. 9 ⚫ Distributional RL ▪ In distributional RL, the cumulative return of a chosen action at state is modeled with the full distribution rather than expectation of it, 𝑍𝜃 𝑠, 𝑎 ≔ 1 𝑁 Σ𝑖=1 𝑁 𝛿𝜃𝑖 (𝑠, 𝑎) • So that the model can capture its intrinsic randomness instead of just first-order moment(high-order moments, multi-modality in state-action value function) ▪ Existing distributional RL algorithms • C51: Pre-determination of the supports of return distribution and then training the categorical distribution. • QR-DQN: To avoid pre-determined supports, and decrease the theory-practice gap, introducing the quantile regression • IQN: Using sampling the quantile, training the full quantile function, and considering the risk of policy • IDAC: Training the full return distribution directly using the adversarial network, and training the policy based on semi-implicit methods Backgrounds (III)
  • 10. 10 ⚫ Overview of the MMDQN ▪ Unlike the existing distributional RL, MMDQN has no assumption about predetermined statistics and could learn the unrestricted statistics • However, for implementation, deterministic pseudo-samples of the return distribution are learned in MMD – The authors use the Dirac mixture Ƹ 𝜇𝜃 𝑠, 𝑎 = 1 𝑁 Σ𝑖=1 𝑁 𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋 (𝑠, 𝑎) ▪ The authors analyze the distributional Bellman operator and establish sufficient conditions for the contraction of the distributional Bellman operator in the MMD • They analyzed the 𝒯𝜋 is a contraction when the kernel is unrectified kernel 𝑘𝛼 𝑥, 𝑦 ≔ − 𝑥 − 𝑦 𝛼 , ∀𝛼 ∈ ℝ, ∀𝑥, 𝑦 ∈ 𝒳 • However, for practical consideration, they demonstrates the commonly used Gaussian kernel has better performance in this framework(𝑘 𝑥, 𝑦 ≔ exp − 𝑥−𝑦 2 2𝜎2 ) ▪ MMDQN with Gaussian kernel mixture showed the state-of-the-art in the 55 Atari 2600 games. • For a fair comparison, they used the same architecture of DQN and QR-DQN MMDQN (I)
  • 11. 11 ⚫ Problem setting ▪ For any policy 𝜋, let 𝜇𝜋 = law(𝑍𝜋 ) be the distribution of the return RV. 𝑍𝜋 𝑠, 𝑎 ≔ Σ𝑡=0 ∞ 𝛾𝑡 𝑅(𝑠𝑡, 𝑎𝑡) ▪ 𝒯𝜋 𝜇 𝑠, 𝑎 ≔ ඲ 𝒮 න 𝒜 ධ 𝜒 𝑓𝛾,𝑟 #𝜇 𝑠′ , 𝑎′ ℛ 𝑑𝑟 𝑠, 𝑎 𝜋 𝑑𝑎′ 𝑠 𝑃 𝑑𝑠′ 𝑠, 𝑎 , 𝑓𝛾,𝑟 𝑧 ≔ 𝑟 + 𝛾𝑧, ∀𝑧 𝑎𝑛𝑑 (𝑓𝛾,𝑟)#𝜇 𝑠′ , 𝑎′ is the push forward measure of 𝜇 𝑠′ , 𝑎′ 𝑏𝑦 𝑓𝛾,𝑟 ⚫ Algorithmic approach ▪ The authors use the Dirac mixture Ƹ 𝜇𝜃 𝑠, 𝑎 = 1 𝑁 Σ𝑖=1 𝑁 𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋 (𝑠, 𝑎) • They referred to the deterministic samples 𝑍𝜃 𝑠, 𝑎 as particles ▪ Algorithm goal is reduced into learning the particles 𝑍𝜃(𝑠, 𝑎) to approximate 𝜇𝜋 (𝑠, 𝑎). • To this end, the particles 𝑍𝜃(𝑠, 𝑎) is deterministically evolved to minimize the MMD distance between the approximate distribution and its distributional Bellman target MMDQN (II)
  • 12. 12 ⚫ Maximum Mean Discrepancy(MMD) ▪ Let ℱ be a Reproducing Kernel Hilbert Space(RKHS) associated with a continuous kernel 𝑘(⋅,⋅) on 𝒳. ▪ The MMD between 𝑝 ∈ 𝑃(𝒳) and 𝑞 ∈ 𝑃 𝒳 is defined as • MMD 𝑝, 𝑞; ℱ ≔ sup 𝑓∈ℱ: 𝑓 ≤1 𝔼 f Z − 𝔼 𝑓 𝑊 = ‫׬‬ 𝜒 𝑘 𝑥,⋅ 𝑝 𝑑𝑥 − ‫׬‬ 𝜒 𝑘 𝑥,⋅ 𝑞 𝑑𝑥 ℱ = 𝔼 𝑘(𝑍, 𝑍′ ) + 𝔼 𝑘(𝑊, 𝑊′ ) − 2𝔼 𝑘(𝑍, 𝑊 1 2, 𝑍, 𝑍′~𝑝,𝑊, 𝑊′ ~𝑞 and they are independent respectively. ▪ In practical, MMD is biased estimated from MMDb with empirical samples 𝑧𝑖 𝑖=1 𝑁 ~𝑝 and 𝑤𝑖 𝑖=1 𝑀 ~𝑞. • MMDb 2 𝑧𝑖 , 𝑤𝑖 ; 𝑘 = 1 𝑁2 Σ𝑖,𝑗𝑘 𝑧𝑖, 𝑧𝑗 + 1 𝑀2 Σ𝑖,𝑗𝑘 𝑤𝑖, 𝑤𝑗 − 2 𝑁𝑀 Σ𝑖,𝑗𝑘(𝑧𝑖, 𝑤𝑗) ▪ The authors utilized the Gaussian kernel 𝑘 𝑥, 𝑦 = exp − 𝑥−𝑦 2 ℎ for objective MMDb • They exemplified the following intuition: – The first term serves as a repulsive force that pushes the particles {𝑍𝜃 𝑠, 𝑎 𝑖} away from each other, preventing them from collapsing into a single mode – The third term acts as an attractive force which pulls the particle {𝑍𝜃 𝑠, 𝑎 i} closer to their target particles { ෠ 𝒯𝑍𝑖}. • They used the kernel mixture trick with 𝐾 kernels(they are different in bandwidth ℎ) MMDQN (II)
  • 13. 13 ⚫ Pseudo code ▪ Hyper-params inputting and target network initialization corresponds with the main network ▪ Every single step, transition is sampled from the replay buffer • In the control setting, action is inferred from the policy(𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦) • In the policy evaluation, action is selected with the estimated return distribution Ƹ 𝜇 𝑠′ , 𝑎′ ▪ For the number of statistics 𝑁, target return value ෠ 𝒯𝑍𝑖 is computed ▪ MMDb objective is computed and backpropagated with SGD MMDQN (III) Pseudo code of MMDQN
  • 15. 15 ⚫ Comparison with previous methods ▪ Experiments show the superior performance of MMDQN compared to previous methods in 55 Atrai 2600 games(OpenAI Gym env) • DQN • C51 • RAINBOW • FQF Experiment results (I) • PRIOR • QR-DQN • IQN Median and mean of best *HN scores Median and mean of the *HN scores *HN scores: Human Normalized score ➔ First group ➔ Second group
  • 16. 16 ⚫ Comparison with the previous method ▪ Experiments show the superior performance of MMDQN compared to the previous method(QR-DQN) in 55 games in Atari 2600 games(OpenAI Gym env) • MMD(3 seeds) Experiment results (III) Online training curves for MMDQN and QR-DQN • QR-DQN(2 seeds)
  • 17. 17 ⚫ Ablation study ▪ Two sets of ablation studies were performed to answer the following questions • (a): Which kernel used for the MMDQN shows the best performance? – Using the mixture of Gaussian kernels with different bandwidths displays the best performance • (b): What number of particles for the MMDQN shows the best performance? – Using more than 50 particles of dist. demonstrates better performance, and using 200 particles of dist. exibits the most s table performance Experiment results (III) The sensitivity of MMDQN in the 6 tuning games w.r.t (a): the kernel choice, and (b): the number of particles N
  • 21. 21 ⚫ Basics of measure theory ▪ Measurable space (𝑋, Σ) • A pair (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 if 𝑋 is a set and Σ is a nonempty 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 of subsets of 𝑋 • A measurable space allows us to define a function that assigns real numbered values to the abstract elements of Σ ▪ Measure 𝜇 • Let (𝑋, Σ) be a measurable space, set function 𝜇 defined on Σ is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 iff has the following properties – 0 ≤ 𝜇 𝐴 ≤ ∞ 𝑓𝑜𝑟 𝑎𝑛𝑦 𝐴 ∈ Σ – For any sequence of pairwise disjoint sets {𝐴𝑛}∈ Σ such that 𝑈𝑛=1𝐴𝑛 ∈ Σ, we have 𝜇 𝑈𝑛=1 ∞ 𝐴𝑛 = Σ𝑛=1 ∞ 𝜇(𝐴𝑛) ▪ Measure space • A triplet (𝑋, Σ, 𝜇) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒 if (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 and 𝜇: Σ → [0; ∞) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 • If 𝜇 𝑋 = 1, then 𝜇 is a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒, which we usually use notation 𝑃, and the measure space is a 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒔𝒑𝒂𝒄𝒆 Backgrounds – 𝜇 Φ = 0
  • 22. 22 ⚫ Basics of measure theory ▪ Measurable function • Let (Ω, Σ) and (Λ, 𝐺) be measurable spaces and 𝑓 a function from Ω to Λ. The function 𝑓 is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 from (Ω, Σ) to (Λ, G) iff 𝑓−1 𝐺 ⊂ Σ ▪ Random variable • A random variable 𝑿 is a measurable function from the probability space (Ω, Σ, 𝑃) into the probability space (𝒳, 𝐵𝒳, 𝑃𝒳), where 𝒳 in ℝ is the range of the 𝑿, 𝐵𝒳 is a 𝐵𝑜𝑟𝑒𝑙 𝑠𝑒𝑡 𝑜𝑓 𝒳 and 𝑃𝒳 is the probability measure(distribution) on 𝒳 – Specifically, 𝑿: Ω → 𝒳 Backgrounds