Distributional RL via Moment Matching

1
Distributional Reinforcement Learning
via Moment Matching
(MMDQN)
*백승언, 주정헌, 박혜진
12 Feb, 2023

2
⚫ Introduction
▪ Estimation of the probability distribution
▪ Limitation of distribution estimation in conventional Reinforcement Learning(RL)
⚫ Distributional RL via Moment Matching(MMDQN)
▪ Backgrounds
▪ MMDQN
⚫ Experiment results
Contents

4
⚫ Moments of a function are quantitative measures related to the shape of the function’s graph
▪ Probability distribution functions(PDF) are generally represented with four measures, mean, variance,
skewness, kurtosis
• Mean : first moment of PDF, (𝜇 = 𝐸[𝑋])
• Variance : second moment of PDF,(𝜎 = 𝐸 𝑋 − 𝜇 2
1
2)
• Skewness : third moment of PDF, (𝛾 = 𝐸[
𝑋−𝜇
𝜎
3
])
• Kurtosis : fourth moment of PDF, (𝐾𝑢𝑟𝑡[𝑥] =
𝐸 𝑋−𝜇 4
𝐸 𝑋−𝜇 2 2)
⚫ Estimation methods of PDF 𝒇 𝒙 in machine learning
▪ Explicit methods
• Determination predetermined statistics as any functional 𝜁: 𝑓 𝑥 ⇒ ℝ
– Median of 𝑓(𝑥): 𝐹−1
(
1
2
), Mean of 𝑓 𝑥 : ‫׬‬
𝒳
𝑃 𝑥 𝑑𝑥
• Estimation of the predetermined statistics using stacked data
▪ Implicit methods
• Estimation of the 𝑓(𝑥; 𝜃) directly using stacked data(GAN, VAE, …)
Estimation of the probability distribution
Various shape of normal dist. based on 𝝁, 𝝈
Various shape of normal dist. based on 𝜸, 𝑲𝒖𝒓𝒕[𝒙]

5
⚫ Naï
ve approaches in estimation of return
▪ Due to stochastic nature of return 𝐺 = Σ𝛾𝑡
𝑟𝑡+1, in general RL, 𝐺 is approximated with 𝑄(𝑠, 𝑎) and 𝑉(𝑠)
• 𝑉 𝑠 = 𝐸 Σ𝛾𝑡
𝑟𝑡+1| 𝑠
▪ These functions, 𝑄 𝑠, 𝑎 and 𝑉(𝑠) are generally estimated with normal bellman operator 𝒯
𝐵 using only mean
value
• (𝒯
𝐵 𝑉) 𝑠 = 𝐸 𝑟(𝑠, 𝜋(𝑠)) + 𝛾𝐸 𝑉(𝑠′
)
Limitation of distribution estimation in conventional Reinforcement Learning
Variance, skewness, kurtosis of return 𝑮 are easily neglected
Return distribution
value [-]
Probability
[-]
These distributions have same
mean, but they are not same!
Return distribution
value [-]
Probability
[-]
Complex dist. could be
modeled in this framework!
• 𝑄(𝑠, 𝑎) = 𝐸 Σ𝛾𝑡
𝑟𝑡+1|𝑠, 𝑎
• (𝒯
𝐵 𝑄) 𝑠, 𝑎 = 𝐸 𝑟(𝑠, 𝑎) + 𝛾𝐸 𝑄(𝑠′
, 𝑎′
)

6
Distributional RL via Moment Matching
(MMDQN)
https://arxiv.org/abs/2007.12354

7
⚫ Basics of probability space (𝛀, 𝚺, 𝑷)
▪ Sample space Ω
• In a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑠𝑝𝑎𝑐𝑒, the set Ω is the set of all possible outcomes. Ω is set with element(event) 𝜔, and is called the 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒
▪ 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ
• A 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ is a set of subsets 𝜔 of Ω s.t.:
– 𝜙 ∈ Σ
– If 𝐴 = 𝜙, Ω ⇒ 𝐴 𝑖𝑠 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎, 𝐴 = 𝜙, 𝐸, 𝐸𝐶
, Ω and 𝐸 ∈ Ω ⇒ A is 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎
▪ Random variable
Backgrounds (I)
– If 𝜔 ∈ Σ, 𝑡ℎ𝑒𝑛 𝜔𝐶
∈ Σ – If 𝜔1, 𝜔2, … , 𝜔𝑛 ∈ Σ, 𝑡ℎ𝑒𝑛, 𝑈𝑖=1
∞
𝜔𝑖 ∈ Σ
Σ ℰ
𝑌
𝐸
• 𝐵(ℝ) is the smallest 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 containing open interval I. This set is called
Borel set
– 𝐵 ℝ = 𝐼 = {(𝑎, 𝑏)|𝑎, 𝑏 ∈ ℝ, 𝑎 < 𝑏}
• If 𝑓 is measurable from (Ω, Σ) to (𝐸, ℰ), it is called a 𝐵𝑜𝑟𝑒𝑙 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 or a
𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒(RV)
– Let (Ω, Σ) and (𝐸, ℰ) be measurable spaces and 𝑓 a function from (Ω, Σ) to (𝐸, ℰ).
The function 𝑓 is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 iff 𝑓−1
ℰ ⊂ Σ
Ω
𝑓−1
(𝑌)
𝑓

8
⚫ Basics of probability space (𝛀, 𝚺, 𝑷)
▪ Probability distribution
• Let (Ω, Σ, 𝜇) be a measure space and 𝑓 be a measurable function from (Ω, Σ) to (𝐸, ℰ). The pushforward measure is denoted by 𝑓#𝜇,,
𝑓# 𝜇, or 𝜇 ∘ 𝑓−1
is a measure on ℰ defined as
– 𝑓#𝜇 𝑌 = 𝑓# 𝜇(𝑌) = 𝜇 ∘ 𝑓−1
𝑌 = 𝜇 𝑓 ∈ 𝑌 = 𝜇 𝑓−1
𝑌 , 𝑌 ∈ ℰ
• If 𝜇 = 𝑃 is a probability measure, and 𝑓 is random variable, then 𝑃 ∘ 𝑓−1
is called the distribution (or the law) of 𝑓 and is denoted by
𝑃𝑋
– 𝑃 is 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 or 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
– In practice, one often use 𝐸, ℰ = ℝ𝑛
, 𝐵 ℝ𝑛
, 𝑓#𝜇: 𝐵 ℝ → [0, 1]
Backgrounds (II)
Σ ℰ
𝑌
𝐸
Ω
𝑓−1(𝑌)
𝑓−1
ℝ
𝑃
𝑓#𝑃
𝑃 ∘ 𝑓−1

9
⚫ Distributional RL
▪ In distributional RL, the cumulative return of a chosen action at state is modeled with the full distribution
rather than expectation of it, 𝑍𝜃 𝑠, 𝑎 ≔
1
𝑁
Σ𝑖=1
𝑁
𝛿𝜃𝑖
(𝑠, 𝑎)
• So that the model can capture its intrinsic randomness instead of just first-order moment(high-order moments,
multi-modality in state-action value function)
▪ Existing distributional RL algorithms
• C51: Pre-determination of the supports of return distribution and then training the categorical distribution.
• QR-DQN: To avoid pre-determined supports, and decrease the theory-practice gap, introducing the quantile
regression
• IQN: Using sampling the quantile, training the full quantile function, and considering the risk of policy
• IDAC: Training the full return distribution directly using the adversarial network, and training the policy based on
semi-implicit methods
Backgrounds (III)

10
⚫ Overview of the MMDQN
▪ Unlike the existing distributional RL, MMDQN has no assumption about predetermined statistics and could
learn the unrestricted statistics
• However, for implementation, deterministic pseudo-samples of the return distribution are learned in MMD
– The authors use the Dirac mixture Ƹ
𝜇𝜃 𝑠, 𝑎 =
1
𝑁
Σ𝑖=1
𝑁
𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎)
▪ The authors analyze the distributional Bellman operator and establish sufficient conditions for the
contraction of the distributional Bellman operator in the MMD
• They analyzed the 𝒯𝜋
is a contraction when the kernel is unrectified kernel 𝑘𝛼 𝑥, 𝑦 ≔ − 𝑥 − 𝑦
𝛼
, ∀𝛼 ∈ ℝ, ∀𝑥, 𝑦 ∈ 𝒳
• However, for practical consideration, they demonstrates the commonly used Gaussian kernel has better
performance in this framework(𝑘 𝑥, 𝑦 ≔ exp −
𝑥−𝑦 2
2𝜎2 )
▪ MMDQN with Gaussian kernel mixture showed the state-of-the-art in the 55 Atari 2600 games.
• For a fair comparison, they used the same architecture of DQN and QR-DQN
MMDQN (I)

11
⚫ Problem setting
▪ For any policy 𝜋, let 𝜇𝜋
= law(𝑍𝜋
) be the distribution of the return RV. 𝑍𝜋
𝑠, 𝑎 ≔ Σ𝑡=0
∞
𝛾𝑡
𝑅(𝑠𝑡, 𝑎𝑡)
▪ 𝒯𝜋
𝜇 𝑠, 𝑎 ≔ ඲
𝒮
න
𝒜
ධ
𝜒
𝑓𝛾,𝑟 #𝜇 𝑠′
, 𝑎′
ℛ 𝑑𝑟 𝑠, 𝑎 𝜋 𝑑𝑎′
𝑠 𝑃 𝑑𝑠′
𝑠, 𝑎 , 𝑓𝛾,𝑟 𝑧 ≔ 𝑟 + 𝛾𝑧, ∀𝑧 𝑎𝑛𝑑
(𝑓𝛾,𝑟)#𝜇 𝑠′
, 𝑎′
is the push forward measure of 𝜇 𝑠′
, 𝑎′
𝑏𝑦 𝑓𝛾,𝑟
⚫ Algorithmic approach
▪ The authors use the Dirac mixture Ƹ
𝜇𝜃 𝑠, 𝑎 =
1
𝑁
Σ𝑖=1
𝑁
𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎)
• They referred to the deterministic samples 𝑍𝜃 𝑠, 𝑎 as particles
▪ Algorithm goal is reduced into learning the particles 𝑍𝜃(𝑠, 𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎).
• To this end, the particles 𝑍𝜃(𝑠, 𝑎) is deterministically evolved to minimize the MMD distance between the
approximate distribution and its distributional Bellman target
MMDQN (II)

12
⚫ Maximum Mean Discrepancy(MMD)
▪ Let ℱ be a Reproducing Kernel Hilbert Space(RKHS) associated with a continuous kernel 𝑘(⋅,⋅) on 𝒳.
▪ The MMD between 𝑝 ∈ 𝑃(𝒳) and 𝑞 ∈ 𝑃 𝒳 is defined as
• MMD 𝑝, 𝑞; ℱ ≔ sup
𝑓∈ℱ: 𝑓 ≤1
𝔼 f Z − 𝔼 𝑓 𝑊 = ‫׬‬
𝜒
𝑘 𝑥,⋅ 𝑝 𝑑𝑥 − ‫׬‬
𝜒
𝑘 𝑥,⋅ 𝑞 𝑑𝑥
ℱ
= 𝔼 𝑘(𝑍, 𝑍′
) + 𝔼 𝑘(𝑊, 𝑊′
) − 2𝔼 𝑘(𝑍, 𝑊
1
2, 𝑍, 𝑍′~𝑝,𝑊, 𝑊′
~𝑞 and they are independent respectively.
▪ In practical, MMD is biased estimated from MMDb with empirical samples 𝑧𝑖 𝑖=1
𝑁
~𝑝 and 𝑤𝑖 𝑖=1
𝑀
~𝑞.
• MMDb
2
𝑧𝑖 , 𝑤𝑖 ; 𝑘 =
1
𝑁2 Σ𝑖,𝑗𝑘 𝑧𝑖, 𝑧𝑗 +
1
𝑀2 Σ𝑖,𝑗𝑘 𝑤𝑖, 𝑤𝑗 −
2
𝑁𝑀
Σ𝑖,𝑗𝑘(𝑧𝑖, 𝑤𝑗)
▪ The authors utilized the Gaussian kernel 𝑘 𝑥, 𝑦 = exp −
𝑥−𝑦 2
ℎ
for objective MMDb
• They exemplified the following intuition:
– The first term serves as a repulsive force that pushes the particles {𝑍𝜃 𝑠, 𝑎 𝑖} away from each other, preventing them from
collapsing into a single mode
– The third term acts as an attractive force which pulls the particle {𝑍𝜃 𝑠, 𝑎 i} closer to their target particles { ෠
𝒯𝑍𝑖}.
• They used the kernel mixture trick with 𝐾 kernels(they are different in bandwidth ℎ)
MMDQN (II)

13
⚫ Pseudo code
▪ Hyper-params inputting and target network initialization corresponds with the main network
▪ Every single step, transition is sampled from the replay buffer
• In the control setting, action is inferred from the policy(𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦)
• In the policy evaluation, action is selected with the estimated
return distribution Ƹ
𝜇 𝑠′
, 𝑎′
▪ For the number of statistics 𝑁, target return value ෠
𝒯𝑍𝑖 is
computed
▪ MMDb objective is computed and backpropagated with SGD
MMDQN (III)
Pseudo code of MMDQN

15
⚫ Comparison with previous methods
▪ Experiments show the superior performance of MMDQN compared to previous methods in 55 Atrai 2600
games(OpenAI Gym env)
• DQN
• C51
• RAINBOW
• FQF
Experiment results (I)
• PRIOR
• QR-DQN
• IQN
Median and mean of best *HN scores
Median and mean of the *HN scores
*HN scores: Human Normalized score
➔ First group
➔ Second group

16
⚫ Comparison with the previous method
▪ Experiments show the superior performance of MMDQN compared to the previous method(QR-DQN) in
55 games in Atari 2600 games(OpenAI Gym env)
• MMD(3 seeds)
Experiment results (III)
Online training curves for MMDQN and QR-DQN
• QR-DQN(2 seeds)

17
⚫ Ablation study
▪ Two sets of ablation studies were performed to answer the following questions
• (a): Which kernel used for the MMDQN shows the best performance?
– Using the mixture of Gaussian kernels with different bandwidths displays the best performance
• (b): What number of particles for the MMDQN shows the best performance?
– Using more than 50 particles of dist. demonstrates better performance, and using 200 particles of dist. exibits the most s
table performance
Experiment results (III)
The sensitivity of MMDQN in the 6 tuning games w.r.t (a): the kernel choice, and (b): the number of particles N

21
⚫ Basics of measure theory
▪ Measurable space (𝑋, Σ)
• A pair (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 if 𝑋 is a set and Σ is a nonempty 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 of subsets of 𝑋
• A measurable space allows us to define a function that assigns real numbered values to the abstract elements of Σ
▪ Measure 𝜇
• Let (𝑋, Σ) be a measurable space, set function 𝜇 defined on Σ is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 iff has the following properties
– 0 ≤ 𝜇 𝐴 ≤ ∞ 𝑓𝑜𝑟 𝑎𝑛𝑦 𝐴 ∈ Σ
– For any sequence of pairwise disjoint sets {𝐴𝑛}∈ Σ such that 𝑈𝑛=1𝐴𝑛 ∈ Σ, we have 𝜇 𝑈𝑛=1
∞
𝐴𝑛 = Σ𝑛=1
∞
𝜇(𝐴𝑛)
▪ Measure space
• A triplet (𝑋, Σ, 𝜇) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒 if (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 and 𝜇: Σ → [0; ∞) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
• If 𝜇 𝑋 = 1, then 𝜇 is a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒, which we usually use notation 𝑃, and the measure space is a
𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒔𝒑𝒂𝒄𝒆
Backgrounds
– 𝜇 Φ = 0

22
⚫ Basics of measure theory
▪ Measurable function
• Let (Ω, Σ) and (Λ, 𝐺) be measurable spaces and 𝑓 a function from Ω to Λ. The function 𝑓 is called
𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 from (Ω, Σ) to (Λ, G) iff 𝑓−1
𝐺 ⊂ Σ
▪ Random variable
• A random variable 𝑿 is a measurable function from the probability space (Ω, Σ, 𝑃) into the probability space
(𝒳, 𝐵𝒳, 𝑃𝒳), where 𝒳 in ℝ is the range of the 𝑿, 𝐵𝒳 is a 𝐵𝑜𝑟𝑒𝑙 𝑠𝑒𝑡 𝑜𝑓 𝒳 and 𝑃𝒳 is the probability measure(distribution)
on 𝒳
– Specifically, 𝑿: Ω → 𝒳
Backgrounds

Distributional RL via Moment Matching

More Related Content

Distributional RL via Moment Matching