View metadata, citation and similar papers at core.ac.uk
brought to you by
CORE
provided by Chalmers Research
T HESIS FOR THE DEGREE OF L ICENTIATE OF E NGINEERING
Sample Efficient Bayesian Reinforcement
Learning
DIVYA GROVER
Division of Data Science and AI
Department of Computer Science and Engineering
C HALMERS U NIVERSITY OF T ECHNOLOGY
Göteborg, Sweden 2020
Sample Efficient Bayesian Reinforcement Learning
DIVYA GROVER
Copyright © DIVYA GROVER, 2020
Thesis for the degree of Licentiate of Engineering
ISSN 1652-876X
Technical Report No. 213L
Division of Data Science and AI
Department of Computer Science and Engineering
Chalmers University of Technology
SE-412 96 Göteborg, Sweden
Phone: +46 (0)31 772 10 00
Author e-mail: divya.grover@chalmers.se
Cover:
This shows how to build intelligent agents capable of taking actions under uncertainty.
Printed by Chalmers Reproservice
Göteborg, Sweden year
To my parents, Monila and Suraj Grover.
ABSTRACT
Artificial Intelligence (AI) has been an active field of research for over a century now. The research field of AI may be grouped into various tasks that are
expected from an intelligent agent; two major ones being learning & inference and planning. The act of storing new knowledge is known as learning
while inference refers to the act to extracting conclusions given agent’s limited
knowledge base. They are tightly knit by the design of its knowledge base. The
process of deciding long-term actions or plans given its current knowledge is
called planning.
Reinforcement Learning (RL) brings together these two tasks by posing a seemingly benign question “How to act optimally in an unknown environment?”.
This requires the agent to learn about its environment as well as plan actions
given its current knowledge about it. In RL, the environment can be represented
by a mathematical model and we associate an intrinsic value to the actions that
the agent may choose.
In this thesis, we present a novel Bayesian algorithm for the problem of RL.
Bayesian RL is a widely explored area of research but is constrained by scalability and performance issues. We provide first steps towards rigorous analysis
of these types of algorithms. Bayesian algorithms are characterized by the belief
that they maintain over their unknowns; which is updated based on the collected
evidence. This is different from the traditional approach in RL in terms of problem formulation and formal guarantees. Our novel algorithm combines aspects
of planning and learning due to its inherent Bayesian formulation. It does so in
a more scalable fashion, with formal PAC guarantees. We also give insights on
the application of Bayesian framework for the estimation of model and value,
in a joint work on Bayesian backward induction for RL.
Keywords: Bayesian Reinforcement Learning, Decision Making under Uncertainty
ACKNOWLEDGMENTS
I would not have been able to write this licentiate thesis without the support of
many people around me. I want to thank a few of them here for the time they
gave me.
First, I would like to thank my advisor Christos Dimitrakakis for his enormous
support. Your support and guidance has been invaluable to me. You fill the gaps
in my knowledge and your sharp counter-examples have helped me wrap my
head around very many things. Thanks for encouraging me to collaborate, e.g.,
the Harvard trip that you arranged for me. Also thank you for helping me out
with this thesis. Next, is my co-supervisor Daniil Ryabko his support. I recall
our many email exchanges, and his expertise in Bandits to help me with some
proofs. I like to thank my co-author Debabrota Basu for his continued support
and encouragement to pursue the right problems. Furthermore, I would like to
express my sincere gratitude to Frans A. Oliehoek for taking his time and energy
to read this work and lead the discussions of my licentiate seminar.
I cannot forget to thank Devdatt Dubashi for our many interactions, as a guide,
colleague and examiner. I recall our chats on India, Chalmers, Sweden and
everything in-between. I am grateful to Aristide Tossou for his bits of advice
about life in Sweden; Hannes Eriksson and Emilio Jorge for the many fruitful discussions on RL; Shirin Tavara for being a very supportive office-mate.
Furthermore, I show my gratitude to the many people whose name are not mentioned here but worked behind the scenes to help me. Finally, I am indebted
to my wife, family and friends who increased my motivation to continue this
thesis.
LIST OF PUBLICATIONS
This thesis is based on the following manuscripts.
. Divya Grover, Christos Dimitrakakis. “Deeper & Sparser Exploration” in 35 th International Conference on Machine Learning. Exploration in RL workshop, Stockholm, Sweden, July 10-15,
2018.
. Divya Grover, Debabrota Basu, Christos Dimitrakakis. “Bayesian
Reinforcement Learning via Deep, Sparse Sampling” in 23 rd International Conference on Artificial Intelligence and Statistics,
Palermo, Italy, June 3-5, 2020.
The following manuscript is under review.
. Christos Dimitrakakis, Hannes Eriksson, Emilio Jorge, Divya Grover,
Debabrota Basu. “Inferential Induction: Joint Bayesian Estimation
of MDPs and Value Functions”, arXiv preprint arXiv:2002.03098.
v
Contents
I
EXTENDED SUMMARY
1
1
Introduction
3
2
Background
7
2.1
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.1
Markov Decision Process (MDP) . . . . . . . . . . . .
7
2.1.2
Bayes Adaptive MDP (BAMDP) . . . . . . . . . . . . .
9
2.2
3
4
II
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1
Bandit problem . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2
Model-based Bayesian RL . . . . . . . . . . . . . . . . 13
2.2.3
2.2.4
POMDP literature . . . . . . . . . . . . . . . . . . . . 16
Bayesian value function . . . . . . . . . . . . . . . . . 17
Efficient Bayesian RL
19
3.1
Deep, Sparse Sampling . . . . . . . . . . . . . . . . . . . . . . 20
3.2
Bayesian backward induction (BBI) . . . . . . . . . . . . . . . 23
Concluding Remarks
27
PUBLICATIONS
35
vii
List of Figures
2.1
Full tree expansion. . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1
Deeper & Sparser tree expansion. . . . . . . . . . . . . . . . . 22
ix
Part I
EXTENDED SUMMARY
1
Chapter 1
Introduction
The field of Operations Research can be seen as a precursor to modern AI.
It is a discipline that deals with the application of advanced analytical methods in making better decisions. During world wars, many problems ranging
from project planning, network optimization, resource allocation, resource assignment and scheduling were studied within OR. The techniques used to solve
them were extensively studied by J. Von Neumann, A. Wald, Bellman and many
others. These are now colloquially referred to as Dynamic Programming (DP)
techniques. DP is arguably the most important method for dealing with a large
set of mathematical problems, known as decision making problems.
Decision making:
Decision making refers to those situations where an algorithm must interact
with a system to achieve a desired objective. A fundamental characteristic of
such problems is the feedback effect that these interactions have on the system.
In AI, this is analogous to the situation where an autonomous agent acts in an
environment. In many cases, we model this situation with a mathematical model
that encapsulates the basics of an interacting system.
Decision making problems can be divided two types, depending on the level of
difficulty in solving them. First is decision making under no uncertainty, which
3
CHAPTER 1. INTRODUCTION
4
refers to the situation where we have full knowledge of the system’s1 behaviour.
Even in this case, it is not a trivial problem to decide how to act ? This process
of developing long-term actions is known as planning.
Decision making under uncertainty:
The second type is decision making under uncertainty. In real world processes,
along with inherent (aleatoric) uncertainty2 , there also exists uncertainty of our
knowledge (epistemic) about it. Reinforcement Learning (RL) is an important
problem in this category. According to Duff [2002], “RL attempts to import
concepts from classical decision theory and utility theory to the domain of abstract agents operating in uncertain environments. It lies at the intersection
of control theory, operations research, artificial intelligence and animal learning.” Its first successful application was the development of a state-of-theart Backgammon playing AI [Tesauro, 1994]. More recent successes include
game [Mnih et al., 2015, Silver et al., 2017] playing AI.
Planning in this general setting requires taking into account future events and
observations that may change our conclusions. Typically, this involves creating
long-term plans covering possible future eventualities, i.e. when planning under
uncertainty, we also need to take into account the possible future knowledge that
could be generated while acting. Executing actions also involve trying out new
things, to gather more information, but it is hard to tell whether this information
will be beneficial. The choice between acting in a manner that is known to
produce good results, or experimenting with something new, is known as the
exploration-exploitation dilemma. It is central to RL research.
Exploration-Exploitation:
Consider the problem of choosing your education stream for your long-term
career. Let’s say you are inclined towards Engineering. However, Project Management has recently been growing quite popular and is financially more rewarding. It is tempting to try it out! But there is a risk involved. It may turn
out to be much worse than Engineering, in which case you will regret switching streams. On the other hand, it could also be much better. What should
1 Potentially
2 Like
stochastic.
the randomness associated with skating on ice.
5
you do? It all depends on how much information you have about either career
choices and how many more years are you willing to spend to get a degree. If
you already have a PhD, then its probably a better idea to go with Engineering.
However, if you just finished your bachelors degree, Project Management may
be a good bet. If you are lucky, you will get a much higher salary for the remainder of your life, while otherwise you would miss out only by a year, making the
potential risk quite small.
Bayesian Reinforcement Learning:
One way to approach the exploration-exploitation dilemma is to take decisions
that explicitly take into account the uncertainty, both in the present and the future. One may use the Bayesian approach for this; essentially any algorithm is
Bayesian in nature if it maintains probabilistic beliefs on quantities of interest
and updates them using evidence collected. Formulating the RL problem in a
Bayesian framework is known as Bayesian RL. Planning trees are data structures used in various planning algorithms. A belief tree is a planning tree used
to plan actions while explicitly taking into accounts their future effects, like in
a Bayesian RL setting.
Main contribution:
Our main contribution is a novel Bayesian RL algorithm, for planning in belief
trees. We perform a thorough analysis of this algorithm by giving a performance
bound for it.
Thesis outline:
In Chapter 2, we formally define the terms of interest and introduce the necessary background that will help understand the remainder of this thesis. In
Chapter 3, we summarize the contributions of this thesis. Section (3.1) discusses the algorithm we introduced, discussing its practical benefit and providing a theoretical guarantee for it. Section (3.2) discusses our other work,
performed jointly, that takes an orthogonal approach to the same Bayesian RL
problem. Chapter 4 concludes the thesis and discusses some interesting future
work. The remainder of this thesis is a reprint of the full version papers [Grover
and Dimitrakakis, 2018, Grover et al., 2019, Dimitrakakis et al., 2020].
Chapter 2
Background
We divide this chapter into two sections. We introduces the necessary preliminaries for this work in section (2.1), followed by an in-depth discussion on
many possible approaches to Bayesian RL in section (2.2).
2.1
Preliminaries
We first define the mathematical model used to describe an environment, known
as Markov Decision Process (MDP). We then define a specific type of MDP,
called Bayes Adaptive MDP, that arises when we are uncertain about the underlying environment, which is the focus of this thesis.
2.1.1
Markov Decision Process (MDP)
Markov Decision Process (MDP) is a discrete-time stochastic process that provides a formal framework for RL problems.
Definition 1 (MDP). An MDP µ = (S, A, P, R) is composed of a state space
S, an action space A, a reward distribution R and a transition function P .
The transition function P , Pµ (st+1 |st , at ) dictates the distribution over next
states st+1 given the present state-action pair (st , at ). The reward distribution
R , Pµ (rt+1 |st , at ) dictates the obtained reward that belongs to the interval
7
CHAPTER 2. BACKGROUND
8
[0, 1]. We shall also use Pµ (rt+1 , st+1 |st , at ) to denote the joint distribution of
next states and actions of MDP µ.
A policy π belonging to a policy space Π is an algorithm for selecting actions
given the present state and previous observations. The objective of an agent is
to find the policy π that maximizes the sum of discounted reward average over
all uncertainties.
The value function of a policy π for an MDP µ is the expected sum of discounted
rewards obtained from time t to T while selecting actions in the MDP µ:
π,T
Vµ,t
(s) = Eπµ (
T
X
γ k rt+k | st = s),
(2.1)
k=1
where γ ∈ (0, 1] is called the discount factor and Eπµ denotes the expectation
under the trajectory generated by a policy π acting on the MDP µ. Let us
define the infinite horizon discounted value function of a policy π on an MDP
π,T
µ as Vµπ , limT →∞ Vµ,0
. Now, we define the optimal value function to be
Vµ∗ , maxπ Vµπ , and the optimal policy to be πµ∗ , arg maxπ Vµπ .
A note on policies: The largest policy set, denote ΠT , can have a cardinality
of |A|T . Some other important policy sets are:
1. Stationary policies: All policies that are consistent over time, i.e,
π(at |st , st−1 . . . st−k ) = π(at0 |st0 , st0 −1 . . . st0 −k ) ∀t, t0 .
2. K-order Markov policies: All polices that only depend on previous K
states, i.e., π(at |st , st−1 . . . s0 ) = π(at |st , st−1 . . . st−k ). They may or
may not be stationary.
3. Deterministic policies: Policies with a trivial action distribution, i.e, π(at =
a| . . .) = 1 for an action a.
We focus in this thesis on history-dependent polices, commonly referred to as
adaptive policies.
Define Bellman operator, Bµπ : V → V , as follows :
Bµπ V (s) , Eπµ (r) + γ
X
s0 ∈S
π 0
0
Pµ (s | s)V (s ),
2.1. PRELIMINARIES
9
π
π
It allows us to compute the value function recursively via Vµ,t
= Bµπ Vµ,t+1
.
Repeated application of the relevant Bellman operator1 on the initial value function is referred here as backward induction.
If the MDP is known, the optimal policy and value function is computable via
backward induction (also known as value iteration). Another important result is
that there exists an optimal policy in the set of 1-order deterministic stationary
Markov policies for an MDP [Puterman, 1994].
2.1.2
Bayes Adaptive MDP (BAMDP)
In reality, the underlying MDP is unknown to the RL algorithm, which gives
rise to the exploration-exploitation dilemma. Bayesian Reinforcement Learning (BRL), specifically the information state formulation [Duff, 2002], provides
a framework to quantify this trade-off using a Bayesian representation.
Following the Bayesian formulation, we maintain a belief distribution βt over
the possible MDP models µ ∈ M.2 Starting with an appropriate prior belief
β0 (µ), we obtain a sequence of posterior beliefs βt (µ) that represent our subjective belief over the MDPs at time t, depending on the latest observations. By
Bayes’ rule, the posterior belief at time t + 1 is
βt+1 (µ) , R
Pµ (rt+1 , st+1 |st , at )βt (µ)
.
0
(r , s |s , a )β (µ0 )dµ0
P
M µ t+1 t+1 t t t
(2.2)
Now, we define the Bayesian value function v analogously to the MDP value
function:
vβπ (s) ,
Z
Vµπ (s)β(µ)dµ.
(2.3)
M
The Bayesian value function is the expected utility of the decision maker according to its current belief β and policy π for selecting future actions from
state s. The optimal policy for the Bayesian value function can be adaptive in
general. For completeness, we also define the Bayes-optimal utility vβ∗ (s), i.e.
the utility of the Bayes-optimal policy:
Z
vβ∗ (s) , max
Vµπ (s)β(µ)dµ.
π∈Π
1 BAMDP
(2.4)
M
has a slightly different one.
precisely, we can define a measurable space (M, M), where M is the possible set of
MDPs, and M is a suitable σ-algebra.
2 More
CHAPTER 2. BACKGROUND
10
It is well known that by combining the original MDPs state st and belief βt
into a hyper-state ωt , we obtain another MDP called the Bayes Adaptive MDP
(BAMDP). The optimal policy for a BAMDP is the same as the Bayes-optimal
policy for the corresponding MDP.
Definition 2 (BAMDP). A Bayes Adaptive Markov Decision Process (BAMDP)
µ̃ , (Ω, A, ν, τ ) is a representation for an unknown MDP µ = (S, A, P, R)
with a space of information states Ω = S × B, where B is an appropriate set
of belief distributions on M. At time t, the agent observes the information state
ωt = (st , βt ) and takes action at ∈ A. We denote the transition distribution as
ν(ωt+1 |ωt , at ), the reward distribution as τ (rt+1 |ωt , at ), and A as the common
action space.
For each st+1 , the next hyper-state ωt+1 = (st+1 , βt+1 ) is uniquely determined
since βt+1 is unique given (ωt , st+1 ) and can be computed using eq. (2.2).
Therefore the information state ωt preserves the Markov property. This allows us to treat the BAMDP as an infinite-state MDP with ν(ωt+1 |ωt , at ), and
τ (rt+1 |ωt , at ) defined as the corresponding transition and reward distributions
respectively. The transition and reward distributions are defined as the marginal
distributions
Z
ν(ωt+1 |ωt , at ) ,
ZM
τ (rt+1 |ωt , at ) ,
M
Pµ (st+1 |st , at )βt (µ)dµ,
Pµ (rt+1 |st , at )βt (µ)dµ.
Though the Bayes-optimal policy is generally adaptive in the original MDP, it
is Markov with respect to the hyper-state of the BAMDP. In other words, ωt
represents a sufficient statistic for the observed history.
Since the BAMDP is an MDP on the space of hyper-states, we can use value
iteration starting from the set of terminal hyper-states ΩT and proceeding backwards from horizon T to t following
Vt∗ (ω) = max E[r | ω, a] + γ
a∈A
X
0
0
∗
ν(ω |ω, a)Vt+1
(ω ),
(2.5)
0
ω ∈Ωt+1
where Ωt+1 is the reachable set of hyper-states from hyper-state ωt . Equa-
2.2. DISCUSSION
11
tion (2.4) implies Equation (2.5) and vice-versa3 , i.e. vβ∗ (s) = V0∗ (ω) for
ω = (s, β). Hence, we can obtain Bayes-optimal policies through backward
induction. Due to the large hyper-state space, this is only feasible for a small
finite horizon in practice, as shown in Algorithm 1.
Algorithm 1 FHTS (Finite Horizon Tree Search)
Parameters: Horizon T
Input: current hyper-state ωh and depth h.
if h = T then
return V(ωh ) = 0
end if
for all actions a do
for all next states sh+1 do
βh+1 = UpdatePosterior(ωh ,sh+1 ,a) (eq. 2.2)
ωh+1 = (sh+1 , βh+1 )
V (ωh+1 ) = FHTS(ωh+1 , h + 1)
end for
end for
Q(ωh , a) = 0
for all ωh+1 , a do
Q(ωh , a) += ν(ωh+1 |ωh , a) × V (ωh+1 )
end for
return maxa Q(ωh , a)
2.2
Discussion
In this section, we first discuss the motivation for Bayesian RL formulation.
We do this by making an analogy to the Bandit problem, for which the theory
is much more clear and developed. Then we discuss algorithms that directly
attack the BAMDP problem, followed by a POMDP perspective on Bayesian
RL. Finally, we discuss Bayesian value function algorithms, which are premise
3 The equivalence can be obtained by expanding the integral eq. (2.4) using definition of value
function and applying Bayes rule to its second term. This gives the desired recursive equation.
CHAPTER 2. BACKGROUND
12
22
ωt+1
22
a1t+1 ωt+2
a2t
21
ωt+1
21
a2t+1 ωt+2
ωt
12
ωt+1
32
a3t+1 ωt+2
a1t
11
ωt+1
31
ωt+2
Figure 2.1: Full tree expansion.
to our work on Bayesian backward induction.
2.2.1
Bandit problem
Consider a very simple MDP, with only a single state and multiple actions.
This is known as stochastic multi-armed bandit problem and is well studied in
Bandit theory. RL in multistate MDPs, in addition to exploration-exploitation
dilemma, presents the difficulties of delayed reward and non independence of
consecutive samples from the MDP process. But the Bandit theory without
these additional difficulties presents a cleaner view of the dilemma.
A stochastic multi-armed bandit model, denoted by ν = (ν1 , ..., νK ), is a collection of K arms (or actions according to MDP notation), where each arm νa
when selected, generates a reward with a probability distribution4 . Agent interacts with a Bandit MDP by choosing at each time t an arm At to play. This
action results in a realization Xt from the respective distribution νAt . The distribution is assumed to be parameterized by θ ∈ Θ. Optimality of an agent is
defined in terms of Regret:
Rθ , T µ∗ −
T
X
Xt
t=0
where µ∗ is the mean of the optimal arm ν ∗ .
4 From
here onwards, we refer to νi as the arm as well as its underlying reward distribution.
2.2. DISCUSSION
13
Frequentist and Bayesian analysis: A fundamental difference between Frequentist and Bayesian approach lies in their respective treatment of the unknown parameters. Frequentist interpretation of Bandit problem assumes there
is a fixed unknown θa associated with each arm, while a Bayesian interpretation
assumes that θa itself is generated from a prior distribution Ψ over the set Θ.
These two views also have different goals:
1. An agent that is optimal in the Frequentist sense chooses to minimize Rθπ
for all θ ∈ Θ, that is, find π ∗ = arg minπ maxθ Eθπ [Rθ ].
2. An agent optimal in the Bayesian sense chooses to minimize the expected
π
π
regret EΨ,θ
[Rθ ], that is, find π ∗ = arg minπ EΨ,θ
[Rθ ] for known prior Ψ .
π
The more general question of finding π ∗ = arg minπ maxΨ EΨ,θ
[Rθ ] is
open, but we expect the bounds to be worse than the Frequentist ones
because nature’s optimal strategy may not be deterministic.
The celebrated Gittins index [Gittins, 1979] gives a solution to the Bayesian formulation of discounted Bandit problem. According to Kaufmann [2014] (sec.
1.3.4), Chang and Lai [1987] developed a closed-form asymptotic approximations to Gittins index. These approximations take the form of explicit bonus
terms added to point estimate of the mean θ̂, for Gaussian bandits. Bonus terms
added to point estimates are essential in proving Frequentist optimality for Bandits [Cappé et al., 2013] and MDP [Jaksch et al., 2010, Tossou et al., 2019], using typical arguments of Optimism in Face of Uncertainty (OFU) principle. She
shows experimentally that Finite Horizon Gitins index actually performs much
better in terms of the Frequentist regret, against those designed to be optimal for
it (e.g. KL-UCB). We conjecture that even for RL in MDP, the Bayes-optimal
policy would inherently explore enough without need for additional explicit
optimism. This connection may also be seen in a previous attempt of Duff and
Barto [1997], where they try to use Gittins index for BAMDP.
2.2.2
Model-based Bayesian RL
BAMDP was initially investigated by Silver [1963] and Martin [1967]. The
problem of computational intractability of the Bayes-optimal solution motivated researchers to design approximate techniques. These are referred to as
14
CHAPTER 2. BACKGROUND
Bayesian RL (BRL) algorithms. BRL algorithms discussed here are all modelbased. Ghavamzadeh et al. [2015] compiles a survey of BRL algorithms. They
can be further classified based on whether they directly approximate the belief
tree structure (lookahead) or not (myopic). Hence we first discuss BRL algorithms based on their design and then their theoretical motivation.
We classify them in two categories based on their functioning: Myopic and
Lookahead.
Myopic: Myopic algorithms do not explicitly take into account the information
to be gained by future actions, and yet may still be able to learn efficiently. One
example of such an algorithm is Thompson sampling [Thompson, 1933], which
maintains a posterior distribution over models, samples one of them and then
chooses the optimal policy for the sample. A reformulation of this for BRL was
investigated in [Strens, 2000]. The Best Of Sampled Set (BOSS) [Asmuth et al.,
2009] algorithm generalizes this idea to a multi sample optimistic approach.
BEB [Kolter and Ng, 2009] at the first look, seems to work directly on Bayesian
value function (eq. 2.5), but it simply adds an explicit bonus term to the mean
MDP estimates of the state-action value. Similar to BEB, MMBI [Dimitrakakis,
2011] assumes constant belief and therefore rolls-in the hyper-state value into
the state-action value, but unlike BEB, it directly approximates the Bayesian
value function. This assumption removes the exponential dependence (due to
path dependent belief) on planning horizon. Then, backward induction is performed using value of the next step optimal adaptive policy. The final output
is the stationary policy obtained at the root through backward induction5 . Take
home point is that, even for constant belief, mean MDP policy is not the best
adaptive policy.
Lookahead: Lookahead algorithms take into account the effect of their future
actions towards their knowledge about the environment and quantify its benefit
for current decision making. The simplest algorithm is to calculate and solve
the BAMDP up to some horizon T , as outlined in Algorithm 1 and illustrated in
Figure 2.1. Sparse sampling [Kearns et al., 1999] is a simple modification to it,
which instead only iterates over a set of sampled states. Kearns algorithm when
5 Although it shouldn’t be hard to store the intermediate optimal constant-belief adaptive policy,
since its computed (step-8, Algo.1) anyways.
2.2. DISCUSSION
15
applied to BAMDP belief tree6 would still have to consider all primitive actions.
Wang et al. [2005] improved upon this by using Thompson sampling to only
consider a subset of promising actions. The high branching factor of belief tree
still makes planning with a deep horizon computationally expensive. Thus more
scalable algorithms, such as BFS3 [Asmuth and Littman, 2011], BOLT [Araya
et al., 2012] and BAMCP [Guez et al., 2012], were proposed. Similar to [Wang
et al., 2005], BFS3 also selects a subset of actions but with an optimistic action selection strategy, though the backups are still performed using Bellman
equation. BOLT includes optimism in the transition function instead. BAMCP
takes a Monte-Carlo approach to sparse lookahead in belief-augmented version
of Markov decision process. BAMCP also uses optimism for action selection.
Unlike BFS3, the next set of hyper-states are sampled from an MDP sampled
at the root7 . Since posterior inference is expensive for any non-trivial belief
model, BAMCP further applies lazy sampling and rollout policy, inspired by
their application in tree search problems [Kocsis and Szepesvári, 2006].
Analysis: We discuss here BEB and BOLT, which have theoretical results similar to us. Both are PAC-BAMDP and derive their result by achieving certain
level of certainty about (s, a) tuples, similar to Kearns and Singh [1998] who
define such tuples as ‘known’. BEB’s authors rely on how many (s, a) tuples are
already ‘known’ to prove their result. They go on to prove that both finite horizon Bayesian-optimal policy and BEB’s policy decay the exploration rate so fast
that they are not PAC-MDP8 , providing with a 3-state MDP counter example.
BOLT’s authors prove that if the probability of unknown states is small enough,
BOLT’s Bayes value function is close to optimal, if not, then such events (of
seeing ‘unkown’ (s, a) tuples) occur only a limited number of times (by contradiction), and that this required amount of exploration is ensured by BOLT’s optimism. The authors also extend BEB’s PAC-BAMDP result to infinite horizon
case. The problem with the above approach of ‘knowing’ all the (s, a) tuples
enough is that the Bayes-optimal policy no longer remains interesting; Martin
6 We freely use the term ‘tree’ or ‘belief tree’ to denote the planning tree generated by the algorithms in the hyper-state space of BAMDP.
7 Note that ideally the next observations should be sampled from the P (s
t+1 |ωt ) instead of
P (st+1 |ωto ), i.e. the next-state marginal at the root belief.
8 This is first of any result on Frequentist nature of Bayes-optimal policy.
CHAPTER 2. BACKGROUND
16
[1967] proves that in such a case, the Bayes-optimal policy approaches the optimal policy of underlying MDP. This leads to unfaithful approximation of the
true Bayes-optimal policy. A better approach is to prove near optimality without such assumption, e.g., BOP [Fonteneau et al., 2013] uses upper bound on
Bayesian value for branch-and-bound tree search. In practice, their exponential
dependence on the branching factor is still quite strong (proposition.2). We on
the other hand, beat state-of-the-art by making reasonable assumption on the
belief convergence rate. We formalize this idea in [Grover et al., 2019], which
approaches Bayes optimality with constraint only on the computational budget.
BAMCP is also successful in practice due to computational reasons, that is, it
is a Monte Carlo technique which samples large number of nodes from belief
tree. Its result is although only asymptotically optimal.
2.2.3
POMDP literature
Partially Observable MDP (POMDP) is a well studied [Åström, 1965, Sondik,
1978, Kaelbling et al., 1998] generalized MDP. Consider the field of robotics,
where even though the dynamics of a robot may be known, there is uncertainty
in the sensory input, i.e, the current state of the environment. Such problems
are modeled by a POMDP. It is an MDP where we maintain a distribution over
the possible states and plan accordingly. A natural idea would be to apply the
already existing literature of POMDP to BAMDP. Significant effort was made
by Duff [2002], where he shows almost mechanical translation of the POMDP
alpha-vector9 formulation to BAMDP (sec. 5.3). He notes that due to the belief
having continuous support in BAMDP, in contrast to the discrete support (over
states) in POMDP, fundamental differences10 arise in the application of Monahan’s algorithm [Monahan, 1982]. He comments (sec. 5.3.3) how the alpha
functions due to backward induction are just a mixture of alpha functions at previous iteration11 , by extension of which any closed set of functions representing
the Bayesian value initially will imply a function from same family locally at the
root belief. Therefore the “idea of characterizing the value function in terms of a
finite set of elements generalizes from the POMDP case”. Although he quickly
vector, α : [0, 1]|S| → R, compactly representing the Bayesian value over all belief space.
vectors become alpha functions, their dot product with belief becomes integral.
11 More precisely, reward is added to this mixture by the definition of Bellman operator.
9A
10 Alpha
2.2. DISCUSSION
17
points out how this approach is computationally infeasible: Exact methods for
PODMP [Sondik, 1978, Kaelbling et al., 1998] crucially depend on eliminating the exponentially growing alpha vectors with respect to planning horizon,
and for this, they a solve set of linear equation constraints. These constraints
in BAMDP case turn into integral constraints, which usually don’t have an easily computable solution. Hence, the curse of exponentially memory usage with
respect to planning depth still remains. Poupart et al. [2006] claim to show
that alpha functions in BAMDP are multivariate polynomials in shape, but their
main Theorem only relies on backward induction a as proof. It is unclear to us
how the initial alpha functions should be multivariate polynomials. Duff [2002]
goes on to argue and develop a general finite-state (memory) controller method
for both POMDP and BAMDP problem. This approach holds much promise
and may be a future research direction of this thesis. One key observation usually missed is that belief convergence doesn’t exist in POMDP12 and hence we
miss out by blindly using POMDP algorithms.
2.2.4
Bayesian value function
Estimating the Bayesian state-value function distribution directly is another interesting approach. Unlike BAMDP, the algorithms don’t compute the hyperstate value function. Algorithms in this category either directly estimate the
state-value distribution from the data or rely on Bayesian formulation of backward induction. They come in both model-free [Dearden et al., 1998, Engel et al., 2003] and model-based [Dearden et al., 1999, Dimitrakakis, 2011,
Deisenroth et al., 2009] flavours. In model-based approach, the uncertainty in
state-value is taken care by maintaining a distribution over models, although
similarity with BAMDP ends there.
Model-free: Bayesian Q-learning [Dearden et al., 1998] attempts to model the
Bayesian Q-value directly with a parametric distribution. They propose online
learning using myopic Value of Perfect Information (VPI) as action selection
strategy, which essentially gives the expected (belief averaged) advantage of
choosing an action over the others. They then propose two ways, both based
12 In BAMDP, the belief over transition probabilities converge as we plan, while no such analogy
exits when considering belief simply over MDP states.
CHAPTER 2. BACKGROUND
18
on bootstrapping, to address the crux of the problem: delayed rewards, i.e., no
direct access to Q-value samples. In practice, their algorithm changes policy
too often to actually get a good estimate by bootstrapping. They mention this
problem in their follow-up [Dearden et al., 1998] “to avoid the problem faced
by model-free exploration methods, that need to perform repeated actions to
propagate values from one state to another”. Reader is directed to section (2.4)
of [Duff, 2002], for a survey of other similar non Bayesian attempts. Engel
et al. [2003] propose a Gaussian Process prior on value function, combined
with temporal difference motivated data likelihood. This is discussed further
in [Dimitrakakis et al., 2020].
Model-based: Dearden et al. [1999] propose a model-based follow-up to address the problems in Bayesian Q-learning. The main algorithm (sec. 5.1) computes a Monte Carlo upper bound on the Bayesian value function which they
take as substitute for optimal Bayesian Q-value13 . They address the complexity of sampling and solving multiple models by two re-weighting approaches:
Importance sampling (sec. 5.2) and Particle filtering (sec. 5.3). Although they
do mention Bayesian Bellman update (sec. 5.4), it is not clearly described (it
seems like a mean-field approximation), and they do not experimentally investigate it. PILCO [Deisenroth et al., 2013] develops an analytic method similar
to [Dearden et al., 1999] in nature, where they use GP prior with an assumption of normal input (sec. 3.2, 4) to predict a closed form Bayesian estimate
of the objective function (multi-step cost, eq. 11). They use analytical gradient
for policy optimization (sec. 3.3). Deisenroth et al. [2009] take a more direct
approach by developing backward induction for Gaussian Process(GP) priors
over models.
In [Dimitrakakis et al., 2020] we develop a Bayesian backward induction framework for joint estimation of model and value function distributions.
13 The
state-action value function is commonly known as Q-value.
Chapter 3
Efficient Bayesian
Reinforcement Learning
The following sections will outline the contributions of this thesis.
In Section (3.1), we develop a sampling based approach to belief tree approximation. We go a step further than previous works by sampling policies instead
of actions to curb the branching factor and reduce complexity. We analyze it,
giving a performance bound, and experimentally validate it on different discrete
environments. My individual contribution is the development of the initial idea
with my supervisor, implementation and experiments, and contributing to its
theoretical analysis.
In Section (3.2), we propose a fully Bayesian, backward induction approach
for joint estimation of model and value function distributions. My individual
contribution to this work is the baseline implementation, verifying correct functioning of algorithms and drawing comparison to BAMDP techniques.
19
CHAPTER 3. EFFICIENT BAYESIAN RL
20
3.1
Deep, Sparse Sampling
We propose a novel BAMDP algorithm, called Deep, Sparse Sampling (DSS),
with the help of insights developed in section (2.2.2). During planning, it focuses on reducing the branching factor by considering K-step policies instead
of primitive actions. These policies are generated through (possibly approximate) Thompson sampling1 over MDP models. This approach is rounded by
using Sparse sampling [Kearns et al., 1999]. The reduced branching factor allows us to build a deeper tree. Figure 3.1 shows a planning tree expanded by
DSS2 . The intuition why this might be desirable is that if the belief changes
slowly enough, an adaptive policy that is constructed out of K-step stationary policies will still be approximately optimal. This intuition is supported by
the theoretical analysis: we prove that our algorithm results in nearly-optimal
planning under certain mild assumptions regarding the belief. The freedom to
choose a policy generator allows the algorithm scale smoothly: we choose Policy Iteration (PI) and a variant of Real Time Dynamic Programming (RTDP)
depending on size of environments.
Algorithm:
The core idea of the DSS algorithm is to plan in the belief tree, not at the individual action level, but at the level of K-step policies. Figure 3.1 illustrates
this concept graphically. Algorithm 2 is called with the current state s and
belief β as input, with additional parameters controlling how the tree is approximated. The algorithm then generates the tree and calculates the value of
each policy candidate recursively (for H stages or episodes), in the following
manner:
1. Line 6: Generate N MDPs from the current belief βt , and for each MDP
µi use the policy generator P : µ → π to generate a policy πi . This gives
a policy set Πβ with |Πβ | = N .
2. Line 10-18: Run each policy for K steps, collecting total K-step dicounted reward R in BAMDP. Note that we sample the reward and next1 We
refer to the optimal policy of the sampled model as Thompson sample(TS) policy.
denote the planning depth. Superscripts on hyper-state iterate policy and belief
respectively.
2 Subscripts
3.1. DEEP, SPARSE SAMPLING
21
Algorithm 2 DSS
1:
Parameters: Number of stages H, steps K, no. of policies N , no. of
samples per policy M , policy generator P
2:
Input: hyper-state ωh = (sh , βh ), depth h.
3:
if h = KH then
4:
return V (ωh ) = 0
5:
end if
6:
Πβh = {P(µi )|µi v ωh , i ∈ Z, i ≤ N }
7:
for all π ∈ Πβh do
8:
Q(ωh , π) = 0
9:
for 1 to M do
10:
R = 0, c = γ h , k = 0
11:
ωk = ωh , sk = sh , βk = βh , ak = π(sh )
12:
for k = 1, . . . , K do
13:
sk+1 v ν(ωk+1 |ωk , ak )
14:
rk+1 v τ (rk+1 |ωk , ak )
15:
R += c × rk+1 ; c = c × γ
16:
17:
18:
βk+1 = UpdatePosterior(ωk , sk+1 , ak ) (from eq. 2.2)
end for
Q(ωh , π)+ = R + DSS(ωK , h + K)
19:
end for
20:
Q(ωh , π)/ = M
21:
end for
22:
return arg maxπ Q(ωh , π)
state from the marginal (Line 13-14), and also update the posterior (Line
16).
3. Line 19-21: Make recursive call to DSS at the end of K steps. Repeat the
process just described for M times. This gives an M -sample estimate of
that policy’s utility vβπ .
Note that the fundamental control unit that we are trying to find here is a policy,
hence Q-values are defined over (ωt ,π) tuples. Since we now have policies at
any given tree node, we re-branch only after running those policies for K steps.
CHAPTER 3. EFFICIENT BAYESIAN RL
22
2,2
ωt+K
πt2
2,1
ωt+K
ωt
1,2
ωt+K
πt1
1,1
ωt+K
Figure 3.1: Deeper & Sparser tree expansion.
Hence we can increase the effective depth of the belief tree upto HK for the
same computational budget. This allows for deeper lookahead and ensures that
the approximation error propagated is also smaller as the error is discounted by
γ HK instead of γ H . We elaborate this effect in the analysis below.
Analysis:
Since our goal is to prove the near-Bayes optimality of DSS, we focus on the
effects of DSS parameters for planning in belief tree. DSS eliminates the necessity to try all actions at every node by making certain assumptions about the
belief:
Assumption 1. The belief βh in the planning tree is such that h ≤ 0 /h, where
h ≥ 1, h = ||β̂h − βh ||1 and β̂h is the constant belief approximation at the
start of episode h.
The first assumption states that as we go deeper in the planning tree, the belief
error reduces. The intuition is that if the belief concentrates at a certain rate,
then so does error of Bayes utility for any Markov policy, by the virtue of its
definition.
Assumption 2. Bounded correlation: Given some constant C ∈ R+ ,
D(µ, µ0 ) , maxs,a kPµ (· | s, a) − Pµ0 (· | s, a)k1 . We have:
βt (µ)βt (µ0 ) ≤
C
D(µ, µ0 )
The second assumption states that belief correlation across similar MDPs is
3.2. BAYESIAN BACKWARD INDUCTION (BBI)
23
higher than across dissimilar ones, due to inverse dependence on the distance
between MDPs. This helps us proving bound for Bayesian value of DSS policy
compared to K-step optimal policy.
Then, our algorithm finds a near-Bayes-optimal policy as stated in Theorem 1:
Theorem 1. Under Assumptions 1 and 2, ∀s ∈ S
vβDS (s)
≥
vβ∗ (s)
1
2(KC + γ K )
− 20 K ln
+
1 − γK
(1 − γ)
s
ln M/δ
−
2N (1 − γ)2
with probability 1 − δ. Here, T is the horizon, divided by parameter K into H
stages, i.e, T = KH. In addition, at each node of the sparse tree, we evaluate
N policies for M times.
At the same time, DSS is significantly less expensive than basic Sparse sampling [Kearns et al., 1999] which would take O((|A|M )T ) calls to the generative BAMDP model, while it requires only O((N M )T /K ) calls for a T -horizon
problem.
Experiments:
We refer the reader to the original publication [Grover et al., 2019] for comparisons to the current state-of-the-art.
3.2
Bayesian backward induction (BBI)
Simple BI:
Since this work is based on backward induction, we first consider the case of
estimating the mean value function under MDP uncertainty. We show here how
the most common approximation of Mean MDP is related to the BAMDP value
function.
CHAPTER 3. EFFICIENT BAYESIAN RL
24
Consider backward induction equation for the Bayesian value:
Vtπ (st , βt )
Z
=
Vµπ (st )βt (µ)dµ
ZM
Z
=
Eµ [rt+1 ]βt (µ)dµ + γ
M
M s0 ∈s
t+1
Z
=
X
Eµ [rt+1 ]βt (µ)dµ + γ
M
X Z
s0 ∈st+1
M
0
π
0
Pµ (s |st , at )Vµ (s )βt (µ)dµ
Vµπ (s0 ) Pµ (s0 |st , at )βt (µ)dµ
(3.1)
Mean MDP approximation for eq. (3.1) is obtained by taking mean-field approximation of the inner part of second term:
Z
Z
Z
π 0
0
π 0
Vµ (s ) Pµ (s |st , at ) P(µ)dµ ≈
Vµ (s ) P(µ)dµ
M
M
M
0
Pµ (s |st , at ) P(µ)dµ
R
R
Define V̄ (s0 ) , M Vµπ (s0 ) P(µ) and ν(s0 |s, a) , M Pµ (s0 |s, a) P(µ)dµ.
Substituting above expression back in eq. (3.1):
Vtπ (st , βt )
Z
X
=
Eµ [rt+1 ]βt (µ)dµ + γ
ν(s0 |st , πt )V̄tπ (s0 )
M
(3.2)
s0 ∈st+1
Equation (3.2) gives the mean MDP approximation under constant belief assumption, since then we only keep track of |S| values at each iteration. MMBI
[Dimitrakakis, 2011] directly works on eq. (3.1) giving much better results.
These approximations only give the mean of the state-value distribution, although it would be more useful to get the full distribution.
Distribution over value functions:
Consider the value function V , with V = (V1 , . . . , VT ) for finite-horizon problems, and some prior belief β over MDPs, and some previously collected data
D = (s1 , a1 , r1 , . . . , st , at , rt ) from some policy π. Then the posterior value
function distribution can be written in terms of the MDP posterior:
Z
(V
|
D)
=
Pβ
Pµ (V ) dβ(µ | D).
M
(3.3)
3.2. BAYESIAN BACKWARD INDUCTION (BBI)
25
Note that this is different from eq. (2.3) which gives the expected state-value
under the model distribution, while here we get the full state-value distribution
E
for the given belief. The empirical measure P̂M
C defined below corresponds to
the standard Monte-Carlo estimate
E
−1
P̂M
C (B) , Nµ
K
n
o
X
1 v (k) ∈ B ,
(3.4)
k=1
where 1 {} is the indicator function. In practice, this can be implemented via
Algorithm 3. The problem with this approach is the computational cost associated with it.
Algorithm 3 Monte-Carlo Estimation of Value Function Distributions
1:
Select a policy π.
2:
for k = 1, . . . , Nµ do
3:
Sample an MDP µ(k) ∼ β.
4:
Calculate v (k) = Vµπ(k) , ∼ β.
5:
end for
6:
return P̂M C ({v (k) })
Bayesian Backward Induction (BBI):
We propose a framework that inductively calculates Pπβ (Vi+1 | D) from
Pπβ (Vi | D) for i ≥ t:
π
Pβ (Vi
Z
| D) =
V
π
π
Pβ (Vi | Vi+1 , D) d Pβ (Vi+1 | D).
(3.5)
Let ψi+1 be a (possibly approximate) representation of Pπβ (Vi+1 | D). Then
the remaining problem is to define the term Pπβ (Vi | Vi+1 , D) appropriately and
calculate the complete distribution.
Link distribution: P(Vi | Vi+1 , D)
A simple idea for dealing with the term linking the two value functions is to
marginalize over the MDP as follows:
Z
π
π
π
(V
|
V
,
D)
=
Pβ i i+1
Pµ (Vi | Vi+1 ) d Pβ (µ | Vi+1 , D).
M
(3.6)
CHAPTER 3. EFFICIENT BAYESIAN RL
26
This equality holds because given µ, Vi is uniquely determined by the policy
π and Vi+1 through the Bellman operator. However, it is crucial to note that
Pπβ (µ | Vi+1 , D) 6= Pβ (µ | D), as knowing the value function gives information
about the MDP.3
In order to maintain a correct estimate of uncertainty, we must specify an appropriate conditional distribution Pπβ (µ|Vi+1 , D). We focus on the idea of maintaining an approximation ψi of value function distributions and combining this
with the MDP posterior through an appropriate kernel, as detailed below.
Conditional MDP distribution: P(µ|Vi+1 , D)
(k)
The other important design decision concerns the distribution Pπβ (µ|Vi+1 , D).
Expanding this term, we obtain, for any subset of MDPs A ⊆ M:
R π
(V ) dβ(µ | D)
A Pµ i+1
π
R
,
(3.7)
Pβ (µ ∈ A | Vi+1 , D) =
π (V
) dβ(µ | D)
M Pµ i+1
since Pπµ (Vi+1 | D) = Pπµ (Vi+1 ), as µ, π are sufficient for calculating Vi+1 .
Inference and experiments:
We refer the reader to the full paper [Dimitrakakis et al., 2020] for details on
inference procedure and experiments.
3 Assuming
otherwise results in a mean-field approximation.
Chapter 4
Concluding Remarks
In this thesis, we introduced a novel BAMDP algorithm, Deep Sparse Sampling
(DSS) which is the author’s primary contribution. We showed its superior performance experimentally and also analyzed its theoretical properties. We also
jointly proposed a Bayesian backward induction approach for estimating the
state-value and model distributions.
A natural extension to this thesis would be in the direction of bounded memory
(insufficient statistics) controllers. We believe it to be a promising avenue to
prove stronger results for BAMDP algorithms. While a general regret bound
for the Bayes-optimal policy is an open question, counter-intuitively, the analysis with insufficient statistics may be simpler, since there will only be a finite
number of beliefs. The main idea for BAMDP would be to optimally tune the
branching factor and depth of the planning tree depending on accuracy of the
statistics. We can then analyze the effect on optimality relative to an oracle with
sufficient statistics. Next, we can analyze how close approximate algorithms
(DSS, Sparse sampling etc.) are to the best insufficient statistic policy under different computational constraints. For completeness, we shall also analyze the
regret for insufficient-statistic versions of the upper-bound algorithms relying
on concentration inequalities. We expect to get qualitatively different bounds
compared to standard analysis, due to the use of insufficient statistics.
27
Bibliography
Mauricio Araya, Olivier Buffet, and Vincent Thomas. Near-optimal brl using
optimistic local transitions. arXiv preprint arXiv:1206.4613, 2012.
J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In UAI 2009, 2009.
John Asmuth and Michael L Littman. Approaching bayes-optimalilty using
monte-carlo tree search. In Proc. 21st Int. Conf. Automat. Plan. Sched.,
Freiburg, Germany, 2011.
Karl Johan Åström. Optimal control of markov processes with incomplete
state information. Journal of Mathematical Analysis and Applications, 10
(1):174–205, 1965.
Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos,
Gilles Stoltz, et al. Kullback–leibler upper confidence bounds for optimal
sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
Fu Chang and Tze Leung Lai. Optimal stopping and dynamic allocation. Advances in Applied Probability, 19(4):829–853, 1987.
Richard Dearden, Nir Friedman, and Stuart J. Russell. Bayesian Q-learning.
In AAAI/IAAI, pages 761–768, 1998. URL citeseer.ist.psu.edu/
dearden98bayesian.html.
Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Kathryn B. Laskey and Henri Prade, editors, Proceedings of
the 15th Conference on Uncertainty in Artificial Intelligence (UAI-99), pages
29
BIBLIOGRAPHY
30
150–159, San Francisco, CA, July 30–August 1 1999. Morgan Kaufmann,
San Francisco, CA.
Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE transactions
on pattern analysis and machine intelligence, 37(2):408–423, 2013.
M.P. Deisenroth, C.E. Rasmussen, and J. Peters. Gaussian process dynamic
programming. Neurocomputing, 72(7-9):1508–1524, 2009.
Christos Dimitrakakis. Robust bayesian reinforcement learning through tight
lower bounds. In European Workshop on Reinforcement Learning, page
arXiv:1106.3651v2. Springer, 2011.
Christos Dimitrakakis, Hannes Eriksson, Emilio Jorge, Divya Grover, and Debabrota Basu. Inferential induction: Joint bayesian estimation of mdps and
value functions. arXiv preprint arXiv:2002.03098, 2020.
Michael O Duff and Andrew G Barto. Local bandit approximation for optimal
learning problems. In Advances in Neural Information Processing Systems,
pages 1019–1025, 1997.
Michael O’Gordon Duff. Optimal Learning Computational Procedures for
Bayes-adaptive Markov Decision Processes. PhD thesis, University of Massachusetts at Amherst, 2002.
Yaakov Engel, Shie Mannor, and Ron Meir. Bayes meets bellman: The gaussian
process approach to temporal difference learning. In ICML 2003, 2003.
Raphael Fonteneau, Lucian Buşoniu, and Rémi Munos. Optimistic planning
for belief-augmented markov decision processes. In 2013 IEEE Symposium
on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL),
pages 77–84. IEEE, 2013.
Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al.
Bayesian reinforcement learning: A survey. Foundations and Trends® in
Machine Learning, 8(5-6):359–483, 2015.
BIBLIOGRAPHY
31
John C Gittins. Bandit processes and dynamic allocation indices. Journal of the
Royal Statistical Society: Series B (Methodological), 41(2):148–164, 1979.
Divya Grover and Christos Dimitrakakis. Deeper and sparser sampling. Exploration in Reinforcement Learning Workshop, ICML, 2018.
Divya Grover, Debabrota Basu, and Christos Dimitrakakis. Bayesian reinforcement learning via deep, sparse sampling. arXiv preprint arXiv:1902.02661,
2019.
Arthur Guez, David Silver, and Peter Dayan. Efficient bayes-adaptive reinforcement learning using sample-based search. In Advances in Neural Information
Processing Systems, pages 1025–1033, 2012.
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds
for reinforcement learning. Journal of Machine Learning Research, 11:
1563–1600, 2010.
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning
and acting in partially observable stochastic domains. Artificial intelligence,
101(1-2):99–134, 1998.
Emilie Kaufmann. Analyse de stratégies Bayésiennes et fréquentistes pour l’allocation séquentielle de ressources. PhD thesis, Paris, ENST, 2014.
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in
polynomial time. In Proc. 15th International Conf. on Machine Learning, pages 260–268. Morgan Kaufmann, San Francisco, CA, 1998. URL
citeseer.ist.psu.edu/kearns98nearoptimal.html.
Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling
algorithm for near-optimal planning in large Markov decision processes. In
Thomas Dean, editor, IJCAI, pages 1324–1231. Morgan Kaufmann, 1999.
ISBN 1-55860-613-0.
Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In
European conference on machine learning, pages 282–293. Springer, 2006.
BIBLIOGRAPHY
32
J Zico Kolter and Andrew Y Ng. Near-bayesian exploration in polynomial
time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 513–520. ACM, 2009.
James John Martin. Bayesian decision problems and Markov chains. Wiley,
1967.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
George E Monahan. State of the arta survey of partially observable markov
decision processes: theory, models, and algorithms. Management science,
28(1):1–16, 1982.
P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete
Bayesian reinforcement learning. In ICML 2006, pages 697–704. ACM Press
New York, NY, USA, 2006.
Marting L. Puterman. Markov Decision Processes : Discrete Stochastic Dynamic Programming. John Wiley & Sons, New Jersey, US, 1994.
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja
Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian
Bolton, et al. Mastering the game of go without human knowledge. Nature,
550(7676):354–359, 2017.
Edward A Silver. Markovian decision processes with uncertain transition probabilities or rewards. Technical report, MASSACHUSETTS INST OF TECH
CAMBRIDGE OPERATIONS RESEARCH CENTER, 1963.
Edward J Sondik. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations research, 26
(2):282–304, 1978.
Malcolm Strens. A bayesian framework for reinforcement learning. In ICML,
pages 943–950, 2000.
BIBLIOGRAPHY
33
Gerald Tesauro. Td-gammon, a self-teaching backgammon program, achieves
master-level play. Neural computation, 6(2):215–219, 1994.
W.R. Thompson. On the Likelihood that One Unknown Probability Exceeds
Another in View of the Evidence of two Samples. Biometrika, 25(3-4):
285–294, 1933.
Aristide Tossou, Debabrota Basu, and Christos Dimitrakakis. Near-optimal optimistic reinforcement learning using empirical bernstein inequalities. arXiv
preprint arXiv:1905.12425, 2019.
Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Bayesian
sparse sampling for on-line reward optimization.
In ICML ’05, pages
956–963, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. doi:
http://doi.acm.org/10.1145/1102351.1102472.