Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Bayes-Adaptive POMDPs

Stéphane Ross Brahim Chaib-draa Joelle Pineau


McGill University Laval University McGill University
Montréal, Qc, Canada Québec, Qc, Canada Montréal, Qc, Canada
sross12@cs.mcgill.ca chaib@ift.ulaval.ca jpineau@cs.mcgill.ca

Abstract

Bayesian Reinforcement Learning has generated substantial interest recently, as it


provides an elegant solution to the exploration-exploitation trade-off in reinforce-
ment learning. However most investigations of Bayesian reinforcement learning
to date focus on the standard Markov Decision Processes (MDPs). Our goal is
to extend these ideas to the more general Partially Observable MDP (POMDP)
framework, where the state is a hidden variable. To address this problem, we in-
troduce a new mathematical model, the Bayes-Adaptive POMDP. This new model
allows us to (1) improve knowledge of the POMDP domain through interaction
with the environment, and (2) plan optimal sequences of actions which can trade-
off between improving the model, identifying the state, and gathering reward. We
show how the model can be finitely approximated while preserving the value func-
tion. We describe approximations for belief tracking and planning in this model.
Empirical results on two domains show that the model estimate and agent’s return
improve over time, as the agent learns better model estimates.

1 Introduction
In many real world systems, uncertainty can arise in both the prediction of the system’s behavior, and
the observability of the system’s state. Partially Observable Markov Decision Processes (POMDPs)
take both kinds of uncertainty into account and provide a powerful model for sequential decision
making under these conditions. However most solving methods for POMDPs assume that the model
is known a priori, which is rarely the case in practice. For instance in robotics, the POMDP must
reflect exactly the uncertainty on the robot’s sensors and actuators. These parameters are rarely
known exactly and therefore must often be approximated by a human designer, such that even if
this approximate POMDP could be solved exactly, the resulting policy may not be optimal. Thus we
seek a decision-theoretic planner which can take into account the uncertainty over model parameters
during the planning process, as well as being able to learn from experience the values of these
unknown parameters.
Bayesian Reinforcement Learning has investigated this problem in the context of fully observable
MDPs [1, 2, 3]. An extension to POMDP has recently been proposed [4], yet this method relies on
heuristics to select actions that will improve the model, thus forgoing any theoretical guarantee on
the quality of the approximation, and on an oracle that can be queried to provide the current state.
In this paper, we draw inspiration from the Bayes-Adaptive MDP framework [2], which is formu-
lated to provide an optimal solution to the exploration-exploitation trade-off. To extend these ideas
to POMDPs, we face two challenges: (1) how to update Dirichlet parameters when the state is a
hidden variable? (2) how to approximate the infinite dimensional belief space to perform belief
monitoring and compute the optimal policy. This paper tackles both problem jointly. The first prob-
lem is solved by including the Dirichlet parameters in the state space and maintaining belief states
over these parameters. We address the second by bounding the space of Dirichlet parameters to a
finite subspace necessary for ǫ-optimal solutions.

1
We provide theoretical results for bounding the state space while preserving the value function and
we use these results to derive approximate solving and belief monitoring algorithms. We compare
several belief approximations in two problem domains. Empirical results show that the agent is able
to learn good POMDP models and improve its return as it learns better model estimate.

2 POMDP
A POMDP is defined by finite sets of states S, actions A and observations Z. It has transition
′ ′
probabilities {T sas }s,s′ ∈S,a∈A where T sas = Pr(st+1 = s′ |st = s, at = a) and observation
saz saz
probabilities {O }s∈S,a∈A,z∈Z where O = Pr(zt = z|st = s, at−1 = a). The reward function
R : S × A → R specifies the immediate reward obtained by the agent. In a POMDP, the state is
never observed. Instead the agent perceives an observation z ∈ Z at each time step, which (along
with the action sequence) allows it to maintain a belief state b ∈ ∆S. The belief state specifies
the probability of being in each state given the history of action and observation experienced so far,
starting from an initial belief b0 . It can be updated at each time step using Baye’s rule: bt+1 (s′ ) =
′a z sat s′
Os t t+1
P
s∈S T bt (s)
P ′′
s at zt+1 P sat s′′ b (s)
.
s′′ ∈s O s∈S T t

A policy π : ∆S → A indicates how the agent should select actions as a func-


tion of the current belief. Solving a POMDP involves finding the optimal policy π ∗
that maximizes the expected discounted return over the infinite horizon. The return ob-
tained byPfollowing π ∗ from P a belief b is defined by Bellman’s equation: V ∗ (b) =


maxa∈A s∈S b(s)R(s, a) + γ z∈Z Pr(z|b, a)V (τ (b, a, z)) , where τ (b, a, z) is the new be-
lief after performing action a and observation z and γ ∈ [0, 1) is the discount factor.
Exact solving algorithms [5] are usually intractable, except on small domains with only a few states,
actions and observations. Various approximate algorithms, both offline [6, 7, 8] and online [9],
have been proposed to tackle increasingly large domains. However, all these methods requires full
knowledge of the POMDP model, which is a strong assumption in practice. Some approaches do
not require knowledge of the model, as in [10], but these approaches generally require a lot of data
and do not address the exploration-exploitation tradeoff.

3 Bayes-Adaptive POMDP
In this section, we introduce the Bayes-Adaptive POMDP (BAPOMDP) model, an optimal decision-
theoretic algorithm for learning and planning in POMDPs under parameter uncertainty. Throughout
we assume that the state, action, and observation spaces are finite and known, but that the transition
and observation probabilities are unknown or partially known. We also assume that the reward
function is known as it is generally specified by the user for the specific task he wants to accomplish,
but the model can easily be generalised to learn the reward function as well.

To model the uncertainty on the transition T sas and observation Osaz parameters, we use Dirichlet
distributions, which are probability distributions over the parameters of multinomial distributions.
Given φi , the number of times event ei has occurred over n trials, the probabilities pi of each event
follow a Dirichlet distribution, i.e. (p1 , . . . , pk ) ∼ Dir(φ1 , . . . , φk ). This distribution represents
the probability that a discrete random variable behaves according to some probability distribution
Pk
(p1 , . . . , pk ), given that the counts (φ1 , . . . , φk ) have been observed over n trials (n = i=1 φi ). Its
1 Qk φi −1
probability density function is defined by: f (p, φ) = B(φ) i=1 pi , where B is the multinomial
beta function. The expected value of pi is E(pi ) = Pkφi .
j=1 φj

3.1 The BAPOMDP Model

The BAPOMDP is constructed from the model of the POMDP with unknown parameters. Let

(S, A, Z, T, O, R, γ) be that model. The uncertainty on the distributions T sa· and Os a· can be
represented by experience counts: φss′ ∀s represents the number of times the transition (s, a, s′ ) oc-
a ′

curred, similarly ψsa′ z ∀z is the number of times observation z was made in state s′ after doing action
a. Let φ be the vector of all transition counts and ψ be the vector of all observation counts. Given

2
′ ′ φa
the count vectors φ and ψ, the expected transition probability for T sas is: Tφsas = P ss′
a , and
s ∈S φss′′
′′
′ ′ ψsa′ z
similarly for Os az : Oψ
s az
= P a .
z ′ ∈Z ψs′ z ′

The objective of the BAPOMDP is to learn an optimal policy, such that actions are chosen to
maximize reward taking into account both state and parameter uncertainty. To model this, we
follow the Bayes-Adaptive MDP framework, and include the φ and ψ vectors in the state of
the BAPOMDP. Thus, the state space S ′ of the BAPOMDP is defined as S ′ = S × T × O,
2
where T = {φ ∈ N|S| |A| |∀(s, a), s′ ∈S φass′ > 0} represents the space in which φ lies and
P

O = {ψ ∈ N|S||A||Z| |∀(s, a), z∈Z ψsz a


P
> 0} represents the space in which ψ lies. The action and
observation sets of the BAPOMDP are the same as in the original POMDP. Transition and obser-
vation functions of the BAPOMDP must capture how the state and count vectors φ, ψ evolve after
every time step. Consider an agent in a given state s with count vectors φ and ψ, which performs
action a, causing it to move to state s′ and observe z. Then the vector φ′ after the transition is defined
as φ′ = φ + δss a a a
′ , where δss′ is a vector full of zeroes, with a 1 for the count φss′ , and the vector
′ ′ a a
ψ after the observation is defined as ψ = ψ + δs′ z , where δs′ z is a vector full of zeroes, with a 1
for the count ψsa′ z . Note that the probabilities of such transitions and observations occurring must
be defined by considering all models and their probabilities as specified by the current Dirichlet
distributions, which turn out to be their expectations. Hence, we define T ′ and O′ to be:

s′ az
, if φ′ = φ + δss ′

Tφsas Oψ a a
′ and ψ = ψ + δs′ z
T ′ ((s, φ, ψ), a, (s′ , φ′ , ψ ′ )) = (1)
0, otherwise.
1, if φ′ = φ + δss ′
 a a
′ ′ ′ ′ ′ and ψ = ψ + δs′ z
O ((s, φ, ψ), a, (s , φ , ψ ), z) = (2)
0, otherwise.

Note here that the observation probabilities are folded into the transition function, and that the ob-
servation function becomes deterministic. This happens because a state transition in the BAPOMDP
automatically specifies which observation is acquired after transition, via the way the counts are
incremented. Since the counts do not affect the reward, the reward function of the BAPOMDP is de-
fined as R′ ((s, φ, ψ), a) = R(s, a); the discount factor of the BAPOMDP remains the same. Using
these definitions, the BAPOMDP has a known model specified by the tuple (S ′ , A, Z, T ′ , O′ , R′ , γ).
The belief state of the BAPOMDP represents a distribution over both states and count values. The
model is learned by simply maintaining this belief state, as the distribution will concentrate over
most likely models, given the prior and experience so far. If b0 is the initial belief state of the
unknown POMDP, and the count vectors φ0 ∈ T and ψ0 ∈ O represent the prior knowledge on this
POMDP, then the initial belief of the BAPOMDP is: b′0 (s, φ0 , ψ0 ) = {b0 (s), if (φ, ψ) = (φ0 , ψ0 );
0, otherwise}. After actions are taken, the uncertainty on the POMDP model is represented by
mixtures of Dirichlet distributions (i.e. mixtures of count vectors).
Note that the BAPOMDP is in fact a POMDP with a countably infinite state space. Hence the belief
update function and optimal value function are still defined as in Section 2. However these functions
now require summations over S ′ = S × T × O. Maintaining the belief state is practical only if the
number of states with non-zero probabilities is finite. We prove this in the following theorem:
Theorem 3.1. Let (S ′ , A, Z, T ′ , O′ , R′ , γ) be a BAPOMDP constructed from the POMDP
(S, A, Z, T, O, R, γ). If S is finite, then at any time t, the set Sb′ ′ = {σ ∈ S ′ |b′t (σ) > 0} has
t
size |Sb′ ′ | ≤ |S|t+1 .
t

Proof. Proof available in [11]. Proceeds by induction from b′0 .

The proof of this theorem suggests that it is sufficient to iterate over S and Sb′ ′ in order to compute
t−1
the belief state b′t when an action and observation are taken in the environment. Hence, Algorithm
3.1 can be used to update the belief state.

3.2 Exact Solution for BAPOMDP in Finite Horizons

The value function of a BAPOMDP for finite horizons can be represented by a finite set Γ of func-
tions α : S ′ → R, as in standard POMDP. For example, an exact solution can be computed using

3
function τ (b, a, z)
Initialize b′ as a 0 vector.
for all (s, φ, ψ, s′ ) ∈ Sb′ × S do
a a a a sas′ s′ az
b′ (s′ , φ + δss ′ ′
′ , ψ + δs′ z ) ← b (s , φ + δss′ , ψ + δs′ z ) + b(s, φ, ψ)Tφ Oψ
end for
return normalized b′
Algorithm 3.1: Exact Belief Update in BAPOMDP.

dynamic programming (see [5] for more details):


Γa1 = {αa |αa (s, φ, ψ) = R(s, a)},
sas′ s′ az ′ ′
Γa,z {αa,z a,z a a ′
P
t = i |αi (s, φ, ψ) = γ s′ ∈S Tφ Oψ αi (s , φ + δss ′ , ψ + δs′ z ), αi ∈ Γt−1 },
a,z1 a,z2 a,z|Z|
Γat = a
Γ1 ⊕ Γt ⊕ Γt ⊕ · · · ⊕ Γ t , (where ⊕ is the cross sum operator),
a
S
Γt = a∈A Γ t .
(3)
Note here that the definition of αa,z i (s, φ, ψ) is obtained from the fact that
T ′ ((s, φ, ψ), a, (s′ , φ′ , ψ ′ ))O′ ((s, φ, ψ), a, (s′ , φ′ , ψ ′ ), z) = 0 except when P φ′ = φ + δss a
′ and
′ a
ψ = ψ + δs′ z . The optimal policy is extracted as usual: πΓ (b) = argmaxα∈Γ σ∈S ′ α(σ)b(σ). In
b
practice, it will be impossible to compute αa,z ′
i (s, φ, ψ) for all (s, φ, ψ) ∈ S . In order to compute
these more efficiently, we show in the next section that the infinite state space can be reduced to a
finite state space, while still preserving the value function to arbitrary precision for any horizon t.

4 Approximating the BAPOMDP: Theory and Algorithms


Solving a BAPOMDP exactly for all belief states is impossible in practice due to the dimensionnality
of the state space (in particular to the fact that the count vectors can grow unbounded). We now show
how we can reduce this infinite state space to a finite state space. This allows us to compute an ǫ-
optimal value function over the resulting finite-dimensionnal belief space using standard POMDP
techniques. Various methods for belief tracking in the infinite model are also presented.

4.1 Approximate Finite Model

We first present an upper bound on the value difference between two states that differ only by

their model estimate φ and ψ. This bound uses′
the following


φ ∈ T , and
definitions: given φ,
ψ, ψ ′ ∈ O, define DSsa (φ, φ′ ) = s′ ∈S Tφsas − Tφsas sa
(ψ, ψ ′ ) = z∈Z Oψ saz saz
P P
and DZ − Oψ

′ ′ ,

and Nφsa = s′ ∈S φass′ and Nψsa = z∈Z ψsz a


P P
.
Theorem 4.1. Given any φ, φ′ ∈ T , ψ, ψ ′ ∈ O, and γ ∈ (0,h1), then for all t:
2γ||R||∞ ′
sup |αt (s, φ, ψ) − αt (s, φ′ , ψ ′ )| ≤ (1−γ)2 sup DSsa (φ, φ′ ) + DZ
sa
(ψ, ψ ′ )
αt ∈Γt ,s∈S s,s′ ∈S,a∈A
P 
| a ′a
s′′ ∈S φss′′ −φss′′ | |a ′a
z∈Z ψs′ z −ψs′ z |
P
4
+ ln(γ −e ) (Nφsa +1)(Nφsa + s′ a +1)(N s′ a +1)
′ +1) (Nψ ψ′

Proof. Proof available in [11] finds a bound on a 1-step backup and solves the recurrence.

We now use this bound on the α-vector values to approximate the space of Dirichlet parameters
ǫ(1−γ)2
within a finite subspace. We use the following definitions: given any ǫ > 0, define ǫ′ = 8γ||R|| ∞
,
−e ′ ′
2
   
ǫ(1−γ) ln(γ ) |S|(1+ǫ ) |Z|(1+ǫ )
ǫ′′ = 32γ||R||∞ , NSǫ = max ǫ′ , ǫ1′′ − 1 and NZǫ = max ǫ′ , ǫ1′′ − 1 .

Theorem 4.2. Given any ǫ > 0 and (s, φ, ψ) ∈ S ′ such that ∃a ∈ A, s′ ∈ S, Nφs a > NSǫ or
′ ′ ′
Nψs a > NZǫ , then ∃(s, φ′ , ψ ′ ) ∈ S ′ such that ∀a ∈ A, s′ ∈ S, Nφs′ a ≤ NSǫ and Nψs ′a ≤ NZǫ where
|αt (s, φ, ψ) − αt (s, φ′ , ψ ′ )| < ǫ holds for all t and αt ∈ Γt .

Proof. Proof available in [11].

4
Theorem 4.2 suggests that if we want a precision of ǫ on the value function, we just need to restrict
2
the space of Dirichlet parameters to count vectors φ ∈ T̃ǫ = {φ ∈ N|S| |A| |∀a ∈ A, s ∈ S, 0 <
sa ǫ |S||A||Z|
Nφ ≤ NS } and ψ ∈ Õǫ = {ψ ∈ N |∀a ∈ A, s ∈ S, 0 < Nψ ≤ NZǫ }. Since T̃ǫ and Õǫ are
sa

finite, we can define a finite approximate BAPOMDP as the tuple (S̃ǫ , A, Z, T̃ǫ , Õǫ , R̃ǫ , γ) where
S̃ǫ = S × T̃ǫ × Õǫ is the finite state space. To define the transition and observation functions over
that finite state space, we need to make sure that when the count vectors are incremented, they stay
within the finite space. To achieve, this we define a projection operator Pǫ : S ′ → S̃ǫ that simply
projects every state in S ′ to their closest state in S̃ǫ .
Definition 4.1. Let d : S ′ × S ′ → R be definedh such that:
s′ a
 2γ||R||

 (1−γ)2

sup DSsa (φ, φ′ ) + DZ (ψ, ψ ′ )
s,s′ ∈S,a∈A
if s = s′


 P 
a ′a a ′a
d(s, φ, ψ, s′ , φ′ , ψ ′ ) = s′′ ∈S |φss′′ −φss′′ | z∈Z |ψs′ z −ψs′ z |
P
4
 + ln(γ −e ) (Nφas +1)(Nφas +1) + (N as′ +1)(N as′ ′ +1) ,

   ′ ψ ψ
 8γ||R||∞ 1 + 4
 2||R||∞
(1−γ)2 ln(γ −e ) + (1−γ) , otherwise.

Definition 4.2. Let Pǫ : S ′ → S̃ǫ be defined as Pǫ (s) = arg min d(s, s′ )


s′ ∈S̃ǫ

The function d uses the bound defined in Theorem 4.1 as a distance between states that only differs
by their φ and ψ vectors, and uses an upper bound on that value when the states differ. Thus
Pǫ always maps states (s, φ, ψ) ∈ S ′ to some state (s, φ′ , ψ ′ ) ∈ S̃ǫ . Note that if σ ∈ S̃ǫ , then
Pǫ (σ) = σ. Using Pǫ , the transition and observation function are defined as follows:

s′ az
, if (s′ , φ′ , ψ ′ ) = Pǫ (s′ , φ + δss

Tφsas Oψ a a
′ , ψ + δs′ z )
T̃ǫ ((s, φ, ψ), a, (s′ , φ′ , ψ ′ )) = (4)
0, otherwise.
1, if (s′ , φ′ , ψ ′ ) = Pǫ (s′ , φ + δss
 a a
′ , ψ + δs′ z )
Õǫ ((s, φ, ψ), a, (s′ , φ′ , ψ ′ ), z) = (5)
0, otherwise.
These definitions are the same as the one in the infinite BAPOMDP, except that now we add an extra
projection to make sure that the incremented count vectors stays in S̃ǫ . Finally, the reward function
R̃ǫ : S̃ǫ × A → R is defined as R̃ǫ ((s, φ, ψ), a) = R(s, a).
Theorem 4.3 bounds the value difference between α-vectors computed with this finite model and
the α-vector computed with the original model.
Theorem 4.3. Given any ǫ > 0, (s, φ, ψ) ∈ S ′ and αt ∈ Γt computed from the infinite BAPOMDP.
Let α̃t be the α-vector representing the same conditionnal plan as αt but computed with the finite
ǫ
BAPOMDP (S̃ǫ , A, Z, T̃ǫ , Õǫ , R̃ǫ , γ), then |α̃t (Pǫ (s, φ, ψ)) − αt (s, φ, ψ)| < 1−γ .

Proof. Proof available in [11]. Solves a recurrence over the 1-step approximation in Thm. 4.2.

Because the state space is now finite, solution methods from the literature on finite POMDPs could
theoretically be applied. This includes en particular the equations for τ (b, a, z) and V ∗ (b) that were
presented in Section 2. In practice however, even though the state space is finite, it will generally
be very large for small ǫ, such that it may still be intractable, even for small domains. We therefore
favor a faster online solution approach, as described below.

4.2 Approximate Belief Monitoring

As shown in Theorem 3.1, the number of states with non-zero probability grows exponentially in
the planning horizon, thus exact belief monitoring can quickly become intractable. We now discuss
different particle-based approximations that allow polynomial-time belief tracking.
Monte Carlo sampling: Monte Carlo sampling algorithms have been widely used for sequential
state estimation [12]. Given a prior belief b, followed by action a and observation z, the new belief
b′ is obtained by first sampling K states from the distribution b, then for each sampled s a new state
s′ is sampled from T (s, a, ·). Finally, the probability O(s′ , a, z) is added to b′ (s′ ) and the belief b′
is re-normalized. This will capture at most K states with non-zero probabilities. In the context of

5
BAPOMDPs, we use a slight variation of this method, where (s, φ, ψ) are first sampled from b, and
then a next state s′ ∈ S is sampled from the normalized distribution Tφsa· Oψ
·az
. The probability 1/K
′ ′ a a
is added directly to b (s , φ + δss′ , ψ + δs′ z ).
Most Probable: Alternately, we can do the exact belief update at a given time step, but then only
keep the K most probable states in the new belief b′ and renormalize b′ .
Weighted Distance Minimization: The two previous methods only try to approximate the distribu-
tion τ (b, a, z). However, in practice, we only care most about the agent’s expected reward. Hence,
instead of keeping the K most likely states, we can keep K states which best approximate the be-
lief’s value. As in the Most Probable method, we do an exact belief update, however in this case
we fit the posterior distribution using a greedy K-means procedure, where distance is defined as in
Definition 4.1, weighted by the probability of the state to remove. See [11] for algorithmic details.

4.3 Online planning

While the finite model presented in Section 4.1 can be used to find provably near-optimal policies
offline, this will likely be intractable in practice due to the very large state space required to ensure
good precision. Instead, we turn to online lookahead search algorithms, which have been proposed
for solving standard POMDPs [9]. Our approach simply performs dynamic programming over all the
beliefs reachable within some fixed finite planning horizon from the current belief. The action with
highest return over that finite horizon is executed and then planning is conducted again on the next
belief. To further limit the complexity of the online planning algorithm, we used the approximate
belief monitoring methods detailed above. Its overall complexity is in O((|A||Z|)D Cb ) where D is
the planning horizon and Cb is the complexity of updating the belief.

5 Empirical Results

We begin by evaluating the different belief approximations introduced above. To do so, we use a
simple online d-step lookahead search, and compare the overall expected return and model accuracy

in two different problems: the well-known Tiger [5] and a new domain called Follow. Given T sas
s′ az
and O the exact probabilities of the (unknown) POMDP, the model accuracy is measured in
terms of the weighted sum of L1-distance, denoted W L1, between the exact model and the probable
models in a belief state b:
P
W L1(b) = (s,φ,ψ)∈Sb′ b(s, φ, ψ)L1(φ, ψ)
hP
sas′ ′
s′ az ′
i (6)
− T sas | + z∈Z |Oψ − Os az |
P P P
L1(φ, ψ) = a∈A s′ ∈S s∈S |Tφ

5.1 Tiger

In the Tiger problem [5], we consider the case where the transition and reward parameters are known,
but the observation probabilities are not. Hence, there are four unknown parameters: OLl , OLr ,
ORl , ORr (OLr stands for Pr(z = hear right|s = tiger Lef t, a = Listen)). We define the
observation count vector ψ = (ψLl , ψLr , ψRl , ψRr ). We consider a prior of ψ0 = (5, 3, 3, 5), which
specifies an expected sensor accuracy of 62.5% (instead of the correct 85%) in both states. Each
simulation consists of 100 episodes. Episodes terminate when the agent opens a door, at which
point the POMDP state (i.e. tiger’s position) is reset, but the distribution over count vector is carried
over to the next episode.
Figures 1 and 2 show how the average return and model accuracy evolve over the 100 episodes
(results are averaged over 1000 simulations), using an online 3-step lookahead search with varying
belief approximations and parameters. Returns obtained by planning directly with the prior and ex-
act model (without learning) are shown for comparison. Model accuracy is measured on the initial
belief of each episode. Figure 3 compares the average planning time per action taken by each ap-
proach. We observe from these figures that the results for the Most Probable and Weighted Distance
approximations are very similar and perform well even with few particles (lines are overlapping in
many places, making Weighted Distance results hard to see). On the other hand, the performance
of Monte Carlo is significantly affected by the number of particles and had to use much more par-

6
2 Exact model 1 20
Most Probable (2)
Monte Carlo (64)

Planning Time/Action (ms)


1
0.8 Weighted Distance (2)
15
0
0.6

Return

WL1
−1 10
0.4
−2
Prior model
Most Probable (2) 5
0.2
−3 Monte Carlo (64)
Weighted Distance (2)
−4 0 0
0 20 40 60 80 100 0 20 40 60 80 100 MP (2) MC (64) WD (2)
Episode Episode

Figure 1: Return with different Figure 2: Model accuracy with Figure 3: Planning Time with
belief approximations. different belief approximations. different belief approximations.

ticles (64) to obtain an improvement over the prior. This may be due to the sampling error that is
introduced when using fewer samples.

5.2 Follow

We propose a new POMDP domain, called Follow, inspired by an interactive human-robot task. It
is often the case that such domains are particularly subject to parameter uncertainty (due to the dif-
ficulty in modelling human behavior), thus this environment motivates the utility of Bayes-Adaptive
POMDP in a very practical way. The goal of the Follow task is for a robot to continuously follow one
of two individuals in a 2D open area. The two subjects have different motion behavior, requiring the
robot to use a different policy for each. At every episode, the target person is selected randomly with
P r = 0.5 (and the other is not present). The person’s identity is not observable (except through their
motion). The state space has two features: a binary variable indicating which person is being fol-
lowed, and a position variable indicating the person’s position relative to the robot (5 × 5 square grid
with the robot always at the center). Initially, the robot and person are at the same position. Both the
robot and the person can perform five motion actions {N oAction, N orth, East, South, W est}.
The person follows a fixed stochastic policy (stationary over space and time), but the parameters of
this behavior are unknown. The robot perceives observations indicating the person’s position rela-
tive to the robot: {Same, N orth, East, South, W est, U nseen}. The robot perceives the correct
observation P r = 0.8 and U nseen with P r = 0.2. The reward R = +1 if the robot and person
are at the same position (central grid cell), R = 0 if the person is one cell away from the robot, and
R = −1 if the person is two cells away. The task terminates if the person reaches a distance of 3
cells away from the robot, also causing a reward of -20. We use a discount factor of 0.9.
When formulating the BAPOMDP, the robot’s motion model (deterministic), the observation
probabilities and the rewards are assumed to be known. We maintain a separate count vec-
tor for each person, representing the number of times they move in each direction, i.e. φ1 =
(φ1N A , φ1N , φ1E , φ1S , φ1W ), φ2 = (φ2N A , φ2N , φ2E , φ2S , φ2W ). We assume a prior φ10 = (2, 3, 1, 2, 2)
for person 1 and φ20 = (2, 1, 3, 2, 2) for person 2, while in reality person 1 moves with probabilities
P r = (0.3, 0.4, 0.2, 0.05, 0.05) and person 2 with P r = (0.1, 0.05, 0.8, 0.03, 0.02). We run 200
simulations, each consisting of 100 episodes (of at most 10 time steps). The count vectors’ distri-
butions are reset after every simulation, and the target person is reset after every episode. We use a
2-step lookahead search for planning in the BAPOMDP.
Figures 4 and 5 show how the average return and model accuracy evolve over the 100 episodes (aver-
aged over the 200 simulations) with different belief approximations. Figure 6 compares the planning
time taken by each approach. We observe from these figures that the results for the Weighted Dis-
tance approximations are much better both in terms of return and model accuracy, even with fewer
particles (16). Monte Carlo fails at providing any improvement over the prior model, which indi-
cates it would require much more particles. Running Weighted Distance with 16 particles require
less time than both Monte Carlo and Most Probable with 64 particles, showing that it can be more
time efficient for the performance it provides in complex environment.

7
2 2 200
Exact model

Planning Time/Action (ms)


0
1.5 150
Most Probable (64)
−2 Monte Carlo (64)

Return

WL1
1 Weighted Distance (16) 100
Most Probable (64)
−4 Monte Carlo (64)
Weighted Distance (16)
Prior model 0.5 50
−6

−8 0 0
0 20 40 60 80 100 0 20 40 60 80 100 MP (64) MC (64) WD (16)
Episode Episode

Figure 4: Return with different Figure 5: Model accuracy with Figure 6: Planning Time with
belief approximations. different belief approximations. different belief approximations.

6 Conclusion
The objective of this paper was to propose a preliminary decision-theoretic framework for learning
and acting in POMDPs under parameter uncertainty. This raises a number of interesting challenges,
including (1) defining the appropriate model for POMDP parameter uncertainty, (2) approximating
this model while maintaining performance guarantees, (3) performing tractable belief updating, and
(4) planning action sequences which optimally trade-off exploration and exploitation.
We proposed a new model, the Bayes-Adaptive POMDP, and showed that it can be approximated
to ǫ-precision by a finite POMDP. We provided practical approaches for belief tracking and online
planning in this model, and validated these using two experimental domains. Results in the Follow
problem, showed that our approach is able to learn the motion patterns of two (simulated) individu-
als. This suggests interesting applications in human-robot interaction, where it is often essential that
we be able to reason and plan under parameter uncertainty.

Acknowledgments
This research was supported by the Natural Sciences and Engineering Research Council of Canada
(NSERC) and the Fonds Québécois de la Recherche sur la Nature et les Technologies (FQRNT).

References
[1] R. Dearden, N. Friedman, and N. Andre. Model based bayesian exploration. In UAI, 1999.
[2] M. Duff. Optimal Learning: Computational Procedure for Bayes-Adaptive Markov Decision Processes.
PhD thesis, University of Massachusetts, Amherst, USA, 2002.
[3] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete bayesian reinforcement
learning. In Proc. ICML, 2006.
[4] R. Jaulmes, J. Pineau, and D. Precup. Active learning in partially observable markov decision processes.
In ECML, 2005.
[5] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic
domains. Artificial Intelligence, 101:99–134, 1998.
[6] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for POMDPs. In
IJCAI, pages 1025–1032, Acapulco, Mexico, 2003.
[7] M. Spaan and N. Vlassis. Perseus: randomized point-based value iteration for POMDPs. JAIR, 24:195–
220, 2005.
[8] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In UAI, Banff, Canada, 2004.
[9] S. Paquet, L. Tobin, and B. Chaib-draa. An online POMDP algorithm for complex multiagent environ-
ments. In AAMAS, 2005.
[10] Jonathan Baxter and Peter L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial
Intelligence Research (JAIR), 15:319–350, 2001.
[11] Stéphane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. Technical Report SOCS-
TR-2007.6, McGill University, 2007.
[12] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods In Practice. Springer, 2001.

You might also like