2、Model Based Bayesian Exploration
2、Model Based Bayesian Exploration
2、Model Based Bayesian Exploration
cost of doing a potentially suboptimal action. This mea- Q*( s ,a ) . Given a model, we can compute Q* using a va-
sure is computed from probability distributionsover the Q- riety of methods, including value iteration. In this method
values of actions. we repeatedly update an estimate Q of Q* by applying the
In this paper, we show how to use the posterior distri- Bellman equations to get new values of Q ( s ) for some (or
bution over possible models to estimate the distribution all) of the states.
of possible Q-values, and then use these to select actions. Reinforcement learning procedures attempt to achieve an
This use of models allows us to avoid the problem faced optimal policy when the agent does not know p ~ and p ~ .
by model-free exploration methods, such as the one used Since we do not know the dynamics of the underlying MDP,
by Dearden et al., that need to perform repeated actions to we cannot compute the Q-value function directly. However,
propagate values from one state to another. The main ques- we can estimate it. In model-free approaches one usually es-
tion is how to estimate these Q-values from our distribu- timates Q by treating each step in the environment as a sam-
tion of possible models. We present several methods of ple from the underlying dynamics. These samples are then
stochastic sampling to approximate these Q-value distribu- used for performing updates of the Q-values based on the
tions. We then evaluate the performance of the resulting Bellman equations. In model-based reinforcement learning
Bayesian learning agents on test environments that are de- one usually directly estimates M. ( s q t )andpR(s4T).The
signed to fool many exploration methods. standard approach is then to act as though these approxima-
In Section 2 we briefly review the definition of MDPs and tions are correct, compute Q * ,and use it to choose actions.
the definition of reinforcement learning problems. In Sec- A standard problem in learning is balancing between
tion 3 we discuss a Bayesian approach for learning models. planning (i.e., choosing a policy) and execution. Ideally,
In Section 4 we review the notion of Q-value distributions the agent would compute the optimal value function for
and the use of value of information for directing exploration its model of the environment each time it updates it. This
and the notion. In Section 5 we propose several sampling scheme is unrealistic since finding the optimal policy for
methods for estimating Q-value distributions based on the a given model is a non-trivial computational task. Fortu-
uncertainty about the underlying model. In Section 6 we nately, we can approximate this scheme if we notice that the
discuss several approaches of generalizing from the sam- approximate model changes only slightly at each step. We
ples we get from the aforementioned methods, and how this can hope that the value function from the previous model
generalization can improveour algorithms. In Section 7 we can be easily "repaired to reflect these changes. This ap-
compare our methods to Prioritized Sweeping (Moore & proach was pursued in the DYNA (Sutton 1990) frame-
Atkeson 1993), a well known model-based reinforcement work, where after the execution of an action, the agent
learning procedure. updates its model of the environment, and then performs
some bounded number of value propagation steps to up-
2 Background date its approximation of the value function. Each value-
We assume the reader is familiar with the basic concepts propagatjon step locally enforces the Bellman-yuation by
of MDPs (see, e.g., (Kaelbling, Littman & Moore 1996)). setting V ( s ) t m a ~ ~~ (Asa ),, where Q ( s ,a) =
We will use the following notation: An MDP is a 4-tuple, E ~ R ( s ~ T )y+ ] EsjES $T(s%s')v(s'), $T(s$sl) and
( S ,A, p ~pR), where S is a set of states, A is a set of ac- T ) the agent's approximate model, and v is the
I ~ R ( S ~ are
tions, p ~(8%) is a transition model that captures the prob- agent's approximation of the value function.
-
-- This raises the question of which states should be up-
ability of reaching state t after we execute action a at state
s, and P ~ (is ~ q ~model
a reward ) that captures the proba- dated. Prioritized Sweeping (Moore & Atkeson 1993) is a
bility of receiving reward T after executing a at state s. For method that estimates to what extent states would change
the reminder of this paper, we assume that possible rewards their value as a consequence of new knowledge of the MDP
are a finite subset R of the real numbers. dynamics or previous value propagations. States are as-
In this paper, we focus on infinite-horizon MDPs with a signed priorities based on the expected size of changes in
discount factor 7 . The agent's aim is to maximize the ex- their values, and states with the highest priority are the ones
pected discounted total reward it receives. Equivalently, we for which we perform value propagation.
can compute a optimal value function V* and a Q-function
Q * . These functions satisfy the Bellman equations: 3 Bayesian Model Learning
--!
In this section we describe how to maintain a Bayesian pos-
V * ( S )= maxQ*(s,a ) ,
aEA terior distribution over MDPs given our experiences in the
environment. At each step in the environment, we start at
where state s. choose an action a. and then observe a new state
t
p T ( s 4 s 1 ) V * ( ~ ' ) . and a reward r . We summarize our experience by a se-
Q* ( s ,a ) = Ep n ( r S r ) [ T [ s , ~ ] +7
.S ' .-
tD
auence of experience tuples
. (* s .,a ., T., t ,) .
A Bayesian approach to this learning problem is to main-
If the agent has access to V* or Q * ,it can optimize its ex- tain a belief state over the possible MDPs. Thus, a belief
pected reward by choosing the action a at s that maximizes state p defines a probability density P ( M I p). Given an
152 Dearden, Friedman, a n d Andre
experience tuple ( s ,a , r , t ) we can compute the posterior For the case of discrete multinomials, which we have
belief state, which we denote p o ( s ,a , r , t ) ,by Bayes rule: assumed in our transition and reward models, we can use
Dirichlet priors to represent Pr(Bf ,) and Pr(0:,,). These
P ( M I P 0 ( s ,a , r , t ) ) priors are conjugate, and thus the posterior after each ob-
p ( ( s ,a , r , t ) I M ) P ( M I P ) served experience tuple will also be a Dirichlet distribution.
In addition, Dirichlet distributions can be described using a
= ~ ( s f +I M t ) P ( S % I M ) P ( M I p).
small number of hyper-parameters. See Appendix A for a
review of ~irichletiri6rsand their propertids.
Thus, theBayesian approach starts with somepriorprob-
In the case of most MDPs studied in reinforcement learn-
ability distribution over all possible MDPs (we assume that
the sets of possible states, actions and rewards are delim- ing, we expect the transition model to be sparse-there are
only a few states that can result from a particular action at
ited in advance). As we gain experience, the approach fo-
a particular state. Unfortunately, if the state space is large,
cuses the mass of the posterior distribution on those MDPs
in which the observed experience tuples are most probable. learning with a Dirichlet prior can require many examples
to recognize that most possible states are highly unlikely.
An immediate question is whether we can represent these
prior and posterior distributions over an infinite number This problem is addressed by a recent method of learn-
of MDPs. We show that this is possible by adopting re- ing sparse-multinomial priors (Friedman & Singer 1999).
sults from Bayesian learning of probabilistic models, such Without going into details, the sparse-multinomial priors
have the same general properties as Dirichlet priors, but as-
as Bayesian networks (Heckerman 1998). Under carefully
sume that the observed outcomes are from some small sub-
chosen assumptions, we can represent such priors and pos-
teriors in any of several compact manners. We discuss one sets of the set of possible ones. The sparse Dirichlet priors
make predictions as though only the observed outcomes are
such choice below.
possible, except that they also assign to novel outcomes. In
To formally represent our problem, we consider the pa-
the MDP setting, a novel outcome is a transition to state t
rameterization of MDPs. The simplest parameterization is
that was not reached from s previously by executing a. See
table based, where there are parameters Bf ,,,, and B:,,,, for
Appendix A for a brief summary of sparse-multinomial pri-
the transition and reward models. Thus, for each choice
of s and a, the parameters O f , , = {Of ,, : t E S ) de- ors and their properties.
For both the Dirichlet and its sparse-multinomial exten-
fine a distribution over possible states, and the parameters
B:,, = {B:,,,, : r E R)define a distributionover possible sion, we need to maintain the number of times, N ( s q t ) ,
state t is observed after executing- action a at state s, and
rewards.l
We say that our prior satisfies parameter independence if similarly, N ( s q r )for rewards. W ~ t the
h prior distributions
it has the product form: over the Darameters of the MDP, these counts define a poste-
rior distribution over MDPs. This representation allows us
to both predict the probability of the next transition and re-
ward, and also to compute the probability of every possible
MDP and to sample from the distribution of MDPs.
Thus, the prior distribution over the parameters of each lo- To summarize, we assumed parameter independence, and
cal probability term in the MDP is independent of the prior that for each prior in (1) we have either a Dirichlet or sparse-
over the others. It turns out that this form is maintained as multinomial prior. The consequence is that the posterior
we incorporate evidence at each stage in the learning can be represented compactly.
Proposition 3.1: Zfthe belief state P(B I p) satisfiesparam- This enables us to estimate adistributionover MDPs at each
eter independence, then P(O I /L o (s,a , r , t)) also satisfies stage.
parameter independence. It is easy to extend this discussion for more compact pa-
rameterization~of the transition and reward models. For
As a consequence, the posterior after we incorporate an ar- example, if each state is described by several attributes, we
bitrarily long number of experience tuples also has the prod- might use a Bayesian network to capture the MDP dynam-
uct form of (1). ics. Such a structure requires fewer parameters and thus
Parameter independence allows us to reformulate the we can learn it with fewer examples. Nonetheless, much
learning problem as a collection of unrelated local learning of the above discussion and conclusions about parameter
problems. In each of these, we have to estimate a probabil- independence and Dirichlet priors apply to these models
ity distribution over all states or all rewards. The question (Heckerman 1998).
is how to learn these distributions. We can use well-known Standard model-based learning methods maintain a point
Bayesian methods for learning standard distributions such estimate of the model. These point estimates are often close
as multinomials or Gaussian distributions (Degroot 1986). to the mean prediction of the Bayesian method. However,
'The methods we describe are easily extend to other param- these point estimates do not capture the uncertainty about
eterization~. In particular, we can consider continuous distribu- the model. In this paper, we examine how knowledge of this
tions, e.g., Gaussians,over rewards. For clarity of discussion, we uncertainty can be exploited to improve exploration.
focus on multinomial distributions throughout the paper.
Model based Bayesian Exploration 153
(s, a , r , t), we only change the posterior over Of,, and %:,.
-
- pr((s,a ,r,t) I M ) Thus, instead of re-weighting the sample M ~ we , can up-
Pr((s, a , r , t ) I P) w P date, or repair, it by re-sampling %f , and O:,,. If the orig-
Model based Bayesian Exploration 155
inal sample M' was sampled from Pr(M I p ) , then it eas- In our current setting, the terms q,l,,t are random vari-
ily follows that the repaired Mi is sampled from Pr(M I ables that depend on our current estimate of Q-value dis-
P O ( s ,a , r , t ) ) . tributions. The probabilities p ~ ( s % s ' ) are also random
Of course, once we modify M' its Q-value function variables that depend on our posterior on O f , , , and finally
changes. However, all of these changes are consequences E [ ~ R ( s % ~is) also
] a random variable that depends on the
of the new values of the dynamics at ( s ,a). Thus, we can posterior on O:,,. Thus, we can sample from q,,,, by jointly
use prioritized sweeping to update the Q-value computed sampling from all of these distributions, i.e., q,t,,t for all
for M i . This sweeping performs several Bellman updates
states, ~ T ( s ~ sand ' ) P, R ( ~ % ~ ) and
. r then computing the
to correct the values of states that are affected by the change
Q-value. If we repeat this sampling step k times, we get k
in the model.=
samples from a single bellman iteration for q,,,.
This suggests the following algorithm. Initially, we sam-
Starting with our beliefs about the model and about the
ple k MDPs from our prior belief state. At each step we:
Q-value distribution of all states, we can sample from the
Observe an experience tuple ( s ,a , r , t ) distribution of q,,,. To make this procedure manageable,
we assume that we can sample from each q,,,,, indepen-
Update Pr(O:,,) by t , and Pr(O:,,) by r .
dently. This assumption does not hold in general MDPs,
For each i = 1,.. . , k, sample O f $ , O$ from the new since the distribution of different Q-values are correlated
Pr(O:,,) and PI($:,,),respectively. (by the Bellman equation). However, we might hope that
For each i = 1, . . . , k run a local instantiation of prior- the exponential decay will weaken these dependencies.
itized sweeping to update the Q-value function of M'. We are now left with the question how to use the k sam-
ples from q,,,. The simplest approach is to use the sam-
Thus, our approach is quite similar to standard model ples as a representation of our approximation of the distri-
based learning with prioritized sweeping, but instead of run- bution of q,,,. We can compute the mean and VPI from
ning one instantiation of prioritized sweeping, we run k in- a set of samples, as we did in the global sampling ap-
stantiations in parallel, one for each sampled MDP. The re- proach. Similarly, we can re-sample from this represen-
pair to the sampled MDPs ensures that they constitute a tation by randomly choosing one of the points. This re-
s a m ~ l from
e the current belief state, and the local instantia- sults in a method that is similar to recent sampling methods
tions of prioritized sweeping ensure that the Q-values com- that have been used successfully in monitoring complex dy-
puted in each of these MDPs is a good approximation to the namic processes (Kanazawa, Koller & Russell 1995).
true value. This gives us a method for performing a Bellman-update
As with the other approaches we have described, after we on our Q-value distributions. To get a good estimate of
invoke the k prioritized sweeping instances we use the k these distributions we need to repeat these updates. Here we
samples from each q , , , to select the next actions using VPI can use a prioritized sweeping like algorithm that performs
computations. updates based on an estimate of which Q-value distribution
Figure 1 shows a single run of learning where the actions can be most affected by the updates to other Q-value distri-
selected were fixed and each of the three methods was used butions.
to estimate the Q-values of a state. Initially the means and
-
-- variances are very high, but as the agent gains more experi- 6 Generalization and Smoothing
ence, the means converge on the true value of the state, and
the variances tend towards zero. These results suggest that In the approaches described above we generated samples
the repair and importance sampling approaches both pro- from the Q-value distributions, and effectively used a col-
vide reasonable approximations to naive global sampling. lection of points to represent the approximation to the Q-
Value distribution. A possible problem with this represen-
5.4 Local Sampling tation approach is that we use a fairly simplistic representa-
Until now we have considered using global samples of tion to describe a complex distribution. This suggests that
MDPs. An alternative approach is to try to maintain for we should generalize from the k samples by using standard
each ( s ,a ) an estimate of the Q-value distribution, and to generalization methods.
Figure 2: Samples, Gaussian approximation, and Kernel estimates of a Q-value distribution after 100,300, and 700 steps
of Naive global sampling on the same run as Figure 1.
moments of the sample, and allows simple generalization. where Pr(q,,, = I)is computed from the generalized prob-
Unfortunately, because of the rnax() terms in the Bellman ability distribution for state s and action a. This integration
equations, we expect the Q-value distribution to be skewed can be simplified to a term where the main cost is an evalua-
to the positivedirection. If this skew is strong, then fitting a tion of the cdf of a Gaussian distribution (e.g., see (Russell
Gaussian would be a poor generalization from the sample. & Wefald 1991). This function, however, is implemented
At the other end of the spectrum are non-parametric ap- in most language libraries (e.g., using the erfo function in
proaches. One of the simplest ones is Kernel estimation the C-library), and thus can be done quite efficiently.
(see for example (Bishop 1995)). In this approach, we ap- Figure 2 shows the effects of Gaussian approximation
proximate the distribution over Q ( s , a ) by a sum of Gaus- and kernel estimation smoothing (using the computed ker-
sians with a fixed variance, one for each sample. This ap- nel width) on the sample values used to generate the Q-
proach can be effective if we are careful in choosing the distributionsin Figure 1 for three different time steps. Early
variance parameter. A too small variance, will lead to in the run Gaussian approximation produces a very poor ap-
a spiky distribution, a too large variance, will lead to an proximation because the samples are quite widely spread
overly smooth and flat distribution. We use a simple rule and very skewed, while kernel estimation provides a much
for estimating the kernel width as a function of the mean better approximation to the observed distribution. For this
(squared) distance between p ~ i n t s . ~ reawn, we expect kernel estimation to perfonn better than
Of course, there are many other generalization methods Gaussian approximation for computing VPI.
we might consider using here, such as mixturedistributions.
However, these two approaches provide us with initial ideas 7 Ex~erimentalResults
on the effect of in this context.
We must also compute the VPI of a set of generalized dis- Figure 3 shows two domains of the type on which we have
tributionsmade up of Gaussians or kernel estimates. This is tested Our algorithms. Each is a four maze
simply a matter of solving the integral given in Equation 2 in which the agent begins at the point marked S and must
collect the flag F and deliver it to the goal G. The agent re-
3This rule is motivated by a leave-one-ourcross-vuliduriones- ceives areward of 1 for each flag it collects and then moves
timate of the kernel w i d t h s . ~ e tq l , . . . ,q%e the k samples. We to the goal state, and the problem is then reset. If the agent
want to find the kernel width u that maximizes the tenn enters the square marked T (a trap) it receives a reward of
-10. Each action (up, down, left, right) succeeds with prob-
ability 0.9 if that direction is clear, and with probability 0.1,
moves the agent perpendicular to the desired direction. The
where f (q'lqJ, u)is the Gaussian pdf with mean qJ and variance
"trap" domain has 18 states, the "maze" domain 56.
u2.Using Jensen's inequality, we have that We evaluate the algorithms by computing the avenge
(over 10 runs) future discounted reward received by the
agent. We use this measure rather than the value of the
learned policy because exploratory agents rarely actually
follow either the greedy policy they have discovered or their
Proposition 6.1 : The value if u2 current exploration policy for very long. For comparison
rhar marimizes C, CJz,
log f (q' I$, u2)is i d , where d is fhe we use prioritized sweeping (Moore & Atkeson 1993) with
average dislunce umong sumples: the Tbord parameter optimized for each problem.
Figure4 shows the performance of a representative Sam-
ple of our algorithms on the trap domain. Unless they
are based on a very small number of samples, all of
the Bayesi'ul exploration methods outperform prioritized
Model based Bayesian Exploration 157
8 Discussion
This paper makes two main contributions. First, we show
how to maintain Bayesian belief states about MDPs. We
show that this can be done in a simple manner by using
ideas that appear in Bayesian learning of probabilisticmod-
els. Second, wediscuss how to use theBayesian belief state
.12 -
o sw IW
N u m * d ltw
ISM 2mo to choose actions in a way that balances exploration and ex-
ploitation. We adapt the value of information approach of
Dearden et al. (1998) to this model-based setup and show
Figure 5: Comparison of Q-value estimation techniques on how to approximate the Q-value distributions needed for
the larger maze domain. making these choices.
A recent approach to exploration that is related to our
work is that of Kearns and Singh (1998). Their approach
4 -
divides the set of states in to two groups. The known states
are ones for which the learner is quite confident about the
transition probabilities. That is, the learner believes that
its estimate of the transition probabilities is close enough
to the true distribution. All other states are considered un-
p Nos-hlwg z-. known states. In Kearns and Singh's proposal, the learner
.......... - constructs a policy over the known states. This policy takes
o.Lu*n W.rnln*Hn
into account both exploitation and the possibility of find-
ing better rewards in unknown states (which are considered
-12
o sw
N&
1m0
m d.p.
1501 zmo as highly-rewarding). When it finds itself in an unknown
state, the agent chooses actions randomly. The algorithm
proceeds in phases, after each one it reclassifies the states
Figure 6: The effects of smoothing techniques on perfor- and recomputes the policy on the known states. Kearns and
mance in the large maze domain. Singh's proposal is significant in that it is the first one for
158 Dearden, Friedman, and Andre
A Dirichlet and Sparse-Multinomial Priors tribuion. Procedures for sampling from these distributions
can be found in Npley 1987).
Let X be a random variable that can take L possible values Friedm'm 'and Singer (1999) introduce a structured prior
from a set C. Without loss of generality, let C = { 1, . . . L}. that captures our uncertainty about the set of "feasible" val-
We are given a training set D that contains the outcomes of ues of X. Define a random variable V that takes values
N independent draws X I , . . . , zN of X from an unknown from the set 2C of possible subsets of C. The intended se-
multinomial distribution P*. The mullinomial esrimarion mantics for this variable, is that if we know the value of V,
problem is to find a good approximation for P*. then Bi > 0 iff i E V.
This problem can be stated as the problem of predicting Clearly, the hypothesis V = C' (for C' C_ C) is consis-
the outcome xN+' given x l ,. . . , xN. Given a prior dis- tent with training data only if C' contains all the indices i for
tribution over the possible multinomial distributions, the which N , > 0. We denote by C0 the set of observed sym-
Bayesinn estimate is: bols. That is, Co = {i : N, > 0 ) , 'and we let k" = ICol.
Suppose we know the value of V. Given this assumption,
we c'an define a Dirichlet prior over possible multinomial
M o d e l b a s e d Bayesian Exploration 159
distributions 0 if we use the same hyper-parameter a for complication since each sample will depend on some unob-
each symbol in V . Formally, we define the prior: served states. To "smooth this behaviour we sample from
P(0IV) K
iFV
0;-I (c 0i = 1 and Bi = 0 for all i # V)
thedistributionover V ocombined withthenovel event. We
sample a value of k from P ( S = klD). We then, sam-
ple from the Dirichlet distribution of dimension k where the
+
first k0 elements are assigned hyper-parameter a Ni, and
Using Eq. (4), we have that: the rest are assigned hyper-parameter a . The sampled vec-
tor of probabilities describes the probability of outcomes in
ifiEV V 0 and additional k - ko events. We combine these latter
otherwise probabilities to be the probability of the novel event.
(6)
Now consider the case where we are uncertain about the References
actual set of feasible outcomes. We construct a two tiered Andre, D., Friedman, N. & Pam, R. (1997), Generalized priori-
prior over the values of V . We start with a prior over the tized sweeping, in 'Advances in Neural Information Process-
size of V , and assume that all sets of the same cardinality ing Systems', Vol. 10.
have the same prior probability. We let the random variable Bishop, C. M. (1995), Neural Networks for Pattern Recognition,
S denote the cardinality of V . We assume that we are given Oxford University Press, Oxford.
a distribution P ( S = k) for k = 1,.. . , L. We define the
Dearden, R., Friedman, N. & Russell, S. (1998). Bayesian Q-
prior over sets to be P(V I S = k) = (:)-I. This prior is leaming, in 'Proceedings of the Fifteenth National Confer-
a sparse-multinomial with parameters cr and Pr(S = k ) . ence on Artificial Intelligence (AAAI-98)'.
Friedman and Singer show that how we can efficiently Degroot, M. H. (1986), Proability and Statistics, 2nd edn,
predict using this prior. Addison-Wesley, Reading, Mass.
Theorem A.1: (Friedman & Singer 1999) Given a sparse- Friedman, N. & Singer, Y. (1999), Efficient bayesian parameter
multinomial priol; the probability of the next symbol is estimation in large discrete domains, in 'Advances in Neu-
ral lnformation Processing Systems 11'. MIT Press, Cam-
p ( x N + l = i / D) = { F-C ( -DC ,(L)D ,L))
n-kD (1
i f i E C0
ifi @ so
bridge, Mass.
Heckeman, D. (1998), A tutorial on leaming with Bayesian net-
works, in M. 1. Jordan, ed., 'Learning in Graphical Models',
where Kluwer, Dordrecht, Netherlands.
Howard, R. A. (1966), 'Information value theory', IEEE Transac-
tions on Systems Science and Cybernetics SSC-2.22-26.
Kaelbling, L. P., Littman, M. L. & Moore, A. W. (1996), 'Rein-
forcement learning: A survey', Journal of Artificial Intelli-
Moreovet; gence Research 4,237-285.
mk Kanazawa, K., Koller, D. & Russell, S. (1995). Stochastic simula-
P ( S = k ID) =
C k ' > k O mk ' tion algorithms for dynamic probabilistic networks, in 'Pro-
ceedings of the Eleventh Conferenceon Uncertainty in Ani-
where ficial Intelligence (UAI-95)', Morgan Kaufmann, Montreal.
k! Kearns, M. & Singh, S. (1998),Near-optimal performance for re-
mk = P ( S = k) r(kff) inforcement learning in polynomial time, in 'Proceedings
(k - k?)! r ( k a + N ) of the Fifteenth Int. Conf. on Machine Learning', Morgan
and r ( x ) = Jr t z - l e - t d t is the gamma function. Thus,
Kaufmann.
Koller, D. & Fratkina, R. (1998). Using leaming for approxima-
tion in stochastic processes, in 'Proceedings of the Fifteenth
International Conference on Machine Learning', Morgan
Kaufmann, San Francisco, Calif.
Moore, A. W. & Atkeson, C. G. (1993), 'Prioritized sweeping-
We can think of C ( D ,L) as scaling factor that we apply to reinforcement leaming with less data and less time', Ma-
the Dirichlet prediction that assumes that we have seen all chine Learning 13, 103-130.
of the feasible symbols. The quantity 1 - C ( D ,L) is the Ripley, B. D. (1987),Stochastic Simulation, Wiley, NY.
probability mass assigned to novel (i.e., unseen) outcomes.
In some of the methods discussed above we need to sam- Russell, S. J. & Wefald, E. H. (1991), Do the Right Thing: Studies
in Limited Rationality, MIT Press, Cambridge, Mass.
ple a parameter vector from a sparse-multinomial prior.
Probable parameter vectors according to such a prior are Sutton, R. S. (1990). Integrated architectures for learning, plan-
sparse, i.e., contain few non-zero entries. The choice of ning, and reacting based on approximating dynamic pro-
gramming, in 'Proceedings of the Seventh Int. Conf. on Ma-
the non-zero entries among the outcomes that were not ob- chine Learning', Morgan Kaufmann, pp. 216-224.
served is done with uniform probability. This presents a