Hector Ge Ner, Blai Bonet - A Concise Introduction To Models and Methods For Automated Planning-Morgan & Claypool (2013)
Hector Ge Ner, Blai Bonet - A Concise Introduction To Models and Methods For Automated Planning-Morgan & Claypool (2013)
Hector Ge Ner, Blai Bonet - A Concise Introduction To Models and Methods For Automated Planning-Morgan & Claypool (2013)
Hector Geffner
ICREA and Universitat Pompeu Fabra, Barcelona, Spain
Blai Bonet
Universidad Simón Bolívar, Caracas, Venezuela
M
&C Morgan & cLaypool publishers
Copyright © 2013 by Morgan & Claypool
DOI 10.2200/S00513ED1V01Y201306AIM022
Lecture #22
Series Editors: Ronald J. Brachman, Yahoo! Labs
William W. Cohen, Carnegie Mellon University
Peter Stone, University of Texas at Austin
Series ISSN
Synthesis Lectures on Artificial Intelligence and Machine Learning
Print 1939-4608 Electronic 1939-4616
ABSTRACT
Planning is the model-based approach to autonomous behavior where the agent behavior is derived
automatically from a model of the actions, sensors, and goals. e main challenges in planning are
computational as all models, whether featuring uncertainty and feedback or not, are intractable in the
worst case when represented in compact form. In this book, we look at a variety of models used in
AI planning, and at the methods that have been developed for solving them. e goal is to provide
a modern and coherent view of planning that is precise, concise, and mostly self-contained, without
being shallow. For this, we make no attempt at covering the whole variety of planning approaches,
ideas, and applications, and focus on the essentials. e target audience of the book are students and
researchers interested in autonomous behavior and planning from an AI, engineering, or cognitive
science perspective.
KEYWORDS
planning, autonomous behavior, model-based control, plan generation and recognition,
MDP and POMDP planning, planning with incomplete information and sensing, action
selection, belief tracking, domain-independent problem solving
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1 Challenges and Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2 Planning, Scalability, and Cognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Preface
Planning is a central area in Artificial Intelligence concerned with the automated generation of behav-
ior for achieving goals. Planning is also one of the oldest areas in AI with the General Problem Solver
being the first automated planner and one of the first AI programs [Newell et al., 1959]. As other ar-
eas in AI, planning has changed a great deal in recent years, becoming more rigorous, more empirical,
and more diverse. Planners are currently seen as automated solvers for precise classes of mathematical
models represented in compact form, that range from those where the state of the environment is fully
known and actions have deterministic effects, to those where the state of the environment is partially
observable and actions have stochastic effects. In all cases, the derivation of the agent behavior from the
model is computational intractable, and hence a central challenge in planning is scalability. Planning
methods must exploit the structure of the given problems, and their performance is assessed empiri-
cally, often in the context of planning competitions that in recent years have played an important role
in the area.
In this book, we look at a variety of models used in AI planning and at the methods that have
been developed for solving them. e goal is to provide a modern and coherent view of planning that is
precise, concise, and mostly self-contained, without being shallow. For this, we focus on the essentials
and make no attempt at covering the whole variety of planning approaches, ideas, and applications.
Moreover, our view of the essentials is not neutral, having chosen to emphasize the ideas that we find
most basic in a model-based setting. A more comprehensive treatment of planning, circa 2004, can be
found in the planning textbook by Ghallab et al. [2004]. Planning is also covered at length in the AI
textbook by Russell and Norvig [2009].
e book is organized into eight chapters. Chapter 1 is about planning as the model-based ap-
proach to autonomous behavior in contrast to appproaches where behaviors are learned, evolved, or
specified by hand. Chapters 2 and 3 are about the most basic model in planning, classical planning,
where a goal must be reached from a fully known initial state by applying actions with deterministic
effects. Classical planners can currently find solutions to problems over huge state spaces, yet many
problems do not comply with these restrictions. e rest of the book addresses such problems in two
ways: one is by automatically translating non-classical problems into classical ones; the other is by
defining native planners for richer models. Chapter 4 focuses thus on reductions for dealing with soft
goals, temporally extended goals, incomplete information, and a slightly different task: goal recogni-
tion. Chapter 5 is about planning with incomplete information and partial observability in a logical
setting where uncertainty is represented by sets of states. Chapters 6 and 7 cover probabilistic planning
where actions have stochastic effects, and the state is either fully or partially observable. In all cases,
we distinguish between offline solution methods that derive the complete control offline, and online
solution methods that derive the control as needed, by interleaving planning and execution, thinking
and doing. Chapter 8 is about open problems.
We are grateful to many colleagues, co-authors, teachers, and students. Among our teachers,
we would like to mention Judea Pearl, who was the Ph.D. advisor of both of us at different times, and
always a role model as a person and as a scientist. Among our students, we thank in particular Hector
Palacios, Emil Keyder, Alex Albore, Miquel Ramírez, and Nir Lipovetzky, on whose work we have
drawn for this book. e book is based on tutorials and courses on planning that one of us (Hector)
has been giving over the last few years, more recently at the ICAPS Summer School (essaloniki,
2009; São Paulo, 2012; Perugia, 2013), the International Joint Conference on AI (IJCAI, Barcelona,
2011), La Sapienza, Università di Roma (2010), and the Universitat Pompeu Fabra (2012). We thank
the students for the feedback and our colleagues for the invitations and their hospitality. anks also
to Alan Fern who provided useful and encouraging feedback on a first draft of the book.
A book, even if it is a short one, is always a good excuse for remembering the loved ones.
A los chicos, caminante no hay camino, a Lito, la llama eterna, a Marina, mucho más que dos, a la
familia toda; a la memoria del viejo, la vieja, la bobe, y los compañeros tan queridos – Hector
A Iker y Natalia, por toda su ayuda y amor, a la familia toda, por su apoyo. A la memoria de Josefina
Gorgal Caamaño y la iaia Francisca Prat – Blai
Planning and
Autonomous Behavior
Planning is the model-based approach to autonomous behavior where the agent selects the action to
do next using a model of how actions and sensors work, what is the current situation, and what is the
goal to be achieved. In this chapter, we contrast programming, learning, and model-based approaches
to autonomous behavior, and present some of the models in planning that will be considered in more
detail in the following chapters. ese models are all general in the sense that they are not bound
to specific problems or domains. is generality is intimately tied to the notion of intelligence which
requires the ability to deal with new problems. e price for generality is computational: planning over
these models when represented in compact form is intractable in the worst case. A main challenge in
planning is thus the automated exploitation of problem structure for scaling up to large and meaningful
instances that cannot be handled by brute force methods.
Stench Breeze
PIT
Breeze Breeze
Stench PIT
Stench Breeze
Breeze Breeze
PIT
Figure
'JHVSF 1.1: "VUPOPNPVT
Autonomous Behavior in the Wumpus World: What to do next?
#FIBWJPS JO UIF 8VNQVT 8PSME 8IBU UP EP OFYU
not known to the agent, but each emits a signal that can be perceived by the agent when in the same
OPU LOPXO UP UIF BHFOU
CVU FBDI FNJUT B TJHOBM UIBU DBO CF QFSDFJWFE CZ UIF BHFOU XIFO JO UIF TBNF
cell (gold) or in a contiguous cell (pits and wumpus). e agent control must specify the action to be
DFMM HPME
PS JO B DPOUJHVPVT DFMM QJUT BOE XVNQVT
ɩF BHFOU DPOUSPM NVTU TQFDJGZ UIF BDUJPO UP CF
done by the agent as a function of the observations gathered. e three basic approaches for obtaining
EPOF CZ UIF BHFOU BT B GVODUJPO PG UIF PCTFSWBUJPOT HBUIFSFE ɩF UISFF CBTJD BQQSPBDIFT GPS PCUBJOJOH
such a controller are to write it by hand, to learn it from interactions with a Wumpus simulator, or to
TVDI B DPOUSPMMFS BSF UP XSJUF JU CZ IBOE
UP MFBSO JU GSPN JOUFSBDUJPOT XJUI B 8VNQVT TJNVMBUPS
PS UP
derive it from a model representing the initial situation, the actions, the sensors, and the goals.
EFSJWF JU GSPN B NPEFM SFQSFTFOUJOH UIF JOJUJBM TJUVBUJPO
UIF BDUJPOT
UIF TFOTPST
BOE UIF HPBMT
While planning is often defined as the branch of AI concerned with the “synthesis of plans of ac-
8IJMF QMBOOJOH JT PGUFO EFmOFE BT UIF CSBODI PG "* DPODFSOFE XJUI UIF iTZOUIFTJT PG QMBOT
tion to achieve goals,” planning is best conceived as the model-based approach to action selection—a view
PG BDUJPO UP BDIJFWF HPBMTw
QMBOOJOH JT CFTU DPODFJWFE BT UIF NPEFMCBTFE BQQSPBDI UP BDUJPO TFMFDUJPO B
that defines more clearly the role of planning in intelligent autonomous systems. e distinction that
WJFX UIBU EFmOFT NPSF DMFBSMZ UIF SPMF PG QMBOOJOH JO JOUFMMJHFOU BVUPOPNPVT TZTUFNT ɩF EJTUJODUJPO
the philosopher Daniel Dennett makes between “Darwinian,” “Skinnerian,” and “Popperian” crea-
UIBU UIF QIJMPTPQIFS %BOJFM %FOOFUU NBLFT CFUXFFO A%BSXJOJBO
A4LJOOFSJBO
BOE A1PQQFSJBO DSFB
tures [Dennett, 1996], mirrors quite closely the distinction between hardwired (programmed) agents,
UVSFT <%FOOFUU
>
NJSSPST RVJUF DMPTFMZ UIF EJTUJODUJPO CFUXFFO IBSEXJSFE QSPHSBNNFE
BHFOUT
agents that learn, and agents that use models respectively. e contrast between the first and the lat-
BHFOUT UIBU MFBSO
BOE BHFOUT UIBU VTF NPEFMT SFTQFDUJWFMZ ɩF DPOUSBTU CFUXFFO UIF mSTU BOE UIF MBU
ter corresponds also to the distinction made in AI between reactive and deliberative systems, as long
UFS DPSSFTQPOET BMTP UP UIF EJTUJODUJPO NBEF JO "* CFUXFFO SFBDUJWF BOE EFMJCFSBUJWF TZTUFNT
BT MPOH
as deliberation is not reduced to logical reasoning. Indeed, as we will see, the inferences captured
BT EFMJCFSBUJPO JT OPU SFEVDFE UP MPHJDBM SFBTPOJOH *OEFFE
BT XF XJMM TFF
UIF JOGFSFODFT DBQUVSFE
by model-based methods that scale up are not logical but heuristic, and follow from relaxations and
CZ NPEFMCBTFE NFUIPET UIBU TDBMF VQ BSF OPU MPHJDBM CVU IFVSJTUJD
BOE GPMMPX GSPN SFMBYBUJPOT BOE
approximations of the problem being solved.
BQQSPYJNBUJPOT PG UIF QSPCMFN CFJOH TPMWFE
• a non-deterministic state transition function F .a; s/ for s 2 S and a 2 A.s/, where F .a; s/ is
non-empty and s 00 2 F .a; s/ stands for the possible successor states of state s after action a is
done, a 2 A.s/,
• a sensor model O.s; a/ O , where o 2 O.s; a/ means that token o may be observed in the
(possibly hidden) state s if a was the last action done, and
Figure 1.2: A planner takes a compact representation of a planning problem over a certain class of models (clas-
sical, conformant, contingent, MDP, POMDP) and automatically produces a controller. For fully and partially
observable models, the controller is closed-loop, meaning that the action selected depends on the observations
gathered. For non-observable models like classical and conformant planning, the controller is open-loop, meaning
that it is a fixed action sequence.
model O.s; a/, which does not depend on a but just on the hidden state s . Namely, O contains nine
observation tokens o, corresponding to the possible combinations of the three booleans stench, breeze,
and bright, so that if s is a state where the agent is next to a pit and a wumpus but not in the same cell
as the gold, then o 2 O.s; a/ iff o represents the combination where stench and breeze are true, and
bright is false. e action costs for the problem, c.a; s/, can be all assumed to be 1, and in addition,
no action can be done by the agent when he is not alive.
A partially observable planner is a program that accepts compact descriptions of instances of
the model above, like the one for the Wumpus, and automatically outputs the control (Figure 1.2). As
we will see, planners come in two forms: offline and online. In the first case, the behavior specifies the
agent response to each possible situation that may result; in the second case, the behavior just specifies
the action to be done in the current situation. ese types of control, unlike the control that results
in classical planning, are closed-loop: the actions selected usually depend on the observation tokens
received.
Offline solutions of partially observable problems are not fixed action sequences as in classical
planning, as observations need to be taken into account for selecting actions. Mathematically, thus,
these solutions are functions mapping the stream of past actions and observations into actions, or more
conveniently, functions mapping belief states into actions. e belief state that results after a given
stream of actions and observations represents the set of states that are deemed possible at that point,
and due to the Markovian state-transition dynamics, it summarizes all the information about the past
that is relevant for selecting the action to do next. Moreover, since the initial belief state b0 is given,
corresponding to the set of possible initial states S0 , a solution function , called usually the control
policy, does not need to be defined over all possible beliefs, but just over the beliefs that can be produced
from the actions determined by the policy from the initial belief state b0 and the observations that
may result. Such partial policies can be represented by a directed graph rooted at b0 , where nodes
stand for belief states, edges stand for actions ai or observations oi , and the branches in the graph
from b0 , stand for the stream of actions and observations a0 ; o0 ; a1 ; o1 ; : : :, called executions, that are
possible. e policy solves the problem when all these possible executions end up in belief states where
the goal is true.2
e models above are said to be logical as they only encode and keep track of what is possible or
not. In probabilistic models, on the other hand, each possibility is weighted by a probability measure.
A probabilistic version of the partially observable model above can be obtained by replacing the set of
possible initial states S0 , the set of possible successor states F .a; s/, and the set of possible observation
tokens O.s; a/, by probability distributions: a prior P .s/ on the states s 2 S0 that are initially possible,
2 We will make this all formal and precise in Chapter 5.
6 1. PLANNING AND AUTONOMOUS BEHAVIOR
transition probabilities Pa .s 0 js/ for encoding the likelihood that s 0 is the state that follows s after a,
and observation probabilities Pa .ojs/ for encoding the likelihood that o is the token that results in the
state s when a is the last action done.
e model that results from changing the sets S0 , F .a; s/, and O.s; a/ in the partially observable
model, by the probability distributions P .s/, Pa .s 0 js/, and Pa .ojs/, is known as a Partially Observable
Markov Decision Process or POMDP [Kaelbling et al., 1998]. e advantages of representing uncer-
tainty by probabilities rather than sets is that one can then talk about the expected cost of a solution
as opposed to the cost of the solution in the worst case. Indeed, there are many meaningful problems
that have infinite cost in the worst case but perfectly well-defined expected costs. ese include, for
example, the problem of preparing an omelette with an infinite collection of eggs that may be good
or bad with non-zero probabilities, but that can be picked up and sensed one at a time. Indeed, while
the scope of probabilistic models is larger than the scope of logical models, we will consider both, as
the latter are simpler, and the computational ideas are not all that different.
A fully observable model is a partially observable model where the state of the system is fully
observable, i.e., where O D S and O.s; a/ D fsg. In the logical setting such models are known as Fully
Observable Non-Deterministic models, abbreviated FOND. In the probabilistic setting, they are known
as Fully Observable Markov Decision Processes or MDPs [Bertsekas, 1995].
Finally, an unobservable model is a partially observable model where no relevant information
about the state of the system is available. is can be expressed through a sensor model O containing
a single dummy token o that is “observed” in all states, i.e., O.s; a/ D O.s 0 ; a/ D fog for all s , s 0 ,
and a. In planning, such models are known as conformant, and they are defined exactly like partially
observable problems but with no sensor model. Since there are no (true) observations, the solution
form of conformant planning problems is like the solution form of classical planning problems: a fixed
action sequence. e difference between classical and conformant plans, however, is that the former
must achieve the goal for the given initial state and unique state-transitions, while the latter must
achieve the goal in spite of the uncertainty in the initial situation and dynamics, for any possible initial
state and any state transition that is possible. As we will see, conformant problems make up an interesting
stepping stone in the way from classical to partially observable planning.
In the book, we will consider each of these models in turn, some useful special cases, and some
variations. is variety of models is the result of several orthogonal dimensions: uncertainty in the
initial system state (fully known or not), uncertainty in the system dynamics (deterministic or not),
the type of feedback (full, partial or no state feedback), and whether uncertainty is represented by sets
of states or probability distributions.
C
A A
B A B C B C
··· ···
A A B C B C
B C B C A C A B A C A B
B
··· ··· ··· C C ··· ···
A A B A B C
Goal
Figure
'JHVSF 1.3: ɩF
e graph corresponding to a simple planning problem involving three blocks with initial and goal
HSBQI DPSSFTQPOEJOH UP B TJNQMF QMBOOJOH QSPCMFN JOWPMWJOH UISFF CMPDLT XJUI JOJUJBM BOE HPBM
situations as shown. e actions allow to move a clear block on top of another clear block or to the table. e size
TJUVBUJPOT BT TIPXO ɩF BDUJPOT BMMPX UP NPWF B DMFBS CMPDL PO UPQ PG BOPUIFS DMFBS CMPDL PS UP UIF UBCMF ɩF TJ[F
of the complete graph for this domain is exponential in the number of blocks. A plan for the problem is shown
PG UIF DPNQMFUF HSBQI GPS UIJT EPNBJO JT FYQPOFOUJBM JO UIF OVNCFS PG CMPDLT " QMBO GPS UIF QSPCMFN JT TIPXO
by the path in red.
CZ UIF QBUI JO SFE
possible WBMVFT
QPTTJCMF values, UIF
the OVNCFS
number PG of OPEFT
nodes JO in UIF
the HSBQI
graph UP to TFBSDI
search DBO
can CF
be JOin UIF
the PSEFS of 2n
, XIFSF
order PG where n JTis UIF
the
number of variables. In particular, if the problem
OVNCFS PG WBSJBCMFT *O QBSUJDVMBS
JG UIF QSPCMFN JOWPMWFT involves 30 variables,
WBSJBCMFT
UIJTthis means
NFBOT 1; 073; 741; 824
nodes, BOE
OPEFT
and JGif UIF
the QSPCMFN
problemJOWPMWFT
involves 100WBSJBCMFT
variables,JUitNFBOT
meansNPSFmoreUIBOthan 1030OPEFTnodes.*OInPSEFS
orderUPtoHFUget
B
7
a concrete idea of what exponential growth means, if it
DPODSFUF JEFB PG XIBU FYQPOFOUJBM HSPXUI NFBOT
JG JU UBLFT TFDPOE UP HFOFSBUF takes one second to generate
OPEFT10 B nodes (a
SFBMJTUJD
23
realistic estimate
FTUJNBUF HJWFO DVSSFOUgivenUFDIOPMPHZ
current technology),
JU XPVME UBLF it would
NPSF take
UIBO more TFDPOET
than 10UP seconds to generate
HFOFSBUF 1030
OPEFT ɩJT
3
JTnodes.
IPXFWFSisBMNPTU
is however almost UJNFT
POF NJMMJPO one million times the
UIF FTUJNBUFE BHFestimated age of the
PG UIF VOJWFSTF universe.
A more vivid illustration of the complexity inherent to the
" NPSF WJWJE JMMVTUSBUJPO PG UIF DPNQMFYJUZ JOIFSFOU UP UIF QMBOOJOH QSPCMFN planning problem DBO can CFbe PCUBJOFE
obtained
by considering a well known domain in AI: the Blocks World. Figure
CZ DPOTJEFSJOH B XFMM LOPXO EPNBJO JO "* UIF #MPDLT 8PSME 'JHVSF TIPXT BO JOTUBODF PG 1.3 shows an instance of UIJT
this
domain where blocks A, B, and C, initially arranged so that A is on B, and
EPNBJO XIFSF CMPDLT "
#
BOE $
JOJUJBMMZ BSSBOHFE TP UIBU " JT PO #
BOE # BOE $ BSF PO UIF UBCMF
B and C are on the table,
must CF
NVTU be SFBSSBOHFE
rearranged TP so UIBU
that #B JTis PO
on $
C, BOE
and $C JTis PO
on "
A. ɩFe BDUJPOT
actions BMMPX
allow UPto NPWF
move Ba DMFBS
clear CMPDL
block B(a CMPDL
block
with no block on top) on top of another clear block or on the table. e problem
XJUI OP CMPDL PO UPQ
PO UPQ PG BOPUIFS DMFBS CMPDL PS PO UIF UBCMF ɩF QSPCMFN DBO CF FBTJMZ FYQSFTTFE can be easily expressed
as Ba DMBTTJDBM
BT classical QMBOOJOH
planning QSPCMFN
problem XIFSFwhere UIFthe WBSJBCMFT
variables BSFare UIF
the CMPDL
block MPDBUJPOT
locations: CMPDLT
blocks DBOcan CF
be PO
on UIF
the UBCMF
table
PS PO UPQ PG BOPUIFS CMPDL ɩF mHVSF TIPXT UIF HSBQI BTTPDJBUFE UP UIF QSPCMFN XIPTF TPMVUJPO JTis Ba
or on top of another block. e figure shows the graph associated to the problem whose solution
path DPOOFDUJOH
QBUI connecting UIF the OPEF
node SFQSFTFOUJOH
representing UIF the JOJUJBM
initial TJUVBUJPO
situation XJUI
with Ba OPEF
node SFQSFTFOUJOH
representing Ba HPBMgoal TJUVBUJPO
situation.
e number of states in a Blocks World problem with n blocks is exponential
ɩF OVNCFS PG TUBUFT JO B #MPDLT 8PSME QSPCMFN XJUI CMPDLT JT FYQPOFOUJBM JO
BT UIF TUBUFT JODMVEF in n , as the states include
all the nŠ possible towers of n blocks plus additional combinations of lower towers. us, a planner
ɩF BHF PG UIF VOJWFSTF JT FTUJNBUFE BU ZFBST BQQSPYJNBUFMZ 7JTJUJOH 100 OPEFT BU OPEFT B TFDPOE XPVME
3 e age of the universe is estimated at 9 nodes at 107 nodes a second would
UBLF JO UIF PSEFS PG 15
ZFBST
BT 10013:7 7 10 years approximately. Visiting 2 15
take in the order of 10 years, as 2 =.10 60 60 24 365/ D 4:01969368 10 .
8 1. PLANNING AND AUTONOMOUS BEHAVIOR
able to solve arbitrary Blocks World instances should be able to search for paths over huge graphs. is
is a crisp computational challenge that is very different from writing a domain-specific Blocks World
solver—namely, a program for solving any instance of this specific domain. Such a program could
follow a domain-specific strategy, like placing all misplaced blocks on the table first, in order, from top
to bottom, then moving these blocks to their destination in order again, this time, from the bottom
up. is program will solve any instance of the Blocks World but will be completely useless in other
domains. e challenge in planning is to achieve both generality and scalability. at is, a classical
planner must accept a description of any problem in terms of a set of variables whose initial values are
known, a set of actions that change the values of these variables deterministically, and a set of goals
defined over these variables. e planner is domain-general or domain-independent in the sense that
it does not know what the variables, actions, and domain stand for, and for any such description it
must decide effectively which actions to do in order to achieve the goals.
For classical planning, as for the other planning models that we will consider, the general prob-
lem of coming up with a plan is NP-hard [Bylander, 1994, Littman et al., 1998]. In Computer Science,
an NP-hard problem (non-deterministic polynomial-time hard) is a problem that is at least as hard
as any NP-complete problem; these are problems that can be solved in polynomial time by a non-
deterministic Turing Machine but which are widely believed not to admit polynomial-time solutions
on deterministic machines [Sipser, 2006]. e complexity of planning and related models has been
used as evidence for contesting the possibility of general planning and reasoning abilities in humans
or machines [Tooby and Cosmides, 1992]. e complexity of planning, however, just implies that no
planner can efficiently solve every problem from every domain, not that a planner cannot solve an
infinite collection of problems from seen and unseen domains, and hence be useful to an acting agent.
is is indeed the way modern AI planners are empirically evaluated and ranked in the AI planning
competitions, where they are tried over domains that the planners” authors have never seen. us, far
from representing an insurmountable obstacle, the twin requirements of generality and scalability have
been addressed head on in AI planning research, and have resulted in simple but powerful computa-
tional principles that make domain-general planning feasible. e computational challenge aimed at
achieving both scalability and generality over a broad class of intractable models, has actually come
to characterize a lot of the research work in AI, that has increasingly focused on the development of
effective algorithms or solvers for a wide range of tasks and models (Figure 1.4); tasks and models that
include SAT and SAT-variants like Weighted-Max SAT and Weighted Model Counting, Bayesian
Networks, Constraint Satisfaction, Answer Set Programming, General Game Playing, and Classi-
cal, MDP, and POMDP Planning. is is all work driven by theory and experiments, with regularly
held competitions used to provide focus, to assess progress, and to sort out the ideas that work best
empirically [Geffner, 2013a].
1.4 EXAMPLES
We consider next a simple navigation scenario to illustrate how different types of planning problems
call for different planning models and different solution forms. e general scenario is shown in Fig-
ure 1.5 where the agent marked as A has to reach the goal marked as G. e four actions available
let the agent move one unit in each one of the four cardinal directions, as long as there is no wall.
Actions that lead the agent to a wall have no effect. e question is how should the agent select the
actions for achieving the goal with certainty under different knowledge and sensing conditions. In all
1.4. EXAMPLES 9
Figure 1.4: Models and Solvers: Research work in AI has increasingly focused on the formulation and devel-
opment of solvers for a wide range of models. A solver takes the representation of a model instance as input, and
automatically computes its solution in the output. Some of the models considered are SAT, Bayesian Networks,
General Games Playing, and Classical, MDP, and POMDP Planning. All of these models are intractable when
represented in compact form. e main challenge is scaling up.
cases, we assume that the agent knows the map, including where the walls and the goals are. In the
simplest case, the actions are assumed to be deterministic and the initial agent location known. In this
case, the agent faces a classical planning problem whose solution is a path in the grid joining the initial
agent location and the goal. On the other hand, if the actions have effects that can only be predicted
probabilistically but the state of the problem—the agent location—is always observable, the problem
becomes an MDP planning problem. e solution to this problem is no longer an action sequence,
that cannot guarantee that the goal will be achieved with certainty, but a policy assigning one of the
four possible actions to each one of the states. e number of steps to reach the goal can no longer
be determined with certainty but there is then an expected number of steps to reach the goal that
can be determined, as the policy and the action model induce a probability distribution over all the
possible paths in the grid. Policies that ensure that the goal is eventually achieved with certainty are
called proper policies. Interestingly, we will see that the exact transition probabilities are not relevant
for defining or computing proper policies; all we need to know for this are which state transitions are
possible (probability different than zero) and which ones are not (probability equal to zero). e prob-
lem variation in which the actions have stochastic effects but the location of the agent cannot be fully
observed is a POMDP planning problem, whose general solution is neither a fixed action sequence,
that ignores the observations, nor a policy prescribing the action to do in each state, that assumes that
the state is observable. It is rather a policy that maps belief states into actions. In the POMDP setting,
a belief state is a probability distribution over the states that are deemed possible. ese probability
distributions summarize all the information about the past that is relevant for selecting the action to
do next. e initial belief state has to be given, and the current belief state is determined by the actions
done, the observations gathered, and the information in the model. On the other hand, if uncertainty
about the initial situation, the system dynamics and the feedback are represented by sets of states as
opposed to probability distributions over states, we obtain a partially observable planning problem which
is the logical counterpart of POMDPs. In such problems, the number of possible belief states (sets
of states deemed possible) is finite, although exponential in the number of states, and hence doubly
exponential in the number of problem variables. Finally, if the agent must reach the goal with certainty
but there is uncertainty about the initial state or about the next state dynamics, and there is no feedback
of any type, the problem that the agent faces is a conformant problem. For the problem shown in the
figure, if the actions are deterministic and the agent knows that it is initially somewhere on the room
on the left, a conformant solution can be obtained as follows: the agent moves up five times, until
it knows with certainty that it is somewhere on the top row, then it moves right three times until it
knows with certainty that it is exactly at the top right corner of the left room. With all uncertainty
gone, the agent can then find a path to the goal from that corner.
10 1. PLANNING AND AUTONOMOUS BEHAVIOR
Figure 1.5: Variations on a planning problem: Agent, marked as A, must reach the goal marked G, by moving
one cell at a time under different knowledge and sensing conditions.
A completely different planning example is shown in Figure 1.6 for a problem inspired on the
use of deictic representations [Ballard et al., 1997, Chapman, 1989], where a visual-marker or “eye”
(the circle on the lower left) must be placed on top of a green block by moving it one cell at a time.
e location of the green block is not known initially, and the observations are just whether the cell
with the mark contains a green block (G), a non-green block (B), or neither (C), and whether such
cell is at the level of the table (T) or not (–). e problem is a partially observable planning problem
and the solution to it can be expressed by means of control policies mapping beliefs into actions. An
alternative way to represent solutions to these types of problems is by means of finite-state controllers,
such as the one shown on the right of Figure 1.6. is finite-state controller has two internal states, the
o=a
initial controller state q0 , and a second controller state q1 . An arrow q ! q 0 in the controller indicates
that, when obtaining the observation o in the controller state q , the action a should be performed,
moving to the controller state q 0 , that may be equal to q or not, and where the same action selection
mechanism is applied. e reader can verify that the finite-state controller searches for a tower with
a green block from left to right, going all the way up to the top block in each tower, then going all
the way down to the table, and iterating in this manner until the visual marker appears on top of a
block that is green. Finite-state controllers provide a very compact and convenient representation of
the actions to be selected by an autonomous system, and for this reason they are commonly used in
practice for controlling robots or non-playing characters in video games [Buckland, 2004, Mataric,
2007, Murphy, 2000]. While these controllers are normally written by hand, we will show later that
they they can be obtained automatically using planners. Indeed, the controller shown in the figure
has been derived in this way using a classical planner over a suitable transformation of the partially
observable problem shown on the left [Bonet et al., 2009]. It is actually quite remarkable that the
finite-state controller that has been obtained in this manner is not only good for solving the original
problem on the left, but also an infinite number of variations of it. It can actually be shown that the
controller will successfully solve any modification in the problem resulting from changes in either the
dimensions of the grid, the number of blocks or their configuration. us, in spite of appearances, the power
'JHVSF 7BSJBUJPOT PO B QMBOOJOH QSPCMFN "HFOU
NBSLFE BT "
NVTU SFBDI UIF HPBM NBSLFE (
CZ NPWJOH
POF DFMM BU B UJNF VOEFS EJĊFSFOU LOPXMFEHF BOE TFOTJOH DPOEJUJPOT
1.5. GENERALIZED PLANNING: PLANS VS. GENERAL STRATEGIES 11
TC/Right
–B/Up
TB/Up –B/Down
–C/Down
q0 q1
TB/Right
Figure 1.6: Left: Problem where a visual-marker (mark on the lower left cell) must be placed on top of a green
'JHVSF
block observing
by just -FGU 1SPCMFN
what isXIFSF
on theB marked
WJTVBMNBSLFS NBSLFinite-state
cell. Right: PO UIF MPXFS MFGU DFMM
obtained
controller NVTU CF with
QMBDFE PO UPQ PGplanner
a classical B HSFFO
CMPDLsuitable
from CZ KVTU translation.
PCTFSWJOH XIBUe JTcontroller
PO UIF NBSLFE
solves DFMM 3JHIU 'JOJUFTUBUF
the problem DPOUSPMMFS
and any variation ofPCUBJOFE XJUIthat
the problem B DMBTTJDBM
resultsQMBOOFS
from
GSPN TVJUBCMF
changes USBOTMBUJPO
in the number ɩF DPOUSPMMFS
or configuration of TPMWFT
blocks.UIF QSPCMFN BOE BOZ WBSJBUJPO PG UIF QSPCMFN UIBU SFTVMUT GSPN
DIBOHFT JO UIF OVNCFS PS DPOmHVSBUJPO PG CMPDLT
of classical planners shouldn’t be underestimated. Often we will be able to solve non-classical planning
problems " using
DPNQMFUFMZ EJĊFSFOU
classical plannersQMBOOJOH
by meansFYBNQMF JT TIPXO
of feasible, JO 'JHVSFtransformations.
well-defined GPS B QSPCMFN JOTQJSFE PO UIF
VTF PG EFJDUJD SFQSFTFOUBUJPOT <#BMMBSE FU BM
$IBQNBO
>
XIFSF B WJTVBMNBSLFS PS AFZF
UIF DJSDMF PO UIF MPXFS MFGU
NVTU CF QMBDFE PO UPQ PG B HSFFO CMPDL CZ NPWJOH JU POF DFMM BU B UJNF
1.5 GENERALIZED PLANNING: PLANS VS. GENERAL
ɩF MPDBUJPO PG UIF HSFFO CMPDL JT OPU LOPXO JOJUJBMMZ
BOE UIF PCTFSWBUJPOT BSF KVTU XIFUIFS UIF DFMM
XJUI UIFSTRATEGIES
NBSL DPOUBJOT B HSFFO CMPDL (
B OPOHSFFO CMPDL #
PS OFJUIFS $
BOE XIFUIFS TVDI
DFMM JT BU UIF
e visual-marker MFWFMproblem
PG UIF UBCMF 5
PStwo
illustrates OPUdifferences
o
ɩF QSPCMFN
that areJTcrucial
B QBSUJBMMZ PCTFSWBCMF
in planning. OneQMBOOJOH QSPCMFN
is the difference
BOE UIF TPMVUJPO UP JU DBO CF FYQSFTTFE CZ NFBOT PG DPOUSPM QPMJDJFT NBQQJOH CFMJFGT JOUP BDUJPOT
between a solution to a problem instance and a solution to a family of instances. e second is the difference "O
XBZ UP SFQSFTFOU TPMVUJPOT UP UIFTF UZQFT PG QSPCMFNT JT CZ NFBOT PG mOJUFTUBUF
between expressing the solution to a problem and finding a solution to the problem in the first place. For
BMUFSOBUJWF DPOUSPMMFST
TVDI BT UIF
example, POF TIPXO
a solution to aPOparticular
UIF SJHIU PG 'JHVSFWorld
Blocks ɩJT mOJUFTUBUF
instance, with DPOUSPMMFS
blocks A,IBT B, UXP
and JOUFSOBM
C, on the TUBUFT
UIF
table,
JOJUJBM
that mustDPOUSPMMFS TUBUFin order
be stacked
BOEwith
B TFDPOE
C onDPOUSPMMFS
top, may TUBUF "O BSSPX
be the action JO UIF
sequence pick(B) DPOUSPMMFS, pick(C)
, stack(B,A) JOEJDBUFT
,
stack(C,B). On the other hand, the general strategy of putting all blocks on the table, in order from
top to bottom, followed by putting all blocks on their destination in order from the bottom to the
top, works for all Blocks World instances, regardless of the number and names of the blocks. Planning
in AI, and in particular what is called domain-independent planning, has been mostly focused on
models and methods for expressing and solving single planning instances. On the other hand, the work
in planning driven by applications over certain specific domains, has usually focused on languages like
Hierarchical Task Networks [Erol et al., 1994] for expressing by hand the strategies for solving any problem
in the given domain. e problem of computing general domain strategies has not been tackled by
automated methods, because the problem appears to be too hard in general. Yet, ideally, this is where
we would like to get, at least on domains that admit compact general solutions [Srivastava et al.,
2011a]. In a recent formulation, a form of generalized planning of this type has been shown to be
EXPSPACE-Complete [Hu and de Giacomo, 2011]. In this formulation, all instances are assumed
to share the same set of actions and observations, and a general solution is a function mapping streams
of observations into actions. In some cases, such functions can be conveniently expressed as policies
that map suitable combination of observables, called features, into actions. e crucial question is how
to get such features and policies effectively, in particular over domains that admit compact solutions
over the right features. An early approach that does this in the blocks world constructs a pool of possible
12 1. PLANNING AND AUTONOMOUS BEHAVIOR
features from the primitive domain predicates and a simple grammar (a description logic), and then
looks for compact rule-based policies over some of such features using supervised learning algorithms
[Fern et al., 2003, Martin and Geffner, 2000]. Another approach, used for playing a challenging game
in real-time, represents the policies that map observables into actions by means of neural networks
whose topology and weights are found by a form of evolutionary search [Stanley et al., 2005]. Both of
these approaches are aimed at general domain policies that map state features into actions, which are
not tied to specific instances.
In a famous exchange during the 80s about “universal plans,” Matthew Ginsberg attacked the
value of the idea of general plans on computational grounds [Ginsberg, 1989]. A universal plan is a
strategy for solving not just one planning instance, but many instances, and more specifically, all those
instances that can be obtained by just changing the initial state of the problem [Schoppers, 1987].
While the solutions to such generalized planning problems can be expressed as policies mapping states
into actions, as for MDPs, Ginsberg argued that the size of such universal plans would be often just too
large; exponential at least for families of problems that are NP-hard to solve. In order to illustrate this
point, Ginsberg conjectured that no compact universal plan could be defined for a specific problem
that he called the fruitcake problem, where blocks were to be placed on a tower to spell the word
“fruitcake.” e challenge, however, was answered by David Chapman who came up with an elegant
reactive architecture, basically a circuit in line with his “Pengi” system [Agre and Chapman, 1987], that
in polynomial time solved the problem [Chapman, 1989]. Chapman went on to claim that Blockhead,
his system, solved the fruitcake problem, and solved it easily, with no search or planning, thus raising
doubts not only about the value of “universal plans” but also of the need to plan itself. is was an
interesting debate, and if we bring it here it is because it relates to the two key distinctions mentioned
at the beginning of this section: one between a solution to a problem instance and a solution to a family of
instances, the other between expressing a solution and finding a solution. From this perspective, Chapman
is right that general strategies for solving many classes of interesting problems can often be encoded
through compact representations, like Pengi-like circuits, yet the challenge is in coming up with such
strategies automatically. Without this ability, it cannot be said that Blockhead is solving the problem—
it’s Chapman. Blockhead simply executes the policy crafted by Chapman that is not general and doesn’t
apply to other domains.
1.6 HISTORY
e first AI planner and one of the first AI programs was introduced by Newell and Simon in the
50s [Newell and Simon, 1961, Newell et al., 1959]. is program called GPS, for General Problem
Solver, introduced a technique called means-ends analysis where differences between the current state
and the goal were identified and mapped into operators that decreased those differences. e STRIPS
system [Fikes and Nilsson, 1971] combined means-ends analysis with a convenient declarative action
language. Since then, the idea of means-ends analysis has been refined and extended in many ways, in
the formulation of planning algorithms that are sound (only produce plans), complete (produce a plan if
one exists), and effective (scale up to large problems). By the early 90s, the state-of-the-art planner was
UCPOP [Penberthy and Weld, 1992], an implementation of an elegant planning method known as
partial-order planning where plans are not searched either forward from the starting state or backward
from the goal, but are constructed from a decomposition scheme in which joint goals are decomposed
into subgoals that create as further subgoals the preconditions of the actions used to establish them
1.6. HISTORY 13
[McAllester and Rosenblitt, 1991, Sacerdoti, 1975, Tate, 1977]. e actions that are incorporated into
the plan are partially ordered as needed in order to resolve possible conflicts among them. Partial-order
planning algorithms are sound and complete, but do not scale up well, as there are too many choices
to make and too little guidance on how to make those choices; yet see [Nguyen and Kambhampati,
2001, Vidal and Geffner, 2006].
e situation in planning changed drastically in the mid 90s with the introduction of Graphplan
[Blum and Furst, 1995], an algorithm that appeared to have little in common with previous approaches
but scaled up much better. Graphplan builds a plan graph in polynomial time reasoning forward from
the initial state, which is then searched backward from the goal to find a plan. It was shown later
that the reason Graphplan scaled up well was due to a powerful admissible heuristic implicit in the
plan graph [Haslum and Geffner, 2000]. e success of Graphplan prompted other approaches. In
the SAT approach [Kautz and Selman, 1996], the planning problem for a fixed planning horizon is
converted into a general satisfiability problem expressed as a set of clauses (a formula in Conjunctive
Normal Form or CNF) that is fed into state-of-the-art SAT solvers, which currently manage to solve
huge SAT instances even though the SAT problem is NP-complete.
Currently, the formulation of classical planning that appears to scale up best is based on heuristic
search, with heuristic values derived from the delete-relaxation [Bonet et al., 1997, McDermott, 1996].
In addition, state-of-the-art classical planners use information about the actions that are most “helpful”
in a state [Hoffmann and Nebel, 2001], and implicit subgoals of the problem, called landmarks, that
are also extracted automatically from the problem with methods similar to those used for deriving
heuristics [Hoffmann et al., 2004, Richter and Westphal, 2010].
Since the 90s, increasing attention has been placed on planning over non-classical models such
as MDPs and POMDPs where action effects are not fully predictable, and the state of the system
is fully or partially observable [Dean et al., 1993, Kaelbling et al., 1998]. We will consider all these
variations and others in the rest of the book. We will have less to say about Hierarchical Task Planning
or HTN planning, which, while widely used in practice, is focused on the representation of general
strategies for solving problems rather than in representing and solving the problems themselves. For a
comprehensive planning textbook, see Ghallab et al. [2004], while for a modern AI textbook covering
planning at length, see Russell and Norvig [2009].
CHAPTER 2
Classical Planning:
Full Information and
Deterministic Actions
In classical planning, the task is to drive a system from a given initial state into a goal state by applying
actions whose effects are deterministic and known. Classical planning can be formulated as a path-
finding problem over a directed graph whose nodes represent the states of the system or enviroment,
and whose edges capture the state transitions that the actions make possible. e computational chal-
lenge in classical planning results from the number of states, and hence the size of the graph, which
are exponential in the number of problem variables. State-of-the-art methods in classical planning
search for paths in such graphs by directing the search toward the goal using heuristic functions that are
automatically derived from the problem. e heuristic functions map each state into an estimate of the
distance or cost from the state to the goal, and provide the search for the goal with a sense of direction.
In this chapter, we look at the model and languages for classical planning, and at the heuristic search
techniques that have been developed for solving it. Variations and extensions of these methods, as well
as alternative methods, will be considered in the next chapter.
S.Nodes/
while Nodes ¤ ; do
Let n WD S-N.Nodes/
Let Rest WD Nodes n fng
if n is a goal node then
return E-S.n/
else
Let Children WD E-N.n/
Set Nodes WD A-N.Children; Rest/
end if
end while
return Unsolvable
Figure 2.1: General search schema invoked with Nodes containing the source node only. A number of familiar
search algorithms are obtained by suitable choices of the S-N and A-N functions.
play no active role during the search are called brute force or blind search algorithms [Edelkamp and
Schrödl, 2012, Pearl, 1983]. e latter include algorithms like Depth-First Search, Breadth-First Search,
and Uniform Cost Search, also called Dijkstra’s algorithm [Cormen et al., 2009]. e former include
Best-First Search, A*, and Hill Climbing. e algorithms search for paths in different ways, and have
different properties concerning completeness, optimality, and time and memory complexity. We will
make a quick overview of them before reviewing some useful variants.
e algorithms can all be understood as particular instances of the general schema shown in
Figure 2.1, where a search frontier called Nodes, initialized with the root node of the graph, shifts
incrementally as the graph is searched. In each iteration, two steps are performed: a selected node is
removed from the search frontier, and if the selected node is not a goal node, its children are added
to the search frontier, else the search terminates and the path to the last node selected is returned.
e nodes in the search represent the states of the problem and contain in addition bookkeeping
information like a pointer to the parent node, and the weight of the best path to the node so far, called
the accumulated cost and denoted by the expression g.n/ where n is the node. e various algorithms
arise from the representation of the search frontier Nodes, the way nodes are selected from this frontier,
and the way the children of these nodes are added to the search frontier.
Depth-First Search is the algorithm that results from implementing the search frontier Nodes as
a : the node that is selected is the top node in the stack, and the children nodes are added to the
top of the stack. It can be shown that if the graph is acyclic, the nodes in Nodes will be selected indeed
in depth-first order. Similarly, Breadth-First Search is the algorithm that results from implementing
the search frontier Nodes as a : nodes are selected from one end, and their children are added
to the other end. It can be shown then that the nodes in Nodes will be selected depth-last, and more
precisely, shallowest-first, which is the characteristic of Breadth-First Search. Finally, Best-First Search
is the algorithm that results when the search frontier is set up as a -, so that the nodes
selected are the ones that minimize a given evaluation function f .n/. Best-First Search reduces to
the well-known A* algorithm [Hart et al., 1968] when the evaluation function is defined as the sum
18 CLASSICAL PLANNING
f .n/ D g.n/ C h.n/ of the accumulated cost g.n/ and the heuristic estimate of the cost-to-go h.n/.
Other known variations of Best-First Search are Greedy Best-First where f .n/ D h.n/, and WA*
where f .n/ D g.n/ C W h.n/ and W is a constant larger than 1. Finally, Uniform-Cost Search or
Dijkstra’s algorithm corresponds basically to a Best-First Search with evaluation function f .n/ D g.n/,
or alternatively to the A* algorithm with the heuristic function h.n/ set to 0.
Certain optimizations are common. In particular, Depth-First Search prunes paths that contain
cycles—namely, pairs of nodes that represent the same state (duplicate nodes), while Breadth-First and
Best-First Search keep track of the nodes that have been already selected and expanded in a CLOSED
list. Duplicates of these nodes can be pruned except when the new node has a lower accumulated cost
g.n/ and the search is aimed at returning a minimum-cost path. Similarly, duplicate nodes in the
search frontier, also called the OPEN list, are avoided by just keeping in OPEN the node with the
least evaluation function.
It can be shown that all these algorithms are complete, meaning that if there is a path to a goal
node, the algorithms will find a path in finite time.1 Furthermore, some of these algorithms are optimal,
meaning that the paths returned upon termination will be optimal. ese include Dijkstra’s algorithm,
Breadth-First Search when action costs are uniform, and A* when the heuristic h.n/ is admissible, i.e.,
it doesn’t overestimate the true optimal cost h .n/ from n to the goal for any node n. e complexity
in space of these algorithms can be described in terms of the length d of the solutions and the average
number of children per node, the so-called branching factor b . e space requirement of Breadth-First
search is exponential and grows with O.b d /, where d is the length of the optimal solution, as a b -ary
tree of depth d has b d leafs all of which can make it into the search frontier in the worst case. e
space complexity of A* with admissible heuristics is in turn O.b C =cmin /, where C is the optimal cost
of the problem and cmin is the minimum action cost. is is because A* may expand in the worst case
all nodes n with evaluation function f .n/ C and such nodes can be at depth C =cmi n .2 e same
bounds apply for the time complexity of the algorithms. On the other hand, the space requirements
for Depth-First Search are minor: they grow linearly with d as O.bd /, as the search frontier in DFS
just needs to keep track of the path to the last selected node along with the children of the ancestors
nodes that have not yet been expanded.
e difference between linear and exponential memory requirements can be crucial, as algo-
rithms that require exponential memory may abort after a few seconds or minutes with an “insuf-
ficient memory” message. Since DFS is the only linear space algorithm above, extensions of DFS
have been developed that use linear memory and yet return optimal solutions. e core of some of
these algorithms is a bounded-cost variant of DFS where selected nodes n whose evaluation function
f .n/ D g.n/ exceeds a given bound B are immediately pruned. Bounded-Cost DFS remains com-
plete provided that the bound B is not smaller than the cost C of an optimal solution, but it is not
optimal unless B is equal to C . Two iterative variants of Bounded-Cost DFS achieve optimality by
performing several successive trials with increasing values for the bound B , until B D C . Iterative
Deepening (ID) is a sequence of Bounded-Cost DFS searches with the bound B0 for the first itera-
tion set to 0, and the bound Bi for iteration i > 0 set to minimum evaluation function f .n/ over the
nodes pruned in the previous iteration so that at least a new node is expanded in each iteration. In its
most standard form, when all action costs are equal to 1, the bound Bi in the iteration i is Bi D i . ID
1 ForDFS to be complete, paths containing cycles must be pruned to avoid getting trapped into a loop.
2 isbound assumes that the heuristic is consistent, as otherwise, nodes may have to be reopened an exponential number of
times in the worst case [Pearl, 1983].
2.3. SEARCH ALGORITHMS: BLIND AND HEURISTIC 19
combines the best elements of Depth-First Search (memory) and Dijkstra’s algorithm (optimality).
ID achieves this combination by performing multiple searches where the same node may be expanded
multiple times. Yet, asymptotically this does not affect the time complexity that is dominated by the
worst-case time of the last iteration. Iterative Deepening A* (IDA*) is a variant of ID that uses the
evaluation function of A*—namely, f .n/ D g.n/ C h.n/. As long as the heuristic is admissible, the
first solution encountered by IDA* will be optimal too, while usually performing fewer iterations than
ID and pruning many more nodes [Korf, 1985].
e number of nodes expanded by heuristic search algorithms like A* and IDA* depends on the
quality of the heuristic h. A* is no better than Breadth-First Search or Dijkstra’s algorithm when h.n/
is uniformly 0, and IDA* is no better then than ID. Yet if the heuristic is optimal, i.e., h.n/ D h .n/
where h stands for the optimal cost from n to the goal, then both A* and IDA* will find an optimal
path to the goal with no search at all, just expanding the nodes in one optimal path only (for this
though, A* must break ties in the evaluation function by favoring the nodes with the smaller heuristic,
else if the problem has multiple optimal solutions, A* may keep switching from one optimal path to
another). In the middle, A* and IDA* will expand fewer nodes using an admissible heuristic h1 than
using an admissible heuristic h2 when h1 is higher than h2 .3 A more common situation is when h1 is
higher than h2 over some states, and equal to h2 over the other states. In such cases, h1 will produce
no more expansions than h2 provided that ties are broken in the same way in the two cases. In the first
case, the heuristic h1 is said to be more informed than h2 or to dominate h2 , in the second, that it is at
least as informed as h2 [Edelkamp and Schrödl, 2012, Pearl, 1983].
Admissible heuristics are crucial in algorithms like A* and IDA* for ensuring that the solu-
tions returned are optimal, yet they are not crucial for finding solutions fast, when there is no need
for optimality. Indeed, these algorithms will often find solutions faster by multiplying an admissi-
ble heuristic h by a constant W > 1 as in WA*. WA* can be thought as an A* algorithm but with
heuristic W h which is not necessarily admissible even when h is. e reason that WA* will find so-
lutions faster than A* can be seen by considering the nodes selected for expansion given an OPEN
list that contains one node n that is deep in the graph but close to the goal, e.g., g.n/ D 10 and
h.n/ D 1, and another node n0 that is shallow and far from the goal; e.g., g.n0 / D 2 and h.n0 / D 6.
Among the two nodes, A* will choose n0 for expansion as f .n0 / D g.n0 / C h.n0 / D 2 C 6 D 8, while
f .n/ D g.n/ C h.n/ D 10 C 1 D 11. On the other hand if W D 2, WA* chooses the node n instead
as f .n0 / D g.n0 / C W h.n0 / D 2 C 2 6 D 14 and f .n/ D g.n/ C W h.n/ D 10 C 2 1 D 12.
e optimality of A* with admissible heuristics can be shown by contradiction. First, it is not
hard to verify that until termination, the OPEN list always contains a node n in an optimal path to
the goal if the problem is solvable. en, if A* selects a goal node n0 with cost f .n0 / that is higher
than the optimal cost C , then f .n0 / > C g.n/ C h.n/ D f .n/ as the heuristic h is admissible.
erefore, A* did not select the node with minimum f -value from OPEN which is a contradiction.
WA* is not optimal, even when the heuristic h is admissible, yet the same argument can be used to
show that the solution returned by WA* will not exceed the optimal cost C by more than a factor of
W . is is important in practice, as for example, with W D 1:2 the algorithm may turn out to run an
order-of-magnitude faster and with much less memory than A*, yet the loss in optimality is then at
most 20%.
Anytime WA* [Hansen and Zhou, 2007] is an anytime optimal algorithm which basically works
exactly like WA* until a solution is found with cost C , not necessarily optimal. Rather than stopping
3 Technically speaking, this result assumes that both heuristics are consistent [Pearl, 1983].
20 CLASSICAL PLANNING
there, however, anytime WA* uses the amount of time available for improving the quality of this
solution, continuing the WA* search, while pruning nodes n with accumulated costs g.n/ greater than
C , and updating the bound C to C 0 when solutions with cost C 0 less than C are found. Anytime WA*
can thus produce solutions more quickly, and if given enough time, produce better and better solutions
until finding an optimal solution. is, however, can only be verified when the search terminates, i.e.,
when the OPEN list becomes empty. Another interesting anytime optimal algorithm obtained as a
variation of WA* is Restarting WA* or RWA* that performs iterated WA* searches but with decreasing
weights, while keeping in memory nodes expanded in previous iterations that are re-expanded only
when a cheaper path to the node is found [Richter et al., 2010]. is is the search algorithm used
in the state-of-the-art heuristic search planner called LAMA [Richter and Westphal, 2010]. Many
other heuristic search planners use Greedy Best-First (GBFS) which is a best-first search with evaluation
function f .n/ D h.n/, where the accumulated cost term g.n/ has been dropped. GBFS can be seen as
an WA* algorithm with a very large constant, and a tie breaking rule that favors nodes n with smallest
accumulated costs g.n/.
where h is the heuristic function and s 0 is the state that is predicted to follow the action a in the state
s , i.e., s 0 D f .a; s/. e minimization is done over the actions a applicable in s , i.e., a 2 A.s/.
is algorithm is known as the greedy algorithm or policy, and also as hill climbing search. It’s
called greedy because it selects the action to be done by trusting the heuristic function h completely,
and hill climbing, because when action costs are uniform it behaves as if Q.a; s/ D h.s 0 /, selecting thus
actions that minimize the heuristic function toward goal states that should have a zero heuristic.4
e main positive property of the greedy algorithm is that it is optimal if the heuristic h is
optimal, i.e., if h D h . In addition, the algorithm uses constant memory; it doesn’t keep track of a
search frontier at all, just the current state and its children. is is however where the good news for
the greedy algorithm end. e algorithm in general is neither optimal nor complete; in fact, it can get
trapped into a loop, selecting actions that take it from a state s into state s 0 and then back from s 0 to
s.
A common way to improve the greedy algorithm is by looking ahead from the current state s ,
not just one level as done by the minimization of the Q.a; s/ expression in Eq. 2.1, but several levels.
A depth-first search from s that prunes nodes that are deeper than a given bound H can be used to
perform this lookahead where H is the number of levels, or planning horizon. is form of lookahead
is time exponential in the horizon, O.b H /, where b is the branching factor of the problem. After this
lookahead, the action that is selected is the one on the path to the best leaf, with “best” defined in
terms of the evaluation function f .n/ D g.n/ C h.n/.
is form of lookahead ensures that the action selected is the one that is best within the planning
horizon H for the given heuristic, yet this horizon must be kept small, else the local lookahead search
cannot be completed in real time. A useful alternative lookahead scheme over small time windows is the
combination of a larger horizon H along with a heuristic search algorithm that operates within this
horizon (deeper nodes are still pruned) as an anytime optimal algorithm. For example, the lookahead
search can be done with the A* algorithm from the current state s . en, when time is up, whether the
search is finished or not, the action selected in s is taken as the one leading to the best leaf, yet with
the leaves being both the nodes at depth H that have been generated plus the nodes that have been
generated at any other level that have not been yet expanded. Algorithms like Anytime WA* can also
be convenient for this type of anytime optimal lookahead search.
While a depth-first or best-first lookahead can improve the quality of the actions selected in the
greedy online search algorithm, neither approach guarantees completeness and optimality, except in
the trivial case where a solution and an optimal solution are within an horizon H of the seed state.
On the other hand, there is a simple fix to the greedy search that delivers both completeness and
optimality. e fix is due to Richard Korf, and the resulting algorithm is known as Learning Real Time
A* or LRTA* [Korf, 1990].
LRTA* is an extremely simple, powerful, and adaptable algorithm, that as we will see generalizes
naturally to MDPs. LRTA* is the online greedy search algorithm with one change: once the action a
that minimizes the estimated cost-to-go term Q.a; s/ from s is applied, the heuristic value h.s/ is
updated to Q.a; s/. e code for LRTA* is shown in Figure 2.2 where the dynamically changing
4 In
the planning setting, the algorithm actually does hill descending. e name of the algorithm, however, comes from contexts
where states that maximize a given function are sought.
22 CLASSICAL PLANNING
LRTA*
% h is the initial value function and V is the hash table that stores the updated
% values. When fetching a value for s in V , if V does not contain an entry for s ,
% an entry is created with value h.s/
Let s WD s0
while s is not a goal state do
Evaluate each action a 2 A.s/ as: Q.a; s/ WD c.a; s/ C V .f .a; s//
Select best action a that minimizes Q.a; s/
Update value V .s/ WD Q.a; s/
Set s WD f .a; s/
end while
heuristic function, initially set to h.s/, is denoted as V .s/. For the implementation of LRTA*, the
estimates V .s/ are stored in a hash table that initially contains the heuristic value h.s0 / of the initial
state s0 only. en, when the value of a state s that is not in the table is needed, a new entry for s with
value V .s/ D h.s/ is allocated. ese entries V .s/ are updated as
where s 0 D f .a; s/, when the action a D argmina2A.s/ Q.a; s/ is applied in the state s . is simple
greedy algorithm combined with these updates delivers the two key properties provided that the heuristic
h.s/ is admissible and that there are no dead-ends (states from which the goal cannot be reached). First,
LRTA* will not be trapped into a loop and will eventually reach the goal. Second, if upon reaching the
goal, the search is restarted from the same initial state while keeping the current heuristic function V ,
and this process is repeated iteratively, eventually LRTA* converges to an optimal path to the goal. is
convergence will be achieved in a finite number of iterations, and the convergence is achieved when
the updates V .s/ WD mina2A.s/ Q.a; s/ do not change the value V .s/ of any of the states encountered
in the way to the goal, which are then optimal.
ese are two remarkable properties that follow from a simple change in the greedy algorithm
that adjusts the value of the heuristic according to Eq. 2.2 over the states that are visited in the search.
Of course, LRTA*, unlike the greedy algorithm, does not run in constant space, as the updates to
the heuristic function take space in the hash table that in the worst case can become as large as the
number of states in the problem. e value of the initial heuristic is critical in the performance of
LRTA*, both in terms of time and space, as better heuristic values mean a more focused search, and
a more focused search means more updates on the states that matter. When LRTA* is to be run once
and not until convergence, a lookahead can improve the quality of the actions selected and boost the
heuristic values of the visited states (which remain admissible if they are initially admissible). e latter
can be achieved if the states that are expanded in the lookahead search are also updated using Eq. 2.2,
and the new values are propagated up to their parents. In this way, a move from s will leave a heuristic
8)&3& %0 )&63*45*$4 $0.& '30.
2.5. WHERE DO HEURISTICS COME FROM? 23
1 3 4
6 2 11 10
5 8 7 9
14 12 15 13
Figure 2.3: e sliding 15-puzzle where the goal is to get to a configuration where the tiles are ordered by
'JHVSF
number
with ɩF TMJEJOHsquare
the empty QV[[MF XIFSF
last. e UIF HPBM
actions JT UPare
allowed HFUthose
UP B that
DPOmHVSBUJPO
slide a tileXIFSF UIFempty
into the UJMFT BSF PSEFSFE
square. CZ
While
OVNCFS XJUI UIF
the problem FNQUZ
is not TRVBSF
simple, theMBTU ɩF BDUJPOT
heuristic BMMPXFE
that sums the BSF UIPTFand
horizon UIBUvertical
TMJEF B UJMF JOUP UIF
distances of FNQUZ TRVBSF
each tile to its8IJMF
target
UIF QSPCMFN JT OPU TJNQMF
UIF IFVSJTUJD UIBU TVNT UIF IPSJ[PO BOE WFSUJDBM EJTUBODFT PG FBDI UJMF
position is simple to compute and provides informative estimates to the goal. In planning, heuristic functions UP JUT UBSHFU
are
QPTJUJPO
obtainedJTautomatically
TJNQMF UP DPNQVUF BOEproblem
from the QSPWJEFTrepresentation.
JOGPSNBUJWF FTUJNBUFT UP UIF HPBM *O QMBOOJOH
IFVSJTUJD GVODUJPOT BSF
PCUBJOFE BVUPNBUJDBMMZ GSPN UIF QSPCMFN SFQSFTFOUBUJPO
value V .s/ for s that would be more informed than the value of s computed from its children using
Eq. 2.2. 8)&3& %0
LSS-LRTA* is )&63*45*$4
a version of LRTA* with a $0.&
lookahead of'30.
this type [Koenig and Sun, 2009].
)FVSJTUJD TFBSDI BMHPSJUINT FYQSFTT B GPSN PG HPBMEJSFDUFE TFBSDI XIFSF IFVSJTUJD GVODUJPOT BSF VTFE
2.5
UP HVJEF WHERE
UIF TFBSDI UPXBSET DOUIF HEURISTICS
HPBM " LFZ RVFTUJPO COME JT IPX TVDI FROM? IFVSJTUJDT DBO CF PCUBJOFE GPS B
HJWFO QSPCMFN " VTFGVM IFVSJTUJD JT POF UIBU QSPWJEFT
Heuristic search algorithms express a form of goal-directed search where HPPE FTUJNBUFT PG UIF DPTU UPfunctions
heuristic UIF HPBM BOE DBO
are used
CF
to guide the search toward the goal. A key question is how such heuristics can be obtained forBUa
DPNQVUFE SFBTPOBCMZ GBTU )FVSJTUJDT IBWF CFFO USBEJUJPOBMMZ EFWJTFE BDDPSEJOH UP UIF QSPCMFN
IBOE problem. ABOE
given<&EFMLBNQ 4DISÚEM
useful
heuristic 1FBSM
is one >
that UIF &VDMJEFBO
provides good estimatesEJTUBODF JT B cost
of the HPPEtoIFVSJTUJD
the goalGPS and SPVUF
can
mOEJOH
be computed reasonably fast. Heuristics have been traditionally devised according to the problemGPS
UIF TVN PG UIF .BOIBUUBO EJTUBODFT PG FBDI UJMF UP JUT EFTUJOBUJPO JT B HPPE IFVSJTUJD at
UIF
hand TMJEJOH QV[[MFT
and
[Edelkamp UIFSchrödl,
BTTJHONFOU 2012, QSPCMFN IFVSJTUJD
Pearl, 1983]: theJTEuclidean
HPPE GPS 4PLPCBO
distance is<+VOHIBOOT
a good heuristic BOE 4DIBFĊFS
for route
>
finding,CPUI the UIFsumBTTJHONFOU
of the ManhattanQSPCMFN distances
BOE TQBOOJOH of eachUSFF tile
IFVSJTUJDT IBWF CFFO VTFE
to its destination GPS UIFheuristic
is a good 5SBWFMMJOHfor
4BMFTNBO 1SPCMFN <-BXMFS FU BM
>
BOE TP PO ɩF HFOFSBM JEFB
the sliding puzzles, the assignment problem heuristic is good for Sokoban [Junghanns and Schaeffer, UIBU FNFSHFT GSPN UIF WBSJPVT
QSPCMFNT
2001], both JT UIBU
theIFVSJTUJDT
assignment problem DBO CF TFFO and BT FODPEJOH
spanning UIFheuristics
tree DPTU PG SFBDIJOH
have been UIF HPBM
used GSPNfor the UIFTravelling
TUBUF JO
BSalesman
QSPCMFN UIBU JT TJNQMFS UIBO UIF PSJHJOBM POF <.JOTLZ
1FBSM
Problem [Lawler et al., 1985], and so on. e general idea that emerges from the various 4JNPO
> 'PS FYBNQMF
UIF TVNPG.BOIBUUBO
problems is that heuristics EJTUBODFT
h.s/ can JO be
UIFseen
TMJEJOH QV[[MFT 'JHVSF
as encoding the cost
DPSSFTQPOET
of reaching the goalUP UIF
from PQUJNBM DPTUs PG
the state in
Ba TJNQMJmDBUJPO PG UIF QV[[MF XIFSF UJMFT DBO CF NPWFE UP BEKBDFOU QPTJUJPOT
problem that is simpler than the original one [Minsky, 1961, Pearl, 1983, Simon, 1955]. For example, XIFUIFS UIFTF QPTJUJPOT
BSF
theFNQUZ PS OPU 4JNJMBSMZ
sum-of-Manhattan UIF &VDMJEFBO
distances in the sliding IFVSJTUJD
puzzles GPS(Figure
SPVUF 2.3)
mOEJOH JT UIF DPTU
corresponds to PG
theB optimal
TJNQMJmDBUJPO
cost of
PG
a simplification of the puzzle where tiles can be moved to adjacent positions, whether theseTJNQMJmFE
UIF QSPCMFN XIFSF TUSBJHIU SPVUFT BSF BEEFE CFUXFFO BOZ QBJS PG DJUJFT JO UIF NBQ ɩF positions
QSPCMFNT
are emptyBSF orOPSNBMMZ SFGFSSFE
not. Similarly, BT SFMBYBUJPOT
theUPEuclidean PG UIF for
heuristic PSJHJOBM
routeQSPCMFN
finding is*Gthe JTcost UIFof PSJHJOBM QSPCMFN
a simplification
JT JUT SFMBYBUJPO
BOE BOE SFGFS UP UIF QSPCMFN BOE
of the problem where straight routes are added between any pair of cities in the map. e simplified SFMBYBUJPO XIFO UIF JOJUJBM TUBUF JT
TFU UP UIF TUBUF
UIF HFOFSBM JEFB JT UP TFU UIF IFVSJTUJD WBMVF BTTPDJBUFE
problems are normally referred to as relaxations of the original problem. If P is the original problem, XJUI UIF QSPCMFN
UP
P 0UIF PQUJNBM
is its DPTU and PPG
relaxation, .s/UIF
andSFMBYFE
P 0 .s/ QSPCMFN
refer to the problem *U JT FBTZ
andUPrelaxation
TIPX UIBUwhen JG UIFthe TPMVUJPOT UP UIFis
initial state
PSJHJOBM
set to the state s , the general idea is to set the heuristic value hP .s/ associated with the problem PGPS
QSPCMFN BSF BMTP TPMVUJPOT PG UIF SFMBYFE QSPCMFN
TPNFUIJOH XIJDI JT OBUVSBM .s/
NPTU
to theSFMBYBUJPOT
optimal cost UIFOhP UIF IFVSJTUJD UIBU SFTVMUT0 GSPN UIF SFMBYBUJPO JT BDUVBMMZ BENJTTJCMF ɩJT
0 .s/ of the relaxed problem P .s/. It is easy to show that if the solutions to the
JToriginal
CFDBVTFproblem
BOZ PQUJNBMP .s/TPMVUJPO
are also GPSsolutionsNVTU of theCFrelaxed
BMTP B TPMVUJPO
problemUPPUIF 0
.s/SFMBYBUJPO
, something which
XIPTF PQUJNBM
is natural for
WBMVF DBOOPU FYDFFE UIFO UIF PQUJNBM WBMVF PG 0O UIF PUIFS IBOE
most relaxations, then the heuristic h .s/ that results from the relaxation is actually admissible. JG UIF IFVSJTUJD GPS is JT
P
24 CLASSICAL PLANNING
is because any optimal solution for P .s/ must be also a solution to the relaxation P 0 .s/, whose optimal
value cannot exceed then the optimal value of P .s/. On the other hand, if the heuristic hP .s/ for P is
obtained from a solution to the relaxation P 0 .s/ that is not necessarily optimal, the resulting heuristic
hP .s/ would not be necessarily admissible.
A key development in modern planning research was the realization that useful heuristics could
be derived automatically from the representation of the problem in a domain-independent planning
language [Bonet et al., 1997, McDermott, 1996]. It does not matter what the problem P .s/ is about,
an automated relaxation P 0 .s/ yielding informative heuristics can be obtained directly and effectively
from the representation of P .s/. e result is a domain-general heuristic h.s/, i.e., a heuristic that
makes the search goal-driven, no matter what the problem is about, as long as it is a problem where
deterministic actions expressed in compact form have to be used to drive the system from a known
initial state into a goal state.
for a D do.ti /. e problem of doing tasks t1 and t2 starting at location l3 can then be modeled by
the tuple P D hF; I; O; Gi where
A solution to P is an applicable action sequence that maps the state s0 D I into a state where the goals
in G are all true. In this case one such plan is the action sequence
e number of states in the problem is 26 as there are six boolean variables. Still, it can be shown that
many of these states are not reachable from the initial state. Indeed, the atoms at .li / for i D 1; 2; 3
are mutually exclusive and exhaustive, meaning that every state reachable from s0 by applying the
available actions makes one and only one of these atoms true. ese boolean variables encode indeed
the possible values of the multivalued variable that represents the agent’s location.
Planning languages featuring non-boolean variables and richer syntactic constructs than
STRIPS are also common in planning [Bäckström and Nebel, 1995, Gerevini et al., 2009]. In partic-
ular, if X is a multivalued variable with domain DX , then the initial situation can be characterized by
a set of literals of the form ‘X D x ” for each variable X where x 2 DX , the actions can be described
in terms of pre and postconditions expressed through these literals, and the same for goals. In princi-
ple, a planning problem expressed through multivalued variables can be compiled automatically into a
problem with boolean variables only, by simple transformations such as replacing each literal X D x
26 CLASSICAL PLANNING
by the proposition p.X D x/ throughout, and by including the propositions p.X D x 0 / in the delete
list of every action that adds p.X D x/. Planning problems expressed over boolean variables such as
STRIPS can be similarly expressed in multivalued form by mapping atoms p into literals Xp D t rue
throughout, except in delete lists when they are mapped into postconditions Xp D f alse . Often it is
possible to derive a more compact multivalued encoding of a planning problem, e.g., like when a set
of atoms at.l1 /, …, at.ln / is used to represent the possible locations of an object. Such a location can
be encoded through a multivalued variable L with domain DL D fl1 ; : : : ; ln g. Programs that auto-
matically infer invariants from a STRIPS encoding, such as sets of exhaustive and mutually exclusive
atoms, are used to transform one representation into another automatically [Helmert, 2009]. While
some of the representation languages are more natural for users, they are not necessarily more efficient
for planning as the extra features can often be compiled away at no cost. For example, STRIPS does
not accommodate negation or negated atoms, yet when it is convenient to introduce a negated atom :p
in the initial situation, preconditions, or goals of a problem, it is possible to introduce a new atom pN
for capturing :p . e atom pN has to be part of I when p 62 I , has to be included in the Add list of
an action when p is included in the Delete list, and vice versa, has to be included in the Delete list
when p is in Add list. en the atom pN can be used in preconditions and goals, as indeed, pN represents
:p over all the states that are reachable from I using the available actions, where the logical formula
pN :p can be shown to hold. In other words, the formula pN :p is an invariant in the problem.
One important syntactic construct that extends STRIPS and is not convenient to compile away
in general, because of a potential exponential blow up in the size of the problem, is conditional effects
[Gazen and Knoblock, 1997, Nebel, 2000]. While Add and Delete lists represent sets of atoms that
become true and false unconditionally after an action is done, a conditional effect C ! C 0 associated
with an action, where C and C 0 are sets of literals (atoms or negated atoms), says that C 0 will be
true right after the action if C was true right before the action. In other words, unlike an action
precondition, C does not have to be true for the action to be applicable, yet if it is true, then C 0 will
become true as a result of the action.
Figure 2.4 shows a description of the Blocks World domain in PDDL. PDDL is the Planning
Domain Definition Language, a language and syntax that has been used in the planning competitions
[McDermott et al., 1998]. PDDL accommodates the STRIPS language along with a number of addi-
tional syntactic constructs in a notation that originates in the Lisp programming language. Problems in
PDDL are expressed in two parts: one about the general domain; the other about a particular domain
instance. In the domain part, the actions are described by means of schemas over generic atoms de-
fined using predicates names like clear , variables like ‹x , and possibly constants. In the instance part,
the object names that will replace the variables are declared, along with the atoms describing the ini-
tial state, and the formula describing the goal states. e “requirement” flag in the domain definition
describes the PDDL fragment used by the encoding which can include STRIPS, ADL extensions
featuring negation, conditional effects, function symbols, and various forms of quantification [Ped-
nault, 1989], a hierarchy of types for controlling how variables can be substituted by object names, the
equality predicate, and so on. ere are currently tens of classical planners that can be downloaded free
from the Internet and hundreds of planning problems expressed in PDDL for use with such planners.
2.7. DOMAIN-INDEPENDENT HEURISTICS AND RELAXATIONS 27
5 is is not true, however, in planning languages that extend STRIPS with negation and conditional effects, where the same
action may have to be applied multiple times for solving the relaxation.
%0."*/*/%&1&/%&/5 )&63*45*$4 "/% 3&-"9"5*0/4
2.7. DOMAIN-INDEPENDENT HEURISTICS AND RELAXATIONS 29
Init
A h=3
B C
C
A h=3 h=2 A h=3
B A B C B C
··· ···
A A h=3 B h=2 C h=1 B h=2 C h=2
B C h=3 B C A C A B A C A B
B
··· ··· ··· C h=0 C h=2 h=2 ··· ···
A A B A B C
Goal
Figure 2.5: A fragment of the graph corresponding to a blocks world planning problem with the automatically
'JHVSF
derived " GSBHNFOU
heuristic PG UIF HSBQI
values shown next toDPSSFTQPOEJOH UP B CMPDLT
some of the nodes. e XPSME QMBOOJOH
heuristic values QSPCMFN XJUI UIF
are computed BVUPNBUJDBMMZ
in low polynomial
EFSJWFE
time andIFVSJTUJD
provideWBMVFT TIPXOwith
the search OFYUa UP TPNF
sense of PG UIF OPEFT
direction. eɩF IFVSJTUJD
instance can WBMVFT
actuallyBSFbeDPNQVUFE JO MPXany
solved without QPMZOPNJBM
search by
UJNF
just performing in each state the action that leads to the node with a lower heuristic value (closer to the TFBSDI
BOE QSPWJEF UIF TFBSDI XJUI B TFOTF PG EJSFDUJPO ɩF JOTUBODF DBO BDUVBMMZ CF TPMWFE XJUIPVU BOZ CZ
goal). e
KVTU QFSGPSNJOH
resulting plan isJOshown
FBDI TUBUF UIFhelpful
in red; BDUJPO UIBU MFBET
actions UP UIFinOPEF
shown blue.XJUI B MPXFS IFVSJTUJD WBMVF DMPTFS UP UIF HPBM
ɩF
SFTVMUJOH QMBO JT TIPXO JO SFE IFMQGVM BDUJPOT TIPXO JO CMVF
NPWJOH $ UP " JT OFFEFE ɩF SFTVMU JT B IFVSJTUJD WBMVF BT TIPXO
XIJDI BDUVBMMZ DPJODJEFT
In order to get a more vivid idea of where the heuristic values shown in the figure come from,
JO UIJT DBTF XJUI UIF DPTU PG UIF CFTU QMBO UP BDIJFWF UIF KPJOU HPBM GSPN JO UIF OPOSFMBYFE QSPCMFN
consider the heuristic h.s/ for the initial state where block A is on B, and both B and C are on the
/POFUIFMFTT
UIJT JT KVTU B DPJODJEFODF
BOE JOEFFE
UIF CFTU QMBOT JO UIF SFMBYBUJPO DBO
table. In order to get the goal “B on C” in the relaxation from the state s , two actions are needed:
CF RVJUF EJĊFSFOU UIBO UIF CFTU QMBOT JO UIF PSJHJOBM QSPCMFN ɩF CFTU QMBO GPS JT JOEFFE
one to get A out of the way to achieve the preconditions for moving B, the second to move B on
VOJRVF NPWJOH " UP UIF UBCMF
UIFO $ PO "
BOE mOBMMZ # PO $ 0O UIF PUIFS IBOE
B QPTTJCMF PQUJNBM
top of C. On the other hand, in order to achieve the second goal “C on A” in the relaxation from s ,
QMBO JO UIF SFMBYBUJPO JT UP NPWF mSTU $ PO "
UIFO " PO UIF UBCMF
BOE mOBMMZ # PO $ 0G DPVSTF
just the action of moving C to A is needed. e result is a heuristic value h.s/ D 3 as shown, which
UIJT QMBO EPFT OPU NBLF BOZ TFOTF JO UIF SFBM QSPCMFN XIFSF " DBOU CF NPWFE XIFO DPWFSFE CZ $
ZFU
actually coincides in this case with the cost of the best plan to achieve the joint goal from s in the
UIF SFMBYBUJPO JT OPU BJNFE BU DBQUVSJOH UIF SFBM QSPCMFN PS UIF SFBM QIZTJDT
JU JT BJNFE BU QSPEVDJOH
non-relaxed problem P .s/. Nonetheless, this is just a coincidence, and indeed, the best plans in the
JOGPSNBUJWF CVU RVJDL FTUJNBUFT PG UIF DPTU UP UIF HPBM ɩF SFBEFS DBO WFSJGZ UIBU GPS UIF MFGUNPTU DIJME
relaxation P C .s/ can be quite different than the best plans in the original problem P .s/. e best plan
PG UIF JOJUJBM TUBUF
UIF DPTUT PG UIF QSPCMFN BOE UIF SFMBYBUJPO OP MPOHFS DPJODJEF
for P .s/ is indeed unique: moving A to the table, then C on A, and finally B on C. On the other
ɩF GPSNFS JT
XIJMF UIF MBUUFS JT
UIF EJĊFSFODFCBSJTJOH GSPN UIF HPBM A$ PO " UIBU JO UIF PSJHJOBM
hand, a possible optimal plan in the relaxation P .s/ is to move first C on A, then A on the table,
QSPCMFN NVTU CF VOEPOF BOE UIFO SFEPOF *O UIF SFMBYBUJPO
UIJT JT OFWFS OFFEFE BT OP BUPN JT FWFS
and finally B on C. Of course, this plan does not make any sense in the real problem where A can’t
EFMFUFE
be moved when covered by C, yet the relaxation is not aimed at capturing the real problem or the real
ɩF IFVSJTUJDT GPS DMBTTJDBM QMBOOJOH UIBU IBWF CFFO EFWFMPQFE TP GBS
BMM BTTVNF UIBU BDUJPOT
physics; it is aimed at producing informative but quick estimates of the cost to the goal. e reader
DPTUT BSF PS EFQFOE BU NPTU PO UIF BDUJPO CVU OPU PO UIF TUBUF JF
*O
can verify that for the leftmost child s 0 of the initial state s , the costs of the problem P .s 0 / and the
QSJODJQMF
UIFSF
C JT0 OP QSPCMFN JO FYQSFTTJOH BSCJUSBSZ BDUJPO DPTUT JO DPNQBDU GPSN
MJLF GPS
relaxation P .s / no longer coincide. e former is 4, while the latter is 3, the difference arising from
30 CLASSICAL PLANNING
the goal “C on A” that in the original problem must be undone and then redone. In the relaxation this
is never needed as no atom is ever deleted.
e heuristics for classical planning that have been developed so far, all assume that actions
costs c.a; s/ are 1 or depend at most on the action a but not on the state s , i.e., c.a; s/ D c.a/. In
principle, there is no problem in expressing arbitrary action costs c.a; s/ in compact form, like for
instance saying that c.a; s/ is 1 except that when s makes both p and q true where it is 100. Yet, while
such cost structures are important and are often needed, they have not been addressed systematically
in the literature so far, and thus there are not yet good heuristics for handling them in planning.
that can be computed quite efficiently in every state s visited in the search, where had d .P re.a/I s/ is
an estimate of the cost of achieving the preconditions of action a from s , defined from the expressions:
def 0 if p 2 s
hadd .pI s/ D (2.4)
mina2O.p/ Œcost .a/ C had d .P re.a/I s/ otherwise
and
def P
hadd .P re.a/I s/ D q2P re.a/ had d .qI s/ : (2.5)
In these expressions, had d .pI s/ stands for the estimated cost of achieving the atom p from s ,
O.p/ stands for the actions in the problem that add p , and had d .P re.a/I s/ stands for the estimated
cost of achieving the preconditions of the actions a from s . Versions of the additive heuristic appear
in several planners [Bonet and Geffner, 2001, Do and Kambhampati, 2001, Smith, 2004], where the
cost of the joint condition in action preconditions (and goals) is set to the sum of the costs of each
condition in isolation. e additive heuristic had d is neither a lower bound nor an upper bound on the
optimal cost function h over the original problem. e reason is that the cost of achieving two atoms
jointly from a state s can be lower or higher than the sum of the costs of achieving each one of them
individually. In particular, if a is a unit-cost action with preconditions that are true in s and atoms p
and q in the Add list that are not true in s , then the heuristic had d .s/ for the goal G D fp; qg will be
2, while clearly the optimal cost h .s/ of achieving G in the problem from s is 1. Likewise, if there is
2.7. DOMAIN-INDEPENDENT HEURISTICS AND RELAXATIONS 31
instead just one action that adds p and deletes q , and one action that adds q and deletes p , then the
heuristic hadd .s/ would still be 2, while the optimal cost h .s/ of achieving G in the problem would
be infinity. G is indeed unachievable in such a problem where the formula :.p ^ q/ is an invariant.
If the estimated cost of the joint condition in Eq. 2.5 is changed from the sum to the maximum,
def
hmax .P re.a/I s/ D maxq2P re.a/ hmax .qI s/ ; (2.6)
a different heuristic is obtained by setting hmax .s/ to hmax .P re.End /I s/, and replacing hadd by
hmax in Eq. 2.4. Since the cost of achieving several atoms from a state s can never be lower than the
cost of achieving one of them, the max-heuristic hmax unlike the additive heuristic had d is admissible,
and hence, potentially useful for computing optimal plans in combination with algorithms such as A*
or IDA*. Still, the max-heuristic is not informative enough in general, as it ignores all but one of the
atoms of each action precondition. In this sense, the heuristic had d is not admissible but is better at
discriminating good from bad actions, as no precondition is left out from the computation.
e equations for hadd and hmax basically define a path-finding problem over atom space as
opposed to the planning problem that is a path-finding problem over the exponentially larger state
space. Indeed, any shortest-path algorithm can be used for computing these heuristics, including Di-
jkstra’s algorithm, Bellman and Ford’s, or Value Iteration [Bertsekas, 1995, Cormen et al., 2009].
A single change is needed though: while the nodes of the graph are the problem atoms, the edges,
that correspond to problem actions are actually hyperedges rather than normal edges, as they link the
set of atoms appearing in an action precondition with each of the atoms appearing in the Add list.
So, while an edge .n; n0 / in a normal directed graph induces a cost c.n0 / c.n/ C w.n; n0 / on the
0 0
target node
P n , a (directed) hyperedge .fn1 ; : : : ; nk g; n / associated with an action a induces a cost
0
c.n / i D1;k c.ni / C c.a/ instead, in the additive heuristic. For any of the algorithms, the costs
c.n/ can be initialized to 0 for the nodes n corresponding to atomsP p that are true in s , and to 1
0
for all other atoms. All the algorithms
P use the inequalities c.n / iD1;k c.ni / C c.a/ as updates
of the form c.n0 / WD min.c.n0 /; i D1;k c.ni / C c.a//, and differ in the order in which these updates
are performed and the conditions under which they are terminated. Dijkstra’s algorithm for example
updates nodes n0 , once, in order, according to the value of the right-hand side expression, lowest first.
Bellman and Ford’s algorithm, and Value Iteration, do not pay the overhead required for following
this order, but end up updating nodes many times. In all cases, the computation is polynomial and
finishes in Dijkstra’s algorithm when there are no more nodes to update, while in Bellman and Ford’s
algorithm and in Value Iteration, when the updates produce no changes. For the max heuristic, the
update expression needs to be changed to c.n0 / WD min.c.n0 /; maxi D1;k c.ni / C c.a//. An analysis of
these various methods for computing the additive heuristic is given by Liu et al. [2002].
P0 D fp 2 sg
Ai D fa 2 O j P re.a/ Pi g
Pi C1 D Pi [ fp 2 Ad d.a/ j a 2 Ai g
until a fixed point is reached; i.e., a layer Pn for which PnC1 D Pn . Here P0 contains all the atoms
in s , Ai contains all the actions whose preconditions are true in Pi , and Pi C1 contains the positive
effects of these actions along with the atoms appearing in previous layers Pk , k i C 1. e resulting
layered graph cannot contain more than jF j layers, which happens only when P0 is empty, and PiC1
just contains one more atom than Pi . Moreover, the construction can be stopped when the goal G
first appears in a layer PmC1 , i.e., G PmC1 , as the plan FF .s/ for the relaxation P C .s/ is extracted
then backward from that layer. For this, it’s convenient to conceive FF .s/ as a “parallel” plan made up
of actions B0 done in “parallel” at time 0, actions B1 done in “parallel” at time 1, and so on until Bm .
By actions done in parallel, we mean that the actions can be done in any order, as they will necessarily
have the same effect in the relaxation. In addition, during the construction of the graph, each atom
p that makes it for the first time in a layer Pi C1 is tagged with one of the actions in Ai that adds p
which is called the best supporter for p and is denoted by ap . Clearly, there must be one such action,
else the atom p would not make it into the layer Pi C1 . e sets of actions Bi in the relaxed plan
FF .s/ is then obtained recursively backward from the set Gi C1 of atoms, initially set to G for i D m.
en for i D m; m 1; : : : ; 0, and starting with Bi D ;, we add to Bi , the best supporter ap of each
of the atoms p in Gi C1 that made into the layer Pi C1 for the first time, and recursively set Gi to
.GiC1 n Add.Bi // [ P re.Bi /, where Add.Bi / and P re.Bi / are respectively the union of the Add
and Precondition lists of the actions in Bi .
It is easy to see that the resulting plan FF .s/ made up of this sequence of action sets Bi , where
actions in a set can be done in any order, is a plan for the relaxation P C .s/. is is because FF .s/ con-
tains actions that add each of the goals in G , and in addition, every action in FF .s/ has preconditions
that are true in the state s or are added by previous actions in the sequence. e heuristic hFF .s/ is
defined as the size jFF .s/j of the plan, namely, the number of actions that it contains, thus assuming
implicitly that action costs are all 1. Of course, hFF .s/ could be defined instead as the sum of the ac-
tion costs c.a/ for a in FF .s/, yet this does not address the fact that the relaxed plan was constructed
assuming that costs were uniform.
An advantage of FF’s heuristic over the additive heuristic is that it is less likely to overcount
actions. For example, if there is an action a whose preconditions hold in s with effects p and q that
do not, then the additive heuristic for the goal G D fp; qg is 2, while FF’s heuristic will be 1. Still,
FF’s heuristic can also overcount if there is a second action that adds q and whose preconditions hold
in s . In such a case, FF can produce a relaxed plan where the atom p is supported by the first action,
and the atom q is supported by the second action. An additional limitation of the FF heuristic is that
it assumes that all action costs are uniform. ere is however a simple way to combine the benefits of
the additive and FF’s heuristic [Keyder and Geffner, 2008a]. For this, all that is required is to change
the definition of the best support action ap for each atom p in the computation of the layered graph to
def
ap D argmina2O.p/ Œc.a/ C had d .P re.a/I s/ (2.7)
2.8. HEURISTIC SEARCH PLANNING 33
when p is not true in the state s . ese best supports obtained from the additive heuristic can then be
used to build a relaxed plan that no longer ignores action costs. e backward procedure for extracting
a plan from these best supports proceeds like above: we collect in the relaxed plan the best supports
for the atoms in the goal, and recursively, the best supports of their preconditions, in all cases skipping
the atoms that are true in the state s . Actually, the same construction can be used also with the max
heuristic. Interestingly, the relaxed plan that would be obtained in this way from the max heuristic is
equivalent to the one obtained from FF’s procedure, provided that ties in the selection of best supports
are broken in the same way. is is because there is a tight correspondence between the max heuristic
and the layered graph constructed by FF. Indeed, it can be easily shown that the heuristic hmax .s/ is
equal to the index i of the first propositional layer Pi that contains the goal G of the problem [Haslum
and Geffner, 2000]. is also implies that the construction of the layered graph provides an alternative
way for computing the hmax heuristic that is still polynomial, although taking more space than the
shortest-path formulation over atom space described above.
Classical Planning:
Variations and Extensions
Most of the current state-of-the-art classical planners are based on the heuristic search formulation
where plans are searched forward from the initial state using heuristics derived from the problem. is
basic idea, however, has been extended in a number of ways, like in the use of structural information,
also obtained automatically from the problem, in the form of “helpful actions” and “landmarks.” In
this chapter, we look at extensions and variations of the basic framework, at the heuristics developed
for optimal planning, and at other computational approaches to classical and related forms of planning
such as temporal and hierarchical task network planning.
Table 3.1: Some recent classical planners and their performance over competition benchmarks. Planners are FF,
Fast Downward, Probe, LAMA, BFS(f ). I is number of instances per domain, S is number of solved instances,
Q and T are the average plan lengths and times in seconds computed over problems solved by all planners.
FF FD PROBE LAMA’11 BFS(f )
Domain I S Q T S Q T S Q T S Q T S Q T
8puzzle 50 49 52.61 0.03 50 52.30 0.18 50 60.94 0.09 49 92.54 0.18 50 45.30 0.20
Barman 20 0 – – 20 197.90 84.00 20 169.30 12.93 20 192.15 8.39 20 174.45 281.28
BlocksW 50 44 39.36 66.67 50 104.24 0.46 50 43.88 0.25 50 89.96 0.41 50 54.24 2.25
Cybersec 30 4 29.50 0.73 28 36.58 859.24 24 50.73 48.29 30 35.27 880.06 28 36.92 63.79
Depots 22 22 51.82 32.72 17 110.25 91.86 22 88.88 1.45 21 43.56 3.58 22 39.56 69.11
Driver 20 16 25.00 14.52 20 50.67 1.26 20 60.17 1.49 20 46.22 1.51 18 48.06 140.93
Elevators 30 30 85.73 1.00 30 92.57 3.20 30 107.97 26.66 30 97.07 4.69 30 129.13 93.88
Ferry 50 50 27.68 0.02 50 30.08 0.09 50 44.80 0.02 50 26.86 0.08 50 31.28 0.03
Floortile 20 5 44.20 134.29 3 39.00 6.91 5 40.50 106.97 5 40.00 8.94 7 36.50 4.15
Freecell 20 20 64.00 22.95 20 61.06 26.55 20 62.44 41.26 19 67.78 27.35 20 64.39 13.00
Grid 5 5 61.00 0.27 5 61.60 4.95 5 58.00 9.64 5 70.60 4.84 5 70.60 7.70
Gripper 50 50 76.00 0.03 50 152.62 0.17 50 152.66 0.06 50 92.76 0.15 50 152.66 0.38
Logistics 28 28 41.43 0.03 28 77.11 0.18 28 55.36 0.09 28 73.64 0.17 28 87.04 0.12
Miconic 50 50 30.38 0.03 50 39.80 0.07 50 44.80 0.01 50 31.02 0.06 50 34.46 0.01
Mprime 35 34 9.53 14.82 35 8.37 9.50 35 12.97 26.67 35 8.60 10.30 35 10.17 19.30
Mystery 30 18 6.61 0.24 19 6.86 1.87 25 7.71 1.08 22 7.29 1.70 27 7.07 0.93
NoMyst 20 4 19.75 0.23 6 22.40 1.96 5 23.20 2.73 11 23.00 1.77 19 22.60 0.78
OpenSt 30 30 155.67 6.86 30 130.11 5.97 30 134.14 64.55 30 130.18 3.49 29 125.89 129.06
OpenSt6 30 30 136.17 0.38 30 222.67 5.39 30 224.00 48.89 30 140.60 4.89 30 139.13 40.19
ParcPr 30 30 42.73 0.06 27 35.79 1.97 28 70.92 0.26 30 70.54 0.28 27 70.42 6.72
Parking 20 3 88.33 945.86 20 74.86 330.76 17 143.36 685.47 19 129.57 361.19 17 83.43 562.39
Pegsol 30 30 25.50 7.61 30 25.97 0.80 30 25.17 8.60 30 26.07 2.76 30 24.20 1.17
Pipes-N 50 35 34.34 12.77 44 75.50 7.94 45 46.73 3.18 44 54.41 11.11 47 58.39 35.97
Pipes-T 50 20 31.45 87.96 40 73.33 99.06 43 54.19 88.47 41 69.83 35.28 40 39.14 216.25
PSR-s 50 42 16.92 63.05 50 14.61 0.27 50 17.20 0.07 50 14.65 0.31 48 18.14 2.57
Rovers 40 40 100.47 31.78 40 153.18 13.69 40 131.20 24.19 40 108.53 17.90 40 126.30 44.20
Satellite 20 20 37.75 0.10 20 40.90 0.78 20 37.05 0.84 20 42.05 0.78 20 36.05 1.26
Scan 30 30 31.87 70.74 28 30.04 7.30 28 25.15 5.59 28 28.04 8.14 27 29.37 7.40
Sokoban 30 26 213.38 26.61 28 204.14 12.44 25 231.52 39.63 28 231.81 184.38 23 218.52 125.12
Storage 30 18 16.28 39.17 20 17.72 3.20 21 14.56 0.07 18 24.56 8.15 20 20.94 4.34
Tidybot 20 15 63.20 9.78 15 66.00 338.14 19 52.67 33.50 16 62.60 102.52 18 63.27 207.85
Tpp 30 28 122.29 53.23 30 127.93 16.95 30 152.53 60.95 30 205.37 18.72 30 110.13 126.03
Transport 30 29 117.41 167.10 30 97.57 12.75 30 125.63 38.87 30 215.90 76.18 30 97.57 46.64
Trucks 30 11 27.09 3.84 17 26.00 0.65 8 26.75 113.54 16 24.75 0.53 15 26.50 8.59
Visitall 20 6 450.67 38.22 7 3583.86 166.35 19 411.71 9.02 20 468.00 4.68 20 339.00 4.58
WoodW 30 17 32.35 0.22 30 57.13 18.40 30 41.13 15.93 30 79.20 12.45 30 41.13 19.12
Zeno 20 20 30.60 0.17 20 37.45 2.68 20 44.90 6.18 20 35.80 4.28 20 37.70 77.56
Summary 1150 909 67.75 51.50 1037 168.60 57.78 1052 83.64 41.28 1065 86.51 48.98 1070 74.32 63.91
3.5. OPTIMAL PLANNING AND ADMISSIBLE HEURISTICS 41
an evaluation function based on width-considerations (Section 2.10) along with tie breakers based on
the additive and number-of-unachieved-landmark heuristics. e scalability of planners has improved
considerably over the last 15 years with the best planners using and extending the ideas of previous
planners, such as heuristic functions, helpful actions, and landmarks. As a reference, a baseline planner
such as HSP, based solely on a Greedy Best-First Search guided by the additive heuristic, solves 789
of the problems, while LAMA, the winner of the last two competitions solves 1,065 problems out
of a total of 1,150. In the Table, I stands for the number of instances per domain, while S, Q, and
T stand for the number of instances solved, and the average plan lengths and times in seconds. e
experiments were conducted on a dual-core CPU running at 2.33 GHz and with two GB of RAM,
with processes timing out after two hours. All of these domains and planners, including their sources,
are available on the Internet.
8
< 0 if C s ,
hm .C; s/ D mina2R.C / Œc.a/ C hm .Reg.a; C /; s/ if C ª s and jC j m, (3.1)
:
maxfhm .C 0 ; s/ W C 0 C; jC 0 j mg otherwise,
where Reg.a; C / stands for the regression of the set of atoms C through the action a, i.e., Reg.a; C / D
.C n Ad d.a// [ P rec.a/, and R.C / stands for set of actions a in the problem that add some atom in
C and delete none. e approximation captured by this definition follows from setting the estimated
cost hm .C; s/ of achieving sets C of more than m atoms, to the cost of achieving the most costly subset
C 0 of C of at most m atoms. e hm heuristic for state s , hm .s/, is hm .s/ D hm .G; s/, where G is the
goal of the problem. For m D 1, it is easy to show that hm is equal to hmax , while for a sufficiently
large value of m that is less than or equal to the total number of variables in the problem, hm is equal
to the optimal heuristic h .
Pattern database (PDB) heuristics provide a generalization of the PDB heuristics developed for
domain-specific heuristic search [Culberson and Schaeffer, 1998]. A PDB heuristic is a lookup table
that stores exact optimal distances for an abstraction of the problem computed by a regression search
from the goal. e abstraction is obtained by dropping a sufficient number of atoms from the problem
42 3. CLASSICAL PLANNING: VARIATIONS AND EXTENSIONS
so that the number of reachable states in the reduced problem fits in memory. An atom is removed
from a problem P D hF; I; O; Gi by removing it from F , I , O , and G ; i.e., the atom is removed from
precondition, delete, and add lists, from the goal and the initial situation, and from the set of problem
atoms. If s is a state over the original problem, and A is the set of atoms retained in the reduced
problem, the heuristic h.s/ is set to the optimal heuristic h .s 0 / over the reduced problem where s 0
is the state s 0 D s \ A. A key challenge in the design of PDBs is deciding which atoms to abstract
away from the problem [Haslum et al., 2007]. Two recent variations on the PDB idea for planning
are the merge-and-shrink heuristics [Helmert et al., 2007] and structural pattern heuristics [Katz and
Domshlak, 2008b].
Multiple admissible heuristics h1 ; : : : ; hn can be combined into a potentially more informed
admissible heuristic by taking their pointwise maximum as h.s/ D maxfh1 .s/; : : : ; hn .s/g, or by par-
titioning the action costs [Haslum et al., 2007, Katz and Domshlak, 2008a]. A cost partitioning ˘ of
problem P with cost function c./ is a collection P1 ; : : : ; Pn of problems
P identical to P except on their
cost functions c1 ; : : : ; cn that must be non-negative and satisfy i D1;n ci .a/ c.a/ for every action a
in P . en, if h1 ; : : : ; hn are (arbitrary) admissible heuristics for the problems P1 ; : : : ; Pn respectively,
the additive heuristics h D h1 C C hn is an admissible heuristic for P . One can indeed improve
a base heuristic by doing a cost partitioning that applies the same base heuristic to every problem in
the partition; the difficulty is in the choice of the cost functions ci , 1 i n, and the number of
partitions.
e LM-Cut heuristic [Helmert and Domshlak, 2009] is a powerful admissible heuristic that
can be thought as either a landmark heuristic or a cost partitioning heuristic based on hmax . For de-
termining the LM-Cut value of a state, a sequence L1 ; : : : ; Lm of (disjunctive but not necessarily
disjoint) action landmarks are computed together with cost functions c1 ; : : : ; cm providing a cost par-
titioning. Like (atomic) landmarks, a disjunctive action landmark L for state s is a set of actions such
that every plan from the state s must contain one of the actions in the set. In cases when the landmarks
computed by LM-Cut are pairwise disjoint, the cost function c1 in the partition defined by LM-Cut
assigns costs c1 .a/ D c.a/ to every action a 2 L1 and c1 .a/ D 0 to a … L1 , costs c2 .a/ D c.a/ for ev-
ery action a 2 L2 n L1 , and c2 .a/ D 0 for a … L2 , and so on, while the heuristic values in such cases
become h1 D mina2L1 c1 .a/, h2 D mina2L2 c2 .a/, and so on. e value of the LM-Cut heuristic at
state s is the sum h1 C h2 C C hm as in any cost partitioning scheme, which is guaranteed to be
admissible. e LM-Cut heuristic can be improved by computing and considering more than one
landmark at a time, exploiting a connection between action landmarks and hitting sets [Bonet and
Helmert, 2010].
• the set of actions A.s/ applicable in s are the actions a 2 O that are relevant and consistent,
namely, for which Add.a/ \ s ¤ ; and Del.a/ \ s D ;,
• the state s 0 D f .a; s/ that follows the application of action a 2 A.s/ in s is s 0 D .s n Ad d.a// [
P rec.a/, and
e solution of this state space is like the solution of any state model, an applicable action
sequence mapping the initial state into a goal state. Yet, notice that in the regression space R.P /, the
initial state is the goal of P , and the goal states are the states that must be true in the initial situation
of P . Moreover, while the states s in the regression space R.P / are defined syntactically as in the
progression space S.P / by sets of atoms, the meaning of these sets is very different in the two cases:
states represent complete truth-assignments in the progression space but partial truth-assignments in the
regression space. In particular, in the initial state s0 D I of the progression space S.P /, every atom that
is not in I is false, while in the initial state s0 D G of the regression space, this is not true; indeed,
the goal G can be the single atom on.A; B/ in a blocks world problem with three blocks A, B , and
C , and there is no reachable state of the problem where on.A; B/ is true and all other atoms are false
(blocks B and C must be somewhere!). It can be shown that every state s in the regression space R.P /
stands indeed for a collection of states s 0 in the progression space S.P /, namely all the states s 0 in the
progression space that include s . Alternatively, the states s in the regression can be thought of as goals
and subgoals to be achieved, with an action applied in reverse, mapping one goal into another one.
e regression space R.P / is sound and complete in the sense that the solutions to R.P /
encode the solutions to the problem P but in reverse. One potential advantage of searching for plans
backward in R.P / as opposed to forward in S.P / is that it is possible to avoid the computation of
the heuristic from scratch in every state. is can represent up to 80% of the total time that heuristic
search planners take in solving a problem. In the regression search it is indeed possible to perform the
bulk of this computation just once. For example, if h.pI s0 / is the estimated cost of achieving the atom
p from the initial problem state s0 according to the additive heuristic, the estimated
P cost h.s 0 / from a
0 0
state s to the goal I in the regression search can be set up to the sum h.s / D p2s 0 h.pI s0 /, where the
elements of the sum need to be computed just once from s0 and used then to determine the heuristic
h.s 0 / of any state s 0 in the regression space.
3.8. PLANNING AS SAT AND CONSTRAINT SATISFACTION 45
A potential problem in the regression search is that while the regression is sound and complete,
it may contain many more dead-ends than the progression search; i.e., the regression can generate
collection of atoms that cannot be achieved jointly by any plan from the initial situation, and hence
that will not lead to any solution [Bonet and Geffner, 1999]. In addition, it is not clear that the
heuristics as defined above are as informative in the backward search as in the forward search. None
of these issues have been studied throughly, though, yet the fact is that there are no regression-based
planners these days that can compete with the best progression-based planners examined above.
Figure 3.1: e subformulas that make up the CNF formula C.P; n/ encoding a planning problem P D
hF; I; O; Gi with horizon n. Fluents p and actions a are tagged with time indices i . e subformulas can be
easily converted into clauses.
clauses encoding the persistence axioms, and the use of lower bounds for initializing the planning
horizon [Kautz and Selman, 1999]. More recently, Jussi Rintanen has introduced other refinements
that improve the performance of SAT-based planners further, including the use of a special heuristic for
variable selection in the otherwise generic SAT solver, an improved search for an adequate planning
horizon, and better memory management for dealing with the millions of clauses that often result
from planning encodings in CNF [Rintanen, 2012]. While SAT-based planners do not yet scale up
as well as the best heuristic-search planners, the gap has narrowed down considerably. Moreover, it is
well known that on domains that are inherently difficult, SAT approaches can do much better than
optimal and non-optimal heuristic search planners [Hoffmann et al., 2007].
Constraint Satisfaction Problem (CSP) provide a generalization of SAT where the variables are
not restricted to be boolean, and constraints are not restricted to be clauses [Dechter, 2003]. General
CSP solvers can deal with multivalued variables over arbitrary constraints. In the same way that clas-
sical planning problems with a fixed horizon can be cast as SAT problems, they can also be cast as
CSP problems [Do and Kambhampati, 2000]. In spite and perhaps because of the additional expres-
sive power afforded by the CSP formulation, CSP-based planners have not been able to keep up with
SAT-based planners, that are based on a more restricted task (SAT) over which the technology has
moved faster.
In the forms of planning considered so far, no information is given about which actions to apply or
which subgoal to pursue; rather, actions are characterized in terms of their pre and postconditions,
and the choice and ordering of the actions for solving a problem is computed automatically. HTN
planning, where HTN stands for Hierarchical Task Networks, provide a completely different way of
constructing plans [Erol et al., 1994, Ghallab et al., 2004]. In HTN planning, plans are not obtained
from a model that describes how the actions change the world, rather actions, called tasks, are described
at several levels, with tasks at one level decomposing into tasks at a lower level, with some tasks, called
the primitive tasks, standing for real executable actions that do not decompose further. For example,
the abstract task of taking a taxi may be decomposed into the tasks of getting to the street, stopping
a taxi, getting on board, and so on. Similarly, the task of getting on the taxi can be decomposed into
the tasks of opening the door, entering the taxi, and closing the door. Some tasks, like getting to the
airport, may admit multiple decompositions called methods, one of which may involve the tasks of
taking a taxi to the train station, and then a train from the station to the airport. In this case, the tasks
inside the methods are constrained to be one after the other. Other types of restrictions may relate the
tasks in a method as well. While the objective in classical planning is to find an action sequence that
maps the initial situation into a goal state, in HTN planning, the objective is to find a decomposition
that results in a consistent network of primitive tasks. Classical planning is model-based because it’s
based on a model of the actions, the initial situation, and goals, from which the plan is derived and
with respect to which the computed plan can be proved correct. HTN planning is not model-based
in this sense, and indeed, in HTN planning there is no clear separation between the problem that is
being solved and the strategies being used for solving it. Actually, HTNs are most commonly used for
encoding solution strategies. From a theoretical point of view, this is not good enough, as there is then
50 3. CLASSICAL PLANNING: VARIATIONS AND EXTENSIONS
no assurance that the planning strategy encoded by the HTN leads to plans that are correct.1 Yet, from
a practical point of view, this may be a feature rather than a bug: in many applications, humans feel
more comfortable describing the solving strategies for the domain than the domains themselves, and
place more trust on such strategies than on plans found by domain-independent planners. is may
explain why HTN planners are more common in applications than domain-independent planners.
is, however, may change, as better ways are found for integrating strategy and domain descriptions,
and for coming up automatically with general and transparent strategies.
1 Indeed, in some of the knowledge-basedplanning competitions held so far, teams were given the planning domains in advance,
and they were free to determine and encode the strategies for solving them. It was then found that plans obtained from these
strategies were not always correct. is is because the HTN encodings were not derived from the domain descriptions but
were written by hand.
CHAPTER 4
where c./ is given by the sum of the action costs in , and ˆ p expresses that p is true in the state
that results from applying the action sequence to the initial problem state.
A plan for a problem with soft goals is optimal when no other plan 0 has utility u. 0 / higher
than u./. e utility of an optimal plan for a problem with no hard goals is never negative as the
empty plan has non-negative utility and zero cost. e International Planning Competition held in
2008 featured a Net-Benefit Optimal track where the objective was to find optimal plans with respect
to Eq. 4.1 [Helmert et al., 2008]. Soft goal or net-benefit planning appears to be very different than
classical planning as it involves two interrelated problems: deciding which soft goals to adopt, and de-
ciding on the plan for achieving them. Indeed, most of the entries in the competition developed native
52 4. BEYOND CLASSICAL PLANNING: TRANSFORMATIONS
planners for solving these two problems. More recently, however, it has been shown that problems P
with soft goals can be compiled into equivalent problems P 0 without soft goals that can then be solved
by classical planners able to handle action costs c.a/ only [Keyder and Geffner, 2009]. e plans for
P and P 0 are the same, except for the presence of dummy actions, and the utilities of the plans for
P are inversely related to the cost of the plans for P 0 . us, optimal cost-based planners for P 0 yield
optimal net-benefit plans for P , while satisfacing cost-based planners for P 0 , that scale up better, yield
satisfacing net-benefit plans for P .
e idea of the transformation from the problem P with soft goals into the equivalent problem
P 0 with hard goals only is very simple. For soft goals p associated with individual atoms, one just needs
to add new atoms p 0 that are made into hard goals in P 0 , that are achievable in one of two ways: by the
new actions collect.p/ with precondition p and cost 0, or by the new actions forgo.p/ with precondition
p , that stands for the negation of p , and cost equal to the utility u.p/ of p . Additional bookkeeping
is needed in the translation so that these new actions can be done only after the normal actions in the
original problem.
More precisely, for a STRIPS problem P D hF; I; O; Gi with action costs c./ and soft goals
u./, the equivalent, compiled STRIPS problem P 0 D hF 0 ; I 0 ; O 0 ; G 0 i with action costs c 0 ./ and no
soft goals has the following components, where Fu D fp j .p 2 F / ^ .u.p/ > 0/g stands for the set of
soft goals [Keyder and Geffner, 2009]:
• F 0 D F [ fp 0 j p 2 Fu g [ fp 0 j p 2 Fu g [ fnormal-mode; end-modeg,
• I 0 D I [ fp 0 j p 2 Fu g [ fnormal-modeg,
• O 0 D O 00 [ fcollect.p/; forgo.p/ j p 2 Fu g [ fendg,
• G 0 D G [ fp 0 j p 2 Fu g, and
8
< c.a/ if a 2 O 00 ;
• c 0 .a/ D u.p/ if a D forgo.p/ ;
:
0 if a D collect.p/ or a D end :
If the STRIPS actions a are denoted as pairs hP re; P ost i, where P re stands for the precondi-
tions of a, and P ost for its effects (negated atoms indicate atoms in Del.a/), the actions in the new
compiled problem P 0 can be expressed as:
• O 00 D fhPre.o/ [ fnormal-modeg; Eff.o/i j o 2 Og,
• end D hfnormal-modeg; fend-mode; :normal-modegi,
• collect.p/ D hfend-mode; p; p 0 g; fp 0 ; :p 0 gi,
• forgo.p/ D hfend-mode; p; p 0 g; fp 0 ; :p 0 gi.
e forgo and collect actions can be used only after the end action that makes the fluent end-mode
true, while the actions from the original problem P can be used only when the fluent normal-mode is
true prior to the execution of the end action. Moreover, exactly one of fcollect.p/; forgo.p/g can appear
for each soft goal p in the plan, as both delete the fluent p 0 which appears in their preconditions, and
no action makes this fluent true. As there is no way to make normal-mode true again after it is deleted
4.2. INCOMPLETE INFORMATION 53
G
I
Figure 4.1: Deterministic conformant problem where a robot must move from an uncertain location I into the
location with G with certainty, one cell at a time, in an n n grid.
by the e nd action, all plans 0 for P 0 have the form 0 D h; e nd; 00 i, where is a plan for P and
00 is a sequence of jS 0 .P /j collect.p/ and forgo.p/ actions in any order, the former appearing when
ˆ p , and the latter otherwise.
e two problems P and P 0 are equivalent in the sense that there is a correspondence between
the plans for P and P 0 , and corresponding plans are ranked in the same way. More specifically, for any
plan for P , a plan 0 in P 0 that extends with the end action and a set of collect and forgo actions has
cost c. 0 / D u./ C ˛ , where ˛ is a constant that is independent of and 0 . Finding an optimal
(maximum utility) plan for P is therefore equivalent to finding an optimal (minimum cost) plan 0
for P 0 . is implies that the best plans for P can be obtained from the best plans for P 0 , and these
can be computed with any optimal classical planner able to handle action costs.
From a computational point of view, the transformation above can be made more effective by
means of a simple trick. Recall that for a single plan for P , there are many extensions 0 in P 0 , all
containing the same actions and having the same cost, but differing in the way the collect and forgo
actions are ordered. For efficiency purposes, it makes sense to enforce a fixed but arbitrary ordering
p1 ; : : : ; pm on the soft goals in P by adding the dummy hard goal pi0 as a precondition of the actions
collect.pi C1 / and forgo.pi C1 / for i D 1; : : : ; m 1. e result is that there is a single possible extension
0 of every plan in P , and the space of plans to search is reduced. Interestingly, the cost-optimal
planners that entered the Optimal Sequential Track of the 2008 IPC, fed with the translations of the
problems in the Optimal Net-Benefit Track, do significantly better than the net-benefit planners that
entered that track [Keyder and Geffner, 2009].
move-right W x1 ! x2 ; :x1 I
move-right W x2 ! x3 ; :x2 I
:::
move-right W x5 ! x6 ; :x5 :
4.2. INCOMPLETE INFORMATION 55
A (deterministic) conformant problem P D hF; I; O; Gi defines a (deterministic) conformant
state model S.P / which is like the state model for a classical problem featuring negation and condi-
tional effects but with one difference: there is no single initial state s0 but a set of possible initial states
S0 . A solution for P , namely a conformant plan for P , is an action sequence that simultaneously solves
all the classical state models S 0 .P / that result from replacing the set of possible initial states S0 in S.P /
by each one of the states s0 in S0 .
From a computational point of view, conformant planning can also be formulated as a path-
finding problem over a graph, but the nodes in the graph do not represent the states of the problem as
in classical planning, but belief states, where a belief state or belief is a set of states deemed possible at one
point [Bonet and Geffner, 2000]. us, the root node of the graph is the belief b0 D S0 corresponding
to the set of possible initial states, and the goal beliefs bG are the possible non-empty sets of goal states.
Likewise, the edges correspond to the belief state transitions .b; ba / that are possible, where ba is the
belief state that results from applying the action a in the belief state b characterized as:
where F .a; s/ denotes the set of states that are possible following the action a in s . Recent proposals
have advanced new heuristics for guiding the search in belief space and more compact belief state
representations [Brafman and Shani, 2012b, Bryce et al., 2006, Cimatti et al., 2004, Hoffmann and
Brafman, 2006, Rintanen, 2004b, To et al., 2011].
A different approach to deterministic conformant planning is based on the translation of con-
formant problems into classical ones [Palacios and Geffner, 2009]. e basic sound but incomplete
translation removes the uncertainty in the problem by replacing each literal L in the conformant prob-
lem P by two literals KL and K:L, to be read as “L is known to be true” and “L is known to be false,”
respectively. If L is known to be true or known to be false in the initial situation, then the translation
will contain respectively KL or K:L. On the other hand, if L is not known, then both KL and K:L
will be initially false. e result is that there is no uncertainty in the initial situation of the translation
which thus represents a classical planning problem.
More precisely, the basic translation K0 is such that if P D hF; I; O; Gi is a deterministic con-
formant problem, the translation K0 .P / is the classical planning problem K0 .P / D hF 0 ; I 0 ; O 0 ; G 0 i
where
• F 0 D fKL; K:L j L 2 F g
• I 0 D fKL j L is a unit clause in I g
• G 0 D fKL j L 2 Gg
• O 0 D O , but with each precondition L for a 2 O replaced by KL, and each conditional effect
a W C ! L replaced by a W KC ! KL and a W :K:C ! :K:L.1
e expressions KC and :K:C for C D fL1 ; L2 ; : : :g are abbreviations for the conjunctions
fKL1 ; KL2 ; : : :g and f:K:L1 ; :K:L2 ; : : :g respectively. Recall that in a classical planning problem,
atoms that are not part of the initial situation are assumed to be initially false, so if KL is not part of
I 0 , KL will be initially false in K0 .P /.
1A conditional effect a W C ! C 0 is equivalent to a collection of conditional effects a W C ! L, one for each literal L in C 0 .
56 4. BEYOND CLASSICAL PLANNING: TRANSFORMATIONS
e only subtlety in this translation is that each conditional effect a W C ! L in P is mapped
into two conditional effects in K0 .P /: a support effect a W KC ! KL, that ensures that L is known
to be true when the condition C is known to be true, and a cancellation effect a W :K:C ! :K:L,
that ensures that L is possible when the condition C is possible.
e translation K0 .P / is sound as every classical plan that solves K0 .P / is a conformant plan for
P , but is incomplete, as not all conformant plans for P are classical plans for K0 .P /. e meaning of
the KL literals follows a similar pattern: if a plan achieves KL in K0 .P /, then the same plan achieves
L with certainty in P , yet a plan may achieve L with certainty in P without making the literal KL
true in K0 .P /.
For completeness, the basic translation K0 is extended into a general translation scheme KT;M
where T and M are two parameters: a set of tags t and a set of merges m. A tag t 2 T is a set (conjunc-
tion) of literals L from P whose truth value in the initial situation is not known. e tags t are used to
introduce a new class of literals KL=t in the classical problem KT;M .P / that represent the conditional
statements: “if t is initially true, then L is true.” Likewise, a merge mWis a non-empty collection of tags
t in T that stands for the Disjunctive Normal Form (DNF) formula t2m t . A merge m is valid when
one of the tags t 2 m must be true in I , i.e., when
W
I ˆ t2m t: (4.3)
A merge m for a literal L in P translates into a “merge action” with effects that capture a simple form
of reasoning by cases:
V
t 2m KL=t ! KL : (4.4)
We assume that the collection of tags T always includes a tag that stands for the empty collec-
tion of literals, called the empty tag and denoted as ;. If t is the empty tag, literals KL=t are denoted as
KL. e parametric translation scheme KT;M is the basic translation K0 “conditioned” with the tags
in T and extended with the actions that capture the merges in M . If P D hF; I; O; Gi is a determinis-
tic conformant problem, then KT;M .P / is the classical planning problem KT;M .P / D hF 0 ; I 0 ; O 0 ; G 0 i
where
• F 0 D fKL=t; K:L=t j L 2 F and t 2 T g,
• I 0 D fKL=t j I; t ˆ Lg,
• G 0 D fKL j L 2 Gg,
• O 0 D fa W KC=t
V ! KL=t; a W :K:C =t ! :K:L=t j a W C ! L in P g [
O 0 D fam;L W t 2m KL=t ! KL j L 2 P; m 2 M g.
A B C
J X D
H F E
Figure 4.2: Plan Recognition: Which destination is the agent moving to after observing that he moved twice
up?
a merge m is included in M such that m D S0 . While the resulting translation KS0 .P / is exponential
in the number of unknown atoms in the initial situation in the worst case, there is an alternative choice
of tags and merges, called the Ki .P / translation, that is exponential in the non-negative integer i , and
that is complete for problems P that have a structural parameter w.P /, called the width of P , bounded
by i . In problems defined over multivalued variables, this width often stands for the maximum number
of variables all of which are relevant to a variable appearing in an action precondition or goal. It turns
out that many conformant problems have a bounded and small width, and hence such problems can
be efficiently solved by a classical planner after a low polynomial translation [Palacios and Geffner,
2009]. e conformant plans are then obtained from the classical plans by removing the “merge”
actions. e translation-based approach, introduced initially for deterministic conformant planning,
has been extended to deterministic planning with sensing [Albore et al., 2009, Bonet and Geffner,
2011, Brafman and Shani, 2012a,b]. In Chapter 5, we will look at a related notion of width in the
more general setting of non-deterministic planning.
P .ObsjH / P .H /
P .H jObs/ D (4.5)
P .Obs/
where P .ObsjH / represents how well the hypothesis H predicts the observation Obs , P .H / stands
for how likely is the hypothesis H a priori, and P .Obs/, which affects all hypotheses H equally,
measures how surprising is the observation. In our problem, the hypotheses are about the possible
destinations of the agent, and since there are no reasons to assume that one is more likely a priori than
the others, Bayes’ rule yields that P .H jObs/ should be proportional to the likelihood P .ObsjH / that
measures how well H predicts Obs . Going back to the figure, and assuming that the agent is reasonably
“rational” and hence wants to achieve his goals with least cost, it’s clear that A, B, and C predict Obs
better than D, E, F, and also that B predicts Obs better than A and C. is is because there is a
single optimal plan for B that is compatible with Obs , but there are many optimal plans for A and
for C, some of which are not compatible with Obs (as when the agent moves first left or right, rather
than up). We say that a plan is compatible with the observed action sequence Obs when the action
sequence Obs is embedded in the action sequence , i.e., when Obs is but with certain actions in
omitted (not observed).
e reasoning above reduces goal recognition to Bayes’ rule and how well each of the possible
goals predicts the observed action sequence. Moreover, how well a goal G predicts the sequence Obs
turns out to depend on considerations having to do with costs, and in particular, two cost measures:
the cost of achieving G through a plan compatible with the observed action sequence Obs , and the
cost of achieving G through a plan that is not compatible with Obs . We will denote the first cost
as cP .G C Obs/ and the second cost as cP .G C Obs/, where P along with the observations Obs
define the plan recognition problem. at is, P is like a classical planning problem but with the actual
goal hidden and replaced by a set G of possible goals G , i.e., P D hF; I; O; Gi. e plan recognition
problem is to infer the probability distribution P .GjObs/ over the possible goals G 2 G , where each
possible goal G can be a (conjunctive) set of atoms.
For the plan recognition problem in Figure 4.2, the measures cP .B C Obs/ and cP .B C Obs/,
encoding the costs of getting to B from X through plans compatible and incompatible with the ob-
served action sequence Obs , are 4 and 6 respectively, assuming moves in each one of the four possible
directions, each with cost 1. On the other hand, the pairs of measures .cP .G C Obs/; cP .G C Obs//
for G equal to A, J, and H, are .8; 8/, .8; 4/, .12; 8/ respectively.
e key feature is actually the cost difference .G; Obs/ D cP .G C Obs/ cP .G C Obs/ for
each goal G which can range from 1 to C1. It can be argued that the higher the value of
.G; Obs/, the better that G predicts Obs , and hence the higher the likelihood P .ObsjG/. In partic-
ular, .G; Obs/ is 1 when all the plans for G comply with Obs , and 1 when none of them complies
with Obs . Values in the middle reflect how good are the plans that comply and do not comply with the
observed action sequence Obs . In our example, .G; Obs/ is 2 for G D B, 0 for G D A and G D C,
and 4 for the other possible goals. Hence P .ObsjG/ is largest for G D B, smaller for G D A and
G D C, and smallest for the rest. e function used by Ramírez and Geffner [2010] for mapping the
cost difference .G; Obs/ D cP .G C Obs/ cP .G C Obs/ into the likelihoods P .ObsjG/ is the
sigmoid function:
4.3. PLAN AND GOAL RECOGNITION 59
1
P .ObsjG/ D ˇ.G;Obs/
(4.6)
1Ce
where ˇ is a positive constant. is expression is derived from the assumption that while the observed
agent is not perfectly rational, he is more likely to follow cheaper plans, according to a Boltzmann
distribution. e larger the value of the constant ˇ , the more rational the agent, and the less likely that
he will follow suboptimal plans.
e target distribution P .GjObs/ over the possible goals G 2 G given the observation sequence
Obs can thus be obtained in three steps. First, the costs cP .G C Obs/ and cP .G C Obs/ of achieving
each possible goal G with plans that are compatible and incompatible with the observed action se-
quence Obs are determined. en, the resulting cost differences .G; Obs/ are plugged into Eq. 4.6
to yield the likelihoods P .ObsjG/. Finally, these likelihoods are plugged into Bayes’ rule (4.5) from
which the goal posterior probabilities are obtained. e probabilities P .Obs/ used in Bayes’ rule are
obtained by normalization (goal probabilities must add up to 1 when summed over all possible goals).
e open question is how to compute the cost measures cP .G C Obs/ and cP .G C Obs/.
Ramírez and Geffner [2010] show that these costs correspond to the costs of two classical planning
problems, that we will call P .G C Obs/ and P .G C Obs/, defined from the plan recognition prob-
lem P D hF; I; O; Gi, where G stands for the set of possible agent goals, and the observed action
sequence Obs . If we assume that no action occurs twice in the observed sequence Obs , the problems
P .G C Obs/ and P .G C Obs/ are like P but with extra atoms pa for each a 2 Obs , all initially false,
such that pa is made into an effect of the action a when a is the first action in Obs , while pb ! pa
is made into a conditional effect of a, when b is the action that immediately precedes a in sequence
Obs . e cost cP .G C Obs/ is then the cost of this classical problem for the goal G 0 D G [ fpa g,
where a is the last action in the sequence Obs , and the cost cP .G C Obs/ is the cost of the same
classical problem but with goal G 00 D G [ f:pa g where :pa is the negation of pa . In other words,
the constraint of achieving a possible goal G in a way that is compatible or incompatible with an ob-
served action sequence Obs , is mapped into the problem of achieving G and a suitable dummy goal
associated with Obs in a transformed classical problem.
Figure 4.3 shows a slightly different example, where the path followed by the agent is shown
on the left as time progresses. e curves on the right show the resulting goal posterior probabilities
over each one of the possible targets as a function of time. e account presented is not tied to agents
navigating in grids but is completely domain-independent. For computing the posterior probabilities,
2 jGj classical planning problems need to be solved. ese probabilities will be exact if the prob-
lems are solved optimally, and will be approximate if they are solved with more scalable non-optimal
planners. e use of model-based approaches to behavior generation for the inverse task of behav-
ior recognition has been considered recently for other models such as MDPs [Baker et al., 2009] and
POMDPs [Ramírez and Geffner, 2011]. Moreover, the approaches can also be used to recognize both
goals and agent beliefs, by just replacing the set of possible goals by a set of possible goals and initial
belief pairs.
60 4. BEYOND CLASSICAL PLANNING: TRANSFORMATIONS
B C D E /
8953
+:;
A F +:<
)*+,-&.
893 +:=
+:>
+:?
+:@
8903
8
/ 0 1 2 3 4 5 6 7 /8 // /0 /1
I !"#$%&#'(
Figure 4.3: Left: Red path shows noisy walk of agent Obs t as time t progresses. Right: Curves show goal
posterior probabilities P .GjObs t / for each possible target as a function of time.
−/Right
A/Right −/Left
A B
B/Left
q0 q1
(a) (b)
Figure 4.4: (a) A partially observable problem where an agent initially in one of the two leftmost positions has
to go to the cell marked B and then back to the cell marked A. ese two marks are observable. (b) A 2-state
controller that solves this problem and many variations of it. e circles are the controller states, and an edge
q ! q 0 labeled o=a means to perform action a when the observation is o in state q , switching then to state q 0 .
e initial controller state is q0 .
TC/Right
–B/Up
TB/Up –B/Down
–C/Down
q0 q1
TB/Right
'JHVSF4.5:
Figure Left:
-FGUProblem
1SPCMFNwhere
XIFSFa Bvisual-marker
WJTVBMNBSLFS(mark
NBSLonPOthe
UIFlower
MPXFSleft
MFGUcell)
DFMM
must
NVTUbeCFplaced
QMBDFEonPOtop
UPQofPGa Bgreen
HSFFO
CMPDLwhose
block XIPTFlocation
MPDBUJPOisJTnot
OPUknown,
LOPXO
byCZmoving
NPWJOHtheUIFmark
NBSLone
POFcell
DFMMatBUa Btime,
UJNF
and
BOEbyCZobserving
PCTFSWJOHwhat’s
XIBUTinJOthe
UIFmarked
NBSLFE
cell. 3JHIUFinite-state
DFMMRight: 'JOJUFTUBUFcontroller
DPOUSPMMFSobtained
PCUBJOFEwith
XJUIa Bclassical
DMBTTJDBMplanner
QMBOOFSfrom GSPNsuitable
TVJUBCMFtranslation.
USBOTMBUJPOe
ɩFcontroller
DPOUSPMMFSsolves
TPMWFT
UIFproblem
the QSPCMFNandBOEany
BOZvariation
WBSJBUJPOresulting
SFTVMUJOHfrom
GSPNchanges
DIBOHFTinJOeither
FJUIFStheUIFnumber
OVNCFSorPSconfiguration
DPOmHVSBUJPOofPGblocks.
CMPDLT
UIJTcell
this DFMMisJTatBUthe
UIFlevel
MFWFMofPGthe
UIFtable
UBCMF(T)
5
orPSnotOPU(–).
o
e ɩFfinite-state
mOJUFTUBUFcontroller
DPOUSPMMFSshown
TIPXOon POthe
UIFright
SJHIUhas
IBT
CFFOcomputed
been DPNQVUFEbyCZrunning DMBTTJDBMplanner
SVOOJOHa Bclassical QMBOOFSoverPWFSa Btranslation
USBOTMBUJPOobtained
PCUBJOFEfollowing
GPMMPXJOHthe UIFtwo
UXPsteps
TUFQTabove:
BCPWF
POF
from
one, GSPNthe UIForiginal
PSJHJOBMpartially
QBSUJBMMZobservable
PCTFSWBCMFproblem
QSPCMFNinto JOUPa Bconformant
DPOGPSNBOUproblem;
QSPCMFNthe UIFsecond,
TFDPOE
from
GSPNtheUIF
DPOGPSNBOUproblem
conformant QSPCMFNinto JOUPa Bclassical
DMBTTJDBMone.
POFe ɩFsolution
TPMVUJPOtoUPthe UIFclassical
DMBTTJDBMproblem
QSPCMFNrepresents
SFQSFTFOUTthe UIFfinite-
mOJUF
TUBUFcontroller
state DPOUSPMMFSthatUIBUisJTshown
TIPXOon POthe
UIFright.
SJHIUInterestingly,
*OUFSFTUJOHMZ
this UIJTcontroller
DPOUSPMMFSnotOPUonly
POMZsolves
TPMWFTthe
UIFproblem
QSPCMFN
TIPXOon
shown POtheUIFleft,
MFGU
but
CVUtheUIFreader
SFBEFScanDBOverify
WFSJGZthat
UIBUitJUalso
BMTPsolves
TPMWFTanyBOZmodification
NPEJmDBUJPOinJOthe UIFproblem
QSPCMFNresulting
SFTVMUJOH
GSPNchanges
from DIBOHFTinJOeither
FJUIFStheUIFdimensions
EJNFOTJPOTofPGthe UIFgrid,
HSJE
number
OVNCFSorPSconfiguration
DPOmHVSBUJPOofPGblocks
CMPDLT[Bonet
<#POFUetFUal.,
BM
2009].
>
ɩJTisJTquite
is RVJUFremarkable
SFNBSLBCMFand BOEillustrates
JMMVTUSBUFTthat
UIBUthe
UIFcombined
DPNCJOFEuse VTFofPGtransformations
USBOTGPSNBUJPOTand BOEclassical
DMBTTJDBMplanners
QMBOOFST
DBObeCFvery
can WFSZpowerful
QPXFSGVMindeed.
JOEFFE
TEMPORALLY
4.5 5&.103"--:EXTENDED
&95&/%&%GOALS
(0"-4
$MBTTJDBMplanning
Classical QMBOOJOHisJTaboutBCPVUacting
BDUJOHon POa Bsystem
TZTUFNtoUPdrive
ESJWFitJUinto
JOUPa Bfinal
mOBMstate
TUBUFwhere
XIFSFa Bgoal
HPBMholds.
IPMETSuch
4VDItasks
UBTLT
BSF TPNFUJNFT DBMMFE iSFBDIBCJMJUZw QSPCMFNT *O UIF MBTU GFX ZFBST
UFNQPSBMMZ
are sometimes called “reachability” problems. In the last few years, temporally extended goals expressed FYUFOEFE HPBMT FYQSFTTFE
inJOtemporal
UFNQPSBMlogics
MPHJDThaveIBWFbeen
CFFOincreasingly
JODSFBTJOHMZused VTFEtoUPcapture
DBQUVSFa Bricher
SJDIFSclass
DMBTTofPGplans
QMBOTwhere
XIFSFrestrictions
SFTUSJDUJPOToverPWFS
UIF XIPMF TFRVFODF PG TUBUFT NVTU CF TBUJTmFE BT XFMM <#FSUPMJ FU BM
the whole sequence of states must be satisfied as well [Bertoli et al., 2003, de Giacomo and Vardi,
EF (JBDPNP BOE 7BSEJ
1999, (FSFWJOJetFUal.,
Gerevini BM
2009].
>A"temporally
UFNQPSBMMZextended
FYUFOEFEgoal HPBMmay
NBZstate,
TUBUF
for GPSexample,
FYBNQMF
thatUIBUany
BOZborrowed
CPSSPXFE
UPPM TIPVME CF LFQU DMFBO VOUJM JU JT SFUVSOFE
EFmOJOH B DPOTUSBJOU UIBU EPFT
tool should be kept clean until it is returned, defining a constraint that does not apply to a single state OPU BQQMZ UP B TJOHMF TUBUF
CVU UP B XIPMF TUBUF TFRVFODF " QMBO BDIJFWFT B HPBM XIJMF TBUJTGZJOH B TUBUFUSBKFDUPSZ
but to a whole state sequence. A plan achieves a goal while satisfying a state-trajectory constraint when DPOTUSBJOU XIFO
UIFplan
the QMBOachieves
BDIJFWFTthe UIFgoal
HPBMinJOtheUIFstandard
TUBOEBSEsense,
TFOTF
andBOEinJOaddition,
BEEJUJPO
the UIFstate
TUBUFsequence
TFRVFODFthatUIBUitJUgenerates
HFOFSBUFT
TBUJTmFT UIF DPOTUSBJOU
satisfies the constraint.
A"standard
TUBOEBSElanguage
MBOHVBHFfor GPSexpressing
FYQSFTTJOHtrajectory
USBKFDUPSZconstraints
DPOTUSBJOUTisJTLinear
-JOFBSTemporal
5FNQPSBMLogic
-PHJDorPSLTL,-5-
a B
MPHJD PSJHJOBMMZ QSPQPTFE BT B TQFDJmDBUJPO MBOHVBHF GPS DPODVSSFOU QSPHSBNT
logic originally proposed as a specification language for concurrent programs [Pnueli, 1977]. Formulas <1OVFMJ
> 'PSNVMBT
ofPGLTL
-5-are BSFbuilt
CVJMUfrom
GSPNa Bset TFUF ofPGatoms
BUPNTand BOEare
BSFclosed
DMPTFEunder
VOEFSthe UIFboolean
CPPMFBOoperators,
PQFSBUPST
the
UIFunary
VOBSZtemporal
UFNQPSBM
operators , Þ, and , and the binary temporal operator U . Intuitively, ' says that ' holds atBUthe
PQFSBUPST ı
BOE
BOE UIF CJOBSZ UFNQPSBM PQFSBUPS *OUVJUJWFMZ
ı TBZT UIBU IPMET UIF
OFYU JOTUBOU
TBZT UIBU XJMM FWFOUVBMMZ IPME BU TPNF GVUVSF JOTUBOU
next instant, Þ' says that ' will eventually hold at some future instant, ' says that from the current TBZT UIBU GSPN UIF DVSSFOU
JOTUBOUon
instant PO' will BMXBZThold,
XJMMalways IPME
andBOE'U says TBZTthat
UIBUatBUsome
TPNFfuture
GVUVSFinstant
JOTUBOU will XJMMhold
IPMEand VOUJMthat
BOEuntil UIBU
4.5. TEMPORALLY EXTENDED GOALS 63
ı
point ' holds. As an example, the formula .p q/ says that if p is true at any time point, then q
must be true at the following time point.
e semantics of LTL is given in terms of infinite state sequences D s0 ; s1 ; : : : ; si ; : : : where
the indices i stand for time points, and the state si represents a truth valuation over F at time i . If we
let .i/ stand for the state si in the sequence , the conditions under which a state sequence satisfies
an arbitrary LTL formula ' at time i , written ; i ˆ ' , can be given inductively as follows:
• ; i ˆ p , for p 2 F , iff p 2 .i /.
• ; i ˆ :' iff not ; i ˆ ' .
• ; i ˆ ' ^ ' 0 iff ; i ˆ ' and ; i ˆ ' 0 .
• ; i ˆ ı' iff ; i C1 ˆ ' .
• ; i ˆ Þ' iff for some j i , we have that ; j ˆ ' .
• ; i ˆ ' iff for all j i , we have that ; j ˆ ' .
• ; i ˆ 'U' 0 iff for some j i , we have that ; j ˆ ' 0 and for all k , i k < j , we have that
; k ˆ ' .
A formula ' is true or satisfied in , written ˆ ' , if ; 0 ˆ ' . For determining whether a
given plan D a0 ; : : : ; an 1 for a classical planning problem P D hF; I; O; Gi satisfies a temporally
extended goal expressed as an LTL formula over F , it is normally assumed that the finite state sequence
s0 ; : : : ; sn generated by the plan represents the infinite state sequence s0 ; : : : ; sn ; sn ; sn ; : : : where the
last state sn in the sequence is repeated forever [Bacchus and Kabanza, 2000]. is is an assumption
that can be used for many LTL formulas but not for all, as some formulas may be satisfiable but not by
sequences of this type. A formula like .Þat .p1/ ^ Þat .p2// expressing that from any time point
on, the robot has to be eventually at position 1 and eventually at position 2, is one such example. ese
formulas require the consideration of more general infinite state sequences where one finite sequence
s0 ; : : : ; sn is followed by another finite state sequence sn ; s10 ; : : : ; sm
0
; sn that forms a loop and is repeated
0
infinitely often, and where the states si are different than sn . We’ll focus now on the fragment of
temporally extended goals expressed in LTL where “completed” state sequences s0 ; : : : ; sn ; sn ; sn ; : : :
suffice. Following Bauer and Haslum [2010], we refer to this as the infinite-extension semantics for
LTL, or simply the IE-semantics.
We turn thus to the problem of computing a finite plan D a0 ; : : : ; an 1 for a classical planning
problem P D hF; I; O; Gi such that the completed infinite state sequence s0 ; : : : ; sn ; sn ; sn ; : : : that
results from the plan satisfies an LTL formula ' . It turns out that this problem can be solved by
mapping the classical problem P and the formula ' into a new classical planning problem P' whose
solutions represent plans for P that satisfy ' [Baier et al., 2009, Cresswell and Coddington, 2004,
Edelkamp, 2006]. Rather than focusing on the syntactic details of the translation, we describe the
main idea semantically.
We know by now that the planning problem P D hF; I; O; Gi represents a state model
S.P / D .S; s0 ; SG ; A; f /, that can also be understood as a deterministic finite automaton AP D
.˙ P ; QP ; q0P ; ı P ; F P / where the input alphabet is ˙ P D O , the states are QP D S , the initial state
is q0P D s0 , the transition function ı P is such that s 0 2 ı P .a; s/ iff s 0 D f .a; s/, and the accept-
ing states are F P D SG . e LTL formula ' defines in turn a non-deterministic Büchi automaton
64 4. BEYOND CLASSICAL PLANNING: TRANSFORMATIONS
A D .˙ ' ; Q' ; q0' ; ı ' ; F ' / where the input alphabet is ˙ ' D S , and the accepted inputs are the in-
'
finite state sequences that satisfy ' , defined as the inputs that generate state sequences over Q' that
pass through accepting states in F ' infinitely often [Gerth et al., 1995, Vardi and Wolper, 1994]. Un-
der the IE-semantics, however, it is enough to reach an accepting state once, and hence the automaton
A' can be regarded as a standard non-deterministic finite automaton, which can be determinized using
standard methods [Hopcroft and Ullman, 1979, Sipser, 2006].
erefore, under the IE-semantics, the valid plans that satisfy an LTL formula ' are the action
sequences D a0 ; : : : ; an 1 that generate state sequences D s0 ; : : : ; sn such that is accepted by
the first automaton AP and is accepted by the second automaton A' . us, the classical planning
problem P' whose solutions encode the plans for P that satisfy the LTL formula ' can be expressed as
the compact representation of the product of two deterministic automata: the deterministic automaton
AP associated with the problem P , and the deterministic version of the automaton A' associated with
the LTL goal ' . e states over the problem P' , which represent the truth-valuations over the atoms
in P' , stand for pairs .s; q/ where s captures the state on the first automaton and q captures the state
on the second automaton. is construction requires the addition of atoms pq in P' for such states q ,
in addition to the atoms in P . e actions in P' are the actions in P but with effects on the atoms pq
in correspondence with the second automaton. Likewise, the initial state of P' is the initial state of
P extended with the atom pq0 , and the goal in P' is the goal of P conjoined with the disjunction of
atoms pq for accepting states q . Approaches and transformations for dealing with arbitrary LTL goals,
that may require plans with loops, and “lasso” state sequences have also been developed [Albarghouthi
et al., 2009, Kabanza and iébaux, 2005, Patrizi et al., 2011, 2013].
CHAPTER 5
• F is a non-deterministic state-transition function such that F .a; s/ denotes the non-empty set
of possible successor states that follow action a in s , for a 2 A.s/, and
For the resulting paths to encode conformant plans, an action a must be regarded as applicable in the
belief state b , written a 2 A.b/, when a is applicable in each state s in b , or equivalently, when the
preconditions of a are true (in all the states) in b .
e model for planning with sensing S D hS; S0 ; SG ; A; F; Oi is the model for conformant
planning extended with a sensing model O : a function O.s; a/ that maps state-action pairs into non-
empty sets of observation tokens. e expression o 2 O.s; a/ means that o is a possible observation
token when s is the true state of the system and a is the last action done. at is, every time that the
agent executes the action a resulting in the state s , the agent gets an observation token from O.s; a/.
is observation o provides partial information about the true but possibly hidden state s , since it rules
out states for which the observation token o is not possible, i.e., the states s 0 for which o … O.s 0 ; a/.
If two different observations belong to O.s; a/, then either one can be observed in s when a is the last
action. We say that the sensing is deterministic or noiseless when O.s; a/ is a singleton for every pair
.s; a/, else it is non-deterministic or noisy.
If the belief state for the agent is b and the observation o is obtained after applying the action
a in b , the new belief state, denoted as bao , is given by the states in ba that are compatible with o:
An observation o is possible in a belief state ba if o 2 O.s; a/ for some state s in ba . Alternatively, the
observation o is possible in ba if and only if the resulting belief state bao is not empty.
LANGUAGE
Conformant models can be expressed in compact form through a set of state variables. For convenience,
in the partially observable setting, we assume that these variables are not necessarily boolean. More
precisely, a conformant planning problem is a tuple P D hV; I; A; Gi where V stands for the problem
variables X , each one with a finite and discrete domain DX , I is a set of clauses over the V -literals
defining the initial situation, A is set of actions, and G is a set of V -literals defining the goal. Every
action a has a precondition P re.a/ given by a set of V -literals, and a set of conditional effects a W
C ! E1 j : : : jEn , where C and each Ei is a set (conjunction) of V -literals. e conditional effect is
non-deterministic if n > 1. A non-deterministic action is an action with one or more non-deterministic
effects.
5.2. SOLUTIONS AND SOLUTION FORMS 67
e conformant problem P D hV; I; A; Gi defines the conformant model S.P / D
hS; S0 ; SG ; A; F i, where S is the set of possible valuations over the variables in V , S0 and SG
are the set of valuations that satisfy I and G respectively, A.s/ is the set of actions whose precondi-
tions are true in s , and F .a; s/ is a non-deterministic state transition function where s 0 2 F .a; s/ is a
possible successor state of action a in state s for a 2 A.s/. e set F .a; s/ of such possible successors s 0
is defined by the conditional effects a W C ! E1 j : : : jEn , n 1, whose body C is true in s . Basically,
any logically consistent choice of heads Ei , one for each conditional effect whose body is true in
s must define a deterministic transition function f .a; s/. F .a; s/ is the non-deterministic transition
function that results from collecting all the successor states s 0 that are possible given any of these
deterministic functions.
A partially observable problem P is a tuple P D hV; I; A; G; V 0 ; W i that extends the description
hV; I; A; Gi of a conformant model with a compact encoding of a sensor model. is sensor model is
defined syntactically by means of a set V 0 of variables Y with a finite domain DY that are assumed
to be observable, and a set W of formulas Wa .Y D y/ over the state variables V of the problem that
determine the states s over which the atom Y D y may be observed. More precisely, the sensor model
O.s; a/ defined by W is such that o 2 O.s; a/ iff o is a valuation over the observable variables Y 2 V 0
such that Y D y is true in o only if the formula Wa .Y D y/ is true in s for y 2 DY . In other words,
an observation o represents a maximal consistent set of partial observations Y D y where Y is an
observable variable and y is a possible value of Y . Such an observation o is possible in the state s after
doing action a if the formulas Wa .Y D y/ are all true in s .
Two last remarks. First, some of the state variables X may be observable and hence belong to
both V and V 0 . In such a case, the formula Wa .X D x/ for the different actions and possible values
of X is given by X D x . Second, the formulas Wa .Y D y/ for the different values y in DY must be
logically exhaustive, as every state-action pair must give rise to some observation over each observable
variable Y . If in addition, the formulas Wa .Y D y/ for the different values y are logically exclusive,
every state-action pair gives rise to a single observation Y D y and the sensing over Y is deterministic.
As an example, if X encodes the location of an agent, and Y encodes the location of an object
that can be seen by the agent when X D Y , we can have an observable variable Z 2 fYes; Nog en-
coding whether the object is seen by the agent or not, with observation model Wa .Z D Y es/ given by
W
l2D .X D l ^ Y D l/, where D is the set of possible locations and a is any action, and Wa .Z D No/
given by the negation of this formula. e resulting sensor is deterministic. A non-deterministic sensor
could be used if, for example, the agent cannot detect with certainty the presence of the object at some
other locations l 2 D 0 . For this, Wa .ZWD Yes/ and Wa .Z D No/ can be set to the disjunction of their
previous expression and the formula l2D 0 .X D l/. e result is that the two observations Z D Y es
and Z D No will be possible in the states where the agent is at some location l 2 D 0 , whether the
object is in the same location or not.
No other executions are possible given this policy. Initially the belief state contains two states differ-
ing only in the truth of the atoms i n.t oy; box1 / and i n.t oy; box2 / After the action i nspect .box1 /
and the resulting observation Y D yes or Y D no, the belief state reduces to a single state where
i n.toy; box1 / and i n.toy; box2 / are true respectively. is is because for a D i nspect .box1 /, the
sensor model O.s; a/ is such that o D .Y D yes/ can be observed only when the state s satisfies
the formula Wa .Y D yes/ D i n.t oy; box1 / ^ vi si ble.box1 /, while o D .Y D no/ can be observed
only when s satisfies the formula Wa .Y D no/ D :i n.t oy; box1 / ^ vi si ble.box/. us, after the his-
tory h2 , i n.toy; box1 / must be true, while after the history h02 , i n.t oy; box1 / must be false. Since
i n.toy; box1 / is false only in the state where i n.t oy; box2 / is true, it follows that i n.t oy; box2 / must
be true after h02 .
where a D .b/, o ranges over the observations that are possible in ba , and c.a; b/ is the cost of doing
action a in the belief state b , which in the worst case is:
A policy is optimal if it minimizes the costs V .b/ over all beliefs b . e cost function V for an
optimal policy D is the optimal cost function V , that is the solution of Bellman’s optimality
equation
0 if b is a goal belief,
V .b/ D (5.5)
mina2A.b/ Œc.a; b/ C maxo V .bao / otherwise.
In the absence of dead-ends i.e., belief states b from which the goal cannot be reached, Equation 5.5
can be solved by a simple dynamic programming method called Value Iteration [Bellman, 1957], where
the equation is used to update a value vector V over all beliefs b until a fixed point is reached. More
precisely, in the version of Value Iteration (VI) known as Gauss-Seidel VI [Bertsekas, 1995], one
starts with a value vector V .b/, initially set to 0 over all entries, and then iteratively updates each of
the entries V .b/ over non-goal beliefs b as:
In this setting, Value Iteration converges in a finite number of steps to the single solution V D V .
e optimal policy for solving the problem is then obtained from the greedy policy V
using the value function V D V . We will see in Chapter 7 that the equations for solving POMDPs
are similar but withPthe subexpression maxo V .bao / representing cost in the worst case, replaced by
the expected costs o ba .o/V .bao /. Likewise, the beliefs b , ba , and bao will become then probability
distributions, and ba .o/ will stand for the probability of observing o after doing the action a in b . A
key difference when moving to POMDPs is that the set of possible beliefs b , representing probability
distributions over the set of states, will no longer be finite.
AO*
% G and G are explicit and best graphs, initially empty; V 0 is heuristic function.
Initialization
Insert node h0 in G where h0 is the empty history.
Initialize V .h0 / WD V 0 .b0 / where V 0 is admissible heuristic and b0 initial belief.
Initialize best partial graph G to G .
Loop
Select non-terminal tip h from best partial graph G . If no such node, Exit.
Expand h in G : for each a 2 A.b/ where b is belief associated with h, add node
.h; a/ as child of h, and for each observation o possible in ba , add node .h; a; o/
as child of .h; a/. Initialize values V .h; a; o/ to the heuristic values V 0 .bao /.
Update h and its ancestor AND and OR nodes in G , bottom-up as:
Mark best action in ancestor OR-nodes h to an action a with V .h/ D V .h; a/,
maintaining marked action if still best.
Recompute best partial graph G by following marked actions in G .
Figure 5.1: AO* for Computing History-based Policies for Partially Observable Problems.
2002]. More recent approaches have appealed to the translations developed for mapping conformant
into classical planning problems (Section 4.2). Indeed, the delete-relaxation of a partially observable
problem has a conformant solution once the preconditions of actions are pushed in as additional con-
ditions of the actions’ conditional effects [Hoffmann and Brafman, 2005]. Such relaxations have the
advantage that they do not need to assume that the information is complete [Albore et al., 2009, 2011].
We have considered an exhaustive dynamic programming method (Value Iteration) for finding
policies over a potentially cyclic belief space, and a heuristic search method (AO*) for finding policies
over the acyclic space of histories. e two types of methods are not incompatible however. More
recent algorithms like LAO* [Hansen and Zilberstein, 2001] and RTDP [Barto et al., 1995] manage
to get the best of both worlds, and variations of these algorithms can be used to compute optimal and
non-optimal belief-based policies in an incremental fashion using heuristics. We will consider such
algorithms in the next chapter.
As an illustration, consider the DET-Ring domain [Cimatti et al., 2004] depicted in Figure 5.2,
where an agent can move forward or backward along a ring with n rooms. Each room has a window
that can be opened, closed, or locked when closed. Initially, the status of the windows is not known,
the agent does not know his initial location, and the goal is to have all windows locked. A plan for
this deterministic conformant problem is to repeat n times the actions .close; lock; f wd /, skipping
the last f wd action (alternatively, f wd can be replaced by the action bwd throughout). e state
variables for the problem encode the agent location Loc 2 f1; : : : ; ng, and the status of each window,
W .i/ 2 fopen; closed; locked g, i D 1; : : : ; n. e location variable Loc is (causally) relevant to each
window variable W .i/, but no window variable W .i / is relevant to Loc or to W .k/ for k 6D i . e
largest contexts are thus for the window variables which have size 2. As a result, the width of the
domain is 2, which is independent of the number of variables for the problem that grows linearly with
5.5. BELIEF TRACKING: WIDTH AND COMPLEXITY 75
W1
Wn W2
W7 W3
W6 W4
W5
Figure 5.2: Ring problem with n windows that must be closed and locked. Initially, the agent does not know its
location or the status of the windows. In NON-DET-Ring, each time the agent moves, the unlocked windows
open or close non-deterministically. In another variation of the problem, the agent needs a key to lock the windows
whose initial position is not known.
n. is means that belief tracking for this problem can be done in quadratic time since there are n
contexts that need to be tracked, each of size O.n/ as the domain size for the window variables is
constant. NON-DET-Ring is a variation of the domain where any movement of the agent, f wd or
bwd , has a non-deterministic effect on the status of all windows that are not locked, capturing the
possibility of external events that can open or close unlocked windows. is non-determinism has no
effect on the relevance relation among the variables as Loc was already relevant to each variable W .i /.
As a result, the change has no effect on the contexts or domain width that remains bounded and equal
to 2. A further variation involves a key that is needed now to lock the windows, whose initial position
is unknown. e agent may then perform a pick action that grabs the key when the key and the agent
are in the same room. In all these variations, the problem width remains bounded and small. As a
result, the belief tracking task for planning can be accomplished in low-order polynomial time even if
the number and size of the beliefs is exponential in the number of rooms.
APPROXIMATIONS
Factored belief tracking is complete for planning and exponential in the problem width, yet this is still
not good enough when problems have a large width. For such cases, however, it has been shown that it
is possible to obtain meaningful approximations that are sound, polynomial, and powerful, even if not
necessarily complete [Bonet and Geffner, 2013]. e idea is the consideration of a larger collection of
projected subproblems PX , each one involving a smaller set of variables. e algorithm being sound
means that when a literal X D x is reported as true or false after an execution, it is really true or false;
while the algorithm being incomplete means that the literal X D x may fail to be reported as true or
false after an execution when X D x is really true or false in the true belief. In the approximation, the
variables X range not only over preconditions and goal variables, but also over observable variables,
while the state variables that make it into the projected problem PX are only those that are causally
relevant to X . e result of this alternative decomposition is that there are more projected problems
PX but of smaller size whose local beliefs bX can be tracked more efficiently. In this scheme, however,
a state variable Y may be involved in two subproblems PX and PX 0 , such that Y D y is known to
be true in bX but not known to be true in bX 0 . e second step in the approximation is to enforce
76 5. PLANNING WITH SENSING: LOGICAL MODELS
a local form of consistency among the local beliefs, an operation that can be achieved in polynomial
time. e resulting approximation algorithm has been used successfully, in combination with simple
heuristics, for solving large instances of domains like Minesweeper, Battleship, and Wumpus, where
belief tracking is key [Bonet and Geffner, 2013].
at is, Vmi n .s/ measures the cost from s to the goal under the assumption that it is the agent rather
than nature the one that chooses the successor state s 0 2 F .a; s/ that follows an action a D .s/ in
each state s . In particular, the cost Vmi n .s/ is finite when the agent can get from s to the goal following
if “lucky” enough, while Vmi n .s/ is infinite when no amount of luck would help the agent as there
are no state trajectories linking s to the goal while following the policy .
It turns out that is a strong cyclic policy for a problem P with initial state s0 iff the policy
is such that over all the states s that are reachable from s0 following , Vmi n .s/ is finite. e set of
states reachable from s0 and is the minimal set of states S that includes s0 and any state s 0 such that
0
s 0 2 F .a; s/ for s 2 S 0 and a D .s/. In other words, is strong cyclic when it drives the agent to
states s all of which are separated from the goal by a finite trajectory s1 ; s2 ; : : : ; sn such that s1 D s , the
state sn is a goal state, and si C1 2 F .a; si / for a D .si /, i D 1; : : : ; n 1. It is easy to show indeed
that infinite executions that feature a state si in the trajectory an infinite number of times, but do not
feature the successor state si C1 an infinite number of times, cannot be fair.
e simplification of the problem that underlies the optimistic cost function Vmi n has been
used as a source of heuristics for non-deterministic and MDP problems where it is called the min-min
relaxation [Bonet and Geffner, 2000, 2005]. It is also closely related to a different relaxation used in FF-
Replan for solving MDPs, called the deterministic relaxation [Yoon et al., 2007]. Ignoring probabilities
for the moment and focusing on semantics rather than in syntax, the min-min relaxation replaces each
non-deterministic action a by deterministic actions a1 , …, am , each one of which picks one of the possible
outcomes of a, so that for any states s and s 0 , s 0 2 F .a; s/ is true in the non-deterministic problem
iff s 0 D f .ak ; s/ for some action ak in the deterministic problem. When this relaxation is done at the
syntactic level, it produces a classical planning problem where the uncertainty about non-deterministic
transitions is now controlled by the agent. Indeed, it is easy to see that the cost function Vmi n .s/ is
finite when such a classical planning problem has a solution from the state s .
is all suggests two methods for computing strong cyclic policies for non-deterministic but
fully observable problems P . A purely semantic and exhaustive method is to compute first, via Value It-
eration, the optimal cost function Vmi n .s/ for the min-min relaxation, where Vmi n .s/ D min Vmi n .s/.
en the states s for which Vmi n .s/ D 1 are removed from the problem, and the actions a that can
possibly lead to such states from states s 0 are removed from the sets A.s 0 /. is process of computing
the value function Vmi n and pruning the action sets is iterated until the set of states s and the sets A.s/
of applicable actions do not change further.1 If the initial state s0 is removed in the process, the prob-
lem has no strong cyclic solution, else, the policy that is greedy in the value function Vmi n computed
last is one such solution [Daniele et al., 1999].
Alternatively, one can compute strong cyclic plans using classical planners over the deterministic
relaxation P 0 of the problem P . Let P 0 .s/ be the classical problem obtained from P 0 by setting the
initial situation to s . Define then a complete state-plan (SP) pair hS 0 ; ˙ i as a set of states S 0 , including
1 An optimization is to select the states to prune from those which are reachable from the initial state s0 with the policy that
is greedy in Vmi n . e iteration can be terminated when there are not such states.
78 5. PLANNING WITH SENSING: LOGICAL MODELS
the initial problem state s0 , along with a set ˙ of classical plans .s/ for the problem P 0 .s/, one for
each state s 2 S 0 . e SP pair hS 0 ; ˙ i is consistent when the plans .s/ in ˙ that pass through a state
s 0 2 S 0 all apply the same action from P in s 0 . e expression ˙.s 0 / is used to denote this action. In
particular, if the action is any of the determinizations ak of a in P , ˙.s/ is a. e consistent SP pair
hS 0 ; ˙i is closed when a state s 0 is in S 0 if there is state s in S 0 such that s 0 2 F .a; s/ for a D ˙.s/. It
can then be shown that the partial policy defined as .s/ D ˙.s/ for complete SP pairs hS 0 ; ˙ i that
are consistent and closed, is a strong cyclic policy for P . is means that a strong cyclic policy for P can be
computed using classical planners incrementally, starting with the initial incomplete SP pair hS 0 ; ˙ i
where S 0 D fs0 g and ˙ D ;. For this, classical plans are added to ˙ to make the pair complete, and
states are added to S 0 to make the pair closed. e state-plan pair is kept consistent by forcing the
classical planner to respect the partial policy encoded by the pair. is can be achieved by adjusting
the deterministic relaxations incrementally, or by modifying the classical planner used. An algorithm
of this form will compute strong cyclic policies backtrack-free in problems with no dead-ends, but
may have to backtrack otherwise. e first use of classical planners for computing strongly cyclic plans
in this way is due to Kuter et al. [2008], and recent refinements to Fu et al. [2011] and Muise et al.
[2012]. ese are all offline algorithms. e planner FF-Replan mentioned above and to be discussed
again in the next chapter, can be regarded as an online version of these algorithms.
CHAPTER 6
• an initial state s0 2 S ,
• transition probabilities Pa .s 0 js/ for s 0 being the next state after doing the action a 2 A.s/ in the
state s 2 S , and
• positive action costs c.a; s/ for applying action a 2 A.s/ in the state s 2 S .
e planning task over Goal MDPs is to come up with an action strategy for reaching the goal with
certainty given the uncertain effect of the actions and the observations gathered. e solution form for
Goal MDPs cannot be thus a fixed action sequence as in classical planning; it must take observations
into account. is is simple to do in MDPs, however, where observations are over full states and the
dynamics and costs are Markovian, meaning that future states and costs depend on the current state
but not on the previous history. e result is that the choice of the action to do next in MDPs just
needs to take into account the last observation, and the solution form for MDPs is a function mapping
(the observed) states into actions. ese functions are called closed-loop control policies or simply policies,
80 6. MDP PLANNING: STOCHASTIC ACTIONS AND FULL FEEDBACK
denoted by the symbol . We will assume for now that a policy maps every non-goal state s into
an action a 2 A.s/. Policies of this type are said to be deterministic and stationary. A stochastic policy
, on the other hand, is a function that maps states into probability distributions over actions, and a
non-stationary policy is a function of both state and time. Stochastic and non-stationary policies can
be used for controlling MDPs, but they are not strictly needed except in the setting of finite-horizon
MDPs where optimal policies can be non-stationary.
A (deterministic and stationary) MDP policy and state s define a probability for every state
trajectory hs0 ; s1 ; : : : ; snC1 i given by the product
P .s0 js/ Pa0 .s1 js0 / Pa1 .s2 js1 / Pan .snC1 jsn / (6.1)
where ai D .si / is the action dictated by the policy in the state si , Pai .si C1 jsi / is the state transition
probability, and P .s0 js/ is 1 if s0 D s and else is 0.
e accumulated cost of a state trajectory hs0 ; s1 ; : : : ; snC1 i given a policy , ai D .si /, is given
in turn by the sum
e expected cost to reach the goal from state s using the policy , denoted as V .s/, stands for the sum
of the accumulated costs of the different state trajectories that are possible given , weighted by their
probabilities. e expected cost function V can also be characterized as the solution to a set of linear
equations. For this, it is convenient to assume that goal states are absorbing and cost-free, meaning that
some action a is applicable in each goal state s , and that such applicable actions a in a goal state s have
zero costs and null effects; i.e., c.a; s/ D 0 and Pa .sjs/ D 1. Under the assumption that every policy
selects one of these “dummy” actions in each goal state, the expected cost of policy from the state
s , V .s/ can be defined by the expression
P
V .s/ D Es i 0 c..Xi /; Xi / (6.3)
where Xi is a random variable that represents the state at time i , and Es Œ is the expectation with
respect to the probability distribution on state trajectories that start in the state s given by (6.1). Moving
the first term of the sum out of the expectation, the following fixed point equation is obtained
P
V .s/ D c..s/; s/ C s 0 2S P.s/ .s 0 js/V .s 0 / (6.4)
that defines the function V as the solution of a system of jS j linear equations with the border condi-
tion V .s/ D 0 for all goal states s .
It is possible to show that a policy for a Goal MDP has a finite expected cost V .s/ if and
only if starting in the state s , the application of the policy leads to a goal state with probability 1.
A policy that leads to the goal with certainty for any possible initial state is called a proper policy.
A necessary and sufficient condition for a policy to be proper is that for any state s , there is a finite
state trajectory hs0 ; s1 ; : : : ; snC1 i, starting in the state s0 D s and ending in a goal state snC1 , such that
all the state transitions in the trajectory are possible given ; i.e., P.si / .siC1 jsi / > 0 for i D 0; : : : ; n.
Notice that the exact value of these probabilities does not matter as long as they are different than
zero. is explains the correspondence between the proper policies in the probabilistic setting, and
6.1. GOAL, SHORTEST-PATH, AND DISCOUNTED MODELS 81
the strong cyclic policies in the non-deterministic setting analyzed in Section 5.6 that do not involve
probabilities at all.
We will consider Goal MDPs where there are no dead-ends, i.e., states from which the goal
cannot be reached. Formally, dead-ends are states s such that there is no state trajectory hs0 ; : : : ; snC1 i
with s0 D s , goal state snC1 , and actions a0 ; : : : ; an such that the transition probabilities Pai .si C1 jsi /
are all positive for i D 0; : : : ; n. Clearly, if s is a dead-end, V .s/ is infinite for any policy , and
alternatively, if there are no dead-ends, there must be a policy that is proper. We will relax the no
dead-ends assumption for Goal MDPs when considering methods that compute partial policies.
A policy is optimal for state s if V .s/ is minimum among all policies; i.e., V .s/ D
min V .s/. While the optimal policies for Goal MDPs are the policies that are optimal for the
given initial state s0 , we will follow the standard notion that identifies the optimal policies as the policies
that are optimal over all states. e cost function V for an optimal policy is the optimal cost func-
tion V , which can be characterized as the unique solution of Bellman’s optimality equation [Bellman,
1957]:
P
V .s/ D mina2A.s/ Œc.a; s/ C s 0 2S Pa .s 0 js/V .s 0 / (6.5)
for all non-goal states s , and V .s/ D 0 for goal states. A deterministic, stationary optimal policy can
be obtained from the optimal cost function V , from the greedy policy V :
P
V .s/ D argmina2A.s/ Œc.a; s/ C s 0 2S Pa .s 0 js/V .s 0 / (6.6)
with the value function V set to V . e ties in (6.6) can be broken arbitrarily.
Figure 6.1(a) depicts a simple example of a Goal MDP in which there is an agent that has to
navigate in a grid with obstacles from the cell marked A to the cell marked G . e agent can move one
cell at a time in each of the four directions as long as there are no obstacles, and the intended moves
succeed with high probability while leaving the agent in nearby cells with non-zero probability. e
panel (b) in Figure 6.1 shows a proper policy for the problem as the action .s/ to do at each of the
cells s , except at the cell G representing the goal state.
G G
(a) (b)
Figure 6.1: A Goal MDP in which an agent, initially at A, must reach the cell marked with G with certainty.
e grey cells are obstacles that cannot be crossed. Each action moves the agent in the intended direction with
non-zero probability but can also leave the agent in a nearby cell with non-zero probability as well. Panel (b)
shows a proper policy for the problem depicted as the action to be done in each non-goal state.
where ai D .si /. Since the cost of any such trajectory is bounded from below and above by c=.1
/
and c=.1
/ respectively, where c and c are lower and upper bounds on the action costs c.a; s/, it
follows that the expected costs V .s/ for all policies and states s are finite. e Bellman equation
characterizing this cost function is:
P
V .s/ D c..s/; s/ C
s 0 2S P.s/ .s 0 js/V .s 0 / ; (6.8)
and similarly, the optimal cost function V for Discounted Cost-based MDPs is given by the unique
solution of the optimality equation [Bertsekas, 1995, Puterman, 1994]:
P
V .s/ D mina2A.s/ Œc.a; s/ C
s 0 2S Pa .s 0 js/V .s 0 / : (6.9)
Discounted Reward-based MDPs are like Discounted Cost-based MDPs but with costs c.a; s/
replaced by rewards r.a; s/, and minimization of expected costs replaced by maximization of expected
rewards. An example of a reward-based MDP is one where an agent gets a positive reward of R every
time it reaches a piece of food, that once consumed appears randomly at a different location. A discount
factor of
< 1 ensures that the maximum discounted reward accumulated never exceeds R=.1
/.
e results and algorithms for Goal and Stochastic-Shortest Paths apply with small modifica-
tion to Discounted MDPs. Moreover, Discounted MDPs can be easily compiled into equivalent Goal
MDPs through a simple and efficient transformation [Bertsekas, 1995]. us, while certain problems,
like the one above, can be more naturally expressed as Discounted MDPs, Discounted MDPs are not
more expressive than Goal MDPs, and actually the opposite seems to be true as there is no known
method for transforming general Goal MDPs into Discounted MDPs.1
1 Some Goal MDPs can be transformed into equivalent Discounted MDPs, but the transformation is not general. For example,
a Goal MDP with action costs c.a; s/ that are all uniform and equal to 1, can be transformed into an equivalent Discounted
6.1. GOAL, SHORTEST-PATH, AND DISCOUNTED MODELS 83
In order to make precise the notion of equivalence among different types of MDPs [Bonet and
Geffner, 2009], let us say that two MDPs M and M 0 , possibly of different types, are equivalent iff
they have the same set of non-goal states and actions (and hence the same space of policies), and for
any policy , the value functions VM and VM 0 over M and M 0 are related by two constants ˛ and ˇ
through the linear equation
over all non-goal states s . e equation ensures that policies have the same relative ranking in M and
0 0
M 0 ; i.e., VM .s/ < VM .s/ iff VM 0 .s/ < VM 0 .s/. e constant ˛ can’t be zero, and is negative only when
M and M 0 have different signs: one being cost-based and the other reward-based.
For showing that a Discounted Reward-based MDP M can be transformed into an equivalent
Goal MDP M 0 , one can show 1) that M is equivalent to a Discounted Reward-base MDP M1 that
is like M but with a negative constant R added to all rewards to make them all negative, 2) that M1
is equivalent to a Discounted Cost-based MDP M2 where these negative rewards are transformed
into positive costs, and 3) that M2 is equivalent to a Goal MDP M 0 that is like M2 but with a new
(absorbing, cost-free) goal state t added, such that the transition probabilities P 0 in M 0 are expressed
in terms of the transition probabilities P of M , M1 , and M2 as:
Pa .s 0 js/ if s 0 ¤ t
Pa0 .s 0 js/ D (6.11)
1
if s 0 D t .
In this expression s ranges over the states in the original discounted model M , and a is an action
applicable in s . Notice that in the resulting Goal MDP, every policy is proper, as every applicable
action a in each non-goal state s , maps s into the goal state t with a non-zero probability 1
. e
equivalence between the Discounted Reward MDP M and the Goal MDP M 0 then follows from the
relation between the value functions VM and VM 0 that satisfies (6.10) for ˛ D 1 and ˇ D R=.1
/.
FINITE-HORIZON MDPS
e MDPs above are said to be infinite horizon, as the costs and rewards accumulate over a horizon that
is not bounded a priori. Finite-horizon MDPs, on the other hand, are concerned with the accumula-
tion of costs or rewards over a fixed number H of stages, called the problem horizon. Finite-horizon
MDPs can be converted into infinite-horizon MDPs by simply augmenting the problem states s with
the horizon left d . us, if s0 is the initial state of the finite-horizon MDP M with horizon H , then
the equivalent infinite-horizon MDP M 0 will have the pair hs0 ; H i as the initial state, the pairs hs; 0i
as the goal states, and transition probabilities Pa0 .hs 0 ; d 1ijhs; d i/ equal to the transition probabil-
ities Pa .s 0 js/ in M . e same transformation is used for costs. e resulting Goal MDP M 0 has an
important characteristic; namely, it is acyclic, meaning that the probability of any state trajectory start-
ing and ending in the same state is zero. is is because time moves forward, and states hs; d i can
only transition to states hs 0 ; d 1i, and this only when d ¤ 0. Dynamic programming procedures
like (Asynchronous) Value Iteration, to be considered next, can be used to solve finite-horizon MDPs,
Reward-based MDP with discount factor
, 0 <
< 1, and rewards r.a; s/ D 0 over non-goal states, and r.a; s/ D 1 over
goal states. e goal states remain absorbing but not cost-free in this Discounted MDP. is transformation, however, does
not ensure equivalence when the action costs c.a; s/ are not uniform.
84 6. MDP PLANNING: STOCHASTIC ACTIONS AND FULL FEEDBACK
and more generally acyclic MDPs, very efficiently, in a single pass over all the states, by considering the
states in order: first the states hs; d i with d D 1, then the states with d D 2, and so on, until reaching
the states with d D H . Still, even a single-pass over all the states may not be computationally feasible.
We will thus also consider incremental, heuristic search algorithms for solving finite-horizon MDPs
and their use in online planning over general infinite-horizon MDPs (Section 6.4).
VALUE ITERATION
Value Iteration (VI) is a method for computing the optimal cost function V , which once plugged
into the greedy policy V in place of V , yields the optimal policy . e optimal cost function V is
the unique solution to the optimality equation
0 P if s is a goal state
V .s/ D (6.12)
mina2A.s/ Œc.a; s/ C s 0 2S Pa .s 0 js/V .s 0 / otherwise.
Value iteration solves this equation by setting V .s/ D 0 for goal states s and initializing the value of
non-goal states arbitrarily, and then using Eq. 6.12 as an update
P
V .s/ WD mina2A.s/ Œc.a; s/ C s 0 2S Pa .s 0 js/V .s 0 / (6.13)
which is performed in parallel over all non-goal states. is operation, which is implemented by means
of two value vectors V and V 0 , is called a full or parallel DP update. Value iteration performs these paral-
lel updates repeatedly. In the limit, the value vector V converges to the solution of Bellman’s optimality
equation (6.12), and hence to the optimal cost function V . Since the convergence is asymptotic, Value
Iteration is stopped when the value vector V is such that the maximum difference between the expres-
sions in the left and right-hand sides of (6.12) is small enough. If this difference, called the residual
and defined as
ˇ P ˇ
ResV D mins2S ˇV .s/ Pa .s 0 js/V .s 0 /ˇ ;
def
Œc.a; s/ C s 0 2S (6.14)
is sufficiently small, the policy V greedy with respect to V is optimal. More generally, the value of
the residual can be used to bound the loss V V .s0 / V .s0 / that results from following the greedy
policy V from the initial state s0 rather than an optimal policy. For Discounted MDPs, this loss is no
greater than 2
=.1
/ if ResV < . While the loss can also be bounded in SSPs and Goal MDPs,
the expression for the bound cannot be expressed in such a closed form [Bertsekas, 1995].
In a variation of VI, known as Asynchronous Value Iteration, the update in Eq. 6.13 is not
performed over all states simultaneously but over some selected states. Provided that every state is
updated infinitely often, Asynchronous VI also converges asymptotically to the optimal cost function
6.2. DYNAMIC PROGRAMMING ALGORITHMS 85
V I
Starts with value function stored in vector V with V .s/ D 0 for goal states s
repeat
flag := true
for each non-goal state s do P
new-value WD mina2A.s/ Œc.a; s/ C s 0 2S Pa .s 0 js/V .s 0 /
If jV .s/ new-valuej then
V .s/ WD new-value
flag := false
end if
end for
until flag = true
Figure 6.2: Simple Version of Asynchronous Value Iteration implemented using single vector that outputs a
value function V with residual ResV smaller than the parameter , > 0.
V [Bertsekas and Tsitsiklis, 1989]. is implies, among other things, that the simple variant of Value
Iteration, implemented using a single value vector that is updated one state at a time, is a form of
Asynchronous VI that also converges to V . is version of VI is known as Gauss-Seidel VI. Code
for a version of VI that delivers a value function with residual less than a given is shown in Figure 6.2.
A suitable version of Asynchronous Value Iteration can be used to solve acyclic MDPs, including
finite-horizon MDPs, in one pass. Recall that an MDP is acyclic when all state trajectories that start
and end in the same state have zero probability, and that finite-horizon MDPs can be cast as infinite-
horizon and acyclic MDPs where the state is extended with the information of the horizon-to-go.
All that is needed for solving acyclic MDPs in a single pass is to order the updates so that a state s
is updated before a state s 0 when there is a state trajectory from s 0 to s in the MDP with positive
probability. In the case of finite-horizon MDPs, it suffices to update states hs; d 1i before states
hs 0 ; d i. It is simple to prove by induction that the values computed in one such pass are optimal. e
procedure is also known as backward induction [Bertsekas, 1995].
POLICY ITERATION
e other standard dynamic programming algorithm for solving MDPs is Policy Iteration [Bertsekas,
1995, Howard, 1971, Puterman, 1994]. Whereas VI iterates over value functions in order to compute
the optimal value function V , Policy Iteration (PI) iterates over policies, each one strictly better than
the one before. Since the total number of (deterministic) policies is finite, PI converges to the optimal
policy in a number of iterations that is bounded.
Policy iteration applies two operations in sequence, starting with a proper policy D 0 . First,
it computes the value of the policy V .s/ over all states. It then finds a new policy 0 that is proper
and improves if is not optimal. e first step, called Policy Evaluation, is done by solving the set
of jSj linear equations given by (6.4) that characterize the value function V . e second step, called
Policy Improvement, uses the value function V computed in the Policy Evaluation step, to see if there
86 6. MDP PLANNING: STOCHASTIC ACTIONS AND FULL FEEDBACK
are states s where the actions dictated by the policy are not best, under the assumption that after
this first step, the policy will be followed. For this, Q-factors of the form
P
Q .a; s/ D c.a; s/ C s 0 2S Pa .s 0 js/V .s 0 / (6.15)
are computed for each non-goal state s and action a 2 A.s/. e policy 0 that is like except in states
s for which the following strict inequality holds
0
where 0 .s/ D argmina2A.s/ Q .a; s/ is a policy that strictly improves , i.e., V .s/ V .s/ with
the inequality being strict over some states. If there are no such states, must be optimal, and Policy
Iteration terminates.
Policy iteration produces a sequence of policies 0 ; : : : ; n , starting with an initial proper policy
0 such that each policy i C1 is proper and strictly improves the previous one. e last policy n is
a policy that cannot be improved further and is optimal. e length of the sequence is bounded by
the total number of deterministic and stationary policies. A proper policy is needed initially in Policy
Iteration, as the Policy Improvement step may not work when the expected costs are not bounded. For
example, consider a Goal MDP with a single non-goal state s and two actions a and b such that a maps
s into itself with probability 1, and b maps s into itself with probability 1=2, and into the goal with
probability 1=2. If the initial policy is such that .s/ D a, then V .s/ is infinite, and therefore both
Q .a; s/ and Q .b; s/ are infinite as well, so that the policy 0 .s/ D b does not appear to improve
even if 0 is optimal and is not. is can’t happen, however, when is a proper policy. Methods for
computing proper or strong cyclic policies are discussed in Section 5.6. Yet, the stochastic policy that
assigns to each state s an action in A.s/ with probability 1=jA.s/j is always proper in problems with
no dead-ends. Code for a simple implementation of Policy Iteration is shown in Figure 6.3.
P I
Starts with proper policy and terminates with being optimal
repeat
Compute V and store it in vector V
Let 0 WD be the new policy initialized to
Let change := false
for each state s 2 S do
for each action a 2 A.s/ doP
Q.a; s/ WD c.a; s/ C s 0 2S Pa .s 0 js/V .s 0 /
end for
Let Q WD mina2A.s/ Q.a; s/
if Q < V .s/ then
0 .s/ WD argmina2A.s/ Q.a; s/
change := true
end if
end for
if change = true then WD 0
until change = false
Figure 6.3: Policy Iteration for Goal MDPs. e algorithm terminates when cannot be improved further and
hence is optimal.
total, it suffices for to be defined over the initial state of the MDP, s0 , and over the states s that can
be reached from s0 while following . is set of states, denoted as S.; s0 /, is the smallest set that
includes s0 , and is closed in the following sense: if s is in S.; s0 /, and Pa .s 0 js/ > 0 for a D .s/, then
s 0 must be in S.; s0 / as well. We say that such partial policies are closed with respect to s0 , or simply,
that they are closed, leaving the initial state s0 implicit. A state s is reachable from s0 with policy iff
s 2 S.; s0 /.
e main idea underlying heuristic search algorithms for MDPs can be expressed as follows.
Let ResV .s/ be the residual of the value function V over a non-goal state s , defined as the difference
between the left and right-hand sides of Bellman’s optimality equation at state s :
ˇ P ˇ
ResV .s/ D ˇ V .s/ mina2A.s/ Œc.a; s/ C s 0 2S Pa .s 0 js/V .s 0 / ˇ : (6.17)
Bellman’s result stating that V is the unique solution of the set of optimality equations and that
the policy V greedy with respect to V is optimal, can be rephrased as saying that if V is a value
function such that the residuals ResV .s/ are zero over all the states, then V is optimal. Heuristic
search algorithms for MDPs exploit a variation of this result that takes into account 1) the possibility
of the value function V being an admissible heuristic function, i.e. a lower bound V V , and 2) the
fact that many states will not be reachable from the initial state s0 and the greedy policy V .
88 6. MDP PLANNING: STOCHASTIC ACTIONS AND FULL FEEDBACK
F--R
Starts with value function V given by admissible heuristic function h, h V
repeat
Find one or more states s reachable from s0 with V such that ResV .s/ >
Update V at those states s and possibly at other states
until no such states s found
Figure 6.4: Find-and-Revise: Computes value function V with residuals bounded by over all the states reach-
able with greedy policy V from the initial state s0 . For sufficiently small , the resulting greedy policy V is
optimal for s0 .
In fact, provided that V is an admissible heuristic, it can be shown that the greedy policy V will
be optimal for the initial state s0 , if the residuals are zero over the states that can be reached with V from
s0 [Bonet and Geffner, 2003a]; i.e.,2
is is a simple but important result that says that there is no need for the value function V to converge
over all states for the greedy policy V to be optimal with respect to the initial state s0 ; convergence over
the states that are reachable from s0 by following the policy V suffices. For this, the value function
V must be admissible, else like in A*, the optimal solutions may be missed. As an example of this,
consider a deterministic Goal MDP with two non-goal states s0 and s1 such that action a maps s0
into the goal at cost 5, while b maps s0 into s1 , and s1 into the goal, in both cases at cost 1. In this
problem, b is a better action than a in the state s0 . Yet, if the value function V is not admissible and
overestimates the value of the state s1 , e.g., by making V .s1 / D 10, the policy V greedy in V will
pick the action a rather than b in s0 , while the residuals ResV .s/ will be zero over all the non-goal
states s that are reachable from s0 with the policy V . Actually, in this deterministic MDP, the same
suboptimal solution would result from A* with the same non-admissible heuristic function. us, the
admissibility of V in (6.18) is a necessary condition for the optimality of the policy V with respect to
the initial state s0 .
e principle expressed in (6.18) suggests a simple generic method for computing admissible
value functions V such that the residuals ResV .s/ over the states s reachable from the initial state
S0 and the greedy policy V do not exceed a certain threshold > 0. e generic method, shown in
Figure 6.4, is called Find-and-Revise [Bonet and Geffner, 2003a]. Find-and-Revise proceeds in two
stages: it first searches for one or more states s in S.V ; s0 / with residuals ResV .s/ > , and then
updates such states as in Asynchronous Value Iteration. is process is repeated until there are no
more such states. If the initial value function V D h is monotonic, the process terminates in at most
1 P
s2S V .s/ h.s/ iterations.
e notion of monotonicity or consistency is well known in heuristic search over directed graphs,
where it refers to heuristics h that satisfy the triangular inequality h.n0 / c.n; n0 / C h.n/ for every
2 We assume that a value function V determines a unique greedy policy V . is is easy to enforce by assuming a static ordering
among actions so that ties in the choice of the action V .s/ are broken by using this ordering, e.g., preferring action a to b
if a precedes b in the ordering.
6.3. HEURISTIC SEARCH ALGORITHMS 89
LRTA*
% Initial value function V given by admissible heuristic h
% Changes to V stored in hash table
Let s WD s0
While s is not a goal state do
Evaluate each action a 2 A.s/ as: Q.a; s/ WD c.a; s/ C V .s 0 / where s 0 D f .a; s/
Select best action a WD argmina2A.s/ Q.a; s/
Update value V .s/ WD Q.a; s/
Set s WD f .a; s/
end while
Figure 6.5: Trial of Learning Real Time A*: LRTA* can be seen as instance of Find-and-Revise for determin-
istic MDPs and D 0. LRTA* would still converge to an optimal solution as an instance of Find-and-Revise if
trials were interrupted right after the first update that changes the value function.
pair of nodes n and n0 that are connected through an edge with cost c.n; n0 /. e use of a mono-
tonic heuristic in an algorithm like A* ensures that the evaluation function f .n/ of the nodes selected
for expansion never decreases.P In the probabilistic setting, the monotonicity of V translates into the
inequality V .s/ c.a; s/ C s 0 2S Pa .s 0 js/V .s 0 / for each state s 2 S and action a 2 A.s/. Since up-
dates preserve monotonicity (and admissibility), an initial monotonic value function guarantees that
state values V .s/ never decrease after updates. Also, since these values are bounded from above by
the finite optimal costs V .s/, they cannot change more than .V .s/ h.s//= times, where h is the
initial admissible and monotonic value function.
In the implementation of Find-and-Revise procedures, the initial value function V is usually
given by a heuristic function h, and changes in the value function are stored in a hash table. e search
step in Find-and-Revise can be implemented in O.jS j/ time with a standard depth-first traversal that
keeps track of visited nodes. A version of Find-and-Revise that combines such a depth-first search with
a labeling procedure for marking states s as solved when the residual over all the states that are reachable
from s and the greedy policy V is bounded by , is known as HDP for heuristic dynamic programming
[Bonet and Geffner, 2003a]. We consider other Find-and-Revise variants below, starting with the
LRTA* algorithm for deterministic problems presented in Section 2.4.
LRTA*
Learning Real Time A* (LRTA*) is a simple but powerful online search algorithm for finding paths
in graphs [Korf, 1990]. As explained in Section 2.4, LRTA* evaluates the actions a applicable in the
current state s , starting with the initial state s0 , by computing the factors Q.a; s/ D c.a; s/ C V .s 0 /,
where V is a value function initialized to a given heuristic h and s 0 is the successor state s 0 D f .a; s/.
LRTA* chooses then the action a that minimizes these Q.a; s/ values, revises the value V .s/ to Q.a; s/,
and iterates in this way by setting s to s 0 until reaching a goal state (Figure 6.5).
LRTA* can be seen as an Asynchronous Value Iteration algorithm over a deterministic MDP
model, where the states that are updated are obtained by running simulations from the initial state,
90 6. MDP PLANNING: STOCHASTIC ACTIONS AND FULL FEEDBACK
RTDP
% Heuristic h is the initial value function V
% Changes to V stored in a hash table
Let s WD s0
While s is not a goal state do P
Evaluate each action a 2 A.s/ as: Q.a; s/ WD c.a; s/ C s 0 2S Pa .s 0 js/V .s 0 /
Select best action a WD argmina2A.s/ Q.a; s/
Update value V .s/ WD Q.a; s/
Select next state s 0 with probability Pa .s 0 js/ and set s WD s 0
end while
Figure 6.6: Trial of RTDP: RTDP generalizes LRTA* to MDPs by evaluating actions using the Q-factors
corresponding to MDPs, and by sampling the successor states.
selecting in each state s the action V .s/ that is greedy in V . Yet, Asynchronous Value Iteration is an
exhaustive algorithm, while LRTA* is not. Indeed, LRTA* can converge to the optimal (partial) policy
V without visiting most of the states in the problem. is is because LRTA* is not only an instance
of Asynchronous VI, but of the general Find-and-Revise schema, and it thus exploits property (6.18),
taking advantage of both a known initial state s0 and an initial admissible value function. Actually,
LRTA* would still converge to an optimal solution if trials were interrupted right after the first update
that changes the value of a state. e resulting algorithm would be a version of Find-and-Revise with
D 0, where a single state is revised in each iteration, found by executing the greedy policy V from the
initial state. e extra work done in each single trial by LRTA* is thus not necessary for convergence.
Moreover, in the presence of dead-ends in the problem, LRTA* can be trapped into a dead-end, while
this variant does not, if the problem has a solution.
LAO*
In Section 5.3 we considered the AO* algorithm for solving acyclic AND/OR graphs [Nilsson, 1980].
LAO* [Hansen and Zilberstein, 2001] is an extension of AO* for solving cyclic AND/OR graphs,
whose solutions may contain cycles as well. Since Goal MDPs can be cast as cyclic AND/OR graphs
whose (optimal) solutions encode the (optimal) solutions for Goal MDPs, LAO* can be used for
solving Goal MDPs, and hence Discounted MDPs. In the AND/OR graph corresponding to a Goal
MDP, the OR nodes correspond to the non-goal states s , the AND nodes correspond to the state-
action pairs .s; a/ where a is an action applicable in s , and the terminal nodes are goal states, and states
from which no action can be applied. e children of the OR node s are the AND nodes .s; a/, and
the children of the AND node .s; a/ are terminal or OR nodes s 0 for which Pa .s 0 js/ > 0.
AO* explicates the implicit AND/OR graph incrementally, starting with the graph G that con-
tains just the root node, and maintains the subgraph G of G that encodes the optimal solution of G
under the assumption that the nodes in G that have not yet been explicated (its children added to
G ), are terminal nodes n with values V .n/ given by a heuristic h.n/. AO* then proceeds to pick one
of these tip nodes of G that are not terminal nodes of the original problem, and expands it in G ,
updating both G and its best solution subgraph G . AO* terminates when the tip nodes in G are all
terminal nodes. e solution to the AND/OR graph is given then by G . e solution is optimal if
the heuristic h is admissible.
e best solution subgraph G is obtained from G incrementally by propagating the values of
the last children added to G to its parents and ancestors. is propagation is a single pass of Value
Iteration (backward induction) that takes advantage of the acyclic structure of the AND/OR graph. If
the acyclic AND/OR graph represents
P an acyclic MDP, the value Q.a; s/ of an AND node .s; a/ is the
function Q.a; s/ D c.a; s/ C s 0 Pa .s 0 js/V .s 0 / of its children s 0 , while the value of an OR node V .s/
is the minimum value Q.a; s/ of its children. e best solution subgraph G is updated by picking the
best child .a; s/ of each OR node s encountered during the bottom-up propagation of values.
In the presence of cycles, this method for maintaining the best subgraph G of G is no longer
correct. Indeed, if the AND/OR graph represents a cyclic MDP, a single pass of Value Iteration over
the ancestors of the nodes last added to the explicit graph G does not necessarily produce optimal
values. Instead, Value Iteration must be run until the residuals over such nodes are smaller than a given
parameter. is is precisely what LAO* does [Hansen and Zilberstein, 2001]. Since running Value
Iteration until convergence in each expansion step of LAO* is too time consuming, an alternative to
LAO*, called Improved LAO* or simply ILAO*, is normally used instead. e two key changes from
LAO* to ILAO* are that ILAO* expands all non-terminal tip nodes of G in each step, and that the
values are propagated up from the new nodes by updating the node values once. Conveniently, both
operations can be done with a single depth-first traversal of the subgraph G in which the updates
92 6. MDP PLANNING: STOCHASTIC ACTIONS AND FULL FEEDBACK
are done during backtracks. e result is that the resulting values are no longer optimal over G , and
hence that G is not necessarily a best solution subgraph of G (so ILAO* is not a best-first algorithm
like AO* or LAO*). us, when the termination condition of AO* and LAO* is reached in ILAO*,
and G contains no tip that can be expanded (because such tips are terminal nodes), ILAO* runs
Value Iteration until the residuals over the nodes in G are smaller than . If the optimal subgraph
G does not change as a result, ILAO* terminates, else ILAO* continues using the revised subgraph
G . ILAO* is not an exact instance of Find-and-Revise as the expansion of tip nodes of G does not
necessarily imply that their residuals exceed (although this will often be the case). Still at termination,
the states in G , which are those reachable with the greedy policy V from s0 , have all residuals that
are bounded by . Further details on heuristic search algorithms for MDPs can be found in the book
by Mausam and Kolobov [2012].
UCT
UCT is a Monte-Carlo Tree Search (MCTS) algorithm for solving finite-horizon MDPs and, more
generally, AND/OR trees [Chaslot et al., 2008, Kocsis and Szepesvári, 2006]. UCT has been suc-
cessfully used in a number of settings, including the game of Go [Gelly and Silver, 2007], real-time
strategy games [Balla and Fern, 2009], and general game playing [Finnsson and Björnsson, 2008].
Like the standard heuristic search algorithm AO* for acyclic AND/OR graphs, UCT builds an ex-
plicit graph G incrementally. ere are however four main differences between UCT and AO*. First,
UCT selects the tip node to expand in G by running a simulation from the root node, which may add
at most one new node to G . Second, UCT evaluates tip nodes by simulating a given base policy from
the node. ird, values are propagated up the tree by means of Monte-Carlo updates. Four, UCT has
no termination condition and its optimality over finite-horizon MDPs is only asymptotic. We explain
these aspects of UCT in more detail below. Code for UCT is shown in Figure 6.7.
UCT consists of a sequence of stochastic simulations that start at the root node of the AND/OR
tree for the finite-horizon MDP. When this simulation reaches a node that is not in the explicit graph,
the node is added to the graph, and the heuristic value of the node is obtained by executing a given
base policy from the node. e processing done by UCT is aimed at improving the quality of this
base policy at the root node. While the simulation traverses internal nodes of the explicit graph, the
successor states are sampled stochastically, as in RTDP, but the choice of the actions is not greedy on
the Q-values, but on the sum of the Q-values plus a bonus term equal to
94 6. MDP PLANNING: STOCHASTIC ACTIONS AND FULL FEEDBACK
q
C 2 log N.s; d /=N.a; s; d / (6.19)
that ensures that all the applicable actions would be tried infinitely often at suitable rates. In Eq. 6.19, C
is an exploration constant, and N.s; d / and N.a; s; d / are counters that track the number of simulations
that have passed through the node .s; d / in the tree, and the number of times that action a has been
selected at such node. If a has never been tried at s , N.a; s; d / D 0 and the bonus term is 1, forcing
a to be selected unless there are other unexplored actions. e bonus term is based on a similar term
used in UCB [Auer et al., 2002], a regret-optimal algorithm for the multi-armed bandit problem.
e counters N.s; d / and N.a; s; d / are maintained for the nodes in the explicit graph only.
When a node .s; d / is generated that is not in the explicit graph, the node is added to the explicit
graph, the counters N.s; d /, N.a; s; d /, and Q.a; s; d / are allocated and initialized to 0, and a cost
c.; s; d / is sampled by simulating the base policy for H d steps starting at s , and propagating
this sampled cost upward along the nodes in the simulated path. ese values are not propagated
using full Bellman backups as in AO*, RTDP or VI, but through Monte-Carlo backups that update
the current average with a new sampled value. For a successful use of a UCT-like algorithm in domain-
independent MDP planning, see Keller and Eyerich [2012].
ANYTIME AO*
Anytime AO* is a simple variation of the AO* algorithm for AND/OR trees aimed at bridging the
gap between AO* and UCT [Bonet and Geffner, 2012a]. Like UCT and unlike AO*, Anytime AO*
is anytime optimal even in the presence of non-admissible or random heuristics. A random heuristic is
a heuristic that corresponds to a random variable that can be sampled such as the cost of a base policy.
On the other hand, Anytime AO* like AO* and unlike UCT has a clear termination condition that is
achieved when there are no more nodes to add to the explicit graph.
Anytime AO* is the result of two small changes in AO*. e first, designed to handle non-
admissible heuristics, is that rather than always selecting non-terminal tip nodes from the explicit
graph G that are part of the best partial solution graph G , Anytime AO* selects non-terminal tip
nodes from G n G with some positive probability. e second change, designed for dealing with
random heuristics h, is that when the value V .s; d / of a tip node .s; d / is set to a heuristic h.s; d / that
is a random variable, such as the cost obtained by following a base policy for d steps from s , Anytime
AO* uses samples of h.s; d / until the node .s; d / is expanded. Until then, each time the value V .s; d /
is required, which occurs each time that a parent node of .s; d / is updated, a new sample of h.s; d /
is obtained which is averaged with the previous samples. is is implemented in standard fashion by
incrementally updating the value V .s; d / using a counter N.s; d / and the new sample.
Anytime AO* has been tried as an online planning algorithm over different types of MDPs,
including problems like the Canadian Traveller Problem where it appears to do as well as UCT, which
represents the state-of-the-art [Eyerich et al., 2010]. An advantage of Anytime AO* is that it can
potentially benefit from a number of techniques developed for speeding up A* and AO* like the use
of weights W > 1 in the heuristic term [Chakrabarti et al., 1988]. Two advantages of UCT, on the
other hand, are that it is a model-free method that can work perfectly well with a simulator rather than
a model, and that it is less affected by large branching factors that obtain when the number of states s 0
that may result from doing an action a in a state s is large.
6.5. REINFORCEMENT LEARNING, MODEL-BASED RL, AND PLANNING 95
UCT.s; d /
if d D 0 or s is goal then return 0
if .s; d / is not in explicit graph G then
Add node .s; d / to explicit graph G
Initialize N.s; d / WD 0 and N.a; s; d / WD 0 for all a 2 A.s/
Obtain sampled accumulated cost c.; s; d / by simulating base policy for
H d steps starting at s
return c.; s; d /
if node .s; d / is in explicit graph G then
p
Bonus.a/ WD C 2 log N.s; d /=N.a; s; d / if N.a; s; d / > 0, else 1, for a 2 A.s/
a WD argmina2A.s/ ŒQ.a; s; d / Bonus.a/
Sample state s 0 with probability Pa .s 0 js/
Let C ost WD c.a; s/ C UCT.s 0 ; d 1/
Increment N.s; d / and N.a; s; d /
Set Q.a; s; d / WD Q.a; s; d / C ŒC ost Q.a; s; d /=N.a; s; d /
return C ost
Figure 6.7: UCT for finite-horizon cost-based MDPs: H is the horizon, G is the explicit graph (initially
empty), is the base policy, and C is the exploration constant. Procedure is called over node .s; H / where s
is the current state. When time runs out, UCT selects the action applicable at s that minimizes Q.a; s; H /.
where stands for the discount factor and the ˛i parameters represent the learning rate. e term
which would be used to update the Q.ai ; si / factor if the probability and reward parameters were
known.3 e reason for learning Q.a; s/ values as opposed to V .s/ values, is that the latter cannot be
used for selecting actions without knowing the probabilities and rewards. e key result for this type of
stochastic updates is that they converge to the optimal Q-values in the limit when all actions are tried in
all states sufficiently often, provided that the constants ˛i comply with basic requirements [Bertsekas
and Tsitsiklis, 1996, Sutton and Barto, 1998, Szepesvári, 2010]. In Q-learning this convergence is
achieved by choosing a random action in each state s with small but non-zero probability , choosing
otherwise the greedy action a, i.e., the action that appears to be best according to the current Q-values,
namely, a D argmaxa2A.s/ Q.a; s/.
Q-learning is a model-free algorithm for solving MDPs as it learns the behavior but not the
model (parameters). Model-based reinforcement learning algorithms, on the other hand, are aimed at
learning both the model and the behavior, which they derive from the model like any planning-based
method. Interestingly, some of the best known model-based RL algorithms like R-MAX [Brafman
and Tennenholtz, 2003] actually map the learning problem into a planning problem. Indeed, R-MAX
plans in an optimistic model where rewards r.s; a; s 0 / that are not yet known are replaced by known,
optimistic rewards R, r.s; a; s 0 / R, and similarly, transition probabilities Pa .s 0 js/ that are not yet
known are replaced by known transition probabilities Pa .sR js/ D 1, where sR is a new and absorbing
“nirvana” state where the rewards r.sR ; a; sR / are all equal to the upper bound R. Planning in this
“optimistic” model directs the learning agent to states s and actions a that lead to the “nirvana” state
sR . By repeating this process over and over, a sufficient number of samples is obtained for the un-
known parameters Pa .s 0 js/ and r.s; a; s 0 / until they become known with sufficient confidence. When
this happens, the optimistic values for these parameters are replaced by the learned values, and the
optimistic MDP model becomes less optimistic and more accurate. By iterating this planning and
learning process, R-MAX ends up producing a nearly optimal policy in polynomial time.
3 In Reinforcement Learning, it is common to consider rewards of the form r.s; a; s 0 / that are a function of the action applied,
the current state s , and the following state s 0 . Such rewards, unlike the rewards r.a; s/ that we have considered so far, need
to be pushed inside the summation as shown in (6.22). When the model parameters are P known, as in planning, these 3-place
rewards r.s; a; s 0 / can be replaced by the 2-place rewards r.a; s/ by setting r.a; s/ D s 0 2S Pa .s 0 js/r.s; a; s 0 /.
CHAPTER 7
en if the observation token o is obtained, the belief that results from b after the action a and the
observation o, denoted as bao , is:
where ba .o/ is the probability of observing o after doing the action a in the belief b :
P
ba .o/ D s2S Pa .ojs/ba .s/ : (7.3)
Equations 7.1 and 7.2 for POMDPs generalize Eqs. 5.1 and 5.2 for the non-deterministic,
partial observable models considered in Chapter 5, by taking probabilities into account, both in the
system dynamics and in the sensors.
Provided that the set A.b/ of actions that are applicable in a belief state b is defined as the set
of actions that are applicable in each of the states that are possible according to b , and that the cost
c.a; b/ of applying an action a in a belief state b is defined as the expected cost
P
c.a; b/ D s2S c.a; s/b.s/ ; (7.4)
it is simple to transform a Goal POMDP M into an equivalent fully observable Goal MDP over belief
states M 0 where:
• the states b in M 0 are the belief states over M ,
• the initial state in M 0 is the belief state b0 ,
• the goal states in M 0 are the goal beliefs bG such that bG .s/ D 0 if s is not a goal state in M ,
• the set of actions A.b/ applicable in b is comprised of the actions a such that a 2 A.s/ for all s
such that b.s/ > 0,
7.2. EXACT OFFLINE ALGORITHMS 99
0 0
• the transition probabilities Pa .b jb/ are equal to ba .o/ if b D bao , and otherwise equal to 0, and
• the costs c.a; b/ are given by (7.4), and are positive (and bounded away from zero) except when
b is a goal belief in which case c.a; b/ D 0.
e solution to this belief MDP is a policy mapping belief states into actions that yields a solution
to the original POMDP. In particular, the equations determining the expected costs V .b/ of a policy
from belief b are:
P
V .b/ D c.a; b/ C o2O ba .o/V .bao / (7.5)
for non-goal beliefs b , and V .b/ D 0 for goal beliefs, while the optimal cost function V .b/ is the
solution of the equation
P
V .b/ D mina2A.b/ Œc.a; b/ C o2O ba .o/V .bao / (7.6)
for non-goal beliefs b , and V .b/ D 0 for goal beliefs. e problem in solving the belief MDP, however,
is that unlike the MDPs considered in the last chapter, it has a continuous and infinite state space
given by the probability distributions over the states in the POMDP. e exact methods for solving
POMDPs must address this challenge. As for MDPs, we will assume that there are no dead-end beliefs
b from which goal beliefs cannot be reached, or alternatively, that there is a proper policy that ensures
that a goal belief will be reached from any belief with probability 1.
5y = −x + 21
f (b)
0.0 b 1.0
Figure 7.1: Example of a piecewise linear and concave (pwlc) function f .b/ over beliefs b . e function f
'JHVSF
is represented by &YBNQMF PG B QJFDFXJTF
a finite set MJOFBS BOE vectors
of jS j-dimensional QXMD
DPODBWFsuch
GVODUJPO
that PWFSP
f .b/ D min˛2 CFMJFGTb.s/˛.s/
ɩF .GVODUJPO
In thes2S
JT SFQSFTFOUFE
example, S containsCZ two
B mOJUF TFUand PG contains
states EJNFOTJPOBM WFDUPST
three planes TVDIEvery
(lines). UIBU belief state
NJO b on S corresponds to*Oa UIF
FYBNQMF
DPOUBJOT UXP TUBUFT BOE DPOUBJOT QMBOFT MJOFT
&WFSZ CFMJFG TUBUF
point in the interval Œ0; 1 and the value f .b/ is determined by the projection of the point PO b on
DPSSFTQPOET UP hull
the concave B QPJOU
of JO. UIF JOUFSWBM BOE UIF WBMVF JT EFUFSNJOFE CZ UIF QSPKFDUJPO PG UIF QPJOU PO UIF DPODBWF IVMM PG
P
VkC1 .b/ D minNJO
a2A.b/ c.a; b/ C o2O ba .o/Vk .bao / :
(7.7)
Yet:FU
suchTVDI BO VQEBUF
an update DBOOPU
cannot CF JNQMFNFOUFE
be implemented CZ JUFSBUJOH
by iterating overPWFS BMM QPTTJCMF
all possible CFMJFG
belief TUBUFT
states. *O
In 1973, IPXFWFS
however,
PCTFSWFE UIBU JG 7BMVF *UFSBUJPO TUBSUT GSPN B QJFDFXJTF MJOFBS BOE DPODBWF QXMD
Sondik observed that if Value Iteration starts from a piecewise linear and concave (pwlc) function V0 ,
4POEJL GVODUJPO
UIFO
then BMM GVODUJPOT
all functions SFTVMUJOH
Vk resulting GSPN
from UIFTF
these VQEBUFT
updates SFNBJO
remain pwlcQXMD " QXMD
. A pwlc GVODUJPO
function PO UIF
f on the DPOUJOVPVT
continuous
CFMJFG TQBDF PWFS JT B DPNCJOBUJPO PG MJOFBS GVODUJPOT
belief space over S is a combination of linear functions given as HJWFO BT
P
f .b/ D minNJO
˛2 s2S b.s/˛.s/
(7.8)
XIFSF is aJTfinite
where B mOJUF set TFU PG -dimensional
of jSj EJNFOTJPOBM realSFBM WFDUPST
vectors. eɩF pwlcQXMD GVODUJPO
function f canDBO CF TUPSFE
be stored JO mOJUF
in finite
TQBDF
space BT UIF
as the set TFU PG WFDUPST. Figure
of vectors 'JHVSF
7.1 TIPXT
shows BO FYBNQMF
an example B QXMD
of aPGpwlc GVODUJPO
function. $MFBSMZ
Clearly, anyBOZ DPOTUBOU
constant
GVODUJPO
function f .b/ v is aJTpwlc B QXMD GVODUJPO
function thatUIBU DPSSFTQPOET
corresponds UP UIF
to the TJOHMFUPO D f˛g where
singleton XIFSFtheUIF WFDUPS
vector ˛ is JT
suchTVDI
thatUIBU
˛.s/ D v for GPS BMM TUBUFT
all states s.
4POEJLT
Sondik’s SFTVMU
result JT GVOEBNFOUBM
is fundamental as itBTprovides
JU QSPWJEFT B GFBTJCMF
a feasible wayXBZfor GPS JNQMFNFOUJOH
implementing Value7BMVF *UFSBUJPO
Iteration:
TUBSUJOH
starting from GSPN
a setB TFU0 of vectors
PG WFDUPST thatUIBU SFQSFTFOU
represent theUIFpwlcQXMD WBMVF
value GVODUJPO
function
B OFX
V0 , a new set TFU PG WFDUPST1 is JT
of vectors
DPNQVUFE
computed thatUIBU SFQSFTFOUT
represents theUIFpwlcQXMD WBMVF
value GVODUJPO
function V1 thatUIBU SFTVMUT
results fromGSPN B GVMM
a full DP%1 VQEBUF
update of VPG0 , and
BOE
so TP
on,PO
VOUJM
until B WBMVF
a value GVODUJPO
function JT PCUBJOFE
Vk is obtained with XJUI SFTJEVBMT
residuals thatUIBU
do EP
notOPU FYDFFE
exceed B HJWFO
a given . We 8FwillXJMM GPMMPX
follow
4POEJL
Sondik JO TIPXJOH
in showing howIPX theseUIFTF VQEBUFT
updates canDBO CF DBSSJFE
be carried out.PVU
We8F useVTF B GPSNVMBUJPO
a formulation withXJUI B EJTDPVOU
a discount GBDUPS
factor
, 0 <
< 1, that ensures convergence to a given residual in a bounded number of iterations, andBOE
UIBU FOTVSFT DPOWFSHFODF UP B HJWFO SFTJEVBM JO B CPVOEFE OVNCFS PG JUFSBUJPOT
BTTVNF
assume thatUIBU BMM BDUJPOT
all actions a 2 A are BSF BQQMJDBCMF
applicable JO states
in all BMM TUBUFT TP UIBU
so that A.b/ D A for GPS anyBOZ
beliefCFMJFG
b . We 8FalsoBMTP
BTTVNF B DPTU TFUUJOH GPS SFXBSET
DPTUT NVTU CF SFQMBDFE CZ SFXBSET
assume a cost setting ; for rewards, costs c.a; s/ must be replaced by rewards r.a; s/, and minimizations
BOE NJOJNJ[BUJPOT
CZ NBYJNJ[BUJPOT
by maximizations.
Given(JWFOan initial pwlcQXMD
BO JOJUJBM WBMVF
value GVODUJPO
function V0 , and
BOE BTTVNJOH
assuming JOEVDUJWFMZ
inductively thatUIBU
theUIF
value WBMVF GVODUJPO
function Vk
DBO CF DIBSBDUFSJ[FE CZ B TFU PG
can be characterized by a set of vectors k as: WFDUPST BT
P
Vk .b/ D minNJO
˛2 k s2S b.s/˛.s/ ;
(7.9)
7.2. EXACT OFFLINE ALGORITHMS 101
we need to show that the function VkC1 , given as the full DP update of the function Vk :
P
VkC1 .b/ D mina2A Œc.a; b/ C
ba .o/Vk .bao / (7.10)
Po2O P o
D mina2A Œc.a; b/ C
o2O ba .o/ min˛2 k s2S ba .s/˛.s/ : (7.11)
is also a pwlc function, given by a finite set of vectors kC1 defined in terms of the vectors in k and
the parameters of the POMDP.
Notice that for each observation o, the outer sum in the right-hand side of (7.11) “picks” a
vector ˛ for the inner sum, and that such vectors can be summarized with a choice function W O ! k .
Moreover, since the outer sum picks vectors that minimize the inner sum, the min inside can be pulled
def
out by converting it into a minimization over the collection Vk D f j W O ! k g of all the j k jO
choice functions. If is one such choice function and .o/ D ˛ , the notation .o/.s/ below stands for
˛.s/:
P P
VkC1 .b/ D mina2A Œc.a; b/ C min2Vk
o2O ba .o/ s2S bao .s/.o/.s/ (7.12)
P P
D mina2A Œc.a; b/ C min2Vk
s2S o2O .o/.s/ba .o/bao .s/ : (7.13)
Making use of the definitions of the probabilities bao .s/ and ba .o/ in Eqs. 7.2 and 7.3 respectively, it
follows that:
P P
VkC1 .b/ D mina2A Œc.a; b/ C min2Vk
s;o .o/.s/ s 0 b.s 0 /Pa .sjs 0 /Pa .ojs/ (7.14)
P P 0 0
D mina2A min2Vk Œc.a; b/ C
s;o .o/.s/ s 0 b.s /Pa .sjs /Pa .ojs/ (7.15)
e set kC1 contains at most jAj j k jjOj different vectors ˛a; , one for each action and choice function
W O ! k . However, some of these vectors are dominated by others in the sense that they do not
yield the minimum at any belief b . Such vectors can be identified by solving a linear program and
removed [Kaelbling et al., 1998]. Similarly, a linear program can be used to check if the residual of the
value function represented by k falls below for terminating Value Iteration. Figure 7.2 shows the
result of applying a full DP update over the function shown in Figure 7.1.
In spite of the exponential complexity of the full DP update, Sondik’s representation is ubiqui-
tous in exact and approximated methods, including recent state-of-the-art algorithms that use it in a
more selective type of updates, known as point-based updates.
7.
102 POMDP
10.%1 PLANNING:
1-"//*/( STOCHASTIC
450$)"45*$ ACTIONS
"$5*0/4 AND
"/% PARTIAL
1"35*"- FEEDBACK
'&&%#"$,
Figure 7.2: e left panel shows the concave hull for a set of vectors that define a pwlc value function (cf.
'JHVSF7.1).
Figure e
ɩFright
MFGU panel
QBOFM shows
TIPXT the
UIF result
DPODBWFof IVMM GPS B aTFUfull update
applying PG WFDUPST
on UIBU
as EFmOF B QXMD
described WBMVF
in the GVODUJPO
text. DG
Only the
'JHVSF
ɩFvectors
non-dominated SJHIU QBOFM TIPXT
on both sets UIF
are SFTVMU
shown.PG BQQMZJOH B GVMM VQEBUF PO BT EFTDSJCFE JO UIF UFYU 0OMZ UIF
OPOEPNJOBUFE WFDUPST PO CPUI TFUT BSF TIPXO
7.3 APPROXIMATE AND ONLINE ALGORITHMS
complexity
e "11309*."5&
of solving POMDPs"/% 0/-*/&
has limited "-(03*5).4
the applicability of exact algorithms to large problems
where approximation methods are used instead. We review some of these methods below.
ɩF DPNQMFYJUZ PG TPMWJOH 10.%1T IBT MJNJUFE UIF BQQMJDBCJMJUZ PG FYBDU BMHPSJUINT UP MBSHF QSPCMFNT
XIFSF BQQSPYJNBUJPO NFUIPET BSF VTFE JOTUFBE 8F SFWJFX TPNF PG UIFTF NFUIPET CFMPX
POINT-BASED BACKUP ALGORITHMS
e exponential blow#"$,61
10*/5#"4&% up in the number of vectors that results from a single DP update is a consequence
"-(03*5).4
of updating the value function over all beliefs. An alternative is to update the value function at a
ɩF FYQPOFOUJBM
restricted subset CMPX VQ JOpoints,
of belief UIF OVNCFS PG WFDUPST
generating fewer UIBU SFTVMUT and
vectors, GSPNhence
B TJOHMF %1 VQEBUF
keeping JT B DPOTFRVFODF
the size of the value
function representation smaller. Some of the state-of-the-art algorithms for POMDPs GVODUJPO
PG VQEBUJOH UIF WBMVF GVODUJPO PWFS BMM CFMJFGT "O BMUFSOBUJWF JT UP VQEBUF UIF WBMVF are basedBUonB
SFTUSJDUFE
this idea ofTVCTFU PG CFMJFG
point-based QPJOUT
value HFOFSBUJOH
updates . GFXFS WFDUPST
BOE IFODF LFFQJOH UIF TJ[F PG UIF WBMVF
GVODUJPOIf V is a pwlc function given by a set ofTUBUFPGUIFBSU
SFQSFTFOUBUJPO TNBMMFS 4PNF PG UIF BMHPSJUINTupdate
vectors , a point-based GPS 10.%1T
or backupBSF CBTFE
of V overPO a
UIJT JEFB PG QPJOUCBTFE WBMVF VQEBUFT b given by a new set of vectors b such that
set of belief points F refers to the pwlc value function V
*G JT B QXMD GVODUJPO HJWFO CZ B TFU PG WFDUPST
B QPJOUCBTFE VQEBUF PS CBDLVQ PG PWFS B
b.b/ D Vfb .b/ for every b 2 F , where Vfb is the full backup on V . If F is the whole belief space, V
V b is
TFU PG CFMJFG QPJOUT
SFGFST UP UIF QXMD WBMVF GVODUJPO HJWFO CZ B OFX TFU PG WFDUPST
TVDI UIBU
equal to Vfb ; otherwise the size of F can be used to control the complexity of the point-based backup.
Point-based GPS FWFSZalgorithms
GC POMDP
XIFSF
[PineauGC JTetUIF
al.,GVMM CBDLVQ
2006, PO et al.,
Shani *G 2012]
JT UIFcompute
XIPMF CFMJFG TQBDF
set of
the new JT
FRVBM UPb GC PUIFSXJTF UIF TJ[F PG DBO CF VTFE UP DPOUSPM UIF DPNQMFYJUZ PG UIF QPJOUCBTFE CBDLVQ
vectors from by adding one vector backup.V; b/ for each belief b in F . e vector backup.V; b/
1PJOUCBTFE 10.%1 BMHPSJUINT <1JOFBV FU BM
4IBOJ FU BM
> DPNQVUF UIF OFX TFU PG
is the one that assigns the value Vfb .b/ to b :
WFDUPST GSPN CZ BEEJOH POF WFDUPS GPS FBDI
P CFMJFG JO ɩF WFDUPS
JT UIF POF UIBU BTTJHOT UIF WBMVF
backup.V;
GC UP
b/ D argmin˛2 fb s2S b.s/˛.s/ (7.19)
set of
choice functions for . is expression however requires the computation of the set fb whose size
O.jAj j jjOj / is exponential in the number of possible observations. e method below computes the
XIFSF vector
single GC JT UIF TFUbelief
backup.V; b/ for an arbitrary PG WFDUPST GPS GC BOEtime. For this, observe first
b in polynomial JT UIF TFUthe
that PG
DIPJDF GVODUJPOT GPS ɩJT FYQSFTTJPO IPXFWFS SFRVJSFT UIF DPNQVUBUJPO PG UIF
sum on the right-hand side of the Bellman equation for updating V .b/ can be expressedGCas: TFU XIPTF TJ[F
JT FYQPOFOUJBM JO UIF OVNCFS PG QPTTJCMF PCTFSWBUJPOT ɩF NFUIPE CFMPX DPNQVUFT UIF
TJOHMF WFDUPS
P GPSPBO BSCJUSBSZ CFMJFG JOPQPMZOPNJBM UJNF 'PS UIJT
PCTFSWF mSTU UIBU UIF
TVN PO UIFo2O
SJHIUIBOE TJEF
ba .o/V .b o
a / DPG UIF #FMMNBO
o2O b a .o/ FRVBUJPO
min ˛2 GPS VQEBUJOH
s2S ab o
.s/˛.s/ DBO CF FYQSFTTFE BT (7.20)
7.3. APPROXIMATE AND ONLINE ALGORITHMS 103
P P 00 00 0 0
D o2O min˛2 s;s 0 ;s 00 2S Pa .ojs /Pa .s js /b.s /˛.s/ (7.21)
P P ˛ 0 0
D o2O min˛2 s 0 2S ga;o .s /b.s / (7.22)
˛
where ga;o is the vector defined as
˛ def P
ga;o .s/ D s 0 2S ˛.s 0 /Pa .ojs 0 /Pa .s 0 js/ : (7.23)
we obtain
P P
o2O ba .o/V .bao / D o2O
˛
min˛2 ga;o b (7.25)
P ˚ ˛ ˛
D o2O argminfga;o b j ga;o ; ˛ 2 g b (7.26)
P ˛ ˛
D o2O argminfga;o b j ga;o ; ˛ 2 g b (7.27)
where the sum in (7.27) is a (point-wise) sum of vectors. Finally, if ga;b is defined as the vector with
entries
def P b
ga;b .s/ D c.a; s/ C
o2O ga;o .s/ (7.28)
b ˛
where the vector ga;o is the one that minimizes the scalar product ga;o b for ˛ 2 . en, the value
of the full backup of V at the belief b becomes:
P ˛ ˛
Vfb .b/ D mina2A c.a; b/ C
o2O argminfga;o b j ga;o ;˛2 g b (7.29)
D mina2A ga;b b (7.30)
so that
backup.V; b/ D argminga;b ; a2A ga;b b : (7.31)
is derivation provides an efficient method for computing the vector that encodes the update
of the value function V at the belief point b . ere are indeed jAj vectors ga;b , each one of which
˛
can be computed in O.jSjjOjj j/ time provided that the vectors ga;o are precomputed and stored,
˛
an operation that requires O.jAjjOjjSj/ time. Moreover, the vectors ga;o do not depend on the belief
point b and thus, once computed, can be reused when computing the backup over other beliefs.
e different point-based POMDP algorithms differ mainly on the set of beliefs F selected for
update in each iteration, in the initial set of vectors, and in the termination condition [Shani et al.,
2012]. Like the RTDP algorithm for POMDPs below, recent point-based algorithms aim to exploit
the information about the initial belief state, use admissible value functions, and focus on the belief
states that are reachable from the initial belief state following greedy policies.
104 7. POMDP PLANNING: STOCHASTIC ACTIONS AND PARTIAL FEEDBACK
RTDP-B
% Initial value function V given by heuristic h
% Changes to V stored in a hash table using discretization function d./
Let b WD b0 the initial belief
Sample state s with probability b.s/
While b is not a goal belief do P
Evaluate each action a 2 A.b/ as: Q.a; b/ WD c.a; b/ C o2O ba .o/V .bao /
Select best action a WD argmina2A.b/ Q.a; b/
Update value V .b/ WD Q.a; b/
Sample next state s 0 with probability Pa .s 0 js/ and set s WD s 0
Sample observation o with probability Pa .ojs/
Update current belief b WD bao
end while
Figure 7.3: RTDP-Bel is RTDP over the belief MDP with an additional provision: for reading or writing the
value V .b/ in the hash table, b is replaced by d.b/ where d is a discretization function.
RTDP-BEL
RTDP-Bel [Bonet and Geffner, 2009, Geffner and Bonet, 1998] is a direct adaptation to Goal
POMDPs of the RTDP algorithm developed for Goal MDPs [Barto et al., 1995] reviewed in Chap-
ter 6, where states are replaced by belief states and the updates are done using the expression (7.7)
for POMDPs. e code for RTDP-Bel is shown in Figure 7.3. ere is just one difference between
RTDP-Bel and RTDP: in order to bound the size of the hash table and make the updates more ef-
fective, each time that the hash table is accessed for reading or writing the value V .b/, the belief b is
discretized. e discretization function d maps each entry b.s/ into the entry d.b.s// D ceil.D b.s//
where D is a positive integer (the discretization parameter), and ceil.x/ is the least integer greater than
or equal to x . For example, if D D 10 and b is the vector .0:22; 0:44; 0:34/ over the states s 2 S , d.b/
is the vector .3; 5; 4/. e discretization is used in the operations for accessing the hash table and does
not affect the beliefs that are generated during a trial. Using a terminology that is common in Re-
inforcement Learning, the discretization is a function approximation device [Bertsekas and Tsitsiklis,
1996, Sutton and Barto, 1998], where a single parameter, the value stored at cell d.b/ in the hash
table, is used to represent the value of all beliefs b 0 that discretize into d.b 0 / D d.b/. is approxima-
tion relies on the assumption that the value of beliefs that are close, should be close as well. Moreover,
the discretization preserves supports (the states s with b.s/ > 0) and never collapses the values of two
beliefs if there is a state that is excluded by one but not by the other.
Belief discretization makes the value function representation finite at the cost of theoretical
properties that do not carry automatically from RTDP to RTDP-Bel. First, convergence of RTDP-
Bel is not guaranteed and actually the value in a cell may oscillate. Second, the value function approx-
imated in this way does not remain necessarily a lower bound.
RTDP and RTDP-Bel can be used both for offline and online planning. For Goal MDPs,
RTDP trials are guaranteed to reach a goal state provided that there are no dead-ends in the prob-
7.4. BELIEF TRACKING IN POMDPS 105
lem. e same is usually true for RTDP-Bel but this cannot be guaranteed due the approximation
introduced by the discretization. In any case, if the input problem is a Discounted POMDP, whether
reward or cost-based, it must first be converted into an equivalent Goal POMDP before RTDP-Bel is
run. is transformation has been applied to the existing benchmarks for Discounted Reward-based
POMDPs in order to compare RTDP-Bel with point-based POMDP algorithms [Bonet and Geffner,
2009].
PO-UCT
PO-UCT is in turn a generalization of the UCT algorithm for POMDPs [Silver and Veness, 2010].
Adapting UCT to POMDPs is less direct than adapting RTDP because, as a model-free algorithm,
PO-UCT cannot keep track of the exact beliefs. PO-UCT thus keeps track of executions or histories,
sequences of actions and observations h D ha0 ; o0 ; a1 ; o1 ; : : : ; ak ; ok i, and for each history h, it ap-
proximates the belief that would result from such an execution and a given initial belief state b0 , by a
set of state samples B.h/. e nodes in the tree built by PO-UCT refer to executions h0 that extend the
real execution h, starting with the empty execution h0 D hi. If h is the real execution so far, PO-UCT
performs a number of simulations starting in states s sampled from B.h/, applies the action a that
minimizes the costs V .ha/, gets the observation o, and resumes the loop with the history hao. As in
UCT, the nodes h are associated with two fields in addition to the set of samples B.h/: the value V .h/
associated to the execution, and a counter N.h/ that tracks the number of simulations that have passed
through the node h.
e planning tree is expanded by performing simulations that start at states s sampled from the
belief B.h/ associated with the real execution h so far. Actions are selected in a node h of the tree as in
UCT, using the current values V .ha/ of the actions a in h, and the number of visits N.h/ and N.ha/.
When the action a is performed in a state s , the simulator returns an observation o, a next state s 0 ,
and a sampled cost c . If the resulting node hao is not in the tree, it is added to the tree, and a rollout
of the base policy is used to initialize the value of the node and to update the ancestor nodes using
Monte-Carlo updates. If the node hao is in the tree, the same process is applied from that node and
the associated state s 0 . In either case, the counter N.hao/ is adjusted and the sample s 0 is added to
B.hao/.
When the planning episode finishes and an action a is selected for execution following the
current execution h, the action is executed and an observation o is gathered. e next planning episode
starts from the resulting history hao. Nodes h0 in the tree that are not extensions of the execution so
far can be pruned. Code for a cost-based version of the PO-UCT algorithm is shown in Figure 7.4. A
problem for PO-UCT arises from the way it approximates beliefs by samples, which does not prevent
reaching real executions h with very few samples B.h/ that limit the information that can be obtained
by planning from h. By using domain-specific methods for adding new samples in such cases, PO-
UCT has been shown to exhibit excellent performance over a collection of large POMDPs, including
games such as Battleship and a partially observable version of Pacman [Silver and Veness, 2010].
S.h/
repeat
Sample s according to b0 if h D hi or B.h/ otherwise
S.s; h; 0/
until time is up
return argmina V .hhai/
S.s; h; depth/
if
dept h < then return 0
if h does not appear in tree T then
for all action a 2 A do
Insert hhai in tree as T .hhai/ WD h0; 0; ;i
end for
return R.s; h; dept h/
end if p
a WD argmina V .hhai/ C log N.h/=N.hhai/
Sample .s 0 ; o; c/ using simulator with state s and action a
C ost WD c C
S.s 0 ; hhaoi; 1 C dept h/
B.h/ WD B.h/ [ fsg
Increment N.h/ and N.hhai/
V .hhai/ WD V .hhai/ C ŒC ost V .hhai/=N.hhai/
return C ost
R.s; h; depth/
if
dept h < then return 0
Let a WD .h/
Sample .s 0 ; o; c/ using simulator with state s and action a
return c C
R.s 0 ; hhaoi; 1 C dept h/
Figure 7.4: PO-UCT algorithm for Discounted Cost-based POMDPs. Each node in the tree corresponds to
an execution history h that is associated with a triplet hN.h/; V .h/; B.h/i made up of a counter N.h/, a value V .h/
for the node, and a set of samples B.h/. e policy is the base policy used in PO-UCT. e action selected
for execution after the history h is the one returned by S.h/. is action a is applied, the observation o is
obtained, and the process resumes with h WD hao.
7.5. OTHER MDP AND POMDP SOLUTION METHODS 107
is exponential in the number of problem variables. e problem of computing the belief that results
at time k C 1 from a given execution h D ha0 ; o0 ; : : : ; ak ; ok i, starting in a given initial belief b0 , can
be expressed as a probabilistic inference problem over a Dynamic Bayesian Network [Pearl, 1988,
Russell and Norvig, 2009]. Exact probabilistic inference over Bayesian Networks is exponential in a
parameter associated with the underlying directed graph, known as the treewidth, which is related to
the maximum number of variables in the network that have to be collapsed into a single variable so
that the result is a Bayesian Tree. Since often the treewidth is not bounded, approximation algorithms
are common. In the case of Dynamic Bayesian Networks (DBN), a usual algorithm is particle filtering
where beliefs are approximated by a set of states or particles [Doucet et al., 2000]. In its most basic
form, given a set of samples Bk providing an approximate representation of the belief bk after an
execution hk D ha0 ; o0 ; : : : ; ak ; ok i, a new set of samples BkC1 can be obtained for approximating the
belief bkC1 that results from the action akC1 and the observation okC1 , by the following three steps.
First, each sample state sk in Bk is propagated into a state skC1 sampled with the transition probability
Pak .skC1 jsk /. Second, the new samples skC1 are assigned a weight given by the observation probability
Pak .okC1 jskC1 /. ird, the set of weighted samples is resampled to yield the set of unweighted samples
BkC1 [Russell and Norvig, 2009]. e initial set of samples B0 is obtained by sampling the initial belief
b0 . e probability that a given formula is true at time k C 1 is obtained from the ratio of samples
in BkC1 where the formula is true. Particle filtering does best when there are few zero entries in the
transition and observation probabilities. e PO-UCT algorithm above approximates beliefs from
histories using a particle filter of this type.
Discussion
e selection of the action to do next is one of the central problems faced by autonomous agents. As
discussed in Chapter 1, the problem is normally addressed in three different ways: in the hardwired
approach, the control is set by nature or by a programmer, in the learning-based approach, the control
is learned by trial-and-error, in the model-based approach, the control is derived from a model of the
actions, sensors, and goals. Planning is the model-based approach to autonomous behavior, and in this
book we have considered the main planning models and methods. In this last chapter, we list some
challenges in current planning research, and discuss briefly how the work in scalable computational
models of planning can contribute to the understanding of one of the most unique human features,
namely, the ability to plan, often in the context of other agents that have goals and make plans too.
Learning to search. Learning can play several roles in model-based approaches, the first of which is
learning the model itself from experience and partial observations (see below). Learning, however, has
also a role to play in the search for solutions; a role that has been crucial in the context of SAT [Biere
et al., 2012], but has not been fully exploited in planning except in the context of planning as SAT
[Kautz and Selman, 1996]. For example, consider an agent that has to deliver a large package to one
of two cells A or B in a grid, by going to the cell and dropping the package. Furthermore, assume that
A is closer to the agent than B but A cannot be entered while holding a large package. Most current
classical planning heuristics will drive the search toward A in a way resembling a fly that wants to get
past a closed window. Unlike flies, however, search algorithms avoid revisiting the same states, and
would eventually solve the problem after partially exhausting the space around A. A more intelligent
strategy would be to note that the failed search around A is the result of an interaction ignored by the
heuristic that should be fixed. is is precisely what SAT solvers do: they identify the causes for failure
and fix them while searching. Traditional heuristic methods cannot replicate this behavior because
they ignore the structure of the heuristic function, yet this structure is available to heuristic search
methods in planning that should be able to exploit it. is same limitation applies to heuristic search
algorithms like LRTA* and those used for solving MDPs and POMDPs: the values of states and
belief states are learned very slowly because there is no analysis for explaining what was wrong with
the updated estimates. It’s an open question whether something akin to the conflict-directed learning
from SAT could be used in a cost-effective way in the setting of heuristic search.
Generalized Planning. e problem of model and feature learning are related to the problem of gen-
eralized planning reviewed in Section 1.5 where a policy is sought not just for one planning instance
but for many instances, e.g., all block world instances. Often general policies of this type can be ex-
pressed in a compact way provided the right features. e question is how to get simultaneously the
right features and the policies. One approach that has been pursued to do this learns compact policies
from examples using features obtained from the potentially infinite collection of predicates defined by
a domain-independent grammar and a given set of primitive domain predicates [Fern et al., 2003].
is is an inductive, learning-based approach. An open question is whether these types of compact,
generalized policies can also be synthesized from a model by suitable transformations of the problem.
e derivation of finite-state controllers using planners considered in Section 4.4 goes in this direc-
tion. Also, for example, the generalized planning problem of picking up a green block from a tower of
8.2. PLANNING, SCALABILITY, AND COGNITION 111
blocks of any size can be cast as a non-deterministic partially observable planning problem over integer
variables, that can be modeled and solved with the methods developed by Srivastava et al. [2011b].
Hierarchies. Hierarchies form a basic component of Hierarchical Task Networks (HTNs), an al-
ternative model for planning that is concerned with the encoding of strategies for solving problems
(Section 3.11). Hierarchies, however, play no role in state-of-the-art domain-independent planners
that are completely flat. Yet, it is clear that most real plans involve primitive actions that can be exe-
cuted along with high level actions that are abstractions of those. For example the action of picking
up a block involves displacements of the robot gripper that must be opened and closed on the right
block. A basic question that has not been fully answered yet is how these abstractions can be formed
automatically, and how they are to be used to speed up the planning process. For instance, the standard
blocks world is an abstraction of a problem where blocks are at certain locations, and the gripper has to
move between locations. is abstraction, however, is not adequate when the table has no space for all
the blocks, or when the gripper cannot get past towers of a certain height. e open question is how to
automatically compile detailed planning descriptions into more abstract ones that can be used to solve
the problems more effectively. ere is a large body of work on abstract problem solving that is rele-
vant to this question [Bacchus and Yang, 1994, Jonsson, 2007, Knoblock, 1990, Korf, 1987, Marthi
et al., 2007, McIlraith and Fadel, 2002, Sacerdoti, 1974], but none so far that solves this problem in
a general manner.
Model Learning. We have discussed briefly model-based reinforcement learning algorithms that ac-
tively learn model parameters such as probabilities and rewards, yet a harder problem is learning the
states themselves from partial observations. Several attempts to generalize reinforcement learning al-
gorithms to such setting have been made, some of which learn to identify useful features and feature
histories [Veness et al., 2011], but none so far that can come up with the states and models themselves
in a robust and scalable manner from streams of observations and actions.
1 is section is taken from Geffner [2010, 2013b] where these issues are discussed in more detail.
Bibliography
P. E. Agre and D. Chapman. Pengi: An implementation of a theory of activity. In Proc. 6th Nat. Conf.
on Artificial Intelligence, pages 268–272, 1987. 12
A. Albarghouthi, J. A. Baier, and S. A. McIlraith. On the use of planning technology for verification.
In Proc. ICAPS’09 Workshop VV&PS, 2009. 64
A. Albore, M. Ramírez, and H. Geffner. Effective heuristics and belief tracking for planning with
incomplete information. In Proc. 21st Int. Conf. on Automated Planning and Scheduling, pages 2–9,
2011. 72
E. Amir and B. Engelhardt. Factored planning. In Proc. 18th Int. Joint Conf. on Artificial Intelligence,
2003. 35
K. Astrom. Optimal control of Markov Decision Processes with incomplete state estimation. Journal
of Mathematical Analysis and Applications, 10:174–205, 1965. 98
H. Attias. Planning by probabilistic inference. In Proc. 9th Int. Workshop on Artificial Intelligence and
Statistics, 2003. 108
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.
Machine Learning, 47(2):235–256, 2002. DOI: 10.1023/A:1013689704352 94
F. Bacchus and F. Kabanza. Using temporal logics to express search control knowledge for planning.
Artificial Intelligence, 116:123–191, 2000. DOI: 10.1016/S0004-3702(99)00071-5 63
F. Bacchus and Q. Yang. Downward refinement and the efficiency of hierarchical problem solving.
Artificial Intelligence, 71:43–100, 1994. DOI: 10.1016/0004-3702(94)90062-0 111
C. Bäckström and B. Nebel. Complexity results for SAS+ planning. Computational Intelligence,
11(4):625–655, 1995. DOI: 10.1111/j.1467-8640.1995.tb00052.x 25
R. K. Balla and A. Fern. UCT for tactical assault planning in real-time strategy games. In Proc. 21st
Int. Joint Conf. on Artificial Intelligence, pages 40–45, 2009. 93
D. Ballard, M. Hayhoe, P. Pook, and R. Rao. Deictic codes for the embodiment of cognition. Be-
havioral and Brain Sciences, 20(4):723–742, 1997. DOI: 10.1017/S0140525X97001611 10, 61
A. Barto, S. Bradtke, and S. Singh. Learning to act using real-time dynamic programming. Artificial
Intelligence, 72:81–138, 1995. DOI: 10.1016/0004-3702(94)00011-O 72, 86, 90, 104
A. Bauer and P. Haslum. LTL goal specifications revisited. In Proc. 19th European Conf. on Artificial
Intelligence, pages 881–886, 2010. DOI: 10.3233/978-1-60750-606-5-881 63
R. Bellman. Dynamic Programming. Princeton University Press, 1957. 70, 81
P. Bertoli and A. Cimatti. Improving heuristics for planning as search in belief space. In Proc. 6th Int.
Conf. on Artificial Intelligence Planning Systems, pages 143–152, 2002. 71
P. Bertoli, A. Cimatti, M. Roveri, and P. Traverso. Planning in nondeterministic domains under partial
observability via symbolic model checking. In Proc. 17th Int. Joint Conf. on Artificial Intelligence,
pages 473–478, 2001. 73
P. Bertoli, A. Cimatti, M. Pistore, and P. Traverso. A framework for planning with extended goals
under partial observability. In Proc. 13th Int. Conf. on Automated Planning and Scheduling, pages
215–225, 2003. 62
D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice
Hall, 1989. 85
D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. 96, 104
D. P. Bertsekas. Dynamic Programming and Optimal Control, Vols 1 and 2. Athena Scientific, 1995.
6, 31, 70, 76, 79, 81, 82, 84, 85
A. Biere, M. Heule, H. Van Maaren, and T. Walsh, editors. Handbook of Satisfiability: Frontiers in
Artificial Intelligence and Applications. IOS Press, 2012. 45, 110
A. Blum and M. Furst. Fast planning through planning graph analysis. In Proc. 14th Int. Joint Conf.
on Artificial Intelligence, pages 1636–1642, 1995. DOI: 10.1016/S0004-3702(96)00047-1 13, 31,
45, 49
B. Bonet and H. Geffner. Planning as heuristic search: New results. In Proc. 5th European Conf. on
Planning, pages 359–371, 1999. DOI: 10.1007/10720246_28 44, 45
BIBLIOGRAPHY 115
B. Bonet and H. Geffner. Planning with incomplete information as heuristic search in belief space.
In Proc. 5th Int. Conf. on Artificial Intelligence Planning Systems, pages 52–61, 2000. 55, 71, 77
B. Bonet and H. Geffner. Planning as heuristic search. Artificial Intelligence, 129(1–2):5–33, 2001.
DOI: 10.1016/S0004-3702(01)00108-4 27, 30, 31, 41
B. Bonet and H. Geffner. Faster heuristic search algorithms for planning with uncertainty and full
feedback. In Proc. 18th Int. Joint Conf. on Artificial Intelligence, pages 1233–1238, 2003. 88, 89, 91
B. Bonet and H. Geffner. Labeled RTDP: Improving the convergence of real-time dynamic program-
ming. In Proc. 13th Int. Conf. on Automated Planning and Scheduling, pages 12–31, 2003. 91
B. Bonet and H. Geffner. mGPT: A probabilistic planner based on heuristic search. Journal of Artificial
Intelligence Research, 24:933–944, 2005. DOI: 10.1613/jair.1688 77
B. Bonet and H. Geffner. Solving POMDPs: RTDP-Bel vs. point-based algorithms. In Proc. 21st
Int. Joint Conf. on Artificial Intelligence, pages 1641–1646, 2009. 83, 99, 104, 105
B. Bonet and H. Geffner. Planning under partial observability by classical replanning: eory and
experiments. In Proc. 22nd Int. Joint Conf. on Artificial Intelligence, pages 1936–1941, 2011. DOI:
10.5591/978-1-57735-516-8/IJCAI11-324 57, 73
B. Bonet and H. Geffner. Action selection for MDPs: Anytime AO* versus UCT. In Proc. 26nd Conf.
on Artificial Intelligence, pages 1749–1755, 2012. 73, 94
B. Bonet and H. Geffner. Width and complexity of belief tracking in non-deterministic conformant
and contingent planning. In Proc. 26nd Conf. on Artificial Intelligence, pages 1756–1762, 2012. 74
B. Bonet and H. Geffner. Causal belief decomposition for planning with sensing: Completeness and
practical approximation. In Proc. 23rd Int. Joint Conf. on Artificial Intelligence, 2013. 75, 76
B. Bonet and M. Helmert. Strengthening landmark heuristics via hitting sets. In Proc. 19th European
Conf. on Artificial Intelligence, pages 329–334, 2010. 42
B. Bonet, G. Loerincs, and H. Geffner. A robust and fast action selection mechanism for planning.
In Proc. 14th Nat. Conf. on Artificial Intelligence, pages 714–719, 1997. 13, 24, 30
B. Bonet, H. Palacios, and H. Geffner. Automatic derivation of memoryless policies and finite-state
controllers using classical planners. In Proc. 19th Int. Conf. on Automated Planning and Scheduling,
pages 34–41, 2009. 10, 60, 61, 62
B. Bonet. Conformant plans and beyond: Principles and complexity. Artificial Intelligence, 174:245–
269, 2010. DOI: 10.1016/j.artint.2009.11.001 54
M. Botvinick and J. An. Goal-directed decision making in the prefrontal cortex: a computational
framework. In Proc. 22nd Annual Conf. on Advances in Neural Information Processing Systems, pages
169–176, 2008. 108
C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and com-
putational leverage. Journal of Artificial Intelligence Research, 1:1–93, 1999. DOI: 10.1613/jair.575
79
116 BIBLIOGRAPHY
C. Boutilier, R. Reiter, and B. Price. Symbolic dynamic programming for first-order MDPs. In Proc.
17th Int. Joint Conf. on Artificial Intelligence, pages 690–700, 2001. 108
M. Bowling, R. Jensen, and M. Veloso. A formalization of equilibria for multiagent planning. In
Proc. 18th Int. Joint Conf. on Artificial Intelligence, pages 1460–1462, 2003. 109
R. I. Brafman and C. Domshlak. Factored planning: How, when, and when not. In Proc. 21st Nat.
Conf. on Artificial Intelligence, pages 809–814, 2006. 35
R. I. Brafman and G. Shani. A multi-path compilation approach to contingent planning. In Proc.
26nd Conf. on Artificial Intelligence, pages 1868–1874, 2012. 57
R. I. Brafman and G. Shani. Replanning in domains with partial information and sensing actions.
Journal of Artificial Intelligence Research, 1(45):565–600, 2012. DOI: 10.1613/jair.3711 55, 57, 73
R. I. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for near-
optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2003. DOI:
10.1162/153244303765208377 96
R. I. Brafman, C. Domshlak, Y. Engel, and M. Tennenholtz. Planning games. In Proc. 21st Int. Joint
Conf. on Artificial Intelligence, pages 73–78, 2009. 109
D. Bryce, S. Kambhampati, and D. E. Smith. Planning graph heuristics for belief space search. Journal
of Artificial Intelligence Research, 26:35–99, 2006. DOI: 10.1613/jair.1869 55, 71, 73
M. Buckland. Programming Game AI by Example. Wordware Publishing, Inc., 2004. 10
T. Bylander. e computational complexity of propositional STRIPS planning. Artificial Intelligence,
69:165–204, 1994. DOI: 10.1016/0004-3702(94)90081-7 8, 16, 34, 35
P. P. Chakrabarti, S. Ghose, and S. C. De Sarkar. Best first search in AND/OR graphs. In Proc. 16th
Annual ACM Conf. on Computer Science, pages 256–261, 1988. DOI: 10.1145/322609.322650 94
D. Chapman. Penguins can make cake. AI Magazine, 10(4):45–50, 1989. 10, 12, 61
G. M. J. Chaslot, M. H. M. Winands, H. Herik, J. Uiterwijk, and B. Bouzy. Progressive strategies for
Monte-Carlo tree search. New Mathematics and Natural Computation, 4(3):343–357, 2008. DOI:
10.1142/S1793005708001094 93
H. Chen and O. Giménez. Act local, think global: Width notions for tractable planning. In Proc.
17th Int. Conf. on Automated Planning and Scheduling, pages 73–80, 2007. 35
A. Cimatti, M. Pistore, M. Roveri, and P. Traverso. Weak, strong, and strong cyclic planning
via symbolic model checking. Artificial Intelligence, 147(1):35–84, 2003. DOI: 10.1016/S0004-
3702(02)00374-0 76, 108
A. Cimatti, M. Roveri, and P. Bertoli. Conformant planning via symbolic model checking and heuris-
tic search. Artificial Intelligence, 159:127–206, 2004. DOI: 10.1016/j.artint.2004.05.003 55, 74
E. M. Clarke, O. Grumberg, and D. A. Peled. Model Checking. MIT Press, 2000. 108
BIBLIOGRAPHY 117
A. J. Coles, A. Coles, M. Fox, and D. Long. Temporal planning in domains with linear processes. In
Proc. 21st Int. Joint Conf. on Artificial Intelligence, pages 1671–1676, 2009. 48
A. J. Coles, A. Coles, A. García Olaya, S. Jiménez, C. Linares López, S. Sanner, and S. Yoon. A
survey of the seventh international planning competition. AI Magazine, 33(1):83–88, 2012. 34, 39
T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 3rd edition,
2009. 6, 16, 17, 31
S. Cresswell and A. M. Coddington. Compilation of LTL goal formulas into PDDL. In Proc. 16th
European Conf. on Artificial Intelligence, pages 985–986, 2004. 63
J. Culberson and J. Schaeffer. Pattern databases. Computational Intelligence, 14(3):318–334, 1998.
DOI: 10.1111/0824-7935.00065 41
W. Cushing, S. Kambhampati, Mausam, and D. S. Weld. When is temporal planning really temporal?
In Proc. 20th Int. Joint Conf. on Artificial Intelligence, pages 1852–1859, 2007. 49
M. Daniele, P. Traverso, and M. Y. Vardi. Strong cyclic planning revisited. In Proc. 5th European
Conf. on Planning, pages 35–48, 1999. DOI: 10.1007/10720246_3 65, 76, 77
G. de Giacomo and M. Y. Vardi. Automata-theoretic approach to planning for temporally extended
goals. In Proc. 5th European Conf. on Planning, pages 226–238, 1999. DOI: 10.1007/10720246_18
62
T. Dean, L. P. Kaelbling, J. Kirman, and A. Nicholson. Planning with deadlines in stochastic domains.
In Proc. 11th Nat. Conf. on Artificial Intelligence, pages 574–579, 1993. 13
R. Dechter, I. Meiri, and J. Pearl. Temporal constraint networks. Artificial Intelligence, 49:61–95,
1991. DOI: 10.1016/0004-3702(91)90006-6 47
R. Dechter. Constraint Processing. Morgan Kaufmann, 2003. 35, 46
D. C. Dennett. Kinds of minds. Basic Books New York, 1996. 2
E. Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271,
1959. DOI: 10.1007/BF01386390 6
M. B. Do and S. Kambhampati. Solving the planning-graph by compiling it into CSP. In Proc. 5th
Int. Conf. on Artificial Intelligence Planning Systems, pages 82–91, 2000. 46
M. B. Do and S. Kambhampati. Sapa: A domain-independent heuristic metric temporal planner. In
Proc. 6th European Conf. on Planning, pages 82–91, 2001. 30, 48
A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for dynamic
bayesian networks. In Proc. 16th Conf. on Uncertainty on Artificial Intelligence, pages 176–183, 2000.
107
S. Edelkamp and S. Schrödl. Heuristic Search – eory and Applications. Academic Press, 2012. 17,
19, 23, 108
118 BIBLIOGRAPHY
S. Edelkamp. Planning with pattern databases. In Proc. 6th European Conf. on Planning, 2001. 41
S. Edelkamp. On the compilation of plan constraints and preferences. In Proc. 16th Int. Conf. on
Automated Planning and Scheduling, pages 374–377, 2006. 48, 63
K. Erol, J. Hendler, and D. S. Nau. HTN planning: Complexity and expressivity. In Proc. 12th Nat.
Conf. on Artificial Intelligence, pages 1123–1123, 1994. 11, 49
P. Eyerich, T. Keller, and M. Helmert. High-quality policies for the canadian traveler’s problem. In
Proc. 24th Conf. on Artificial Intelligence, pages 51–58, 2010. 94
Z. Feng and E. A. Hansen. Symbolic heuristic search for factored Markov decision processes. In Proc.
16th Nat. Conf. on Artificial Intelligence, pages 455–460, 1999. 108
Z. Feng, E. A. Hansen, and S. Zilberstein. Symbolic generalization for on-line planning. In Proc.
18th Conf. on Uncertainty on Artificial Intelligence, pages 209–216, 2002. 108
A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy language bias. In Proc. 17th
Annual Conf. on Advances in Neural Information Processing Systems, 2003. DOI: 10.1613/jair.1700
12, 110
R. Fikes and N. Nilsson. STRIPS: A new approach to the application of theorem proving to problem
solving. Artificial Intelligence, 1:27–120, 1971. DOI: 10.1016/0004-3702(71)90010-5 12, 24
H. Finnsson and Y. Björnsson. Simulation-based approach to general game playing. In Proc. 23th
Conf. on Artificial Intelligence, pages 259–264, 2008. 93
M. Fox and D. Long. PDDL 2.1: An extension to PDDL for expressing temporal planning domains.
Journal of Artificial Intelligence Research, 20:61–124, 2003. 49
E. Freuder. A sufficient condition for backtrack-free search. Journal of the ACM, 29(1):24–32, 1982.
DOI: 10.1145/322290.322292 35
J. Fu, V. Ng, F. Bastani, and I. Yen. Simple and fast strong cyclic planning for fully-observable
nondeterministic planning problems. In Proc. 22nd Int. Joint Conf. on Artificial Intelligence, pages
1949–1954, 2011. DOI: 10.5591/978-1-57735-516-8/IJCAI11-326 78
B. Gazen and C. Knoblock. Combining the expressiveness of UCPOP with the efficiency of Graph-
plan. In Proc. 4th European Conf. on Planning, pages 221–233, 1997. 26, 51
H. Geffner and B. Bonet. Solving large POMDPs using real time dynamic programming, 1998.
AAAI Fall Symposium on POMDPs. 104
H. Geffner. Heuristics, planning, cognition. In R. Dechter, H. Geffner, and J. Y. Halpern, editors,
Heuristics, Probability and Causality. A Tribute to Judea Pearl. College Publications, 2010. 112
H. Geffner. Artificial Intelligence: From Programs to Solvers. AI Communications, 2013. DOI:
10.3233/978-1-58603-925-7-4 8
H. Geffner. Computational models of planning. Wiley Interdisciplinary Reviews: Cognitive Science, 2,
2013. DOI: 10.1002/wcs.1233 112
BIBLIOGRAPHY 119
C. W. Geib and R. P. Goldman. A probabilistic plan recognition algorithm based on plan tree gram-
mars. Artificial Intelligence, 173(11):1101–1132, 2009. DOI: 10.1016/j.artint.2009.01.003 57
S. Gelly and D. Silver. Combining online and offline knowledge in UCT. In Proc. 24th Int. Conf. on
Machine Learning, pages 273–280, 2007. DOI: 10.1145/1273496.1273531 93
A. E. Gerevini, P. Haslum, D. Long, A. Saetti, and Y. Dimopoulos. Deterministic planning in the fifth
international planning competition: PDDL3 and experimental evaluation of the planners. Artificial
Intelligence, 173(5–6):619–668, 2009. DOI: 10.1016/j.artint.2008.10.012 25, 62
R. Gerth, D. A. Peled, M. Y. Vardi, and P. Wolper. Simple on-the-fly automatic verification of linear
temporal logic. In Proc. Int. Symposium on Protocol Specification, Testing and Verification, pages 3–18,
1995. 64
M. Ghallab, D. Nau, and P. Traverso. Automated Planning: theory and practice. Morgan Kaufmann,
2004. xi, 13, 46, 49
G. Gigerenzer. Gut feelings: e intelligence of the unconscious. Viking Books, 2007. 111
R. P. Goldman and M. S. Boddy. Expressive planning and explicit knowledge. In Proc. 3rd Int. Conf.
on Artificial Intelligence Planning Systems, pages 110–117, 1996. 54
E. A. Hansen and R. Zhou. Anytime heuristic search. Journal of Artificial Intelligence Research, 28:267–
297, 2007. DOI: 10.1613/jair.2096 19
E. A. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops.
Artificial Intelligence, 129:35–62, 2001. DOI: 10.1016/S0004-3702(01)00106-0 72, 91
E. A. Hansen. Solving POMDPs by searching in policy space. In Proc. 14th Conf. on Uncertainty on
Artificial Intelligence, pages 211–219, 1998. 107
P. Hart, N. Nilsson, and B. Raphael. A formal basis for the heuristic determination of min-
imum cost paths. IEEE Trans. on Systems Science and Cybernetics, 4:100–107, 1968. DOI:
10.1109/TSSC.1968.300136 17
P. Haslum and H. Geffner. Admissible heuristics for optimal planning. In Proc. 5th Int. Conf. on
Artificial Intelligence Planning Systems, pages 70–82, 2000. 13, 33, 41
120 BIBLIOGRAPHY
P. Haslum and P. Jonsson. Some results on the complexity of planning with incomplete information.
In Proc. 5th European Conf. on Planning, pages 308–318, 1999. DOI: 10.1007/10720246_24 54
R. Hassin, J. Uleman, and J. Bargh. e New Unconscious. Oxford University Press, 2005. 111
M. Helmert and C. Domshlak. Landmarks, critical paths and abstractions: What’s the difference
anyway? In Proc. 19th Int. Conf. on Automated Planning and Scheduling, pages 162–169, 2009. 41,
42
M. Helmert, P. Haslum, and J. Hoffmann. Flexible abstraction heuristics for optimal sequential
planning. In Proc. 17th Int. Conf. on Automated Planning and Scheduling, pages 176–183, 2007. 42
M. Helmert, M. B. Do, and I. Refanidis. 2008 IPC Deterministic planning competition. In 6th Int.
Planning Competition Booklet (ICAPS 2008), 2008. 39, 51
M. Helmert. e Fast Downward planning system. Journal of Artificial Intelligence Research, 26:191–
246, 2006. DOI: 10.1613/jair.1705 33, 38, 39
M. Helmert. Concise finite-domain representations for PDDL planning tasks. Artificial Intelligence,
173(5):503–535, 2009. DOI: 10.1016/j.artint.2008.10.013 26
J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. SPUDD: Stochastic planning using decision diagrams.
In Proc. 15th Conf. on Uncertainty on Artificial Intelligence, pages 279–288, 1999. 108
J. Hoffmann and R. I. Brafman. Contingent planning via heuristic forward search with implicit belief
states. In Proc. 15th Int. Conf. on Automated Planning and Scheduling, pages 71–80, 2005. 72
J. Hoffmann and R. I. Brafman. Conformant planning via heuristic forward search: A new approach.
Artificial Intelligence, 170:507–541, 2006. DOI: 10.1016/j.artint.2006.01.003 55, 73
J. Hoffmann and B. Nebel. e FF planning system: Fast plan generation through heuristic search.
Journal of Artificial Intelligence Research, 14:253–302, 2001. DOI: 10.1613/jair.855 13, 31, 37, 39
J. Hoffmann, J. Porteous, and L. Sebastia. Ordered landmarks in planning. Journal of Artificial Intel-
ligence Research, 22:215–278, 2004. DOI: 10.1613/jair.1492 13, 38, 39
J. Hoffmann, C. Gomes, B. Selman, and H. A. Kautz. SAT encodings of state-space reachability
problems in numeric domains. In Proc. 20th Int. Joint Conf. on Artificial Intelligence, pages 1918–
1923, 2007. 34, 46
J. Hoffmann. e Metric-FF planning system: Translating “ignoring delete lists” to numeric state
variables. Journal of Artificial Intelligence Research, 20:291–341, 2003. DOI: 10.1613/jair.1144 48
J. Hoffmann. Where ‘ignoring delete lists’ works: Local search topology in planning benchmarks.
Journal of Artificial Intelligence Research, 24:685–758, 2005. DOI: 10.1613/jair.1747 35
BIBLIOGRAPHY 121
J. Hoffmann. Analyzing search topology without running any search: On the connection be-
tween causal graphs and hC . Journal of Artificial Intelligence Research, 41:155–229, 2011. DOI:
10.1613/jair.3276 35
J. Hopcroft and J. Ullman. Introduction to Automata eory, Languages, and Computation. Addison-
Wesley, 1979. 64
Y. Hu and G. de Giacomo. Generalized planning: Synthesizing plans that work for multiple en-
vironments. In Proc. 22nd Int. Joint Conf. on Artificial Intelligence, pages 918–923, 2011. DOI:
10.5591/978-1-57735-516-8/IJCAI11-159 11
P. Jonsson and C. Bäckström. Tractable planning with state variables by exploiting structural restric-
tions. In Proc. 12th Nat. Conf. on Artificial Intelligence, pages 998–1003, 1994. 35
A. Jonsson. e role of macros in tractable planning over causal graphs. In Proc. 20th Int. Joint Conf.
on Artificial Intelligence, pages 1936–1941, 2007. 111
A. Junghanns and J. Schaeffer. Sokoban: Enhancing general single-agent search methods us-
ing domain knowledge. Artificial Intelligence, 129(1):219–251, 2001. DOI: 10.1016/S0004-
3702(01)00109-6 23
F. Kabanza and S. iébaux. Search control in planning for temporally extended goals. In Proc. 15th
Int. Conf. on Automated Planning and Scheduling, pages 130–139, 2005. 64
L. P. Kaelbling, M. L. Littman, and A. Cassandra. Planning and acting in partially observable stochas-
tic domains. Artificial Intelligence, 101(1–2):99–134, 1998. DOI: 10.1016/S0004-3702(98)00023-
X 6, 13, 101, 107
Daniel Kahneman. inking, fast and slow. Farrar, Straus and Giroux, 2011. 111
S. Kambhampati, C. Knoblock, and Q. Yang. Planning as refinement search: A unified framework for
evaluating design tradeoffs in partial-order planning. Artificial Intelligence, 76(1–2):167–238, 1995.
DOI: 10.1016/0004-3702(94)00076-D 47
E. Karpas and C. Domshlak. Cost-optimal planning with landmarks. In Proc. 21st Int. Joint Conf. on
Artificial Intelligence, pages 1728–1733, 2009. 41
M. Katz and C. Domshlak. Structural patterns heuristics via fork decomposition. In Proc. 18th Int.
Conf. on Automated Planning and Scheduling, pages 182–189, 2008. 42
H. A. Kautz and J. F. Allen. Generalized plan recognition. In Proc. 5th Nat. Conf. on Artificial
Intelligence, pages 32–37, 1986. 57
122 BIBLIOGRAPHY
H. A. Kautz and B. Selman. Planning as satisfiability. In Proc. 10th European Conf. on Artificial
Intelligence, pages 359–363, 1992. 45
H. A. Kautz and B. Selman. Pushing the envelope: Planning, propositional logic, and stochastic
search. In Proc. 13th Nat. Conf. on Artificial Intelligence, pages 1194–1201, 1996. 13, 45, 110
H. A. Kautz and B. Selman. Unifying SAT-based and graph-based planning. In Proc. 16th Int. Joint
Conf. on Artificial Intelligence, pages 318–327, 1999. 45, 46
T. Keller and P. Eyerich. PROST: Probabilistic planning based on UCT. In Proc. 22nd Int. Conf. on
Automated Planning and Scheduling, pages 119–127, 2012. 94
E. Keyder and H. Geffner. Heuristics for planning with action costs revisited. In Proc. 18th European
Conf. on Artificial Intelligence, pages 588–592, 2008. DOI: 10.3233/978-1-58603-891-5-588 32
E. Keyder and H. Geffner. e HMDPP planner for planning with probabilities. In 6th Int. Planning
Competition Booklet (ICAPS 2008), 2008. 93
E. Keyder and H. Geffner. Soft goals can be compiled away. Journal of Artificial Intelligence Research,
36:547–556, 2009. DOI: 10.1613/jair.2857 52, 53
E. Keyder, S. Richter, and M. Helmert. Sound and complete landmarks for And/Or graphs. In Proc.
19th European Conf. on Artificial Intelligence, pages 335–340, 2010. 39
C. A. Knoblock. Learning abstraction hierarchies for problem solving. In Proc. 8th Nat. Conf. on
Artificial Intelligence, pages 923–928, 1990. 111
L. Kocsis and C. Szepesvári. Bandit based Monte-Carlo planning. In Proc. 17th European Conf. on
Machine Learning, pages 282–293, 2006. DOI: 10.1007/11871842_29 93
S. Koenig and X. Sun. Comparing real-time and incremental heuristic search for real-time situ-
ated agents. Journal of Autonomous Agents and Multi-Agent Systems, 18(3):313–341, 2009. DOI:
10.1007/s10458-008-9061-x 23
A. Kolobov, P. Dai, Mausam, and D. S. Weld. Reverse iterative deepening for finite-horizon MDPs
with large branching factors. In Proc. 22nd Int. Conf. on Automated Planning and Scheduling, pages
146–154, 2012. 93
A. Kolobov, Mausam, and D. S. Weld. LRTDP versus UCT for online probabilistic planning. In
Proc. 26nd Conf. on Artificial Intelligence, pages 1786–1792, 2012. 93
R. E. Korf. Depth-first iterative-deepening: An optimal admissible tree search. Artificial Intelligence,
27(1):97–109, 1985. DOI: 10.1016/0004-3702(85)90084-0 19
R. E. Korf. Real-time heuristic search. Artificial Intelligence, 42:189–211, 1990. DOI: 10.1016/0004-
3702(90)90054-4 21, 89
BIBLIOGRAPHY 123
U. Kuter, D. S. Nau, E. Reisner, and R. P. Goldman. Using classical planners to solve nondeterministic
planning problems. In Proc. 18th Int. Conf. on Automated Planning and Scheduling, pages 190–197,
2008. 78
N. Lipovetzky and H. Geffner. Searching for plans with carefully designed probes. In Proc. 21st Int.
Conf. on Automated Planning and Scheduling, pages 154–161, 2011. 39
N. Lipovetzky and H. Geffner. Width and serialization of classical planning problems. In Proc. 20th
European Conf. on Artificial Intelligence, pages 540–545, 2012. DOI: 10.3233/978-1-61499-098-7-
540 35, 39
M. L. Littman. Memoryless policies: eoretical limitations and practical results. In D. Cliff, editor,
From Animals to Animats 3. MIT Press, 1994. 60
Y. Liu, S. Koenig, and D. Furcy. Speeding up the calculation of heuristics for heuristic search-based
planning. In Proc. 18th Nat. Conf. on Artificial Intelligence, pages 484–491, 2002. 31
B. Marthi, S. J. Russell, and J. Wolfe. Angelic semantics for high-level actions. In Proc. 17th Int. Conf.
on Automated Planning and Scheduling, pages 232–239, 2007. 111
M. Martin and H. Geffner. Learning generalized policies in planning using concept languages. In
Proc. 7th Int. Conf. on Principles of Knowledge Representation and Reasoning, pages 667–677, 2000.
12
Mausam and A. Kolobov. Planning with Markov Decision Processes: An AI Perspective. Morgan &
Claypool, 2012. DOI: 10.2200/S00426ED1V01Y201206AIM017 92, 93
D. McAllester and D. Rosenblitt. Systematic nonlinear planning. In Proc. 9th Nat. Conf. on Artificial
Intelligence, pages 634–639, 1991. 13, 47
D. V. McDermott. A heuristic estimator for means-ends analysis in planning. In Proc. 3rd Int. Conf.
on Artificial Intelligence Planning Systems, pages 142–149, 1996. 13, 24
G. Shani, J. Pineau, and R. Kaplow. A survey of point-based POMDP solvers. Journal of Autonomous
Agents and Multi-Agent Systems, pages 1–51, 2012. Online-First Article. DOI: 10.1007/s10458-
012-9200-2 102, 103
D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. In Proc. 24th Annual Conf. on
Advances in Neural Information Processing Systems, pages 2164–2172, 2010. 105
H. A. Simon. A behavioral model of rational choice. e Quarterly Journal of Economics, 69(1):99–118,
1955. DOI: 10.2307/1884852 23
M. Sipser. Introduction to eory of Computation. omson Course Technology, Boston, MA, 2nd
edition, 2006. 8, 45, 64
R. Smallwood and E. Sondik. e optimal control of partially observable Markov processes over a
finite horizon. Operations Research, 21:1071–1088, 1973. DOI: 10.1287/opre.21.5.1071 98
D. E. Smith and D. S. Weld. Conformant graphplan. In Proc. 15th Nat. Conf. on Artificial Intelligence,
pages 889–896, 1998. 54
D. E. Smith and D. S. Weld. Temporal planning with mutual exclusion reasoning. In Proc. 16th Int.
Joint Conf. on Artificial Intelligence, pages 326–337, 1999. 49
D. E. Smith, J. Frank, and A. K. Jonsson. Bridging the gap between planning and scheduling. e
Knowledge Engineering Review, 15(1):47–83, 2000. DOI: 10.1017/S0269888900001089 47, 48
D. E. Smith. Choosing objectives in over-subscription planning. In Proc. 14th Int. Conf. on Automated
Planning and Scheduling, pages 393–401, 2004. 30, 51
S. Srivastava, N. Immerman, and S. Zilberstein. A new representation and associated algorithms for
generalized planning. Artificial Intelligence, 175(2):615–647, 2011.
DOI: 10.1016/j.artint.2010.10.006 11