SPUDD: Stochastic Planning using Decision Diagrams
Jesse Hoey
Robert St-Aubin
Alan Hu
Craig Boutilier
Department of Computer Science
University of British Columbia
Vancouver, BC, V6T 1Z4, CANADA
jhoey,staubin,ajh,cebly @cs.ubc.ca
Abstract
Recently, structured methods for solving factored
Markov decisions processes (MDPs) with large
state spaces have been proposed recently to allow dynamic programming to be applied without the need for complete state enumeration. We
propose and examine a new value iteration algorithm for MDPs that uses algebraic decision diagrams (ADDs) to represent value functions and
policies, assuming an ADD input representation
of the MDP. Dynamic programming is implemented via ADD manipulation. We demonstrate
our method on a class of large MDPs (up to 63
million states) and show that significant gains can
be had when compared to tree-structured representations (with up to a thirty-fold reduction in
the number of nodes required to represent optimal
value functions).
1
Introduction
Markov decision processes (MDPs) have become the semantic model of choice for decision theoretic planning
(DTP) in the AI planning community. While classical computational methods for solving MDPs, such as value iteration and policy iteration [19], are often effective for small
problems, typical AI planning problems fall prey to Bellman’s curse of dimensionality: the size of the state space
grows exponentially with the number of domain features.
Thus, classical dynamic programming, which requires explicit enumeration of the state space, is typically infeasible
for feature-based planning problems.
Considerable effort has been devoted to developing representational and computational methods for MDPs that obviate the need to enumerate the state space [5]. Aggregation
methods do this by aggregating a set of states and treating
the states within any aggregate state as if they were identical [3]. Within AI, abstraction techniques have been widely
studied as a form of aggregation, where states are (implicitly) grouped by ignoring certain problem variables [14, 7,
12]. These methods automatically generate abstract MDPs
by exploiting structured representations, such as probabilistic STRIPS rules [16] or dynamic Bayesian network (DBN)
representations of actions [13, 7].
In this paper, we describe a dynamic abstraction method for
solving MDPs using algebraic decision diagrams (ADDs)
[1] to represent value functions and policies. ADDs
are generalizations of ordered binary decision diagrams
(BDDs) [10] that allow non-boolean labels at terminal
nodes. This representational technique allows one to describe a value function (or policy) as a function of the variables describing the domain rather than in the classical “tabular” way. The decision graph used to represent this function is often extremely compact, implicitly grouping together states that agree on value at different points in the dynamic programming computation. As such, the number of
expected value computations and maximizations required
by dynamic programming are greatly reduced.
The algorithm described here derives from the structured
policy iteration (SPI) algorithm of [7, 6, 4], where decision trees are used to represent value functions and policies. Given a DBN action representation (with decision
trees used to represent conditional probability tables) and
a decision tree representation of the reward function, SPI
constructs value functions that preserve much of the DBN
structure. Unfortunately, decision trees cannot compactly
represent certain types of value functions, especially those
that involve disjunctive value assessments. For instance, if
the proposition describes a group of states that have
a specific value, a decision tree must duplicate that value
three times (and in SPI the value is computed three times).
Furthermore, if the proposition describes not a single value,
but rather identical subtrees involving other variables, the
entire subtrees must be duplicated. Decision graphs offer
the advantage that identical subtrees can be merged into
one. As we demonstrate in this paper, this offers considerable computational advantages in certain natural classes
of problems. In addition, highly optimized ADD manipulation software can be used in the implementation of value
iteration.
The remainder of the paper is organized as follows. We provide a cursory review of MDPs and value iteration in Section 2. In Section 3, we review ADDs and describe our
ADD representation of MDPs. In Section 4, we describe
a conceptually straightforward version of SPUDD, a value
iteration algorithm that uses an ADD value function representation, and describe the key differences with the SPI algorithm. We also describe several optimizations that reduce
both the time and memory requirements of SPUDD. Empir-
ical results on a class of process planning examples are described in Section 5. We are able to solve some very large
MDPs exactly (up to 63 million states) and we show that the
ADD value function representation is considerably smaller
than the corresponding decision tree in most instances. This
illustrates that natural problems often have the type of disjunctive structure that can be exploited by decision graph
representations. We conclude in Section 6 with a discussion
of future work in using ADDs for DTP.
2
Markov Decision Processes
(1)
A policy is optimal if -E.GFH- .I for all KJ( and policies ’. The optimal value function -ML is the value of any
optimal policy.
Value iteration [2] is a simple iterative approximation algorithm for constructing optimal policies. It proceeds by constructing a series of N -stage-to-go value functions -PO . Setting -MQ1R , we define
:[U\
- OTS6 21U05PVX
Z WY '?89B:0; <&>@]B^C_- O D`
ADDs and MDPs
Algebraic decision diagrams (ADDs) [1] are a generalization of BDDs [10], a compact, efficiently manipulable data
structure for representing boolean functions. These data
structures have been used extensively in the VLSI CAD
field and have enabled the solution of much larger problems
than previously possible. In this section, we will describe
these data structures and basic operations on them and show
how they can be used for MDP representation.
3.1 Algebraic Decision Diagrams
We assume that the domain of interest can be modeled as
a fully-observable MDP [2, 19] with a finite set of states
and actions . Actions induce stochastic state transitions,
with denoting the probability with which state
is reached when action is executed at state . We also
assume a real-valued reward function , associating with
each state its immediate utility .
A stationary policy !#"
describes a particular
course of action to be adopted by an agent, with denoting the action to be taken in state . We assume that the
agent acts indefinitely (an infinite horizon). We compare
different policies by adopting an expected total discounted
reward as our optimality criterion wherein future rewards
are discounted at a rate $&%('*),+ , and the value of a policy
is given by the expected total discounted reward accrued.
The expected value -/.0 of a policy at a given state
satisfies [19]:
-/.21340657'489:;=<?>@ AB C-/.D
3
A BDD represents a function o2O!"po from N boolean variables to a boolean result. Bryant [10] introduced the BDD
in its current form, although the general ideas have been
around for quite some time (e.g., as branching programs in
the theoretical computer science literature). Conceptually,
we can construct the BDD for a boolean function as follows.
First, build a decision tree for the desired function, obeying the restrictions that along any path from root to leaf, no
variable appears more than once, and that along every path
from root to leaf, the variables always appear in the same order. Next, apply the following two reduction rules as much
as possible: (1) merge any duplicate (same label and same
children) nodes; and (2) if both child pointers of a node
point to the same child, delete the node because it is redundant (with the parents of the node now pointing directly to
the child of the node). The resulting
directed, acyclic graph
m
is the BDD for the function. In practice, BDDs are generated and manipulated in the fully-reduced form, without
ever building the decision tree.
ADDs generalize BDDs to represent real-valued functions
o2Oq" r ; thus, in an ADD, we have multiple terminal
nodes labeled with numeric values. More formally, an ADD
denotes a function as follows:
1. The function of a terminal node is the constant function s^M1t , where is the number labelling the terminal node.
2. The function of g a nonterminal node labeled with
boolean variable
is given by
k
k
k
k
k
s6 vuDuDu O 21 Cws]xDy0zB{| m u0uu O e5 C s@zB} ~z m uuu O
k
where boolean values
are viewed as $ and + , and
s xDy0zB{ and s z} ~z are the functions of the ADDs rooted
k
(2)
The sequence of value functions -aO produced by value iteration converges linearly to the optimal value function -=L .
For some finite N , the actions that maximize Equation 2
form an optimal policy, and -PO approximates its value. A
commonly used stopping criterion specifies termination of
the iteration procedure when
k
k
at the then and else children of the node.
1iVXWY
J
denotes the supremum
(where
norm). This ensures that the resulting value function -aOTS6
is within ml of the optimal function -ML at any state, and that
the resulting policy is -optimal [19].
BDDs and ADDs have several useful properties. First, for a
given variable ordering, each distinct function has a unique
reduced representation. In addition, many common functions can be represented compactly because of isomorphicsubgraph sharing. Furthermore, efficient algorithms (e.g.,
depth-first search with a hash table to reuse previously computed results) exist for most common operations, such as
addition, multiplication, and maximization. For example,
Figure 1 shows a computation of the maximum of two
ADDs. Finally, because BDDs and ADDs have been used
We ignore actions costs for ease of exposition. These impose
no serious complications.
We are describing the most common variety of BDD. Numerous variations exist in the literature.
b
bhg7b
n
- O@S62c - O
/j k j k
b
)Hd
g
e+ c ^
'
f
'
(3)
d
Z
Z
KEY
Z
true
MAX(
X
5.0
7.0
0.5
Y
5.0
) =
7.0
X
0.0
Y
5.0
f(z,x)
g(z,y)
false
0.5
h(z,x,y) = MAX(f(z,x), g(z,y))
Figure 1: Simple ADD maximization example
sentation of CPTs, we use ADDs to capture
g
g in
Z regularities
the CPTs (i.e., to represent the functions < I
u0uu O ).
This type of representation exploits context-specific independence in the distributions [9], and is related to the use
of tree representations [7] and rule representations [18] of
CPTs in DBNs. Figure 2(b) illustrates the ADD representation of the CPT for two variables, P and K?K . While
the distribution over M is a function of its seven parent variables, this function exhibits considerable regularity, readily
apparent by inspection of the table, which is exploited by
the ADD. Specifically, the distribution over P is given by
the following formula:
9
< eI ?a!&?*KX! ¡¢1
P
£
£
¤5 ?*C !&¥5 &22C!X¥C4X
5P?7CAK?HC4?¦§CP¦C0$ uw¨
extensively in other domains, very efficient implementations are readily available. As we will see, these properties
make ADDs an ideal candidate to represent structured value
functions in MDP solution algorithms.
3.2 ADD Representation of MDPs
We assume that the MDP
g state g space is characterized byg4a
set of variables H1
C0CC O k/ . Values of variable g4
will be denoted in lowercase (e.g., ). We assume each
is boolean, as required by the ADD formalism, though we
discuss multi-valued variables in Section 5. Actions are often most naturally described as having an effect on specific
variables under certain conditions, implicitly inducing state
transitions. DBN action representations [13, 7] exploit this
fact, specifying a local distribution over each variable describing the (probabilistic) impact an action has on that variable.
A DBN
g for action
g requires two sets of variables, one set
1
C0CC O referring to the state of gthe systemg before action has been executed, and ! 1
C0CC arcs
O
denoting the state after has been executed. Directed
from variables in to variables in ! indicate direct causal
influence and have the usual semantics [17, 13]. The conditional
g probability table (CPT) for each post-action
g variZ
defines a conditional distribution < I over —
able
g
i.e., ’s effect on
—for each instantiation
Z g of itsg parents. This can be viewed as a function < I
uuu O ,
but where
the
function
value
(distribution)
depends
only on
gP
g
those
that are parents of g . No quantification is provided for pre-action variables : since the process is fully
observable, we need only use the DBN to predict state transitions. We require one DBN for each action J! .
In order to illustrate our representation and algorithm, we
introduce a simple adaptation of a process planning problem taken from [14]. The example involves a factory agent
which has the task of connecting two objects and . Figure 2(a) illustrates our representation for the action bolt,
where the two parts are bolted together. We see that whether
the parts are successfully connected, , depends on a number of factors, but is independent of the state of variable
(painted). In contrast, whether part is punched, !? ,
after bolting depends only on whether it was punched before bolting.
Rather than the standard, locally exponential, tabular repre-
We ignore the possibility of arcs among post-action variables,
disallowing correlations in action effects. See [4] for a treatment
of dynamic programming when such correlations exist.
(we ignore the zero entries). Similarly, the ADD for K?K
corresponds to:
9
< ©«eª6 ¬ I !?K21UK?HC+ u $
Reward functions can be represented similarly. Figure 2(c)
shows the ADD representation of the reward function for
this simple example: the agent is rewarded with 10 if the
two objects are connected and painted, with a smaller reward of 5 when the two objects are connected but not
painted, and is given no reward when
the g parts are not cong
u0uu , is simply
nected. The reward function,
O
&1iC03C+0$ u ®
$ 5¯,C 3C °
This example action illustrates the type of structure that can
be exploited by an ADD representation. Specifically, the
CPT for P clearly exhibits disjunctive structure, where a
variety of distinct conditions each give rise to a specific
probability of successfully connecting two parts. While this
ADD has seven internal nodes and two leaves, a tree representation for the same CPT requires 11 internal nodes and
12 leaves. As we will see, this additional structure can be
exploited in value iteration. Note also that the standard matrix representation of the CPT requires 128 parameters.
ADDs are often much more compact that trees when representing functions, but this is not always the case. The
ordering requirement on ADDs means that certain functions can require an exponentially larger ADD representation than a well-chosen tree; similarly, ADDs can be exponentially smaller than decision trees. Our initial results suggest that such pathological examples are unlikely to arise in
most problem domains (see Section 5), and that ADDs offer
an advantage over decision trees.
4
Value Iteration using ADDs
In this section, we present an algorithm for optimal policy construction that avoids the explicit enumeration of the
state space. SPUDD (stochastic planning using decision
diagrams) implements classical value iteration, but uses
ADDs to represent value functions and CPTs. It exploits
the regularities in the action and reward networks, made
APU
BPU
APU’
APU
T
F
C
APU
PL’
C’
ADR
ADR’
BDR
BDR’
BO
BO’
P
P’
KEY
true
1.0
BPU’
false
0.0
C
C
PL
APU’
1.0
0.0
T
F
F
F
F
F
F
F
F
F
F
F
PL APU BPU ADR BDR BO C’
T/F
T
T
T
T
T
T
T
F
F
F
F
T/F
T
T
T
F
F
F
F
T/F
T/F
T/F
T/F
T/F
T
T
F
T/F
T/F
T/F
T/F
T/F
T/F
T/F
T/F
T/F T/F
T/F T/F
T/F T/F
T/F T/F
T
T
T
T
T
F
F
T/F
T
T
T
T
T
F
F
T/F
T/F
T
F
T/F
T
F
T/F
T/F
T
F
T/F
T/F
0.9
0.9
0.0
0.0
0.9
0.0
0.0
0.0
0.9
0.0
0.0
0.0
PL
APU
BPU
REWARD
BDR
C
(a)
P
0.0
10.0 5.0
BO
0.9
0.0
Matrix
Representation
C
P
ADR
ADD
Representation
Network
(b)
(c)
Reward
Figure 2: Small FACTORY example: (a) action network for action bolt; (b) ADD representation of CPTs (action diagrams);
and (c) immediate reward network and ADD representation of the reward table.
explicit by the ADD representation described in the previous section, to discover regularities in the value functions
it constructs. This often yields substantial savings in both
space and computational time. We first introduce the algorithm in a conceptually clear way, and then describe certain
optimizations.
OBDDs have been explored in previous work in AI planning [11], where universal plans (much like policies) are
generated for nondeterministic domains. The motivation in
that work, avoiding the combinatorial explosion associated
with state space enumeration, is similar to ours; but the details of the algorithms, and how the representation is used
to represent planning domains, is quite different.
4.1 The Basic SPUDD Algorithm
The SPUDD algorithm, shown in Figure 3, implements
a form of value iteration, producing a sequence of value
functions -=Q]h-MC0CC until the termination condition is met.
g
Each ± stage-to-go
value
g function is represented as an ADD
denoted -
uu0u O . Since -?Q1q , the first value
function has an obvious ADD representation. The key insight underlying SPUDD is to exploit the ADD structure
itself to discover the apof - and the MDP representation
propriate ADD structure for - S6 . Expected value calculations and maximizations are then performed at each terminal node of the new ADD
rather than at each state.
produces - S6 .
Given an ADD for - , Step 3 of SPUDD
When computing - S6 , the function - is viewed as representing values at future states, after a suitable action has
performed with ±5X+ stages remaining. So variables in
been
- are first replaced by their primed, or post-action, counterparts (Step 3(a)), referring to the state with ± stages-togo; this prevents them from being confused with unprimed
variables that refer to the state with ±5²+ stages-to-go. Figure 4(a) shows the zero stage-to-go primed value diagram,
-a³Q , for our simple example.
For each action , we then compute an ADD representation of the function - Z S6 , denoting the expected value of
performing action with ±65¥+ stages to go given that dictates ± stage-to-go value. This requires several steps,
described below.
First, we note that the ADD-represented
Z
functions < I , taken from the action network for , give the
g
are made true
(conditional) probabilities that variables
by action . To fit within the ADD framework, we introduce
the negative action diagrams
g
Z g
g
Z g
< I u0uu O 1¥´+ c < I u0uu O e
g
which gives the probability that will make
false. We
Z
then define the dual action diagrams µ I as the ADD
g
Z
rooted at , whose true branch is the action diagram < I
Z
and whose false branch is the negative action diagram < I :
g
g Z g
g
Z g A¶ g
µ I
u0uu O ·1 g CA< I g u0uu g O 5
Z
C < I u0uu O ´ (4)
k
g
k jg
Z k ¶k
Intuitively, µ I
1
uuu O denotes <
1
k
g
k
0CCC O 1 O (under action ). Figure 4(a) shows the
dual action diagram for the variable C’ from the example in
Figure 2(b).
In order to generate - Z S6 , we must, for each state , combine the ± stage-to-go value for each state with the probability of reaching from . We do this by multiplying,
in
g
turn, the dual action diagrams for each variable
by -P
1. Set
½ ¸¡¹»º²¼ where ¼
is the immediate reward diagram; set
º*¾
I ÃÅÄ!ÇBÆ È Ä n ÈÉÉÉ Ä=Ê]Ë for each
Í
4
Ï
Ç
, and for each Ä!Æ
Æ
ÇÒÑ n Ó Ç
n×/ØÙ
Repeat until и
¸ ®
Ð Ô,ÕÖ eØ
Ç
(a) Swap
Ç all variables in Ä with primed versions to create
Ä!Æ
(b) For all Ú ÍÎ
Ç
Set ÛÅÜÝÁÞ?ºß¸ Æ
Ç
For all primed variables, Ä à Æ in ¸ Æ
ÛáÜ0ÝÁÞ4º7ÛáÜ0Ý¡Þaâ2¿ À ã I
Set ÛÅÜ0Ý¡Þ4º Sum the sub-diagrams of ÛáÜ0Ý¡Þ
over the primed variable Ä!à Æ
End For
Multiply the result by discounting
factor ä
Ç
and add ¼ to obtain ¸
À
End For
Ç
ÇÒÑ n
(c) Maximize over all ¸ ’s to create ¸
.
À
(d) Increment i
2. Create dual action diagrams, ¿ÁÀÂ
Ì?Í&Î
3.
End Repeat
4. Perform one more iteration and assign to each terminal node
the actions Ì which contributed the value in the value ADD
at that node; this yields the å -optimal policy ADD, æç . Note
that terminal nodes which have the same values for multiple
actions are assigned all possible actions in æç .
5. Return the value diagram ¸
ÇÒÑ n
and the optimal policy æ
ç
.
Figure 3: SPUDD algorithm
g
and then eliminating
by summing over its values in the
Z
resultant ADD. More precisely, by multiplying µ ã I by -a ,
we obtain a function s6
k
k
g
k
0CC0C
k
g
O
g
g
C0CC O where
s6 0 CC0 C O 0CC0C O 1
k
k
k jk
k
- 0 CC0C ´<
0u uu O
O
(assuming transitions induced by action ). This intermedi-
ate calculation is illustrated in Figure 4(b), where the dual
diagram for variable M is the first to be multiplied by -a³Q .
Note that P lies at the root of this ADD. Once this function s is obtained, we can eliminate
dependence of future
g
value on the specific value of
by taking an expectation
over both of its truth values. This is done by summing the
left and right subgraphs of the ADD for s , leaving us with
the function
è g C0CC g é g 0 CC0C g g g 21
k u0uj u g O
g
S6k
gO
g
8Òê - 0C CC 0 CC0C O ´ <
u u0u O
ãI
This is illustrated in Figure 4(c), where the variable P is
eliminated. This ADD denotes the expected future value (or
$ stage-to-go value) as a function of the parents of P with
+ stage-to-go and all post-action variables except with $
stages-to-go.
g
This process is repeated for each
post-action variable
Z
that occurs in the ADD for -a : we first multiply µ ã I into
the intermediate value ADD, then eliminate that variable by
taking an expectation over its values. Once all primed variables have been eliminated, we are left with a function
g
ë g
0 CCCe O 1
k
k
k jg
g
ê 8 ê - 0 CCC O ´< uuu O CCC
Iìeíwîwîwîwí Iï
k jg
g
< O u u0u O
By the independence assumptions embodied in the action
network, this is precisely the expected future value of performing action . By adding the reward ADD to this
function, we obtain an ADD representation of - Z S6 . Figure 5 shows the result for our simple example. The remaining primed variable P in Figure 4(c) has been removed,
producing -?ð@ ñò³ó using a discount factor of $ . Finally,
we
to produceuw¨ the - S6 diatake the maximum over all actions
that
gram. Given ADDs for each - Z S6 , this requires
:[ simply
one construct the ADD representing VW_Y Z
- Z S6 .
The stopping criterion in Equation 3 is implemented
by
comparing each pair of successive ADDs, - S6 and - .
Once the value function has converged, the -optimal pold one further
icy, or policy ADD, is extracted by performing
dynamic programming backup, and assigning to each terminal node the actions which produced the maximimizing
value. Since each terminal node represents some state set of
states ô , the set of actions thus determined are each optimal
for any PJô .
4.2 Optimizations
The algorithm as described in the last section, and as shown
in Figure 3, suffers from certain practical difficulties which
make it necessary to introduce various optimizations in order to improve efficiency with respect to both space and
time. The problems arise in Step 3(b)
Z when -M is multiplied by the dual action diagrams µ . Since there
are potentially N primed variables in the ADD
for -a and N unZ
primed variables in the ADD for µ , there is an intermediate step
f in which a diagram is created with (potentially)
up to N variables. Although this will not be the case in
general, it was deemed necessary to modify the method in
order to deal with the possibility of this problem arising.
Furthermore, a large computational overhead is introduced
by re-calculating the joint probability distributions over the
primed variables at each iteration. In this section, we first
discuss optimizations for dealing with space, followed by a
method for optimizing computation time.
The increase in the diagram size during Step 3(b) of the algorithm can be countered by approaching the multiplications and sums slightly differently. Instead of blindly mul the dual action diagram for the variable
tiplying the by
at the root of - , we can traverse the ADD for - to the
level of the last variable in the ADD ordering, then multiply and sum the sub-diagrams rooted at this variable by
V’0
C’
C’
P’
10.0
P’
P’
0.0
5.0
C’
C
C
APU
APU
BPU
BO
BO
BO
BO
BDR
BDR
BDR
BDR
BPU
BPU
ADR
ADR
ADR
ADR
APU
APU
BPU
BPU
BPU
PL
PL
(2)
SUM OVER C’
APU
APU
PL
PL
PL
PL
(1)
MULTIPLY DIAGRAMS
C
C
C
C
PRIMED VALUE DIAGRAM
ADR
ADR
9.0
BDR
BDR
0.0
4.5
9.0
0.0
4.5
BO
BO
0.9
0.0
DUAL ACTION DIAGRAM
b)
a)
c)
Figure 4: First Bellman backup for the Value
Iteration using ADDs algorithm. (a) 0-stage-to-go primed
value diagram, and
Zõ
Zõ
dual action diagram for variable P , µ I (b) Intermediate result after multiplying -M³Q with µ I . (c) Intermediate result
after quantifying over M .
C
P
18.1
for a given problem do not change during the generation of
a policy, the joint probability distribution <?>@]B from
Equation 2 could be pre-computed. In our case, this means
we could take the product of all dual action diagrams for a
given action , as shown in Equation 5 below, prior to a speZ
cific value iteration. We refer to this product diagram, < ,
as the complete action diagram for action :
P
PL
9.05
PL
APU
APU
BPU
ö O Z g 0¶ g
g
g
g
g
Z g
uuu O uuu O 1 Å ÷ µ I
uu0u O
(5)
Z
The resulting function < provides a representation of theZ
state transition probabilities for action . This explicit <
function could then be multiplied by the -awO during Step
BPU
<
ADR ADR
BDR
BDR
BO
8.1
BO
0.0
4.05
Figure 5:9 Resulting 1-stage-to-go value diagram for action
bolt, - .
e
the corresponding dual diagram. This process will only remove the dependency of the -a on a primed variable for
a given branch, and will therefore only introduce a single
diagram of N unprimed variables at a leaf node of -a . By
out this procedure using the structure
recursively carrying
of the ADD for -P , the intermediate stages never grow too
large. Essentially, the additional unprimed variables are introduced only at specific points in the ADD and the corresponding primed variable immediately eliminated—this is
much like the tree-structured dynamic programming algorithm of [7].
Unfortunately, this method requires a great deal of unnecessary, repeated computation. Since the action diagrams
3 of the algorithm, and then primed variables eliminated.
Although this may lead to a substantial savings in compuf
tation time, it will again generate diagrams with up to N
variables.
As a compromise, we implemented a method where the
space-time trade-off can be addressed explicitly. A “tuning knob” enables the user to find a middle ground between
the two methods mentioned above. We accomplish this
by pre-computing only subsets of the complete action diagram. That is, we break the large diagram
up g into a few
g
smaller pieces. The set of variables u0uu O is divided
intog ø subsets,
preserving
(e.g.,
£g
£g
g Åù the total
£ g ú ordering
g
ì S6 uu0u
u0uu ì ¦ ,
¦ , ...,
uu0u O ¦ ), and
the complete action
for each
g are pre-computed
g
Z g diagrams
subset (e.g., < ã uuu ãû ì ü u0uu O ). Step 3(b)
of the algorithm must be modified
as shown in Figure 6.
The primed value diagram -a is traversed to the top of the
second level ( ± 5ý+ ), and the procedure is carried
g out re
cursively on each sub-diagram rooted at variables ì . If
a level is reached with no variables
it, then theS6subg below
ú
diagram rooted at each variable
of -P is multiplied
1. Set þÿRº user-specified limit for size of graphs
Set ÛáÜ0Ý¡Þ4º ADD constant
½ 1
Set ?º HÝUºG¾
Ü«º7¾
2. While Ô number of variables
½ à º
º ½
While
ÜÁÔþÿ
Set ÛáÜ0Ý¡Þ=º²ÛÅÜÝÁÞaâ2¿ À I
½
Ü«º no. of internal nodes in ÛáÜ0Ý¡Þ
?º
End
While
3.
À Ã à Æ ÈÉÉÉÈ
ÝUº²Ý
End While
ÇÒÑ n
Repeat until и
..
.
×|n È n ÈÉÉÉÈ Ê Ë º7ÛÅÜÝÁÞ
Æ
Ó
n×/ØÙ
Ç
¸ ®
Ð Ô ÕÖ eØ
(c) For all Ú ÍÎ
Set Ì AÛ Ü Ü|º7 Ý
Call pRew( ¸ Æ Ê , À ,¾ , )
Ì ! Û ½ "# , Ì%$ ,# 'Ü &EÛ D Ü 'Ü )
procedure pRew ( Ì ]Ü ,
½
If Ì%$)( Ê+*-½ ,. /0* *1/
If Ì%$2( / . /0* *1/
$ Ü 4ÅÛº À'3 Ì ]Ü
Ì AÛ Ü Ü| º Ì
Ó
AÛ DÜ Ü
Else
#
½"5#
ÛÅÜÝÁÞ = pRew ( ÌÌ ! ]Ü ,Ì! Û , Ì%$ , Ü&/Û Ü Ü6 )
ÛÅ$ ÜÝÁÞ?ºßÛáÜ0Ý¡Þ¡â Û
Ü 4ÅÛº sum all sub-diagrams of ÛáÜ0 Ý¡( Þ ½
over primed variables, à Æ
/ À'3 . /0* *1/
else
Ì
ÛáÜ0ÝÁ8Þ 77º prRew(then(
Ì! Û ½"5 # ,D Ü ] Ü Ü ),8 ,# 'Ü &EÛ Ü Ü )
Ì
ÛáÜ0ÝÁ:Þ 9ߺ prRew(else(
Ì! Û ½"5 # ,DÜ] Ü Ü),8 ,# 'Ü &EÛ Ü Ü )
$ Ü '%ÅÛº tree rooted at <Ç ;0=1>1=;
Æ
with then,else branches: ÛáÜ0ÝÁ8
Þ 7 ,ÛáÜ0Ý¡Þ?9 , resp.
return result
Figure 6: Modified SPUDD algorithm
with the Z corresponding
of the
g ú
g subset
g
g complete action diagram, <
ug uu
uuu O , and summed over
O
primed variables @ , ACBq± D . In this way, the diagrams
are kept small by making sure that enough elimination occurs to balance the effects of multiplying by complete action diagrams. The space and time requirements can then
be controlled by the number of subsets the complete action
diagrams are broken into. In theory, the more subsets, the
smaller the space requirements and the larger the time requirements. Although we have been able to produce substantial changes in the space and time requirements of the
algorithm using this tuning knob, its effects are still unclear. At present, we choose the ø subsets of variables by
simply building the complete action diagrams according to
some variable ordering until they reach a user-defined size
limit, at which point we start on the next subset. We note
that this space-time tradeoff bears some resemblance to the
space-time tradeoffs that arise in probabilistic inference algorithms like variable elimination [15].
Although we have not implemented heuristics for variable
ordering, there are some simple ordering methods that could
improve space efficiency. For instance, if we order variables so that primed variables with many shared parents are
eliminated together, the number of unprimed variables introduced will be kept relatively small relative to the number
of primed variables eliminated. More importantly, we must
develop more refined heuristics that keep the ADDs small
rather than minimizing the number of variables introduced.
This revised procedure (Figure 6) has a small inefficiency,
as our results in the next section will show. Since we are
pre-computing subsets of the complete action diagrams, any
variables which are included in the domain, but are not relevant to its solution, will be included in these pre-computed
diagrams. This will increase the size of the intermediate
representations and will add overhead in computation time.
It is important to be able to discard them, and to only compute the policy over variables that are relevant to the value
function and policy [7]. A possible way to deal with these
types of variables in our algorithm would be to progressively build the complete action diagrams during the iterative procedure. In this way, only the variables relevant to
the domain would be added.
5
Data and Results
The procedure described above was implemented using the
CUDD package [20], a library of C routines which provides support for manipulation of ADDs. Experimental results described in this section were all obtained using a dualprocessor SUN SPARC Ultra 60 running at 300Mhz with 1
Gb of RAM, with only a single processor being used. The
SPUDD algorithm was tested on three different types of
examples, each type having MDP instances with different
numbers of variables, hence a wide variety of state space
sizes. The first example class consists of various adaptations of a process planning problem taken from [14]. The
second and third example classes consist of synthetic problems taken from [7, 8]. These are designed to test best- and
worst-case behavior of SPUDD.E
The first example class consists of process planning problems taken from [14], involving a factory agent which must
paint two objects and connect them. The objects must be
smoothed, shaped and polished and possibly drilled before
painting, each of which actions require a number of tools
which are possibly available. Various painting and connection methods are represented, each having an effect on the
quality of the job, and each requiring tools. The final product is rewarded according to what kind of quality is needed.
Rewards range from $ to +0$ and a discounting factor of $ uw¨
was used throughout.
The examples used here, unlike the one described in Section 3, were not designed with any structure in mind which
could be taken advantage of by an ADD representation. In
the original problem specification, three ternary variables
were used to represent painting quality of each object (good,
poor or false), and the connection quality (good, bad or
false). However, as discussed above, ADDs can only repF
Data for these problems can be found at the Web page:
www.cs.ubc.ca/spider/staubin/Spudd/index.html.
resent binary variables, so that each ternary variable was
expanded into two binary ones. For example, the variable connected, describing the type of connection between
the two objects, was represented by boolean variables connected and connected well. This expansion enlarges the
state space by a factor of G8H%I for each ternary variable so
expanded (by introducing unreachable states). A number of
FACTORY examples were devised, with state space sizes
ranging from 55 thousand to 268 million.
Optimal policies were generated using SPUDD and a structured policy iteration (SPI) implementation for comparison
purposes [7]. Results, displayed in Table 1, are presented
for SPUDD running on six FACTORY examples, and for
SPI running on five. SPI was not run on the factory4 example, because its estimated time and space requirements
exceeded available capacity. SPI implements modified policy iteration using trees to represent CPTs and intermediate value and policy functions. SPI, however, does allow
multi-valued variables—so versions of each example were
tested in SPI using both ternary variables, and thier binary
expansion. Table 1 shows the number of ternary variables
in each example, along with the total number of variables.
The state space sizes of each FACTORY example are shown
for both the original and the binary-expansion formulations.
SPUDD was only run on the binary-expanded versions.
The examples labelled factory1 and factory2 differ only by
a single binary variable, which is not affected by any action
in the domain, and which does not itself affect any other
variables. Hence, the number of internal nodes resulting
in Table 1 are identical for the two examples. This variable was added in order to show how structured representations like SPUDD and SPI can effectively discard variables which do not affect the problem at hand, as discussed
in Section 4.2. Since SPUDD pre-computes the complete
action diagrams, as shown in Figure 6, the running time for
SPUDD almost doubles when this new variable is added,
since it creates overhead for the iterative procedure. This
problem could be circumvented using the method described
at the end of Section 4.2.
Running times are shown for SPUDD and SPI. However,
the algorithms do not lend themselves easily to comparisons
of running times, since implementation details cloud the results; so running times will not be discussed further here.
The SPI results are shown in order to compare the sizes
of the final value function representations, which give an
indication of complexity for policy generation algorithms.
However, a question arises when comparing such numbers
about the variable orderings, as mentioned in Section 3. The
variable ordering for SPUDD is chosen prior to runtime and
remains the same during the entire process. No special techniques were used to choose the ordering, although it may be
argued that good orderings could be gleaned from the MDP
specification. Variable orderings within the branches of the
tree structure in the SPI algorithm are determined primarily
by the choice of ordering in the reward function and action
descriptions [7]. Again, no special techniques were used
to choose the variable ordering in SPI. Finding the optimal
variable orderings in either case is a difficult problem, and
we assume here that neither algorithm has an advantage in
this regard. Dynamic reordering algorithms are available in
CUDD, and have been implemented but not yet fully tested
in SPUDD (see below).
In order to compare representation sizes, we compare the
number of internal nodes in the value function representations only. This is most important when doing dynamic
programming back-up steps and is a large factor in determining both running time and space requirements. Furthermore, we compare numbers from SPUDD using binary representations with numbers from SPI using binary/ternary
representations in order not to disadvantage SPI, which can
make use of ternary variables. We also compare both implementations using only binary variables. The equivalent tree
leaves column in Table 1 gives the number of leaves of the
totally ordered binary tree (and hence the number of internal nodes) that results in expanding the value ADD generated by SPUDD. These numbers give the size of a tree that
would be generated if a total ordering was imposed. Comparing these numbers with the numbers generated by SPI
give an indication of the savings that occur due to the relaxation of the total ordering constraint. The rightmost column in Table 1 shows the ratio of the number of internal
nodes in the tree representation to the number in the ADD
representation. We see that reductions of up to 30 times
are possible, when comparing only binary representations
to binary/ternary representations, and reductions of over
40 times when comparing the same binary representations.
These space savings also showed up in the amount of memory used. For example, the factory3 example took 691Mb
of memory using SPI, and only 148Mb using SPUDD. The
factory4 example took 378Mb of space using SPUDD.
The BIGADD limit (see Figure 6) was set to 10000 for the
factory, factory0, factory1 and factory2 examples and to
20000 in the factory3 and factory4 examples. These limf
or
its broke up the complete action diagrams into ø 1
I pieces, with typically 6000-10000 nodes in the first and
second and under 1000 nodes in the third if it existed. In the
large examples (factory2, 3 and 4), it was not possible (with
1Gb of RAM) to generate the full complete action diagram
( ø 1#+ ), and running times became too large when BIGADD was set to 1. The functionality of this “tuning knob”
was not fully investigated, but, along with studies of different heuristics for variable grouping, is an interesting avenue
for future exploration.
For comparison purposes, flat (unstructured) value iteration
was run on both the factory and factory0 examples. The
times taken for these problems were 895 and 4579 seconds,
respectively. For the larger problems, memory limitations
precluded completion of the flat algorithm.
In order to examine the worst-case behaviour, we tested
SPUDD on a series of examples, drawn from [7, 8], in
which every state has a unique value; hence, the ADD representing the value function will have a number of terminal nodes exponential in the number of state variables. The
problem EXPON involves N ordered propositions and N
actions, one for each proposition. Each action makes its
corresponding proposition true, but causes all propositions
lower in the order to become false. A reward is given only
if
J allm variables are true. The problem is representable in
N space using ADDs; but the optimal policy winds
through the entire state space like a binary counter. This
Example
Name
State space size
variables
states
time (s)
SPUDD - Value
internal
leaves
nodes
leaves
ratio of
tree nodes:
ADD nodes
6721
9513
7879
9514
8.12
11.48
5763.1
6238.4
15794
22611
18451
22612
13.89
19.89
49558
14731.9
15430.6
31676
44304
37315
44305
14.60
20.43
178
49558
14742.4
15465.0
31676
44304
37315
44305
14.60
20.43
4711
208
242840
98340.0
112760.1
138056
193318
168207
193319
29.31
41.04
7431
238
707890
-
-
-
-
equiv.
tree
leaves
time (s)
ternary
total
factory
3
0
14
17
55296
131072
78.0
828
147
8937
2210.6
2188.23
factory0
3
0
16
19
221184
524288
111.4
1137
147
14888
factory1
3
0
18
21
884736
2097132
279.0
2169
178
factory2
3
0
19
22
1769472
4194304
462.1
2169
factory3
4
0
21
25
10616832
33554432
3609.4
factory4
4
0
24
28
63700992
268435456
14651.5
SPI - Value
internal
nodes
Table 1: Results for FACTORY examples.
S
Since the value obtained at the state furthest from the goal is
the goal reward discounted by the number of system states (since
each must be visited along the way), the goal reward must be set
very high to ensure that the value at this state is not (practically)
zero.
T
The running times are especially large due to the nature of the
problem which requires a large number of iterations of alue iterationU to converge.
Of course, best-case behavior for SPUDD involves a problem
in which all variables are irrelevant to the value function. This
problem represents a “best case” in which all variables are required in the prediction of state value.
10000
SPUDD
Flat VI
1000
Computation Time (sec)
problem
causes worst-case behaviour for SPUDD because
f
was tested on
all O states have different values. SPUDD
f
K M/
L f+0$ andf + variables, leading
the EXPON example with /
G °%
K 0+$ G and @
G $ K , respecto state spaces with sizes KN|
¨ in these
tively. The initial reward and the discounting factor
f
examples must be scaled to accommodate the O -step lookahead for the largest problem (12 variables), and were set
to +$@PO and $ uw¨¨ , respectively.Q Figure 7 compares the running times of SPUDD and (flat) value iteration plotted (in
log scale) as a function of the number of variables. Running times for both algorithms exhibit exponential growth
with the number of variables, as expected.O It is not surprising that flat value iteration performs better in this type
of problem since there is absolutely no structure that can be
exploited by SPUDD. However, the overhead involved with
creating ADDs is not overly severe, and tends
to diminish
f
as the problems grow larger. With N²1 + , SPUDD takes
less than 10 times longer than value iteration.
One can similarly construct a “best-case” series of examples, where the value function grows linearly in the number
of problem variables. Specifically, the problem LINEAR
involves N variables and
m N^5+ distinct values. The MDP
J has
can be represented in N space using ADDs
and the opJ
timal value function can be represented in N space with
an ADD (see [8] for further details). R Hence, the inherent
structure of such a problem can easily be exploited. As seen
in Figure 8, SPUDD clearly takes advantage of the structure in the problem, as its running time increases linearly
with the number of variables, compared to an exponential
100
V
10
1
0.1
6
7
8
9
Number of Variables
10
11
12
Figure 7: Worst-case behavior for SPUDD.
increase in running time associated with flat value iteration.
6
Concluding Remarks
In this paper, we described SPUDD, an implementation of
value iteration, for solving MDPs using ADDs. The ADD
representation captures some regularities in system dynamics, reward and value, thus yielding a simple and efficient
representation of the planning problem. By using such a
compact representation, we are able to solve certain types
of problems that cannot be dealt with using current techniques, including explicit matrix and decision tree methods.
Though the technique described in this paper has not yet
been tested extensively on realistic domains, our preliminary results are encouraging.
One drawback of using ADDs is the requirement that variables be boolean. Any (finite-valued) non-boolean variable can be split into a number of boolean variables, generally in a way that preserves at least some of the structure of the original problem (see above), though it often
10000
References
SPUDD
Flat VI
Computation Time (sec)
1000
100
V
10
1
0.1
6
8
10
12
Number of Variables
14
16
18
Figure 8: Best-case behavior for SPUDD.
makes the new state space larger than the original. Conceptually, there is no difficulty in allowing ADDs to deal
with multi-valued variables (all algorithms and canonicity
results carry over easily). However, for domains with relatively few multi-valued variables, SPUDD does not appear
to be handicapped by the requirement of variable splitting.
At present, SPUDD uses a static user-defined variable ordering in order not to cloud the initial results with the effects of dynamic variable reordering. However, dynamic
reordering of the variables at runtime can make significant
improvements in both the space required, by finding a more
compact representation, and in the running time, by choosing more appropriate subsets of variables as discussed in
Section 4.2. The CUDD package provides a rich set of
dynamic reordering algorithms [20]. Typically, when the
ADD grows too large, variable reorderings are attempted
by following one of these algorithms, and a new ordering
is chosen which minimizes the space needed. Some of the
available techniques are slight variations of existing techniques while some others were specifically developed for
the package. It may be necessary, however, to implement a
new heuristic which takes into account the variable subsets
which influence the running time. Future work will include
more complete experimentation with automatic dynamic
reordering in SPUDD. Another extension of SPUDD would
be the implementation of other dynamic programming algorithms, such as modified policy iteration, which are generally considered to converge more quickly than value iteration in practice. Finally, we hope to explore approximation
methods within the ADD framework, such as have previously been researched in the context of decision trees [6].
Acknowledgements
Thanks to Richard Dearden for helpful comments and for
providing both his SPI code and example descriptions for
comparison purposes. St-Aubin was supported by NSERC.
Hu was supported by NSERC. Boutilier was supported by
NSERC Research Grant OGP0121843 and IRIS-III Project
“Dealing with Actions.”
[1] R. Iris Bahar, E. A. Frohm, C. M. Gaona, G. D. Hachtel, E.
Macii, A. Pardo, and F. Somenzi. Algebraic decision diagrams and their applications. Intl. Conf. Computer-Aided
Design, 188–191, IEEE, 1993.
[2] R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, 1957.
[3] D. P. Bertsekas and D. A. Castanon. Adaptive aggregation
for infinite horizon dynamic programming. IEEE Trans. Aut.
Cont., 34:589–598, 1989.
[4] C. Boutilier. Correlated action effects in decision theoretic
regression. Proc. UAI-97, pp.30–37, Providence, RI, 1997.
[5] C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning: Structural assumptions and computational leverage. J.
Artif. Intel. Research, 1999. To appear.
[6] C. Boutilier and R. Dearden. Approximating value trees in
structured dynamic programming. Proc. Intl. Conf. Machine
Learning, pp.54–62, Bari, Italy, 1996.
[7] C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting
structure in policy construction. Proc. IJCAI-95, pp.1104–
1111, Montreal, 1995.
[8] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dynamic programming with factored representations.
manuscript, 1999.
[9] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller.
Context-specific independence in Bayesian networks. Proc.
UAI-96, pp.115–123, Portland, OR, 1996.
[10] R. E. Bryant. Graph-based algorithms for boolean function
manipulation. IEEE Trans. Comp., C-35(8):677–691, 1986.
[11] A. Cimatti, M. Roveri, and P. Traverso. Automatic obddbased generation of universal plans in non-deterministic domains. Proc. AAAI-98, pp.875–881, 1998.
[12] T. Dean and R. Givan. Model minimization in Markov decision processes. Proc. AAAI-97, pp.106–111, Providence,
1997.
[13] T. Dean and K. Kanazawa. A model for reasoning about persistence and causation. Comp. Intel., 5(3):142–150, 1989.
[14] R. Dearden and C. Boutilier. Abstraction and approximate
decision theoretic planning. Artif. Intel., 89:219–283, 1997.
[15] R. Dechter. Topological parameters for time-space tradeoff.
Proc. UAI-96, pp.220–227, Portland, OR, 1996.
[16] S. Hanks and D. V. McDermott. Modeling a dynamic and uncertain world i: Symbolic and probabilistic reasoning about
change. Artif. Intel., 1994.
[17] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, 1988.
[18] D. Poole. Exploiting the rule structure for decision making
within the independent choice logic. Proc. UAI-95, pp.454–
463, Montreal, 1995.
[19] M. L. Puterman. Markov Decision Processes: Discrete
Stochastic Dynamic Programming. Wiley, New York, NY.,
1994.
[20] F. Somenzi. CUDD: CU decision diagram package. Available from ftp://vlsi.colorado.edu/pub/, 1998.