Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Academia.eduAcademia.edu

SPUDD: Stochastic planning using decision diagrams

1999, Proceedings of the Fifteenth …

Recently, structured methods for solving factored Markov decisions processes (MDPs) with large state spaces have been proposed recently to allow dynamic programming to be applied without the need for complete state enumeration. We propose and examine a new value iteration algorithm for MDPs that uses algebraic decision diagrams (ADDs) to represent value functions and policies, assuming an ADD input representation of the MDP. Dynamic programming is implemented via ADD manipulation. We demonstrate our method on a class of large MDPs (up to 63 million states) and show that significant gains can be had when compared to tree-structured representations (with up to a thirty-fold reduction in the number of nodes required to represent optimal value functions).

SPUDD: Stochastic Planning using Decision Diagrams Jesse Hoey Robert St-Aubin Alan Hu Craig Boutilier Department of Computer Science University of British Columbia Vancouver, BC, V6T 1Z4, CANADA jhoey,staubin,ajh,cebly  @cs.ubc.ca Abstract Recently, structured methods for solving factored Markov decisions processes (MDPs) with large state spaces have been proposed recently to allow dynamic programming to be applied without the need for complete state enumeration. We propose and examine a new value iteration algorithm for MDPs that uses algebraic decision diagrams (ADDs) to represent value functions and policies, assuming an ADD input representation of the MDP. Dynamic programming is implemented via ADD manipulation. We demonstrate our method on a class of large MDPs (up to 63 million states) and show that significant gains can be had when compared to tree-structured representations (with up to a thirty-fold reduction in the number of nodes required to represent optimal value functions). 1 Introduction Markov decision processes (MDPs) have become the semantic model of choice for decision theoretic planning (DTP) in the AI planning community. While classical computational methods for solving MDPs, such as value iteration and policy iteration [19], are often effective for small problems, typical AI planning problems fall prey to Bellman’s curse of dimensionality: the size of the state space grows exponentially with the number of domain features. Thus, classical dynamic programming, which requires explicit enumeration of the state space, is typically infeasible for feature-based planning problems. Considerable effort has been devoted to developing representational and computational methods for MDPs that obviate the need to enumerate the state space [5]. Aggregation methods do this by aggregating a set of states and treating the states within any aggregate state as if they were identical [3]. Within AI, abstraction techniques have been widely studied as a form of aggregation, where states are (implicitly) grouped by ignoring certain problem variables [14, 7, 12]. These methods automatically generate abstract MDPs by exploiting structured representations, such as probabilistic STRIPS rules [16] or dynamic Bayesian network (DBN) representations of actions [13, 7]. In this paper, we describe a dynamic abstraction method for solving MDPs using algebraic decision diagrams (ADDs) [1] to represent value functions and policies. ADDs are generalizations of ordered binary decision diagrams (BDDs) [10] that allow non-boolean labels at terminal nodes. This representational technique allows one to describe a value function (or policy) as a function of the variables describing the domain rather than in the classical “tabular” way. The decision graph used to represent this function is often extremely compact, implicitly grouping together states that agree on value at different points in the dynamic programming computation. As such, the number of expected value computations and maximizations required by dynamic programming are greatly reduced. The algorithm described here derives from the structured policy iteration (SPI) algorithm of [7, 6, 4], where decision trees are used to represent value functions and policies. Given a DBN action representation (with decision trees used to represent conditional probability tables) and a decision tree representation of the reward function, SPI constructs value functions that preserve much of the DBN structure. Unfortunately, decision trees cannot compactly represent certain types of value functions, especially those that involve disjunctive value assessments. For instance, if the proposition  describes a group of states that have a specific value, a decision tree must duplicate that value three times (and in SPI the value is computed three times). Furthermore, if the proposition describes not a single value, but rather identical subtrees involving other variables, the entire subtrees must be duplicated. Decision graphs offer the advantage that identical subtrees can be merged into one. As we demonstrate in this paper, this offers considerable computational advantages in certain natural classes of problems. In addition, highly optimized ADD manipulation software can be used in the implementation of value iteration. The remainder of the paper is organized as follows. We provide a cursory review of MDPs and value iteration in Section 2. In Section 3, we review ADDs and describe our ADD representation of MDPs. In Section 4, we describe a conceptually straightforward version of SPUDD, a value iteration algorithm that uses an ADD value function representation, and describe the key differences with the SPI algorithm. We also describe several optimizations that reduce both the time and memory requirements of SPUDD. Empir- ical results on a class of process planning examples are described in Section 5. We are able to solve some very large MDPs exactly (up to 63 million states) and we show that the ADD value function representation is considerably smaller than the corresponding decision tree in most instances. This illustrates that natural problems often have the type of disjunctive structure that can be exploited by decision graph representations. We conclude in Section 6 with a discussion of future work in using ADDs for DTP. 2 Markov Decision Processes (1) A policy  is optimal if -E.GFH- .I for all KJ( and policies  ’. The optimal value function -ML is the value of any optimal policy. Value iteration [2] is a simple iterative approximation algorithm for constructing optimal policies. It proceeds by constructing a series of N -stage-to-go value functions -PO . Setting -MQ1R , we define :[U\ - OTS6 21U05PVX Z WY '?89B:0; <&>@]B^C_- O D` ADDs and MDPs Algebraic decision diagrams (ADDs) [1] are a generalization of BDDs [10], a compact, efficiently manipulable data structure for representing boolean functions. These data structures have been used extensively in the VLSI CAD field and have enabled the solution of much larger problems than previously possible. In this section, we will describe these data structures and basic operations on them and show how they can be used for MDP representation. 3.1 Algebraic Decision Diagrams We assume that the domain of interest can be modeled as a fully-observable MDP [2, 19] with a finite set of states and actions . Actions induce stochastic state transitions, with  denoting the probability with which state  is reached when action  is executed at state  . We also assume a real-valued reward function  , associating with each state  its immediate utility  .  A stationary policy  !#" describes a particular course of action to be adopted by an agent, with   denoting the action to be taken in state  . We assume that the agent acts indefinitely (an infinite horizon). We compare different policies by adopting an expected total discounted reward as our optimality criterion wherein future rewards are discounted at a rate $&%('*),+ , and the value of a policy is given by the expected total discounted reward accrued. The expected value -/.0 of a policy  at a given state  satisfies [19]: -/.21340657'489:;=<?>@ AB C-/.D 3 A BDD represents a function o2O!"po from N boolean variables to a boolean result. Bryant [10] introduced the BDD in its current form, although the general ideas have been around for quite some time (e.g., as branching programs in the theoretical computer science literature). Conceptually, we can construct the BDD for a boolean function as follows. First, build a decision tree for the desired function, obeying the restrictions that along any path from root to leaf, no variable appears more than once, and that along every path from root to leaf, the variables always appear in the same order. Next, apply the following two reduction rules as much as possible: (1) merge any duplicate (same label and same children) nodes; and (2) if both child pointers of a node point to the same child, delete the node because it is redundant (with the parents of the node now pointing directly to the child of the node). The resulting directed, acyclic graph m is the BDD for the function. In practice, BDDs are generated and manipulated in the fully-reduced form, without ever building the decision tree. ADDs generalize BDDs to represent real-valued functions o2Oq" r ; thus, in an ADD, we have multiple terminal nodes labeled with numeric values. More formally, an ADD denotes a function as follows: 1. The function of a terminal node is the constant function s^M1t , where  is the number labelling the terminal node. 2. The function of g a nonterminal node labeled with boolean variable is given by  k k k k k s6 vuDuDu O 21  Cws]xDy0zB{| m u0uu O e5  C s@zB} ~z m uuu O  k€ where boolean values are viewed as $ and + , and s xDy0zB{ and s z} ~z are the functions of the ADDs rooted k (2) The sequence of value functions -aO produced by value iteration converges linearly to the optimal value function -=L . For some finite N , the actions that maximize Equation 2 form an optimal policy, and -PO approximates its value. A commonly used stopping criterion specifies termination of the iteration procedure when k k at the then and else children of the node. 1iVXWY  J  denotes the supremum (where norm). This ensures that the resulting value function -aOTS6 is within ml of the optimal function -ML at any state, and that the resulting policy is -optimal [19]. BDDs and ADDs have several useful properties. First, for a given variable ordering, each distinct function has a unique reduced representation. In addition, many common functions can be represented compactly because of isomorphicsubgraph sharing. Furthermore, efficient algorithms (e.g., depth-first search with a hash table to reuse previously computed results) exist for most common operations, such as addition, multiplication, and maximization. For example, Figure 1 shows a computation of the maximum of two ADDs. Finally, because BDDs and ADDs have been used We ignore actions costs for ease of exposition. These impose no serious complications. We are describing the most common variety of BDD. Numerous variations exist in the literature. b bhg7b n - O@S62c - O /j k j k b )Hd g e+ c ^ '  f ' (3) d ‚ Z Z KEY Z true MAX( X 5.0 7.0 0.5 Y 5.0 ) = 7.0 X 0.0 Y 5.0 f(z,x) g(z,y) false 0.5 h(z,x,y) = MAX(f(z,x), g(z,y)) Figure 1: Simple ADD maximization example sentation of CPTs, we use ADDs to capture g g in Z regularities the CPTs (i.e., to represent the functions < ‡ ˆI   u0uu O  ). This type of representation exploits context-specific independence in the distributions [9], and is related to the use of tree representations [7] and rule representations [18] of CPTs in DBNs. Figure 2(b) illustrates the ADD representation of the CPT for two variables, ŽP and K?‘K . While the distribution over ŽM is a function of its seven parent variables, this function exhibits considerable regularity, readily apparent by inspection of the table, which is exploited by the ADD. Specifically, the distribution over ŽP is given by the following formula: 9 – < “e”I •  Ž—?˜a!&‘™š?‘*K›œXš›!œžš Ÿ¡¢1 P £ £ Ž¤5 Ž ?˜*C !&‘¥5 &˜22C!›Xœ¥Cš4›Xœ 5P?˜7CAK?‘HCš4?‘‹¦§Cš—ŸP¦C0$ uw¨ extensively in other domains, very efficient implementations are readily available. As we will see, these properties make ADDs an ideal candidate to represent structured value functions in MDP solution algorithms. 3.2 ADD Representation of MDPs We assume that the MDP g state g space is characterized byg4a set of variables ƒH1  C0CC O k/ . Values of variable g4 will be denoted in lowercase (e.g., ). We assume each is boolean, as required by the ADD formalism, though we discuss multi-valued variables in Section 5. Actions are often most naturally described as having an effect on specific variables under certain conditions, implicitly inducing state transitions. DBN action representations [13, 7] exploit this fact, specifying a local distribution over each variable describing the (probabilistic) impact an action has on that variable. A DBN g for action g  requires two sets of variables, one set ƒ„1  C0CC O  referring to the state of gthe systemg before action  has been executed, and ƒ! 1  C0CC arcs O  denoting the state after  has been executed. Directed from variables in ƒ to variables in ƒ! indicate direct causal influence and have the usual semantics [17, 13].† The conditional g  probability table (CPT) for each post-action g variZ  defines a conditional distribution < ‡ ˆI over — able g  i.e.,  ’s effect on —for each instantiation Z g of itsg parents. This can be viewed as a function < ‡ ˆI  ‰uuu O  , but where the function value (distribution) depends only on gPŠ g  those that are parents of g . No quantification is provided for pre-action variables : since the process is fully observable, we need only use the DBN to predict state transitions. We require one DBN for each action ‹J! . In order to illustrate our representation and algorithm, we introduce a simple adaptation of a process planning problem taken from [14]. The example involves a factory agent which has the task of connecting two objects Œ and  . Figure 2(a) illustrates our representation for the action bolt, where the two parts are bolted together. We see that whether the parts are successfully connected, Ž , depends on a number of factors, but is independent of the state of variable  (painted). In contrast, whether part Œ is punched, !?‘ , after bolting depends only on whether it was punched before bolting. Rather than the standard, locally exponential, tabular repre- ’ We ignore the possibility of arcs among post-action variables, disallowing correlations in action effects. See [4] for a treatment of dynamic programming when such correlations exist. (we ignore the zero entries). Similarly, the ADD for K?‘K corresponds to: 9 < ©«“e”ª6• ¬ I  !?‘K21UK?‘HC+ u $ Reward functions can be represented similarly. Figure 2(c) shows the ADD representation of the reward function for this simple example: the agent is rewarded with 10 if the two objects are connected and painted, with a smaller reward of 5 when the two objects are connected but not painted, and is given no reward when the g parts are not cong  u0uu   , is simply nected. The reward function,   O Ž—&‰1­ŽiC03C+0$ u ® $ 5¯Ž,C 3C ° This example action illustrates the type of structure that can be exploited by an ADD representation. Specifically, the CPT for ŽP clearly exhibits disjunctive structure, where a variety of distinct conditions each give rise to a specific probability of successfully connecting two parts. While this ADD has seven internal nodes and two leaves, a tree representation for the same CPT requires 11 internal nodes and 12 leaves. As we will see, this additional structure can be exploited in value iteration. Note also that the standard matrix representation of the CPT requires 128 parameters. ADDs are often much more compact that trees when representing functions, but this is not always the case. The ordering requirement on ADDs means that certain functions can require an exponentially larger ADD representation than a well-chosen tree; similarly, ADDs can be exponentially smaller than decision trees. Our initial results suggest that such pathological examples are unlikely to arise in most problem domains (see Section 5), and that ADDs offer an advantage over decision trees. 4 Value Iteration using ADDs In this section, we present an algorithm for optimal policy construction that avoids the explicit enumeration of the state space. SPUDD (stochastic planning using decision diagrams) implements classical value iteration, but uses ADDs to represent value functions and CPTs. It exploits the regularities in the action and reward networks, made APU BPU APU’ APU T F C APU PL’ C’ ADR ADR’ BDR BDR’ BO BO’ P P’ KEY true 1.0 BPU’ false 0.0 C C PL APU’ 1.0 0.0 T F F F F F F F F F F F PL APU BPU ADR BDR BO C’ T/F T T T T T T T F F F F T/F T T T F F F F T/F T/F T/F T/F T/F T T F T/F T/F T/F T/F T/F T/F T/F T/F T/F T/F T/F T/F T/F T/F T/F T/F T T T T T F F T/F T T T T T F F T/F T/F T F T/F T F T/F T/F T F T/F T/F 0.9 0.9 0.0 0.0 0.9 0.0 0.0 0.0 0.9 0.0 0.0 0.0 PL APU BPU REWARD BDR C (a) P 0.0 10.0 5.0 BO 0.9 0.0 Matrix Representation C P ADR ADD Representation Network (b) (c) Reward Figure 2: Small FACTORY example: (a) action network for action bolt; (b) ADD representation of CPTs (action diagrams); and (c) immediate reward network and ADD representation of the reward table. explicit by the ADD representation described in the previous section, to discover regularities in the value functions it constructs. This often yields substantial savings in both space and computational time. We first introduce the algorithm in a conceptually clear way, and then describe certain optimizations. OBDDs have been explored in previous work in AI planning [11], where universal plans (much like policies) are generated for nondeterministic domains. The motivation in that work, avoiding the combinatorial explosion associated with state space enumeration, is similar to ours; but the details of the algorithms, and how the representation is used to represent planning domains, is quite different. 4.1 The Basic SPUDD Algorithm The SPUDD algorithm, shown in Figure 3, implements a form of value iteration, producing a sequence of value functions -=Q]h-MC0CC until the termination condition is met.  g Each ± stage-to-go value g function is represented as an ADD denoted -   uu0u  O  . Since -?Q™1q , the first value function has an obvious ADD representation. The key insight underlying SPUDD is to exploit the ADD structure  itself to discover the apof - and the MDP representation propriate ADD structure for - S6 . Expected value calculations and maximizations are then performed at each terminal node of the new ADD rather than at each state.     produces - S6 . Given an ADD for - , Step 3 of SPUDD When computing - S6 , the function - is viewed as representing values at future states, after a suitable action has  performed with ±5X+ stages remaining. So variables in been - are first replaced by their primed, or post-action, counterparts (Step 3(a)), referring to the state with ± stages-togo; this prevents them from being confused with unprimed variables that refer to the state with ±5²+ stages-to-go. Figure 4(a) shows the zero stage-to-go primed value diagram, -a³Q , for our simple example. For each action  , we then compute an ADD representation of the function - Z S6 , denoting the expected value of performing action  with ±65¥+ stages to go given that dictates ± stage-to-go value. This requires several steps, described below. First, we note that the ADD-represented Z functions < ‡ ˆI , taken from the action network for  , give the g  are made true (conditional) probabilities that variables by action  . To fit within the ADD framework, we introduce the negative action diagrams g Z g g Z g < ‡ ˆI    u0uu  O  1¥´+ c < ‡ ˆI    u0uu  O e g  which gives the probability that  will make false. We Z‡ then define the dual action diagrams µ ˆI as the ADD g  Z rooted at , whose true branch is the action diagram < ‡ ˆI Z and whose false branch is the negative action diagram < ‡ ˆI : g g  ‡Z g g Z g A¶ g µ ‡ ˆI    u0uu O ·1 g CA< ˆI  g   u0uu g O 5  ‡Z C < ˆI    u0uu O ´ (4) k g  k  jg Z k ¶k Intuitively, µ ‡ ˆI  1   uuu O  denotes <—  1 k g k  0CCC O 1 O  (under action  ). Figure 4(a) shows the dual action diagram for the variable C’ from the example in Figure 2(b).  In order to generate - Z S6 , we must, for each state  , combine the ± stage-to-go value for each state  with the probability of reaching  from  . We do this by multiplying, in g Š turn, the dual action diagrams for each variable by -P 1. Set ½ ¸¡¹»º²¼ where ¼ is the immediate reward diagram; set º*¾ ˆI ÃÅÄ!ÇBÆ È Ä n ÈÉÉÉ Ä=Ê]Ë for each Í 4 Ï Ç , and for each Ä!Æ Æ ÇÒÑ n Ó Ç n×/ØÙ Repeat until и ¸ ® Ð Ô,ÕÖ ‚eØ Ç (a) Swap Ç all variables in Ä with primed versions to create Ä!Æ (b) For all Ú ÍÎ Ç Set ÛÅÜÝÁÞ?ºß¸ Æ Ç For all primed variables, Ä à Æ in ¸ Æ ÛáÜ0ÝÁÞ4º7ÛáÜ0Ý¡Þaâ2¿ À ã I Set ÛÅÜ0Ý¡Þ4º Sum the sub-diagrams of ÛáÜ0Ý¡Þ over the primed variable Ä!à Æ End For Multiply the result by discounting factor ä Ç and add ¼ to obtain ¸ À End For Ç ÇÒÑ n (c) Maximize over all ¸ ’s to create ¸ . À (d) Increment i 2. Create dual action diagrams, ¿ÁÀ Ì?Í&Î 3. End Repeat 4. Perform one more iteration and assign to each terminal node the actions Ì which contributed the value in the value ADD at that node; this yields the å -optimal policy ADD, æç . Note that terminal nodes which have the same values for multiple actions are assigned all possible actions in æç . 5. Return the value diagram ¸ ÇÒÑ n and the optimal policy æ ç . Figure 3: SPUDD algorithm g Š and then eliminating by summing over its values in the  Z resultant ADD. More precisely, by multiplying µ ‡ ã I by -a , we obtain a function s6 k k g k  0CC0C k g O  g g  C0CC O  where s6  0 CC0 C O   0CC0C O ‰1 k k k Š jk k -  0 CC0C ´<—  0u uu O    O (assuming transitions induced by action  ). This intermedi- ate calculation is illustrated in Figure 4(b), where the dual diagram for variable ŽM is the first to be multiplied by -a³Q . Note that ŽP lies at the root of this ADD. Once this function s is obtained, we can eliminate dependence of future g Š value on the specific value of by taking an expectation over both of its truth values. This is done by summing the left and right subgraphs of the ADD for s , leaving us with the function è  g  C0CC g Šé  g Š 0 CC0C g  g  g 21  k u0uj u g O  g   S6k  gO g Š Š 8Òê -   0C CC 0 CC0C O ´ <—   u u0u O  ãI This is illustrated in Figure 4(c), where the variable ŽP is eliminated. This ADD denotes the expected future value (or $ stage-to-go value) as a function of the parents of ŽP with + stage-to-go and all post-action variables except Ž with $ stages-to-go. g Š This process is repeated for each post-action variable  Z that occurs in the ADD for -a : we first multiply µ ‡ ã I into the intermediate value ADD, then eliminate that variable by taking an expectation over its values. Once all primed variables have been eliminated, we are left with a function g ë g   0 CCCe O   1 k k k jg g ê 8 ê -   0 CCC O ´<—    uuu O CCC Iìeíwîwîwîwí Iï k jg g <  O   u u0u O  By the independence assumptions embodied in the action network, this is precisely the expected future value of performing action  . By adding the reward ADD  to this function, we obtain an ADD representation of - Z S6 . Figure 5 shows the result for our simple example. The remaining primed variable P in Figure 4(c) has been removed,  producing -?ð@ ñò³ó using a discount factor of $ . Finally, we  to produceuw¨ the - S6 diatake the maximum over all actions  that gram. Given ADDs for each - Z S6 , this requires :[ simply one construct the ADD representing VW_Y Z - Z S6 .   The stopping criterion in Equation 3 is implemented by comparing each pair of successive ADDs, - S6 and - . Once the value function has converged, the -optimal pold one further icy, or policy ADD, is extracted by performing dynamic programming backup, and assigning to each terminal node the actions which produced the maximimizing value. Since each terminal node represents some state set of states ô , the set of actions thus determined are each optimal for any PJžô . 4.2 Optimizations The algorithm as described in the last section, and as shown in Figure 3, suffers from certain practical difficulties which make it necessary to introduce various optimizations in order to improve efficiency with respect to both space and time. The problems arise in Step 3(b) Z when -M is multiplied by the dual action diagrams µ . Since there are potentially N primed variables in the ADD for -a and N unZ primed variables in the ADD for µ , there is an intermediate step f in which a diagram is created with (potentially) up to N variables. Although this will not be the case in general, it was deemed necessary to modify the method in order to deal with the possibility of this problem arising. Furthermore, a large computational overhead is introduced by re-calculating the joint probability distributions over the primed variables at each iteration. In this section, we first discuss optimizations for dealing with space, followed by a method for optimizing computation time. The increase in the diagram size during Step 3(b) of the algorithm can be countered by approaching the multiplications and sums slightly differently. Instead of blindly mul the dual action diagram for the variable  tiplying the by at the root of - , we can traverse the ADD for - to the level of the last variable in the ADD ordering, then multiply and sum the sub-diagrams rooted at this variable by V’0 C’ C’ P’ 10.0 P’ P’ 0.0 5.0 C’ C C APU APU BPU BO BO BO BO BDR BDR BDR BDR BPU BPU ADR ADR ADR ADR APU APU BPU BPU BPU PL PL (2) SUM OVER C’ APU APU PL PL PL PL (1) MULTIPLY DIAGRAMS C C C C PRIMED VALUE DIAGRAM ADR ADR 9.0 BDR BDR 0.0 4.5 9.0 0.0 4.5 BO BO 0.9 0.0 DUAL ACTION DIAGRAM b) a) c) Figure 4: First Bellman backup for the Value Iteration using ADDs algorithm. (a) 0-stage-to-go primed value diagram, and Zõ Zõ dual action diagram for variable ŽP , µ I (b) Intermediate result after multiplying -M³Q with µ I . (c) Intermediate result after quantifying over ŽM . C P 18.1 for a given problem do not change during the generation of a policy, the joint probability distribution <?>@]B from Equation 2 could be pre-computed. In our case, this means we could take the product of all dual action diagrams for a given action  , as shown in Equation 5 below, prior to a speZ cific value iteration. We refer to this product diagram, < , as the complete action diagram for action  : P PL 9.05 PL APU APU BPU ö O Z g 0¶ g g g g g Z g    uuu  O    uuu  O  1 Å ÷ µ ‡ ˆI    uu0u  O   (5) Z The resulting function < provides a representation of theZ state transition probabilities for action  . This explicit < function could then be multiplied by the -awO during Step BPU < ADR ADR BDR BDR BO 8.1 BO 0.0 4.05 Figure 5:9 Resulting 1-stage-to-go value diagram for action bolt, - . “e”• the corresponding dual diagram. This process will only remove the dependency of the -a on a primed variable for  a given branch, and will therefore only introduce a single diagram of N unprimed variables at a leaf node of -a . By  out this procedure using the structure recursively carrying of the ADD for -P , the intermediate stages never grow too large. Essentially, the additional unprimed variables are introduced only at specific points in the ADD and the corresponding primed variable immediately eliminated—this is much like the tree-structured dynamic programming algorithm of [7]. Unfortunately, this method requires a great deal of unnecessary, repeated computation. Since the action diagrams 3 of the algorithm, and then primed variables eliminated. Although this may lead to a substantial savings in compuf tation time, it will again generate diagrams with up to N variables. As a compromise, we implemented a method where the space-time trade-off can be addressed explicitly. A “tuning knob” enables the user to find a middle ground between the two methods mentioned above. We accomplish this by pre-computing only subsets of the complete action diagram. That is, we break the large diagram up g into a few g smaller pieces. The set of variables    u0uu  O  is divided intog ø  subsets, preserving (e.g., £g £g  g Åù the total £ g ú ordering g ì S6  uu0u   u0uu  ì ¦ , ¦ , ...,  uu0u O ¦ ), and the complete action for each g  are pre-computed g Z g  diagrams subset (e.g., <  ã  uuu  ãû ì ü   u0uu  O  ). Step 3(b)  of the algorithm must be modified as shown in Figure 6. The primed value diagram -a is traversed to the top of the second level ( ± 5ý+ ), and the procedure is carried g  out re cursively on each sub-diagram rooted at variables ì . If  a level is reached with no variables it, then theS6subg below ú diagram rooted at each variable of -P is multiplied 1. Set þÿRº user-specified limit for size of graphs Set ÛáÜ0Ý¡Þ4º ADD constant ½ 1 Set ?º  HÝUºG¾ Ü«º7¾ 2. While Ô number of variables ½ à º º ½ While ÜÁԙþÿ Set ÛáÜ0Ý¡Þ=º²ÛÅÜÝÁÞaâ2¿ À I ½ Ü«º no. of internal nodes in ÛáÜ0Ý¡Þ ?º End While  3. À à  à Æ ÈÉÉÉÈ  ÝUº²Ý End While ÇÒÑ n Repeat until и .. . ×|n È  n ÈÉÉÉÈ  Ê Ë º7ÛÅÜÝÁÞ Æ Ó n×/ØÙ Ç ¸ ® Ð Ô ÕÖ ‚eØ (c) For all Ú ÍÎ Set  Ì AÛ  Ü  Ü|º7 Ý Call pRew( ¸ Æ Ê , À ,¾ ,  ) Ì ! Û ½ "# , Ì%$ ,# 'Ü &EÛ D Ü  'Ü  ) procedure pRew (  Ì  ]Ü , ½ If  Ì%$)( Ê+*-½ ,. /0*  *1/ If  Ì%$2( / . /0*  *1/ $ Ü 4ÅÛº À'3 Ì  ]Ü Ì AÛ  Ü  Ü| º Ì Ó AÛ DÜ  Ü  Else # ½"5# ÛÅÜÝÁÞ = pRew ( ÌÌ ! ]Ü ,Ì! Û , Ì%$ , Ü&/Û  Ü  Ü6 ) ÛÅ$ ÜÝÁÞ?ºßÛáÜ0Ý¡Þ¡â Û Ü 4ÅÛº sum all sub-diagrams of ÛáÜ0 Ý¡( Þ ½ over primed variables, à Æ / À'3 . /0*  *1/ else Ì ÛáÜ0ÝÁ8Þ 77º prRew(then( Ì! Û ½"5 # ,D Ü ] Ü Ü ),8 ,# 'Ü &EÛ  Ü  Ü ) Ì ÛáÜ0ÝÁ:Þ 9ߺ prRew(else( Ì! Û ½"5 # ,DÜ] Ü Ü),8 ,# 'Ü &EÛ  Ü  Ü ) $ Ü '%ÅÛº tree rooted at <Ç ;0=1>1=; Æ with then,else branches: ÛáÜ0ÝÁ8 Þ 7 ,ÛáÜ0Ý¡Þ?9 , resp.  return result Figure 6: Modified SPUDD algorithm with the Z corresponding of the g ú g subset g g complete action diagram, <   ug uu     uuu  O  , and summed over O primed variables @ , ACBq± D . In this way, the diagrams are kept small by making sure that enough elimination occurs to balance the effects of multiplying by complete action diagrams. The space and time requirements can then be controlled by the number of subsets the complete action diagrams are broken into. In theory, the more subsets, the smaller the space requirements and the larger the time requirements. Although we have been able to produce substantial changes in the space and time requirements of the algorithm using this tuning knob, its effects are still unclear. At present, we choose the ø subsets of variables by simply building the complete action diagrams according to some variable ordering until they reach a user-defined size limit, at which point we start on the next subset. We note that this space-time tradeoff bears some resemblance to the space-time tradeoffs that arise in probabilistic inference algorithms like variable elimination [15]. Although we have not implemented heuristics for variable ordering, there are some simple ordering methods that could improve space efficiency. For instance, if we order variables so that primed variables with many shared parents are eliminated together, the number of unprimed variables introduced will be kept relatively small relative to the number of primed variables eliminated. More importantly, we must develop more refined heuristics that keep the ADDs small rather than minimizing the number of variables introduced. This revised procedure (Figure 6) has a small inefficiency, as our results in the next section will show. Since we are pre-computing subsets of the complete action diagrams, any variables which are included in the domain, but are not relevant to its solution, will be included in these pre-computed diagrams. This will increase the size of the intermediate representations and will add overhead in computation time. It is important to be able to discard them, and to only compute the policy over variables that are relevant to the value function and policy [7]. A possible way to deal with these types of variables in our algorithm would be to progressively build the complete action diagrams during the iterative procedure. In this way, only the variables relevant to the domain would be added. 5 Data and Results The procedure described above was implemented using the CUDD package [20], a library of C routines which provides support for manipulation of ADDs. Experimental results described in this section were all obtained using a dualprocessor SUN SPARC Ultra 60 running at 300Mhz with 1 Gb of RAM, with only a single processor being used. The SPUDD algorithm was tested on three different types of examples, each type having MDP instances with different numbers of variables, hence a wide variety of state space sizes. The first example class consists of various adaptations of a process planning problem taken from [14]. The second and third example classes consist of synthetic problems taken from [7, 8]. These are designed to test best- and worst-case behavior of SPUDD.E The first example class consists of process planning problems taken from [14], involving a factory agent which must paint two objects and connect them. The objects must be smoothed, shaped and polished and possibly drilled before painting, each of which actions require a number of tools which are possibly available. Various painting and connection methods are represented, each having an effect on the quality of the job, and each requiring tools. The final product is rewarded according to what kind of quality is needed. Rewards range from $ to +0$ and a discounting factor of $ uw¨ was used throughout. The examples used here, unlike the one described in Section 3, were not designed with any structure in mind which could be taken advantage of by an ADD representation. In the original problem specification, three ternary variables were used to represent painting quality of each object (good, poor or false), and the connection quality (good, bad or false). However, as discussed above, ADDs can only repF Data for these problems can be found at the Web page: www.cs.ubc.ca/spider/staubin/Spudd/index.html. resent binary variables, so that each ternary variable was expanded into two binary ones. For example, the variable connected, describing the type of connection between the two objects, was represented by boolean variables connected and connected well. This expansion enlarges the state space by a factor of G8H%I for each ternary variable so expanded (by introducing unreachable states). A number of FACTORY examples were devised, with state space sizes ranging from 55 thousand to 268 million. Optimal policies were generated using SPUDD and a structured policy iteration (SPI) implementation for comparison purposes [7]. Results, displayed in Table 1, are presented for SPUDD running on six FACTORY examples, and for SPI running on five. SPI was not run on the factory4 example, because its estimated time and space requirements exceeded available capacity. SPI implements modified policy iteration using trees to represent CPTs and intermediate value and policy functions. SPI, however, does allow multi-valued variables—so versions of each example were tested in SPI using both ternary variables, and thier binary expansion. Table 1 shows the number of ternary variables in each example, along with the total number of variables. The state space sizes of each FACTORY example are shown for both the original and the binary-expansion formulations. SPUDD was only run on the binary-expanded versions. The examples labelled factory1 and factory2 differ only by a single binary variable, which is not affected by any action in the domain, and which does not itself affect any other variables. Hence, the number of internal nodes resulting in Table 1 are identical for the two examples. This variable was added in order to show how structured representations like SPUDD and SPI can effectively discard variables which do not affect the problem at hand, as discussed in Section 4.2. Since SPUDD pre-computes the complete action diagrams, as shown in Figure 6, the running time for SPUDD almost doubles when this new variable is added, since it creates overhead for the iterative procedure. This problem could be circumvented using the method described at the end of Section 4.2. Running times are shown for SPUDD and SPI. However, the algorithms do not lend themselves easily to comparisons of running times, since implementation details cloud the results; so running times will not be discussed further here. The SPI results are shown in order to compare the sizes of the final value function representations, which give an indication of complexity for policy generation algorithms. However, a question arises when comparing such numbers about the variable orderings, as mentioned in Section 3. The variable ordering for SPUDD is chosen prior to runtime and remains the same during the entire process. No special techniques were used to choose the ordering, although it may be argued that good orderings could be gleaned from the MDP specification. Variable orderings within the branches of the tree structure in the SPI algorithm are determined primarily by the choice of ordering in the reward function and action descriptions [7]. Again, no special techniques were used to choose the variable ordering in SPI. Finding the optimal variable orderings in either case is a difficult problem, and we assume here that neither algorithm has an advantage in this regard. Dynamic reordering algorithms are available in CUDD, and have been implemented but not yet fully tested in SPUDD (see below). In order to compare representation sizes, we compare the number of internal nodes in the value function representations only. This is most important when doing dynamic programming back-up steps and is a large factor in determining both running time and space requirements. Furthermore, we compare numbers from SPUDD using binary representations with numbers from SPI using binary/ternary representations in order not to disadvantage SPI, which can make use of ternary variables. We also compare both implementations using only binary variables. The equivalent tree leaves column in Table 1 gives the number of leaves of the totally ordered binary tree (and hence the number of internal nodes) that results in expanding the value ADD generated by SPUDD. These numbers give the size of a tree that would be generated if a total ordering was imposed. Comparing these numbers with the numbers generated by SPI give an indication of the savings that occur due to the relaxation of the total ordering constraint. The rightmost column in Table 1 shows the ratio of the number of internal nodes in the tree representation to the number in the ADD representation. We see that reductions of up to 30 times are possible, when comparing only binary representations to binary/ternary representations, and reductions of over 40 times when comparing the same binary representations. These space savings also showed up in the amount of memory used. For example, the factory3 example took 691Mb of memory using SPI, and only 148Mb using SPUDD. The factory4 example took 378Mb of space using SPUDD. The BIGADD limit (see Figure 6) was set to 10000 for the factory, factory0, factory1 and factory2 examples and to 20000 in the factory3 and factory4 examples. These limf or its broke up the complete action diagrams into ø 1 I pieces, with typically 6000-10000 nodes in the first and second and under 1000 nodes in the third if it existed. In the large examples (factory2, 3 and 4), it was not possible (with 1Gb of RAM) to generate the full complete action diagram ( ø 1#+ ), and running times became too large when BIGADD was set to 1. The functionality of this “tuning knob” was not fully investigated, but, along with studies of different heuristics for variable grouping, is an interesting avenue for future exploration. For comparison purposes, flat (unstructured) value iteration was run on both the factory and factory0 examples. The times taken for these problems were 895 and 4579 seconds, respectively. For the larger problems, memory limitations precluded completion of the flat algorithm. In order to examine the worst-case behaviour, we tested SPUDD on a series of examples, drawn from [7, 8], in which every state has a unique value; hence, the ADD representing the value function will have a number of terminal nodes exponential in the number of state variables. The problem EXPON involves N ordered propositions and N actions, one for each proposition. Each action makes its corresponding proposition true, but causes all propositions lower in the order to become false. A reward is given only if J allm variables are true. The problem is representable in N  space using ADDs; but the optimal policy winds through the entire state space like a binary counter. This Example Name State space size variables states time (s) SPUDD - Value internal leaves nodes leaves ratio of tree nodes: ADD nodes 6721 9513 7879 9514 8.12 11.48 5763.1 6238.4 15794 22611 18451 22612 13.89 19.89 49558 14731.9 15430.6 31676 44304 37315 44305 14.60 20.43 178 49558 14742.4 15465.0 31676 44304 37315 44305 14.60 20.43 4711 208 242840 98340.0 112760.1 138056 193318 168207 193319 29.31 41.04 7431 238 707890 - - - - equiv. tree leaves time (s) ternary total factory 3 0 14 17 55296 131072 78.0 828 147 8937 2210.6 2188.23 factory0 3 0 16 19 221184 524288 111.4 1137 147 14888 factory1 3 0 18 21 884736 2097132 279.0 2169 178 factory2 3 0 19 22 1769472 4194304 462.1 2169 factory3 4 0 21 25 10616832 33554432 3609.4 factory4 4 0 24 28 63700992 268435456 14651.5 SPI - Value internal nodes Table 1: Results for FACTORY examples. S Since the value obtained at the state furthest from the goal is the goal reward discounted by the number of system states (since each must be visited along the way), the goal reward must be set very high to ensure that the value at this state is not (practically) zero. T The running times are especially large due to the nature of the problem which requires a large number of iterations of alue iterationU to converge. Of course, best-case behavior for SPUDD involves a problem in which all variables are irrelevant to the value function. This problem represents a “best case” in which all variables are required in the prediction of state value. 10000 SPUDD Flat VI 1000 Computation Time (sec) problem causes worst-case behaviour for SPUDD because f was tested on all O states have different values. SPUDD f K M/ L f+0$ andf + variables, leading the EXPON example with / G  °%€ K 0+$ G and @ G $ K , respecto state spaces with sizes KN| ¨ in these tively. The initial reward and the discounting factor f examples must be scaled to accommodate the O -step lookahead for the largest problem (12 variables), and were set to +$@PO and $ uw¨¨ , respectively.Q Figure 7 compares the running times of SPUDD and (flat) value iteration plotted (in log scale) as a function of the number of variables. Running times for both algorithms exhibit exponential growth with the number of variables, as expected.O It is not surprising that flat value iteration performs better in this type of problem since there is absolutely no structure that can be exploited by SPUDD. However, the overhead involved with creating ADDs is not overly severe, and tends to diminish f as the problems grow larger. With N²1 + , SPUDD takes less than 10 times longer than value iteration. One can similarly construct a “best-case” series of examples, where the value function grows linearly in the number of problem variables. Specifically, the problem LINEAR involves N variables and m N^5—+ distinct values. The MDP J has can be represented in N  space using ADDs and the opJ timal value function can be represented in N space with an ADD (see [8] for further details). R Hence, the inherent structure of such a problem can easily be exploited. As seen in Figure 8, SPUDD clearly takes advantage of the structure in the problem, as its running time increases linearly with the number of variables, compared to an exponential 100 V 10 1 0.1 6 7 8 9 Number of Variables 10 11 12 Figure 7: Worst-case behavior for SPUDD. increase in running time associated with flat value iteration. 6 Concluding Remarks In this paper, we described SPUDD, an implementation of value iteration, for solving MDPs using ADDs. The ADD representation captures some regularities in system dynamics, reward and value, thus yielding a simple and efficient representation of the planning problem. By using such a compact representation, we are able to solve certain types of problems that cannot be dealt with using current techniques, including explicit matrix and decision tree methods. Though the technique described in this paper has not yet been tested extensively on realistic domains, our preliminary results are encouraging. One drawback of using ADDs is the requirement that variables be boolean. Any (finite-valued) non-boolean variable can be split into a number of boolean variables, generally in a way that preserves at least some of the structure of the original problem (see above), though it often 10000 References SPUDD Flat VI Computation Time (sec) 1000 100 V 10 1 0.1 6 8 10 12 Number of Variables 14 16 18 Figure 8: Best-case behavior for SPUDD. makes the new state space larger than the original. Conceptually, there is no difficulty in allowing ADDs to deal with multi-valued variables (all algorithms and canonicity results carry over easily). However, for domains with relatively few multi-valued variables, SPUDD does not appear to be handicapped by the requirement of variable splitting. At present, SPUDD uses a static user-defined variable ordering in order not to cloud the initial results with the effects of dynamic variable reordering. However, dynamic reordering of the variables at runtime can make significant improvements in both the space required, by finding a more compact representation, and in the running time, by choosing more appropriate subsets of variables as discussed in Section 4.2. The CUDD package provides a rich set of dynamic reordering algorithms [20]. Typically, when the ADD grows too large, variable reorderings are attempted by following one of these algorithms, and a new ordering is chosen which minimizes the space needed. Some of the available techniques are slight variations of existing techniques while some others were specifically developed for the package. It may be necessary, however, to implement a new heuristic which takes into account the variable subsets which influence the running time. Future work will include more complete experimentation with automatic dynamic reordering in SPUDD. Another extension of SPUDD would be the implementation of other dynamic programming algorithms, such as modified policy iteration, which are generally considered to converge more quickly than value iteration in practice. Finally, we hope to explore approximation methods within the ADD framework, such as have previously been researched in the context of decision trees [6]. Acknowledgements Thanks to Richard Dearden for helpful comments and for providing both his SPI code and example descriptions for comparison purposes. St-Aubin was supported by NSERC. Hu was supported by NSERC. Boutilier was supported by NSERC Research Grant OGP0121843 and IRIS-III Project “Dealing with Actions.” [1] R. Iris Bahar, E. A. Frohm, C. M. Gaona, G. D. Hachtel, E. Macii, A. Pardo, and F. Somenzi. Algebraic decision diagrams and their applications. Intl. Conf. Computer-Aided Design, 188–191, IEEE, 1993. [2] R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, 1957. [3] D. P. Bertsekas and D. A. Castanon. Adaptive aggregation for infinite horizon dynamic programming. IEEE Trans. Aut. Cont., 34:589–598, 1989. [4] C. Boutilier. Correlated action effects in decision theoretic regression. Proc. UAI-97, pp.30–37, Providence, RI, 1997. [5] C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning: Structural assumptions and computational leverage. J. Artif. Intel. Research, 1999. To appear. [6] C. Boutilier and R. Dearden. Approximating value trees in structured dynamic programming. Proc. Intl. Conf. Machine Learning, pp.54–62, Bari, Italy, 1996. [7] C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting structure in policy construction. Proc. IJCAI-95, pp.1104– 1111, Montreal, 1995. [8] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dynamic programming with factored representations. manuscript, 1999. [9] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specific independence in Bayesian networks. Proc. UAI-96, pp.115–123, Portland, OR, 1996. [10] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Trans. Comp., C-35(8):677–691, 1986. [11] A. Cimatti, M. Roveri, and P. Traverso. Automatic obddbased generation of universal plans in non-deterministic domains. Proc. AAAI-98, pp.875–881, 1998. [12] T. Dean and R. Givan. Model minimization in Markov decision processes. Proc. AAAI-97, pp.106–111, Providence, 1997. [13] T. Dean and K. Kanazawa. A model for reasoning about persistence and causation. Comp. Intel., 5(3):142–150, 1989. [14] R. Dearden and C. Boutilier. Abstraction and approximate decision theoretic planning. Artif. Intel., 89:219–283, 1997. [15] R. Dechter. Topological parameters for time-space tradeoff. Proc. UAI-96, pp.220–227, Portland, OR, 1996. [16] S. Hanks and D. V. McDermott. Modeling a dynamic and uncertain world i: Symbolic and probabilistic reasoning about change. Artif. Intel., 1994. [17] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, 1988. [18] D. Poole. Exploiting the rule structure for decision making within the independent choice logic. Proc. UAI-95, pp.454– 463, Montreal, 1995. [19] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, NY., 1994. [20] F. Somenzi. CUDD: CU decision diagram package. Available from ftp://vlsi.colorado.edu/pub/, 1998.