Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Abstracting Influences for Efficient Multiagent Coordination Under Uncertainty by Stefan J. Witwicki A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in the University of Michigan 2011 Doctoral Committee: Professor Professor Professor Assistant Edmund H. Durfee, Chair Satinder Singh Baveja Michael P. Wellman Professor Amy Ellen Mainville Cohn ➞ Stefan J. Witwicki All Rights Reserved 2011 To my late grandfather, Stephen F. Witwicki. ii ACKNOWLEDGEMENTS First and foremost, I would like to thank my advisor, Ed Durfee. Not only has he provided me with countless bits of advice, which have no doubt shaped me as a researcher, but he has also given me a tremendous amount of freedom to explore and to discover for myself the research that most excites me. It is the resulting enthusiasm and confidence that has propelled me to complete this dissertation. On that note, without Ed’s thoughtful suggestions and critical feedback, my work surely would not have achieved the level of quality and depth that it has. I owe him a great debt of gratitude for all of the guidance, patience, and tireless assistance he has given me when I needed each the most. My research has also benefited greatly from my interactions with my other doctoral committee members, namely Amy Cohn, Satinder Baveja, and Michael Wellman, each of whom have shared with me their own unique perspectives and insightful critiques. The feedback that I received during my proposal defense was instrumental in redirecting the focus of my dissertation in the two years leading up to my final defense. In particular, Michael Wellman’s rigorous reviews of drafts of my dissertation at various stages of its development have been extremely useful. I would be remiss if I did not extend thanks to the administrators and support staff of the CSE department. During my time as a grad student, they have helped me to navigate program policies and Ph.D requirements, to make sense of my finances, and to arrange conference travel, aside from performing countless other supportive acts behind the scenes. I am especially grateful for the help of Dawn Freysinger, Rita Rendell, Kelly Cormier, and Cindy Watts. I’d also like to thank my colleagues. Fellow grad students and alumni, especially Jim, Erik, Dmitri, Jonathon, Lian, Jason S., Anna, Quang, Andrew R., Gargi, and Chris P., have made the lab an engaging and productive environment, have entertained, and also inspired me. Outside of Michigan, I’ve been fortunate to publish within an extremely supportive research community. Conversations and e-mail exchanges with Frans Oliehoek, Janusz Marecki, Hala Mostafa, Shlomo Zilberstein, Prashant Doshi, Milind Tambe, Chris Amato, and Victor Lesser, have provided me with invaluable iii perspective, perpetual encouragement, and a strong motivation to continue to share my research and to collaborate with others. My parents, my sister, and my extended family are deserving of more gratitude than I could possibly convey with this document. Thank you all so much for your unwavering support through all of the ups and downs of my Ph.D journey. I couldn’t have made it this far without you. iv TABLE OF CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii CHAPTER 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 . . . . . . 2 2 5 6 10 13 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 1.3 1.4 2.1 2.2 2.3 Multiagent Coordination Under Uncertainty 1.1.1 Motivating Example . . . . . . . . 1.1.2 Core Problem Properties . . . . . Problem Statement . . . . . . . . . . . . . Solution Approach . . . . . . . . . . . . . . Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . Single-Agent Sequential Decision Making . . . 2.2.1 Markov Decision Processes . . . . . . 2.2.2 Partially-Observable MDPs . . . . . . 2.2.3 Complexity of Single-Agent Planning 2.2.4 Decomposition and Abstraction . . . Multiagent Coordination . . . . . . . . . . . . 2.3.1 Decentralized POMDPs . . . . . . . . 2.3.2 Structural Restrictions and Subclasses 2.3.3 Decoupled Joint Policy Formulation . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . 17 18 19 24 28 29 30 31 35 41 2.3.4 Coordinating Abstract Behavior . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 45 3. Exploiting Transition-Dependent Interaction Structure . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4 3.1 3.2 3.3 3.4 3.5 3.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TD-POMDP Formalism . . . . . . . . . . . . . . . . . . . . . 3.2.1 Factored Decomposability . . . . . . . . . . . . . . 3.2.2 Nonconcurrently-Controlled Nonlocal Features . . . 3.2.3 Temporal Synchronization . . . . . . . . . . . . . . 3.2.4 Decoupled Representation . . . . . . . . . . . . . . Optimality and Tractability . . . . . . . . . . . . . . . . . . . 3.3.1 Solution Concept . . . . . . . . . . . . . . . . . . . 3.3.2 General Complexity . . . . . . . . . . . . . . . . . . 3.3.3 Significance of Structure . . . . . . . . . . . . . . . Expressiveness of the Representation . . . . . . . . . . . . . . 3.4.1 Comparison with Existing Models . . . . . . . . . . 3.4.2 Communication . . . . . . . . . . . . . . . . . . . . 3.4.3 Overcoming Representational Limitations . . . . . . Weak Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Locality of Interaction . . . . . . . . . . . . . . . . 3.5.2 Degree of Influence . . . . . . . . . . . . . . . . . . 3.5.3 Summary of Weak Coupling Characterization . . . . 3.5.4 Related Work on Characterizing Weak Coupling . . 3.5.5 Contribution Outside the Scope of the TD-POMDP Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 49 50 55 59 60 62 63 64 66 67 68 75 76 79 80 100 106 108 110 111 4. Influence-Based Policy Abstraction . . . . . . . . . . . . . . . . 113 4.1 4.2 4.3 4.4 4.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . Belief State and Influence . . . . . . . . . . . . . . . . . . 4.2.1 General Best-Response Belief State . . . . . . . 4.2.2 Condensed Belief State for TD-POMDP Agents 4.2.3 TD-POMDP Belief State Sufficiency . . . . . . . 4.2.4 Complexity of Best Response Computation . . . 4.2.5 Influence Information . . . . . . . . . . . . . . . Characterization of Transition Influences . . . . . . . . . . 4.3.1 Transition Influences . . . . . . . . . . . . . . . 4.3.2 State-Dependent influences . . . . . . . . . . . . 4.3.3 History-Dependent Influences . . . . . . . . . . . 4.3.4 Influence-Dependent Influences . . . . . . . . . . 4.3.5 Comprehensive Influence DBN . . . . . . . . . . A Special Case: Influences on Event-Driven Features . . . Influence Space . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 116 118 122 127 133 135 136 137 138 139 140 140 144 147 4.6 4.7 Empirical Analysis of Influence Space Size 4.6.1 Experimental Setup . . . . . . . 4.6.2 Results . . . . . . . . . . . . . . 4.6.3 Summary of Findings . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 150 157 172 174 5. Constrained Local Policy Formulation . . . . . . . . . . . . . . 175 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . Applying the Dual LP Formulation . . . . . . . . . . . . 5.2.1 Constraining the LP to Return a Deterministic 5.2.2 Evaluating Deterministic Policies . . . . . . . 5.2.3 Handling Partial Observability . . . . . . . . . Probabilistic Goal Achievement . . . . . . . . . . . . . . State-Dependent Influence Achievement . . . . . . . . . 5.4.1 History-Dependent Influence Achievement . . . Alternative Approaches to Constraining Influence . . . . Exploring the Space of Feasible Influences . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 . . . 176 Policy 178 . . . 179 . . . 180 . . . 181 . . . 183 . . . 185 . . . 186 . . . 190 . . . 194 6. Optimal Influence-space Search . . . . . . . . . . . . . . . . . . 195 6.1 6.2 6.3 6.4 6.5 6.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correctness of Optimal Influence-space Search . . . . . . . . Depth-First Search . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Structure of Search Tree . . . . . . . . . . . . . . 6.3.2 Enumerating Feasible Influences . . . . . . . . . . 6.3.3 Incorporating Ancestors’ Influences . . . . . . . . Interaction Digraph Cycles . . . . . . . . . . . . . . . . . . Empirical Results . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . 6.5.2 Comparison with Policy-Space Search . . . . . . . 6.5.3 Comparison with the Centralized MILP Approach 6.5.4 Comparison with SPIDER . . . . . . . . . . . . . 6.5.5 Comparison with SBP . . . . . . . . . . . . . . . . 6.5.6 Scaling Beyond Two Agents . . . . . . . . . . . . 6.5.7 Summary and Discussion . . . . . . . . . . . . . . Scaling Beyond a Handful of Agents . . . . . . . . . . . . . 6.6.1 Independent Ancestors . . . . . . . . . . . . . . . 6.6.2 Conditionally Independent Descendants . . . . . . 6.6.3 Bucket Elimination for Optimal Influence Search . 6.6.4 Complexity of Bucket Elimination OIS . . . . . . 6.6.5 Empirical Results . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . 195 196 199 200 202 203 204 207 207 209 212 215 216 217 219 222 222 223 224 228 229 7. Flexible Approximation Techniques . . . . . . . . . . . . . . . . 232 7.1 7.2 7.3 Approximation of Influence Probabilities . . . . . . . . . Time Commitment Abstraction . . . . . . . . . . . . . . 7.2.1 Service Coordination . . . . . . . . . . . . . . 7.2.2 Time Commitment Formalism . . . . . . . . . 7.2.3 Modeling, Incompleteness, and Inconsistency . 7.2.4 Space of Time Commitments . . . . . . . . . . Greedy Service Negotiation . . . . . . . . . . . . . . . . 7.3.1 Negotiation Protocol . . . . . . . . . . . . . . 7.3.2 Service Provider Reasoning . . . . . . . . . . . 7.3.3 Service Requester Reasoning . . . . . . . . . . 7.3.4 Negotiation-Driven Commitment Convergence 7.3.5 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 234 234 235 236 238 240 241 242 247 249 252 8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 8.1 8.2 8.3 Summary of Contributions . . . . . . . . . . . . . . . 8.1.1 Identifying Structure . . . . . . . . . . . . . 8.1.2 Abstracting Influences . . . . . . . . . . . . 8.1.3 Proposing and Evaluating Influences . . . . . 8.1.4 Coordinating Influences . . . . . . . . . . . . Open Questions . . . . . . . . . . . . . . . . . . . . . 8.2.1 Quality-Bounded Influence Space Search . . 8.2.2 Influence Encoding Compaction . . . . . . . 8.2.3 Other Applications of Influence Abstraction . Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 260 261 261 262 263 263 264 264 265 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 viii LIST OF FIGURES Figure 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 Planetary Exploration example domain. . . . . . . . . . . . . . . . . 3 Components of Influence-Based Abstraction methodology. . . . . . . 11 Overview of dissertation contributions, indexed by chapter. . . . . . 14 MDP state, action, transition, and reward dynamics. . . . . . . . . . 19 A simple example of a planning problem faced by a Mars rover. . . 21 The MDP for the rover in Example 2.1. . . . . . . . . . . . . . . . . 22 POMDP state, action, transition, observation, and reward dynamics. 26 DBN describing relationships among Dec-POMDP variables. . . . . 31 Decoupled joint policy search. . . . . . . . . . . . . . . . . . . . . . 42 Example of structured interaction among TD-POMDP agents. . . . 49 A simple satellite-rover example problem. . . . . . . . . . . . . . . . 51 Example of local state representations and local observations. . . . . 52 Example of the dependencies among feature transitions. . . . . . . . 58 DBN illustrating the TD-POMDP’s structured transition dependence. 60 An example of a constraint graph. . . . . . . . . . . . . . . . . . . . 82 The interaction graph for Example 3.1. . . . . . . . . . . . . . . . . 87 An example of exploitable interaction digraph structure. . . . . . . 89 Examples of COP constraint graphs derived from interaction digraphs. 93 The TD-POMDP description for Example 3.31. . . . . . . . . . . . 98 Example of equivalence classes. . . . . . . . . . . . . . . . . . . . . 102 Example of limited influence. . . . . . . . . . . . . . . . . . . . . . . 115 Usage of belief state for POMDP agent reasoning. . . . . . . . . . . 117 One possible belief state trajectory of rover 2 from Example 4.2. . . 120 A DBN expressing CI relationships among TD-POMDP variables. . 131 Abstracting influences from policies. . . . . . . . . . . . . . . . . . 136 The influence DBN for each previously-presented example. . . . . . 142 An agent’s local policy space and resultant influence space. . . . . . 148 A digraph vertex representing an influencing agent. . . . . . . . . . 153 Two variations of the agent i’s influences. . . . . . . . . . . . . . . . 153 State, policy, and influence space sizes as a function of time horizon T .159 Degree of influence as a function of time horizon T . . . . . . . . . . 160 Branching factor, policy space size, and influence space size. . . . . 161 Degree of influence vs. tasks per agent and local window size. . . . . 162 ix 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 5.1 5.2 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 State, policy, and influence space sizes as a function of uncertainty. 163 Degree of influence as a function of uncertainty. . . . . . . . . . . . 164 Varying the number of nonlocally-affecting tasks and influence type. 166 Distribution of influence space sizes for 100 problems (per setting). . 167 Scatter plot of policy space sizes respective influence space sizes. . . 168 Varying the size of the nonlocal feature manipulation window. . . . 169 Varying the nonlocally-affecting task’s start time and window size. . 171 Increasing the size of the local decision problem. . . . . . . . . . . . 173 Functional diagram of constrained local policy formulation. . . . . . 175 A simple, concrete example of influences modeled by 2 agents. . . . 186 One path through the influence search tree. . . . . . . . . . . . . . . 200 Example of marginalization of unneeded DBN parameters by Agent 7. 204 Example of Influence-Space Search on a cyclic interaction Digraph. . 206 OIS vs. Policy Space Search : growing problem size . . . . . . . . . 211 OIS vs. Policy Space Search : window of interaction . . . . . . . . . 212 OIS vs. Centralized MILP : scaling . . . . . . . . . . . . . . . . . . 213 OIS vs. Centralized MILP : NLAT Window Size . . . . . . . . . . . 214 OIS vs. SPIDER : scaling local problem size . . . . . . . . . . . . . 216 OIS vs. SPIDER : NLAT Window Size . . . . . . . . . . . . . . . . 216 OIS vs. SBP : NLAT Window Size . . . . . . . . . . . . . . . . . . 217 Scalability of OIS and Centralized MILP to more than two agents. . 218 An interaction digraph wherein parents are independent. . . . . . . 223 An interaction digraph wherein children are conditionally independent 224 Interaction digraph (left) and processing of buckets by BE-OIS (right) 226 “chain” and “zigzag” interaction digraph topologies. . . . . . . . . . 230 Scalability of DF-OIS and BE-OIS on “zigzag” topology. . . . . . . 231 Empirical evaluation of ǫ-approximate OIS. . . . . . . . . . . . . . . 233 Service Coordination example . . . . . . . . . . . . . . . . . . . . . 235 A conservative model of a time commitment. . . . . . . . . . . . . . 237 The space of feasible time commitments . . . . . . . . . . . . . . . . 239 Negotiation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 241 An example of counterproposal. . . . . . . . . . . . . . . . . . . . . 245 Scalability: problem time horizon. . . . . . . . . . . . . . . . . . . . 253 Scalability: local complexity. . . . . . . . . . . . . . . . . . . . . . . 254 Scalability: number of agents. . . . . . . . . . . . . . . . . . . . . . 255 Average solution quality on 25 random problems . . . . . . . . . . . 256 Scalability: OIS vs. Greedy Service Negotiation on “chain” topology. 258 Solution quality of Greedy Service Negotiation on “chain” topology. 259 x LIST OF TABLES Table 2.1 3.1 4.1 A sample execution trace for the rover in Example 2.2 . . . . . . . . 28 Comparison of Dec-POMDP subclasses . . . . . . . . . . . . . . . . 70 Testbed parameterization. . . . . . . . . . . . . . . . . . . . . . . . 156 xi LIST OF APPENDICES Appendix A. B. Comparison of EDI-DEC-MDP and TD-POMDP . . . . . . . . . . . . 267 Random Service Problem Generation . . . . . . . . . . . . . . . . . . 271 xii ABSTRACT When planning optimal decisions for teams of agents acting in uncertain domains, conventional methods explicitly coordinate all joint policy decisions and, in doing so, are inherently susceptible to the curse of dimensionality, as state, action, and observation spaces grow exponentially with the number of agents. With the goal of extending the scalability of optimal team coordination, the research presented in this dissertation examines how agents can reduce the amount of information they need to coordinate. Intuitively, to the extent that agents are weakly coupled, they can avoid the complexity of coordinating all decisions; they need instead only coordinate abstractions of their policies that convey their essential influences on each other. In formalizing this intuition, I consider several complementary aspects of weaklycoupled problem structure, including agent scope size, corresponding to the number of an agent’s peers whose decisions influence the agent’s decisions, and degree of influence, corresponding to the proportion of unique influences that peers can feasibly exert. To exploit this structure, I introduce a (transition-dependent decentralized POMDP) model that efficiently decomposes into local decision models with shared state features. This context yields a novel characterization of influences as transition probabilities (compactly encoded using a dynamic Bayesian network). Not only is this influence representation provably sufficient for optimal coordination, but it also allows me to frame the subproblems of (1) proposing influences, (2) evaluating influences, and (3) computing optimal policies around influences as mixed-integer linear programs. The primary advantage of working in the influence space is that there are potentially significantly fewer feasible influences than there are policies. Blending prior work on decoupled joint policy search and constraint optimization, I develop influence-space search algorithms that, for problems with a low degree of influence, compute optimal solutions orders of magnitude faster than policy-space search. When agents’ influences are constrained, influence-space search also outperforms other state-of-the-art optimal solution algorithms. Moreover, by exploiting both degree of influence and agent scope size, I demonstrate scalability, substantially beyond the reach of prior optimal methods, to teams of 50 weakly-coupled transition-dependent agents. xiii CHAPTER 1 Introduction A fundamental characteristic of any intelligent system, natural or artificial, is its ability to make a rational decision when faced with a set of choices. As people, our daily lives are filled with decisions, each of which involves reasoning about the consequences of potential choices. Automating the decision-making process is the topic of an extremely active area of research. The motivation is that, by outfitting the automated decision-maker (or agent) with computational resources and developing efficient and effective reasoning algorithms for it to make its decisions, we can realize tremendously beneficial systems. For instance, decision-support software agents can help doctors and nurses in a hospital’s intensive-care unit to make quick decisions about treatment options; Agents controlling nodes of a power grid can conserve energy by forecasting consumption and deciding where to route power and when; And unmanned autonomous vehicles can deliver relief supplies and search for survivors in the wake of a natural disaster. In each of these domains, there is uncertainty such that an agent cannot predict the consequences of its decisions deterministically, but can instead reason over a space of probabilistic outcomes. Further, each domain involves multiple interacting agents whose individual choices may affect each others’ decision consequences. Hence, achieving the most desirable outcomes requires coordination. Conceptually, one might view the multi-agent coordination problem as planning joint actions for the system of agents. That is, a centralized planner formulates a joint decision rule that dictates, for any given state of the overall system, a harmonious composition of individual agent actions. In this sense, the multi-agent problem is much like controlling a single agent’s multiple arms. This centralized approach, taken by much of the literature on multi-agent sequential decision-making, and as reviewed in Section 2.3.1.2, implicitly achieves coordination since all agents’ actions are planned together. However, there are limitations to solving the coordination problem in this manner. From a computational standpoint, the number of composite action choices 1 grows exponentially with the number of agents, leading to poor scalability of methods that plan every decision jointly. From a logistical standpoint, this method requires a centralization of all problem information during the planning process, a requirement that is not reasonable for systems maintaining a separation of information (either because the infrastructure does not support transmission of all information to a central entity, or because there are portions of information that need be kept private). Building on prior work reviewed in Section 2.3.3, the research presented in this dissertation studies an alternative approach to multi-agent coordination that decentralizes the planning process. In particular, it applies the following insight. If some individual agent decisions do not affect other agents in the system, these decisions need not be jointly reasoned about. Instead of coordinating all decisions, the agents only need to coordinate the individual choices that affect one another. This is a standard method that people use to perform joint reasoning. For instance, when scheduling a meeting, instead of describing their individual activities in full, it is common for a group to communicate windows of availability. In doing so, they create a layer of abstraction that separates the influence each individual has on the group from the underlying joint decisions. The decisions around the meeting can then be planned individually, thereby avoiding much of the complexity of fully-centralized reasoning (and maintaining privacy of local decision information), yet still achieving harmonious joint behavior. The focus of this dissertation is on the computational aspects of influence abstraction: the development and evaluation of representations and algorithms for coordinating agents’ abstract influences. 1.1 Multiagent Coordination Under Uncertainty The label multiagent coordination under uncertainty could be used to describe a vast array of problems with different assumptions studied under disparate circumstances. This thesis focuses on one particular class of coordination problems with specific properties. I begin by describing a motivating domain (in Section 1.1.1) wherein problems from this class arise, and outlining the properties that define the class (in Section 1.1.2). 1.1.1 Motivating Example Consider a team of robots sent to explore the surface of Mars to gather scientific data autonomously for a period of months with little or no human intervention. As in past NASA missions, this team would likely include rovers situated on the planet’s 2 surface that move about and take various sensor measurements. The International Mars Exploration Workgroup is presently exploring the use of an array of additional robots and spacecraft equipped with other technologies to be deployed in future missions (Beaty et al., 2008). For instance, the rover team could benefit from the inclusion of orbiting satellites capable of collecting real-time imaging data for mapbuilding and rover localization. Imagine also a data processing center that safely houses a database of scientific information (including data from past missions and newly-collected data) as well as specialized hardware and software for image processing, path planning, and forecasting of environmental conditions. Transmit Data to Earth Capture Detailed Image of Site of Interest Analyze Soil Explore Collect 3D Sensory Data Visit Site A Visit Site B Return to Base Visit Site A Visit Site B Return to Base Capture Course Image of Region of Interest Process Satellite Imaging Data Compile Maps Plan Path Images Courtesy of NASA Figure 1.1: Planetary Exploration example domain. Together these components make up a system of agents with diverse capabilities that act and interact in a shared environment. The example pictured in Figure 1.1 contains four such agents that gather scientific data by performing various high-level activities. A satellite orbits the planet taking pictures and relaying information between Earth and Mars. On the surface sits a base station that houses the data center, whose activities include analyzing imaging data and compiling the data from multiple sources into detailed surface maps. Two rovers, that are situated at the base at the start of the mission, move about the surface and visit different sites of interest. Rover 1 is equipped with tools and sensors that it can use to dig into the ground and analyze the soil. Rover 2 is designed to travel more quickly and to position itself in a 3 series of locations so as to compile 3-dimensional sensory data. Because of its speed, it also has the capability of rapidly exploring unknown areas. As the agents complete their activities, they fulfill various science-gathering objectives. Thus, associated with each successfully completed activity is some value, and the overall productivity of the science-gathering mission may be quantified as the summation of completed activity values. As new information is collected, new objectives may present themselves. Imagine that with each new day comes a new mission and associated objectives, which may differ depending on environmental conditions and on analyses of past missions. In each such mission, the collective goal of the agents is to maximize their expected accumulated values. The agents can each gather and process data on their own, but benefit from interacting with one another. Interactions occur through the pursuit of interdependent activities (denoted by the lines in Figure 1.1). For instance, Rover 1 can visit site A more efficiently if the data center agent first plans a path for it. By interacting in this manner, the combination of their coordinated activities allows them to achieve complex mission objectives like locating areas with unusual geographical features, navigating rovers to those areas, analyzing soil samples, and storing the results in a geological database. It is through such a composition of individual activities that the team of agents achieves the greatest collective value, thereby making the most of its time exploring the planet. Successful coordination in this domain requires surmounting several difficult challenges. The agents’ objectives are temporally constrained with strict deadlines. In the example from Figure 1.1, the satellite is constrained in when it can take pictures of sites and areas of interest because it is orbiting about the planet. The rovers’ only source of power is the sun, so each is constrained to complete activities by a deadline related to the amount of energy it has stored and the time the sun sets. There are also behavioral constraints that dictate that each agent can perform only one activity at a time. For instance, the satellite imaging agent cannot simultaneously point its camera at two different locations. In order to optimally coordinate the team’s behavior and avoid wasting resources, agents’ activities should be carefully planned in advance. Furthermore, there is uncertainty in the durations of various activities. For instance, depending on the path planned for Rover 1 and on the obstacles encountered along this path, it may take a variable amount of time to reach site A. In order to maximize productivity in expectation, the agents’ plans should account for the uncertainty in their actions and interactions. 4 1.1.2 Core Problem Properties The Mars exploration example, as with all problems considered in this dissertation, may be characterized using the properties outlined here and in Section 1.2. This description serves to clarify the context of this thesis and to preempt any broad misconceptions. I begin by stating the fundamental properties of the class of coordination problems addressed herein. Property 1: Cooperative Multiagent System. The problem is to formulate intelligent behavior for a team of coexisting agents that share a common goal: to maximize the group’s joint value (of which there is some well-defined measure). In the example from Figure 1.1, joint value is measured as the expectation of the sum of qualities accrued from the successful completions of activities. For each agent, there is no notion of personal gain, and consequently, issues of fairness, truth, and incentivization do not arise in the development of solution methods. Property 2: Model-Based Planning. Agents’ activities involve a significant investment of resources over time, so decisions about which activities to perform for what purposes, and when, should be carefully planned in advance so as not to waste time and resources. For this purpose, there exists a (generative) model, known to the team in advance, that the agents may use to make (probabilistic) predictions and plan activities that maximize expected outcome utilities. Property 3: Sequential Decision Making Under Uncertainty. The agents interact with their environment and with one another by performing actions and receiving observations. The model describing their behavior is a Markov decision process (described more formally in Section 2.3.1) that associates an underlying system state with any situation that the agents may encounter. As the agents take actions, the model can be used to make (probabilistic) predictions about transitions of the state and resulting observations. The model also describes the value of activities by associating a reward with every state and combination of agents’ actions. Uncertainty in an agent’s activities (such as the analysis of soil by Rover 1 in Figure 1.1) translates to uncertainty in the system state and observation outcomes accounted for in the model as transition probabilities and observation probabilities. The model also accounts for constraints on when agents can successfully execute their activities as well as for interdependencies between activities. The problem of optimal coordination then becomes deciding how each agent should act given its observations, such that the sequence of joint decisions maximizes the agents’ expected accumulation of rewards 5 over a finite time horizon. The formulation of joint decisions is referred to as a joint policy. Property 4: Decentralized Awareness. While executing activities, the agents do not (necessarily) have complete views of the system state, nor the actions taken by other agents. Instead each is aware of, and bases its decisions on, only the subset of information conveyed by its local observations. For instance, in the planetary exploration domain, each rover agent observes only portions of information relevant to its navigation of the terrain (such as a measure of its velocity and its sensor readings), but it does not observe state information related to the satellite’s camera position or the other rover’s sensors. Any runtime communication is modeled through agents’ transitions and observations. That is, an agent may transmit information to another agent by taking an action that causes a transition of the system state, and a resulting observation seen by the receiving agent. 1.2 Problem Statement Together, the four properties described in Section 1.1.2 are closely aligned with those of the well-established (Seuken & Zilberstein, 2008) finite-horizon Decentralized Partially-Observable Markov decision process (Dec-POMDP), which I review in detail in Section 2.3.1. Dec-POMDPs are powerful theoretical models capable of representing a rich space of agent behaviors, interaction capabilities, and team objectives. However, with their expressiveness comes a general NEXP computational complexity (Bernstein et al., 2002). This result poses a substantial barrier in applying the Dec-POMDP model practically to solve problems of significant size. To overcome the complexity barrier, researchers have sought tractable Dec-POMDP subclasses wherein agents are limited in their interactions. For instance, there has been significant effort in developing more efficient, scalable solution methods for transitionindependent problems (Becker et al., 2004a; Kumar & Zilberstein, 2009; Marecki et al., 2008; Nair et al., 2005; Varakantham et al., 2007), where agents interact by jointly affecting the reward, but have independent affects on state transitions (as detailed more formally in Section 2.3.2.3). Intuitively, transition-independent agents cannot affect the outcomes of each others’ actions. Transition-independent problems are believed to be fundamentally less complex (Allen, 2009; Goldman & Zilberstein, 2004) than general Dec-POMDPs. Although empirical results demonstrate scalability of quality-bounded1 solutions to teams of more than a handful of transition-independent 1 I use the term quality-bounded to refer to solutions whose values are guaranteed to be within 6 agents (Marecki et al., 2008), the drawback of these models is that they place a fairly strong restriction on the way that agents may interact. The inability of the agents to alter the consequences of each others’ actions means that many useful interactions simply cannot be represented. For instance, we would expect that the act of the data center agent from Figure 1.1 planning a path should reduce the outcome duration of rover 1’s “visit site” activity. But this constitutes a transition-dependent interaction and is outside of the scope of transition-independent models. The research presented in this dissertation endeavors to concretely define a form of transition-dependent interaction structure that can be exploited, and to develop efficient2 , scalable solution algorithms capable of exploiting it. While others have taken steps in identifying transition-dependent structure, their models either (a) have not been shown to compute solutions with guaranteed bounds on quality (Varakantham et al., 2009), (b) have have not been shown to scale three beyond agents (Becker et al., 2004a; Oliehoek et al., 2008b), or (c) impose limitations on agents’ individual (noninteracting) behavior by restricting the transition or observation function (Beynier & Mouaddib, 2005; Guestrin & Gordon, 2002; Marecki & Tambe, 2009, 2007). Alternatively, this dissertation focuses on identifying useful structure in agents’ interactions without restricting agents’ individual behavior. To this end, I now introduce (and formalize in Chapter 3) several additional problem properties. Property 5: Factored Decomposability. The system model (described in Properties 2–3) conveys a complete description of the problem, containing information regarding all agents’ action consequences, but the information is explicitly decomposed into subproblem descriptions, each conveying the dynamics of a single agent’s behavior. In particular, the world state is factored into (overlapping) local state partitions, each composed of features relevant to an individual agent. In the example from Section 1.1.1, images that a satellite is taking of one side of the planet do not factor into the immediate decisions of a rover analyzing soil on the other side of the planet. As such, independent local transition functions dictate that each agent’s actions may exert immediate effects on the values of features within its local state but not those outside of its local state. Observations are similarly factored such that an agent cannot observe immediate changes to features outside of its local state. Further, the reward function is composed of local reward functions that convey the immediate benefits of each agent’s actions. Moreover, the overall value of a joint policy can be computed some nontrivial, expressible bound of the optimal value. 2 Given the daunting complexity of problems that I address, I use the term efficient here and throughout this dissertation to mean relatively efficient (in comparison to the computation required by other algorithms), but not necessarily polynomial. 7 efficiently by evaluating some well-defined function of agents’ local policy values. For instance, in the domain from Figure 1.1, the agents’ local values are accumulated from their individual task completions, which in turn sum across the agents to yield the joint value for the agent team. The factored state, transition, and reward structure results in a natural decomposition of the joint decision model into local decision models (though it is important to note that the local decision models are not necessarily independent of one another due to the potentially overlapping local state partitions). Property 6: Nonconcurrently-controlled Nonlocal features. Agents affect outcomes of each others’ activities in the following manner. With the overlap of agents’ local states (described in Property 5), there are some features that are directly affected by one agent but that also appear in another agent’s local state. For instance, in Figure 1.1, whether or not a rover visits a site depends upon whether or not the data center agent has planned a path for it. From the rover’s perspective, we refer to “path-planned” as a nonlocal feature because its value is altered by the actions of another agent. In turn, the change in value will allow the rover to visit the site more quickly and reliably. Through changes to nonlocal features, one agent may affect the choices and consequences of another’s subsequent (but not concurrent) actions. For instance, in Figure 1.1, the consequences of a rover’s actions taken after the data center plans a path for it may be altered, but the actions taken by the rover while the data center agent is planning the path are unaffected. The non-concurrence of interaction effects may limit the space of representable interactions, but it vastly simplifies our decomposition of the joint planning problem (as elaborated in Section 4.2.3). Property 7: Temporal Synchronization. One implication of the sequential interactions described by Property 6 is that successfully coordinated agents’ decisions will anticipate interactions with other agents. For instance, if a rover knows that a path will be planned for it in the near future, it can spend the interim time engaging in a short activity instead of wasting time waiting or engaging in a longer but lowerquality activity than the planned path would allow. This anticipation and resultant coordination is made possible through the use of a synchronized clock signal. It is the shared awareness of the current time that allows agents to decide when to perform certain activities with assurance that they will be well-aligned with other agents’ activities. As such, time is an integral feature of the system state. The structure identified by these additional properties is significant in that it accommodates a broad spectrum of agent interaction. At one extreme of the spectrum, agents do not interact at all, translating to a factored model of system state 8 (as described by Property 5) composed of independent, non-overlapping local state factors: effectively, fully-independent POMDPs. In this degenerate case, there is no need for coordination because the optimal joint policy is simply the combination of independently-planned optimal local policies. Moreover, the decisions that each agent makes cannot influence the decisions of its peers. With the addition of nonlocal features (as described in Property 6), agents begin to influence each other’s decisions. The coupling of the system (a metric developed formally in Section 3.5) describes the degree to which agents may influence each others’ decisions. The expectation is that as the degree of agent coupling increases, a greater degree of coordination is required, and the coordination problem becomes harder to solve. It is on the weakly-coupled side of the spectrum that agents should be able to compute solutions more efficiently and scale their solution algorithms to larger problems (assuming they remain weakly-coupled). The main problem that I address in this dissertation is how to solve the class of cooperative, model-based, sequential, stochastic, decentralized, decomposable, structured, temporal coordination problems outlined by Properties 1–7 in such a way that will exploit interaction structure to solve weakly-coupled problems more efficiently than strongly-coupled problems. The objective is a practical computational methodology whose usefulness over existing approaches lies in its satisfaction of the following desiderata: ❼ Exploitation of weak coupling for improved performance The methodology should, in principle, compute optimal solutions to the entire spectrum of coordination problems defined by Properties 1–7. Since not every problem will be computationally tractable, the methodology should exploit structure in problems that are weakly-coupled3 , so as to require less computational overhead (measured by memory requirements and computation time) for weakly-coupled problems than for strongly-coupled problems. Moreover, gradual variations in agent coupling should lead to a gradual shift in the computational overhead required to formulate solutions. ❼ Scalability in the number of agents Much of the literature on multi-agent coordination under uncertainty restricts consideration of models or experiments to just two agents. By scaling to more agents, solution methods become more widely applicable (to domains with multiple interacting decision-makers) as well as more effective in domains where 3 Note that weakly-coupled is not a binary classifier. Throughout this dissertation, whenever I qualify problems or agents as “weakly-coupled”, I am referring to the fact that the relative degree of coupling falls towards the weak end of the coupling spectrum. 9 more agents means more diverse capabilities (such as in the planetary exploration domain described in Section 1.1.1). In particular, the methodology should produce optimal solutions to problems with dozens of transition-dependent agents (under the assumption that the agents are sufficiently weakly-coupled, though not transition-independent).4 ❼ Flexibility of Approximation Given the complexity of coordination under uncertainty, there exist many problems for which it is impractical to compute optimal solutions. Thus, it is important that the methodology produce approximate solutions according to computational restrictions. Moreover, the methodology should be amenable to different degrees of approximation, thereby providing knobs that the practitioner can turn so as to strike a desired balance between computational overhead and solution quality. 1.3 Solution Approach To provide satisfactory solutions that fulfill the desiderata in Section 1.2, I have developed a principled framework for influence-based policy abstraction. My framework formalizes the following simple intuition. Weakly-coupled agents, who have little influence on each others’ decisions, can plan more efficiently by decomposing the joint policy computation problem into (partially) decoupled subproblems: formulation of individual agent policies and coordination of abstract influences. This is in stark contrast to fully-centralized planning, which is the paradigm adopted by much of the literature for solving Dec-POMDPs (reviewed in Section 2.3.1.2). Instead of a central entity reasoning about all agents’ policy decisions together, my framework gives each agent its own planning perspective through which to compute its own local policy. However, the local policy formulation problems are not completely independent since the optimal policy of one agent can depend on the decisions made by other agents. Agents account for these dependencies by conveying, and coordinating over, only their essential influences on each other. Figure 1.2 provides a diagrammatic overview of influence-based policy abstraction, wherein the blocks represent different components of the solution formulation process and the arrows and lines depict information flow between the components. I introduce each component below, describing how it fits into the overarching scheme as well as 4 For a demonstration of scalability of optimal solutions to problems with 50 agents, see Section 6.6. 10 Influence Modeling bestresponse model Constrained Local Policy Formulation proposed peer influences 𝚪≠𝑖 Agent 𝑖 Influence Coordination Influence Modeling Constrained Local Policy Formulation proposed Influence on peers local policy 𝝅𝑖 𝚪𝑖 Constrained Local Policy Formulation local policy Influence Modeling local policy Figure 1.2: Components of Influence-Based Abstraction methodology. identifying the high-level research questions to which this dissertation is devoted to providing answers. Influence Modeling. The fundamental question that ignited this body of work was: How should agents model each other? The system model (introduced in Properties 2–3) provides a representation of agents’ joint behavior, but not in a manner that will allow an agent to efficiently infer how its decisions are impacted by those of its peers. This portion of my work develops efficient local models that account for peers’ expected behaviors as they relate to the agent’s own decisions. Intrinsic to the modeling problem is the question of: What does an agent need to know about peers’ planned behavior in order to plan its own optimal local behavior in response? Knowing all the policy decisions of all other agents would certainly suffice. However, weakly-coupled agents that interact only in the context of certain activities, and with only a subset of peers, need not model all decisions of all peers. There may be many peer decisions whose consequences do not affect an agent i. With this insight, I develop an abstraction of peer policies that I call an influence, referred to in Figure 1.2 as Γ6=i , which is a subset of peer policy information that conveys only the effects of peers’ decisions as they relate to agent i’s own decision problem. For instance, whereas the complete policies of the team of agents in Figure 1.1 would include information 11 about all of the activities each agent plans to pursue in every foreseeable situation, Rover 1 only cares about whether and when paths will be planned. Although influences could be represented in any number of ways, the model that I adopt in this dissertation takes the form of a probability distribution over interaction effects. For instance, the influence ΓDataCenter of the Data Center on Rover 1 would include the probability P r(P ath-to-Site-A|time) of the Data Center sending the Rover a planned path at various times over the course of the mission. (as the labels of the incoming and outgoing arrows in Figure 1.2 indicate). Upon receiving the proposed influences Γ6=i from its peers, agent i folds Γ6=i into a local decision model (as portrayed in Figure 1.2 as “influence modeling”) which I refer to as a best-response model, from which it can compute, among other things, optimal local policies in response to peers’ proposed influences. Constrained Local Policy Formulation. Agent i can also use the local “bestresponse model” to reason its own influence Γi on its peers (given peer influences Γ6=i ). Influence Γi provides an abstraction that conveys expectations about the effects of agent i’s policy on i’s peers. Hence, using the abstraction involves translating back and fourth between its policy and influence representations. For instance, the agent may propose an influence by starting with a completely-specified local policy and computing the influence that the policy exerts on its peers. More importantly, given a proposed influence, the agent must implement a local policy that delivers on the expectations conveyed by that influence. This evokes the question: How can an agent enforce that its policy exerts a committed influence? Prior approaches encourage the exertion of various forms of influence through reward shaping (Mataric, 1997; Musliner et al., 2006; Varakantham et al., 2009), which injects artificial rewards (or penalties) into the agent’s local decision model to encourage (or discourage) desired behavioral outcomes. Although this method could be employed to bias the agent to fulfill its committed influences, fulfillment is not guaranteed, nor is the optimality of the agent’s local policy (as I prove in Chapter 5). These results have motivated me to develop a new method for influence enforcement that uses a fundamentally different strategy. Instead of biasing individual decisions to push the agent into situations where it will interact as desired, the idea is to constrain the policy directly to enforce the committed influence. When associating influences with policy constraints, issues of overconstrainedness arise. That is, a particular influence (or combination of influences) may not be feasible for the agent to exert by any local policy. Thus, it is vital that an agent be 12 able to efficiently identify the feasible influences that it could exert. In addition to influence enforcement, the constrained policy formulation methodology that I develop in Chapter 5 also addresses the problems of checking feasibility and identifying feasible influences (without explicitly enumerating all individual policies). Influence Coordination. The previous two components involve decentralized computation by the individual agents in the system, wherein each agent reasons about its own decisions and its own influences (on peers) as a function of potential peer influences. Using these components, the agents can avoid explicit joint reasoning about detailed policy decisions. Instead, they need only jointly consider their influences on one another. The “Influence Coordination” block at the center of the diagram in Figure 1.2 forms the intersection of the individual agents’ decision-making problems, addressing the following question. How can an agent team converge on a set of influences that yield an optimal joint policy. At a high level, I have recast the problem of policy-space search as one of influencespace search. The motivation is that although a weakly-coupled agent may have a very large number of policies, depending on the relative portion of its policy decisions which do not affect its peers, the agent will have a proportionally smaller number of unique influences that it can exert on its peers. In contrast to prior distributed planning approaches that work directly with policies (Marecki et al., 2008; Nair et al., 2003; Varakantham et al., 2009), approaching the problem in this manner offers its own interesting challenges. For instance, the influence space (as defined more formally in Chapter 4) is a continuous space of probability vectors. Further, since one agent’s influence value changes the feasible influences of another agent, the order that the team reasons about different influences can have significant effects on the completeness and efficiency of the search algorithm. 1.4 Contributions The primary contribution of this work is the design and evaluation of a principled methodology for multiagent coordination under uncertainty, with a focus on efficiency of solution generation and scalability to problems with many weakly-coupled agents interacting in a structured manner. Out of this effort comes definitive evidence in support of the hypothesis that coordinating using abstractions of structured interactions can afford agents significant reductions in computational complexity, thereby enabling solutions to problems that were previously thought to be intractable (Allen, 13 3 Identification of exploitable Dec-POMDP interaction structure exploited by 4 Principled Framework for Nonlocal Abstraction accommodates 7 Flexible Methods for Approximation comprises 5 Constrained Policy Formulation Methodology 6 Optimal Influence-Space Search Algorithms Time Commitment Abstraction Probability Approximation Greedy Influence Negotiation Figure 1.3: Overview of dissertation contributions, indexed by chapter. 2009; Bernstein et al., 2002). An overview of the components of this thesis, indexed by chapter, is shown in Figure 1.3. The primary contributions of each component are as follows. ❼ Identification of Interaction Structure Amenable to Tractable Solutions Given the computational complexity (Bernstein et al., 2002) of the general DecPOMDP where agents’ interactions are unrestricted, successful Dec-POMDP applications beyond small, two-agent toy problems require exploitation of structure amenable to tractable computation of optimal or near-optimal solutions. Akin to prior work that identifies transition-independent structure through which agents affect each others’ rewards (Becker et al., 2004b; Nair et al., 2005), the work presented in this dissertation identifies structure in the way that agents affect each others’ transitions. The formalization of this structure (which I present in Chapter 3 as a TD-POMDP) is novel in its combination of (a) explicitly distinguishing each nonlocal state feature (through which an agent is influenced by a peer) from the local state features (that the agent controls), thereby facilitating a natural decoupling of the joint model into local models, (b) enabling a systematic analysis and abstraction (in Chapter 4) of sequential transition-dependent interagent influences, and (c) not imposing overly-restrictive constraints on agents’ local behavior nor their ability to interact. Moreover, results indicate 14 that for weakly-coupled agents, exploiting the structure I have identified can yield exponential speedups over solution methods for general flavors of transitiondependent Dec-POMDPs, not to mention scalability to teams of many more agents. These traits make my structural model a useful candidate for researchers to extend and for practitioners to adopt. ❼ Principled Framework for Nonlocal Abstraction This dissertation develops a general, principled framework for abstracting agents’ transition-dependent influences from their policies. The practical contribution of the framework is a novel influence model that compactly incorporates nonlocal information (abstracted from peer agents’ committed policies) into a local POMDP. The conceptual contribution is the idea that, by formally characterizing this nonlocal information, an influence space emerges that is often more efficient to search than the policy space, yet is still amenable to optimal solutions. Furthermore, in developing and evaluating influence-based solution algorithms, this dissertation sheds light on the impact of nonlocal abstraction, particularly as it relates to the degree of agent coupling and the efficiency of solution computation. Knowledge of the circumstances under which influence-based policy abstraction provides the greatest computational gains can inform researchers seeking to apply such techniques. ❼ Constrained Policy Formulation Methodology This work extends the research of others (D’Epenoux, 1963; Dolgov & Durfee, 2005; Kallenberg, 1983) in applying linear optimization to sequential decision making. The insight is that, since the probabilistic effects that influences encode are intrinsically represented in the MDP dual linear program, agents can compute policies that directly account for the influences that they exert on their peers. In Chapter 5, I develop several flavors of (mixed-integer) linear programs for constraining agents’ policies and exploring the space of possible influences. In contrast with prior approaches geared towards enforcing interacting behavior, this novel methodology enables an agent to (a) determine whether a desired influence is feasible, if so (b) compute the optimal local policy that is constrained to exert the influence, and (c) completely avoid any tuning of parameters associated with influence enforcement. More generally, the contribution to the agentbased optimization community is an arsenal of constrained policy formulation techniques that may be adapted and extended to solve other decision-making problems involving behavioral constraints. 15 ❼ Optimal Influence-Space Search Algorithms In Chapter 6, I develop and evaluate efficient algorithms that employ the abstraction models of influence from Chapter 4 and constrained policy formulation techniques from Chapter 5. The novelty of my algorithms is their use of influencebased policy abstraction to compute optimal solutions for a general class of transition-dependent problems. In addition to a depth-first search algorithm, I extend and apply a method from constraint optimization to exploit graphical structure in influences, which allows scaling of optimal solution computation to teams of more than a handful of weakly-coupled agents. These algorithms, by themselves, constitute a meaningful contribution to the Dec-POMDP community because they demonstrably advance the state of the art in efficiency and agent scalability for classes of commonly-studied transition-dependent problems. Additionally, this dissertation contributes an empirical evaluation of benefits and limitations of optimal influence-space search that may serve as a guide for researchers and developers so that they may make informed decisions about the suitability of influence-space search to the problems that they address. ❼ Flexible Influence Approximation Lastly, this dissertation outlines several extensions of the influence-based abstraction methodology for coping with larger problems whose optimal solutions are intractable to compute. In Chapter 7, I develop three different techniques that agents may employ to trade solution quality for computational efficiency. The first technique approximates the space of influence probabilities, ignoring influences whose settings are close to those already considered. The second technique approximates the structure of agents’ influence encodings, thereby applying an extra layer of abstraction to reduce the number of parameters with which influences are conveyed. In particular, I develop an abstraction wherein agents represent their influences as single-parameter time commitments. The third technique searches the space of time commitments greedily rather than exhaustively for significantly faster convergence on approximate solutions. Although less systematic than some of my earlier analyses, my empirical evaluations of these techniques contribute evidence of the effectiveness of these variations of influence-based abstraction at reducing computation while still achieving near-optimal solutions (on average). 16 CHAPTER 2 Background In order to gauge the scope and magnitude of this dissertation’s contributions, we must first consider the prior work that addresses related problems. Research in the field of automated decision making has grown far too vast to recount in its entirety. Instead I review the most closely related paradigms, problem formalisms, and solution approaches. This chapter provides the reader with a brief survey of background work to better understand the foundation on which my models and methodologies are built. It motivates the remainder of this dissertation by noting the inadequacies of these prior approaches in relation to the desiderata that I outlined in Section 1.2. 2.1 Overview I have divided the background material into single-agent and multiagent decision making research, reviewed in Sections 2.2 and 2.3 respectively. In Section 2.2, I begin by describing the Markov Decision Process (MDP) and its partially-observable extension (the POMDP). In addition to forming the basis for the multiagent models described in Section 2.3, single-agent (PO)MDPs are employed by my solution approach for the local portions of agents’ planning (referred to in Figure 1.2 as “constrained local policy formulation”). As such, I also give an overview of MDP and POMDP planning algorithms, digging into the details of those that I extend in later chapters, and review the computational complexities of optimal MDP and POMDP planning. I also give a brief survey of foundational single-agent work in decomposition and abstraction and its relationship to my methodology. In Section 2.3, I review the Decentralized POMDP (Dec-POMDP), a general extension of the POMDP for teams of cooperative agents. After presenting the DecPOMDP formalism, the computational complexity of optimal Dec-POMDP planning, and an overview of general-purpose planning algorithms, I focus the remainder of this 17 chapter on work that exploits particular problem structure to improve efficiency and scalability. In this vein, I give an overview of structural restrictions that researchers have imposed, each of which has yielded planning algorithms with significant computational advantages over general-purpose algorithms, but whose scopes are limited to problems in specialized Dec-POMDP subclasses. In particular, I characterize those algorithms that exploit weakly-coupled1 problem structure, wherein agents’ limited interactions engender an efficient decoupling of the centralized joint policy formulation into largely-decentralized local planning problems. My review of Dec-POMDP algorithms and subclasses in Section 2.3 exposes a division of research thrusts in the field of Dec-POMDP planning, which I elaborate in Section 2.4. On one side, there is work that remains general, assuming no particular structure, so as to develop the most broadly-applicable Dec-POMDP theory and algorithms. On the other side, there are those approaches that intentionally avoid generality in favor of exploitability, each restricting consideration to a subclass that exhibits particular structure amenable to efficient and scalable algorithms. Few approaches are both generally applicable and efficient and scalable (subject to the degree to which exploitable structure is present). In particular, there are no qualitybounded algorithms for transition-dependent problems that have achieved scalability beyond three agents. 2.2 Single-Agent Sequential Decision Making Let us begin by considering a single agent inhabiting an environment, which we call the world. Over the course of the agent’s lifetime, it encounters situations, which we call states, wherein it must decide what action to take. The outcome of each action an agent takes is a transition into a new state (according to the dynamics of the world), where the agent faces another decision. With each transition is associated an intrinsic value, called a reward, that depends on the agent’s state and action. As it makes decision after decision, the agent’s objective is to maximize some function of its rewards received. Such is the basic premise of sequential decision making, which encompasses a wide variety of problems in Artificial Intelligence (Littman, 1996). In this dissertation, I adopt the Markov Decision Process (Bellman, 1957), along with its later-described extensions, as a general model for sequential decision making. MDP planning can be thought of as a generalization of classical planning (Fikes & Nilsson, 1971) (where the problem is finding a sequence of actions with deterministic 1 I develop a more concrete definition of weak coupling later Section 3.5. 18 rt state st at reward next state action transition st+1 Figure 2.1: MDP state, action, transition, and reward dynamics. transitions that lead to a goal state), adding transition uncertainty as well as reward. It can also be thought of as a generalization of (discrete-time) scheduling (Pinedo, 2008) (where the problem is timing an agent’s decisions). In the subsections that follow, I review the MDP formalism, an extension for partially-observable worlds, algorithms, complexity results, and advanced solution strategies rooted in the single-agent MDP literature that I use in my multiagent methodology. 2.2.1 Markov Decision Processes I begin with a very brief introduction to MDPs, citing just a few results from other authors’ more detailed treatments (Bellman, 1957; Kallenberg, 1983; Papadimitriou & Tsitsiklis, 1987; Puterman, 1994; Sutton & Barto, 1998). A single-agent MDP may be described by a 4-tuple hS, A, P, Ri whose contents are as follows: ❼ S is a finite set of world states, called the state space, over which there is a probability distribution α that specifies the probability that the agent will start in any given state s0 ∈ S. ❼ A is a finite set of actions, called the action space, such that for each state s, a specified subset As ⊆ A of actions are available for the agent to perform. ❼ The transition function P : S × A × S 7→ [0, 1] specifies the probability, denoted P (st+1 |st , at ), of the agent transitioning into state st+1 given that it takes action at ∈ A in state st ∈ S. ❼ The reward function 2 R : S × A 7→ R defines a local reward, denoted rt = R (st , at ), ascribed to the action at taken in state st . 2 In some other work, the reward function rt depends upon the previous state st−1 , action at−1 , and resulting state st . This is simply a different convention than the one I present here, and both are equally expressive (one not any more or less general than the other). 19 In the above treatment, superscripts t and t + 1 denote any two successive decision steps, which are depicted graphically in Figure 2.1 as a two-stage Dynamic Bayesian Network (DBN) (Guestrin et al., 2003; Koller & Friedman, 2009), showing the dependencies among the components of the MDP model. We will assume that all of the functional components of the model are stationary (returning the same value regardless of the particular value of t). As depicted in the DBN, the agent’s next state st+1 depends only upon its latest state st and latest action at . The actions the agent took or states that the agent encountered prior to arriving in state st cannot affect st+1 . This conditional independence of the future from the past conditioned on the present is called the Markov property, and is what makes the MDP Markovian. Example 2.1. Consider that a rover, shown in Figure 2.2, has three activities that it can pursue as it explores a newly-visited portion of the Martian surface. It can construct a map of the area using a lower-level mapping algorithm which, due to the unknown complexity of the terrain, it estimates will take 1, 2, or 3 hours with equal probability. It can also excavate, which involves drilling into the surface for approximately 1 hour and collecting various samples of rock and dirt. However, with a small probability (0.1), the excavation will be unsuccessful due to equipment malfunction or if the surface is too rocky. If successful, the rover can bring the excavated samples back to base for analysis, which will take approximately 2 hours. Borrowing from the TÆMS modeling language (Decker, 1996), Figure 2.2 describes these activities as tasks with probability distributions (“Pr”) over duration (“D”), and also over quality (“Q”), which measures the relative value of completing each activity. For instance, the map area task is valued twice as highly as the excavate task but may take the rover longer to complete. As described, the durations represent hours of execution. Additionally, each task has a window of feasibility. For instance, the rover can successfully map the area for the next 4 hours, during which time the sunlight is ideal, and after which time the rover will automatically stop mapping whether or not the map is complete. The possible failure of a task is indicated with outcomes quality 0. The rover’s completion of the excavate task with a positive quality (indicating that the rover has collected samples) enables the rover to perform its analyze samples task. 20 enables Map Area outcomes : D 1 2 3 window : [0,4] Excavate Q Pr 2 𝟏𝟑 2 𝟏𝟑 2 𝟏𝟑 Analyze Samples outcomes : D Q Pr 1 1 0.9 1 0 0.1 window : [0,6] outcomes: D Q Pr 2 3 1.0 window : [0,6] Figure 2.2: A simple example of a planning problem faced by a Mars rover. Once the rover starts one of its tasks, it cannot stop until the task finishes (either completing with positive quality or failing), and the rover cannot restart a task which has previously failed. The rover’s objective is to maximize its accumulation of qualities of the tasks it completes over the next 6 hours. We can represent the rover’s decision problem using the MDP shown in Figure 2.3, whose states model the time and task status information. At the start, the agent has not started any of the three tasks and the time is 0. Each hour, the rover acts by either beginning one of its tasks, continuing a task, or idling. Figure 2.3 shows two steps of actions and transitions. In each state, the available actions correspond to those that are allowed given task statuses and window constraints. For instance, the rover cannot analyze samples until reaching a state in which the “Excavate” task has completed. Figure 2.1 also shows the terminal states, whose outgoing transitions are assigned positive rewards equal to the sum of all completed task qualities. 2.2.1.1 Values and Policies In addition to the reward function R(), which specifies the immediate reward assigned to a particular state st and action at , we can also consider the long-term value, or expected utility, of taking at in st . Expected utility in an MDP is typically defined, with function U ∗ (st , at ), as the maximal expected discounted reward, written recursively as: ∗ t t t t U (s , a ) = R(s , a ) + γ · X  P (s t+1 st+1 ∈S 21 t ∗ t |s , a ) · max U (s t+1 a ∈A t+1 t+1 ,a )  (2.1) State Representation: Mapping, Excavation, Sample Analysis, time, where Mapping {N (not started), 0 (started at time 0), 1, 2, 3, C (completed with positive quality), F (failed)} Excavation  {N (not started), 0 (started at time 0), 1, 2, 3, 4, 5, C (completed with positive quality), F (failed)} Analysis  {N (not started), 0 (started at time 0), 1, 2, 3, 4, 5, C (completed with positive quality), F (failed)} time  {0, 1, 2, 3, 4, 5, 6} I 3 1 E 0NN1 1 M 1NN2 M 0NN2 M (terminal states) 3 0.9 0.1 1 2 1 3 M M E I NC56 I r=0 r=1 r=1 NCC6 NFN6 I CFN2 CNN6 I CCN6 I CC56 I CCC6 I CFN6 I r=4 r=0 0.1 3 I NNN0 NCN6 CCN2 I A E E 0.9 2 I I I CNN1 NNN6 I 2 CNN2 Start State Distribution: (NNN0) = 1.0 0.1 1 3 2 NFN1 NCN1 (all unlabeled transition probabilities are 1.0) 1FN2 3 1 M 3 2 3 1CN2 I Reward Function: A 0, for nonterminal state where 𝑡𝑖𝑚𝑒 𝑠 < 6 sum of completed task outcome qualities I r=2 M I NFN2 I 0.9 Transition Function shown 𝑟 𝑠, 𝑎 = 2 I NNN1 M State Space: Mapping  Excavation  Analysis  time Action Space: { I (Idle : don’t execute any tasks for one time step) M (begin or continue to map area), E (excavate), A (begin or continue to analyze excavated samples) } M E NNN2 M A M I r=3 r=3 r=6 NCN2 A r=2 NC12 A Figure 2.3: The MDP for the rover in Example 2.1. where the first term represents immediate reward, γ ∈ [0, 1] denotes the discount factor, and the summation computes the expected future reward given the best action is chosen in every subsequent state. Equation 2.1, which is commonly referred to as the Bellman equation (Russell et al., 1996), presents the most general notion of expected utility, where future rewards are discounted over a potentially infinite-length sequence of decisions. In this dissertation, I focus on finite horizon problems, wherein the agent’s objective is to maximize is accumulation of rewards within finite mission deadlines. In this case, rewards are undiscounted (γ = 1). An agent’s behavior is prescribed by an MDP policy π, which encodes how the agent should behave in each world state. In general, an MDP agent may use a randomized policy, which maps each state to a probability distributions over randomlyselected actions (π : S × A 7→ [0, 1]). However, in this dissertation, I assume that each agent adopts a deterministic policy π : S 7→ A, selecting a single unique action, denoted at = π(st ), for each state. For any MDP problem, there exists at least one deterministic policy that is just as good as any stochastic policy (subject to the value function V () I define below), so there is no loss in solution quality associated with restricting attention to deterministic policies (Puterman, 1994). When following a (deterministic) policy π, an agent’s expected utility of entering a given state st is defined (using the notation from Equation 2.1) as: Uπ (st ) = R(st , π(st )) + γ · X  st+1 ∈S 22 P (st+1 |st , π(st ))Uπ (st )  (2.2) Overall, the value V (π) of a policy π is the expected utility of following π from the probabilistic distribution over initial states α specified in the MDP description: V (π) = X α(s0 )Uπ (s0 ) s0 ∈S  (2.3) where α(s0 ) is the probability of starting in state s0 . MDP planning is the problem of, given a complete description of the MDP, finding the optimal policy π ∗ , whose value is greatest: π ∗ = arg max V (π) (2.4) π∈Π In Equation 2.4, Π denotes the agent’s policy space. The optimal policy π ∗ is also referred to as the solution to the MDP. Similarly, solving an MDP refers to the process of computing π ∗ . 2.2.1.2 Solution Algorithms Aside from simple enumeration of the policy space (as implied by the arg max in Equation 2.4), a variety of more efficient solution methods are commonly used. For instance, policy iteration and value iteration apply variations of the Bellman equation (Eq. 2.1) to iteratively converge on an optimal policy and an accurate optimal value function, respectively (Russell et al., 1996). In this dissertation, I make use of a Linear Programming (LP) approach (D’Epenoux, 1963; Kallenberg, 1983), which frames an MDP planning problem as the following linear optimization problem: max x XX x(s, a)R (s, a) s∈S a∈A ∀st+1 ∈ S, X x(st+1 , at+1 ) − γ at+1 ∈A ∀s ∈ S, ∀a ∈ A, x(s, a) ≥ 0 XX  x(st , at )P st+1 |st , at = α(st+1 ) (2.5) st ∈S at ∈A where the vector x of variables {x(s, a), ∀s ∈ S, ∀a ∈ A}, often called the occupation measures, denotes the total expected discounted number of times action a is performed in state s. Upon solving this LP, we can straightforwardly compute the optimal policy 23 π ∗ from the computed optimal occupation measures: π ∗ (s, a) = P x∗ (s, a) ′ a′ ∈A x(s, a ) (2.6) If the LP algorithm simplex is used to solve the LP in Equation 2.5, it is guaranteed to return a solution x ∗ corresponding to deterministic policy as long as the components of α are nonzero (Dolgov & Durfee, 2006). For MDPs wherein α contains zeros, and for use of the MDP solution methodology with other LP algorithms, additional constraints can be introduced into Equation 2.5 to guarantee that a deterministic policy is returned (as I develop in Chapter 5). The value V (π ∗ ) of the optimal policy found by solving the LP from Equation 2.5 is simply the value of the objective function, which is equal to the dot product of the occupation measures x ∗ and the vector of rewards specified by the MDP reward function R(): V (π ∗ ) = XX x∗ (s, a)R (s, a) (2.7) s∈S a∈A 2.2.2 Partially-Observable MDPs The MDP model reviewed above assumes the agent is able to sense its true state of the world upon taking an action. For instance, the rover from Example 2.1 knows whether or not its excavation has succeeded or failed. The partially-observably Markov Decision Process (POMDP) relaxes this assumption, instead dictating that the agents receive observations that are perhaps distinct from the true world state. Example 2.2. Consider that the rover from Example 2.1 models its excavation activity using a lower-level decision process, in order to decide how many holes to drill and what samples to keep. In particular, the rover is looking for white-colored rocks (which indicate a desirable chemical composition). There may or may not be any white rocks in the ground at the dig site. Let us use a boolean feature, white-rocks-exist, to represent the existence of white rocks at the dig site. In the rover’s excavation, white-rocks-exist is an important, but partially-observable 24 state feature. That is, the rover cannot see white rocks buried below the surface until after digging them out. Moreover, if the rover drills a hole and does not see any white rocks, that does not mean that white-rocks-exist=false, since there may be white rocks buried just a few inches away. Instead, the rover receives an observation, no-white-rocks-in-hole, that is correlated with the state white-rocksexist=false, but not necessarily equal to the true state value. Partial observability makes the rover’s decision of whether or not to drill a second hole, and then a third hole, and then a fourth hole, nontrivial. For instance, the rover may be better off spending its time collecting other-colored samples from the first hole than digging additional holes, depending on the relative values of the various rock samples. Formally, a single-agent finite-horizon POMDP is may be described by tuple hS, A, P, R, Ω, O, T i whose contents are as follows: ❼ S is the state space, A is the action space, P () is the transition function, and R() is the reward function, exactly as they were defined in the fully-observable MDP (Sec 2.2.1). ❼ Ω is a finite set of observations, such that the agent receives an observation o ∈ Ω with every transition that it makes. ❼ O : A × S × Ω 7→ R is the observation function, specifying the the probability, denoted O (ot+1 |at , st+1 ), that the agent receives observation ot+1 ∈ Ω after taking action at ∈ A and arriving in state st+1 ∈ S. ❼ T ∈ N is the finite time horizon of execution, specifying that the agent faces decisions at (discrete) time steps h0, 1, . . . , T i. Figure 2.4 shows a DBN depicting the graphical relationships among the POMDP variables, indicating that observations depend solely upon current state and latest action. Notice that the POMDP specification above has also extended the MDP specification with a time horizon T component (in addition to the observation-related components). Without T , we would have an infinite-horizon decision problem. Although the infinite horizon POMDP is well defined, researchers have found little use for it due to its undecidability (as I describe more formally in Section 2.2.3). Similarly, in this dissertation, I restrict consideration to finite-horizon problems. 25 observation state st action ot reward rt ot+1 at transition st+1 Figure 2.4: POMDP state, action, transition, observation, and reward dynamics. 2.2.2.1 Histories of Observations Although the POMDP state transition dynamics are Markovian, as clearly depicted in Figure 2.4, a POMDP agent does not necessarily know its true state at any given decision step, and so it must rely on present and past observations to make the most informed decisions. It can no longer safely forget the past (as it could in the case of an MDP). If it did it would be throwing away potentially useful information about the present (with which to disambiguate the present state). Example 2.2 (continued). Assume that the rover drills one hole and observes no white rocks, then drills another, again observing no white rocks, then a third, still observing no white rocks. Should the rover dig again? If it bases its decision on only the latest observation, forgetting all of its past failures to find white rocks, it is likely to keep on trying and keep on failing. However, by considering all observed evidence, hno-white-rocks-in-hole, no-white-rocks-in-hole, no-white-rocks-in-holei, it can make a better informed decision about whether or not to try again or to pursue a different-colored rock. The information that a POMDP agent collects about the world from time steps 0 to t is captured by its history of observations ~o t = ho1 , ..., ot i ∈ (Ω)t . Since, in general, an agent’s optimal decisions may depend on past observations, a (deterministic) POMDP policy π : (Ω)T 7→ A is a mapping of complete observation history3 to action, prescribing an action at = π(~o t ) for every possible sequence of observations ~o t . The difficulty of such a representation is that it grows with every additional decision step, making a POMDP agent’s policy space exponential in the time horizon T . 3 In the case that an agent is following a randomized policy, it must also base its decisions on its history of actions. However, in this dissertation, I assume that the agent’s policy is deterministic, implying that the agent can always recover its action history from its observation history. 26 2.2.2.2 Belief State Researchers have developed a useful strategy for combating the exponentiallygrowing policy representation, which I briefly review now. It turns out that an agent can forget past observations as long as it maintains a belief state that encodes sufficient information about past observations and actions to make the best possible predictions about future transitions. In a completely-observable single-agent MDP, the current world state constitutes a sufficient belief state. In a POMDP, the probability distribution over all possible current world states is sufficient (Smallwood & Sondik, 1973). Thus, the POMDP belief state is a vector b , containing a component for each world state:   (2.8) bt st = P r st |~ajt−1 , ~ojt , ∀st ∈ S At the start of execution, before an agent takes a single action or receives a single observation, the initial belief state is equal to the probability distribution over initial world states b 0 = α (where α is the probability distribution over start states specified in the POMDP description from Section 2.2.1). As the agent takes actions and receives observations, it updates each component of its belief state using the following belief-state estimator : b t+1 s t+1   P P r(ot+1 |at , st+1 ) st P r (st+1 |st , at ) bt stj = a normalizing factor (2.9)  where bt stj is a component from the latest belief state, as derived by Smallwood & Sondik (1973). Although the POMDP belief state representation b is still larger than the MDP state in that it requires a probability value for every world state, it takes only constant space to maintain as the agent makes more and more decisions. In contrast, the history of observation grows with each new decision. Example 2.2 (continued). Smallwood & Sondik’s theory dictates that the rover can forget past observations as long as it maintains a value of bt = P r(white-rocks-existt = true). Let O(no-white-rocks-in-hole |white-rocks-exist = true) = 0.5, indicating that if there exist white rocks at the dig site, the rover is just as likely not to find them in a given hole than it is to find them. Let the initial state distribution (i.e., the prior probability of white rocks existing) α = 0.5. Table 2.1 below shows the world state, actions, and observations of the previously-described execution trace, along with the rover’s estimated belief state. 27 step t 0 1 2 3 true state st white-rocks-exist=false white-rocks-exist=false white-rocks-exist=false white-rocks-exist=false observation ot — no-white-rocks no-white-rocks no-white-rocks belief state bt P r(white-rocks-exist = true) = 0.5 P r(white-rocks-exist = true) = 13 P r(white-rocks-exist = true) = 0.2 P r(white-rocks-exist = true) = 19 action at drill-hole drill-hole drill-hole drill-hole Table 2.1: A sample execution trace for the rover in Example 2.2 Another benefit of the POMDP belief state is that its dynamics are Markovian. Moreover, the space of all reachable POMDP belief states and their transition dynamics (which are simply derived from Equation 2.9) together define a belief-state MDP, whose solution is equivalent to that of the POMDP. In effect, the belief state representation reduces the POMDP to a complicated but normal MDP. As such, a common approach for solving a POMDP is to work entirely in the belief-state space, thereby solving the equivalent belief-state MDP (Cassandra et al., 1996; Kaelbling et al., 1998; Littman et al., 1995a). In a later chapter (Section 4.2), I will derive a more complicated belief state representation that incorporates information about other agents in a multiagent system, but that is based upon the same principles of sufficiency and MDP reducibility. 2.2.3 Complexity of Single-Agent Planning I now briefly review some complexity results and their implications on the computation required by MDP and POMDP solution algorithms, foregoing the foundational background of complexity theory such as Turing machines and complexity classes (but I refer the reader to a textbook on complexity theory (e.g., Papadimitriou, 1994) for a deeper understanding of the results in this section). Completely-observable single agent MDPs have been proven to be polynomial, implying that there exist algorithms for which, in the worst case, the time and space taken to compute an optimal MDP policy is a polynomial function of the size of the MDP problem description (Littman et al., 1995b; Papadimitriou & Tsitsiklis, 1987). Formally, the size of the problem description is the amount of space required to store the complete model specification (i.e., the tuple hS, A, P, Ri in the case of the MDP, and hS, A, P, R, Ω, O, T i in the case of the POMDP). If the action space is significantly smaller than the state space (kAk ≪ kSk), then the MDP can be solved in time and space polynomial in kSk. In particular, the LP methodology that I reviewed in Section 2.2.1.2 admits polynomial-time solutions.4 4 Despite this result, polynomial time LP algorithms are rarely used to solve MDPs. Rather, the worst-case-exponential simplex algorithm has been shown empirically to yield better average case performance, as detailed by Littman et al. (1995b). 28 Not surprisingly, theoretical results suggest that, in general, POMDPs are harder to solve than MDPs. Papadimitriou & Tsitsiklis (1987) have proven the finite-horizon POMDP to be in a higher complexity class, PSPACE, for which there are believed to be at best exponential time (and polynomial space) solution algorithms (Allen, 2009; Papadimitriou, 1994). Consequently, under the assumption that {kAk ≪ kSk, kΩk ≪ kSk, and T ≪ kSk}, it is believed that the worst-case computation time of computing an optimal (finite-horizon) POMDP solution is exponential, denoted EXP (kSk), in the size of the state space kSk. Lusena et al. (2001) have proven an even stronger result, indicating that the time required to compute ǫ-approximately optimal (finite-horizon) POMDP policies (whose values are within ǫ of the optimal value) is also exponential. In the case of infinite-horizon POMDPs (wherein the time horizon T is unbounded), the problem of determining whether or not a given policy is optimal is undecidable (Madani et al., 1999). The implication is that no general technique exists for computing an optimal policy to an infinite-horizon POMDP. 2.2.4 Decomposition and Abstraction Although this thesis is concerned with coordination in systems of multiple agents, some of its central themes are routed in single-agent research. In particular, decomposition and abstraction techniques have proven to be effective for improving the efficiency of single-agent planning and reasoning. I now briefly review these concepts and provide citations to pioneering work in decomposing and abstracting single-agent sequential decision making. Decomposition breaks one large problem into smaller, more manageable problems. For example, Singh & Cohn (1998) study MDP models composed of concurrent subprocesses with interdependent actions that can be solved in parallel and merged to construct optimal global solutions. Meuleau et al. (1998) develop a method for decomposing very large MDPs into independent subprocesses coupled by resource constraints. Both of these works compute solutions efficiently by exploiting factored structure. That is, they isolate subsets of actions and portions of the world state that may be treated independently of one another. There has since been a lot of work in developing efficient solution algorithms for factored MDPs (Boutilier et al., 1999a; Guestrin et al., 2003; Kearns & Koller, 1999; Poupart et al., 2002). In the multiagent methodology that I present in this dissertation, I too take advantage of factored structure so as to decouple each agent’s local decision model from the joint decision model. 29 Researchers have also reduced single-agent MDP complexity by solving smaller (often approximate) models with knowledge or action representations abstracted from the original models. For example, Dean (along with others) explores reduction of large state and action spaces through heuristic prioritization (Boutilier et al., 1997; Dean & Lin, 1995) and aggregation (Dean & Givan, 1997; Dean et al., 1998; Dearden & Boutilier, 1997). There is also a large body of literature on hierarchical representations that treat individual actions and states as abstractions of sequences of primitive actions and state transition (Barto & Mahadevan, 2003; Jonsson & Barto, 2005; Osentoski & Mahadevan, 2007; Sutton et al., 1999). I take a similar approach, abstracting expected (nonnonlocal) transition sequences of an agent’s peers as local transitions in its local model. 2.3 Multiagent Coordination Propelled by the momentum gained and results achieved in single-agent sequential decision making, researchers have developed a variety of multiagent extensions to the MDP and POMDP models. Some extensions, such as the Partially-Observable Stochastic Game (POSG) studied by Hansen et al. (2004), and Gmytrasiewicz & Doshi’s Interactive POMDP (I-POMDP), represent agents as having their own objectives and intentions, making these models appropriate for systems of self-interested (non-cooperative) agents. In this dissertation, I restrict consideration to teams of cooperative agents who share a common objective. As I cite in Section 2.3.1.1, the problem of computing optimal behavior for cooperative agents under transition and observation uncertainty is extremely challenging in and of itself. Here I review the Decentralized POMDP (Dec-POMDP)5 as studied by Bernstein et al. (2000), which has emerged as the most popular and the most general POMDP extension for cooperative agents. Before delving into the details of the Dec-POMDP, I now give brief mention of some other general models and their relationships to the Dec-POMDP. Another name for the Dec-POMDP is the Partially Observable Identical Payoff Stochastic Game (POIPSG), which was introduced by Peshkin et al. (2000) in the same year as the Dec-POMDP’s inception. Additionally, the Multiagent Team Decision Problem (MTDP) (Pynadath & Tambe, 2002) has been proven to be equivalent to the Dec-POMDP (Seuken & Zilberstein, 2008), and differs only 5 Here and throughout this dissertation, I maintain the convention of abbreviating “Decentralized–” with “Dec-” (using a lowercase e and c). Note that I am referring to the same models that appear in some other work abbreviated as “DEC–” (e.g. “DEC-POMDP”). 30 in its representation of policies6 . Researchers have also developed extensions to the Dec-POMDP and the MTDP models, called the Dec-POMDP-Com (Goldman & Zilberstein, 2003) and MTDP-Com (Pynadath & Tambe, 2002), that represent communication among agents distinctly from agents’ actions and observations. Both of these extensions have been shown to be no more general (in representational power) than the Dec-POMDP (Seuken & Zilberstein, 2008). After reviewing the general Dec-POMDP formalism, general Dec-POMDP algorithms, and Dec-POMDP complexity, I describe a variety of other more restrictive models that may be considered as Dec-POMDP subclasses. These include, among many others, the Multi-agent Markov Decision Process (MMDP) as described by Boutilier (1996), and the Dec-MDP as described by Bernstein et al. (2002), both of which impose restrictions on agents’ observations. I go on (in Sections 2.3.2–2.3.3) to characterize these models by their restrictions as well as the problem structure that their respective solution algorithms are designed to exploit. For other characterizations of Dec-POMDP models, subclasses, and algorithms, I refer the reader to the treatments of Seuken & Zilberstein (2008), Allen (2009), and Oliehoek (2010). 2.3.1 Decentralized POMDPs The qualifying prefix “Dec-” in Dec-POMDP refers to a decentralization both of control and of observation of an underlying POMDP. Instead of one agent taking actions and receiving observations, we now have a team of agents, each of which independently takes its own action and receives its own observation at every time step. Figure 2.5 shows a DBN, whose graphical structure illustrates the conditional independencies among Dec-POMDP variables. The formal details of the Dec-POMDP model are as follows. team reward observations world state st 𝒐𝒕𝒊 t o𝒐𝒕𝟏 actions 𝒂𝒕𝒊 rt st+1 𝒐𝒕+𝟏 𝒊 𝒐𝒕+𝟏 𝟏 transition Figure 2.5: DBN describing relationships among Dec-POMDP variables. 6 The MTDP encodes policies as mappings from belief state to action rather than observation history to action. 31 Definition 2.3. A Dec-POMDP is specified by tuple hN , S, A, P, R, Ω, O, T i, where ❼ N is a team of n agents, ❼ S is the state space, a finite set containing all world states, with distinguished initial state 7 s0 , ❼ A = A1 × . . . × Ai × . . . × An is the joint action space, wherein component Ai refers to the finite set of local actions available to agent i, ❼ P : S × A 7→ [0, 1] is the transition function, specifying the probability P (st+1 |st , a) that the agents will transition into world state st+1 ∈ S given that the agents performed joint action a = ha1 , . . . , ai , . . . , an i ∈ A in state st ∈ S, ❼ Ω = ×i∈N Ωi is a finite set of joint observations, such that each agent i observes an observation oi ∈ Ωi with every transition that it makes, ❼ O : A × S × Ω 7→ R is the observation function, specifying the probabil- ity, denoted O (ot+1 |at , st+1 ), is the probability that the agents receive joint t+1 t+1 observation ot+1 = hot+1 1 , . . . , oi , . . . , on i ∈ Ω after taking taking actions at = hat1 , . . . , ati , . . . , atn i ∈ A and arriving in state st+1 ∈ S, ❼ R : S × A 7→ Rn is the reward function, specifying the team reward, denoted rt = R(st , at ), ascribed to the joint action at ∈ A taken in state st ∈ S, and ❼ T ∈ N is the finite time horizon, specifying that the agents will face decisions at (discrete) time steps h0, 1, ..., T i. Interactions among agents are manifested in the Dec-POMDP’s transition, observation, and reward functions. The world state transition depends upon combinations of agents’ actions. Similarly, an agent’s observation (which is separate from the observation given to other agents) may depend on its own action, other agents’ actions, and the new world state. A single team reward captures the immediate value of a joint action and resulting world state. Note that, just as in the single-agent MDP and POMDP, the reward is not explicitly observed, but simply provides a concrete specification by which to evaluate outcomes and policies. 7 The conventional definition of the Dec-POMDP (Bernstein et al., 2002), for simplicity of α). Note, however, exposition, specifies a unique start state instead of a distribution over start states (α that this does not restrict the representational power of the model. 32 Whereas in the single-agent POMDP the term partial observability referred to the agent’s observations as distinct from the world state, the term takes on a richer meaning in the Dec-POMDP. Here, an agent’s observation gives it a partial view of the world state as well as a partial view of the other agents’ actions. Further, it may be the case that one agent completely observes part of the world state (where by part I mean either some state features’ values or some regions of the state space) while another agent completely observes another part of the world state. In this sense, partial observability may also refer to the agents’ differing views of their shared world. Just as in the single-agent POMDP case, each Dec-POMDP agent i bases its decision at time step t on its local observation history, denoted ~oit = ho1i , ..., oti i ∈ (Ωi )t . Definition 2.4. A local policy πi : (Ωi )T 7→ Ai for agent i deterministically8 specifies an action ati ∈ Ai that i will perform for each observation history ~oit . The objective of a set of Dec-POMDP agents is, as in the single-agent case, to maximize the value function. In this dissertation (and in most finite-horizon hP Dec-POMDP i T t t planning), the value refers to the expected cumulative reward E R(s , a ) . In t=0 this case, the value function is dependent upon all agents’ actions, as conveyed by the joint policy. Definition 2.5. A joint policy π = hπ1 , ..., πn i is a vector of all agents’ local policies. Definition 2.6. The value of a joint policy π is the expectation of the summation of rewards received by following π: V (π) = E " T X t=0 R(st , at )|π # (2.10) Although Dec-POMDP agents’ actions and observations are decentralized, the Dec-POMDP planning process need not be decentralized. In fact, at the present time, the vast majority of Dec-POMDP solution algorithms compute agents’ policies centrally, either with a single computational process or by allowing arbitrary exchange of information between agents during the planning process. It is not until agents go off and execute their planned policies that the problem becomes “decentralized”. During execution, agents do not explicitly share their observations nor communicate 8 As described in Section 2.2.1.1, I restrict consideration in this dissertation to deterministic policies. Here and throughout, I will use the term local policy to mean an individual agent’s deterministic local policy. 33 their actions.9 Thus, even if the agents’ decisions are all planned together, this does not guarantee that agents will necessarily execute these decisions in a synchronized, well-coordinated manner because they cannot be certain about the other agents’ views. The fact that the agents’ runtime awareness is disjoint but not independent makes the problem of optimal Dec-POMDP policy computation extremely challenging. 2.3.1.1 Complexity While it comes as little surprise that planning for teams of agents is harder than planning for individual agents, it turns out that Dec-POMDP planning is in a whole different complexity class from that of POMDP planning (reviewed in Section 2.2.3). Bernstein et al. (2002) have proven the finite-horizon Dec-POMDP to be in complexity class NEXP, which is believed (though not proven) to be strictly harder than NP, and is widely considered intractable (Papadimitriou, 1994). Further, the NEXP-completeness holds for Dec-POMDPs with as few as two agents (Bernstein et al., 2002). This suggests10 that the computation time of an optimal joint policy (by any algorithm) for a team of two Dec-POMDP agents is, in the worst case, doubly-exponential in the size of the Dec-POMDP problem description. 2.3.1.2 General Solution Methods Despite the daunting complexity of these models, several optimal solution approaches have been developed for the general class of finite-horizon Dec-POMDPs. For instance, Bernstein et al. (2009) show that policy iteration using stochastic, correlated joint controllers converges on the optimal Dec-POMDP solution. Other optimal approaches include extensions of dynamic programming (Hansen et al., 2004) and A∗ heuristic search (Szer et al., 2005). Not surprisingly, none of the three optimal methods have been shown to scale beyond small 2-agent problems. Approximate solution methods for the general class of Dec-POMDPs are more abundant. Oliehoek et al. (2008a) extend Szer et al.’s multiagent A∗ search to efficiently-computable approximate value functions. Seuken & Zilberstein (2007b) introduce heuristics into Hansen et al.’s optimal dynamic programming algorithm to reduce the number of joint policies considered. Nair et al. (2003) develop a policy-space 9 Implicit communication may, however, be manifested by the Dec-POMDP’s actions, observations, and transitions. Alternatively, there are extensions to the Dec-POMDP framework that augment the problem description with special communicative actions. I address these topics in a later chapter (Section 3.4.2). 10 Formally, the double exponentiality of optimal Dec-POMDP planning is contingent upon the assertion that NEXP 6= EXP, which has yet to be proven. 34 search method, JESP (which I describe later on in Sec. 2.3.3), that converges upon a joint policy that is a Nash equilibrium but is not guaranteed to be optimal. Similarly, researchers have developed methods that search an approximate space by modeling policies with fixed-size local controllers. For instance, Bernstein et al. (2005) perform policy iteration on local stochastic finite-state controllers along with an additional shared controller that correlates the stochastic actions of the agents. Alternatively, Amato et al. (2007) optimize fixed-size local controllers using non-linear programming. Kumar & Zilberstein (2009) extend point-based methods (Pineau et al., 2006; Spaan & Vlassis, 2005) to approximate the Dec-POMDP belief-state space. Although these general-purpose approximate algorithms have enabled researchers to tractably solve problems with larger state and action spaces and longer time horizons than had the optimal algorithms, they have not been shown to scale to problems with more than two agents. 2.3.2 Structural Restrictions and Subclasses With an eye towards avoiding the NEXP complexity of the general Dec-POMDP problem class, researchers have identified a variety of Dec-POMDP subclasses that are amenable to efficient, scalable solution methods, but that impose various restrictions on problem structure. Here, I survey the most common structural restrictions along with their associated Dec-POMDP subclasses. 2.3.2.1 Joint Observability Intuitively, the difficulty of optimal coordination in Dec-POMDPs is due, in part, to agents’ differing observations of their shared environment. The Multiagent MDP (MMDP ) (Boutilier, 1996) sidesteps this problem by assuming that all agents completely observe the world state, effectively reducing the problem to a single-agent MDP with a joint action, whose computational complexity is just polynomial in the size of the problem description. An analogous assumption that agents receive the same partial observations reduces the Multiagent POMDP to a single-agent POMDP with a joint action (Messias et al., 2010). A less restrictive assumption is the agents’ observations together fully determine the world state. Bernstein et al. (2002) formalize this assumption as joint observability (whose definition I restate below), calling the resulting Dec-POMDP subclass the Dec-MDP. 35 Definition 2.7. A Dec-POMDP is jointly observable if there exists a mapping J : Ω 7→ S such that whenever O (hot1 , ..., otn i|at−1 , st ) > 0, then J (hot1 , ..., otn i) = st . Although joint observability dictates that agents’ observations jointly determine the world state, it does not imply that any one agent will ever be aware of the true world state (since the agents are not assumed to share their observations during execution). Moreover, an agent’s individual observation may not even be enough to establish awareness of a portion of the true world state. As such, only in combination with other structural restrictions (which I review in the subsections that follow) has joint observability led to computationally-efficient solution methods. 2.3.2.2 Local Full Observability Another branch of work assumes that each agent i observes (exactly or partially) a portion of the world state s referred to as its local state si , where si consists of a subset of the feature values that make up the world state s (Becker et al., 2004b; Goldman & Zilberstein, 2004; Nair et al., 2005; Varakantham et al., 2009). In all of these models, the state is factored such that every feature appears in at least one agent’s local state, and each agent’s observations depend only on the values of its local state features. Problems in which agents observe their local states exactly are commonly referred to as locally fully observable (Becker et al., 2004b; Goldman & Zilberstein, 2004), whose definition I review below. Definition 2.8. A Dec-POMDP is locally fully observable if: ∀i ∈ N , ∀oi ∈ Ωi , ∃si ∈ Si |P r(si |oi ) = 1, where Si is agent i’s local state space. By Definition 2.8, an agent i’s current local state si is uniquely determined from i’s observation oi , making local full observability a stronger assumption than joint observability. Whereas a Dec-POMDP that is locally fully observable is also jointly observable (Goldman & Zilberstein, 2004), the converse does not hold. In jointlyobservable problems, an agent’s individual observation alone might not determine any portion of the world state (but only in combination with other agents’ observations provide awareness of the agent’s local state). The TI-Dec-MDP (Becker et al., 2004b), the EDI-Dec-MDP (Becker et al., 2004a), and the EDI-CR (Mostafa & Lesser, 2009), each of which I will describe in more detail (in Sec. 2.3.2.3–2.3.2.5) after reviewing more structural restrictions, are all subclasses that are locally fully observable. 36 2.3.2.3 Transition and Observation Independence In addition to factoring the world state into local states, researchers have also imposed particular factored structure on the transition and observation functions. In particular, they have identified problems in which agents cannot affect the values of each others’ local states (Becker et al., 2004b; Nair et al., 2005). These problems are referred to as transition-independent (Becker et al., 2004b). Definition 2.9. A Dec-POMDP is transition independent11 if:   t+1 t t t t P r st+1 |s , a = P r s |s , a i i i i By Definition 2.9, a transition-independent agent i’s next local state value st+1 depends i t t only on its previous local state si and latest individual action ai . Along with transition independence, researchers have imposed observation independence (Becker et al., 2004b; Nair et al., 2005). Definition 2.10. A Dec-POMDP is observation independent if: Y  t t+1 O (ot+1 |at , st+1 ) = Oi ot+1 i |ai , si i∈N Here, the Dec-POMDP observation function O() has been decomposed into a set of local observation functions {Oi ()}, one for each agent i, dictating probabilities of individual observations which are assumed to be independent of peers’ observations. In problems that are both transition and observation independent, an agent i cannot affect the transitional outcomes of other agents’ actions, nor can it affect other agents’ observations, through any actions that i takes. Thus, the only form of interaction occurs through the reward function: one combination of agents’ actions may be valued differently than another. Consider, for instance, a problem in which the successful delivery (team reward +1) of a package is contingent upon both a dockworker (agent 1) loading the package into a truck (action a1 ) and a driver (agent 2) (action a2 ) transporting the package to its destination. Becker et al. (2004b) have identified the Transition-Independent Dec-MDP (TIDec-MDP) class, which restricts problems to be locally fully observable (Def. 2.8), transition independent (Def. 2.9), and observation independent (Def. 2.10). They have also proven that the TI-Dec-MDP is NP-complete, putting it in a complexity 11 Definition 2.9 is simplified slightly from that given by Becker et al. (2004b) in its omission of global features s0 (also referred to as uncontrollable features (Goldman & Zilberstein, 2004), and unaffectable state (Nair et al., 2005)), which do not depend on any agent’s action. Here, I treat such features as jointly modeled in all agents’ local states. 37 class (widely believed to be) easier than that of the general Dec-POMDP (Becker et al., 2004b). Subsequently, other researchers have developed algorithms for solving TI-Dec-MDPs approximately using mixed-integer linear programming (Wu & Durfee, 2006), and optimally using separable bilinear programming (Petrik & Zilberstein, 2009), which exploit the transition and observation independent structure. Nair et al. (2005) have identified another class, the Network-Distributed POMDP (ND-POMDP), that is transition and observation independent but not locally fully observable, which has led to the development of a suite of exploitative algorithms (Kim et al., 2006; Kumar & Zilberstein, 2009; Marecki et al., 2008; Nair et al., 2005; Varakantham et al., 2007), and demonstrations of quality-bounded solution computation for problems with up to 10 agents, thereby making a significant leap in Dec-POMDP scaling. However, these algorithms remain limited in their applicability to transition and observation independent problems, examples of which include the control of distributed sensor networks wherein at each decision step agents choose only where to point their sensors and cannot not affect each others’ observations or local states. 2.3.2.4 Reward Independence As an alternative to transition and observation independent problems, Becker et al. (2004a) have developed a different class of problems that imposes a factoring of the team reward into local rewards. Definition 2.11. A Dec-POMDP is reward independent if there are functions f and R1 through Rn such that R (s, a) = f (R1 (s1 , a1 ) , R2 (s2 , a2 ) , ..., Rn (sn , an )) and Ri (si , ai ) ≤ Ri (si , a′i ) ⇔ f (R1 ...Ri (si , ai ) ...Rn ) ≤ f (R1 ...Ri (si , a′i ) ...Rn ) In Definition 2.11, Ri is agent i’s local reward function, valuing i’s local state and individual action independently of the other agents’ local states and individual actions. The reward composition function f , which is restricted to be monotonic, defines the resulting team reward. Reward independence, in the context of the models described in Section 2.3.2.5, has enabled researchers to decompose the Dec-POMDP value function into local value functions, and to exploit the resulting factored structure. 38 2.3.2.5 Event-Driven Interactions Becker et al. (2004a) define a subclass called the Dec-MDP with Event-Driven Interactions (EDI-Dec-MDP), which combines reward independence and local full observability with another property that restricts agents’ interactions to take a particular form. Due to its similarity to the model that I develop in Chapter 3, I describe the formal details of the EDI-Dec-MDP in Appendix A, which I briefly summarize here. Each interaction takes the form of a special transition dependency. For an agent i whose actions affect agent j, the EDI-Dec-MDP models a dependency that relates the occurrence of an event, which is a transition of agent i’s local state, to the probability of a subsequent transition of agent j’s local state. EDI-Dec-MDP agents are always reward independent, and they are transition independent in all world states except those explicitly represented with a dependency. Using this insight, Becker et al. (2004a) develop a solution approach for EDI-DecMDPs that iteratively solves nearly-independent local models augmented with nonlocal event information, and demonstrate their algorithm to be much more efficient than exhaustive joint policy search. However, no EDI-Dec-MDP algorithms to date have been shown to scale beyond two agents. 2.3.2.6 Hierarchy of Methods With Fixed Execution Ordering Another branch of work, in addition to requiring structured interactions, imposes restrictions on agents’ local behaviors. The Opportunity-Cost Dec-MDP (OC-DecMDP), introduced by Beynier & Mouaddib (2005) and studied by Marecki & Tambe (2007), models a team of agents whose objective is to coordinate the execution times of methods with stochastic durations. Unlike the other models reviewed thus far, the OC-Dec-MDP specifies a fixed ordering over each agent’s method executions, restricting the problem to one of determining only when to start each method and not which order to execute the methods in. OC-Dec-MDP interactions take the form of precedence constraints, each dictating that a method executed by one agent will only complete successfully if a particular method of some other agent has already completed successfully. In combination with local full observability and the fixed ordering over method executions, this restricted form of transition dependence makes the OC-Dec-MDP more practical for scaling to problems with many methods (or many agents). Researchers have exploited this specialized structure to compute approximate solutions containing over a hundred methods (Marecki & Tambe, 2007). 39 2.3.2.7 Other Subclasses In addition to those described in the previous subsections, researchers have identified several other specialized Dec-POMDP subclasses whose structure has allowed for efficient computation of optimal solutions. For instance, Goldman & Zilberstein (2004) defines a Goal-Oriented Dec-MDP (GO-Dec-MDP) wherein the objective is to minimize the cost of agents’ actions en route to one of a subset of goal states, and proves that, when combined with transition and observation independence, the GoDec-MDP is polynomial. Guo & Lesser (2005) define a Partially-Observable Stochastic Game with state-dependent action sets, wherein each agent controls a separate MDP that is independent from the others except that the agent’s set of available actions depends upon other agents’ MDP states, and demonstrate that for such problems the joint policy space can be reduced significantly by iteratively removing dominated local policies. Dolgov & Durfee (2006) defines a flavor of Dec-MDP for multiagent resource allocation, wherein agents’ individual MDPs are completely independent with the exception of constraints on joint actions that depend upon an initial allocation of resources. Wu & Durfee (2010) extend Dolgov & Durfee’s formulation to the case of sequential resource re-allocation, defining the multiagent resource-driven mission phasing problem (M-RMP) and proving the computationally complexity of this subclass to be NP-complete. Meanwhile, others have defined subclasses with fewer structural restrictions, but that have only been shown to accommodate efficient approximate solution methods. For instance, Varakantham et al. (2009) define the Distributed POMDP with Coordination Locales (DPCL), which requires only observation independence (Def. 2.10) and a decomposition of the team reward into local rewards. The authors demonstrate that by explicitly distinguishing all of those world states (or locales) in which agents can interact, an efficient distributed algorithm can exploit the underlying structure to compute solutions efficiently, but without guaranteeing optimality or near optimality. Guestrin et al. (2001) define a hierarchical multiagent factored MDP whose structure can be exploited by an efficient approximate linear programming algorithm. Lastly, Oliehoek et al. (2008b) describe a generalization of Guestrin et al.’s model called the factored Dec-POMDP, which explicitly represents factored value functions whose components depend on subsets of state variables and subsets of agents’ actions. While the structure in these more general subclasses has allowed efficient and scalable computation of approximate solutions, no generally applicable algorithms have yet been developed that can compute quality-bounded solutions for problems with more than three agents. 40 2.3.3 Decoupled Joint Policy Formulation By and large, the successes (some of which are cited in Section 2.3.2) in scaling planning to Dec-POMDPs with more than two or three agents have come though the use of decoupled solution methods. In contrast to centralized planning algorithms that formulate joint behavior by reasoning about all agents’ decisions in combination, a decoupled algorithm breaks the computation up into coordinated local policy formulations. I now review the infrastructural groundwork (on which my own solution approach also rests), the results that others have attained, and the limitations of this past work. 2.3.3.1 Best Response Central to the decoupled solution approach is the use of local models to separately compute each agent’s individual policy. As derived by Nair et al. (2003), any DecPOMDP can be transformed into a single-agent POMDP for agent i given that the policies of its peers have been fixed. Agent i uses the single-agent POMDP to compute its best response policy, which I will denote πi∗ (π6=i ), that is optimal with respect to the policies of its peers (π6=i ). The idea is to compute best responses to a series of candidate policies of i’s peers, as illustrated in Figure 2.6. Nair et al. (2003) provide a dynamic programming algorithm for computing best responses to the general class of Dec-POMDPs. Though computationally less expensive than computing a complete joint policy, Nair et al.’s best-response computation requires that the agent reason about the space of possible observation histories of its peers, which increases exponentially with the time horizon (T ) and with the number of peers. Consequently, Nair’s general best response computation has failed to scale to problems with more than two agents (Varakantham et al., 2009). In a more restrictive context, researchers have devised best-response models that provide substantial leverage (Becker et al., 2004b; Nair et al., 2005) in reducing computational cost. They take advantage of the locality of agents’ interactions (Nair et al., 2005), such that the agent reasons about only the observation histories of a subset of peers’ and only a subset of state features. However, these specialized models are only applicable to transition and observation independent Dec-POMDPs (Becker et al., 2004b; Nair et al., 2005). 41 peer policies  i best response computation  i*   i  best response policy Joint Policy Space Agent i Figure 2.6: Decoupled joint policy search. 2.3.3.2 Policy-Space Search Given the decoupling scheme that the best response model provides, planning the joint policy becomes a search through the space of combinations of optimal local policies (each found by solving a local best-response model). Nair et al. (2003) develop a general algorithm, Joint Equilibrium-based Search for Policies (JESP), for searching the joint policy space in this manner. Using JESP, agents iteratively revise their policies by computing best responses to each others’ best responses, ultimately converging on a (Nash) equilibrium that is a local optimum but not necessarily a (Pareto-efficient) global optimum. A subsequent extension to JESP, the Global Optimal Algorithm (GOA), ensures that agents compute best responses to all interacting peers’ policies, and thus returns the optimal joint policy (Nair et al., 2005). However, GOA is also limited in its scalability due to the intractable growth of the joint policy space. Researchers have also employed joint policy search for solving problems in the NDPOMDP class (Nair et al., 2005; Marecki et al., 2008; Kim et al., 2006; Varakantham et al., 2007). By exploiting locality of interaction (Nair et al., 2005; Kim et al., 2006), using smart pruning techniques (Varakantham et al., 2007), and replacing policies with fixed-size controllers (Marecki et al., 2008), they have been successful in scaling up the computation of joint policies to transition-dependent problems with 10 agents. With one such algorithm, SPIDER (Varakantham et al., 2007), they have additionally been able to bound the quality of the solutions returned. However, none of these scalable algorithms are directly applicable to transition-dependent problems. 2.3.3.3 Adaptations Researchers have developed other decoupled joint policy formulation methods by maintaining the same paradigm as I have described, but by adapting the mechanics of either the search process or the best response calculation. For instance, Varakantham 42 et al. (2009) have designed an algorithm, Team’s REshaping of MOdels for Rapid execution (TREMOR), for computing approximate solutions to DPCLs (as described in Sec 2.3.2.7). Like JESP (described in Section 2.3.3.2), TREMOR employs local models, iteratively computing individual policies in response to candidate peer behavior, and greedily converges on a local optimum. The difference is that instead of computing an optimal best response, TREMOR uses social model shaping to compute approximate best responses. Agents construct local models whose transitions have been shaped to account for expected peer effects (upon entering into predetermined coordination locales) and whose rewards have been shaped to encourage the agent to select harmonious actions (in coordination locales). By foregoing optimality, TREMOR’s local response calculation has been shown to scale to problems with 10 agents. However, it provides no guarantees of near optimality, nor any bounds on the quality loss due to the approximate best response. Another adaptation, the Coverage Set Algorithm (CSA) (Becker et al., 2004a,b), which was originally designed to solve TI-Dec-MDP problems, is built on the same best-response concept as JESP, GOA, and SPIDER. Becker et al. (2004b) defines an optimal coverage set as the set containing each local policy that is a best response to some combination of peer policies. By considering all policies in the coverage sets of all agents, CSA ensures that the optimal joint policy will not be overlooked, which is the same insight behind GOA. The novelty of CSA lies in its exploration of the coverage set. Policies are abstracted using a collection of parameters over which the best-response value function is piecewise-linear and convex. CSA then searches the parameter space by evaluating policies that correspond to hyperplane intersections along the surface of the optimal joint value function. Such a parameterization was first defined for TI-Dec-MDPs based on the joint reward structure (Becker et al., 2003). Another parameterization was later defined for EDI-Dec-MDPs (Becker et al., 2004a), but has since only been shown to be tractable on small two-agent event-driven problem instances. Petrik & Zilberstein (2009) have since reformulated CSA as an optimization problem called a separable bilinear program, and developed a centralized solution approach, which I will refer to as SBP, that has been shown to significantly outperform the basic CSA implementation. SBP works by repeatedly solving an optimization problem according to an approximation bound, successively refining the bound from iteration to iteration. Aside from converging more quickly than CSA in practice, it also has the advantage of allowing the agents to compute anytime solutions with bounds on approximate solution quality. However, it has only been developed for 43 solving problems with two agents. Extension to more than two agents is nontrivial by nature, since the consequent mathematical formulation would no longer constitute a “bilinear” program. 2.3.4 Coordinating Abstract Behavior Another paradigm central to this dissertation is the coordination of abstract interactions. Intuitively, agents don not always require detailed models of peers’ individual behavior in order to coordinate their decisions. Instead, they only need to consider the portions of peers’ behavior relating to their interactions. In the context of a decoupled solution approach (Sec. 2.3.3), agents can formulate coordinated joint behavior by negotiating abstract commitments to interactions and planning local behavior around those commitments. Historically, this has been a dominant approach in multiagent planning. As early as 1980, the Contract Net protocol (Smith, 1980) provided a convention for agents to commit to executing necessary subtasks of a larger problem. Cohen & Levesque (1990, 1991) use the commitment paradigm to develop a theory of Joint Intentions by which agents commit to performing actions in states that will allow achievement of persistent goals. Grosz & Kraus (1996) formalize commitments into a model of agents’ simultaneous completions of plan components with their SharedPlan framework. Durfee & Lesser (1991) develop a Partial Global Planning methodology (subsequently generalized by Decker & Lesser, 1992), wherein agents coordinate their interactions by exchanging group goals and integrating agents’ commitments in the form of partial plans that can be used to complete those goals. Meanwhile, local plans are formed around the promised partial plans and revised (when the need arises) to adapt to dynamic environmental factors and unexpected circumstances. More recently, researchers have used the concept of partial global planning to develop algorithms wherein agents identify coordination points and coordinate using abstract characterizations of their interactions (Clement et al., 2007; Cox & Durfee, 2003; Xuan & Lesser, 1999). Other examples wherein agents coordinate abstract interactions include Tambe’s Shell for Teamwork (STEAM ) (Tambe, 1997) based on Cohen’s theory of Joint Intentions, Jennings’ Generic Rules and Agent model Testbed Environment (GRATE* ) (Jennings, 1995) which defines an extension to Joint Intentions called Joint Responsibility, and Rich & Sidner’s Collaborative Agent toolkit (COLLAGEN ) (Rich & Sidner, 1997), based on Grosz’s SharedPlans theory. The principle of coordinating abstract interactions has received considerably less attention in Dec-POMDP settings. Most of the decoupled policy formulation techniques 44 described in Section 2.3.3 involve coordination through the exchange of complete policies, that represent both local and interacting behavior. Alternatively, CSA (Becker et al., 2004b) employs abstraction by parameterizing one agent’s policies using expectations about nonlocally-affecting events. However, this particular parametrization is limited to the TI-Dec-MDP (Becker et al., 2004b) and the EDI-Dec-MDP (Becker et al., 2004a). Musliner et al. (2006) develops a distributed planning algorithm wherein agents communicate commitments about the timings of their interdependent task executions, which are in turn modeled using local MDPs. However, this particular form of commitment does not incorporate uncertainty in agents’ interactions, thereby providing only an approximate model of interaction. Similarly, TREMOR (Varakantham et al., 2009) employs approximate local models (using transition and reward shaping) that abstract agents’ interactions. However, TREMOR’s search process dictates that agents communicate complete policies without regard to the abstract interactions that they entail, leading to convergence on local optima and no guarantees that agents will consider any breadth of committed interactions. 2.4 Summary In summary, I have reviewed work in single-agent sequential decision making (Section 2.2) and multiagent sequential decision making (Section 2.3) that forms the foundations of the work that I develop in this dissertation. I have also surveyed related approaches to coordination under uncertainty, and drawn attention to the limitations of past work when it comes to scalability, bounded solution quality, and applicability. In particular, I find that researchers have developed several general algorithms, for computing quality-bounded solutions to transition-dependent flavors of Dec-POMDPs, that are limited to problems with just two or three agents (due to their computational overhead). Alternatively, researchers have developed algorithms that demonstrably scale to teams of 5 or 10 agents and guarantee bounds on solution quality, but that are limited in their applicability to specialized subclasses with restrictive assumptions (described in Section 2.3.2). There is no prior work that both solves a general flavor of transition-dependent problems and scales to more than three agents whilst guaranteeing bounds on quality. The latter group of algorithms (that scale quality solution computation) have achieved their scalability by decoupling the joint policy formulation problem (Section 2.3.3) and by identifying and exploiting specialized structure in agents’ interactions. It is important to identify instances of structure that lend themselves to efficient and 45 scalable solution methods. However, in this pursuit, I find that there is a tendency in past work for each instance of exploitable structure to be studied separately. This is evident from the large number of Dec-POMDP subclasses reviewed in Section 2.3.2 (e.g., TI-Dec-MDPs, OC-Dec-MDPs, GO-Dec-MDPs, ND-POMDPs, EDI-Dec-MDPs, EDI-CRs, DPCLs), each of which comes with its own set of restrictions, and each of which is accompanied by its own specialized solution algorithms. The field of multiagent sequential decision making lacks models that are both general (and hence of interest to a broad group of researchers) and exploitable (and hence enable solution methods that are efficient and scalable to the extent that exploitable structure is present).12 12 A notable exception is the factored Dec-POMDP (Oliehoek et al., 2008b), whose exploitable structure I describe in the next chapter (Section 3.4.1). 46 CHAPTER 3 Exploiting Transition-Dependent Interaction Structure In spite of the general intractability of Dec-POMDP planning, a large body of work surveyed in the last chapter has shown us that there exist subclasses of Dec-POMDP problems, even some involving more than two or three agents, that are tractable. Moreover, this past work has provided us with tools to compute solutions efficiently by exploiting restricted instances of problem structure. Inspired by the successful solving and scaling of transition and observation independent problems (Nair et al., 2005; Varakantham et al., 2007), and with the ambition of reproducing these results under less restrictive conditions, I now introduce a new Dec-POMDP subclass that is more general than other subclasses, along with a corresponding model description that articulates exploitable problem structure. My class of Transition-Decoupled POMDP (TD-POMDP) problems serves as the context for the remaining chapters of this dissertation. The TD-POMDP is named for its structure: it consists a set of transition-dependent local POMDP models, one for each agent, that can be decoupled by fixing peer agents’ policies and abstracting their transition influences. Before developing the mechanics of TD-POMDP decoupling and influence abstraction (in Chapter 4), here I provide a formal description of the TD-POMDP’s exploitable interaction structure as well as a theoretical motivation for exploiting this structure. Intuitively, the computational leverage gained through decoupling and abstraction depends upon the extent to which conditional independencies exist among agents’ decisions that render the agents weakly coupled. Extending past work, I characterize three complementary aspects of weaklycoupled interaction structure, relate each to the TD-POMDP problem description, and derive bounds on the TD-POMDP’s computational complexity that depend upon the degree of agent coupling. 47 3.1 Overview The contents of this chapter are structured as follows. I begin, in Section 3.2, by presenting the formal details of the TD-POMDP model, expressed as properties that constrain the Dec-POMDP formalism. The structure that these additional properties induce leads me, in Section 3.2.4, to specify the TD-POMDP problem description as a collection of interdependent local models whose mutually-modeled features (through which agents interact) are treated as first-class entities. In Section 3.3, I formally describe what it means to solve the TD-POMDP and how difficult this problem is. Although just as complex as the general Dec-POMDP in the worst case, the benefit of the TD-POMDP lies in its emphasis of exploitable structure. In Section 3.4, I contrast the TD-POMDP with other models, comparing the structure that each articulates as well as the problem restrictions that each imposes. As a step towards exploiting TD-POMDP interaction structure, in Section 3.5 I develop theory for characterizing what it means for a problem to be weakly coupled, and for measuring the degree to which it is coupled. My characterization includes three different aspects of weak coupling that, when considered in concert, lead me to develop tighter bounds on the complexity of computing optimal TD-POMDP solutions. I conclude, in Section 3.6, with a summary of the formalisms I have introduced and theoretical results I have derived, and a discussion of their respective contributions. Throughout this chapter, I refer to example problems of the form shown in Figure 3.1 and described in Example 3.1 below. Like Example 2.1 (shown in Figure 2.2), Example 3.1 is depicted in Figure 3.1 as a network of interdependent tasks whose relationships, indicated by connecting lines, may be described using a variation of the TÆMS modeling language (Decker, 1996). Figure 3.1 highlights one such task relationship, which constitutes a structured transition-dependent interaction between two TD-POMDP agents. Example 3.1. Figure 3.1 presents a concrete example of a structured interaction from the planetary exploration domain described in Section 1.1.1. Here, a satellite agent (1) interacts with a rover agent (7) by building a path for the rover to travel from its present location to site A. The path-building task (representative of a lower-level path-planning routine) has two possible outcomes whose durations (“D”), qualities (“Q”) and probabilities (“Pr”) are given. As shown, in the case 48 that the satellite completes task “Build-Rover-Path-A” successfully (with outcome quality > 0), this influences the outcome of a rover’s task, “Visit-Site-A”, allowing the rover to visit site A more quickly (in 3 time units instead of 6) with high probability (0.9 instead of 0.1, as denoted by the arrow in the last column of the “Visit Site A” task). This task relationship is just one example of an interaction that may exist among the team of satellites and rovers shown in Figure 3.1. 1 Analyze Topography 2 Locate Site C Photograph R1 Forecast Weather 3 Plan Path B Visit Site A Plan Path A 4 outcomes : D Q Pr 3 1 0 (0.9) 6 1 1 (0.1) window : [2,8] Build Rover Path A 5 DD QQ PrPr 0.8 22 11 0.8 0.2 11 00 0.2 window : [0,10] outcomes : Recharge Visit A Compare Soil Sample 6 Search Region R2 Visit C 7 Figure 3.1: Example of structured interaction among TD-POMDP agents. 3.2 TD-POMDP Formalism Decentralized Partially-Observable Markov Decision Processes (Dec-POMDPs), as reviewed in Section 2.3.1, provide a powerful, well-studied framework for multiagent planning under uncertainty. Here I present a formal specification of the subclass of Dec-POMDP problems that this thesis addresses: the Transition-Decoupled POMDP. In Subsections 3.2.1-3.2.3, I specify precisely how the key problem characteristics outlined in Section 1.2 translate into the formal properties that define the TransitionDecoupled POMDP. Then, in Subsection 3.2.4, I bring all of these properties together into a concise representation of TD-POMDP problem information. 49 3.2.1 Factored Decomposability The world state s is factored into state features (denoted as s = hb ∈ B, c ∈ C, d ∈ D, . . .i, for instance), each of which represent a different aspect of the environment. Equivalently, the world state space S may be represented as the cross product of individual feature domains: S = (B × C × D × . . .). Factoring of the world state allows us to express the conditional independence relationships that exist among the variables (i.e., state features, observation features, actions, and rewards) of the decision model (Boutilier, 1996; Guestrin et al., 2003). As reviewed in Chapter 2, factorization in multi-agent sequential decision making is not original to this dissertation. However, the way in which the TD-POMDP model is factored sets it apart from related models.1 The definitions that follow serve to formalize the factorization particular to the TD-POMDP. By factoring the world state, we can impose a distribution of environment information among the agents. Different state features are relevant to different agents as they make decisions about which activities to pursue. Moreover, some features may not be available to an agent. Limited sensory capabilities may restrict the agent’s awareness to only a small subset of features. Using Definition 3.2 below, not all features are observable to all agents. Definition 3.2. A state feature f , with domain F = {f, f ′ , f ′′ , ...}, is observable to agent i if and only if i’s observation ot+1 depends upon the concurrent value of f (for i some combination of state, action, and observation)2 : ∃ot+1 ∈ Oi , at ∈ A, hb, c, ..., f, ...it+1 ∈ S, hb, c, ..., f ′ , ...it+1 ∈ S i such that t+1  t+1  t ′ t 6= P r ot+1 P r ot+1 i |a , hb, c, ..., f , ...i i |a , hb, c, ..., f, ...i It is important to distinguish observability (Def. 3.2) from the (previously stated) concept of full observability (as in Def. 2.8, and sometimes referred to as “direct observability”). Whereas these previous terms refer to an agent’s awareness of the exact value of a state feature f , Definition 3.2 makes no distinction between exact observations and partial observations. As long as the agent’s observation oi is not conditionally independent of the concurrent value of f (conditioned on other state 1 A detailed comparison with closely-related models is deferred to Section 3.4.1, after the TDPOMDP has been formally specified. 2 Just as in Chapter 2, here and throughout, the superscripts t and t + 1 simply refer to two successive decision steps. Similarly, all components of model are stationary, such that they return the same value regardless of the particular value of t. These superscripts should not be confused with time, which is a feature of state, as described in Section 3.2.3. 50 Explore Map Region Build Path A Visit Site A Analyze Soil at Site A Return to Base Figure 3.2: A simple satellite-rover example problem. features’ values), f is observable to the agent. Throughout this dissertation, the usage of “observes”, “observable”, and “observability” in relation to state features refers to Definition 3.2. Example 3.3. Figure 3.2 depicts a simplified version of the running example. There are just two agents, a satellite and a rover, with a small number of activities. The dotted lines between activities indicate the existence of local constraints. For instance, the satellite cannot build a path for the rover until it has mapped the region, and the rover cannot analyze the soil at site A after it has returned to base. Moreover, neither agent can execute multiple tasks simultaneously. Here, there may be a number of different state features that the agents can model. Given that the rover performs all of its activities on the surface, and the satellite is located far above the surface looking down, the two agents will have vastly different perspectives of the world. Whereas the rover agent may be concerned with the composition of the soil sample it has just analyzed, which I will denote “soil-composition-at-site-A” (SCA), this is not relevant to the satellite agent’s activities, nor is the satellite equipped with sensors for analyzing soil. “path-Abuilt” (PAB), on the other hand, is a feature that is relevant to both agents. The satellite should model PAB so that it does not perform redundant computations. The rover should model PAB because this feature impacts the rover’s ability to visit Site A (as indicated by the arrow in Figure 3.2). PAB is observable to both agents during execution because the satellite broadcasts the completed path to the rover . 51 rover’s local state (strover) satellite’s local state (stsat) (rover-location) RLoct (soil-composition-at-site-A) SCAt otrover (synchronized clock) Timet (path-A-built) PABt (regionmapped) RMt (satellitelocation) SLOCt rover’s local observation otsat satellite’s local observation Figure 3.3: Example of local state representations and local observations. Ultimately, it is the task of the problem designer to specify how the awareness of various state information is distributed among the agents. By making a feature observable to an agent, through manipulation of sensor infrastructure or communication infrastructure (as discussed in Section 3.4.2), the designer can alter which information is used for decision making by which agents as they execute their activities. For this purpose, as in other related models (e.g., those discussed in Section 3.4.1), the TD-POMDP world state s is aggregation3 of agents’ local states: s = hs1 , ..., sn i . (3.1) The designer of a TD-POMDP problem indicates which features are relevant to each agent i by specifying the agent’s local state si according to the constraints given in Definition 3.4. Figure 3.3 portrays the local state representations for the satellite and rover (from Example 3.3) in the form of a Dynamic Bayesian Network (DBN), where the arrows represent dependencies between state feature variables and observation variables. Definition 3.4. The local state for TD-POMDP agent i, denoted si = hfi1 , fi2 ...i, represents a subset of world state features, such that the following properties hold: 1. For every world state feature f , if f is not contained in si , f must be contained within some other agent’s world state. 3 Unlike related models (e.g., Becker et al., 2004a; Nair et al., 2005)), TD-POMDP agents’ local states may share features. 52 2. If a world state feature f is observable to agent i, f must be contained within i’s local state representation. Property 1 of Definition 3.4 requires that every world state feature be represented in at least one agent’s local state. Property 2 implies that an agent may observe (partially or fully) only those features that make up its local state. As such, agent i’s local observation oi satisfies the following equality: P r oti |at , st = ..., sti , ...  = P r(oti |at , sti ). (3.2) Local state thereby allows for a separation of features observable to one agent from features observable to another agent. It is important to note that the separation need not be strict. For instance, certain state features (such as “PAB” in Figure 3.3) may be observed by more than one agent, and thus shared by more than one agent’s local state representation. Equation 3.2 describes one aspect of the local observation function fully specified in the following definition. Definition 3.5. The local observation function Oi : Ai ×Si ×Ωi → R, functionally  t t+1 denoted Oi ot+1 , dictates the probability with which agent i will receive i |ai , si observation oti ∈ Ωi after taking action ati ∈ Ai and transitioning into local state st+1 ∈ Si . All agents’ local observation functions together define the probabilities of i joint observations: Y   t t+1 t+1 t t+1 . = Oi ot+1 P r ot+1 , ..., o |a , s 1 n i |ai , si (3.3) 1≤i≤n The TD-POMDP’s local observation functions allow agents possible observability of their local state features but not of features outside their local states. Although factored, the agents’ observations are not necessarily independent (a requirement of other related models (Becker et al., 2004b; Nair et al., 2005)). They may be dependent on features shared across local states (as well as joint action choices that are correlated because of features shared across local states). However, by Equation 3.3, the observation probabilities are conditionally independent given values of the shared state features. For instance, if two rover agents are traveling the same path looking for a particular landmark, and if there is a probability that each agent will fail to detect the landmark as it passes by, the probability with which rover 1 finds the landmark and the probability with which rover 2 finds the landmark are assumed to be conditionally independent of one another (conditioned on the current world state 53 and the agents’ latest actions).4 The reward function for the TD-POMDP is similarly decomposed into local reward functions, each dependent on local state and local action. Definition 3.6. The local reward function Ri (sti , ati ) indicates the local component of the immediate team reward, which is ascribed to agent i’s transition from local state to next local state given local action. The agents’ local reward functions combine by summation to yield the team reward (represented in the general Dec-POMDP model): n   X R st , at = Ri sti , ati . (3.4) i=1 In the example problem, an agent’s local rewards are the qualities attained from the tasks that the agents execute. Not only is the team reward decomposable, but additionally, the value of any given joint policy can be expressed as the composition of local values. Definition 3.7. The local value Vi (π) is the expectation of the non-discounted summation of local rewards (Def. 3.6) for agent i given that the team of agents adopts joint policy π: " T # X Vi (π) = E Ri (st , at )|π (3.5) t=0 Theorem 3.8. The (joint) value V of a joint policy π is the summation of local values: n X V (π) = Vi (π) (3.6) i=1 Proof. This follows directly from the definition of Dec-POMDP value function (Def. 2.6) and the definition of the local reward function (Def. 3.6): 4 Without loss of generality, dependencies among TD-POMDP agents’ observations occur through their shared state features. That is, arbitrarily-complex observational dependencies are representable by sharing additional features among agents’ local states. 54 V (π) =E =E = " " n X i=1 = n X T X R(st , at )|π t=0 T n X X # Ri (st , at )|π # by Definition 3.6 Ri (st , at )|π # by the linearity of expectation i=1 t=0 E " T X t=0 Vi (π) by Definition 2.6 by Definition 3.7 i=1 Notice that local value is defined as a function of joint policy (and not simply local policy). This is due to the combination of the following properties: (1) the TD-POMDP local rewards are dependent on local state feature values, and (2) given that local state features may be shared, local state values may be affected by other agents’ actions (as described in detail in Section 3.2.2). As such, TD-POMDP agents are not strictly reward independent (Def. 2.11). 3.2.2 Nonconcurrently-Controlled Nonlocal Features The transition dynamics of the TD-POMDP are also factored in such a way that enable agents to reason individually about the values of the features of their local states. Before formally developing the TD-POMDP transition function, let me begin by defining the concepts of controllability and affectability. Definition 3.9. A state feature fix is controllable5 by agent i if and only if: ∃ha1 , ..., ai , ..., an i ∈ A, a′i ∈ Ai , st ∈ S such that   t+1 t t+1 t P r fix |s , ha1 , ..., ai , ..., an i 6= P r fix |s , ha1 , ..., a′i , ..., an i Definition 3.9 states that agent i can control a feature fix if the value of fix may depend upon i’s latest action. However, it does not say anything about i’s actions in previous time steps. By relaxing the condition from Definition 3.9, I define a slightly more general concept that I refer to as affectability. 5 My definition of controllability is an extension of Goldman & Zilberstein’s (2004) definition of uncontrollable features. It is a departure from the concept of controllability developed in the control theory literature (e.g., Ogata, 1997). Here, regardless of whether or not an agent can manipulate a feature fix deterministically, and whether or not it can set the value of fix at its will, fix is controllable as long as the agent can alter the probability of fix taking on some value in the next time step. 55 Definition 3.10. A state feature fix is affectable by agent i if and only if: ∃~ati ∈ (Ai )t , ~at′i ∈ (Ai )t such that   t+1 t t+1 t′ P r fix |~ai 6= P r fix |~ai A feature fix is affectable by an agent i as long as its value is dependent on past actions taken by i. If a feature is controllable by i, it must also be affectable by i. However, a feature that is affectable need not be controllable. For instance, feature Rloc in Figure 3.4 is affectable by the satellite (since the satellite controls PAB and the value of RLoc depends upon PAB) but not controllable by the satellite. Any feature that is not affectable by agent i is unaffectable. Definition 3.11. A state feature fix is unaffectable by agent i if and only if: ∀~ati ∈ (Ai )t , ~at′i ∈ (Ai )t ,   t+1 t′ t+1 t |~ai P r fix |~ai = P r fix In the TD-POMDP model, each state feature can be controlled by at most one agent (though it may be affected by more than one agent). Furthermore, if agent i can control a feature fix , then that feature must be represented in i’s local state (Definition 3.4). These properties induce a further decomposition of the state features within each agent’s local state: Definition 3.12 (Local State Constituents). Agent i’s local state si is comprised of three disjoint feature sets, si = ūi , ¯li , n̄i , where: ❼ i’s unaffectable features ūi = hui1 , ui2 , ...i are those features that are not affectable by any agent, but may be observable by multiple agents. Examples include time-of-day or temperature. ❼ i’s locally-controlled features ¯li = hli1 , li2 , ...i are those features that are controllable by agent i. Features ¯li are not controllable by any other agent. For example, a rover’s position is a locally-controlled feature. Additionally, ¯li may contain features that are not controllable by any agent, but that are affectable by agent i and are not contained within any other agent’s locally-controlled feature set.6 6 Special care must be taken in treating features that are not controllable by any agent but affectable by multiple agents. For instance, an agent pushes the first domino, and 10 time steps later the last domino falls. In this case, the “last-domino-down” feature could be included in agent i’s locally-controlled feature set. However, if there are any other agents that can affect this feature, these other agents must model it as a nonlocal feature (see below). 56 ❼ i’s nonlocal(ly-controlled) features n̄i = hni1 , ...i are those remaining fea- tures, each of which is in the locally-controlled feature set of exactly one other agent, and whose values may affect the transitions of i’s locally-controlled features (formalized in Equation 3.9). According to the composition of agents’ local states given in Definition 3.12, there may exist world state features that are not modeled exclusively by a single agent. First, a feature may appear as an unaffectable features in more than one agent’s local state. Second, each nonlocal feature in agent i’s local state appears as a locally-controlled feature in the local state of exactly one other agent. In the example from Figure 3.1, the rover models whether or not the satellite agent has planned a path for it, so path-A-planned would be a nonlocal feature in the rover’s local state as well as a locally-controlled feature in the satellite’s local state. I refer to such features as mutually-modeled features. Conceptually, mutually-modeled features are aspects of the environment that are relevant to more than one agent as they plan their decisions. Definition 3.13. Agent i’s mutually-modeled features, m̄i , are those state features that appear in i’s local state representation si , as well as one or more other agents’ local state representations: m̄i ≡ hf ∈ si |∃j 6= i, f ∈ sj i (3.7) Mutually-modeled features make the TD-POMDP model transition dependent. Referring back to Definition 2.9, transition independence is violated when the change in value of nonlocal feature njx ∈ n̄j in agent j’s local state depends upon agent i’s action:   t+1 t ′ t (3.8) P r nt+1 jx |s , ai , aj 6= P r njx |s , ai , aj The transition dependencies that exist between TD-POMDP agents are structured as follows. Within agent j’s local state sj = hūj , ¯lj , n̄j i = hfj1 , ..., fjk i, the transition t+1 probability of any feature fjx ∈ sj at time t + 1, denoted fjx , given that the agents t t t t have just performed joint action a = a1 , ..., aj , ...an ∈ A in world state s ∈ S at time t is: 57 t srover m usat t a rover (rover-location) RLoct Rloct+1 (soil-composition-at-site-A) SCAt SCAt+1 (synchronized clock) Timet Timet+1 (path-A-built) PABt PABt+1 (regionmapped) RMt RMt+1 (satellitelocation) SLOCt l sat (nsat = ∅ ) t +1 srover urover nrover SLOCt+1 a t s sat l rover t sat t +1 s sat Figure 3.4: Example of the dependencies among feature transitions.  t+1 t t P r fjx |s , a   t+1 t  for unaffectable feature fjx ∈ ūj P r fjx |ūj     P r f t+1 |st = ūt , ¯lt , n̄t , at  for locally-controlled feature f ∈ ¯l jx j j j jx  j j j = t+1 t t  for nonlocal feature fjx ∈ n̄j P r fjx |si , ai     (locally-controlled by agent i) (3.9) Equation 3.9 indicates that the transitions of all unaffectable features and locallycontrollable features depend on only the local state and local action. However, the transitions of each nonlocal feature depend on world features outside of the local state and on the actions of exactly one other agent (i, for instance), respectively. In Figure 3.4, these dependence relationships are represented graphically with a 2-stage DBN for the running example problem (Ex. 3.3). Here, the features are grouped by agent as well as by feature type, with the mutually-modeled features labeled m̄. The semantics of the DBN are such that a particular feature is conditionally independent of all non-parent features conditioned on the feature’s parents. Although this particular DBN is specific to the example problem, notice that its conditional independencies (denoted by the absence of arrows) conform to Equation 3.9. Additionally, as I formalize in Equation 3.10, the values for the three groups of  features ūj , ¯lj , n̄j are conditionally independent of one another given the previous state and joint action. Bringing the terms from Equation 3.9 together and generalizing to multiple nonlocal features (dependent on one or more other agents) leads to a 58 formal definition of the TD-POMDP’s factored transition function. Definition 3.14. Agent j’s factored local transition function is the probability distribution over j’s next local state conditioned on world state and joint action: t t P r st+1 j |s , a      t ¯t+1 ¯t t t t P r n̄t+1 |st , at = P r ūt+1 6=j j j |ūj P r lj |lj , n̄j , ūj , aj | {z } | {z } locally-dependent component nonlocally-dependent component = PjU t ūt+1 j |ūj  PjL ¯lt+1 |st , at j j j  Y ¯t+1 t t PiL n̄t+1 jx ≡ lix |si , ai ∀i|∃l̄ix ⊂l̄i ∧l̄ix ⊆n̄j  (3.10) denoted as the product of j’s unaffectable feature transition function j’s L locally-controlled transition function Pj (), and other agents’ locally-controlled feature transition probabilities. PjU (), Equation 3.10 factors the transition of j’s local state features, explicitly distinguishing between the features dependent on previous values of the local state (and local action) and features dependent on nonlocal state and action. Moreover, the TDPOMDP model explicitly specifies the locally-dependent factored transition function components PjU () and PjL (). As shown, agent j’s nonlocally-dependent components are encoded in the locally-dependent components of other agents. The result of this factorization is a structured transition dependence whereby agents may affect the consequences of each others’ actions sequentially but not concurrently. An example of this is depicted graphically in Figure 3.5, where agent i may affect the value of one of agent j’s nonlocal state features and agent j’s subsequent (but not simultaneous) locally-controlled feature transitions are influenced by the new value. I defer a discussion of the limitations of non-concurrent interactions to Section 3.4.3.1. 3.2.3 Temporal Synchronization Included as a key feature of TD-POMDP world state is time, which is an unaffectable feature with deterministic transitions, and which is mutually-modeled by all agents. This feature serves to synchronize agents’ executions and to practically facilitate coordination, particularly in domains where frequent communication is not possible. Typically, time = 0 in the TD-POMDP start state. Similarly, successor states always have a larger time value than their predecessors. A side effect is that the state space 59 s tj s tj1 utj1 u tjx1 utjx s tj 2 l tj1 l t jx a tj l tjx1 a tj1 l tjx 2 ntj1 n tjx1 ntjx transition s t i l ixt dependence ai Figure 3.5: DBN illustrating the TD-POMDP’s structured transition dependence. is non-recurrent: no world state may be visited more than once over the course of a single execution.7 3.2.4 Decoupled Representation The preceding subsections (3.2.1–3.2.3) described the TD-POMDP in the context of the conventional Dec-POMDP specification, formalizing the structural properties that delineate the TD-POMDP as a proper Dec-POMDP subclass, and along the way introducing the essential structural components. In Section 3.4, I provide a detailed discussion of the expressiveness of the TD-POMDP along with its representational limitations that come with this added structure. Here I summarize the compilation of components that specifies a TD-POMDP problem. Definition 3.15. A TD-POMDP M is specified by the following tuple: M = N , {Sj }, {Aj }, {Ωj }, {Oj }, {Rj }, {m̄j }, {PjU }, {PjL }, T , where ❼ N is a team of n TD-POMDP agents, indexed by j. ❼ Sj ⊆ Uj × Lj × Nj is agent j’s local state space (Def. 3.4), which is (possibly a subset of) the cross product of unaffectable, locally-controlled, and nonlocallycontrolled feature spaces (Def. 3.12), with explicitly distinguished initial state s0j ; 7 My approach is also applicable to problems with recurrent state spaces, but I do not consider those problems in this thesis. 60 ❼ Aj is j’s local action space (as in Def. 2.3); ❼ Ωj is j’s local observation space (as in Def. 2.3); ❼ Oj : Aj × Sj × Ωj 7→ [0, 1] is j’s local observation function (Def. 3.5); ❼ Rj : Sj × Aj 7→ R is j’s local reward function (Def. 3.6); ❼ m̄j is the set of j’s mutually-modeled features (Def. 3.13), where each feature is explicitly associated with at least one other agent; ❼ PjU : Uj × Uj 7→ [0, 1] is the unaffectable feature transition function (Def. 3.14); ❼ PjL : Sj × Aj × Lj 7→ [0, 1] is the locally-controlled feature transition function (Def. 3.14); and ❼ T ∈ N is the finite time horizon of execution (as in Def. 2.3). Unlike the conventional Dec-POMDP specification, most of the TD-POMDP model information is inherently distributed. For instance, instead of representing the world state space with a set S, the TD-POMDP explicitly specifies individual local state spaces. S may be recovered by aggregating all of the local state spaces (though with mutually-modeled features, S is not simply a cross product of local state spaces ×1≤j≤n {Sj }). Transition, observation, and reward information is similarly distributed into local components. The TD-POMDP model also distinguishes those local state features that are mutually modeled, explicitly characterizing the type of feature as well as the controlling agent (unless unaffectable). In fact, explicit in the TD-POMDP specification is the notion of a local model that represents the dynamics of the world as they relate to an individual agent j. Definition 3.16. Agent j’s local model Mj for TD-POMDP M is specified by tuple Mj = Sj , Aj , Ωj , Oj , Rj , m̄j , PjU , PjL , T , where each component is taken from the joint model M. Representing the joint decision model as a collection of local models has several advantages. First, for problems involving agents that each deal (primarily) with a different realm of their shared environment, it is natural to decompose the model information as such. Specifying joint versions of the components (i.e., joint state, joint action, joint transition, joint observation) would involve unnecessary aggregation of largely-independent agent dynamics. Further, the space required to store the 61 aggregated information (when naı̈vely represented such as in the conventional DecPOMDP specification) could be unnecessarily large. Instead, the TD-POMDP factors the model information so as to break up an otherwise very large joint state space, joint transition matrix, and joint observation matrix into potentially more compact local components. However the aforementioned advantages of the TD-POMDP begin to disappear as agents’ mutually-modeled feature sets grow large relative to the number of features in the world state. The more features are shared, the more information will be duplicated in the TD-POMDP agents’ local models. Consequently, the TD-POMDP representation is most appropriate for weakly-coupled problems (detailed in Section 3.5) with a relatively low density of nonlocal features. At first glance, agent j’s local model Mj closely resembles a single-agent POMDP (defined in Section 2.2.2). However, it is important to note that Mj , when studied in isolation from M’s other local models, does not constitute a proper POMDP. In general, Mj will include nonlocal features whose transitions depend on other agents’ behavior. Mj does not include information about the transition probabilities of these features. However, even if Mj were to include nonlocal feature transition information, the local model would not constitute a POMDP due to the non-Markovian dynamics. Recall that nonlocal features may depend on features not modeled in j’s local state, and on other agents’ actions, that may in turn depend on histories of features in the local state. As such, the agents’ local models are tied to one another by the transition dependencies of their nonlocal features. Only in the absence of nonlocal features does Mj define a proper POMDP. However, as the name “(T)ransition (D)ecoupled POMDP” implies, the local models can be decoupled from one another and be made into independently-evaluable decision models. As developed formally in Chapter 4, decoupling is accomplished by holding peer agents’ policies fixed and abstracting the transition influences. Once decoupled, the local models can be used to plan individually in the context of a distributed solution methodology (developed in Chapter 6). 3.3 Optimality and Tractability Next, I discuss what it means to solve the TD-POMDP (in Section 3.3.1) and express the worst-case time complexity of solving it (in Section 3.3.2). The result is forbidding: just like the general Dec-POMDP, in the worst case, computing an optimal policy for a problem in the TD-POMDP subclass is intractable. However, 62 this complexity result does not mean that the TD-POMDP should be condemned as a model that is less general than the Dec-POMDP and just as impractical. The merit of the TD-POMDP lies in its emphasis and explicit representation of problem structure, which I summarize in Section 3.3.3. By exploiting TD-POMDP structure, we will see (theoretically in Section 3.5 and empirically in Chapters 4 and 6) that portions of the TD-POMDP space yield efficiently-computable solutions. 3.3.1 Solution Concept The solution concept that I adopt in this dissertation is that of maximizing expected value. Thus, (optimally) solving a TD-POMDP problem involves computing a set of agent policies that maximizes the team’s expected cumulative reward (Def. 2.6). This set is referred to by the Dec-POMDP community as the optimal joint policy, whose definition I now restate. ∗ Definition 3.17.D An optimal joint E policy π of a TD-POMDP M is a combination ~ i 7→ Ai , each of which assigns an agent’s local action to of local policies ∀i, πi : O each of its local observation histories (as per Definition 2.4), that maximizes the joint value function: π ∗ ∈ arg max V (π). π Henceforth, I use the term solution and optimal joint policy interchangeably, and solving a TD-POMDP problem to mean optimally solving it by computing an optimal joint policy. For certain problems, optimality may not be tractable. I refer to an approximate solution as a joint policy that is returned by some planning algorithm, but that is not guaranteed to be optimal. When optimal solutions are intractable, the next best thing is to compute quality-bounded approximate solutions, which are those that attain a solution quality (i.e., value) that is provably close (by some standard of proximity) to the optimal solution quality. However, computing qualitybounded approximate solutions—where the proximity of approximate solution quality to optimal solution quality is bounded—may also be intractable, and provably as hard as computing optimal solutions (as is the case for the general finite-horizon Dec-POMDP problem class (Rabinovich et al., 2003)). As such, I will use the term approximate method to refer to any solution method not guaranteed to return optimal solutions, regardless of whether or not it produces quality-bounded solutions. Note that optimality is defined in relation to, and is sensitive to, the problem specification M. That is, a policy is optimal if it maximizes the expected team utility given each agent’s representation of features in its local state, constraints implied by its local observation function, and so on. If two problems, specified by Ma and Mb 63 respectively, differ only in one agent’s observability of one feature, a joint policy π ∗ that is optimal for Ma may be suboptimal for Mb . Moreover, for two problems with slightly different specifications, computing the solution to one may take seconds, but computing the solution to another may take hours. I refer the reader to a classical example comparing the Dec-POMDP model with the Multiagent MDP (MMDP) model (Boutilier et al., 1999b), which is a Dec-POMDP having the property that each agent observes the complete world state. Whereas the worst-case time complexity of the MMDP class is polynomial in the size of the state space (Goldman & Zilberstein, 2004), the worst-case time complexity of the Dec-POMDP class is NEXP-complete, even if we restrict consideration to Dec-POMDP problems where agents’ observations collectively (though not individually) determine the world state (Bernstein et al., 2002). The intuition behind why collectively-observable problems are so much harder to solve (than completely-observable problems) is that, when agents receive different (though not independent) observations, optimal joint behavior requires each agent to reason about (the exponentially-growing space of) what other agents’ may have observed in addition to what the agent itself observes. 3.3.2 General Complexity I now turn to the worst-case time complexity of computing optimal solutions for the class of TD-POMDP problems. The class of TD-POMDP problems is a subclass of the finite-horizon Dec-POMDP class (as emphasized in Section 3.2), so its complexity can be no greater than that of the Dec-POMDP. Further, the TD-POMDP imposes several structural restrictions on top of the Dec-POMDP model, so one might expect it to have a worst-case complexity strictly lower than that of the Dec-POMDP. Building on the work of others (Becker et al., 2004a; Allen, 2009), I have derived that this is unfortunately not the case. Instead, the TD-POMDP’s worst-case complexity has the same asymptotic lower-bound and upper-bound as does that of the more general Dec-POMDP. Theorem 3.18. The TD-POMDP is NEXP-complete. Proof. The NEXP-hard lower bound follows directly from the reduction of the EDIDec-MDP (Becker et al., 2004a), proved to be NEXP-complete (Allen, 2009), to the TD-POMDP. I present the reduction in Appendix A. The NEXP upper bound is proven given that the TD-POMDP is subclass of the Dec-POMDP (and was formally specified as such in Section 3.2). Therefore, the TD-POMDP is NEXP-complete. 64 The fact that the TD-POMDP is in the same complexity class as the Dec-POMDP suggests that, although the TD-POMDP requires additional problem structure beyond that of the Dec-POMDP, it does not strongly constrain the problems that may be represented (an issue discussed in detail in Section 3.4). The TD-POMDP can represent problems that are (asymptotically) just as hard. To be precise, the complexity result given in Theorem 3.18 is a statement relating problem description size to the worst-case computation required to verify that a solution is optimal. For NP-complete problems, such verification requires a number of computations polynomial in the problem size. For NEXP-complete problems like the TD-POMDP, verification requires a number of computations that is irreducibly exponential in the problem size. Thus, computation time of optimal solutions to NEXP-complete problem requires at least exponential time in the problem size, though this lower bound relies on the successful reduction of NEXP to EXP. It is widely believed (Papadimitriou, 1994) that NEXP and EXP are distinct complexity classes, and that solving NEXP-problems requires time doubly exponential in the problem size. For more information on this issue, I refer the reader to the theses of Daniel Bernstein (2005) and Martin Allen (2009). Observation 3.19. The worst-case computation time of optimal solutions for TDPOMDP problems is believed to be doubly exponential in the size of the problem description. Theorem 3.18 and Observation 3.19, by themselves, do not say anything about how the computation time of TD-POMDP solutions relates to the number of agents in the system. Discussion of Dec-POMDP complexity in the literature is centered around the complexity of 2-agent problems, wherein the size of a problem’s specification is commonly interpreted (Allen, 2009; Bernstein, 2005) as the size of its state space kSk (under the assumptions that the size of the local action spaces, size of the local observation spaces, and time horizon are each no larger than the size of the state space). The focus of this dissertation is on scaling solution computation to problems with more than two agents. In this context, let us be more concrete about the ramifications of the NEXP complexity result. Observation 3.20. If the size of the TD-POMDP world state space kSk ∝ n, the TD-POMDP’s worst-case computation time is believed to be doubly exponential in the number of agents n. The assumption that the state space grows with the number of agents is not terribly restrictive. By placing more agents in a shared environment, the number of combi65 nations of agent circumstances (each specified by a state in the state space) should increase linearly if not exponentially. Linear growth in the state space translates to doubly-exponential worst-case solve times (by Observation 3.20). All of these complexity results should be taken with a grain of salt. For instance, from Observation 3.20, it is tempting to conclude that the class of problems that I consider in this dissertation, for which I endeavor to scale solution computation to many agents, do not admit tractable scalable solution methods, and that such a pursuit is doomed to fail. However, recall that these are worst-case bounds. The TD-POMDP’s ability to represent problems with doubly exponential solve times does not imply that all TD-POMDP problems that we face will have this property. I prove in Section 3.5 and demonstrate in Chapter 6 the existence of sets of problems that avoid such menacing complexity and scale efficiently to large teams of agents. At the meta-level, I have introduced the TD-POMDP model in order to provide the formal specification of an umbrella class and the language with which to characterize its underlying subspace. 3.3.3 Significance of Structure The significance of the TD-POMDP lies in its emphasis of structure in problems that can be exploited to yield tractable solutions. I now briefly summarize the key elements of exploitable structure, referencing the details of their exploitation that appear in subsequent sections of this dissertation. Decoupled Representation. The TD-POMDP’s wholly factored representation decouples the joint model M into a set of local models {Mi } that are tied together by the transition dependencies of agents’ nonlocal features. Local model Mi compactly specifies the subset of problem dynamics relevant to agent i’s own behavior. When augmented with information relating to other agents’ choices (as I detail in Section 4.2.2), Mi suffices as a complete local planning model that i can use to compute its own policy without having to reason about superfluous details of other agents’ behavior. Hence, the benefit of the TD-POMDP’s decoupled representation is that it enables efficient individual reasoning in the context of a distributed joint policy search.8 8 The degree to which individual reasoning is efficient depends upon a structural metric state factor scope that I describe in Section 3.5. 66 Locality Of Interaction. Through its explicit representation of mutually-modeled features {m̄i } (Def. 3.13), the TD-POMDP specification emphasizes the team of agents’ locality of interaction. Each agent has a subset of other agents that it interacts with and a subset of features through which it interacts. In Section 3.5.1, by accounting for locality of interaction in the TD-POMDP, I derive a tighter complexity bound than that last derived in Section 3.3.2. Additionally, in Chapter 6, I present an algorithm that achieves the subsequent reduced complexity by exploiting this structure. Distinguished Interaction Features. Within agents’ mutually-modeled feature sets, the TD-POMDP distinguishes agents’ nonlocal features {n̄i ⊆ m̄i } (Def. 3.12). It is through these nonlocal features that agents can influence one another. Moreover, these features’ transitions constitute the information that TD-POMDP agents must jointly address when coordinating their behaviors. Instead of coordinating over whole policies, agents can coordinate over compact abstractions, which I term influences, that encode nonlocal feature transition information. Before developing a methodology for influence-based policy abstraction in Chapter 4, I derive in Section 3.5.2 that, utilizing such an abstraction can yield a potentially substantial reduction in complexity (on top of the reduction achieved by exploiting locality of interaction). By leveraging each of the above structural characterizations, I demonstrate in this dissertation that, despite its intractable general complexity, the TD-POMDP’s landscape of representable problems contains rich regions whose solutions are efficiently computable. In Section 3.5.1, I explore and characterize the landscape. 3.4 Expressiveness of the Representation Having presented the formal details of the TD-POMDP model along with its worstcase computational complexity, I now discuss the expressiveness of the TD-POMDP specification. In particular, I address two issues relating to expressiveness: (1) the space of problems that can be represented in the model, and (2) the problem structure that the model’s specification makes explicit. Section 3.2 began with the well-established, general Dec-POMDP model and introduced several additional structural properties, each one potentially carving away portions of the overarching Dec-POMDP class. What we are left with is a Dec-POMDP subclass, the TD-POMDP, that is the focus of this thesis. With the progression from Dec-POMDP to TD-POMDP, there is the danger that, even though each additional 67 restriction individually appears reasonable, the combination of additional properties narrows the representation to a space of problems that is no longer interesting. In response to these concerns, I contend that the TD-POMDP class—although it comes with some inherent restrictions—remains an interesting space, and that the TDPOMDP model specification conveys a rich array of useful problem structure. I defend this claim in Section 3.4.1 by contrasting my class with a variety of problem classes studied by others, for many arguing (and for some proving) that the TD-POMDP is a more general class and for others arguing that the TD-POMDP specification expresses just as much (if not more) exploitable problem structure. I extend my comparison in Section 3.4.2 to models explicitly representing communication, incorporating a discussion of how to model agent communication in the TD-POMDP. Examining the relationship of the TD-POMDP with existing models exposes its limitations. In Section 3.4.3, I summarize the key representational restrictions and provide suggestions of how they might be overcome in future work so as to expand the TD-POMDP (and the methodologies presented in this dissertation) to a broader sphere of problems. 3.4.1 Comparison with Existing Models Prior to the work presented in this dissertation, researchers have introduced a variety of other Dec-POMDP subclasses (reviewed in Section 2.3.2). At a high level, the motivation for defining these subclasses has been the same as mine: to identify problem structure that can be exploited by solution methods to yield solutions tractably. In some cases, the structure has allowed for impressive scaling of optimal solution methods to systems of several agents (Beynier & Mouaddib, 2005; Marecki & Tambe, 2007; Nair et al., 2005; Varakantham et al., 2007). However, in such cases, not only was problem structure identified, but the problem representation was also severely restricted from that of the Dec-POMDP. Moreover, in such cases, generality was lost to the extent that some Dec-POMDP problems with closely-related structure could no longer be represented given the constraints of their specifications. Other subclasses have sought to identify structure with little or no loss of generality of the representation (Oliehoek et al., 2008b; Varakantham et al., 2009), but as-ofyet have not achieved both scalability (to more than three agents) and optimality. This is due, in part, to the unavoidable complexity that comes with considering a general space of Dec-POMDPs. However, I posit that another reason is that the specifications of these latter subclasses do not articulate enough identifiable structure whose exploitation would allow scalability of optimal solution methods. Inspired by the successful scaling achieved by the former group of (exploitable, 68 yet restrictive) subclasses, I have collected what I see as the most useful elements of problem structure (amenable to tractable scalability and optimality) and have incorporated those into the TD-POMDP’s specification. However, I have done so in such a way as to impose as few restrictions as possible on the problem representation. In this sense, the TD-POMDP strives for a balance in its articulation of exploitable structure and its loss of generality. In the subsections that follow, I enumerate the most closely related Dec-POMDP subclasses. For each, I contrast its representational restrictions, as well as the elements of exploitable structure that it articulates (in the context of the problems that it can represent), with those of the TD-POMDP.9 The results are summarized in Table 3.1, whose rows are ordered roughly from the least general model to the most general model. The second and third columns indicate the “representational restrictions” and “exploitable structure” of each of the eight models. So as to provide a very rough overview of the scalability of solution methods of each class, I have included a “scalability results to date” column, populated with data taken from published empirical demonstrations of solution computations to problems of more than two agents.10 For cases in which only approximate solutions were computed, I prefix the scalability results accordingly (giving priority to results involving quality-bounded solutions if available). Table 3.1 shows that, based on my analysis, the only problem classes that are more general than the TD-POMDP have not been shown to scale optimally beyond three agents. Additionally, many of the classes that are less general than the TD-POMDP have not been shown to scale beyond two agents. We also see that, with the exception of precedence relationships among methods, the TD-POMDP specification expresses the same types of exploitable structure as do any of the other subclasses. This result 9 I have attempted to remain objective in my comparison of subclasses to the extent possible. However, given that I am characterizing a diverse collection of models in like terms according to my own dimensions (which were not necessarily those used by the authors of the respective works I am characterizing), a degree of subjectivity is bound to have crept in. For a more complete description of each model, along with the biases of its creators, see the respective papers I reference. 10 The “scalability results to date” are not meant to be compared to each other at face value. Each result represents solution computation on a different set of problems. Some results were published years before others. Although most classes of problems are defined around a set of agents, the OC-Dec-MDP is defined around a network of methods (which could perhaps be distributed among agents), as are the corresponding scalability results. Furthermore, developers of TI-Dec-MDP, EDI-Dec-MDP, and EDI-CR algorithms emphasized aspects of performance other than scalability, and did not extend their algorithmic implementations to more than two agents (Becker et al., 2004a,b; Mostafa & Lesser, 2009; Petrik & Zilberstein, 2009). Nevertheless, this column provides a coarse overview of published achievements that motivates the development of scalable TD-POMDP solution methods. 69 Model OC-Dec-MDP Representational Restrictions hierarchy of methods with fixed execution ordering, LFO TOI, LFO, structured utility decomposition TOI, structured utility decomposition LFO, RI, event-driven NIE, structured utility decomposition LFO, event-driven NIE, structured utility decomposition Exploitable Structure Scalability Results To Date (approximate) 100+ methods 70 precedence relationships among methods TI-Dec-MDP decoupled representation, 2 agents locality of interaction ND-POMDP decoupled representation, (quality-bounded approximate) 7 agents locality of interaction EDI-Dec-MDP decoupled representation, 2 agents explicit interaction features, locality of interaction EDI-CR decoupled representation, 2 agents explicit interaction features, locality of interaction TD-POMDP NIE, structured utility decom- decoupled representation, *See Chapter 6* position explicit interaction features, locality of interaction DPCL OI, structured utility decompo- decoupled representation, (approximate) 10 agents sition explicit interaction features, locality of interaction Factored Dec-POMDP – locality of interaction 3 agents (LFO=local full observability, TOI=transition and observations independence, RI=reward independence, OI=observations independence, NIE=nonconcurrent interaction effects) Table 3.1: Comparison of Dec-POMDP subclasses suggests that the TD-POMDP model defines a sweet spot where other models were too restrictive, not optimally scalable, or lacking in explicit representation of useful problem structure. 3.4.1.1 Opportunity Cost Dec-MDP (OC-Dec-MDP) The OC-Dec-MDP (Beynier & Mouaddib, 2005; Marecki & Tambe, 2007) is a model that exploits specialized structure for planning the execution times of agents’ activities, called methods. Strictly less general than the TD-POMDP, its representation is restricted in the following ways. First, the OC-Dec-MDP specifies a fixed ordering over agents’ method executions, restricting the problem to one of determining only when to start each method. More general models such as the TD-POMDP can also represent the problem of determining which method to execute. Second, the observation function of the OC-Dec-MDP takes a special form such that each agent observes exactly the status of its own method executions. This particular observational restriction is a special case of local full observability (LFO) (Def. 2.8). The TD-POMDP observation function is less restrictive, allowing for partial observations that may depend upon the values of any features (locally-controlled or nonlocally-controlled) in their local states. The only kind of interaction that can be represented by an OC-Dec-MDP is a precedence constraint dictating that a method executed by one agent will only complete successfully if a particular method of some other agent has already completed. This is strictly more restrictive that the TD-POMDP, which can represent more complex method dependencies through the specification of nonlocal features. Nevertheless, the OC-Dec-MDP’s method precedence relationships constitute powerful, though specialized, problem structure that researchers have exploited to compute approximate solutions to problems containing over a hundred methods (Marecki & Tambe, 2007). 3.4.1.2 Transition-Independent Dec-MDP (TI-Dec-MDP) The TI-Dec-MDP (Becker et al., 2004b), like the TD-POMDP, represents factored state dynamics that are exploited so as to decouple the joint model into local decision models with local states, local transitions, and local rewards. Unlike the TD-POMDP, the TI-Dec-MDP requires local full observability (Def. 2.8), restricting an agent to observe its local state features exactly at every time step. Furthermore, the TI-DecMDP requires transition and observation independence (Defs. 2.9-2.10), restricting an agent’s local state transitions to be independent of other agents’ local states and local actions. This property puts the TI-Dec-MDP in a different complexity class (NP 71 instead of NEXP) than more general flavors of Dec-POMDPs (Allen, 2009). With the TD-POMDP model, local state transitions (as well as local observations) can depend on other agents’ states and actions by way of structured nonlocal feature dependencies (Sec. 3.2.2). Alternatively, the TI-Dec-MDP models dependencies in the reward function, whereby particular combinations of agents’ actions can result in additional reward or penalty.11 Like the TD-POMDP’s nonlocal feature dependencies, the TI-Dec-MDP’s reward dependencies emphasize agents’ locality of interaction (a concept that I develop formally in Section 3.5.1), leading to a decomposition of the team’s value function that can be exploited to yield efficiently-computable optimal solutions for problems with two agents (Becker et al., 2004b; Petrik & Zilberstein, 2009) and likely more, but scaling of TI-Dec-MDPs has never been demonstrated empirically. In contrast to the TD-POMDP, whose reward is composed of a summation of local rewards, the TI-Dec-MDP’s reward consists of a summation of local rewards and of special reward dependency terms that account for agents’ joint actions. 3.4.1.3 Network-Distributed POMDP (ND-POMDP) The ND-POMDP (Nair et al., 2005) is also less general than the TD-POMDP in that it requires transition and observation independence like the TI-Dec-MDP, but unlike the TI-Dec-MDP it does not require local full observability. Moreover, instead of modeling individual reward dependencies, the ND-POMDP specifies a particular decomposition of the team utility into local neighborhood utilities. This utility structure, as with the TD-POMDP’s transition dependency structure, enables an exploitation of locality of interaction by explicitly representing interactions among groupings of agents. Exploiting the ND-POMDP’s locality of interaction along with its factored, decoupled representation has led to efficient optimal solution methods (Nair et al., 2005) and impressive scalability of quality-bounded solutions to teams of 7 agents (Varakantham et al., 2007; Marecki et al., 2008), not to mention efficientlycomputed (unbounded) approximate solutions to even larger agent teams (Kumar & Zilberstein, 2009; Marecki et al., 2008). 11 The TD-POMDP too can represent reward dependency, where agent i’s actions affect agent j’s rewards, by (1) modeling two versions of each reward-dependent local state in j’s local state space, (2) assigning separate local rewards to transitions into these states, and (3) modeling a special nonlocal feature controlled by i that drives j’s transitions into these states. 72 3.4.1.4 Dec-MDP with Event-Driven Interactions (EDI-Dec-MDP) The EDI-Dec-MDP (Becker et al., 2004a), whose description I elaborate in Appendix A, is perhaps the most closely-related model to the TD-POMDP. Like the TD-POMDP, the EDI-Dec-MDP decouples the joint model into (largely independent) local models tied together by structured transition dependencies that are made explicit by the problem specification. Where the EDI-Dec-MDP differs is in its representation of the features involved in the transition dependencies. For an agent i whose transitions affect agent j, each EDI-Dec-MDP dependency relates the occurrence of an event, which is a transition of agent i’s local state, to a dependent transition of agent j’s local state. The dependency is modeled by agent j using an unobservable boolean variable denoting whether or not the event has occurred, which can be viewed as a special type of nonlocal feature (Def. 3.12) whose dynamics and observability are more restricted than those of the TD-POMDP. Like the TD-POMDP, this dependency involves nonconcurrent interaction effects (a concept introduced in Section 3.2.2 and formalized later in Section 3.4.3.1). The EDI-Dec-MDP also requires local full observability (such that all locallycontrolled features are exactly observed), and reward independence (such that an agent’s rewards are independent of all other agents’ actions conditioned on the values of its locally-controlled features). Consequently, the EDI-Dec-MDP is less general than the TD-POMDP (as I derive in Appendix A) and also, for some problems that it can represent, more awkward to specify. In particular, for problems with temporally-uncertain interactions, the TD-POMDP can be specified with a single boolean nonlocal feature per interaction whereas the EDI-Dec-MDP requires several boolean dependency features (one for each time at which the interaction could occur) per interaction. Though its specification of dependencies captures some degree of locality of interaction, EDI-Dec-MDP solution methods have not been shown to scale beyond two agents (Becker, 2006). 3.4.1.5 Event-Driven Interactions with Complex Rewards (EDI-CR) The EDI-CR model (Mostafa & Lesser, 2009), described by its authors as a hybrid of the TI-Dec-MDP and the EDI-Dec-MDP, represents both reward dependencies and event-driven transition dependencies. I argue that it is no more general than the EDI-Dec-MDP because reward dependencies could be implemented as transition dependencies by folding the additional rewards or penalties into transition-dependent 73 local state outcomes.12 Further, it is more restrictive in its local full observability, and suffers from the same representational disadvantages as the EDI-Dec-MDP with respect to event-driven nonconcurrent interaction effects. State-of-the-art solution methods for the EDI-CR model support scaling to teams of more than two agents (Mostafa & Lesser, 2009) in theory, however, this has not yet been demonstrated empirically.13 3.4.1.6 Distributed POMDP with Coordination Locales (DPCL) Like the TD-POMDP, the DPCL (Varakantham et al., 2009) is geared towards exploiting structure and locality of interaction. It is less general than the TD-POMDP in that it requires observation independence (Def. 2.10), whereas the TD-POMDP allows an agent’s observations to depend on another’s actions. However, DPCL is more general than the TD-POMDP in that it represents transition dependencies with concurrent effects using structures that the authors refer to as “same-time coordination locales”. The presence of same-time coordination locales appears to complicate optimal planning. As of yet, researchers have not found a way to compute optimal best responses efficiently for these problems. Instead, the DPCL has only been shown to afford efficient heuristically-guided approximate solutions with no bounds on solution quality. As I show in later chapters, by excluding problems with same-time coordination locales, the TD-POMDP model can be decoupled into efficiently-solvable, provably-optimal local best response models, thereby achieving both scalability and optimality. Additionally, I suggest in Section 3.4.3.1 how the TD-POMDP might be extended in future work so as to overcome the nonconcurrency restriction. 3.4.1.7 Factored Dec-POMDP The factored Dec-POMDP, as studied by Oliehoek et al. (2008b), is fully general, capable of representing any Dec-POMDP problem, but additionally allowing exploitation of factored state, transition functions, and value functions. Oliehoek et al. demonstrate how reductions in the scope of each value function component (to a small subset of feature values and a small subset of agents), may be used to decompose and tractably approximate the (otherwise intractable) overall value function. As such, 12 Generality aside, there may be computational advantages to modeling certain types of reward dependencies with EDI-CR joint reward structures as opposed to transition dependencies, though this issue has never been explored in the literature, and is beyond the scope of my analysis. 13 As I describe when presenting empirical comparisons in Section 6.5, the only available implementation of a EDI-CR solution method is restricted to two-agent problems. 74 the factored Dec-POMDP specification emphasizes a locality of interaction inherent in the structure of the problem. By exploiting this structure, Oliehoek (2010) has demonstrated efficient computation of optimal solutions for 3-agent problems, not to mention computation of (unbounded) approximate solutions to problems with many more agents. The TD-POMDP is a special case of a factored Dec-POMDP whose factorization has the properties described in Sections 3.2.1-3.2.2. In effect, the factorization of world state into local states and subsequent factorization of local state into locally-dependent and nonlocally-dependent components controls the scope of the TD-POMDP’s factored value function. This particular factorization has the benefit of accommodating decoupled efficiently-solvable local best response models (described formally in Section 4.2). However, the factorization requires that the TD-POMDP be restricted to nonconcurrent, pairwise agent interactions and a particular decomposition of the joint value function into a summation of local value functions. In Section 3.4.3, I suggest how these restrictions might be relaxed. 3.4.2 Communication In addition to the models I enumerated in Section 3.4.1, researchers have developed special Dec-POMDP classes that explicitly represent communication actions: the Dec-POMDP-Com (Goldman & Zilberstein, 2003) and Com-MTDP (Pynadath & Tambe, 2002) are notable extensions to the general Dec-POMDP, but there are also communication models that extend some of the subclasses discussed above, such as the ND-POMDP-Comm (Tasaki et al., 2010) and the TI-Dec-MDP-Com (Goldman & Zilberstein, 2004). Using these models, agents plan what or when to communicate in addition to planning their usual actions. The purpose of such communication is to exchange information at runtime so as synchronize agents’ views. However, there is typically a cost associated with communicating, such that the simple communication policy that always communicates everything to everyone is not necessarily the optimal communication policy. For the Dec-POMDP-Com and related communicative models, agents are said to employ direct communication 14 (Goldman & Zilberstein, 2004) because they are explicitly selecting communication actions that broadcast information over a special communication channel. This is not the only means of exchanging information, however. When agents perform actions that affect each other’s observations, this is referred 14 The concept of direct communication is often referred to as “explicit communication”, and indirect communication as “implict communication” in other work (Seuken & Zilberstein, 2008). 75 to as indirect communication. For instance, one agent applies the brakes of a car, illuminating a brake light observed by another agent, implicitly communicating that there is a speed trap ahead. It turns out that, from the standpoint of representational power, these two forms of communication are equivalent; even though the Dec-POMDPCom specifies communication actions and communication observations in addition to the usual Dec-POMDP actions and observations, it is no more general than the Dec-POMDP (Seuken & Zilberstein, 2008).15 Likewise, the TD-POMDP implicitly includes agent communication, and is fully capable of representing decisions about what and when to communicate. Here, the channel over which information is communicated between agents is the set of nonlocal features (Def. 3.12). Each nonlocal feature is controllable by one agent, the speaker, and (partially) observable to another agent (the listener). Further, the TD-POMDP can model a noisy communications channel by specifying that agents receive only partial observations of these communication-specific nonlocal features. Similarly, the TD-POMDP can model communication cost (or lack thereof) through the specification of rewards associated with actions that set the nonlocal features. By instantiating nonlocal features as such, the problem designer may outfit a group of TD-POMDP agents with the desired communication capabilities.16 3.4.3 Overcoming Representational Limitations The comparison of the TD-POMDP with related models (Section 3.4.1) exposed restrictions that make it a less general representation than the subsuming Dec-POMDP class. Specifically, the TD-POMDP requires nonconcurrency of interaction effects and structured local utility decomposition. In the subsections that follow, I reintroduce each restriction and discuss the degree to which it is an inherent limitation of this work (on which the results of this thesis tightly hinge) rather than a detail imposed 15 Although no more general, models with direction communication include additional structure that may improve efficiency of planning communicative actions. 16 One could pose the question of what communication capabilities agents should be given. This is a question of modeling, and not one that I address directly in this dissertation. Instead, I assume that the TD-POMDP model’s state features, observations, and rewards have been determined exogenously. However, the results that I present later on do shed some light on the issue of modeling. In particular, my theoretical analysis presented in Section 3.5 and my empirical results presented in subsequent chapters suggest the following: the more features that agents jointly observe and, by extension, the more communication capabilities that agents are given, the harder the problem of planning becomes. Intuitively, the more information agents share, the more strongly-coupled they are, and the more computationally expensive it becomes to perform optimal decoupled planning and reasoning. On the other hand, reducing agents’ communication capabilities may lead to problems whose optimal solutions are of lower value. 76 for the sake of convenience. I also suggest, for each, how it might be overcome in future work and speculate on the implications of doing so. 3.4.3.1 Concurrent Interaction Effects In its present form, the TD-POMDP model disallows concurrent interaction effects by requiring that an agent’s locally-controlled feature values be independent of the concurrent values of its nonlocal features (conditioned on its latest local action and local state). Depicted graphically in Figure 3.5, and restated mathematically in Equation 3.11 below, this concurrency property is a consequence of the TD-POMDP’s factored local transition function (Definition 3.14).   t t ¯t+1 t t P r ¯ljt+1 |n̄t+1 j , sj , aj = P r lj |sj , aj (3.11) This means that the TD-POMDP cannot represent a problem in which a single feature’s value at time t + 1 is dependent on the actions taken at time t by more than one agent. Each of agent i’s locally-controlled features may depend upon only i’s latest action (and no other agent’s latest action), and each of i’s nonlocal features, by definition, depend on at most one other agent’s action. Thus, only a single agent initiates each TD-POMDP interaction. The reason that the TD-POMDP imposes this nonconcurrency property is that it facilitates the decoupling of the joint decision model into optimal local decision models.17 Example 3.21. A common toy example to illustrate Dec-POMDP dynamics, the cooperative box pushing problem (Kube, 1997; Oliehoek, 2010; Seuken & Zilberstein, 2007a), is centered around a concurrent interaction effect. Here, two agents navigate a two-dimensional grid and receive rewards for pushing objects from starting locations to goal locations. Among these objects are large boxes, whose movement requires both agents to push simultaneously from adjacent grid locations. The location of a box thereby depends upon the latest joint action (and not just a single agent’s latest action). 17 As I develop in Chapter 4, optimal decoupling is achieved by augmenting each agent’s local model with compact influence information. If agents are allowed to concurrently control a given feature, it is unclear how to distill one agent’s individual influence on that feature and incorporate the respective nonlocal policy information into the other agent’s optimal local decision model without exploding the computational complexity of the local model. 77 Strictly speaking, the TD-POMDP is incapable of modeling concurrent interaction effects. However, this does not rule out the possibility of transforming a problem with concurrent interaction effects into a TD-POMDP problem. For instance, we can model the box pushing problem from Example 3.21 as a TD-POMDP using two nonlocal features {agent-1-pushing, box-pushed-by-both-agents}. The trick is to introduce a 1 time step delay. Consider that agent 1 executes its push action at time step t, in turn setting agent-1-pushing to true at time t + 1, and agent 2 executes its push action at time step t + 1, in turn setting box-pushed-by-both-agents at time step t + 2, and causing the box to move. Nonlocal feature box-pushed-by-both-agents does not depend on the agents’ concurrent actions, but instead depends on the 2-step sequence of agent 1’s action followed by agent 2’s action. Adding a delay does change the problem, but it need not change the problem in any significant way if we consider time step t + 1 as an intermediate decision step whose real-world time value is arbitrarily close to that of time step t. Further investigation is needed to determine whether or not such a transformation is applicable to any general class of concurrent interaction effects, and to ensure that the solution returned by the equivalent TD-POMDP is realistically implementable. As future work, formalizing the transformation from a Dec-POMDP with (structured) concurrent interaction effects would ensure that my theoretical results and influencebased policy formulation algorithms apply to a larger space of problem than I have carved out here, thereby expanding the scope of this dissertation’s contributions. 3.4.3.2 Generalized Utility Structure As indicated by Theorem 3.8, the TD-POMDP’s joint value function is the sum of local values, each dependent on agent i’s local rewards: V (π) = n X Vi (π) . (3.12) i=1 Another way of stating this property is that the agents’ quality aggregation function is summation, which is a common assumption shared by much the cooperative planning literature (Becker et al., 2004a; Beynier & Mouaddib, 2005; Guestrin et al., 2001; Marecki & Tambe, 2009) and distributed constraint optimization literature (Atlas, 2009; Modi et al., 2005; Petcu & Faltings, 2005). Despite its frequent usage, this property makes the TD-POMDP more restrictive than the general Dec-POMDP model, which allows for arbitrarily complex aggregation of utility values. 78 Generalizing to a broader class of quality aggregation functions is the subject of future work. Intuitively, the crucial requirement of the TD-POMDP is that the value decomposes, such that the planning problem can solved in a distributed, decentralized manner. The applicability and usefulness of the model should be less dependent on the actual aggregation function. For instance, I speculate that generalizing my solution framework to a broader space of monotonic quality aggregation functions is possible. However, this hypothesis has yet to be substantiated. 3.5 Weak Coupling The TD-POMDP specification is quite general (as I argued in Section 3.4), but not all problems in its expressible repertoire are efficiently solvable (as I proved in Section 3.3). In this section, I endeavor to illuminate the tractable subspace consisting of problems that I call “weakly coupled”, driven by the realization that the TD-POMDP actually emphasizes exploitable structure (summarized in Section 3.3.3) present in these tractable problems. The structure is significant in that it decouples the joint planning model into a set of local models that are bound together with the transition dependencies of agents’ nonlocal features. By explicitly distinguishing those nonlocal features, and acknowledging the resulting factoring of the agents’ transition and observation functions, a specification emerges that, given sufficient factored structure, is substantially more compact than the general Dec-POMDP specification. Likewise, the exploitation of TD-POMDP structure during planning is shown in later chapters to enable exponential gains in computational efficiency over prior methods. Neither representational compactness nor computational efficiency are particularly surprising, as these advantages are well documented consequences of factored models in single-agent decision theoretic planning (Boutilier et al., 1999b; Guestrin et al., 2003). (The interaction structure made explicit by the TD-POMDP is a special case of factorization.) Less obvious, however, are the structural conditions under which the TD-POMDP representation is advantageous. At one extreme, a multiagent problem might involve no agent interaction (and hence no nonlocal features), represented in the TD-POMDP framework as a collection of completely independent single agent POMDP models, translating to an exponential reduction in (worst-case) time and space complexity from the general Dec-POMDP representation. For problems with relatively sparse interactions, there may be something to gain by explicitly representing each nonlocal feature (as in the TD-POMDP) and exploiting the resultant conditional independencies. However, for problems with dense agent interactions, where agents 79 are heavily dependent on one another, and where the number of nonlocal features is of the same order of magnitude as the number of world state features, it is not clear that we could gain anything from representing the problem as a TD-POMDP. Intuitively, the amount of advantage (over the naı̈ve Dec-POMDP representation) afforded by the TD-POMDP representation on any given planning problem depends on the weakness of agent coupling. Outside of the TD-POMDP model, researchers have previously used the term “weakly-coupled” (and alternatively, “loosely-coupled”) to refer to multiagent sequential decision making problems under various structural properties that impose conditional independencies among agents’ subproblems (Bernstein et al., 2001; Cavallo et al., 2006; Dolgov & Durfee, 2004b; Guo & Lesser, 2005; Meuleau et al., 1998; Mostafa & Lesser, 2009). However, in all of these works, the “weakly-coupled” qualifier specified either a binary classification (weakly-coupled versus not weakly-coupled) or a qualitative assessment. Instead, I pose the question exactly how weakly coupled is a given problem? in order to develop a quantitative scale that can be used to determine the degree to which advantageous structure is present and, ultimately, to predict the amount of computation needed to solve the problem. If accurate, a quantitative measure of weak coupling could be extremely useful for the meta-level control of systems with a variable amount of computational resources, where predicting the relative complexity of a problem could determine an appropriate allocation of resources for solving it. The measure could also be useful, when facing a set of problems with diverse problem sizes, in deciding whether a problem will be tractable to solve optimally or the system would be better off employing an approximate solution method. In the subsections that follow, I consider several aspects that contribute to my formulation of a measure of weakness of coupling. For each aspect (in Sections 3.5.1-3.5.2), I describe its respective structural assumptions (referencing prior work where appropriate), relate it to the TD-POMDP specification, and formalize how it affects computational complexity, successively refining a bound on the worst-case time required to compute optimal solutions to the TD-POMDP. In Section 3.5.3, I summarize the results and their ramifications, and in Section 3.5.4, I contrast my analysis with other work that relates problem structure to problem hardness. 3.5.1 Locality of Interaction The first two aspects of weakness of coupling are both instances of a broader concept referred to in the literature as locality of interaction (Kim et al., 2006; Melo, 2008; Nair et al., 2005; Oliehoek et al., 2008b; Tasaki et al., 2010). Interaction (in its 80 most general form) in Dec-POMDP problems consists of arbitrary dependence of one agent’s rewards, state-action outcomes, and observations, on another agents’ actions. An interaction may be very broad in the sense that it involves all the agents’ actions and all the features of the world state. Alternatively, an interaction may be relatively local, involving only a small subset of agents that interact through a small subset of features. Because the TD-POMDP explicitly distinguishes interaction features, each of which inherently link pairs of interacting agents together, it serves as a natural setting for formal analysis of locality of interaction. In the subsections that follow, I decompose locality of interaction into agent scope, which I refer to as the portion of the agent population on which an interaction depends, and state factor scope, which I refer to as the portion of state features on which an interaction depends. These two aspects affect the complexity of joint planning along orthogonal dimensions. Before delving into the details of how agent scope (Sec. 3.5.1.2) and state factor scope (Sec. 3.5.1.3) affects complexity, I establish the preliminary background context in Section 3.5.1.1, where I relate Dec-POMDP planning to constraint optimization. 3.5.1.1 Relationship to Constraint Optimization The TD-POMDP complexity results that I present throughout this section rely on a reduction of the planning problem to a constraint optimization problem (COP ). Here I briefly review the classical COP model as it was defined by Dechter (2003) and show how any Dec-POMDP can be mapped into an equivalent COP. Definition 3.22. A constraint optimization problem (COP) is specified as a tuple C = hX, D, Ci, whose components and auxiliary notations I explicate: ❼ X = {x1 , ..., xn } is a set of n variables, ❼ D = {D1 , ..., Dn } is comprised of the domain Di of values assignable to each variable xi , ❼ an assignment ā = ha1 , ..., an i specifies a value ai ∈ Di for each variable xi , ❼ C =  C1 , ..., CkCk is a set of constraints, each taking the form of a cost Y function 18 Ck with scope Qk ⊆ {1, ..., n} such that Ck : Dk 7→ {R, ∞} k∈Qk 18 In my treatment of constraint optimization, I do not distinguish “hard” and “soft” constraints since both flavors can be represented without loss of generality using cost functions that map assignments to {real numbers, infinity}. 81 and the application of Ck to the restricted scope of an assignment ā is denoted Ck (ā) ≡ Ck (ai , ∀i ∈ Qk ), ❼ the global cost of an assignment ā is C(ā) = kCk X Ck (ā), and k=1 ❼ a solution to C is an assignment ā∗ that minimizes the global cost. Constraint optimization (Def. 3.22) is a useful formulation in that it emphasizes particular problem structure. Each cost function is defined over a subset of variables, and all costs are aggregated via summation to yield the global cost. This structure is often depicted graphically using a constraint graph (Definition 3.23). An example of a constraint graph is shown in Figure 3.6. Definition 3.23. The constraint graph GC for a COP C is an undirected hypergraph whose vertices V = {v1 , ..., vn } contain a vertex vi for each variable xi ∈ X and whose  edges E = e1 , ..., ekCk contain a hyperedge ek ⊆ {1, ..., n} for each constraint Ck ∈ C that connects the vertices indexed {vi , ∀i ∈ Qk } by the corresponding scope Qk . x1 C1 x2 C2 C3 x3 x4 𝒘∗ = 𝟐 Figure 3.6: An example of a constraint graph. In turn, solution methods for COPs exploit the graphical structure. One such method, bucket elimination (Dechter, 1999), performs dynamic programming on an ordered constraint graph (Def. 3.24), traversing edges from the bottom of the graph to the top of the graph in order to eliminate individual variable assignments in an efficient manner (thereby avoiding consideration of superfluous combinations of variables’ values). Bucket elimination is most efficient when the connectivity of the graph is sparse. For the purpose of evaluating bucket elimination, Dechter (2003) quantifies the sparsity of the constraint graph using a measure of induced width, the definition of which I review below. 82 Definition 3.24. An ordered graph hG, di prescribes an ordering d = hv1 , .., vn i of the vertices V in G. In an ordered constraint graph, the parents of a vertex vj denote those vertices preceding vj in the ordering that are connected to vj : parents(vj ) = {vi ∈ V | ((i < j) ∧ (∃ek ∈ E) ∧ ({i, j} ⊆ ek ))}. The width w(G, d) of the ordered graph hG, di denotes the maximum number of parents of any vertex: ω(G, d) = max kparents(vj )k. vj ∈V Definition 3.25. The induced width w∗ (G, d) of an ordered graph is the width obtained by processing nodes from last to first, such that a node is processed by connecting its parents. The induced width w∗ (G) is the minimum induced width of any ordering of GC : w∗ (GC ) = min [w(GC , d)]. d The induced width of the constraint graph in Figure 3.6 is 2, which is the width of the ordered graph that orders the vertices hx1 , x2 , x3 , x4 i. The details of Dechter’s analysis (Dechter, 1999) are beyond the scope of this dissertation, but her results (which we will extend) are as follows. The worstcase time and space complexity of bucket elimination for constraint optimization  ∗ is O kCk · kDimax kw +1 , when applied to a problem with a maximum variable domain size of kDimax k = max kDi k, kCk constraints, and induced width w∗ (Dechter, Di ∈D 2003). Notice that the complexity is not a function of the number of variables. The asymptotic complexity is the same for problems with many variables as it is for problems with few variables as long as the induced width is equal, the maximum variable domain sizes are equal, and the number of constraints is bounded. Since complexity is exponential in the induced width, Dechter’s analysis suggests that problems with a small induced width should be easier to solve. This general trend is intuitive: fewer agents connected by a single constraint means fewer combinations of behavior to consider in minimizing the constraint’s cost. In my analysis of TD-POMDP complexity that follows, induced width will play a crucial role in characterizing a measure of weak coupling. In a constraint optimization problem, the objective is to minimize the summation of local cost values of the variable assignments. For Dec-POMDP problems, the objective is to maximize the expected utility value of the joint policy. Additionally, for certain Dec-POMDP problems, and notably any TD-POMDP problem, the value function can be decomposed into component value functions. The inherent similarity between the two problems leads me to the following mapping of a Dec-POMDP to an equivalent COP: 83 Observation 3.26. A Dec-POMDP M, whose value function V (π = hπ1 , ..., πn i) is decomposable into component value functions {V1 (π), ..., Vk (π)} such that V (π) = k X Vk (π) may be reduced to a COP CM = hX, D, Ci with the following specification: i=1 ❼ X {x1 , ..., xn } contains exactly one variable xi for each agent, ❼ The domain Di of each variable xi is the set of agent i’s possible (deterministic) local policies Πi , ❼ an assignment ā = ha1 ≡ πi , ..., an ≡ πn i specifies a joint policy (i.e. a policy πi ∈ Πi for each agent), ❼ C = {C1 , ..., Ck } consists of a single cost function Ci for each component Vi of M’s value function, each taking the form Ci (ā = π) = −Vi (π). Using the mapping in Observation 3.26, a solution assignment for CM equates to a joint policy that maximizes the Dec-POMDPs value function. In the case that the Dec-POMDP’s value function does not decompose into component value functions, the COP will contain a single constraint (and the constraint graph a single edge) that connects all vertices. However, the benefit of the COP representation, its potentiallysparse graphical structure, is only present when the value function is decomposable into local component value functions with limited scopes. For the problems that I consider in this thesis, the TD-POMDP specification explicitly factors the joint value function into agents’ local value functions (Def. 3.7). Each Vi () corresponds to a single agent’s expectation of the summation of its local rewards. Thus, for a TD-POMDP, the equivalent COP will contain a single constraint for each agent. The scope of each constraint, equivalently the scope of each local value function, is not immediately obvious. In section 3.5.1.2, I describe how the local value scopes can be extracted from the TD-POMDP specification. I am not the first to observe an equivalence between multiagent sequential decision making and constraint optimization. In the context of the TI-Dec-MDP (reviewed in Section 2.3.2), Becker et al. (2004b) propose a reformulation of their coverage set algorithm as one of distributed19 constraint optimization (DCOP) (Yokoo et al., 1998) with a mapping of variable domains to local policy sets identical to that given in 19 In distributed constraint optimization, the prefix “distributed” conveys a distribution of the COP specification among multiple agents as well as distribution of computational resources for solving the problem. In the case of the DCOP proposed by Becker et al. (2004b), each agent is charged with computing its own policy and communicating that policy to a subset of other agents that are linked to it via constraints. 84 Observation 3.26. The authors describe special constraints that correspond to the TI-Dec-MDP’s structured reward dependencies between pairs of agents. Nair et al. (2005) map a related class of transition-independent problems that they call networkdistributed POMDPs (ND-POMDPs) into DCOPs containing local neighborhood utility constraints. Like the mapping I suggest, their local neighborhood constraints each involve a restricted scope of agents that contribute to a given component of the joint value function. Nair and colleagues exploit the resulting locality of interaction in their development of locally-optimal and globally-optimal solution algorithms (Kim et al., 2006; Nair et al., 2005). Whereas these previous works both focus on classes involving agents that interact through the reward function but that are transition-independent, TD-POMDP agents interact through their transitions, complicating the analysis of how individual agents affect the joint value. Thus, the study of value-based COP constraints that I present here is intended to supplement these past works with more general analysis. More recently, Oliehoek et al. (2008b) describe an extension of local neighborhood constraints to the more general class of factored Dec-POMDPs (reviewed in Section 2.3.2). Instead of mapping the overall planning problem to a static COP, they analyze how the scope of local value functions change over the course of policy execution, thereby uncovering a dynamic constraint structure (that is dependent not only on the factored structure of the value function, but also on the particular decision stage). The complexity analysis that I present in the following subsection is complementary in that it too allows for general constraints, but assumes a static COP (wherein variables represent complete policies) that could be extended to account for Oliehoek et al.’s dynamics. 3.5.1.2 Agent Scope The COP reformulation from Observation 3.26 touched upon the significance of structure in the value function of multiagent planning problems. In particular, worst-case complexity is heavily dependent on the agent scope, or the number of agents whose decisions can affect a given component value function. To make this relationship more concrete, we now turn to the TD-POMDP, where a careful examination of problem specification allows for a formalization of agent scope and further analysis of complexity. As detailed in Section 3.2.2, the effects that TD-POMDP agents have on one another are represented as nonlocal features shared by the agents’ local states. It is through changes to these features that agents interact. We can depict their potential 85 interactions graphically using an agent interaction digraph (as shown in Figure 3.7). Definition 3.27. A agent interaction digraph20 DM for TD-POMDP M is a directed graph with a vertex vi representing each agent i. For any two agents i and j, there is an edge from vi to vj for each of agent i’s locally-controlled features that is modeled as a nonlocal feature in agent j’s local state representation. Definition 3.28. An agent i’s ancestors, denoted Λ(i), is the set of agents that correspond to ancestor vertices of vertex i in the interaction digraph. Formally, for any agent j 6= i, if there is a directed path from vj to vi in the interaction digraph, j ∈ Λ(i). Definition 3.29. An agent i’s descendants, denoted Ψ(i), is the set of agents that correspond to descendant vertices of vertex i in the interaction digraph. Formally, for any agent j 6= i, if there is a directed path from vi to vj in the interaction digraph, j ∈ Ψ(i). Note that here and throughout, I use the word peer to mean “some other agent” without assuming or implying any graphical relationship between the agents. Alternatively, to indicate a graphical relationship, I will instead use the term ancestor or descendant. For a team of agents, the interaction digraph summarizes the interdependencies that exist between agents’ activities. Each outgoing edge represents one attribute through which an agent can affect another agent. And each incoming edge represents one attribute through which an agent can be affected by its ancestors. For any two agents i and j, there may be more than one edge leading from i to j, one for each nonlocal feature controlled by i and affecting j. Note also that there are can be no self-loops (leading both out of and into any given vertex) in the interaction digraph because, by definition, a nonlocal feature nix modeled by agent i is controlled only by another agent j (and not by i itself). The interaction digraph can, however, contain directed cycles containing two or more nodes. 20 This definition is adapted from that of Brafman and Domshlak’s agent interaction digraph (Brafman & Domshlak, 2008) used for multi-agent classical planning and that of Guestrin’s coordination graph (Guestrin et al., 2002) for multiagent MDP coordination. It can be viewed as a generalization of the former (Brafman’s digraph), where each edge represents a special kind of activity dependence: the precondition of one agent’s planning operator is fulfilled by another agent’s operator. The edges in the digraph presented here encode more general state feature dependencies. It extends the latter (Guestrin’s coordination graph) in that it explicitly labels edges with the dependent features, and contains one such edge for each dependent feature (whereas the coordination graph represents at most one edge from i to j without specifying the particulars of the dependence). 86 SAT2 SAT1 n4b n6 n4a n7a R6 n7b R7 n5 R5 SAT3 n4c R4 Figure 3.7: The interaction graph for Example 3.1. As is the case with other graphical models (Dechter, 2003; Jordan, 1999; Koller & Friedman, 2009), the motivation for representing the TD-POMDP problem using an interaction digraph is to exploit structure in the connectivity of the graph. In the literature on multiagent planning under uncertainty, there has been a variety of work that exploits graphical interaction structure in specialized contexts (Beynier & Mouaddib, 2005; Brafman & Domshlak, 2008; Dolgov & Durfee, 2004a; Guestrin et al., 2002; Marecki & Tambe, 2009; Nair et al., 2005; Oliehoek et al., 2008b). Although these prior methods are not directly applicable here, we can make use of common concepts that have emerged. The degree of agent coupling is directly related to the density of edges in the interaction digraph. In the extreme case of weak coupling, the agents are uncoupled, such that there are no nonlocal features and we are left with a graph of n unconnected vertices. An isolated vertex refers to an agent whose policy formulation problem is independent of all other agents’ problems, since the agent cannot be influenced by others choices, nor can its own choices influence its peers. Thus, in the uncoupled case, the optimal joint policy can be computed by simply combining the n optimal local policies, each computed independently of the others. In the extreme case of strong coupling, every locally-controlled feature in each agent’s local state corresponds to a nonlocal feature in every other agent’s state. Let there be n agents and k such local features per agent. In this case, all k · n features of the world state are contained in every agent’s local state. The interaction digraph is not only fully-connected (with edges running in both directions between every pair of vertices), but each directed edge is duplicated k times, yielding a total of n · (n − 1) · k edges. Because each agent in the system is interacting with every other agent, no 87 single agent’s decision problem can be solved independently of any of its peers’ decision problems (as in the uncoupled case). Moreover, no single local feature can be reasoned about independently because its value affects peer agents’ decisions. Both the unconnected and fully-connected cases are degenerate in the sense that the former conveys complete independence and the latter does not convey any conditional independence relationships whatsoever. The more interesting cases are those falling in between the two extremes, where there are some conditional independence relationships that can be taken advantage of, but the agents’ activities are not completely independent. A simple example of one such case appears in Example 3.31 and Figure 3.8. Definition 3.30. The agent scope, denoted Qi , of a value function " Vi () #is the subset Y Πj 7→ R. of agents on whose policies its value depends. Equivalently, Vi : j∈Qi Example 3.31. Consider the interaction digraph in Figure 3.8, where two edges connect three agents. The first edge indicates that agent 1’s activities can affect the activities of agent 2 and the second edge indicates that agent 1’s activities can affect the activities of agent 3. Notice that there is no directed path leading from node 2 to node 3, implying that the outcomes of agent 2’s activities cannot affect the outcomes of agent 3’s activities. As such, agent 3’s local value function V3 (π) is independent of agent 2’s policy decisions and can be rewritten V3 (π1 , π3 ). Likewise, V2 (π) ≡ V2 (π1 , π2 ). Theorem 3.32. The only agents that can affect the values of the features in agent i’s local state are i’s ancestors Λ(i) and i itself. Proof. Mathematically, Theorem 3.32 states: ∀tP r(st+1 at ) = P r(st+1 ati , ~atΛ(i) ), where i |~ i |~ ~at is the joint action history, ~ati is agent i’s local action history, and ~atΛ(i) is the history of actions performed by i’s ancestors. This is a statement about conditional independence and, in particular, one that can be inferred from the DBN shown in Figure 3.5, which captures the conditional independencies among the TD-POMDPs factored state and action variables. It suffices to prove that, for any agent j ∈ / ({i} ∪ Λ(i)), there cannot be a path of any length n ≥ 1 in the DBN leading from variable at−n+1 to st+1 j i . I prove this by induction on n. Base Case (n = 1): By Definition 3.14, the only action variables whose edges lead 88 local effect 1 Task A Task B Task C outcomes : D Q Pr 1 1 0.5 3 1 0.5 window : [0,6] outcomes : D Q Pr 1 2 1.0 D Q Pr outcomes: 1 0 1.0 (0.0) 2 1 0.0 (0.6) 3 1 0.0 (0.4) window : [0,6] window : [0,6] nonlocal effect 2 nonlocal effect Task D Task E Task F out- D Q Pr comes: 2 12 0.0 (1.0) 2 0 1.0 (0.0) window : [1,4] outcomes : D Q Pr 4 6 𝟏𝟑 outD Q Pr comes: 1 5 0.0 (1.0) 1 0 1.0 (1.0) window : [0,5] window: [0,6] 5 6 6 6 𝟏 𝟏 𝟑 𝟑 3 1 “Task-Denabled” 2 “Task-Fenabled” 3 Figure 3.8: An example of exploitable interaction digraph structure. into st+1 are actions ai and those actions of other agents who control i’s nonlocal i features (who are thus necessarily digraph ancestors of i by Definition 3.27). Since j∈ / ({i} ∪ Λ(i)), there is no path leading from j to i. Inductive Hypothesis (IH): For any agent j ∈ / ({i} ∪ Λ(i)), there cannot be a path of length n in the DBN leading from variable at−n+1 to st+1 j i . Inductive Step: Here, we assume IH and deduce that, for any agent k ∈ / ({i} ∪ Λ(i)), there cannot be a path of length n + 1 in the DBN leading from variable at−n to st+1 i . k Let us assume that this deduction is false, or in other words, that there exists an agent k∈ / ({i} ∪ Λ(i)) for which a path of length n + 1 in the DBN leads from variable at−n k t+1 t−n to si . If this is the case, ak connects to another variable from which there is a path of length n to st+1 i , which, according to the DBN from Figure 3.5, must be a state feature variable f t−n+1 . There are two possibilities for f : to st+1 1. f ∈ si : In this case, there is path of length n from st−n+1 i i , and a to st−n+1 . Since we were assuming that path of length 1 leading from at−n i k k∈ / ({i} ∪ Λ(i)), this directly contradicts the result derived in our base case. 2. f 6∈ si . If there is such a path, we can deduce that f must be a nonlocal feature, because the only DBN connections between two separate agents’ state features are by way of nonlocal features. Further, we can deduce that f must be controlled by k and is modeled by some other agent j. Using the same logic, we can also deduce that j is in turn controlling a nonlocal feature, and that there must therefore also be a path of length n from at−n+1 to st+1 j i . This contradicts our inductive hypothesis. Having derived a contradiction in both cases, the inductive step must be correct. 89 Therefore, agents that can affect the values of the features in agent i’s local state are i’s ancestors Λ(i) and i itself. Theorem 3.33. For TD-POMDP agent i, the agent scope Qi of i’s local value function Vi () does not include agents outside of i and i’s ancestors: Qi ⊆ ({i} ∪ Λ(i)). Proof. By Definition 3.7, Vi (π) is an expectation of the summation of agent i’s local  rewards, each of the form Ri sti , ati , st+1 , and is thus a function of i’s local state si i and local action ai . It suffices to prove that the only agents that can affect i’s local state and local action are ({i} ∪ Λ(i)): 1. By Theorem 3.32, the only agents that affect i’s local state are ({i} ∪ Λ(i)). 2. Agent i’s local actions are dictated by agent i’s policy πi , which is a mapping of agent i’s local observation history ~oti to local action ati . By Definition 3.5, ~oti can only depend on past values of agent i’s local state. By Theorem 3.32, the only agents that affect i’s local state are ({i} ∪ Λ(i)). Thus, the only agents that can affect agent i’s actions are ({i} ∪ Λ(i)). Therefore, Vi (π) = Vi (πi , π̄Λ(i) ), and equivalently Qi ⊆ ({i} ∪ Λ(i)). When the agent scope is reduced (from the set of all agents), it implies a conditional independence that can be exploited during individual agent reasoning. Example 3.31 (continued). A practical consequence of the reduced agent scope in the example problem (Figure 3.8) is that, given agent 1’s planned decisions, agent 3’s decisions may be planned independently of agent 2’s decisions (without sacrificing optimality of the agents’ planned policies). In other words, if agent 3 knows the policy decisions of agent 1, agent 3 does not need to reason about agent 2’s decisions in order to plan its local policy component of the optimal joint policy. Nor does agent 2 need to reason about agent 3’s decisions. In other words, agent 2’s policy decisions are conditionally independent of agent 3’s policy decisions given agent 1’s policy. The definitions and theorems that follow formalize the conditional independence relationships contained in the TD-POMDP agent interaction digraph and their implications on the multi-agent planning problem. 90 Definition 3.34. An agent i is conditionally decision-independent of peer agent j conditioned on the decisions of c ∈ {0, 1, ..., n − 2} other agents K = {k1 , ..., kc } if, in maximizing21 the joint value of the team, i’s optimal decisions do not differ  depending on j’s decisions (given any fixed settings of K’s decisions): ∀ πjx , πjy ⊂ Πj , ∀πk1 , . . . , πkc ,   arg max V πi , πjx , πk1 , . . . , πkc = arg max V πi , πjy , πk1 , . . . , πkc πi ∈Πi πi ∈Πi The equation in Definition 3.34 contains terms of the following form, with which agent i can compute local policies that maximize the team’s joint value given candidate policies of i’s peers: πi∗ (πj , . . .) = arg max V (πi , πj , . . .) (3.13) πi ∈Πi One such maximizing argument setting, πi∗ (πj , . . .), is commonly referred to as agent i’s best response to peer policies π6=i = {πj , ...}. This is the same paradigm that was reviewed in the discussion of decoupled solution methodologies (Sec. 2.3.3). As we will see, conditional decision-independence may be exploited within a decoupled solution method to substantially reduce the computational complexity of optimal planning. Theorem 3.35. If (1) there is no directed path connecting (distinct) nodes i and j in the interaction digraph and (2) nodes i and j share no common descendants, then i is decision-independent of any agent j conditioned on i’s ancestors’ policies π̄Λ(i)  Proof. By definition, the theorem states that ∀πjx , πjy , ∀π̄Λ(i) , arg max V πi , πjx , π̄Λ(i) = πi  arg max V πi , πjy , π̄Λ(i) . Recall, from Theorem 3.8, that the value function is comπi P posed of local value functions: V (π1 , ..., πn ) = i Vi (π1 , ..., πn ). By the monotonicity of summation, it suffices to prove that for each local value function Vk (),   ∀πjx , πjy , ∀π̄Λ(i) , arg max Vk πi , πjx , π̄Λ(i) = arg max Vk πi , πjy , π̄Λ(i) . πi πi ❼ Case A (k = i) : By Theorem 3.33, Vi : Πi ×ΠΛ(i) 7→ R. Clause 1 of Theorem 3.35 states that there is no path connecting i and j. Thus, j 6∈ Λ(i), and consequently,  Vk=i () is independent of πj . Trivially, ∀πjx , πjy , ∀π̄Λ(k) , arg max Vk πi , πjx , π̄Λ(i) = πi   y arg max Vk πi , π̄Λ(i) = arg max Vk πi , πj , π̄Λ(i) . πi πi 21 Here, I notate maximization with arg max f (x), which returns the set of all values of argument x x that achieve the maximal value of the expression f (x) (to which arg max is applied). Elsewhere, I use the notation arg max (in plain, not bold, text) to refer to the maximization that returns a single x maximizing argument instead of a set. 91 ❼ Case B (k ∈ Ψ(i)) : Here, conversely agent i is an ancestor of k (i ∈ Λ(k)), and hence Vk () may depend upon πi . By clause 2 of Theorem 3.35, i and j cannot share any descendants, so j 6∈ Λ(k). Thus, Vk () is independent   of πj , and ∀πjx , πjy , ∀π̄Λ(i) , arg max Vk πi , πjx , π̄Λ(i) = arg max Vk πi , π̄(Λ(k)−i) = πi πi  y arg max Vk πi , πj , π̄Λ(i) . πi ❼ Case C (otherwise) : Given that neither Case A nor Case B applied, it must be that i = 6 k and i 6∈ Λ(k). By Theorem 3.33, Vk () must be independent of πi .   Thus, ∀πjx , πjy , ∀π̄Λ(i) , arg max Vk πi , πjx , π̄Λ(i) = arg max Vk πi , πjy , π̄Λ(i) . πi πi Having derived the necessary arg max equality for all of the local value components {Vk } of the joint value function V (), it must in turn hold for V () because the joint value composition function, summation, preserves the order (including the arg max) of input values for each parameter. Therefore, given clauses 1 and 2 of the theorem, i is decision-independent of agent j conditioned on the decisions of the agents indexed by Λ(i). Given the preceding characterization of TD-POMDP agent scope and the accompanying conditional independencies, we can now relate it back to the COP mapping developed in Section 3.5.1.1. Recall that maximizing the joint value of the TD-POMDP is equivalent to minimizing the sum of costs of particular constraints pertaining to component value functions. Converting the TD-POMDP into a COP thus involves creating a constraint Ci for each local value function Vi (). Each constraint Ci constrains those variables that correspond to the policies of the agents in the respective agent scope Qi . Similarly, the constraint graph for the mapped COP includes, for each local value function, a hyperedge linking together those vertices in the respective agent scope. There are similarities between the constraint graph and the interaction digraph, but also notable differences. Like the TD-POMDP interaction digraph, the constraint graph contains a single vertex xi for each agent i. But whereas the interaction digraph contains a directed edge for each nonlocal feature connecting a pair of vertices, the constraint graph contains an undirected hyperedge Ci (connecting 1, 2, or perhaps more vertices) for each local value function Vi . As dissimilar as the connections in the two representations may appear, the translation from interaction digraph to equivalent constraint graph is straightforward. By Theorem 3.33, in general, the agent scope of a local value function Vi () includes i and Λ(i). Thus, for each agent i, there is a hyperedge Ci in the constraint graph connecting i and i’s (interaction digraph) 92 (Ex. 3.31) C1 1 induced width d f 2 3 interaction digraph (Ex. 3.1) 𝝎=𝟏 complexity O Π𝑖 SAT2 n4b n5 R5 n6 n7b x3 constraint graph C1 C3 C2 x1 R4 R6 interaction digraph C4 C5 C6 n7a R7 x3 x2 n4c SAT1 n4a C3 x2 2 SAT3 x1 C2 C7 x4 x5 𝝎=𝟐 O Π𝑖 x7 3 x6 constraint graph Figure 3.9: Examples of COP constraint graphs derived from interaction digraphs. ancestors. To illustrate this translation, the interaction digraphs for Examples 3.1 and 3.31 are shown alongside their corresponding constraint graphs in Figure 3.9. Definition 3.36. The induced width ω of a TD-POMDP M denotes the induced width (Def. 3.25) of its (unordered) equivalent constraint graph GM (converted from the interaction digraph DM ). There are also useful relationships to be drawn between the induced width and agents’ scope sizes. In general, ω ≥ (maxk kQk k − 1) (which follows from the definitions of ω and Qk ). I have observed ω ≈ (maxk kQk k − 1) to be a robust estimate of induced width for a wide variety of interaction digraph topologies (some of which are shown in Figure 3.9). Although there do exist instances for which ω > (maxk kQk k − 1), as we will see later in Observation 3.37, approximating induced width from agent scope enables the subsequent approximation of TD-POMDP complexity without the need to first convert the TD-POMDP interaction digraph to a constraint graph and evaluate its induced width. Ultimately, gaining the advantages of reduced agent scope size (and reduced width 93 in the COP reformulation) requires the use of a decoupled solution methodology (such as is reviewed in Section 2.3.3) wherein each agent computes its local component of the joint policy through a series of best response calculations in response to candidate peer policies. While searching through the space of policies, decision independence allows an agent to avoid reasoning about a peer’s policy decisions, effectively pruning an entire cross-section of the joint policy space at no cost, as is the case for the problem described in Example 3.31 and Figure 3.8. 94 Example 3.31 (continued). Returning to the problem shown in Figure 3.8, we can deduce from the interaction digraph (by Theorem 3.35) that agents 2 and 3 are decision-independent of one another conditioned on agent 1’s policy. By definition, this means that given a policy π1 selected by agent 1, the optimal policy of agent 2 does not depend upon the policy chosen by agent 3 (and vice versa): ∀π3x , π3y , ∀π1 , arg max V (π1 , π2 , π3x , ) = arg max V (π1 , π2 , π3y , ) = arg max V (π1 , π2 ) π2 π2 π2 = arg max [V1 (π1 ) + V2 (π1 , π2 )] π2 = arg max V2 (π1 , π2 ) π2 ∀π2x , π2y , ∀π1 , arg max V (π1 , π3 , π2x , ) = arg max V (π1 , π3 , π2y , ) = arg max V (π1 , π3 ) π3 π3 π3 = arg max [V1 (π1 ) + V3 (π1 , π3 )] π3 = arg max V3 (π1 , π3 ) π3 Notice that each equation has been expanded to show the local value functions with their reduced agent scopes. The two equalities describe a simple brute-force method for computing the optimal joint policy. For each policy πi of agent 1, agent 2 can compute its part of the optimal joint policy by evaluating each of its policies in conjunction with π1 and creating a vector of best responses hπ2∗ (π1 ) = arg max(...)i of length kΠ1 k. Here, agent 2 need only maintain a single best response per policy π1 of agent 1.22 This computation involves kΠ1 k · kΠ2 k policy evaluations. Similarly, agent 3 can compute its part of the optimal joint policy with another kΠ1 k · kΠ3 k policy evaluations, producing a vector of best responses hπ3∗ (π1 )i of length kΠ1 k. Finally, the optimal joint policy can be determined by iterating through the two vectors and selecting the π1∗ (and associated π2∗ (π1∗ ) and π3∗ (π1∗ )) that maximize the joint utility value [V1 (π1 ) + V2 (π1 , π2 ) + V3 (π1 , π3 )]. When all is said and done,  the agents will have computed the optimal joint policy using just O kΠi k2 policy evaluations. In contrast, the worst-case complexity of a 3 agent problem (in the  absence of exploitable structure) is O kΠi k3 (as per Observation 3.20). 95 The solution process described in Example 3.31 is a simplified execution trace of the bucket elimination algorithm (introduced in Section 3.5.1.1) for solving constraint optimization problems (Dechter, 2003). A more detailed algorithmic description of Bucket Elimination appears later on in Chapter 6 (where I develop an extension for solving TD-POMDP problems). The complexity reduction in Example 3.31 afforded by Bucket Elimination relies on a very simple 3-agent graph structure and will certainly not hold for all 3-agent TDPOMDPs. However, we can generalize our analysis by making use of the complexity result cited in Section 3.5.1.1. Recall that the worst-case time and space complexity of  ∗ Bucket Elimination for constraint optimization is O c · kXimax kω +1 , when applied to a problem with a maximum variable domain size of kXimax k, c (hard or soft) constraints, and induced width ω ∗ (Dechter, 2003). Instantiating each COP metric with the corresponding details from the TD-POMDP specification results in the following observation. Observation 3.37. The worst-case time and space complexity of solving a TDPOMDP problem is bounded by O (n · kΠmax kω+1 ), where n is the number of agents, i Πmax is the largest policy space of any agent, and ω is the induced width of the i interaction digraph (Def. 3.36). As indicated in Figure 3.9, the computed complexity kΠi k2 of solving Example 3.31 is just as predicted by Observation 3.37. Even more substantial is the reduction in complexity of Example 3.1 (whose interaction digraph is shown in Figure 3.9), which has been tightened to O(kΠi k3 ), down from O(kΠi k7 ) when ignoring the problem’s locality of interaction. 3.5.1.3 State Factor Scope Locality of interaction manifests itself in a reduction of the scope of dependence of individual agent subproblems. As my theoretical results suggest thus far, the fewer the number of dependent agents (i.e. the smaller the agent scope), the simpler planning joint behavior becomes. An analogous statement can be made about the number of dependent world state features. The state factor scope (Guestrin et al., 2001; Oliehoek 22 Recall that arg max (without the bold font) returns a single maximizing argument. Here, the overarching problem is the computation of a single optimal joint policy, not all optimal joint policies. Moreover, the reason that agent 2 can get away with maintaining a single best response per π1 is that π2 does not appear at all outside of the term V2 (π1 , π2 ), indicating that maximizing V2 (π1 , π2 ) once and for all, with any arbitrary best response, will maximize the joint value function V (). Given ∗ ∗  two choices of best response π2∗x (π1 ) and π2 y (π1 ) for which V2 (π1 , π2∗x ) = V2 π1 , π2 y , there is no ∗ valuation that would differentiate π2∗x (π1 ) and π2 y (π1 ). 96 et al., 2008b) refers to the subset of world state features on which an agent’s local value depends.23 Regardless of the number of agents involved, the state factor scope controls the complexity of each individual agent’s reasoning. To better understand how state factor scope affects the complexity of planning, consider the derivation of the previous complexity result in Section 3.5.1.2, which was based upon a reduction to a constraint optimization problem. Solving the equivalent COP involved evaluations of the form arg max V (πi , πj , . . .) (as in Equation 3.13) that πj constitute agent j’s best-response calculation to potential policy πi of peer agent i. The complexity result in Observation 3.37 assumes a naı̈ve algorithm for performing this calculation: enumeration of all of i’s policies and, for each, an explicit evaluation of Vi (πj ). In a classical COP, enumeration of variable domains would be the only way to compute a best response. However, the COP that we are solving involves structured variable domains containing TD-POMDP policies. Instead of using simple enumeration, an agent can calculate its best response by solving a special POMDP model seeded with peers’ policy information (like the one that I develop later on in Section 4.2). For a weakly-coupled TD-POMDP agent, its local best-response model does not necessarily need to represent all world state features. Intuitively, there may be world features that have no bearing on the value ascribed to the agent’s own behavior. Example 3.38. For instance, in the Example 3.31, whose TD-POMDP transition structure is shown in Figure 3.10, whether or not Task F is enabled (encoded by feature “Task-F-enabled”, appearing in agent 1’s local state) has no bearing on agent 2’s computation of best response π2∗ (π1 ). Using Definition 3.39 below, we say that feature “Task-F-enabled” is not in agent 2’s state factor scope. Definition 3.39. An agent i’s state factor scope Xi is a minimal set of features that are sufficient for i to represent and reason about when computing a best response. Definition 3.39 refers to a minimal set of features because, as I describe later on in Section 4.2, there are various flavors of the best-response model that could be used to compute the same best response but that represent different feature sets. Here, I am most interested in those that exploit weakly-coupled problem structure by reducing 23 Here I refer to state factor scope in the particular context of agents’ local value functions. Note, however, that it is a more general mathematical concept that can also be used to characterize other functions (Guestrin et al., 2001; Oliehoek et al., 2008b). 97 actions and transitions observations a2 local states 𝒔𝟐 (task D status) (task E status) D E (synched clock) time 𝒎 𝒔𝟑 time a1 Den (task A status) A A (task B status) B B (C enabled) C C (task C status) Cen Cen (synched clock) time time (F enabled) Fen Fen (task F status) F t a3 F o2 𝒏𝟏 o1 o3 nonlocal features mutually-modeled features (D enabled) Den 𝒔𝟏 D E 𝒏𝟐 t+1 Figure 3.10: The TD-POMDP description for Example 3.31. their modeled set of features as much as possible. In addition to the set of features in the state factor scope, we can also discuss their domains. The state factor scope magnitude is particularly useful because it serves as a measure of the size of the state space of the best response POMDP model. Definition 3.40. An agent’s state factor scope magnitude Xi is the product of the sizes of the domains of the features in Xi . Given this additional structure, we can refine our bound on computational complexity of the TD-POMDP. Theorem 3.41. The worst-case time and space complexity of solving a TD-POMDP problem is bounded by O (n · EXP(Xmax ) · kΠmax kω ), where n is the number of agents, i i Xmax is the largest state factor scope magnitude of any agent, Πmax is the largest i i policy space of any agent, and ω is the induced width of the interaction digraph (Def. 3.36). Proof. The derivation of the complexity result from Observation 3.37 entails every best response computation requiring an arg maxπi to be taken, enumerating the local k, for all combinations of policies of ω peers, yielding policy space bounded by kΠmax i 98 kkΠmax kω for each of the n agents. By replacing each best response complexity kΠmax i i k in our calculation with one POMDP solution, we can substitute the first term kΠmax i complexity computation with the complexity of solving a finite-horizon best-response POMDP, which is known to be O(EXP (S) = EXP (Xmax )) (Bernstein et al., 2002) i max given that the state space is bounded by Xi (by Definitions 3.39-3.40). For the TD-POMDP, it turns out that there is concrete measure of state factor scope encoded in the problem specification. As I derive later on in Section 4.2, there exists a best response model for any given TD-POMDP that requires agent i to consider only (1) the values of features from its local state (Definition 3.12) and (2) the history of values of features from its mutually modeled feature set (Definition 3.13). As such, for TD-POMDP problems, the scope magnitude can be replaced in Theorem 3.41 with X = kSj kkMj kT −1 . For TD-POMDP problems that are locally fully observable (Definition 2.8), a slightly stronger result holds: complexity is polynomial in kSj kkMj kT −1 (as I derive in Section 4.2.4). Example 3.42. Let us examine the TD-POMDP specification for the problem depicted in Figure 3.8. Agent 2’s local state consists of features {time, taskD-execution-status, task-E-execution-status, task-D-enabled }, of which time and task-D-enabled are mutually-modeled. The domain of time is the set of time steps {0, .., 6} until the global horizon (T = 6). The domain of each task’s execution status is {not-started,started-at-time-0,...,started-at-time-5,completed }, containing a total of 7 values. The last feature, task-D-enabled can be either true or false. As such, the size of agent 2’s local state space is bounded by kS2 k ≤ (7 · 7 · 7 · 2 = 686). The domain of agent 2’s mutually modeled features is bounded by kM2 k ≤ (7 · 27 = 896). We can bound agent 2’s scope magnitude by the product of these two figures X2 ≤ 614, 656. The relationship between complexity and scope magnitude supports an intuitive characterization: the smaller the portion of the world state that an agent observes and interacts with, the easier its local planning and reasoning becomes. Moreover, the fewer the interaction features that it shares with other agents, the easier the problem becomes. 99 3.5.2 Degree of Influence Locality of agent interaction is an important aspect of TD-POMDP structure whose exploitation can lead to dramatic reductions in the complexity of optimal planning, but it is not the only important aspect. So far we have looked at which agents in the team can impact each others’ decisions as well the features through which they interact. Next, I introduce a aspect of structure that characterizes the degree to which agents impact each others’ decisions. With the addition of this metric, I extend the theory from Section 3.5.1 so as to refine the bound on worst-case TD-POMDP complexity. Making use of Definition 3.34, for any two agents i and j, i is either decisiondependent on j or decision-independent of j (conditioned on some other agents’ policies). Considering the rich space of dependencies that may exist between the two agents, a binary relation such as decision-independent lacks the precision to characterize weakly-coupled problems satisfactorily. For instance, i may be able to reason independently of some of j’s decisions but not of others. Moreover, there may be circumstances under which j’s decisions do not affect i’ decisions. Example 3.43. Returning to the problem depicted in Figure 3.8, Agent 2 is decision-dependent on agent 1, but only dependent on those decisions relating to the execution of “Task A”. For instance, whether or not agent 1 idles for one time step or executes “Task B” (which will necessarily take 1 time step) is of no consequence to agent 2 as it plans its own decisions. Furthermore, after agent 1 has completed “Task A”, any decisions that it makes cannot impact agent 2’s decisions in any way. In other words, any two of agent 1’s possible policies, π1x and π1y , that differ only in the decisions made after completing “Task A” will induce the same best response from agent 2. Definition 3.44. Two policies, πia and πib , of agent i are impact-equivalent, deI noted πia ≡ πib , conditioned on some other agents’ policies π̄K , if adopting πib instead π̄K of πia will not cause any other agent j 6∈ K to change its best response decisions: I πia ≡ π̄K " πib ⇔ ∀j 6∈ K, arg max V (πia , πj , π̄K ) = arg max V πib , πj , π̄K πj ∈Πj πj ∈Πj  # π̄K Definition 3.45. An impact equivalence class Ei,x (subscripted with the agent 100 index i and class index x and superscripted by policies π̄K of other agents K) is a set  I π̄K of impact-equivalent policies (conditioned on π̄K ): ∀ πia , πib ∈ Ei,x , πia ≡ πib . π̄K In principle, an agent i’s local policy space could be partitioned into disjoint equivalence classes, each of which can be thought to impact other agents in the system in a different way, and thereby each inducing a different best response. Figure 3.11 illustrates the equivalence class partitions in a simple two-agent problem. Notice that the influence classes in Figure 3.11 are each labeled using notation E1,x without any superscript. Here, E1,x specifies an unconditional equivalence class containing a subset of agent 1’s policies all of which induce an identical best response from agent 2. There are no other agents in the system on which to condition the equivalence. For problems with more that two agents, policies π1a ∈ Π1 and π1b ∈ Π1 , for instance, may induce the same best response from agent 2 only under the condition that agent I 3 adopts π3c . In this case, we would write π1a ≡c π1b . For example, whether or not a π3 colleague (agent 1) expresses interest in collaborating may only cause a researcher (agent 2) to change her plans under the condition that a program funding manager (agent 3) allocates the necessary funds. Alternatively, agent 1 might influence two other agents, such that π1a ∈ Π1 and π1b ∈ Π1 evoke the same best response in agent 2, but different best responses in agent 3. π1a are impact-equivalent (and may in turn be) only in the case that they evoke identical best responses in each and every other agent. Definitions 3.44-3.45 enable the discussion of a spectrum of varying degrees of dependence. At one end of the spectrum, agent j is decision-independent (Def. 3.34) of agent i conditioned on agents K, meaning that all policies that i could adopt induce the same best response from agent j. This implies that, by Definition 3.44, any two policies of agent i are impact-equivalent conditioned on any polices π̄K of agents K. Moreover, all of agent i’s policies may be grouped into a single impact equivalence class (Def. 3.45). At the opposite end of the spectrum, agent j is decision-dependent on i in such a way that j’s best response is highly sensitive to the policy that i adopts. At this extreme, no two policies of agent i are impact-equivalent and the minimum number of i’s impact equivalence classes is equal to the size of its policy space kΠi k. In comparing the policy space to the impact equivalence class space, I am highlighting one of the primary intuitions of this work. When agents impact each other with some of their decisions but not all of their decisions, they do not need to jointly consider each and every joint policy. They really only need to coordinate the policy decisions that matter, which are those that separate the different impact equivalence classes from one another. 101 Agent 2 Agent 1 policy space 𝑬𝟏,𝟏 𝑬𝟏,𝟐  2*  1  E1,1  policy space  2*  1  E1, 2   2*  1  E1, 3  𝑬𝟏,𝟑 Figure 3.11: Example of equivalence classes. Example 3.43 (continued). Consider the problem shown in Figure 3.8. Agent 1 and agent 2 interact when agent 1 completes Task A thereby enabling agent 2 to achieve a positive outcome quality for subsequent execution of Task D. Examination of the possible policies of agent 1 and the corresponding best responses of agent 2 reveals the following impact equivalence classes. • E1,1 : For any policy in which agent 1 begins Task A at time 0, agent 2’s best response is wait for one time step and then, at time 1, if Task D is enabled24 begin Task D, but otherwise begin Task E. The probability (0.5) that Task D is enabled at time 1 and the quality (12.0) of completing it before its deadline (4) are such that the potential benefit of waiting outweighs the potential loss associated with starting Task E a time step late. If D is not enabled at time 1, then agent 1 can infer that Task A has not yet completed and will not complete until time 3, which does not allow enough time to complete Task D before its deadline. In this case, it should not wait any longer but instead begin Task E at time 1. 102 • E1,2 : For any policy in which agent 1 begins Task A at time 1, agent 2’s best response is wait for two time steps and then, at time 2, if Task D is enabled begin Task D, but otherwise begin Task E. The rationale is the same as before. In this case, Task A will complete at either time 2 or time 4 with equal probability. Completion at time 2 gives agent 2 enough time to complete Task D successfully. Agent 2’s expected local utility using this best response is ( 12 )(12.0) + ( 12 )( 13 )(6.0) = 7.0, which is better than it would do if it began Task E at time 0. • E1,2 : For any policy in which agent 1 begins Task A any later than time 1, agent 2’s best response is begin Task D at time 0. In this case, there is no chance that D will be enabled early enough for agent 2 to complete it before its deadline. Thus, agent 2 should begin its only other task, Task E, as early as possible to maximize the probability of completing it successfully. The policies contained within each partition are trivially impact-equivalent with respect to agent 3’s best response. (Since Agent 3 only has a single task to execute, its best response to any of agent 1’s policies is simply to begin Task F as soon as F becomes enabled.) By relating the magnitude of the equivalence class set to that of the local policy space, I can now extend the theory developed in Section 3.5.1. Recall the COP reformulation from Section 3.5.1.1, where the problem of optimal joint policy computation was reduced to selecting the best combination of values for variables {xj }, each pertaining to an agent j’s local policy. As suggested in Section 3.5.1, agents can solve the problem using bucket elimination by each iterating through the domains D6=j of decision-dependent peers’ local policy variables and computing a best response to each. The resulting set of best response policies is referred to in other work as agent j’s coverage set (Becker et al., 2004b). Definition 3.46. An agent j’s coverage set, denoted Cπ̄i K (Πj ), with respect to agent i and a setting π̄K of other hpeers’ policies, is the setiof policies that meets the following condition: ∀πi , Cπ̄i K (Πj ) ∩ arg max V (πi , πj , π̄K ) 6= ∅. πj ∈Πj The complexity results developed in Section 3.5.1.1 assume a brute force method for computing agent j’s coverage set: enumeration of all possible combinations of 24 For this TD-POMDP problem, “task-D-enabled” is a nonlocal feature that is completely observable to agent 2. As such, in any given state, agent 2 knows whether or not Task D has become enabled, and can infer the corresponding outcome distribution of its Task D. 103 peer agents’ policies and computation of a best response to each. If several of a peer i’s policies are impact-equivalent (conditioned on a particular setting of other peers’ policies π̄k ), then they will result in the same best response from agent j. Moreover, all π̄K policies πi ∈ Eix in an equivalence class will induce the same best response, suggesting the possibility of redundant best response calculations that could be avoided by taking into account equivalence class structure. Lemma 3.47. In order to compute an agent j’s coverage set Cπ̄i K (Πj ), it suffices to compute a best response to a single (arbitrarily selected) policy πix from each of i’s π̄K impact equivalence classes Eix . Lemma 3.47, which follows directly from Definitions 3.45 and 3.46, implies that the fewer the number of agent i’s equivalence classes, the fewer the number of j’s necessary best responses. We can use this result to refine the bound on TD-POMDP complexity, but before doing so, one other consideration must be addressed. Unlike the locality of agent interaction, which can be directly and trivially assessed from a TD-POMDP problem’s interaction digraph, the problem’s equivalence class structure is not known a priori. As such, the solution process must itself perform a partitioning of agents’ local policy spaces into equivalence classes in order to take advantage of the underlying equivalence class structure. Definition 3.48. An impact equivalence partitioning scheme P is a method that takes as input an agent i and a setting of some other agents’ policies π̄K , and partitions agent i’s local policy space into a set of disjoint impact equivalence classes  π̄K P(i, π̄K ) = Eix . I denote the complexity of the partitioning scheme with a term CP , which refers to the worst-case computational complexity required for P to partition any agent’s local policy space into equivalence classes conditioned on any other agents’ policies. There are a variety of different partitioning schemes that could be used to partition agents’ local policy space, each involving a different amount of computational overhead and thus a different value of CP . In Chapter 5, I present a specific partitioning scheme grounded in my influence-based policy abstraction methodology (Chapter 4) and characterize its complexity. Another scheme requiring only constant time would be to do nothing, thereby leaving each individual policy πix ∈ Πi in its own partition. This scheme is valid in the sense that it creates classes of i’s policies {Eix } that are equivalent, but it is not useful for reducing the number of best response calculations that j must perform. 104 Definition 3.49. For a given problem, the degree of influence dP afforded by a partitioning scheme P (Def. 3.48), is the maximal ratio of the number of impact equivalence classes to the number of local policies: max dP = max ∀K,∀π̄K i=1,...,n Pπ̄i K (3.14) kΠi k Example 3.43 (continued). For the example problem shown in Figure 3.8, we derived earlier that a mere 3 partitions {E1,1 , E1,2 , E1,3 } can be used to classify all of agent 1’s possible policies. Even in this very simple problem, given the time horizon 6 and agent 1’s three different tasks in addition to a “wait” action, agent 1 has a total of 483,729,408 possible policies.25 The degree of influence (given this 3 partitioning) is thus 483,729,408 ≈ 2.067 × 10−9 . The degree of influence quantifies the (worst-case) reduction in best-response calculations achievable with a particular partitioning scheme. For any given problem, there is a minimal number of equivalence classes that, if found by a partitioning scheme, would minimize the degree of influence. However, the computation required to execute the partitioning scheme with the lowest degree of influence may be prohibitive, possibly canceling out the associated benefit of the reduced number of best response calculations. Thus, in selecting a partitioning scheme, it is desirable to achieve a balance in the degree of influence and the computational overhead of partitioning (across all problems that the partitioning scheme is expected to face). Theorem 3.50 sheds some light on the problem of finding such a balance. Theorem 3.50. The worst-case time and space complexity of solving a TD-POMDP problem is bounded by: O n · EXP(Xmax ) · (dP kΠmax k)ω + n · CP · (dP kΠmax k)ω−1 i i i  (3.15) where n is the number of agents, Xmax is the largest scope magnitude (Def. 3.40) of i any agent, dP is the degree of influence (Def. 3.49) given partitioning scheme P, Πmax i is the largest policy space of any agent, CP is the worst-case complexity of P, and ω is the induced width of the interaction digraph (Def. 3.36). 25 The number of possible policies was calculated by multiplying together the number of available local actions in every reachable local state of the corresponding TD-POMDP. 105 Proof. The complexity bound presented in Theorem 3.50 follows from the analysis of my BE-OIS algorithm presented in Section 6.6.3. Theorem 3.50 is an extension of the complexity results developed in Section 3.5.1. The differences between Equation 3.15 and the bound in Theorem 3.41 are (1) a reduction in the base of the exponent by a factor of dP and (2) the addition of term CP accounting for the computational overhead of P (in the context of the bucket elimination algorithm). The new bound suggests that, all else being equal, problems involving agents with a low degree of influence (whose local policy spaces can be partitioned into a relatively small number of influence equivalence classes) should be easier to solve than problems with a high degree of influence contingent upon the efficiency of impact equivalence partitioning. If the partitioning complexity C P is bounded  O dP · EXP(Xmax ) · kΠmax k , the second term of Equation 3.15 vanishes. However, if i i it is of significantly larger magnitude, it overwhelms the first term, indicating that the overhead of partitioning potentially outweighs any computational benefit of the smaller local policy space search sizes. Note that the bound given in Theorem 3.50 is no longer purely a statement about the problem. That is, it includes information, C P and dP , specific to the algorithm that is used to solve the problem. As such, it is actually a bound on the complexity of exploiting influence structure, where the exploitation is inexorably tied to the solution algorithm. 3.5.3 Summary of Weak Coupling Characterization Over the course of this section, I have developed theory relating TD-POMDP complexity to weakly-coupled problem structure. Subsections 3.5.1.2, 3.5.1.3, and 3.5.2 synthesize three key aspects of problem structure {agent scope, state factor scope, and degree of influence} into an integrated characterization of weak coupling. The end result is a refined bound on the worst-case time and space complexity of optimal planning presented in Equation 3.15 that accounts for the three weak coupling aspects. As I summarize in the list below, each aspect manifests itself in a different set of problem parameters, and each affects the overall complexity in a different manner. ❼ Agent scope refers, conceptually, to which agents in the system are affecting each others’ decisions (and hence, which peers’ influences need be reasoned in order for the affected agent to plan its own behavior). In general, the fewer the agents that affect one another, the smaller the worst-case computation 106 time. With respect to agent scope, the strength of coupling of a problem can be quantified as the induced width ω of the interaction digraph, which bounds the number of peers that can affect any given agent. All else being equal, a smaller value of ω indicates a more weakly-coupled problem. In the context of a decoupled joint policy search method, worst-case computation time is exponential in ω (regardless of the total number of agents). ❼ State factor scope refers to the portion of world state features that must be reasoned about by an individual agent when planning its local policy (in the context of a decoupled search method). With respect to state factor scope, I quantify strength of coupling as the magnitude of the largest scope magnitude Xmax (which is the largest number of combinations of values that may be taken by i the features in an agents’ state factor scope). In general, worst-case computation time is exponential in Xmax . For TD-POMDP problems, Xmax is bounded by i i   T −1 max Xi . Thus, weak coupling is directly related to the ≤ max kSj kkMj k j density of mutually-modeled state features m̄j ⊆ sj and the sizes of their joint domain kMj k. For TD-POMDP problems in which agents directly observe their   local state, worst-case computation time is polynomial in max kSj kkMj kT −1 . j ❼ For agents that do affect each others’ decisions, the degree of Influence relates to the proportion of unique ways that they can affect each other’s decisions. I have derived, in Section 3.5.2, that all of agent i’s policies that have the same impact on agent j’s decisions can theoretically be grouped together, thereby partitioning agent i’s local policy space, so as to reduce the number of policy combinations that need be jointly considered by the group of agents. Given a partition scheme P, parameter dP quantifies a problem’s degree of influence as the worst-case ratio of partitions to local policies. In regard to the worstcase time complexity bound, which is exponential in the induced width, dP is situated in the base of the exponent and thus has a potentially-significant effect on computation time. More weakly-coupled problems, whose values of dP are smaller, are likely to be easier to solve than problem with larger values of dP (all else being equal). However, this statement is contingent on the efficiency of the partitioning scheme P, whose complexity CP affects the overall worst-case computation time polynomially. , dP } described above can be thought of The three problem parameters {ω, Xmax i as orthogonal dimensions whose combination provides a concrete measure of the the 107 degree of coupling of a TD-POMDP problem. A problem’s worst-case complexity depends on where it lies along the spectrum of agent scope, along the spectrum of state factor scope, and along the spectrum of degree of influence. For any two problems, we can now compare their worst case complexities by evaluating (or estimating) the values of the three parameters and positioning each in the 3-dimensional space. Evaluating the first two parameters, ω and Xmax , is straightforward given the probi lem specification. The induced width ω may be obtained by converting the interaction digraph (whose connectivity is made explicit by the TD-POMDP description M) into a constraint graph (as described in Section 3.5.1.2) and computing its induced width using one of the algorithms reviewed by Dechter (2003). The state factor scope   magnitude Xmax may be computed by evaluating max kSj kkMj kT −1 , whose terms i j are also explicitly described in M. A problem’s degree of influence dP is not readily assessable from the TD-POMDP description, but it can be estimated heuristically. In Section 4.6, I supplement this theoretical analysis by proposing and evaluating several heuristics for estimating dP that are specific to my influence-based abstraction scheme of partitioning. Aside from the ability to compare problems’ worst-case computation times, the theory that I presented in this section has broader consequences. In Section 3.3.2, where I proved the intractable general worst-case complexity of the TD-POMDP class, I argued that, given its explicit description of exploitable structure, the class contains regions wherein problems can be solved efficiently. Within my three-dimensional characterization of weakly-coupled problem structure lies a map for navigating the TDPOMDP class and uncovering those efficient regions. For each of the three dimensions, my analysis allows for determination of the worst-case complexity with respect to that dimension, and for borders to be drawn between regions that fall into one complexity class and those that fall into another. Worst-case complexity aside, my analysis provides guiding signs that suggest the pitch and slope of problem difficulty, and that can be used to better understand why some problems take hours to solve and others seconds. Finally, with respect to the degree of influence, the theory presented here justifies the exploration into influence-based abstraction that is the focus of this dissertation. 3.5.4 Related Work on Characterizing Weak Coupling In contrast to related studies of weakly-coupled problems in sequential decision making, the primary distinction of my analysis is that it synthesizes several different aspects of weak-coupling into a single unified characterization. Each of these aspects 108 have appeared individually in some shape or form in past work. For instance, the work of Guestrin et al. (2001, 2003) on exploiting restricted scope in factored value functions, though limited in context to approximate solution computation, plays a foundational role in my analysis. Dolgov & Durfee (2004a) analyze structure in graphical multiagent MDPs, relating agent scope to the complexity of optimizing local value functions. Another branch of work characterizes agent scope for systems of transition-independent agents, developing exploitative distributed joint policy search algorithms on which my analysis is based (Nair et al., 2005; Kim et al., 2006), and relating complexity of optimal planning to induced width (Kumar & Zilberstein, 2009). Oliehoek et al. (2008b) measures locality of interaction as a function of both agent scope and state factor scope, focusing on the complexity of joint planning stage-by-stage as a series of collaborative graphical Bayesian games. To my knowledge, the last aspect of weak coupling that I consider, degree of influence, has received no attention in the literature. However, the theory I have developed might explain the performance of the Coverage Set Algorithm Becker et al. (2004b), which effectively partitions each agent’s local policy space (by way of a parametrization that encodes the effects of its policy on other agents’ rewards). Aside from the three aspects {agent scope, state factor scope, and degree of influence} on which my analysis concentrates, researchers have performed similar analyses relating other forms of Dec-POMDP problem structure to problem complexity. For instance, Goldman & Zilberstein (2004) characterize the complexity of various Dec-POMDP subclasses by classifying agents’ communication capabilities, whether or not agents share any information during execution, and the agents’ objectives (e.g., whether they are maximizing rewards or striving to reach a set of goal states). Shen et al. (2006) characterize complexity of optimal Dec-MDP planning according to the complexity of the minimal encoding of agents’ local policies. Allen (2009) takes a different approach to characterizing problem structure. With the motivation of predicting the amount of computation required by optimal solution methods and the value achieved by approximate solution methods, he develops an information-theoretic metric, influence gap, that quantifies roughly the difference in the degree to which each of two agents can affect world state transitions, joint observations, and joint rewards in a Dec-POMDP. The meaning of influence in Allen’s work is slightly different from that of my degree of influence, which refers to the degree to which agents can impact each other. However, his results express the same general sentiment that varying levels of impact result in varying problem complexity. Lastly, there is a strong connection between my analysis and that of Brafman & 109 Domshlak (2008). Whereas my analysis quantitatively characterizes weakly-coupled sequential decision making problems (specifically TD-POMDPs), Brafman & Domshlak (2008) quantitatively characterize “loosely-coupled” multiagent classical planning problems. Note that, although the Dec-POMDP is an optimization problem, the multiagent classical planning problem is one of satisfaction, whose solution is a joint plan that satisfies a set of goal conditions. In their analysis of complexity, Brafman & Domshlak (2008) take advantage of this fact to transform the planning problem into a constraint satisfaction problem (CSP). Much like in my analysis of joint policy computation as constraint optimization, they incorporate a parameter ω corresponding to the induced width of the constraint graph. They describe the level of coupling of a problem with one other variable δ that measures the number of potential coordination points (wherein an agent can affect others by adding a “public” action to its plan). Conceptually, this is similar to degree of influence, which dictates the number of unique impacts that an agent can manifest on another. 3.5.5 Contribution Outside the Scope of the TD-POMDP Researchers have developed a number of different algorithms for exploiting the kinds weakly-coupled problem structure included in my characterization (Becker et al., 2004b; Kim et al., 2006; Kumar & Zilberstein, 2009; Mostafa & Lesser, 2009; Nair et al., 2003, 2005; Oliehoek et al., 2008b; Witwicki & Durfee, 2010). A broader contribution of my characterization is that it can explain some of the trends observed in the performance of these algorithms that are not easily explained without considering combinations of weak coupling dimensions. For instance, the successes of a family of ND-POMDP algorithms (Kim et al., 2006; Nair et al., 2005) in scaling to many agents has been attributed to the reduced agent scope associated with ND-POMDP agents’ local neighborhoods (Kim et al., 2006; Kumar & Zilberstein, 2009; Nair et al., 2005). That is, as long as the agent scope remains small, these algorithms are expected to be practical. However, a generalized version of one of these algorithms (JESP (Nair et al., 2003)) has recently been reported as intractable for a test set of Distributed POMDPs with Coordination Locales (DPCLs) containing just two agents, even when generating an approximate solution (Varakantham et al., 2009). A likely explanation for this phenomenon is contained within Equation 3.15, which suggests that it was not the agent scope of the problems that foiled JESP but instead the cost of JESP’s best response calculation. Whereas ND-POMDP problems have an inherently restricted state factor scope due to the strict separation of agents’ local states and transition and observation independence, DPCL 110 problems involve transition-dependent agents that need to reason about each others’ state variables in order to compute optimal best responses (which JESP employs in computing approximate solutions), making the DPCL more strongly coupled even in its two-agent incarnation. 3.6 Summary The main contribution of this chapter is a model for multiagent coordination that emphasizes exploitable problem structure. While past work has defined a variety of other structured models, the TD-POMDP expresses structure without imposing overly-restrictive assumptions. In particular, the TD-POMDP accommodates rich transition-dependent agent interactions and partial observability, yet it is able to articulate aspects of structure previously exploited only in transition and observation independent problems. The TD-POMDP’s structure is significant because it decouples the joint model into a set of interdependent local POMDP models that are tied to one another by their transition influences. As a consequence, the TD-POMDP is a natural candidate for the application of decoupled solution algorithms that decompose the computation of joint behavior into a series of simpler computations about local behavior. Despite the TD-POMDP model’s inherently-decoupled representation, the generallyintractable computational complexity of the class of TD-POMDP problems brings into question the efficiency of decoupled solution formulation. Certainly, some TDPOMDP problems are intractable to solve, while others can be decomposed and solved efficiently. I call this latter group of problems weakly-couped. Fortunately, the problem structure expressed in the TD-POMDP’s description provides clues as to the efficiency of solving any given problem. This insight has driven me to develop a characterization of weakly-coupled problems, and to derive refined bounds on worst-case computational complexity that accounts for three different aspects of weakly-coupled problem structure: agent scope, state factor scope, and degree of influence. Whereas both agent scope and state factor scope have been analyzed in some shape or form in prior work (though in more restricted problem contexts), I am the first to formalize the degree of influence in any context. Unlike the other two aspects of weak coupling, the degree of influence is not immediately discernible from the problem description. When exploited, however, my weak coupling theory suggests that a low degree of influence translates to significant improvements in computational efficiency. The promise of efficient solutions, along with the elusiveness of evaluating the degree of 111 influences, motivates the development and evaluation of a methodology for exploiting influence structure. Such is the focus of the remainder of this dissertation. 112 CHAPTER 4 Influence-Based Policy Abstraction In the last chapter, I claimed that the TD-POMDP’s explicit representation of problem structure makes it a natural candidate for modeling and solving weaklycoupled problems efficiently. Guided by my characterization of degree of influence as well as state factor scope, I now begin to address these claims with the development of a methodology for exploiting the TD-POMDP’s weakly-coupled problem structure. The primary insight of this chapter is that, when most agent decisions are independent of peers’ decisions, the agents can avoid the complexity of coordinating their full policies. They can optimize their joint behavior by instead coordinating policy abstractions that convey only the essential influences. In connection with the theory presented in Section 3.5.2, influences summarize classes of impact-equivalent policies (Def. 3.45). Here, I examine what these influences are, how they can be represented compactly without loss of optimality, and why their coordination has potentially significant computational benefits over conventional policy search. In answering these questions, this chapter contributes a formal characterization of transition-dependent influence. We find that, in the context of the TD-POMDP model, agent interaction can be conveniently modeled using probability distributions over present and past values of shared feature values. Further, this representation suffices for formulating optimal joint policies. The influence derivation and proof of sufficiency presented herein serve as the theoretical foundation for the influence-based solution methods developed in subsequent chapters. Practically, and in connection with the remainder of the dissertation, this chapter also develops an important piece of the influence-based solution methodology: the model that allows each agent to efficiently compute its optimal local policy in response to promised influences of its peers. Despite the fact that the influence formalism I develop here has no direct application outside of the TD-POMDP model, I believe that there is potential for farther-reaching 113 impact. For instance, the formalism could be extended to represent concurrent transition dependencies in the more general class of Dec-POMDPs. Moreover, the idea of reducing nonlocal policies to probability distributions over local effects without loss of information (sufficient for optimal reasoning) is itself a conceptual contribution. This insight could inspire the development of similar approaches to reasoning about uncertainty in multi-agent contexts other than Dec-POMDP planning. 4.1 Overview In contrast to the general Dec-POMDP where each agent’s behavior may be arbitrarily intertwined with all others’, TD-POMDP agents are coupled to one another through structured feature dependencies between select individuals of the team. In particular, for weakly-coupled agents whose decisions are largely independent of one another (as in the interaction graph shown in Figure 3.7), the TD-POMDP provides a succinct representation for their interactions. We can, in turn, exploit this representation using a decoupled solution methodology (reviewed in Section 2.3.3) that decomposes the joint policy formulation into a series of local policy formulations. While substantial computational leverage has been obtained in using the decoupled approach to solve problems where agents are transition and observation independent (Becker et al., 2004b; Nair et al., 2005; Varakantham et al., 2007), less progress has been made in applying the same techniques to problems where agents interact through the transition model. Much of the difficulty in decomposing transition-dependent agents’ policy formulations is due to the complexity of formulating and solving best-response models. As discussed in Section 4.2, computing a best response entails converting the joint problem into a single-agent POMDP, wherein the agent uses a belief-state representation to keep track of its knowledge about the system’s trajectory as it takes actions and receives observations. In contrast to the single-agent POMDP belief state (Smallwood & Sondik, 1973), a Dec-POMDP agent’s belief-state needs to include information that it gains about other agents’ possible beliefs in addition to the information that it gains about the Dec-POMDP world state. For a general Dec-POMDP agent whose interactions are unrestricted, this entails maintaining a probability distribution over the possible observations of interacting peer agents (as derived by Nair et al., 2003) which is necessarily exponential in the number of peers. However, thanks to the structure introduced in Chapter 3, TD-POMDP agents can make use of an alternative belief-state representation, as I describe in Section 4.2.2 and derive in Section 4.2.3. 114 This novel belief-state representation consists of a vector whose size depends not on the number of peer agents, but instead on the state factor scope (described more precisely in Section 3.5.1.3), making it advantageous for weakly-coupled agents with sparse peer interactions. What we find from the derivation of a TD-POMDP agent’s belief state is that information about peers’ behavior can be represented quite compactly in the form of a probability distribution over nonlocal feature values. Since the agent is influenced by peers only through nonlocal features, the transition dynamics of all nonlocal features constitutes a model of influence. In order to make optimal decisions, the agent does not need to know the details of peers’ planned behavior as long as it knows the resulting influences. Before formally characterizing interagent influence in Section 4.3, I illustrate the high-level concepts with an example. Visit Site A Visit Site B Visit Site C outcome: window: [0,8] D 1 2 3 Visit Site D Visit Site C Q 2 2 2 P 0.3 0.4 0.3 Prepare Site C outcome: D Q P 111 window : [3,4] (Rover 5) outcome: D Q P 2 1 0 (1) 2 0 1 (0) window : [5,8] (Rover 6) Figure 4.1: Example of limited influence. Example 4.1. Figure 4.1 portrays a simple, concrete example problem involving two rover agents. The rovers are each equipped with different hardware, so it is necessary for rover 5, upon visiting site C, to prepare the site in order for rover 6 to gain any value from visiting the site. Apart from this interaction, the two agents’ problems are completely independent. Neither of them interacts with any other agents, nor do they share any observations except for the occurrence of site C’s preparation and the current time. In a TD-POMDP, this simple interaction corresponds to the assignment of a single boolean nonlocal feature site-C-prepared that is locally-controlled by rover 5, but that influences (and is nonlocal to) rover 6. Thus, in planning its own actions, rover 6 needs to be able to make predictions about site-C-prepared ’s value (influenced by rover 5) over the course of execution. 115 Due to window constraints present in Example 4.1, the only information relating rover 6’s behavior that is relevant to rover 5 is the probability with which site-Cprepared will become true at time = 4. At the start of execution, site-C-prepared will take on value f alse and remain f alse until rover 5 completes its “Prepare Site C” task (constrained to finish only at time 4, if at all, given the task window in Figure 4.1). After the site is prepared, the feature will remain true thereafter until the end of execution. With these constraints, there is no uncertainty about when site-C-prepared will become true, but only if it will become true. Hence, the influence of rover 5’s policy can be summarized with just a single probability value, P r(site-C-prepared = true|time = 4). Aside from providing an elegant, compact1 representation of nonlocal policy information, the influence abstraction (exemplified by P r(site-C-prepared = true|time = 4)) engenders a potentially-significant reduction in the size of the search space for optimal joint policies. As described more formally in Section 4.5, the influence space clusters together those individual policies of each agent that exert identical influences on the agent’s peers. In Example 4.1, notice that any two policies that differ only in the decisions made after time 3 will yield the same value for P r(site-C-prepared = true|time = 4). By considering only those influence values achievable by some feasible policy, agents avoid jointly reasoning about the multitude of local policies with equivalent influences. In Section 4.6, I corroborate this claim with an empirical evaluation of influence space size over a systematically-explored space of random problems. 4.2 Belief State and Influence To decouple the joint policy computation into local policy computations, agents require local decision models that incorporate the influences of their peers’ candidate policies. The purpose of the local model is to allow an agent to reason about the implications of its individual action choices given that all peers’ choices are assumed to be determined. Doing so allows the agent to compute a best-response policy relative to its promised peer policies. From the perspective of this decision-making agent, once its peers fix their policies, they cease to be decision makers and instead become processes of the stochastic environment. As such, the best-response model is really a local POMDP whose construction (as discussed in Section 4.2.1) is derived from the Dec-POMDP model, but whose observation signal is local (instead of joint) and 1 Throughout this chapter, I use the word compact to refer to the fact that the encoding exploits weakly-coupled problem structure to express the necessary information in fewer parameters (than conventional representations). 116 whose action selection is local (instead of joint). Due to the partial observability of POMDPs, the agent cannot track its current system state precisely. Instead, as is common practice when building and solving local POMDP models, the agent maintains a belief state that summarizes the knowledge it gains as it acts and observes the environment (Smallwood & Sondik, 1973). The belief state encodes information sufficient for the agent to make predictions and to select choices that are just as good as those that it could by remembering its complete action-observation history. Figure 4.2 depicts the use of a belief state in place of action-observation history. World observation o tj actionobservation history action a tj j   o tj a tj 1 belief state estimator b tj belief state Agent j Figure 4.2: Usage of belief state for POMDP agent reasoning. For the purposes of a best-response POMDP conditioned on fixed peer behavior, representation and maintenance of belief state are both nontrivial to operationalize and computationally complex. With the goal of reducing the complexity of best-response reasoning for TD-POMDP agents, this section develops a more efficient best-response model that exploits weakly-coupled problem structure. I begin in Section 4.2.1 by examining prior work (Nair et al., 2003) on belief state representations for agents whose peers’ policies have been fixed. I refer to this general representation henceforth as the General Best-Response Belief-State. Next, in Sections 4.2.2–4.2.3, I derive a condensed version that takes advantage of the structure articulated by the TD-POMDP model to improve computational efficiency of local (best-response) reasoning. Through this derivation, I reveal nonlocal policy information in Section 4.2.5 that forms the basis for my influence-based abstraction methodology (to which the remainder of this chapter 117 is devoted). 4.2.1 General Best-Response Belief State Let us begin by considering an existing formulation of best-response belief state that Nair derived for the general class of Dec-POMDPs (which, at the time, he referred to as MTDPs). Given the fixed, deterministic policies of an agent’s peers, the problem of finding an optimal local best-response policy may be represented using a complex but normal single-agent POMDP (Nair et al., 2003). Recall, from my review in Section 2.2.2.2, that the single-agent POMDP belief state (which I will denote b ) summarizes the agent’s action-observation history with a probability distribution over possible  world states: btj (st ) = P r st |~ajt−1 , ~ojt , ∀st ∈ S, where j is the agent, st is a possible  current world state, and ~ajt−1 , ~ojt is the action-observation history. This particular belief state vector is a sufficient statistic for predicting future action-observation consequences in single-agent problems (Smallwood & Sondik, 1973). However, in the context of a best-response calculation, where peer agents are assumed to be executing fixed policies conditioned on their own partial observations, a distribution over (DecPOMDP) world states is insufficient. Given that the agent’s local observations (and actions) may be correlated with peers’ observations, information gained to inform inference about peers’ beliefs may be lost with the translation of local observation history to world state distribution. Example 4.2. Consider two rovers (1 and 2) that receive a joint observation of wind at their base station, which serves as partial information about the weather conditions at the various sites that they might choose to visit. To rover 1, windfelt-at-base (the observation feature) serves as an indication of the possibility of rain-at-site-A (the world state feature). Assume that rover 1 has planned a policy that is particularly sensitive to this observation, dictating that if it ever observes any wind, it will not travel to visit site A for the rest of the day. More precisely, it will not perform action begin-trip-to-site-A for any observation history containing observation wind-felt-at-base. In planning its behavior in response to rover 1’s policy, rover 2 considers the scenario where it begins its day observing wind at base at 9:00 and then travels to site A, observing directly that rain-at-site-A=false at 12:00. For this scenario, rover 1’s policy dictates that, due to the existence of wind over base in the morning, it will not make a trip out to 118 site A at noon. But rover 2 will not be able to perform this reasoning based solely on world state (rain-at-site-A=false). Rover 2 will need to reason instead that, given its observations of wind three hours ago, rover 1 also observed wind and therefore should not be expected to visit (regardless of whether or not it is raining at site A). The is because rover 1’s beliefs about the world state differ from rover 2’s beliefs. Nair ensures that information about peer beliefs is not lost by augmenting the classical belief state with a probability distribution over peer observation histories (Nair et al., 2003):   btj st , ~o6=j = P r st , ~o6=t j |~ajt−1 , ~ojt , ∀st ∈ S, ∀~o6=t j ∈ Ω1 × ... × Ωj−1 × Ωj+1 × ...Ωj+1 (4.1) Equation 4.1 shows the multiagent belief state vector b tj that agent j associates with a given action-observation history ending at time t, where each component represents the probability of a unique world state (st ) and combination of unique peer observation histories (~o6=t j = {~oit , ∀i 6= j}). By maintaining a joint distribution over world state and peer observation histories, agent j is able to keep track of its belief about the world state as well as the likelihoods of other agents’ possible beliefs. Use of any belief state representation requires the agent’s ability to compute and update its belief state as it performs actions and receives observations. Using Nair’s belief state update function (Nair et al., 2003), BSU (), agent j can compute the individual components of its belief state as shown in Equation 4.2. The initial belief state b 0j , conditioned on agent j’s as-of-yet empty observation history, is simply the probability distribution of world start states dictated by the Dec-POMDP problem description. Subsequent belief states b t+1 are a function of previous belief state, local j action, and local observation. b 0j = BSU (∅) = hP r (s0 )i   t+1 t+1 t t+1 t t bj , atj , ot+1 , o |b = P r s , ~ o , a b t+1 = BSU b j j j j j = 6 j P  t ot t t+1 |st , at ,π t t t ) ,st+1 ,ot+1 o6= h j 6=j (~o6=t j )i)·P r(ot+1 )] st ∈S [bj (s ,~ j 6=j )·P r (s j i 6=j |s ,haj ,π6=j (~ = a normalization factor (4.2) With every new action agent j takes and observation it receives, it can calculate each belief state component as in Equation 4.2 by looping over all possible last world states st , for each adding the product of the three terms in the numerator, and 119 b 02 s = RA o1=  s = RA o1=  (p = 0.8) (p = 0.2) world state space S = {RA : not-rain-at-site-A, RA : rain-at-site-A} Individual observation space Oi = {N : no-clouds-seen-at-base, C : clouds-seen-at-base} a2 = idle o2 = C b 12 s = RA o1= C s = RA o1= C s = RA o1= N s = RA o1= N (p = 0.6) (p = 0.4) (p = 0) (p = 0) a2 = begin-trip-to-site-A o2 = N b 22 s = RA o1=C,C (p = 0) s = RA o1=C,C s = RA o1=C,N s = RA o1=C,N s = RA o1=N,C (p = 0) (p = 0.9) (p = 0.1) (p = 0) s = RA o1=N,C (p = 0) s = RA o1=N,N s = RA o1=N,N (p = 0) (p = 0) Figure 4.3: One possible belief state trajectory of rover 2 from Example 4.2. normalizing (since the component probabilities should all sum to 1). Calculation of these three terms is straightforward using the Dec-POMDP model and the fixed peer policies. The first term is the probability that the world was in state st at the previous time step and that the other agents had observed the subsequence of observations ~o6=t j from times 0 to t equal to those in the respective belief state component (as dictated by the previous belief state). The second term is the probability that the current world state st+1 is equal to that of the respective belief state component, conditioned on previous state and previous joint action composed of the action taken by j and the set of actions dictated by the other agents’ fixed policies applied to their respective observation histories2 (as computed using the Dec-POMDP transition function P ). 2 This dissertation, as with Nair’s work, considers and computes policies that are deterministic. (For any finite-horizon Dec-POMDP, there exists a deterministic joint policy that achieves the same value as the optimal randomized joint policy.) Under the assumption that each peer agent’s policy is deterministic, it need not be conditioned on action-observation history, but only on observation history. This is because action history is uniquely determined from each observation history by stepping through the observations and selecting the deterministic action choice. Furthermore, computation of belief state relies on the property that peer policies map their observations histories to actions ~ 7→ A), as opposed to mapping belief states to actions. Even though agents are planning (πi : O their policies using the belief state representation, the dynamic programming algorithm Nair uses to compute best response policies enumerates all reachable observation sequences, recording those 120 And the third term is the joint probability of agent j’s new observation ot+1 and the j new set of observations for the other agents associated with the respective belief state component conditioned on the previous world state and joint action (as in the second term) and new world state. Through successive applications of its belief state update function (Equation 4.2), agent j can compute a belief state b tj given any sequence  0 t of actions and observations a0j , ..., at−1 j , oj , ..., oj . A simplified example of one such belief state trajectory is pictured in Figure 4.3. Computing the best-response policy boils down to solving the fully-observable MDP defined over the space of belief states. Figure 4.3 shows just one path through this MDP. In general, there will be a branch for each combination of action that agent j can take and observation that agent j might receive. The reward signal R′ of the belief-state MDP is equal to the expected immediate rewards that the team would receive (given uncertainty of the true system state). " X   R′ b tj , atj = b tj st , ~ot6=j hst ,~ot6=j i #   P · st+1 ∈S P r st+1 |st , atj , π6=j (~ot6=j ) R st , atj , π6=j (~ot6=j ) , st+1   atj U ∗ b tj , atj = R′ b tj , " # X    btj , atj · max U BSU b tj , atj , ot+1 , at+1 + P r ot+1 j j j |b t+1 ot+1 ∈Ωj j aj ∈Aj (4.3) Equation 4.3, as derived by Nair et al. (2003), shows the calculation of immediate reward R′ (), which agent j can calculate by invoking the Dec-POMDP model’s reward function R(). The value U ∗ associated with each belief state and action pair is defined recursively as the sum of the immediate rewards and future rewards obtained by taking the optimal action in every subsequent belief state. The belief state space is continuous, containing an infinite number of possible distributions over world states. However, given that the Dec-POMDP has a finite horizon and finite state, action, and observation spaces (as is the case in the class of problems this thesis considers), there are a finite number of possible state transitions sequences which resulted in each reachable belief state, and thereby computing a policy that is a function of observation histories. Were peer policies defined over belief states and not observation histories, computation of this term would become much more complicated, ultimately requiring recursive invocation of peers’ belief state update functions, which in turn would require invocation of their peers’ belief state update functions. 121 and observations, and hence a finite number of reachable belief states. Nair’s bestresponse solution algorithm takes advantage of this fact by expanding only those belief states that are reachable. Although there are a finite number of belief states, the belief state vector itself becomes computationally expensive to maintain as the problem size increases. A belief Q state encountered by agent j at time t contains |S| · i6=j |Ωi |t components. Under the assumption that all agents’ individual observation  spaces are bounded by |Ωi |,  n−1 the worst-case space complexity is O |S| · (|Ωi |t ) : exponential in the number of agents (as well as the problem time horizon). This exponential dependence carries over to the time complexity of any solution algorithm that performs component-wise belief updates (as in Equation 4.2, and depicted in Figure 4.3). Given that this computation is all directed towards computing a single agent’s (best-response) policy, and the joint policy space is exponential in the number of agents, the cost of finding optimal joint policies in this manner is potentially doubly-exponential in the number of agents (in the worst case). 4.2.2 Condensed Belief State for TD-POMDP Agents The general best-response belief state representation discussed in the previous section, while tractable for small problems involving two agents (as was demonstrated by Nair et al. (2003)), does not scale well to teams of three or more agents (as was shown empirically by Varakantham et al. (2009)). Though complete, its representation contains a significant amount of belief information that may be irrelevant for an agent in a weakly-coupled system. The intuition is that if most peer decisions have no bearing on the agent’s local decision problem, then the agent need not distinguish most peer observation histories, nor distinguish most state information relating to peers’ activities. The structured agent coupling of the TD-POMDP model leads us to define a representation of belief state that is more compact for such weakly-coupled cases, and whose compactness depends upon on the scope of interaction. In the extreme case of independent agents, the new belief state representation is equivalent to the traditional single-agent POMDP belief state (Smallwood & Sondik, 1973). This is accomplished by representing only that information which is necessary to make optimal decisions.  First, consider the observational information in Equation 4.1: P r ~o6=t j |~ojt , ~ajt−1 . For the general Dec-POMDP agents, belief state includes a distribution over peer observation histories because of potential correlation between local observations and peer observations (as was the case for the rover agents in Example 4.2). Although the 122 general Dec-POMDP allows for arbitrary correlation of agent and peer observations, the TD-POMDP makes explicit the structure with which observations can be correlated. By Definition 3.5, each individual TD-POMDP agent j’s observation is a function of its local state variables and local action. By Equation 3.3 in Definition 3.5, the only way that j’s observation may be correlated with a peer i’s observation is if there are state features common to both i’s and j’s local states. Moreover, the values of these mutually-modeled features are the only information that links the two agents’ observations oi and oj . Instead of maintaining a distribution over all peer observation histories, a TD-POMDP can instead maintain a distribution over just those mutuallymodeled state feature values. This is all that a TD-POMDP agent needs in order to make distinctions between different peer observations (based on its own observations). The other information represented by the general best-response belief state (Equa tion 4.1) is the world state distribution: P r st |~ojt , ~ajt−1 . This conveys the information necessary to predict how the system will progress from one time step to the next. For a TD-POMDP agent j, relevant state features are contained within its local state sj . The consequences of its action are both determined solely by local state feature values and applied solely to (changes in) local state feature values. In planning its actions, maintaining a distribution over the subset of world state sj ∈ s (in conjunction with the distribution of mutually-observed state feature histories) is enough to optimize its behavior given the fixed policies of its peers. This leads us to the following representation of belief state, which we prove in Section 4.2.3 sufficiently summarizes a TD-POMDP agent’s action-observation history. Definition 4.3. The TD-POMDP belief state for agent j, denoted b j , represents a joint probability distribution over current local state sj and histories of mutuallymodeled (Def. 3.13) features m̄j :   btj stj , m ~ t−1 ajt−1 , ~ojt = P r stj , m ~ t−1 j j |~ (4.4) The mutually-modeled features m̄j are the only world state features that may be mutually observable (Def. 3.2) because each is modeled in some other agent’s local state, and the other agents’ observations do not depend on state features outside of their local states, respectively. However, one should note that these features may be only partially observable, or equivalently, indirectly observable (from observations of dependent locally-controllable state features), or (in the degenerate case) completely unobservable. Each feature f in tuple m̄j fits into one of the following categories: 1. f is locally-controlled by agent j and thus modeled as a nonlocally-controllable 123 feature by some other agent i. 2. From j’s perspective f is nonlocally-controlled, and thus modeled by exactly one other agent i as a locally-controllable feature. 3. f is an unaffectable feature that impacts both i’s and j’s local state transitions. The novelty of the TD-POMDP belief state representation is its exploitation of weakly-coupled problem structure. Unlike Nair’s belief state representation, which is exponential in the number of agents, the length of the vector in Equation 4.4 is exponential in the number of mutually-modeled state features irrespective of the number of agents. For weakly-coupled problems where several agents interact through a (proportionally) small number of world features, this new belief state representation will be much more manageable than the general belief state. The TD-POMDP belief state is updated in the same fashion as was the general belief state (described in Equation 4.2). The new belief state update function is as follows:   t+1 t+1 t+1 t t t t t b b t+1 = BSU b , a , o = P r s , m ~ |b , a , o j j j j j j   O ot+1j|at ,sjt+1 j P ~ t−1 |m ~ tj )btj(stj ,m |stj )P r(n̄t+1 ) ) stj −m̄tj PjL (l̄jt+1 |stj ,atj )PjU (ūt+1 j( j j j j j j = P r(ot+1 |~ajt−1 ,~ ojt ,atj ) : a normalization factor j (4.5) I present a detailed derivation of Equation 4.5 in Section 4.2.3, and describe here the individual terms, contrasting them with those of the general belief state update  t+1 t function (Equation 4.2). The first term, Oj ot+1 , which is the probability j |aj , sj of the new observation given the action taken and the world state encoded in the respective component of the belief state vector, roughly corresponds to the third term in Equation 4.2. Due to the factored TD-POMDP observations, local observation is not correlated with peer observations except in the values of the latest shared state features, so need not depend on other agents’ actions nor observations, nor on world features outside of the local state.    t+1 t The second, third, and fourth terms, PjL ¯ljt+1 |stj , atj PjU ūt+1 ~ tj , j |sj P r n̄j |m appear inside of a summation over possible last values of unshared features (stj − m̄tj ). The product of these terms constitutes the probability of new local state given last  t t ~ tj , atj , but have local state, mutually-modeled history, and action: P r st+1 j |sj − mj , m been factored into individual transition probability components according to Equation 3.10. The product roughly corresponds to the second term in Equation 4.2, but here 124 need only represent the probability of new local state, and is consequently conditioned on a different set of past state and action information.  ~ t−1 The last term, btj stj , m , which too appears inside the summation, represents j prior probability information encoded in the previous belief state, serving the same purpose as the first term in Equation 4.2, but invoking different set of belief state information. This prior gets multiplied by the product of the previous three terms  btj , atj . Just as in Equation 4.2, the denominator of Equation to compute P r st+1 j |b 4.5 can be treated as a constant factor for normalization since the variables that it is conditioned on take on the same value for all nonzero components of the b t+1 j . Most of the new BSU () terms are straightforward to compute from the TDPOMDP problem description (via application of Oj (), PjL (), and PjU ()). The only  t exception is P r n̄t+1 | m ~ j . This is also the only term that that depends upon peers’ j fixed-policy behavior. I will discuss this term further in Section 4.2.5 and Chapter 6. For the moment, assume this term is efficiently computable given the TD-POMDP model and the other agents’ fixed policies. Using the new belief state update function, agents maintain a different belief state representation, but the transition structure of the underlying MDP is congruent to that of Nair’s belief state MDP. That is, each possible action-observation pair maps to a transition in the belief state MDP. The difference between the two representations is that Nair’s belief state update function might group different action-observation sequences (together in one belief state) than would the TD-POMDP belief state update function. Let us rewrite the belief state MDP’s reward function R′ () (from Equation 4.3) using our new belief state representation:  Rj′ b tj , atj = X ~ t−1 i hstj ,m j " btj stj , m ~ t−1 j   X st+1 ∈Sj j    t+1 t t |s P r n̄ | m ~ PjL ¯ljt+1 |stj , atj PjU ūt+1 j j j j t+1 = ¯ljt+1 , ūt+1 ·Rj atj , st+1 j j , n̄j  # (4.6) Although the new belief state MDP reward function R () has roughly the same form as that described in Equation 4.3, its output differs in one important dimension. Instead of associating joint rewards with the belief states and actions, the new reward function takes advantage of the TD-POMDP’s reward decomposition to assign immediate local values that are independent of the other agents’ behavior (as given by Rj ()). The repercussion of using this local reward valuation instead of a global reward ′ 125 valuation is that the best-response policy of agent j maximizes its expected local utility instead of the expected joint utility. As proven in Chapter 6, this is satisfactory given that the encompassing search process entails all agents computing local best responses and combining their local utilities to evaluate each viably-optimal point in the joint policy space. While it was straightforward to reason about peers’ behavior and to see how their fixed policies were used in the general best-response belief state update, transition, and valuation from Section 4.2.1, it is less evident using the new belief state representation. However, we have isolated a single term common to the reward function (Equation 4.6) and belief state update function (Equation 4.5) that is dependent on nonlocal behavior:  P r nt+1 ~ tj . As developed in the next section, this term expresses the influence that j |m is exerted on agent j by the other agents as they execute their policies. All other pieces of agent j’s best-response model are independent of its peers’ policies. Moreover, all other terms can be computed using only the local portions of the TD-POMDP model, thereby maintaining a separation of the individual agents’ (potentially-private) information. The primary benefit of this TD-POMDP belief state representation is its compactness and scalability. By taking advantage of the TD-POMDP’s useful properties such as factorization of state observations and rewards that express agents’ independence, and structured transitions that express their weakly-coupled dependence on their peers, we are able to derive a best-response model that is more efficient to maintain. Whereas the general best-response belief state representation grows exponentially with the number of agents, the worst-case space complexity of this new representation is O (|Sj | · |Mj |t ) (where Mj represents the domain of agent j’s shared feature values) irrespective of the number of agents. Although a distribution of histories of shared features is maintained, this is expected to be much more compact than representing a joint distribution of observation histories for several other agents (which was maintained by the general best-response model). Together with the reduction in space complexity of the TD-POMDP best reponse belief state, the same reduction in time complexity ensues for any policy formulation method that performs component-wise belief state updates. Furthermore, if the problem exhibits local full observability (Definition 2.8), the belief state representation need not represent a probability distribution. That is, if each agent’s current observation fully dictates the current local state, all that must be maintained is a unique history and not a probability distribution over all possible histories (since only one history will have positive probability). 126 4.2.3 TD-POMDP Belief State Sufficiency Here, I prove the claim that the TD-POMDP agent belief state representation presented in Definition 4.3 sufficiently summarizes a TD-POMDP agent j’s actionobservation history a0 ....ot . Before descending into the proof, I begin with a supporting definition and lemma. Definition 4.4. A belief state vector b tj is a sufficient statistic (for making predictions) if it encodes all of the information gained by agent j as it executes from time 0 to time t required for making predictions about future information that will be gained after time t. Lemma 4.5. If, by maintaining a belief state vector b tj and forgetting its past actions and observations ~ajt−1 , ~ojt , agent j can accurately evaluate all future actionobservation probabilities, then b tj is a sufficient statistic: b tj sufficient ⇐⇒  ∀ t ≤ T, k ≤ (T − t + 1), ~ajt+k−1 , ~ojt+k ,   , at+k = P r ot+k+1 |bbtj , ajt , ojt+1 , ..., ajt+k−1 , ot+k P r ot+k+1 |~ajt+k−1 , ~ojt+k , at+k j j j j j Proof. I prove this lemma by analyzing how information is gained by agent j. Prior to execution, j has its decision model (the TD-POMDP, in this case) and promised peer policies. The information that j obtains during execution from times 0 to t is that it performed a series of actions ~ajt−1 and received a series of observations ~ojt . Subsequently, from time t to time t + 1, the only additional information gained is that action atj resulted in observation ot+1 j . A complete information state would therefore be a record of all actions taken and observations received. If, as the premise of the lemma states, j can accurately evaluate future actionobservation probabilities, then j can accurately evaluate the probabilities of all future information states. Any prediction that j might want to make must depend only on information state (and the prior information contained in the decision model and peer policies). Therefore j can make every prediction about future information (as accurately as it could have by recording its information state ~ajt−1 , ~ojt exactly). By definition, b tj is a sufficient statistic. The result of Lemma 4.5 can be stated simply as follows. Because the agent interacts with the system only by performing actions and receiving observations, (probabilistically) predicting future action-observations allows prediction of anything else that the agent could dream of predicting. In other words, agent j’s belief state 127 MDP constitutes a generative model of future action-observation consequences. As such, this model suffices for the agent to plan optimal decisions given fixed policies of its peers. In proving sufficiency of the TD-POMDP belief state, we will also have proven that the TD-POMDP belief-state methodology enables computation of optimal local best-response policies.  ~ t−1 ajt−1 , ~ojt , Theorem 4.6. The TD-POMDP belief state (Def. 4.3), b tj = P r stj , m j |~ is a sufficient statistic. Proof. By Lemma 4.5, to prove that the belief state representation b tj is a sufficient statistic, it suffices to prove that for any action-observation history ~ajt−1 , ~ojt , the probabilities of all future observations obtained by taking any future actions (given action-observation history) can be determined directly (and exactly) from the belief state vector. I prove this by reverse induction over history length t: Base Case (t = T ): At the problem horizon (time T ), agent j has taken all of the actions and received all of the observations already. Hence, there are no future predictions to be made. Trivially, b Tj is a sufficient statistic for predicting the empty set of probabilities of future action-observation consequences. Inductive Step: Next, we derive that if b t+1 is sufficient for computing all future action-observation probj t+1 , this implies that b tj must also be sufficient (given ~ajt−1 , ~ojt ). abilities given ~ajt , ~oj The following equation expresses the belief state vector at time t + 1.    t+1 t+1 t+1 t+1 t−1 t+1 t t t t t t bt+1 s , m ~ = P r s , m ~ |~ a , ~ o = P r s , m ~ |~ a , ~ o , a , o j j j j j j j j j j j j j by Definition 4.3, and expansion of the action-observation history vectors  t+1 t−1 t t t P r ot+1 , s , m ~ , ~ a , ~ o , a j j j j j j = t−1 t t P r ot+1 , ~ a , ~ o , a j j j j = Pr by definition of conditional probability   st+1 ~ tj , ~ajt−1 , ~ojt , atj j ,m  ~ajt−1 , ~ojt , atj t+1 ot+1 ~ tj , ~ajt−1 , ~ojt , atj P r j |sj , m  P r ot+1 ajt−1 , ~ojt , atj P r j |~ by two applications of the definition of conditional probability   t+1 t P r st+1 ~ tj , ~ajt−1 , ~ojt , atj Oj ot+1 j |aj , sj j ,m   = t−1 t−1 t t t t P r ~ a , ~ o , a P r ot+1 |~ a , ~ o , a j j j j j j j by definition of the TD-POMDP local observation function Oj (Def. 3.5) 128 = t+1 t Oj ot+1 j |aj , sj P t P r st+1 ~ t−1 ajt−1 , ~ojt , atj j , sj , m j ,~   P r ot+1 ajt−1 , ~ojt , atj P r ~ajt−1 , ~ojt , atj j |~ stj ∈Sj  by the law of total probability    P t+1 t+1 t t ~ t−1 ajt−1 , ~ojt , atj P r stj , m ~ t−1 ajt−1 , ~ojt , atj Oj ot+1 j |aj , sj j ,~ j |~ stj P r sj |sj , m  = P r ot+1 ajt−1 , ~ojt , atj j |~ by applications of the definition of conditional probability, and cancellation   P t+1 t t−1 t−1 t−1 t−1 t+1 t t t t t t ❙ ✓ P r s |s , m ~ , ~ a , ~ o , a P r s , m ~ |~ a , ~ o , a |a , s Oj ot+1 t j j j j j j j j j j j j j j sj ✓❙  = t+1 t−1 t t P r oj |~aj , ~oj , aj because current state is independent of future action    P t+1 t t+1 t ~ t−1 ~ t−1 ajt−1 , ~ojt , atj btj stj , m Oj ot+1 j j ,~ j |aj , sj stj P r sj |sj , m  (4.7) = P r ot+1 ajt−1 , ~ojt , atj j |~ by substitution of the appropriate belief state component (Definition 4.3) We can further simplify Equation 4.7 by targeting the second term in the numerator, which specifies the conditional probability of the local state at time t + 1 dependent on actions, observations, and various feature values at previous time steps. This term may be expanded as follows by taking into account the TD-POMDP’s factorization of local state and local transition developed in Section 3.2.2. Recall that local state sj is composed of locally-controllable features ¯lj , unaffectable features ūj , and nonlocallycontrollable features n̄j .   t+1 t t P r st+1 ~ t−1 ajt−1 , ~ojt , atj = P r ¯ljt+1 , ūt+1 ~ t−1 ajt−1 , ~ojt , atj j |sj , m j ,~ j , n̄j |sj , m j ,~   ~ t−1 , ~a t−1 , ~o t , at ~ t−1 , ~a t−1 , ~o t , at P r ūt+1 , n̄t+1 |st , m = P r ¯lt+1 |ūt+1 , n̄t+1 , st , m j = PjL j j j j j j j j j j j j j j by application of Bayes’ rule   t−1 ¯lt+1 |st , at P r ūt+1 , n̄t+1 |st , m ajt−1 , ~ojt , atj j ~ j ,~ j j j j j by substitution of the factored local feature transition function PjL (Eq. 3.10)    ~ t−1 , ~a t−1 , ~o t , at ~ t−1 , ~a t−1 , ~o t , at P r n̄t+1 |st , m = P L ¯lt+1 |st , at P r ūt+1 |n̄t+1 , st , m j = PjL j j j j j j j j j j j j j j by application of Bayes’ rule    t−1 ¯lt+1 |st , at P U ūt+1 |st P r n̄t+1 |st , m ajt−1 , ~ojt , atj j ~ j ,~ j j j j j j j j j by substitution of the unaffectable feature transition function PjU (Eq. 3.10)    t+1 t (4.8) ~ tj = PjL ¯ljt+1 |stj , atj PjU ūt+1 j |sj P r n̄j |m 129 The last step in the derivation of Equation 4.8 relies on the property that, given a history over mutually-modeled feature values m ~ tj , new nonlocally-controlled feature values n̄t+1 are conditionally independent of remaining local state feature values, local j observation history, and local action history. It is straightforward to reason that n̄t+1 is j t+1 independent of the latest local action. Since each feature in n̄j is controlled by some other agent i, its value would depend only on i’s latest action ati (dictated by a fixed policy over i’s past observations ~oit ) and i’s latest values of state features sti , none of which can be affected by atj until the next time step t+1 at the earliest. Justification of the other conditional independencies requires an intimite look at the factored structure of TD-POMDP feature transitions (described formally in Section 3.2.2). Relationships among TD-POMDP variables may be represented graphically by 2-stage DBN in Figure 4.4, which divides all of the world state features into five distinct sets. The lower-most state variable, s⊆j , represents those (unshared) features from agent j’s local state which do not appear in any other agent’s local state. Working our way upwards, agent j’s mutually-modeled features m̄j appear within the grey box, and are further decomposed into shared locally-controlled features ¯lj ⊆ m̄j , shared unaffectable features ūj ⊆ m̄j , and shared nonlocally-controlled features n̄j ⊆ m̄j . The state features that remain are features that appear in other agents’ local states but not j’s local state and are represented above the grey box as variable s6=j . The observations are also divided into agent j’s observations oj and those of the other agents o6=j . Actions are similarly divided, with all other agents’ policies assumed to be fixed (because agent j is the one computing a best response). The connecting arrows follow from the definitions of the TD-POMDP state transitions and local observation function developed in Sections 3.2.1–3.2.2. Captured within the DBN in Figure 4.4 are a number of different conditional independence relationships (Russell et al., 1996). The relationship that we will take advantage of is one of direction-dependent separation (Pearl, 1988), or d-separation for short. In review, a directed path from a node x to a node y is blocked given a set of evidence nodes E if the path contains a node in E. A set of evidence nodes E d-separates a node x from another node y if all paths between x and y are blocked. If E d-separates x from y, y is conditionally independent of x given E. From the shading of Figure 4.4, it is plain to see that every path leading from any node in {s⊆j t, ~ajt−1 , ~ojt , atj } to nt+1 to nt+1 passes through evidence set m ~ tj (highlighted in grey). j j Hence, every node in the set {s⊆j t, ~ajt−1 , ~ojt , atj } is d-separated from nt+1 by evidence j set m ~ tj (highlighted in grey). This implies the conditional independence relationship:   t P r n̄t+1 ~ tj . And therefore, the last step in the ~ t−1 ajt−1 , ~ojt , atj = P r n̄t+1 j |sj , m j |m j ,~ 130 ot¡1 6=j f mutually-modeled evidence variables sj ot6=j ¼6=j s6=j = s ¡ sj m ¹j at¡1 6=j st¡1 6=j sµj = sj ¡ m ¹j at¡2 j ¼6=j st6=j st+1 6=j t¡1 j n ¹ tj n ¹ t+1 j t¡1 j u ¹tj t¡1 j ¹ ljt st¡1 µj stµj ... n¹ ... u¹ ... ¹l f at6=j ¼j ot¡1 at¡1 j j otj ¼j atj Figure 4.4: A DBN expressing CI relationships among TD-POMDP variables. derivation of Equation 4.8 holds. Intuitively, this conditional independence relationship is made possible by the nonconcurrency of agent influence. There is no path leading from action atj to nt+1 j because agent j cannot influence the transition probabilities of other agents’ locallycontrolled features until the next time step. This is reflected in the path leading from at−2 to ljt−1 , and continuing on to st6=j . Similarly, agent j cannot influence other agents’ j actions until first effecting a change in its local feature values. Plugging Equation 4.8 back into Equation 4.7 results in the following simplified expression for belief state bt+1 j :   t+1 t+1 t+1 t t t = P r s , m ~ |~ a , ~ o s , m ~ bt+1 j j j j j j j  U t+1 t   t t t−1  P t+1 t+1 t t+1 L ¯t+1 t t t P l |s , a P ū |s P r n̄ | m ~ b j sj , m ~j Oj oj |aj , sj t t j j j j j j j j j sj −m̄j  = P r ot+1 ajt−1 , ~ojt , atj j |~ (4.9) The first thing to note is that the denominator in Equation 4.9 is equal for all components of the vector (because all are conditioned on the same action-observation history). Thus, it can be treated as a normalizing constant which is equal to the sum, 131 over all components, of their respective numerators:  P r ot+1 ajt−1 , ~ojt , atj = j |~  X  X L t+1 t t  U t+1 t    t+1 t t+1 t−1 t+1 t t t ¯ Pj lj |sj , aj Pj ūj |sj P r n̄j |m Oj oj |aj , sj ~j ~ j b j sj , m t t t+1 t sj −m̄j ~ ji h sj ,m (4.10) Turning back to Equation 4.9, determining belief state at time t + 1 given action and observtion history only involves computations of the five terms in the numerator. The first three terms are simply applications of the TD-POMDP agents’ local observation function and local transition functions (contained in the TD-POMDP model). The fourth term is not so straightforward to calculate, but it does not depend on knowledge of the action-observation history. Nor does the fifth term. In fact, the numerator can be computed using only knowledge of the previous belief state btj and without keeping track of the action-observation history. That is, next belief state is a function of current belief state and next action-observation pair. Equation 4.9 thereby serves as a belief state update function for TD-POMDP agents. Further, the derivation of Equation 4.9 implies that the process as defined over belief states is Markovian. Recall that, for the purposes of this proof, we assumed that bt+1 was sufficient for j determining probabilities of action-observation pairs from time t+1 onward. Under this assumption, btj must be sufficient for determining probabilities of action-observation pairs from time t onward. Such predictions at time t (whereby an action ajt induces an observation ojt+1 ) are performed by Equation 4.10, which is based solely on belief state b tj without the need to remember past actions and observations. Predictions at time t + 1 and beyond may be made by applying Equation 4.9 to determine the next belief state, which can in turn (given our inductive assumption) can be used to determine all action-observation probabilities from times t + 1 to the end of horizon T . Thus, our inductive step holds. This completes the proof that for all values of t, b tj is a sufficient statistic for the action-observation history ~ajt−1 , ~ojt . The implication is that agent j can make optimal decisions by basing its action choices solely on the probability distribution expressed by Equation 4.4. As it takes actions and receives observations, it can safely forget past actions and observations as long as it updates its new belief state from its previous belief state with every new observation (using Equation 4.9). Although every action-observation history maps to a single belief state, not every representable belief state (of which there are infinitely many) corresponds to an action-observation history. Just as with the general best-response belief state representation (Sec. 4.2.1), several histories may map to 132 the same TD-POMDP belief state. In other words, the reachable belief state space is potentially significantly smaller than the number of possible action-observation histories. 4.2.4 Complexity of Best Response Computation In Section 4.2.2, my comparison against the general best-response belief state focused on the efficiency of updating the condensed TD-POMDP belief state representation, a result based solely on the relationship between the number of components in the belief state vector and the number of shared state features. There is yet another distinct, and arguably more significant computational advantage to using the TD-POMDP belief state relating to the overall complexity of planning best response policies. Recall that the purpose of instantiating the belief state formalism is to facilitate the solving of a single-agent POMDP model (as portrayed in Figure 4.2). That is, once an agent’s peers’ policies have been fixed, the peers become anonymous facets of a single-agent environment, and b tj encodes the agent’s belief about the state of the single-agent POMDP that represents that environment. Using the condensed representation I developed, the best-response POMDP need only model a subset of those features from the original TD-POMDP world state. To see this, let us rewrite the TD-POMDP belief state update equation (Eq. 4.5), . replacing the belief state component index with a variable xtj = stj , m ~ t−1 j   bt+1 xjt+1 = BSU b tj , atj , ot+1 j j    X L t+1 t t  U t+1 t  t+1 t ~ tj btj xtj Pj ¯lj |sj , aj Pj ūj |sj P r n̄t+1 Oj ot+1 j |m j |aj , sj = stj −m̄tj a normalizing factor by substitution of xtj =hstj ,m ~ t−1 i j = P  t t t+1 t+1 t t t ~ t−1 Oj ot+1 bj xj j |aj , sj j st −m̄t P r sj |sj , aj , m = P  t t t+1 t−1 t+1 t t t t P r s , m ~ |s , a , m ~ bj xj |a , s Oj ot+1 t t j j j j j j j j s −m̄ j into Eq. 4.5 j a normalizing factor by collection of transition terms, given Eq. 3.10 j j a normalizing factor because m ~ tj is included in the conditional information {stj ,m ~ t−1 } j (4.11) 133 = t+1 t Oj ot+1 j |aj , sj P stj −m̄tj  t t t t bj x j P r xt+1 ~ t−1 j |sj , aj , m j a normalizing factor by substitution of = xt+1 =hst+1 ,m ~ tj i j j P  t t t+1 t+1 t t−1 t t |s , a , m ~ Oj ot+1 bj xj t−1 P r xj t j j j |aj , sj j ~ h s ,m i j j a normalizing factor because, for the additional combinations of values of hstj ,m ~ t−1 i considered j by the summation, = t+1 t P r ot+1 ~ tj j |aj , sj , m P r(xt+1 |stj ,atj ,m ~ t−1 )=0 j j  t t bj x j P r xt+1 |stj , atj , m ~ t−1 j j ~ t−1 i hstj ,m j a normalizing factor P by Def. 3.5 and the conditional independence of observation ot+1 j on past state information given hatj ,st+1 i j = t+1 t P r ot+1 j |aj , xj P xtj  t t t t P r xt+1 j |xj , aj bj xj a normalizing factor by substitutions of xtj (4.12) Comparing the simplification in Equation 4.11 with the single-agent POMDP belief state (reviewed in Section 2.2.2.2), we see that the TD-POMDP best-response belief ~ t−1 state is identical to that of a single-agent POMDP with state xtj = stj , m . Morej ~ t−1 , over, the TD-POMDP best-response model is itself a POMDP with state stj , m j a subset of the world state representation st of the joint decision model (Def. 3.15). Observation 4.7. The state of the single-agent POMDP used for the TD-POMDP  . ~ t−1 best response need only represent features stj , m j Combining POMDP complexity theory from Section 2.2.3 with Observation 4.7, I deduce the following result. Observation 4.8. In the worst case, planning a best response for a TD-POMDP  agent j requires time exponential in kSj k · kMj kT −1 , denoted EXP kSj kkMj kT −1 . Simply put, the size of the state space of the best response POMDP is at most kSj k · kMj kT −1 , and all known (general POMDP) solution algorithms have a worstcase time complexity exponential in the size of state space. A stronger result holds for TD-POMDP problems with local full observability (Definition 2.8), wherein the belief state encodes the exact value of stj , m (instead ~ t−1 j of a probabilistic distribution over values). For such problems, the best-response 134 model is an MDP, for which the complexity is known to be polynomial in the size of the state space. Observation 4.9. For a locally-fully observable TD-POMDP, the worst-case time to plan agent j’s best response is polynomial in kSj k · kMj kT −1 . Observation 4.9 follows directly from the MDP’s polynomial complexity (Papadimitriou & Tsitsiklis, 1987), a result reviewed in Section 2.2.3. Just like the size of a TD-POMDP best-response belief state, the complexity of a TD-POMDP agent’s best-response computation depends upon the number of shared state features km̄j k and the time horizon T but not necessarily on the amount of state information for the entire team of agents. For problems in which the world state space grows exponentially with the number of agents, I expect the computational savings afforded by the TD-POMDP best response model to be substantial, enabling scaling of the best response computation to larger problems with more agents than was possible with the general best response representation. 4.2.5 Influence Information Aside from the complexity of local planning reasoning, another benefit of the TD-POMDP belief state is that it distinguishes nonlocal information dependent on other agents’ policies from local information independent of other agents’ policies. The nonlocal information serves as an abstraction of peers’ policies and, as I develop in Section 4.5, facilitates an efficient partitioning of the nonlocal policy space into impact equivalence classes (Def. 3.45). The only component of a TD-POMDP agent’s best-response model (Section 4.2.2)  that is dependent on the policies of the agent’s peers is one term, P r n̄t+1 ~ tj , found j |m in both the transition probabilities (Eq. 4.5) and the reward function (Eq. 4.6) of the  best-response POMDP. As such, the probability distribution P r n̄t+1 ~ tj represents j |m exactly the information that j needs (prior to execution) in order to model (and compute best response policies to) planned behavior of its peers. All of the other information required for best-response decision making is contained within j’s local model (Def. 3.16) and is independent of other agents’ decisions. The consequences of a peer agent i’s decisions with respect to j manifest themselves   exclusively in the values of P r n̄t+1 ~ tj . As such, we call P r n̄t+1 ~ tj the influence j |m j |m of agent j’s peers on agent j. By altering its plans, a peer i might change its influence, in turn changing j’s best response. 135 4.3 Characterization of Transition Influences In the last section, I derived a condensed representation of belief state for TDPOMDP agents, deriving a local best response model and contrasting the size of its representation and the computational complexity of its employment (for formulating best responses) with those of the general best-response belief state representation. Another important distinction is that the TD-POMDP best response model is not seeded with fixed peer policies, but instead with fixed influences. That is, in order to compute a best response to agent i’s policy πi , agent j may not need to know all the details of πi (which were necessary when using the general belief-state representation in Section 4.2.1). Instead it only needs to know the influence of πi . Definition 4.10. The influence of agent i’s policy πi on agent j, denoted Γjπi , is information summarizing πi that is sufficient for agent j to plan a best response πj∗ (πi , π̄K ) to πi (and the policies π̄K of i’s other peers K):    ∀j, ∀π̄K ∈ ×k∈(N −{i,j}) Πk , πj∗ Γjπi , π̄K = πj∗ (πi , π̄K ) . The objective of influence-based abstraction is to reduce the amount of information that agents need to exchange and coordinate over (during planning). By abstracting away inessential details of agent i’s policy πi , Γjπi should compactly encode the consequences of i’s behavior as it relates agent j’s decisions. With such an abstraction, agent i need not broadcast its full policy containing a multitude of decisions (exponential in the number of possible sequences of observations), nor disclose intimate details of plans that have no bearing on j’s decisions. Figure 4.5 depicts the usage of the influence-based abstraction, wherein agent j’s best response calculation takes as input the influence Γjπi abstracted from peer policy  πi and returns agent j’s consequent optimal local policy πj∗ Γjπi . policy 𝜋𝑖 Influence abstraction influence 𝑗 𝜋𝑖 best response computation 𝑗 𝜋𝑗∗ 𝜋𝑖 Figure 4.5: Abstracting influences from policies. In the context of TD-POMDP agent coordination, the “best response” block refers to the solving of the POMDP model described in Sections 4.2.2-4.2.4, whereas details of the “influence abstraction” block are presented later on in Chapter 5. The focus of this section is on the identification of the content and structure of influence information. 136 The concept of influence-based abstraction is quite general, deriving inspiration from work in multiagent classical planning (Durfee & Lesser, 1991; Smith, 1980; Tambe, 1997; Xuan & Lesser, 1999) as well as methodologies for solving other specialized classes of Dec-POMDPs (Becker et al., 2004a; Musliner et al., 2006). And though the discussion here will remain centered around TD-POMDP agents’ influences, the development of characteristics and formulations of influence models that follow may be more broadly applicable. This dissertation constitutes the first endeavor at a general characterization of influence in the context of sequential decision making, which subsumes several related influence models (noted and cited where appropriate). In the subsections that follow, I systematically categorize influences, revealing a language through which agents can convey the policy information essential to coordination. My development of influence terminology culminates, in Section 4.3.5, with a formal characterization of a complete influence model for TD-POMDP agents. 4.3.1 Transition Influences In the TD-POMDP, the only way that agent i can impact j is through the manipulation of nonlocal features. As such, information about the expected transitions of nonlocal features sufficiently summarizes πi . Let us call this particular type of influence a transition influence. Definition 4.11. The transition influence of TD-POMDP agent i’s policy πi on TD-POMDP agent j’s nonlocal feature njx , is a probability distribution Γjπi (njx ) = P r(nt+1 jx |...) that serves as a sufficient summary of πi for j to predict the (probability ajt−1 , ~ojt , atj that distribution over) values of nt+1 jx for any action-observation history ~ j may encounter. By representing influences of peers’ policies using probability distributions, agents can straightforwardly construct transition models for each of their nonlocal features. In general, modeling the transitions of a Dec-POMDP state feature would require a transition probability for every value of the feature conditioned on every feature of the world state and every joint action. However, given the factorization of TD-POMDP state, modeling the transitions of a TD-POMDP agent’s nonlocal feature often requires substantially less information. 137 Example 4.1 (continued). Turning back to the 2-agent problem shown in Figure 4.1, consider the influence of rover 5’s policy on rover 6, which may be modeled using a transition influence Γjπi (site-C-prepared) = P r(site-C-preparedt+1 |...) . In this particular problem, rover 6 does not need a complete probability distribution that is conditioned on all features. In fact, the only features that rover can use to predict the value of site-C-prepared are time and site-C-prepared itself. Although site-C-prepared is dependent on other features from rover 5’s local state, rover 6 cannot observe any evidence of these features except through its observations of site-C-prepared and time. Thus, all other features can be marginalized out of the distribution P r(site-C-prepared|...). Furthermore, the only influence information that is relevant to rover 6 is the probability with which site-C-prepared will become true conditioned on time = 4. At the start of execution, site-C-prepared will take on value f alse and remain f alse until rover 5 completes its “Prepare Site C” task (constrained to finish only at time 4, if at all, given the task window in Figure 4.1). After the site is prepared, the feature will remain true thereafter until the end of execution. With these constraints, there is no uncertainty about when site-C-prepared will become true, but only if it will become true (at time = 4). Hence, the influence of rover 5’s policy can be summarized with just a single probability value, P r(site-C-prepared = true|time = 4), from which rover 6 can infer all transition probabilities of site-C-prepared. 4.3.2 State-Dependent influences The influence in Example 4.1 (Figure 4.1) has a very simple structure due to the highly-constrained transitions of the nonlocal feature. By removing constraints, we can more generally categorize the influence between rover 5 and rover 6. 138 Example 4.1 (continued). Let the window of execution of “Prepare Site C” be unconstrained: [0, 8]. With this change, there is the possibility of rover 5 preparing site C at any time during execution. The consequence is that a single probability is no longer sufficient to characterize rover 5’s influence. Instead of representing a single probability value, rover 6 needs to represent a probability for each time site-C-prepared could be set to true. In this case, a set of probabilities P r(site-C-preparedt+1 = true|site-C-preparedt = f alse, timet = t), ∀t is required, each of which is conditioned on features site-C-prepared and time. Definition 4.12. A transition influence Γπi (njx ) is state-dependent with respect to a subset of features f¯ ⊆ s if its summarizing distribution need be conditioned only  ¯t on f¯’s latest value: Γπi (njx ) = P r nt+1 jx |f . The set of probabilities P r(site-C-preparedt+1 |site-C-preparedt , timet ) in Example 4.1 is an abstraction of rover 5’s policy that conveys both the probability of the interaction taking place and its potential timing. Definition 4.12 extends past development of more restrictive forms of state-dependent influences called commitments, which accounted for time but not probability (Musliner et al., 2006) or probability but not time (Witwicki & Durfee, 2007). More generally, state-dependent influences may be conditioned on features other than time. For instance, as I describe in Example 4.13, agents’ influence might need to be conditioned on other jointly-observable features such as weather. 4.3.3 History-Dependent Influences Generalizing further, the probability of an interaction may differ based on both present and past values of state features. Example 4.13. Consider the satellite and rover from Figure 3.1, and consider that they jointly observe a feature weather that is unaffectable, but that may affect their interaction. For instance, if it is cloudy in the morning, this prohibits the satellite from taking pictures, and consequently lowers the probability that it builds a path for the rover in the afternoon. Thus, by monitoring the history of the weather, the rover could anticipate the lower likelihood of help from the 139 satellite, and might change some decisions accordingly. Using Definition 4.14, we say that influence Γπi (path-A-build) is history-dependent with respect to feature weather. Definition 4.14. A transition influence Γπi (njx ) is history-dependent w.r.t. feadistribution need be conditioned on the history of tures f¯ ⊆ s if its summarizing   t+1 t values of f¯: Γπi (njx ) = P r njx |f~ . Becker et al. (2004a) employ a special case of history-dependent influences in their Event-driven Dec-MDP solution algorithm, wherein agents augment their local decision models with event histories, thereby representing the probability of future nonlocal events conditioned on the histories of past events. 4.3.4 Influence-Dependent Influences Transition influences may also be interdependent. Example 4.15. For instance, in the interaction digraph in Figure 3.7, agent R4 has two arcs (labeled n4b and n4c ) coming in from agent SAT3, indicating that agent 3 is exerting two influences, such as if agent SAT3 could build two different paths for agent R4. In the case that agent 3’s time spent building one path leaves too little time to plan the other path, the nonlocal features n4b and n4c are highly correlated, requiring that their joint distribution be represented. Definition 4.16. Two transition influences Γπi (njx ) and Γπi (njy ) are influence dependent if Γπi (njx ) = P r nt+1 |... need be conditioned on concurrent values of njy jx  t+1 or vice versa, thereby necessitating a joint distribution Γπi (njx , njy ) = P r nt+1 jx , njy |... . 4.3.5 Comprehensive Influence DBN With the preceding terminology, I have systematically introduced an increasingly comprehensive characterization of transition influences. A given TD-POMDP influence might be history-dependent with respect to one feature and state-dependent with respect to another. There may also exist chains of influence-dependent influences. 140 Example 4.17. In Figure 3.7, agent R7 models two nonlocal features, one (n7a ) influenced by agent SAT1 and the other (n7b ) influenced by agent R6. The additional arc between agents SAT1 and R6 forms an undirected cycle that implies a possible dependence between n7a and n7b by way of n6b . The only way to ensure a complete influence model is to incorporate all three influences into a joint distribution. In general, for any team of TD-POMDP agents, their influences altogether constitute a Dynamic Bayesian Network (DBN) whose variables consist of the nonlocal features as well as their respective dependent state features and dependent history features. Figure 4.6 illustrates the influence DBNs for the four examples presented in this chapter along with their implied conditional probability tables (CPTs). Once all of the influences associated with an agent i’s nonlocal features have been decided, i can extract the corresponding conditional probabilities and inject them into its local  best-response model (replacing the term P r n̄t+1 i |... identified in Section 4.2.5). The connections between the variables in the influence DBN are dictated by my characterization of state-dependence, history-dependence, and influence-dependence. That is, the scope of a variable nt+1 ix in the influence DBN (which refers to the subset ¯ of variables f for which an arrow is drawn from f to nt+1 ix ) is such that the DBN encodes sufficient information for agent i to model the probabilities of nix ’s transitions given that i’s peers hold their policies still. Note that the influence DBN is very different from the DBN that I presented in Section 4.2.3 (Figure 4.4) to describe the conditional independencies in the joint model. I will refer to the previously-described DBN as the TD-POMDP DBN because it represents all variables in the TD-POMDP model. In contrast, the influence DBN has a smaller width than the TD-POMDP DBN, modeling only the transitions of nonlocal features. However, in the case of history-dependent influences, the influence DBN has a greater depth of connectivity than does the TD-POMDP DBN, connecting variables indexed with time step t + 1 to those indexed with t, with t − 1, with t − 2, an so on.3 3 Essentially, the influence DBN is the result of variable elimination performed on the TD-POMDP DBN. In particular, a variable xi (which may refer to a feature in agent i’s local state, agent i’s action, or agent i’s observation) that is eliminated is marginalized out due to the fact that is it unobservable to all other agents in the system except through the observations of mutually-modeled features. The inclusion of history features in the influence DBN is the direct result of such elimination. 141 Given history- and influence-dependence, the influence DBN could potentially grow to be more complex than the TD-POMDP DBN, encoding a larger number of probability parameters to encode than there are elements in the TD-POMDP transition matrix. Indeed, as TD-POMDP agents’ interactions become more complicated, more and more parameters involving more and more variables are needed to encode their effects. However, due to the TD-POMDP’s decomposable transition structure, the DBN need contain only those critical variables that link the agents’ POMDPs together. Ex. 4.1 CPT: T 1.0 1 F T 0.0 1 T T 0.25 … … 𝒕 0 T … … (state-dependent influence) 𝒕 𝒕+𝟏 𝒄 𝒄 𝒕+𝟏 𝐏𝐫 𝒄 𝒄 ≡ site−C−prepared, 𝒕 ≡ time Ex. 4.13 𝒘 𝟎 𝒘 CPT: 𝒕−𝟏 𝒕 𝒏𝟐𝟒𝒃 𝒏𝒕𝟒𝒃 𝒏𝟎𝟒𝒄 𝒏𝟏𝟒𝒄 𝒏𝟐𝟒𝒄 𝒏𝒕𝟒𝒄 or, equivalently… 𝒘 𝒕 … … FFT F T 1.0 … … … 𝑎𝑡 𝑤 𝑡+1 Pr 𝑤 𝑡+1 𝑤 𝑡 , 𝑎𝑡 … 𝑤𝑡 𝒏𝒕𝟒𝒄 𝒂𝒕+𝟏 𝒂𝒕 … 𝒂𝒕 𝒕, 𝒄 𝒏𝟏𝟒𝒃 𝒏𝒕𝟒𝒃 … 𝒘 𝒕 (influence-dependent influence) 𝒏𝟎𝟒𝒃 (history-dependent influence) or, equivalently… 𝒕 Ex. 4.15 𝑡 𝑐 𝑡 𝑐 𝑡+1 Pr 𝑐 𝑡+1 𝑡, 𝑐 𝑡 𝒕+𝟏 𝒂 𝒕+𝟏 𝐏𝐫 𝒂 𝒂 ≡ path−A−built, 𝒘 ≡ weather time has been omitted because it is 𝒕 Ex. 4.17 𝒕 𝒘 ,𝒂 𝒏𝒕+𝟏 𝟒𝒃 𝒏𝒕+𝟏 𝟒𝒄 𝟒 ∙ 𝟐𝟒𝑻 parameters of the form: 𝒕 𝒕+𝟏 𝒕 𝐏𝐫 𝒏𝒕+𝟏 𝟒𝒃 , 𝒏𝟒𝒄 𝒏𝟒𝒃 , 𝒏𝟒𝒄 𝒏𝒕𝟕𝒂 𝒏𝒕+𝟏 𝟕𝒂 𝒏𝒕𝟕𝒃 𝒏𝒕+𝟏 𝟕𝒃 𝒏𝒕𝟔 encoded in history 𝒘𝒕 𝒏𝒕+𝟏 𝟔 𝒕+𝟏 𝒕 𝐏 𝐫 𝒏𝒕+𝟏 𝒏𝟕𝒂 , 𝒏𝒕𝟔 𝟕𝒂 , 𝒏𝟔 𝟒 ∙ 𝟐𝟒𝑻 parameters) 𝒕 𝒕 𝐏𝐫 𝒏𝒕+𝟏 𝟕𝒃 𝒏𝟔 , 𝒏𝟕𝒃 𝟐 ∙ 𝟐𝟒𝑻 parameters) Figure 4.6: The influence DBN for each previously-presented example. Theorem 4.18. For any given TD-POMDP, the influence Γjπi (njx ) of agent i’s policy πi on agent j’s nonlocal feature njx need only be conditioned on histories of mutuallymodeled features m ~ j (Def. 3.13). Proof. This follows directly from the proof of belief state sufficiency presented in Sec ~ t−1 ~ t−1 ajt−1 , ~ojt , ∀stj , m tion 4.2.3. In review, the belief state btj = P r stj , m was j j |~ 142 proven to be sufficient for P computing agent j’s best response using the update t ,st+1 ~ t−1 |m ~ tj )btj(stj ,m |stj )P r(n̄t+1 |a Oj(ot+1 ) ) stj −m̄tj PjL (l̄jt+1 |stj ,atj )PjU (ūt+1 j j j j j j t+1 . Here, rule: b j = a normalization factor the first term is j’s local observation function (Def. 3.5), the second and third terms are locally-dependent components of j’s local transition function (Def. 3.14), and  the last term is j’s previous belief state. The remaining term, P r n̄t+1 ~ tj , is the j |m  t only one that depends upon peers’ policies. Thus, P r n̄t+1 | m ~ j , serves as a sufficient j summary of i’s policy for computing j’s best response. Corollary 4.19. The influence DBN grows with the number of mutually-modeled state features irrespective of the number of local state features and irrespective of the number of agents. By Theorem 4.18, agents’ influences need encode only the histories of state features that are shared among agents. Moreover, the complexity with which an agent models its peers is controlled by the tightness of coupling with respect to state factor scope (Def. 3.39), and not by the complexity of the peer behavior, nor by the number of peer agents. Despite this result, the influence DBN could still become too complex for TDPOMDP agents to use effectively. Example 4.20 describes an extreme case, wherein the conditional probability table grows unwieldy. The size of the conditional probability tables associated with the influences from Examples 4.15 and 4.17 are characterized in Figure 4.6. Example 4.20. Consider and example problem for which 10 out of 11 state features in agent i’s local state are nonlocal features in agent j’s state, and hence mutually-modeled by agent j. In this case, in order to capture the possible effects of the 11th unobservable feature, agent j’s sufficient encoding of i’s influence would include the histories of all 10 mutually-modeled features. Given a time horizon of length T , and under the assumption that all features are boolean, there are 2(10T ) possible combinations of mutually-modeled histories and 210 combinations of joint mutually-modeled feature values. In this case, the specification of the influence DBN would require on the order of 2(10T +10) parameters. Aside from the space complexity of storing the CPT, the ramifications of a large influence encoding is as follows. First, as per Observation 4.8, in the worst case, the computation required to compute a best response grows with the amount of 143 information on which the influence is conditioned. Second, as I describe later in Chapter 5, using my mixed-integer linear programming (MILP) methodology, the number of MILPs that must be solved in order enumerate feasible influence settings grows linearly with the number of parameters that encode the influence. Lastly, as my empirical results presented later in this chapter suggest, the number of feasible influence settings tends to grow with the size of the influence encoding (regardless how each influence setting is found). Given the various forms of growth in computational complexity associated with large influence encodings, it is important to identify conditions under which influence encodings remain compact. Along these lines, I describe one set of conditions under which we can avoid history dependence in the next section. 4.4 A Special Case: Influences on Event-Driven Features The characterization of TD-POMDP transition influences that I developed in Section 4.3 was extremely general, encompassing all interactions that one TD-POMDP agent might have with another. I now focus on one particular class of interactions– those that can be represented with event-driven 4 features. After defining this class of interactions, I derive conditions under which agents can encode their influences on event-driven features compactly. In particular, I prove that when the interaction digraphs contain no cycles (undirected or directed), influences on event driven feature are state-dependent and not history-dependent. Definition 4.21. An event-driven feature f is a Boolean state feature that encodes the occurrence of an event, such that P r(f t+1 = f alse|f t = true) = 0. The condition in Definition 4.21 means that the transition of an event-driven feature f is restricted such that the f can change from f alse to true (from one state to the next) but never from true to f alse. Intuitively, if the corresponding event has not occurred, f = f alse. Once the event occurs, f changes to true and can never thereafter return to f alse. Features of this type have appeared in several of the example problems that I have presented thus far (e.g., Examples 3.1, 3.31, and 4.1). Definition 4.22. An event-driven interaction in a TD-POMDP refers to one or more event-driven nonlocal features through which one agent affects another. 4 I adopt the term event-driven from the work of Becker et al. (2004a), who defined a Dec-POMDP subclass called the Dec-MDP with event-driven interactions (or the EDI-Dec-MDP). In my definitions, I present a slight generalization of Becker et al.’s semantics so as to accurately define event-driven interactions in my more general TD-POMDP problem class. 144 Example 4.23. In the problem from Example 4.1 and Figure 4.1, rover 5 interacts with rover 6 by completing a task “Prepare Site C” and thereby altering future outcomes of rover 6’s own tasks. This interaction is event-driven because it can be represented with a nonlocal feature site-C-prepared ∈ {true, f alse} that encodes the rover 5’s completion of site C preparations. Inherently, once task “Prepare Site C” is completed, the task (and underlying feature value change) cannot be undone. Theorem 4.24. For a TD-POMDP problem whose nonlocal features are all eventdriven and whose interaction digraph (Def. 3.27) contains no directed or undirected cycles, each influence Γ(njx ) on a nonlocal features njx has the following properties: 1. For any nonlocal feature ny 6≡ njx , Γ(njx ) need not be conditioned on ny . 2. Γ(njx ) is state-dependent (but not history-dependent) with respect to njx . Proof. I address properties (1) and (2) separately. 1. To prove that property 1 holds, let us consider three disjunctive cases: case a: ny 6∈ m̄j . By Theorem 4.18, Γ(njx ) need only be conditioned on features in agent j’s mutually modeled feature set m̄j (Def. 3.13). Thus, Γ(njx ) need not be conditioned on ny . case b: ny ∈ m̄j ∧ ny ∈ n̄j . By Definition 3.12, ny ∈ n̄j refers to the fact that ny is controlled by another agent, which we will call agent i, and affects agent j. Feature njx is also controlled by another agent, which we will call agent k, and affects agent j. We can deduce that i = 6 k from the acyclicity of the interaction digraph. If i and k were the same agent, this would mean two edges leading from node i to node j, constituting an undirected cycle. Further, we can deduce that i 6∈ Λk (agent i is not a digraph ancestor of agent k, using Definition 3.28), because this would indicate an undirected cycle containing nodes i, j, and k (i.e. i ∈ Λk , i ∈ Λj , and k ∈ Λj ). Hence, by Theorem 3.32, agent i cannot affect the value of njx , through its control of ny or otherwise. Thus, nt+1 jx is independent of ~nty , and P r(nt+1 nty , ...) = P r(nt+1 jx |~ jx |...). Therefore, Γ(njx ) need not be conditioned on ny . case c: ny ∈ m̄j ∧ ny 6∈ n̄j . By Definition 3.12, ny 6∈ n̄j refers to the fact that nonlocal feature ny is controlled agent j. From the interaction digraph 145 acyclicity, we can deduce that agent j is not an ancestor of agent k (who controls njx ). Hence, using the same line of reasoning as in case b, by Theorem 3.32, P r(nt+1 nty , ...) = P r(nt+1 jx |~ jx |...), and therefore, Γ(njx ) need not be conditioned on ny . 2. To prove that property 2 holds, let us consider two cases: t case a: ntjx = true. By Definition 4.21, P r(nt+1 jx = true|njx = true) = 1, yielding a deterministic transition that is independent of previous values ~nt−1 jx . t+1 t t nt−1 Thus, P r(nt+1 jx ) = P r(njx |njx = true). jx |njx = true, ~ case b: ntjx = f alse. By Definition 4.21, ~nt−1 jx = hf alse, f alse, f alse, ..., f alsei, indicating that when ntjx takes on value f alse, its history ~nt−1 jx is fully determined. t−1 t+1 t t+1 t Thus, P r(njx |njx = f alse, ~njx ) = P r(njx |njx = f alse). t+1 t t nt−1 Combining case a and case b, P r(nt+1 jx ) = P r(njx |njx ), and hence jx |njx , ~ Γ(njx ) need not be conditioned on ~nt−1 jx . Therefore, by Definition 4.14, Γ(njx ) is state-dependent but not history-dependent with respect to njx . The significance of Theorem 4.24 is that, for a commonly-studied class of problems with event-driven interactions (Becker et al., 2004a; Marecki & Tambe, 2009; Mostafa & Lesser, 2009), when the interaction digraph topology contains no cycles, agents influence encodings need not be conditioned on event histories. Avoiding history means that the size of agents’ influence encodings will, at worst, grow linearly with the time horizon. Furthermore, by property 1 in Theorem 4.24, for problems with event-driven interactions and acyclic interaction digraphs, influences need not encode joint distributions over nonlocal feature transitions. In this case, the size of the influence DBN grows linearly with the number of event-based interactions (as long as no cycles are created). Later on in my empirical results, I show these traits to yield significant reduction (over problems with cyclic digraphs and history-dependent event-driven interactions) in the overall computation required by influence-based policy abstraction. Example 4.23 (continued). Returning to the problem shown in Figure 4.1, since there is only a single interaction, the interaction digraph is degenerately acyclic. By Theorem 4.24, rover 5’s influence on nonlocal feature site-C-prepared (modeled by rover 6), which we will abbreviate Γ(c), may be encoded with probability distribution Γ(c) = P r(ct+1 |ct , t), making Γ(c) state dependent with respect to nonlocal feature c ≡site-C-prepared and unaffectable feature t ≡time. 146 However, the presence of just one undirected cycle violates the conditions of Theorem 4.24, and necessitates dependence on history. Example 4.25. Consider a slight variation of Example 4.23 in which there is additional interaction involving rover 5 preparing site D for rover 6. In this case, there are two event-based nonlocal features (site-C-prepared and site-D-prepared ), and two corresponding edges in the interaction digraph both of which lead out of vertex 5 and into vertex 6. Consequently, the digraph contains an undirected cycle, thereby violating the conditions of Theorem 4.24. With this variation, the influence Γ(c) becomes history-dependent and influence-dependent. The influence DBN must model the joint distribution P r(ct+1 , dt+1 |~ct , d~t ). Intuitively, since they are controlled by the same agent, the transitions of site-C-prepared and site-D-prepared are no longer independent. For instance, rover 5 cannot finish preparing both sites at the same time. Similarly, the probability with which rover 5 finishes preparing site D depends upon how long ago rover 5 finished preparing site C. Fortunately, for cyclic cases, agents can encode the histories of event-driven features compactly. By Definition 4.21, the transitions of an event-driven feature f are structured such that f can only change from f alse to true but never from true to t f alse. Thus, the complete history f~t may be captured by a single variable fhist with domain {0, 1, ..., t, f alse} whose value is set to the time index that f changed from f alse to true, or set to f alse if f has not changed from f alse to true. 4.5 Influence Space Theorems 4.18 and 4.24 describe the size of the influence encoding, but they say nothing about the number of possible influence assignments. Each combination of local policies corresponds to a summarizing influence DBN whose probability values have been assigned accordingly. The influence space is the domain of feasible assignments to the probability values encoded by the influence DBN, where each feasible assignment is the result of at least one combination of local policies. I refer to a feasible assignment to the influence DBN as an influence point in the influence space. An important hypothesis of this dissertation is that, in formulating optimal joint policies, agents can gain potentially significant computational advantages by searching 147 through the influence space instead of searching through the joint policy space directly. The intuition is that, although every influence point maps to at least one joint policy, there may be many joint policies that all map to the same influence point. Example 4.26. Returning to the example problem shown in Figure 4.1, rover 5 has several sites it can visit, each with uncertain durations. In general, different policies that it adopts may achieve different interaction probabilities. However, due to the constraints in Figure 4.1, many of rover 5’s policies will map to the same influence point. For instance, any two policies that differ only in the decisions made after time 3 will yield the same assignment to Γ6π5 (site-C-prepared) = P r(site-C-prepared = true|time = 4). For this example, the influence space is strictly smaller than the policy space. By considering only the feasible influence values, agents avoid redundant joint reasoning about the local policies with identical influences. Figure 4.7 illustrates the potential reduction from influence space to policy space. R5’s local policy space policies influence 𝝅𝒊𝒂 Site C will be prepared by time 4 with probability 0.7 𝝅𝒊𝒃 R5’s outgoing Influence space influence point 1 influence point 2 influence point 3 Figure 4.7: An agent’s local policy space and resultant influence space. In relation to the weak coupling theory developed in Section 3.5.2, influence-based abstraction can be viewed as framework for partitioning a TD-POMDP agent’s policy space into impact-equivalence classes (Def. 3.45). For any two policies {πix , πiy } whose influences are identical (Γjπix = Γjπy ), they necessarily provoke the same best-response i from agent j. By Definition 3.44, πix and πiy are impact equivalent and thus may be grouped into the same impact equivalence class.5 5 The converse is not true, however. Two local policies of agent i that result in the same best 148 With Definition 3.49 in Section 3.5.2, I related an agent’s local policy space size to the number of partitions achieved by an impact-equivalent partitioning scheme, calling the maximal ratio of these two quantities the degree of influence. For influence-based policy abstraction, the number of partitions is equal to the size of the influence space. Hence, the degree of influence afforded by influence-based policy abstraction is the influence space size divided by the policy space size. 4.6 Empirical Analysis of Influence Space Size In the preceding section, I have provided intuition and anecdotal evidence to support my claim that influence-based policy abstraction framework can be employed to reduce the size of the search space wherein TD-POMDP agents seek to optimize their joint behavior. I have also contended that, beyond toy examples, there is a large space of TD-POMDP problems for which agents have far fewer unique influences than they do policies. I now defend this claim with a rigorous empirical analysis. Further, I investigate the circumstances under which influence-based abstraction yields the greatest reduction. The theoretical treatment of weak coupling presented in Section 3.5.2 has proven that the worst-case complexity of computing optimal TDPOMDP policies is dependent upon the degree of influence. However, the theoretical results do not say anything about what problems might have a low degree of influence, nor do they provide a method for determining a problem’s degree of influence before solving it. By itself, the theory cannot be applied practically. The empirical results that I present here pick up where the theory has left off, striving to expose identifiable attributes that determine a problem’s degree of influence (in the context of influencebased policy abstraction), and thereby illuminating a part of the TD-POMDP space that is truly weakly coupled (along the degree of influence dimension). My high-level strategy for performing this empirical analysis is as follows. Using the testbed detailed below, I generate a large sampling of random problems. The space of random problems is parameterized by a variety of attributes (detailed in Section 4.6.1), each of which I connect to identifiable aspects of the TD-POMDP problem description. According to this parameterization, I sample the problem space evenly across all parameter settings. For a given problem, I focus on an individual agent’s influences in isolation from the rest of the team, directly measuring the size of the agent’s influence space and the size of its local policy space. response from agent j may not map to the same influence point. As such, influence abstraction will not necessarily yield the most coarse-grained partitioning of the policy space. 149 By varying each parameter and observing its effect on the influence space size and the policy space size, I am able to characterize the relationships between a problem’s identifiable attributes and its (less discernible) degree of influence. The results in Section 4.6.2, which are the culmination of an iterative process of parameter definition and parameter testing, highlight those attributes that appear to be the strongest empirical predictors of a problem’s influence-space size and its degree of influence. Later on in Chapter 6, I empirically evaluate the overall computation performed by a an optimal TD-POMDP algorithm that employs influence-based abstraction. 4.6.1 Experimental Setup I perform my analysis on a testbed of task-based problems specified using a simplified version of the TÆMS modeling language (Decker, 1996). Problems of this flavor frequently arise in the Dec-POMDP literature (Becker et al., 2004a; Marecki & Tambe, 2009; Mostafa & Lesser, 2009; Musliner et al., 2006). Moreover, the multiagent planning community in general has demonstrated an interest in problems formulated using TÆMS, due to its domain-independent, naturally-distributed specification of interdependent agent activities with uncertain outcomes, its quantitative representation of team goals, and its emphasis on structured agent interactions (Atlas, 2009; Horling et al., 2006; Lesser et al., 2004; Smith et al., 2007; Wagner et al., 2003; Wu & Durfee, 2007; Xuan & Lesser, 1999). Though others have performed experiments on various hand-coded TÆMS problems, no universally-accepted problem suite has emerged. As such, I have created my own TÆMS-based testbed so as to systematically generate problems with a desired set of controlled parameters. I describe these parameters along with the details of my problem generator in Section 4.6.1.3. But before then, I provide a description of my problem specification in Section 4.6.1.1 and the corresponding models of influence in Section 4.6.1.2. In Section 4.6.1.4, I describe my method of evaluating each problem’s degree of influence. 4.6.1.1 Task-based Problem Specification The problem representation I am about to describe is the same as that introduced in Example 3.1 and used in several other examples in Chapters 3 and 4. For the sake of reproducibility of my experiments, I now provide a more detailed description of the task-based problem specification. Each problem contains n agents, where each agent i has a set of tasks, Ti = {taski1 , ..., taski||Ti || }, that it may execute with outcomes Di = {di1 , ..., di||Ti || } and win- 150 dow constraints Wi = {wi1 , ..., wi||Ti || }. For taskix , dix = {hdurixk , qualixk , probixk i} specifies the probabilities of outcome durations and associated qualities. Each window constraint is a pair wix = hestix , lf tix i denoting the earliest start time and latest finish time of the task. An agent can only perform one of its tasks at a time, with idling allowed between task executions. As such, the TD-POMDP’s local action set Ai contains an action for each taskix ∈ Ti and a NOOP action that causes the agent to idle for one time step. Once an agent starts a task, it cannot interrupt the task, so its only available action is to continue the task until the task ends. All task executions must occur between time steps 0 and a finite horizon T . However, an agent cannot start a task before the task’s earliest start time or after its latest finish time. If a task does not achieve one of its prescribed outcomes by its latest finish time, the task instead achieves a failure outcome (with quality 0). There are also task interrelationships called effects, each of the form form eix,jy = taskix , taskjy , d′iy , indicating that the completion of taskix with positive outcome quality alters the subsequent outcome distribution of taskjy (as long as agent j performs taskjy after taskix finishes).6 Examples of effects appear in Figure 3.8 as arrows connecting tasks. Here, Task A effects a change in the outcome of Task D, enabling a nonzero quality outcome to be attained with positive probability. The same sort of effect links Task C and Task F, as well as Task B and Task C. Agent i may have local effects Li = {..., eix,iy , ...} that link its own tasks. Additionally, agent i may have incoming nonlocal effects Nin i = {..., ejx,iy , ...} and outgoing nonlocal effects Nout = {..., eix,jy , ...}, indicating interactions with other agents. Colleci [  out tively, the set of all of the team’s nonlocal effects is denoted N = . In Nin i ∪ Ni ∀i the presence of nonlocal effects, each agent’s tasks Ti are categorized into two disjoint local sets Ti = Tlocal ∪ Tnle and nonlocally-affecting tasks Tnle i i : local tasks Ti i , such that out nle taskix ∈ Ti if and only if taskix is referenced in Ni (which indicates that it affects another agent’s task). The TD-POMDP local state Si includes a task status features for each taskix ∈ Ti with domain {not-started, started-at-time-t, completed-with-positive-quality, failed }. The mutually-modeled feature set m̄i consists of time (the current time index) as out well as the statuses of agent i’s nonlocal effects Nin i ∪ Ni , each with domain {true, false} indicating whether or not the respective effecting task has completed.7 As such, 6 The definition of effect here deviates slightly from that of the TÆMS language (Decker, 1996), but suffices to model several common TÆMS effects such as task enablement and disablement and special cases of facilitation and hindering. 7 Additional information about an effect need not be encoded in the TD-POMDP state because the effect only depends upon whether or not the affecting task has completed with positive quality. 151 each nonlocal feature nix,jy is a boolean variable indicating the status of an incoming nonlocal effect eix,jy ∈ Nin i . The TD-POMDP is locally fully observable, such that agents observe the features in their local states directly (including the statuses of their incoming nonlocal effects). An agent’s local rewards are zero for all state-action pairs prior to the end of horizon, and otherwise equal to the sum of its completed task qualities. The objective is to plan coordinated policies for task execution that maximize the summation of qualities attained from all agents’ completed tasks. A detailed example of the transition and reward structure for these problems appears in Figure 2.3 (in Chapter 2), which shows a single-agent MDP that has been constructed as described here. Although my influence-based abstraction methodology (and the solution methods that I present in subsequent chapters) are fully general to the TD-POMDP problem class, my task-based problem specification just described has several limitations that restrict consideration in my empirical work to a subset of TD-POMDP problems. For instance, each task has just one contiguous window and agents receive full observations of their individual task statuses. These are both restrictions that are inherent to the TÆMS modeling language (Decker, 1996). Additionally, agents’ interactions consist solely of event-driven nonlocal effects relating to task completion events, and the value of a joint policy is assumed to be the summation of completed task quality values. All of these restrictions together are common to other empirical studies of related Dec-POMDP subclasses (Becker et al., 2004a; Beynier & Mouaddib, 2005; Marecki & Tambe, 2009; Mostafa & Lesser, 2009). 4.6.1.2 Anatomy of an Influencing Agent In this particular set of experiments, I study individual agents’ outgoing influences, directly comparing local policy space size with outgoing influence space size. As such, each problem consists of a single agent i that is not affected by others8 , but who models zero or more locally-controlled nonlocally-affecting features, each of which can be thought of as affecting some other phantom agent, and hence each of which agent i considers to be mutually-modeled (Def. 3.13). Figure 4.8 shows a snapshot of agent i How long ago the task completed or which outcome it attained (as long as the outcome had positive quality) is of no consequence to the affected task’s dynamics. 8 For simplicity, I limit consideration in this analysis to an agent that may influence its peers but is not influenced by its peers. Given an acyclic interaction digraph, this limitation does not restrict the generality of my results: an uninfluenced agents’ local decision model is equivalent to that of an agent whose incoming influence has been fixed. In the case of a cycle, where agent i’s outgoing influence settings affect others that in turn influence i, I have not yet determined whether or not the size of the overall influence space will be affected. 152 as the root vertex in an interaction digraph, where each outgoing edge represents a nonlocally-affecting feature. Each such feature nx is a boolean feature that, when true, denotes the successful completion of agent i’s nonlocally-affecting task taskix ∈ Tnle i . 𝒏𝒙 𝒊 𝒏𝒚 Figure 4.8: A digraph vertex representing an influencing agent. Aside from agent i’s nonlocally-affecting task features, the only other feature that is mutually-modeled is time ∈ {0, 1, ..., T }. Agent i models the corresponding influence on each nonlocal feature as state-dependent with respect to feature time.9 In this set of experiments, we consider two different variations: (a) one modeling each influence Γπi (nx ) as state-dependent with respect to nx , and (b) another modeling all of the influences with a single joint distribution Γπi (nx , ny , ...) that is history-dependent with respect to all nonlocal features {nx , ny , ...}. Figure 4.9 illustrates each variation. 𝒕 𝒏𝒕𝒙 𝒏𝒕𝒚 𝒏𝒕+𝟏 𝒏𝒕𝒙 , 𝒕 𝚪𝝅𝒊 𝒏𝒙 = 𝐏𝐫 𝒏𝒕+𝟏 𝒙 𝒙 𝚪𝝅𝒊 𝒏𝒚 = 𝐏𝐫 𝒏𝒕+𝟏 𝒏𝒕𝒚 , 𝒕 𝒏𝒕+𝟏 𝒚 𝒚 (A) state-dependent influences 𝒏𝒕𝒙 𝒏𝒕𝒚 𝒏𝒕+𝟏 𝒙 𝒕+𝟏 𝒏t+1 𝒚 li 𝒕 𝒕 𝒕+𝟏 𝚪𝝅𝒊 𝒏𝒚 , 𝒏𝒚 , … = 𝐏𝐫 𝒏𝒕+𝟏 𝒙 , 𝒏𝒚 , … 𝒏𝒙 , 𝒏𝒚 , … (B) history- and influence-dependent influences Figure 4.9: Two variations of the agent i’s influences. The analysis of these two variations was motivated by my theoretical treatment of event-driven interactions in Section 4.4. Case (a), in which the influences are represented as separate state-dependent distributions, corresponds to a problem whose interaction digraph contains no cycles. Here, the phantom agents that i is assumed to be influencing are necessarily unique, meaning each nonlocal feature {nx } affects a different agent. Case (b) corresponds to problems wherein all of the nonlocal features {nx , ny , ...} correspond to edges situated in an undirected cycle in the interaction 9 The influence Γπi (nx ) is state-dependent (and not history-dependent) with respect time because the feature time changes deterministically and predictably. 153 digraph. For instance, all of the nonlocal features might affect the same phantom agent. 4.6.1.3 Problem Attributes and Testbed Parameters I have implemented a random problem generator to systematically create random problems of the form described in Section 4.6.1.1. For each problem, I generate a local TD-POMDP model Mi (Def. 3.16) for an influencing agent i, who is described in Section 4.6.1.2. My testbed is comprised of sets of random problems, wherein each set is seeded with a particular setting of parameters. The parameters, detailed below, serve as control knobs to adjust various high-level attributes of the TD-POMDP problems that I generate. I now describe each high-level attribute and the associated testbed parameter(s). Number of Decision Steps. A parameter T controls the global time horizon. All task executions must occur in the interval [0, T ]. As such, the TD-POMDP is unrolled (referring to the process of creating states, actions, and transitions indexed with time ∈ {0, . . . , T }) such that there are T decision steps {0, 1, . . . , T − 1}. Likewise, observation histories are of maximal length T . Branching Due to Decisions. From each state st (with time = t), there is at least one branch10 for each available action by which the agent i transitions into a state st+1 (with time = t + 1). For task-based TD-POMDP problems, the branching due to agents’ decisions is controlled by a parameter tasks per agent = kTi k, ∀i. Because there are only branches for available actions, the branching factor is also controlled by a parameter local window size∈ (0, 1), by which the each task’s earliest start time and latest finish time are set. The length of every task’s window is ⌊local windows size · (T − 1)⌋ + 1. Using this quantity, each task’s earliest start time is selected such that the task’s window is placed uniformly randomly within the interval [0, T ]. Branching Due to Uncertainty. The branching factor is also dependent on the uncertainty inherent in the tasks that agents perform. A parameter uncertainty ∈ [0, 1] controls the number of outcomes of each task. All outcomes of a given task have equal 10 Note that for task-based problems, the TD-POMDP state space is a directed acyclic graph and not a tree. 154 quality11 (selected uniformly randomly ∈ {1, ..., 10}) but different durations. Each task’s duration distribution is set to a randomly-assigned probability mass function, wherein ⌊uncertainty · (local windows size − 1)⌋ + 1 different durations are selected at random in the interval (1, local windows size), each of which are assigned a random probability such that their probabilities together sum to 1. I have generated 50 random problems12 for each setting of the above parameters, whose domains are given in Table 4.1. As specified thus far, these problems contain no agent interaction, and will be henceforth referred to as baseline problems, and the above parameters as baseline parameters. Next, I consider several additional parameters used to exact control over the agent i’s interactions. For this second set of parameters, which I refer to as interaction parameters, rather than generating problems at random for each parameter setting, I use the 50 problems per baseline parameter setting, each of which contains no agent interactions, as baseline problems. For each baseline problem, I generated modified versions by systematically varying each interaction parameter. In the end, I have a set of 50 test problems per setting of baseline parameters per setting of interaction parameters. The domains of each parameter are summarized in Table 4.1. I describe the interaction parameters below, indexed by the high-level TD-POMDP attributes they are intended to embody. Number of Nonlocal Features. The number of nonlocal features is controlled by a parameter NLATs = kNout i k, which is the number of agent i’s nonlocally-affecting tasks, constrained to be less than or equal to tasks per agent. 11 Since TD-POMDP agents’ transition influences model feature transition probabilities and not rewards, task outcome qualities will not affect the size of the influence space nor the degree of influence. 12 I selected the number 50 based on the following observations. The random variations in example problems generated by my testbed differ most significantly in the earliest start time of each task, which can take on T − ⌊localW indowSize(T − 1)⌋ different values (each uniformly at random), and in  ⌊(localwindowSize∗(T −1)⌋+1 the positioning of random task durations, which can take on ⌊uncertainty⌊(localwindowSize∗(T −1)⌋⌋+1 different values. The number of combinations of these two types of random variations, across all tasks,  ⌊(localwindowSize∗(T −1)⌋+1 is thus tasksP erAgent (T − ⌊localW indowSize(T − 1)⌋) ⌊uncertainty⌊(localwindowSize∗(T , a −1)⌋⌋+1 term which is maximized by the baseline parameter setting: hT = 5, localwindowSize = 1.0, uncertainty = 0.5, tasksP erAgent = 3i. For this particular parameter setting I generated an extra 100 random problems for each setting of the interaction parameters. Using this additional test set, I performed each comparison presented Section 4.6.2 and compared the results with the same comparison using a set of 50 problems. In all cases, the trends observed with the 100-problem test set were qualitatively identical to those observed with the 50-problem test set. From this, I concluded that, for the other parameters settings whose problem variance was theoretically smaller, that 50 random problem was adequate to sample the space for all other parameter settings. 155 State-dependent Influences vs. History and Influence-dependent Influences. As described in Section 4.6.1.2, I analyze two different variations of influence models. In the first variation, i’s influence on each nonlocal feature is modeled using a separate state-dependent distribution. In the second variation, i models a single joint distribution that is history-dependent with respect to all nonlocal features. Parameter influence type∈ {state, history} controls which variation is used. Window of Nonlocal Feature Manipulation. For problems with a single nonlocal feature, I introduce two parameters that constrain when the nonlocal feature is allowed to be manipulated. The size of the window of agent i’s nonlocally-affecting task is set to NLATWindow, and the beginning of the window is set to NLAT est. In essence, N LAT W indow and N LAT est control the timing of agent i’s interactions. Empirical results that I present in the next section demonstrate that both of these features have a significant impact on the size of the influence space. Problem Attributes Testbed Parameter — Baseline Parameters — Number of Decision Step T tasks per agent Branching Due To Decisions local window size Branching Due To Uncertainty uncertainty — Interaction Parameters — Number of Nonlocal Features NLATs State-Dependent Influences vs. History- influence type Dependent Influence-Dependent Influences NLATWindow Window of Nonlocal Feature Manipulation13 NLAT est Domain {1, 2, 3, 4, 5} {1, 2, 3} {0.0, 0.5, 1.0} {0.0, 0.5, 1.0} {1, . . . , tpa} {state, history} {0, 1, . . . , T } {0, . . . , T − 1} Table 4.1: Testbed parameterization. The parameters and their respective domain values shown in Table 4.1 have been chosen so as to generate spaces of problems that are sufficiently rich to include a variety of different scenarios, and to demonstrate general trends and relationships between a problem’s high-level characteristics and its degree of influence, yet define a space that is small enough to be explored systematically and thoroughly. Strictly speaking, my results are only directly applicable to problems in my testbed. The degree to which other TD-POMDP problems exhibit the same quantitative values of influence space sizes and degree of influence will depend upon their similarity to those 13 The domains of NLATWindow and NLAT est are systematically explored only in the case of a single nonlocally-affecting task (NLAT =1). 156 from my testbed. However, I have no reason to believe that the qualitative trends observed here will not generalize beyond the space considered here. 4.6.1.4 Evaluation Scheme To evaluate influence space size, I employ an algorithm presented in the next chapter (Section 5.6) that explores the influence space exhaustively, counting each unique setting of probabilities in the influence DBN that is achieved by a deterministic policy of agent i. For each problem, in addition to recording the number of influences in agent i’s influence space, I also record the number of deterministic policies in agent i’s policy space, computed as the product of the number of available actions in every state of the agent i’s best response model (detailed in Section 4.2). For any given problem, I calculate the degree of influence by dividing the influence space size by the policy space size. 4.6.2 Results Using the testbed described in Section 4.6.1, I have performed a series of experiments that illuminate the relationships between the degree of influence and the problem attributes described in Section 4.6.1.3. I now present the results, organized by problem attribute (each of which was introduced in Section 4.6.1.3). I provide a summary of all findings at the end of this section. Note that each of the comparisons described below has been performed across all combinations of parameter settings. For each result that I present, instead of overwhelming the reader with page after page of plots for each and every setting, I select just a few cases (which I label {A, B, C, ... } followed by the corresponding parameter setting) whose qualitative trends are representative of the entire space of parameter settings tested.14 4.6.2.1 Number of Decision Steps The number of agent i’s decision steps, specified by the time horizon T , has a significant effect on the size of the problem in general. Since time is necessarily a feature of the TD-POMDP state, the state space grows (in general exponentially) with each additional decision step. In turn, each additional state with more than 14 Rest assured that the plots for all other settings have been examined, and are omitted here simply because they do not provide any additional information about the high-level, qualitative trends that I describe. 157 one available action constitutes an additional decision for agent i to make, causing an increase in the policy space (which is in general doubly exponential in the time horizon). I posit that the influence space size should also increase with the time horizon. In general, a longer horizon entails more times during which agent i’s interactions can take place. In particular, in the problems that I generate, the windows of agents tasks are specified relative to the time horizon T , such that the number of different times that agent i is allowed to start a nonlocally affecting task increases proportionally with T . The greater the number of possible start times, the greater the number of possible finish times, and hence the more unique influences. Figure 4.10 supports my hypothesis. Here, local state space size, local policy space size, and influence space size are plotted as a function of T , each for three different settings of the baseline parameters15 (from Table 4.1). Out of all possible combinations of parameters, the three settings, labeled A, B, and C, were chosen as representative snapshots of three different gradations of problem difficulty (as measured by state space size and policy space size). Moreover, the trends portrayed by these plots are representative of the trends observed across all of the other possible settings. In each plot, T is varied from 1 to 5 along the x-axis, and the y axes are given a logarithmic scale. As expected, a near-exponential increase in state space size is observed along with an exponential (or asymptotically greater) increase in policy space size for all parameter settings. Similarly, the influence space size increases exponentially with T . Although the steepness of the exponential increase depends upon the particular parameter setting, the steepness of increase of policy space size generally16 exceeds that of the influence space size. The result of this difference in growth is that the degree of influence decreases as the number of decision steps grows. The degree of influence for settings A, B, and C, is plotted, again on a logarithmic scale, in Figure 4.11. 4.6.2.2 Branching Due To Decisions Next, I examine the impact of branching in the local state space caused by decisions (i.e. action choices) that agent i faces . For the task-based problem in my testbed, 15 For each problem, one or two (prescribed by “NLATs=1” or “NLATs=2”) of agent i’s tasks have been selected at random and turned into nonlocally-affecting tasks, and modeled as state-dependent or history-dependent (as prescribed by “influenceType=state” or “influenceType=history”) influences. 16 This trend was observed across all parameter settings with the exception of one degenerate case, involving a single task per agent with window size of 1, in which both the influence space size and policy space size remained constant. In this case, regardless of the time horizon or the placement of the task’s window, agent i only ever has 2 possible policies and 2 possible influences. 158 1 10 0 10 1 2 3 4 5 2 policy space size 10 1 10 0 10 1 2 3 4 5 mean influence space size state space size mean policy space size mean state space size 2 10 influence space size 0.6 10 0.5 10 0.4 10 1 2 3 4 5 state space size mean policy space size mean state space size 2 10 1 10 0 10 1 2 3 4 5 10 policy space size 10 5 10 0 10 1 2 3 4 5 mean influence space size T T T (A) tasksPerAgent=1, localWindowSize=1.0, uncertainty=0.0, NLATs=1, influenceType=state influence space size 0.8 10 0.6 10 0.4 10 1 2 3 4 5 state space size mean policy space size mean state space size 3 10 2 10 1 10 0 10 1 2 3 4 5 20 policy space size 10 10 10 0 10 1 2 3 4 5 mean influence space size T T T (B) tasksPerAgent=3, localWindowSize=0.5, uncertainty=0.5, NLATs=1, influenceType=state 3 influence space size 10 2 10 1 10 0 10 1 2 3 4 5 T T T (C) tasksPerAgent=3, localWindowSize=0.5, uncertainty=1.0, NLATs=2, influenceType=history Figure 4.10: State, policy, and influence space sizes as a function of time horizon T . branching from decisions is controlled by two parameters: the number of tasks per agent and the local window size. Each of these two parameter, when increased, yields a consequent increase in the number of available actions (averaged across the entire state space). In any given state, an agent has at most tasks per agent+1 available actions (where the additional action causes the agent to idle). However, not all of these actions will be available in every state. The local window size controls the proportion of decision steps during which each action can be taken. For this experiment, I systematically vary each of these two parameters and examine the effect on policy space size and influence space size as before. I also measure the average branching factor (across all nonterminal states) of the agents’ local decision model that results from each setting of the two parameters, so as to solidify the connection between testbed parameters and branching factor (which is a more general attribute that is easily computed for any TD-POMDP problem, 159 −10 10 1 2 3 T (A) 4 5 degree of influence 0 10 −5 10 −10 10 1 2 3 T (B) 4 5 mean degree of influence −5 10 mean degree of influence mean degree of influence degree of influence 0 10 degree of influence 0 10 −5 10 −10 10 1 2 3 4 5 T (C) Figure 4.11: Degree of influence as a function of time horizon T . task-based or otherwise). The results are shown in Figure 4.12, where local window size is varied along the x-axis and the three values of tasks per agent appear as lines superimposed on each plot. Again, out of all the combinations of parameter settings, I present three settings (A, B, and C) whose qualitative trends are representative of those observed in all of the remaining settings. Figure 4.12 generally confirms that more tasks and wider task windows yield a larger branching factor. However, the increase in branching factor due to local window size is extremely small for setting A. This is because the uncertainty parameter is set such that each task has a single duration that is uniformly randomly selected from interval (0, ⌊local windows size · (T − 1)⌋ + 1). Effectively, as the local window size increases, tasks tend to take deterministically longer to complete, causing a larger portion of states to have just a single available continue action, and thereby counteracting the rise in branching factor from increased local window size. For all settings, we observe that the policy space size increases both with local window size and with tasks per agent, and that the influence space size increases with the local window size. In the majority of parameter settings, we observe the same qualitative relationship between influence space size and tasks per agent, though this trend is faint for setting B and indiscernible for setting A. The reason for this is that, due to the combination of small values of uncertainty and small values of local window size, the number of outcomes of each task is limited to just 1, for setting A and all of the points in setting B except for local window size=1.0 (wherein the number of task outcomes is 2). With just a single duration per task, problems become entirely deterministic, as do agent i’s influences. That is, each influence either conveys (1) that the nonlocally-affecting task will complete with certainty at a particular time or (2) that the nonlocally-affecting task will never complete. As such, the number of influences simply relates to the number of different times agent i is allowed to start the 160 2.5 2 1.5 localWindowSize 1 tasksPerAgent=1 tasksPerAgent=2 tasksPerAgent=3 5 10 0 10 mean policy space size mean branching factor 1.5 1 4 10 tasksPerAgent=1 tasksPerAgent=2 tasksPerAgent=3 2 10 0 10 0 0.5 1 0.8 0.6 10 0.4 10 0 0.5 influence space size 2.5 2 1.5 1 15 10 10 10 tasksPerAgent=1 tasksPerAgent=2 tasksPerAgent=3 5 10 0 10 0 0.5 1 1 10 policy space size mean policy space size mean branching factor 6 10 localWindowSize branching factor localWindowSize 0.5 localWindowSize (B) T=3, uncertainty=0.5, NLATs=1, influenceType=state 3 0.5 0 influence space size 2 1 0 0.4 10 policy space size 2.5 0.5 1 0.5 10 localWindowSize branching factor localWindowSize 0.5 influence space size 0.6 10 localWindowSize (A) T=5, uncertainty=0.0, NLATs=1, influenceType=state 3 1 0 0 mean influence space size 0.5 policy space size mean influence space size 1 0 10 10 mean influence space size mean policy space size mean branching factor branching factor 3 3 1 10 2 10 1 10 0 10 0 localWindowSize (C) T=5, uncertainty=1.0, NLATs=1, influenceType=state 0.5 1 localWindowSize Figure 4.12: Branching factor, policy space size, and influence space size. nonlocally affecting task regardless of the other tasks that the agent might execute. In all three cases, we observe a decrease in the degree of influence as the branching factor increases (shown in Figure 4.13). The decrease in degree of influence is minimal when there is only a single task per agent because, in this case, all of the agents decisions involve deciding whether not to start the single nonlocally-affecting task. For this scenario, one might expect the influence space size to be equal to the policy space size. This expectation is valid when the local window size is 0 (meaning that the length of the window is 1 time unit) because there are just 2 influences and 2 policies. However, as the nonlocally-affecting task window grows, a growing number of policies dictate that agent i start executing the task too late for it to succeed before its latest end time, and each of these policies all map to the same influence, thereby effecting a decrease in the degree of influence. 161 −10 10 0 0.5 localWindowSize (A) 1 degree of influence 0 10 −5 10 −10 10 0 tasksPerAgent=1 tasksPerAgent=2 tasksPerAgent=3 0.5 localWindowSize (B) 1 mean degree of influence −5 10 mean degree of influence mean degree of influence degree of influence 0 10 degree of influence 0 10 −5 10 −10 10 0 0.5 1 localWindowSize (C) Figure 4.13: Degree of influence vs. tasks per agent and local window size. 4.6.2.3 Branching Due To Uncertainty Another component that affects the branching factor of agent i’s state space is the uncertainty in the outcomes of its actions. Given more uncertainty, there is a wider array of future states reachable from each action, and hence a larger branching factor. Again, I hypothesize that the increase in branching factor will yield a general increase in the size of the policy space. Although there are no more actions available in each state, I expect that there will be a greater number of states, thereby yielding a greater number of policy decisions. Uncertainty should also have a significant effect on the size of the influence space. Since influences encode probabilities of nonlocal feature transition outcomes, and more uncertainty yields a greater number of probabilistic outcomes, increasing the uncertainty can only increase the number of feasible influence points. I test these hypotheses by varying a testbed parameter, uncertainty, which controls the number of outcomes of each of agent i’s tasks, such that the number of outcomes per task is set to uncertainty · ⌊local windows size − 1⌋ + 1 (as detailed in Section 4.6.1.3). The results are shown in Figures 4.14 and 4.15, which plot the branching factor, policy space size, influence space size, and degree of influence as a function of uncertainty for three different parameter settings. As before, the trends shown for parameters settings A, B, and C are qualitatively characteristic of all remaining combinations of parameter settings (from Table 4.1) not shown. As predicted, all cases exhibit an increase in both branching factor (plotted on a linear scale) and influence space size (plotted on a logarithmic scale) as uncertainty is varied from 0.0 to 1.0. In almost all cases (including those not shown), we also observe an increase in the policy space size. Moreover, the increase in the policy space size usually overwhelms that of the influence space size. This trend is clearly illustrated by case A, where both the policy space and influence space appear to grow exponentially 162 3 2 1 0 0.5 1 6 policy space size 10 5 10 4 10 0 0.5 1 mean influence space size mean policy space size mean branching factor branching factor 4 2 influence space size 10 1 10 0 10 0 0.5 1 mean policy space size mean branching factor branching factor 4 3 2 1 0 0.5 1 5 policy space size 10 4 10 3 10 0 0.5 1 mean influence space size uncertainty uncertainty uncertainty (A) T=3, tasksPerAgent=3, localWindowSize=1.0, NLATs=1, influenceType=state 2 influence space size 10 1 10 0 10 0 0.5 1 mean policy space size mean branching factor branching factor 4 3 2 1 0 0.5 1 3 policy space size 10 2 10 1 10 0 10 0 0.5 1 mean influence space size uncertainty uncertainty uncertainty (B) T=5, tasksPerAgent=2, localWindowSize=0.5, NLATs=2, influenceType=history influence space size 0.77 10 0.69 10 0.61 10 0 0.5 1 uncertainty uncertainty uncertainty (C) T=5, tasksPerAgent=1, localWindowSize=1.0, NLATs=1, influenceType=state Figure 4.14: State, policy, and influence space sizes as a function of uncertainty. but the policy space grows more steeply, yielding an overall decrease in the degree of influence (Figure 4.15A). In case B, we again observe an exponential increase in the influence space, but a less pronounced increase in policy space, and consequently only a very slight decrease in the degree of influence. This suggests that for smaller problems with fewer tasks (such as in case B), influence-based abstraction may have less computational benefit (yielding a smaller reduction in the search space even when there is a large amount of uncertainty). Case C illustrates an extreme wherein there is just a single task. Here, since the only actions agent i has are to begin the task or to idle, the only choice the agent faces is whether or not to begin its task. Varying the outcomes of the lone task have no effect on the policy space whatsoever, and hence we observe a flat policy space size curve (in Figure 4.14C). In contrast, the influence space grows as the uncertainty is varied from its minimum to maximum value, yielding a slight increase in the degree of 163 −4 10 0 0.5 uncertainty (A) 1 degree of influence 0 10 −2 10 −4 10 0 0.5 1 mean degree of influence −2 10 mean degree of influence mean degree of influence degree of influence 0 10 degree of influence 0 10 −2 10 −4 10 0 uncertainty (B) 0.5 1 uncertainty (C) Figure 4.15: Degree of influence as a function of uncertainty. influence (Figure 4.15C). Note that, out of all parameter settings including those not shown, case C exhibited the largest growth in the degree of influence. The results in the past three sections are promising. In general, Figures 4.10-4.15 show an increase in influence space size but a decrease in the degree of influence. This suggests that, by and large, as agents’ local problems become more complex, the benefits of abstracting influences will be magnified. However, this empirical trend is conditioned on the agents’ nonlocal effects remaining constant as the local problem size grows. In the next set of experiments, I explore what happens when the nonlocal effects are varied. 4.6.2.4 Number of Nonlocal Features and Influence Type The number of nonlocal features (controlled by agent i and affecting other agents) has a direct effect on the size of agent i’s influence encoding, whose parameters are the probabilistic transitions of agent i’s nonlocal features. As described in Section 4.6.1.2, if each nonlocal feature is encoded with a separate state-dependent influence (which is denoted by parameter setting influenceType=state), the size of the influence encoding grows linearly with the number of nonlocal features. If, on the other hand, agent i models its influences as history-dependent and influence-dependent with respect to each other (influenceType=history), the size of i’s influence encoding grows exponentially with the number of nonlocal features. As the size of the encoding increases, the influence is capable of representing more details of i’s policy. Further, by describing agent i’s behavior with greater specificity, thereby allowing each influence point a refined scope of possible policies, one would expect that this richer encoding would also engender a larger number of influence 164 points. In other words, a larger encoding should result in a larger influence space.17 As such, I offer the general hypotheses that (1) as the number of nonlocal features increases, the size of the influence space increases; and (2) when influences are historyand influence-dependent, the size of the influence space will be, on average, larger than when influences are state-dependent. I test these hypotheses by systematically generating variations of the baseline problems used in the previous sections (and described in Table 4.1), where in each variation, some number (specified by the value of parameter N LAT s) of agent i’s tasks are turned into nonlocally-affecting tasks. Further, for each problem, I explore the feasible space of influences for both the state-dependent encoding (denoted influenceType=state) and the history-dependent influence-dependent encoding (denoted influenceType=history), both of which are described in Section 4.6.1.2. Figure 4.16 shows the results of varying the number of nonlocally-affecting tasks for three different settings of baseline parameters (A, B, and C). Across all settings (including those not shown), we observe an increase in the average influence space size as the number of nonlocally-affecting tasks increases. We also observe an increase in the average degree of influence for all settings due to the fact that agent i’s policy space remains constant as N LAT s is varied (given that tasksP erAgent is fixed). For state-dependent influence encodings (which are indicated by the black line with circular markers in Figure 4.16), the average increase in influence space size was near linear in all cases. The history-dependent influence encodings exhibited similar average influence space growth in some cases, and super-linear growth in other cases. The three cases A, B, and C that I present here illustrate a dominant trend that I noticed across the space of baseline parameters settings. For cases with low uncertainty (e.g., case A), the influence space size for state-dependent influence encodings was always equal to that of the history-dependent influence encoding. Intuitively, the two influence space sizes must be the same for this case because uncertainty = 0.0 classifies problems as entirely deterministic.18 When uncertainty > 0, and as the number of nonlocal features increases, the growth of the influence space size for history-dependent 17 Although this assertion is intuitive, there do exist counterexamples for which the influence space size is the same for a larger, more informative encoding than it is for a smaller, less informative encoding. 18 When all effects are deterministic, any given policy will result in nonlocal features changing values at predictable times, and there is only one possible history. In this case, any influence point encoded with a history-dependent influence can be reduced to one encoded with a state-dependent influence, and thus cannot differentiate between any two policies that the state-dependent influence could not have differentiated between, and therefore cannot accommodate a larger space of feasible influences. 165 influenceType = state influence space size 30 20 10 0 1 2 3 mean degree of influence influenceType = history mean influence space size encodings overtakes that of state-dependent encodings on average. Moreover, as illustrated by cases B and C, as the uncertainty becomes larger, the gap between average influence space sizes widens. The last point in Figure 4.16C is missing because, for some of the problems with setting huncertainy = 1.0, N LAT s = 3, inf luenceT ype = historyi, the influence spaces were not able to be explored exhaustively within the 3 hours of computation time allotted to each problem. Based on those runs that did complete, the average influence space size for this missing data point is greater than 600. degree of influence 0.03 0.02 0.01 0 −0.01 −0.02 1 2 influenceType = state influence space size −4 150 100 50 0 1 2 3 mean degree of influence influenceType = history mean influence space size NLATs (A) T=5, tasksPerAgent=3, localWindowSize=0.5, uncertainty=0.0 4 x 10 degree of influence 2 0 −2 1 2 influenceType = state influence space size −5 400 200 0 1 2 3 mean degree of influence influenceType = history mean influence space size NLATs (B) T=5, tasksPerAgent=3, localWindowSize=0.5, uncertainty=0.5 600 3 NLATs 15 x 10 3 NLATs degree of influence 10 5 0 −5 1 NLATs (C) T=5, tasksPerAgent=3, localWindowSize=0.5, uncertainty=1.0 2 3 NLATs Figure 4.16: Varying the number of nonlocally-affecting tasks and influence type. For all cases, notice that the variance in both the influence space size and the degree of influence is extremely large. Figure 4.17 presents a more detailed view of influence space sizes for case B above. Here, a histogram shows the distribution of influence space sizes for 100 randomly-generated problems per value of N LAT s, and for both the state-dependent influence encoding (top) and the history-dependent encoding (bottom). With histograms for the three values of N LAT s superimposed 166 (in black, grey, and white), we observe that distributions take on a similar shape. For all values of N LAT s and inf luenceT ype, there is a large mass within the range of 1-50 influences and a tail leading outward. As N LAT s increases, the mass tends to become more spread out, and the tail of the distribution heavier. For this particular case, there is not a large difference between the state-dependent influence encoding and the history-dependent encoding except for N LAT s = 3, where we observe that the distribution is spread well beyond 200 for the latter encoding. Histogram of Influence Space Sizes (influenceType=state) 60 NLATs = 1 NLATs = 2 NLATs = 3 # of problems 50 40 30 20 10 0 1−10 21−30 41−50 61−70 81−90 101−110 121−130 141−150 161−170 181−190 >200 # of influence points (T=5, tasksPerAgent=3, localWindowSize=0.5, uncertainty=0.5, influenceType=state) Histogram of Influence Space Sizes (influenceType=history) 60 NLATs = 1 NLATs = 2 NLATs = 3 # of problems 50 40 30 20 10 0 1−10 21−30 41−50 61−70 81−90 101−110 121−130 141−150 161−170 181−190 >200 # of influence points (T=5, tasksPerAgent=3, localWindowSize=0.5, uncertainty=0.5, influenceType=history) Figure 4.17: Distribution of influence space sizes for 100 problems (per setting). One cause of the variance in influence space size is due to the variance in the complexities of the local models created by the random problem generator (caused by random window placement and random duration selection). This is evident from the large range of policy space sizes, which ranged from 5, 184 all the way to 382, 205, 952 167 for the problems plotted in Figure 4.17. With this wide range of policy space sizes, it is not surprising that the influence space sizes varied from 2 to 613. Figure 4.18 shows a scatter plot of policy space sizes and corresponding influence space sizes for the same sets of problems, again classified by the number of nonlocally affecting tasks. 3 Influence Space Size vs. Policy Space Size 3 10 NLATs = 1 NLATs = 2 NLATs = 3 2 influence space size influence space size NLATs = 1 NLATs = 2 NLATs = 3 10 1 10 0 10 2 10 Influence Space Size vs. Policy Space Size 10 2 10 1 10 0 4 10 6 10 8 10 10 2 10 10 10 policy space size (influenceType=state) 4 10 6 10 8 10 10 10 policy space size (influenceType=history) Figure 4.18: Scatter plot of policy space sizes respective influence space sizes. Here, we observe a slight correlation between policy space size and influence space size. The correlation appears to be strongest when the number of nonlocally affecting tasks is greatest. This is somewhat intuitive since a greater value of N LAT s corresponds to a larger percentage of agent i’s tasks that are nonlocally affecting, and thus a larger portion of policy decisions that impact the transitions of nonlocal features. The correlation is weakest for N LAT s = 1 (plotted with black diamonds). Clearly, there are other factors at play that are causing the large variance in the influence space size. In the next subsection, I explore some of these other factors. 4.6.2.5 Window of Nonlocal Feature Manipulation Earlier (in Section 4.6.2.2), I analyzed the effects of manipulating tasks’ execution windows via a parameter localW indowSize, which set the size of all of agent i’s task windows. I now exert finer control, targeting only those nonlocally-affecting tasks’ windows, using an analogous19 parameter N LAT W indow. By manipulating the windows of the nonlocally-affecting tasks, I am able to control the proportion 19 Whereas localW indowSize specifies the size relative to the time horizon T, N LAT W indow specifies the size of the nonlocally-affecting tasks window in absolute terms. 168 policy space size (log) 10 10 5 10 1 2 3 4 5 influence space size 60 40 20 0 0 2 4 6 mean degree of influence 15 10 mean influence space size mean policy space size of decision steps during which each nonlocal feature may be affected. I now test the initial hypothesis that, all else being equal, a larger window of nonlocal feature manipulation will cause an increase in the degree of influence. −3 degree of influence (log) 10 −5 10 −7 10 1 2 3 4 5 policy space size (log) 5 10 4 10 1 1.5 2 2.5 3 influence space size 15 10 5 0 0 1 2 3 4 mean degree of influence 6 10 mean influence space size mean policy space size NLATWindow NLATWindow NLATWindow (A) T=5, tasksPerAgent=3, localWindowSize=0.5, uncertainty=0.5, serviceStartTime=0, influenceType=state, NLAT_est=0 −3 degree of influence (log) 10 −5 10 −7 10 1 1.5 2 2.5 3 policy space size (log) 10 10 8 10 6 10 1 2 3 4 5 influence space size 6 5 4 3 2 0 2 4 6 mean degree of influence 12 10 mean influence space size mean policy space size NLATWindow NLATWindow NLATWindow (B) T=3, tasksPerAgent=3, localWindowSize=1.0, uncertainty=0.5, serviceStartTime=0, influenceType=state, NLAT_est=0 −3 degree of influence (log) 10 −5 10 −7 10 1 2 3 4 5 NLATWindow NLATWindow NLATWindow (C) T=5, tasksPerAgent=3, localWindowSize=1.0, uncertainty=0.0, serviceStartTime=0, influenceType=state, NLAT_est=0 Figure 4.19: Varying the size of the nonlocal feature manipulation window. Contrary to my hypothesis, empirical results showed an inverse relationship between the degree of influence and the size of the nonlocally-affecting task’s window across all baseline parameter settings, 3 of which are shown in Figure 4.19. Due to the sensitivity of the policy space size, we observe an (often super-) exponential increase when even just a single task’s window is expanded. In case A, the policy space growth overwhelms the growth of the influence space, resulting in a significant decrease in the degree of influence as the nonlocally-affecting tasks window was increased from 1 time unit to the length of the problem horizon T (a trend which was most common across all parameter settings). For small problems, such as that shown in case B, the growth of influence space nearly matches that of the policy space, yielding a degree of influence that is relatively flat. For case C, uncertainty = 0 indicates that all tasks 169 have deterministic durations. In this case, determinism forces a linear increase in the number of influences (as was previously described in Section 4.6.2.2). The policy space growth is similarly stunted, which results in a degree of influences which is relatively unaffected by the size of the nonlocally-affecting task’s window. Although my initial hypothesis regarding the increasing degree of influence turned out to be false, we can distill from these results a strong trend in the increase of the influence space. For all nondeterministic cases (including those not shown), the influence space size grew exponentially with the increasing nonlocally affecting task window. Just as we observed in Section 4.6.2.4, however, the increase in the number of influences is accompanied by an exponential increase in the variance. Next, I examine one other factor contributing to this variance. I hypothesize that it is not only the size of the window of nonlocal feature manipulation that affects the influence space size, but also the window’s temporal placement. Intuitively, the later the window of nonlocal feature manipulation, the more decisions that agent i has made before interacting with others, and hence the more that could have transpired locally before an interaction takes place, corresponding to an increasing number of possible trajectories of agent i before setting the nonlocal feature. I hypothesize that, as a consequence, the later the window of nonlocal feature manipulation, the more feasible influence points there will be in general. I test this hypothesis by simultaneously varying the size of the nonlocally-affecting task’s window and location of its window (as controlled by the earliest start time N LAT est). The results, plotted in Figure 4.20, confirm that this hypothesis holds true for the problems in my testbed. The only exceptions are settings for which there are no other tasks besides the single nonlocally-affecting task or for which the other tasks are all deterministic20 , as in case A. For these exceptions, the influence space size remains constant because for each time t that agent i can start its nonlocally affecting task, regardless of the value of t, there is necessarily a single influence point due to the certainty that agent i will indeed start the task at time t.21 In both cases B and C, the influence space size increases exponentially as the nonlocally-affecting task window is shifted forwarded in time, as predicted. However, the degree of influence exhibits more complicated behavior. At a high level, as the local window size is increased, progressing from 0.0 in case A to 1.0 in case C, the degree goes from strictly decreasing (in case A) to wavering (in case B), to strictly 20 Determinism in case A is due to the setting localW indowSize = 0.0, dictating that each local task has just a single outcome with duration 1 and probability 1. 21 Although there is uncertainty as to when the nonlocally-affecting task will finish, this uncertainty is encoded in a single influence. 170 0 10 0 1 2 3 4 3 influence space size 10 NLATWindow=1 NLATWindow=2 NLATWindow=3 NLATWindow=4 NLATWindow=5 2 10 1 10 0 10 0 1 2 3 4 mean degree of influence policy space size (log) mean influence space size mean policy space size 5 10 −1 degree of influence (log) 10 −2 10 −3 10 0 1 2 3 4 policy space size (log) 10 10 5 10 0 10 0 1 2 3 4 3 influence space size 10 2 10 1 10 0 10 0 1 2 3 4 mean degree of influence 15 10 mean influence space size mean policy space size NLAT_est NLAT_est NLAT_est (A) T=5, tasksPerAgent=3, localWindowSize=0.0, uncertainty=1.0, influenceType=state −2 degree of influence (log) 10 −4 10 −6 10 −8 10 0 1 2 3 4 policy space size (log) 10 10 5 10 0 1 2 3 4 3 influence space size 10 2 10 1 10 0 10 0 1 2 3 4 mean degree of influence 15 10 mean influence space size mean policy space size NLAT_est NLAT_est NLAT_est (B) T=5, tasksPerAgent=3, localWindowSize=0.5, uncertainty=1.0, influenceType=state degree of influence (log) 0 10 −5 10 −10 10 0 1 2 3 4 NLAT_est NLAT_est NLAT_est (C) T=5, tasksPerAgent=3, localWindowSize=1.0, uncertainty=1.0, influenceType=state Figure 4.20: Varying the nonlocally-affecting task’s start time and window size. increasing in case C. Given the inverse relationship between policy space size and degree of influence, the trend that we observe in the degree of influence is a result of the opposite trend in the size of the policy space. The explanation behind this trend is as follows. As the nonlocally-affecting task window is shifted forward in time, there are two opposing forces being exerted on the policy space. First, there is policy space growth due to the fact that there are more states at later times than at earlier times, and hence more opportunities for increasing the number of policy decisions when the nonlocally affecting task is allowed to be started later on. In essence, the longer agent i delays a decision about its interaction, the more that could have transpired, and the more circumstances that it will need to consider. This force is dominant in case A as well as in cases B and C given that N LAT W indow is large. Second, the later agent i may begin its nonlocally-affecting task, the later the 171 branching due to the uncertain task outcomes occurs, and the smaller the subsequent increase in states at later times and the smaller the corresponding rise in the number of decisions due to these additional states. In essence, the longer that agent i delays its interactions, the shorter-lasting the consequences of its interactions will be, and the less it will need to reason about thereafter. This second force is dominant when local tasks’ windows are large and also when the nonlocally affecting task window is small. This makes sense because this second force is magnified when the branching factor (due to additional actions) is increased, which occurs when local task windows are enlarged (as we observed in Section 4.6.2.2). 4.6.3 Summary of Findings The empirical results that I have presented in Sections 4.6.2.1-4.6.2.5 provide the following insights: ❼ In general, as an agent’s local decision problem becomes more complex (involving more states and actions, and more uncertainty), there is an increasing number of influences that the agent can exert on its peers. However, in almost all cases, the rate of policy space growth exceeds that of influence space growth. Consequently, agents with more complex local behavior tend to have a smaller degree of influence. To validate that this result was not restricted to the space of relatively small problems considered above, I ran two additional tests on a set of larger problems. The results, shown in Figure 4.21, anecdotally corroborate that the trends observed in my earlier experiments can be extrapolated to problems with time horizons of 10 or greater.22 These trends suggests that the potential advantages of coordinating abstract influence (over coordinating full policies) are magnified as agents’ local behavior becomes more complex. ❼ As the size of an agent’s influence encoding increases, the number of points in its feasible influence space tends to increase. Specifically, we observed an increase in the average influence space size when the encoded distribution included more nonlocal features and also when its probabilities were conditioned on history instead of on state. We also observed that the growth in the influence space due to the more verbose history-dependent encoding was greater when the problem 22 Figure 4.21A extends my analysis of growing number of decision steps (from Section 4.6.2.1) and Figure 4.21B performs the same comparison of varying numbers of tasks and task window sizes from Section 4.6.2.2, but for larger problems with a time horizon of 10. 172 10 NLATs=1 NLATs=2 0 10 2 4 6 8 10 12 3 influence space size 10 2 10 1 10 0 10 2 4 6 8 10 12 mean degree of influence 10 mean influence space size mean policy space size policy space size 20 10 degree of influence 0 10 −5 10 −10 10 −15 10 2 4 6 8 10 12 tasksPerAgent=1 tasksPerAgent=2 tasksPerAgent=3 20 10 10 10 0 10 0 0.5 localWindowSize 1 4 influence space size 10 2 10 0 10 0 0.5 1 mean degree of influence policy space size 30 10 mean influence space size mean policy space size T T T (A) tasksPerAgent=3, localWindowSize=0.5, uncertainty=0.5, influenceType=history degree of influence 0 10 −5 10 −10 10 −15 10 0 localWindowSize (B) T=10, uncertainty=0.5, NLATs=1, influenceType=state 0.5 1 localWindowSize Figure 4.21: Increasing the size of the local decision problem. uncertainty increased. With respect to the degree of influence, agents whose interactions can be encoded more compactly tend to be more weakly-coupled. As a consequence, influence-based abstraction is (on average) more effective at reducing the size of the search space when the influence encoding is small. ❼ In general, along with increasing influence space size, we also observed a signifi- cant increase in the variance of the influence space size. This indicates that the computation required to search the influence space (exhaustively) is increasingly unpredictable for problems with more complex influence encodings. To combat this unpredictability, I have identified several characteristics that can be used to gauge a problem’s influence space size and its degree of influence: 1. The size of the policy space appears to be weakly correlated with the size of the influence space. 2. Not surprisingly, increasing the size of the window during which agents are allowed to interact tends to increase the number of feasible influence points. However, the additional number of policies that results from the agents’ greater interaction flexibility tends to overwhelm the growth of the influence space. This suggests that giving agents a broader array of choices about when to interact actually makes them more weakly coupled.23 23 This result indicates that there is some discrepancy between the intuitive definition of weak coupling and the semantics that I have presented in Section 3.5. 173 3. For nondeterministic problems, moving the window of interaction forward in time tends to increase the number of feasible influence points. Intuitively, the later an agent’s interaction may occur, the more that can transpire before the interaction, and so the greater the uncertainty in if and when the interaction will take place. This also tends to increase the degree of influence. However, for cases in which the interaction window is small, or in which there is little uncertainty, moving the window of interaction forward in time can cause a decrease in the degree of influence (and thus an increase in the effectiveness of influence-based abstraction at reducing the size of the policy space). 4.7 Summary This chapter makes several key contributions. First, I developed a novel bestresponse model whose computational complexity is dependent on the number of shared state features and otherwise independent of the number of peer agents. As such, agents’ usage of this best response model constitutes inherent exploitation of reduced state factor scope (as formalized in Section 3.5.1.3). Although restricted in its application to TD-POMDP problems, this best response model is the first to exploit such structure transition-dependent agents. Within the larger scope of this work, in this chapter I have developed a general framework for abstracting agents’ transition influences. More significantly, I have proven that the influence abstractions suffice for optimal local reasoning about peers’ behavior. To begin to evaluate the efficacy of my influence-based abstraction methodology, I have performed an empirical analysis, and presented evidence that influence-based abstraction enables a significant reduction the overall search space. Further, by identifying problem characteristics that impact the size of the influence space and the size of the policy space, my analysis takes steps towards characterizing the circumstances under which influence-based policy abstraction is most advantageous. Although the results presented in this chapter capture only the number of influences and not the computation required to find each point in the feasible influence space, my rigorous analysis of influence-space size and degree of influence lead me, in Chapter 6, to characterize the overall computational advantages and disadvantages of influence-based abstraction (after developing the remaining components of my influence-based solution methodology). 174 CHAPTER 5 Constrained Local Policy Formulation In the last chapter, I identified an alternative search space, the influence space, proposing that agents coordinate influences instead of policies. However, they cannot do away with policies altogether. Whereas influences convey expectations about select portions of joint behavior, agents’ policies provide complete specifications of local behavior, without which a solution to the planning problem would be incomplete. As such, there is an inherent duality of agent reasoning associated with influence-based policy abstraction: individually, agents reason about policies, and jointly, agents reason about influences. The constrained local policy formulation methodology that I present in this chapter provides agents with a mapping between the two representations. Constrained Local Policy Formulation best-response model 𝝅𝒊 𝚪𝑖 𝚪𝑖 local policy 𝝅𝒊 influence (on peers) Figure 5.1: Functional diagram of constrained local policy formulation. As indicated by Figure 5.1, which isolates the “constrained local policy formulation” component from the other contents of my approach shown in Figure 1.2, I address both directions of translation. Translating from policy to influence, my methodology allows each agent to extract from any one of its policies the implied influence (on the 175 agent’s peers). Further, given a proposed influence, it allows agents to compute a policy that adheres to the influence, thereby translating from influence to policy. At the heart of my approach lies a conceptual connection between the occupation measures of the MDP dual linear program (LP) formulation (D’Epenoux, 1963; Kallenberg, 1983) and the probabilistic effects that influences encode. In formalizing this relationship, I derive the probability value of each influence component as a function of the occupation measures returned by the MDP LP solution. Further, I develop a novel extension to the MDP LP formulation that incorporates additional constraints so as to guarantee that the solution policy adheres to an agent’s proposed influences (if such a policy exists). In contrast to existing alternative approaches, which encourage the enforcement of various forms of influence by biasing the MDP model, my approach strictly enforces agents’ influences without the need for parameter tuning or model manipulation, by constraining the policy directly. Further, my approach is guaranteed to produce individual agent policies that are optimal with respect to the influence constraints. 5.1 Overview The contents of this chapter are organized as follows. I begin, in Section 5.2, by formalizing the application of the dual LP to an agent’s best response model and incorporating additional mixed-integer constraints for computing and evaluating deterministic policies and for handling partial observations. In Section 5.3, I introduce the relationship between occupation measures and policy effects within the simple, yet restrictive, context of probabilistic goal achievement. I relax this restriction in Section 5.4, extending to state-dependent and history-dependent influences. Next, in Section 5.5, I contrast my constrained policy formulation methodology with alternative approaches. In Section 5.6, I develop an algorithm that iteratively enumerates all of an agent’s outgoing influences and analyze its complexity. I conclude the chapter with a summary of its contributions in Section 5.7. 5.2 Applying the Dual LP Formulation Among the various single-agent (PO)MDP solution methods reviewed in Section 2.2.1.2, an agent may employ the LP formulation from Equation 5.1 to solve its best-response model (developed in Section 4.2). In review, the best-response model incorporates all peers’ influences into its transition function. During the planning 176 process, agents’ peers propose influences, at which point the agent can use its bestresponse model to reason about its local behavior as if it was alone in the world (since the influences of its peers have been fixed). Throughout this section, I will treat the agent, reasoning with its best-response model, in isolation. In general, the TD-POMDP best-response model is partially observable, however, for the moment, let us assume that is a completely-observable single-agent MDP. In Section 5.2.3, I will describe an extension for applying LP techniques to partially-observable models. In Sections 5.2.1–5.2.2, I describing extensions for computing and evaluating deterministic policies. But before then, let me formally re-introduce the basic form of MDP dual linear program (D’Epenoux, 1963; Kallenberg, 1983). The variables x = hx(s, a), ∀s ∈ S, ∀a ∈ Ai of the LP, called occupation measures, model the expected (discounted) number of times that action a is taken in state s. For Dec-POMDPs, and consequently for agents’ local best response models, the time horizon is finite and the discount factor γ = 1.1 Thus, for our purposes, the occupation measures specify the expected non-discounted number of times action a is taken in state s from time steps 0 to T (the finite horizon). max x XX x(s, a)R(s, a) s∈S a∈A ∀st+1 ∈ S, X x(st+1 , at+1 ) − at+1 ∈A ∀s ∈ S, ∀a ∈ A, x(s, a) ≥ 0 XX  x(st , at )P st+1 |st , at = α(st+1 ) (5.1) st ∈S at ∈A The first constraint of the LP in Equation 5.1 can be thought of as conserving the flow of probability through each state st+1 , requiring that the expected number of times st+1 is exited (the flow out) subtracted from the expected number of times st+1 is entered (the flow in) be equal the probability of starting in st+1 . The second constraint forces each occupation measure to be no less than zero. When a LP is solved, the LP solver returns a solution, which is a setting of variables (in this case x ) that maximizes the objective function, in addition to the resulting maxXX x(s, a)R(s, a) imal objective value. In Equation 5.1, the objective function max x s∈S a∈A ensures that the solution to the LP, which I will refer to as the optimal occupation vector x ∗ , maximizes the expected accumulation of rewards. Upon computing the optimal occupation vector x ∗ , the agent can recover its 1 Throughout this chapter, I assume that γ = 1. The extent to which my formalism can be extended to cases where γ < 1 is the subject of future work. 177 corresponding best-response policy as follows: π ∗ (s, a) = P x∗ (s, a) ′ a′ ∈A x(s, a ) (5.2) Similarly, the agent’s (local) value V (π ∗ ) of the policy π ∗ that was computed using Equation 5.2 is simply the value of the objective function: V (π ∗ ) = XX x∗ (s, a)R(s, a) (5.3) s∈S a∈A 5.2.1 Constraining the LP to Return a Deterministic Policy In general, the policy π ∗ returned (in Equation 5.2) by the MDP LP is stochastic, prescribing a probability with which the agent shall take each action in each state (π : S × A 7→ (0, 1)). There are an infinite number of such policies. In the interest of maintaining a finite search space, my overarching solution methodology for planning coordinated behavior (developed in Section 5.6 and in Chapter 6) restricts itself to deterministic policies of the form π : S 7→ A. Fortunately, it is straightforward to constrain the LP from Equation 5.1 to return the optimal deterministic policy. However, it entails transforming the LP into a mixed-integer LP (MILP), making policy computation more costly in general. As such, in adopting this extension, the solution methodology that I adopt in this dissertation inherently trades potential reductions in computational complexity of local policy computation for finiteness of joint policy space and ease of searching the space (as I describe in Section 5.6). Equation 5.4 computes deterministic policies by extending the standard MDP dual LP (Equation 5.1) with additional variables and constraints. Here , I introduce a vector of Boolean variables z = hz(s, a) ∈ {0, 1}, ∀s ∈ S, ∀a ∈ Ai, whose values indicate whether or not the corresponding occupation measures are greater than zero. Each pair {z(s, a), x(s, a)} is thereby connected by a constraint −1 ≤ (x(s, a) − z(s, a)) ≤ 0, requiring that z(s, a) = 1 whenever x(s, a) > 0 (but not the converse). Subse P z(s, a) quently, an additional constraint for each state s of the form = 1, a∈A ′ ′′ restricts that at most one of {z(s, a), z(s, a ), z(s, a ), ...}, and thus at most one of {x(s, a), x(s, a′ ), x(s, a′′ ), ...}, be nonzero. The solution to our new LP takes the form x∗ , z ∗ i pair. of an optimal hx 178 max x XX x(s, a)R(s, a) s∈S a∈A ∀st+1 ∈ S, X x(st+1 , at+1 ) − at+1 ∈A XX st ∈S at ∈A  x(st , at )P st+1 |st , at = α(st+1 ) ∀s ∈ S, ∀a ∈ A, −1! ≤ (x(s, a) − z(s, a)) ≤ 0 X ∀s ∈ S, z(s, a) = 1 (5.4) a∈A ∀s ∈ S, ∀a ∈ A, x(s, a) ≥ 0 ∀s ∈ S, ∀a ∈ A, z(s, a) ∈ {0, 1} x∗ , z ∗ i subject to the above constraints, the agent recovers its Upon computing hx deterministic policy as: π ∗ (s) = arg max z ∗ (s, a), (5.5) a wherein a single action a is assigned to each state s. Note that z(s, a) ∈ {0, 1}, and z ∗ (s, a) = 1 indicates that a is the only action (if any) with a positive occupation measure in state s. If state s is unreachable via π ∗ , the deterministic action may be selected arbitrarily by the LP solver. In general, enforcing deterministic policies in this manner results in a harder optimization problem. With the addition of integer variables z , Equation 5.4 defines a mixed-integer linear program (MILP), whose worst-case complexity is no longer polynomial ; instead, the best known algorithms take exponential time (in the number of variables) in the worst case. 5.2.2 Evaluating Deterministic Policies In addition to computing optimal (best-response) policies, an agent may use another variation of the basic MDP dual LP to evaluate any candidate deterministic policy π (computed by a linear program or otherwise). To do so, it is simply a matter of disallowing all actions other than those specified by the policy π: ∀s ∈ S, ∀a ∈ A s.t. a 6= π(s), x(s, a) = 0 (5.6) The constraints given in Equation 5.6, when added to the original LP from Equation 5.1, derives from policy π its implied occupation measures x (consistent with the transition dynamics of the MDP). During the process of solving this LP, its objective 179 function is evaluated, and π’s corresponding value computed. 5.2.3 Handling Partial Observability Next, I describe an extension for computing optimal POMDP policies. The idea is to model the observation history together with state in an MILP, such that occupation measures x are defined over both state and observation history: x(st , ~o t , a) refers to the probability that the agent observes ~o t from times (1, . . . , t), is in state s at time t, and takes action a. Inevitably, the size of the occupation measure vector will be larger that required for fully-observable problems (growing at worst exponentially in the time horizon). However, the ensuing computational overhead is not unreasonable for problems wherein the observation history can be encoded compactly. Note that the original semantics of occupation x(st , a) can easily be recovered from the POMDP occupation measures x(st , ~o t , a): x(st , a) = X x(st , ~o t , a), (5.7) ~ ot which follows from the fact that two different observation histories cannot both occur in a single execution trajectory. Analogously, the POMDP LP objective function is XXX simply max x(s, ~o, a)R(s, a). x s a ~ o The flow constraint in the POMDP LP (which is an extension of the first constraint in Equation 5.1) must account for the probability of encountering state-observation pair (st+1 , ot+1 ) given that action at was taken in state st upon observing history ~o t : ∀st+1 , ∀~o t+1 = ~o t , ot+1 , X XX  x(st+1 , ~o t+1 , at+1 ) − x(st , ~o t , at )P st+1 |st , at O(ot+1 |at , st + 1) = 0, st at+1 at (5.8) which includes both the POMDP state transition probability and the probability of the new observation (as prescribed by the observation function O introduced in Section 2.2.2). Equation 5.8 constrains the flow from one observation to the next. An additional constraint is required to account for the start state distribution α : 0 ∀s ∈ S, ∀~o, X 0 0 x(s , ~o, a ) = a0 180 ( α(s0 ), if ~o = ∅ 0, otherwise (5.9) Just as in Section 5.2.1, we will use the integer variables z to constrain the policy to be deterministic. For the POMDP LP, we need a z(s, ~o, a) value for every element of x . The deterministic policies constraints (not shown here) are otherwise identical to those given in Section 5.2.1. Given that our occupation measures store both state and observation history information, one additional set of constraints is needed. For the POMDP, policies map observations histories (but not states) to actions. Thus, a valid policy must assign the same action to all state-observation-history pairs with identical observation history: ∀~o, ∀a, X z(s, ~o, a) = kSk · z(s0 , a) (5.10) s∈S In combination with the deterministic policy constraints and involving the variables z developed in Section 5.2.1, Equation 5.10 requires that, for every unique observation history and for every action, the summation of z values must be equal to the number of states multiplied by the z value of one (arbitrarily chosen) state s0 . The consequence is that all observation-history-action pairs must have the same z value. By the semantics of z , the same deterministic action must be chosen for every observation history regardless of action. Just as in Section 5.2.1, the deterministic action is easily recovered by selecting the single action with a nonnegative value z(s, a) for each state. The policy π : Fobs 7→ A, in this case mapping observable feature values to actions, is therefore: π ∗ (~o ) = arg max z ∗ (s0 , ~o, a), (5.11) a where state s0 is arbitrarily chosen. Note that the POMDP extension described above may be used in combination with any of the other extensions developed in remained of this chapter. 5.3 Probabilistic Goal Achievement As I have described in the last Section, the dual LP expresses occupation measures that suffice as an alternate representation of an agent’s policy. Whereas the conventional representation of a deterministic policy maps states to actions, occupation measures instead articulate expected state-action statistics, thereby providing a richer encoding. Subject to the assumption that the state space is acyclic, occupation measures encode an agent’s probabilistic action effects in addition to its action choices. The fact that this probabilistic information is intrinsic to the MDP linear program means that we can manipulate the policy formulation process at its core, explicitly specifying 181 constraints and objectives on desired probabilistic effects. In this section, I describe a simple application of this concept before explicitly connecting it to influence-based abstraction in the next section. Consider that, in addition to maximizing utility, an agent has other aspirations that cannot easily be accounted for in the utility function. In particular, the agent would like to reach a set of goal states Sg ⊂ S such that, in any given trajectory, at most one state sg ∈ Sg may be encountered. By adapting the basic MDP dual LP from Equation 5.1, the agent can compute a policy guaranteed to reach exactly one of its goal states with the addition of a single constraint:   X X {s∈Sg } a∈A  x(s, a) = 1 In essence, the agent is directly constraining its policy to achieve its goals. Although the objective function remains the same—to maximize the expected summation of rewards—the agent will now compute the highest-valued policy that reaches a goal state (if such a policy exists). More generally, the agent can constrain its policy to achieve its goals with probability ≥ ρ by constraining the occupation measures as:   X X {s∈Sg } a∈A  x(s, a) ≥ ρ Yet another alternative is to alter the objective function such that the agent achieves its goals with maximal probability:  max  x X X {s∈Sg } a∈A  x(s, a) Note that the LP formulations suggested above exploit the fact that an occupation measure must equal the probability of ever visiting the state and taking the action. The equivalence between occupation measure x(s, a) and probability ρ is contingent upon the assumption that states are time indexed and so cannot be visited more than once in any execution trajectory. Moreover, I assume that the goal states are mutually exclusive, such that not more than one can be visited in a single trajectory, and thus the probability of reaching any goal state is equal to the summation of the probabilities of reaching each goal state. The strength of this approach, in contrast to other policy formulation techniques 182 (e.g. policy iteration, value iteration, dynamic programming), is its ability to constrain policies precisely while still maintaining optimality (with respect to the constraints). If there does not exist a policy that will satisfy the agent’s probabilistic goal constraints, the LP solver will return “no solution” and the agent will know that its goals are over-constraining. 5.4 State-Dependent Influence Achievement Just like probabilities of fulfilling goals, agents’ influences (as formalized in Section 4.3) are also directly related to the MDP LP’s occupation measures. Let us begin by considering a state-dependent  influence Γπi (n̄),  which consists of a set of parameters, t+1 t each taking the form γ = P r n̄ = n̂|f¯ = fˆ , whose semantics are as follows: given n o that agents are in a state st ∈ Dγt ≡ st ∈ S|f¯(st ) = fˆ at time t, then the influencing agent’s policy will cause the transition into a state st+1 ∈ Eγt+1 ≡ {st+1 ∈ S|n̄(st ) = n̂} at time t + 1 in which the prescribed effect has been achieved with probability γ. Starting from this definition, we can rewrite the equation for γ as follows:  γ = P r n̄ t+1   t ¯ ˆ = n̂|f = f = P r Eγt+1 |Dγt given the semantics of influence  P r Eγt+1 , Dγt  = P r Dγt by definition of conditional probability X X  P r st , st+1 = st ∈Dγt st+1 ∈Eγt+1 X P r st st ∈Dγt  because agents cannot occupy multiple states simultaneously X X X  P r st , at , st+1 = st ∈Dγt at ∈A st+1 ∈Eγt+1 X X st ∈Dγt at ∈A P r st , at  by the law of total probability 183 = X X X st ∈Dγt at ∈A st+1 ∈Eγt+1   P r st+1 |st , at P r st , at X X P r st , at st ∈Dγt at ∈A  by definition of conditional probability X X X   P st+1 |st , at x st , at = st ∈Dγt at ∈A st+1 ∈Eγt+1 X X x st , at st ∈Dγt at ∈A  (5.12) subsituting agent i’s transition function P() and occupation measures x Alternatively, a state-dependent influence may have no conditioned evidence, thereby expressing a prior probability P r (n̄0 = n̂|∅). In this case, γ depends only on the start state distribution α:   γ = P r n̄0 = n̂ = P r Eγ0 given the semantics of influence X  = P r s0 s0 ∈Eγ0 = X s0 ∈Eγ0 α s0  (5.13) substituting agent i’s start state probabilities α By the above derivations, given any candidate policy πi , agent i can compute its outgoing influence Γπi by (1) evaluating policy πi using the LP described in Section 5.2.2, thereby returning a vector of occupation measures x , and (2) evaluating the derived expressions in Equation 5.12 (or Eq. 5.13) for each γ ∈ Γπi . The MDP LP thereby suffices as a method of state-dependent influence abstraction, translating from policies to influences. The other direction of translation—from influence to policy—may be achieved by incorporating influence constraints, which I derive as follows, by turning equation 5.12 on its head using algebraic manipulation. 184 X X X st ∈Dγt at ∈A st+1 ∈Eγt+1   P st+1 |st , at x st , at X X x st , at st ∈Dγt at ∈A   X X X st ∈Dγt at ∈A st+1 ∈Eγt+1 X X st ∈Dγt at ∈A ⇔      P st+1 |st , at x st , at  −    x st , at  ⇔ X st+1 ∈Eγt+1 =γ X X st ∈Dγt at ∈A   γx st , at  = 0 (5.14)    P st+1 |st , at  − γ  = 0 Computing a policy that constrains agent i to fulfill an influence Γπi is thereby achieved with the addition of constraints of the form derived in Equation 5.14, one for each γ ∈ Γπi , to the standard MDP LP from Equation 5.1. Putting it all together: max x XX x(s, a)R(s, a) s∈S a∈A XX  x(st+1 , at+1 ) − x(st , at )P st+1 |st , at = α(st+1 ) at+1 ∈A    st ∈S at ∈A X X X   (5.15) P st+1 |st , at  − γ  = 0 x st , at  ∀γ ∈ Γi , ∀st+1 ∈ S, X st ∈Dγt at ∈A st+1 ∈Eγt+1 ∀s ∈ S, ∀a ∈ A, x(s, a) ≥ 0 The solution to the LP in Equation 5.15 corresponds to a policy πi∗ (Γi ) that maximizes agent i’s local utility with respect to the candidate influence setting Γi . If the LP returns no solution, this means that there is no local policy that achieves influence setting Γi , in which case we say that Γi is infeasible. 5.4.1 History-Dependent Influence Achievement Extending the state-dependent influence calculations and constraints from Equations 5.12-5.15 to the case of history-dependent influences is straightforward. The key is to incorporate the necessary history into the state of the best-response model.  That t+1 t ~ ˆ is, for a history-dependent influence of the form γ = P r n̄ = n̂|f = f , simply 185 model f~t as a feature of the state at time t. From the influencing agent’s standpoint, a history-dependent influence is effectively no different from a state-dependent influence. 5.5 Alternative Approaches to Constraining Influence In the preceding sections, I have used linear programming to address the problem of computing an agent’s policy that fulfills a desired influence (on its peers). In essence, my LP constraints enforce that the influencing agent achieve a requisite behavior. I now turn to an alternative approach sometimes referred to as reward shaping that others have used to enforce desired agent behaviors (e.g., Musliner et al., 2006; Varakantham et al., 2009; Williamson et al., 2009). Here, the (PO)MDP reward function is tweaked by adding rewards or penalties (i.e., negative rewards) to bias the agent towards desirable behavior or away from undesirable behavior. Upon setting these additional rewards, any standard (PO)MDP solver may be used compute an optimal local policy, which in this case is optimal with respect to the manipulated reward function. Along these lines, reward shaping could be used to bias an agent to achieve a particular influence. Figure 5.2: A simple, concrete example of influences modeled by 2 agents. 186 Example 5.1. The problem shown in Figure 5.2 depicts two agents with simplistic influences, each encoded as a single probability value (ρ). Assume that the objective is to compute a policy that enforces that agent 1 influence agent 2 by setting bit x with probability ρ1 = 0.4. In this case, the reward shaping approach will then add extra reward value r1 to states 0010 and 1010 (since these are the states that agent 1 enters upon setting bit x to 1). Analogously, a penalty p1 ≤ 0 is added to the reward values of states 1100 and 1001, since these are the states at agent 1’s time horizon for which arrival means that bit x has never been and will never be set. Notice that if r1 > 10 or if p1 < −10, action a1 will be strictly preferred by agent 1. Running an MDP solver on this augmented MDP will invariably yield a1 as the optimal action choice for agent 1 in state 0000. And so agent 1 will be able to satisfy its influence whilst maximizing its local utility. The reward shaping methodology may be effective in some situations, but it is often difficult to set the reward and penalty values appropriately. Moreover, there may not be any values that correctly enforce the commitment. Example 5.2. Returning to the same two-agent problem shown in Figure 5.2, consider the joint influence Γ = {ρ1 = 0.4, ρ2 = 0.4}, indicating that agent 1 will set x with probability 0.4 and y with probability 0.4. In this case, we will use a reward-penalty pair hr1 , p1 i to encourage the setting of bit x (and discourage transitions in which x is not set), and another reward-penalty pair hr2 , p2 i to encourage the setting of bit y. Towards selecting appropriate values for our rewards and penalties, let us express agent 1’s local value of each of its three policies (which, in this case, correspond to actions a1 , a2 , and a3 ), which is the expected sum of rewards received over the course of executing the policy, adding in the r’s and p’s where appropriate: V1 (a1 ) = r1 + p2 + 10 V2 (a2 ) = 0.6(p1 + p2 + 5) + 0.4(r1 + r2 + 5) = 0.4r1 + 0.4r2 + 0.6p1 + 0.6p2 + 5 V3 (a3 ) = r2 + p1 + 10 (5.16) 187 First, notice that a policy does exist which will satisfy the influence {ρ1 = 0.4, ρ2 = 0.4}. The deterministic local policy which does this is the one that prescribes action a2 in state 0000. With probability 0.4, the agent will transition into 0010, satisfying ρ1 = 0.4 and from there with certainty into 0011, satisfying ρ2 = 0.4. Furthermore, it turns out that the optimal joint policy dictates that agent 1 should select action a2 . However, as I prove below, we cannot compute this policy by adding extra rewards and penalties. Theorem 5.3. There exists an MDP and a set of influences for which: 1. there exists a deterministic local policy achieving an influence Γ, and 2. using reward shaping along with standard deterministic MDP solution techniques, no tuple of the form hr1 ≥ 0, p1 ≤ 0, r2 ≥ 0, p2 ≤ 0i will yield a policy that adheres to Γ. Proof. Consider the MDP in Figure 5.2 and the influence Γ = {ρ1 = 0.4, ρ2 = 0.4}. The only deterministic policy that satisfies this influence, and indeed the only one by which x is set with positive probability and y is set with positive probability, is the policy that selects action a2 . Thus, it suffices to prove that for no values of hr1 ≥ 0, p1 ≤ 0, r2 ≥ 0, p2 ≤ 0i is action a2 preferred over action a1 or action a3 . To begin with, let us derive preference relations by manipulating the value functions in Equation 5.16 (presented in Example 5.2): a1 ≻ a2 ⇐⇒ r1 + p2 + 10 > 0.4r1 + 0.4r2 + 0.6p1 + 0.6p2 + 5 by Equation 5.16 ⇐⇒ 5r1 + 5p2 + 50 > 2r1 + 2r2 + 3p1 + 3p2 + 25 by multiplying both sides by 5 ⇐⇒ 3r1 − 3p1 + 25 > 2r2 − 2p2 by subtracting (2r1 + 3p1 + 5p2 + 25) from both sides 188 (5.17) a3 ≻ a2 ⇐⇒ r2 + p1 + 10 > 0.4r1 + 0.4r2 + 0.6p1 + 0.6p2 + 5 by Equation 5.16 ⇐⇒ 5r2 + 5p1 + 50 > 2r1 + 2r2 + 3p1 + 3p2 + 25 by multiplying both sides by 5 ⇐⇒ −2r1 + 2p1 + 25 > −3r2 + 3p2 (5.18) by subtracting (2r1 + 3p1 + 5r2 + 25) from both sides Now, consider the following cases, which cover all possible combinations of values of hr1 ≥ 0, p1 ≤ 0, r2 ≥ 0, p2 ≤ 0i: case 1 : [−2r1 + 2p1 + 25 > −3r2 + 3p2 ]. ⇒ a3 ≻ a2 by Equation 5.18 case 2 : [−2r1 + 2p1 + 25 ≤ −3r2 + 3p2 ]. ⇒ 3r1 − 3p1 − 37.5 ≥ 4.5r2 − 4.5p2 ⇒ 3r1 − 3p1 + 25 > 4.5r2 − 4.5p2 4 ⇒ 3r1 − 3p1 + 25 > (4.5r2 − 4.5p2 ) 9 ⇒ 3r1 − 3p1 + 25 > 2r2 − 2p2 ⇒ a1 ≻ a2 by multiplying both sides by -1.5 because 25 > −37.5 by {r2 ≥ 0, p2 ≤ 0} ⇒ (4.5r2 − 4.5p2 ) ≥ 0 by Equation 5.17 Thus, for no combinations of values of hr1 ≥ 0, p1 ≤ 0, r2 ≥ 0, p2 ≤ 0i is it the case that a2 is preferred. Therefore, reward shaping cannot be used, in combination with deterministic policy formulation techniques, to enforce Γ. In Example 5.2, there are no perfect rewards and penalties that enable the computation of a policy for agent 1 that adheres to the optimal influence setting. For other problems, even if perfect values of r and p exist, it may be difficult to identify what they are. Semantically, the agent is forced to assign value to satisfying the probabilistic effect of the influence versus failing to satisfy it. This value is inherently tied to local policy values dictated by the MDP reward model. If r and p are too close to zero, a policy may be formulated that fails to achieve the desired nonlocal effect (or else achieves it with too small of a probability). But if r and p are too far from zero, then the agent may sacrifice some of its local quality so as to build a policy that achieves the nonlocal affect with a higher probability than desired. The primary advantage of the LP approach presented in Section 5.4 is its ability to construct policies that capture influence probabilities perfectly while still maintaining optimality. It is possible that a desired influence settings is infeasible for the influencing agent, meaning that no deterministic policy achieves the influence. In this case the LP solver will return “no solution” and the agent will know immediately that this influence point should not be considered. Otherwise, the returned policy is guaranteed to satisfy the influence. Reward shaping, on the other hand, will return a policy regardless of whether or not a desired influence setting is feasible. Post-processing of the policy (e.g., using Equation 5.12) is then needed to determine whether or not the achieved influence setting matches the desired influence setting. Due to the difficulties of setting r and p and the lack of optimality guarantees, I do not consider reward shaping further in this dissertation. Instead, my solution approach utilizes constrained linear programming to enumerate influences and to compute optimal local policies around those influences. However, reward shaping has several distinct advantages in other algorithmic contexts. For instance, reward shaping inherently strives for a balance in the costs of nonlocal effects and their anticipated advantages to other agents, making it useful for rapidly converging on approximate joint policies (Musliner et al., 2006; Varakantham et al., 2009). Moreover, unlike constrained linear programming, reward shaping has the added flexibility of accommodating any (PO)MDP solver. 5.6 Exploring the Space of Feasible Influences In addition to constraining policies to achieve desired influence settings, agents can employ the same methodology to generate the set of all feasible influence settings. That is, for a given influence parameter γ, the influencing agent can enumerate all feasible values {γ̂} of the parameter achievable by any deterministic policy. It can do so by solving a series of MILPs, each of which looks for a deterministic policy that constrains γ to take on a value that has not previously been considered. Let the influencing agent iteratively check for the existence of a feasible probability value within an input interval γ̂min < γ̂ < γ̂max by running an MILP solver on the following program, adapted from Equation 5.15. 190 x · 0) max(x x ...usual constraints for computing deterministic policies (Sec 5.4)... X XX  ∀st+1 ∈ S, x(st+1 , at+1 ) − x(st , at )P st+1 |st , at = α(st+1 ) st ∈S at ∈A at+1 ∈A ∀s ∈ S, ∀a ∈ A, −1! ≤ (x(s, a) − z(s, a)) ≤ 0 X z(s, a) = 1 ∀s ∈ S, a∈A ∀s ∈ S, ∀a ∈ A, x(s, a) ≥ 0 ∀s ∈ S, ∀a ∈ A, z(s, a) ∈ {0, 1} ...influence parameter γ̂min ...  than  setting γ must be greater X X X   P st+1 |st , at  − γ̂min  > 0 x st , at  st ∈Dγt at ∈A (5.19) st+1 ∈Eγt+1 ...influence parameter ...  setting γ must be lessthan γ̂max X X X   P st+1 |st , at  − γ̂max  < 0 x st , at  st ∈Dγt at ∈A st+1 ∈Eγt+1 ...all other influence parameters γ ′ must be as prescribed ∀γ ′ ∈ Γi |prescribed(γ ′ ),    X X    X  P st+1 |st , at  − γ̂ ′  = 0 x st , at  st ∈Dγt ′ at ∈A st+1 ∈Eγt+1 ′ Equation 5.19 includes two constraints that enforce an upper and lower bound on the setting of parameter γ (in addition to the constraints from Eq. 5.15 that constrain any other prescribed influence γ ′ = γ̂ ′ , as well as the deterministic policy constraints from Equation 5.4). Deterministic policy constraints are required so that the agent does not cycle through an infinite set of nondeterministic policies, the elements of which may exert influences whose settings are arbitrarily close to one another. The agent begins by checking interval (γ̂min = −∞, γ̂max = ∞). If the LP from Equation 5.19 returns a solution, the agent has simultaneously found a new influence γ = γ̂0 (which may be computed using Equation 5.12) and computed a policy that exerts that influence, subsequently uncovering two new intervals {(γ̂min , γ̂0 ), (γ̂0 , γ̂max )} to explore. Alternatively, if the LP returns “no solution” for a particular interval, there is no feasible influence within that range. By divide and conquer, the agent can uncover each feasible setting of γ, stopping only after all subintervals have been 191 Algorithm 5.1 Feasible Influence Enumeration : Single Parameter EnumerateFeasibleSettingsForParam(γ, P OM DPi , Γprescribed ) i ...Initialize Interval Queue and settings list... 1: intervalQ = ∅ 2: Push(intervalQ, (−0.1, 1.1)) 3: f easibleSettingList = ∅ ...Explore Sub-intervals... 4: while not IsEmpty(intervalQ) do 5: (γ̂min , γ̂max ) ← Pop(intervalQ) x, f easible} ← SolveIntervalLP(P OM DPi , Γprescribed 6: {x , (γ̂min , γ̂max )) i ⊲ [Eq. 5.19] 7: if f easible then x, γ) 8: γ̂new ← computeInfluenceSetting(x ⊲ [Eq. 5.12] 9: Add(f easibleSettingsList, γ̂new ) 10: Push(intervalQ, (γ̂min , γ̂new )) 11: Push(intervalQ, (γ̂new , γ̂max )) 12: end if 13: end while 14: return f easibleSettingsList explored. Operating as such, Algorithm 5.1 performs enumeration of all feasible settings of an individual influence parameter γ influenced by agent i (whose local model is denoted P OM DPi , and whose already-prescribed influences are denoted Γprescribed ). i Moreover, an agent i can enumerate all the feasible combinations of its outgoing influence parameter settings. The agent does so by constructing a tree, wherein at each level of the tree, it enumerates the feasible settings of a different parameter γ. As long as the agent orders the parameters consistently with the partial order of the influence DBN, it can reason about one parameter after another, at each node branching for all of the feasible settings of a parameter given the constraint that it achieves the settings of the parameters of its tree ancestors. The leaves of the tree thereby correspond to all feasible combinations of agent i’s outgoing influences. Algorithm 5.2 shows the pseudo-code for i’s feasible influence generation, which invokes Algorithm 5.1 at each level of the tree (line 9). 192 Algorithm 5.2 Feasible Influence Enumeration : All Outgoing Influence Parameters GenerateFeasibleInfluences(P OM DPi ) ...Start With Unprescribed Influence Parameters... 1: Γi ← InitializeOutgoingInfluenceParameters(i) 2: return EnumerateSettingsOfRemainingParams(P OM DPi , Γi ) EnumerateSettingsOfRemainingParams(P OM DPi , Γi ) ...Initialize... 1: settings ← ∅ prescribed 2: Γi ← GetPrescribedSettings(Γi ) 3: γ ← FirstRemainingParameter(Γi ) ...If all outgoing influence parameters set, evaluate and return... 4: if γ = N IL then 5: hlocalV al, πi i ←Evaluate(P OM DPi , Γi ) 6: Add(settings, hlocalV al, Γi i) 7: return settings 8: end if ...Enumerate settings of first unprescribed parameter... prescribed 9: {γ̂} ← EnumerateFeasibleSettingsForParam(γ, P OM DPi , Γi ) ...IncorporateSettingsOfRemainingParameters... 10: for each γ̂ ∈ {γ̂} do 11: Γcopy ← CopyAndPrescribe(Γi , γ, γ̂) i 12: settingsγ̂ ← EnumerateSettingsOfRemainingParams(P OM DPi , Γcopy ) i 13: AddAll(settings, settingsγ̂ ) 14: end for 15: return settings Example 5.2 (continued). In the example from Figure 5.2, agent 1 models its influences on agent 2 with two parameters {ρ1 , ρ2}. Thus, agent 1 can enumerate all of its feasible outgoing influence settings as follows. First, agent 1 finds each feasible value of ρ1 using Algorithm 5.1, and for each setting ρ1 = ρ̂1 that it finds, agent 1 enumerates the settings of ρ2 that are feasible in combination with ρ̂1 . Consider that, as an alternative to my LP-guided exploration of the space of an agent’s feasible influence points, the agent could instead iterate through all of its possible local policies and manually partition its local policy space into classes with equivalent influence. In this case, the greater the number of policies, the greater the computation required to enumerate the feasible influences (regardless of the number of influences). The advantage of the LP-guided enumeration is that it does not directly depend on the number of local policies. Given the structure and character of the feasible 193 influence tree, the number of nodes can be no greater than the product of the number of feasible influence points (i.e., the number of leaves) and the number of parameters (i.e., the depth). Hence, the number of LPs required to compute the feasible influence tree for influence Γ is O (numberOf P arameters(Γ) · numberOf F easibleSettingsOf (Γ)) irrespective of the size of the policy space. 5.7 Summary The primary contribution of this chapter is a principled approach for constraining agents’ policies to adhere to proposed influences. This approach is made possible by drawing a conceptual connection between agents’ influence parameters and the transition probabilities implied by occupation measures in the MDP LP. Through the formalization of this connection, I have derived (1) a mapping from the agent’s policy to its implied influences on peers and (2) an extension of the MDP LP for computing an influence-constrained policy. In contrast to alternative approaches that use reward shaping, which encourage the enforcement of various forms of influence by biasing the MDP model, my approach strictly enforces agents’ influences without the need for parameter tuning or model manipulation. Moreover, by constraining the policy directly, my approach is guaranteed to compute a local policy that is optimal with respect to the prescribed influences if such a policy exists; if not, the LP will determine that the influence is infeasible and return “no solution”. The same cannot be said for reward shaping. While I am not the first to constrain agents’ policies with additional MDP LP constraints (Dolgov & Durfee, 2006; Wu & Durfee, 2010), my approach is the first to formulate constraints pertaining to transition-dependent agents’ interactions. Practically, and in the broader scheme of my optimal solution methodology, I have also developed a novel extension to the influence-constrained linear programming methodology for generating feasible influences. By solving a series of such MILPs, an agent can enumerate its entire set of feasible influences. The significance of this algorithm is that it avoids explicit enumeration of all of the agent’s local policies. Instead, its computational complexity is dictated by the size of the influence encoding and the size of the feasible influence space. 194 CHAPTER 6 Optimal Influence-space Search In relation to the preceding chapters, which provided techniques for modeling individual and joint behavior (Ch. 3), abstracting influences (Ch. 4), and computing influence-constrained local policies (Ch. 5), I now integrate these components into a partially-decentralized algorithm for computing optimal solutions to TD-POMDP problems. My algorithm, Optimal Influence-space Search (OIS), is motivated by the intuition (Sec. 4.5) and empirical evidence (Sec. 4.6) that the space of feasible influence points is potentially significantly smaller than the space of joint policies. Using OIS, agents reason jointly about abstract influences and individually about their detailed local policies. OIS ultimately returns the optimal influence, which is the influence point corresponding to the optimal joint policy. OIS gains traction and scalability over existing algorithms by leveraging weaklycoupled problem structure. Inherently, OIS takes advantage of agents’ low degree of influence by searching over the space of influence points (a concept which was introduced in Section 4.5). By decoupling the optimal joint policy formulation into a well-ordered series of influence generations and evaluations, OIS is also able to exploit agents’ locality of interaction. In particular, for weakly-coupled problems with a small fixed agent scope (Def. 3.30), OIS scales well beyond the state of the art1 , a claim I defend with empirical results in Section 6.6.5. 6.1 Overview Before developing the mechanics of influence-space search, I first prove that it yields optimal solutions in Section 6.2. Over the course of the remainder of the chapter, I gradually unveil my algorithm, OIS, for searching through the influence space. I 1 I refer to the state of the art as algorithms (whose results are published) for computing optimal solutions to commonly-studied flavors of transition-dependent Dec-POMDP problems. 195 begin, in Section 6.3, by presenting OIS in its simplest form—a depth-first search that follows the natural ordering of an acyclic interaction digraph. Next, in Section 6.4, I describe how to adapt the search process to accommodate graphs with directed cycles. Next I present an empirical comparison with four other optimal solution algorithms in Section 6.5, focusing on 2-agent problems for which all algorithms are tractable. My analysis serves to assess the degree to which OIS gains a computational advantage through its exploitation of weakly-coupled structure, and also, continuing where my earlier experiments left off, to characterize the problems for which OIS is advantageous as well as those for which it is disadvantageous. I conclude this empirical analysis with a discussion in Section 6.5.7, where I relate my results back to my original claims regarding the efficacy of influence-based abstraction in computing optimal solutions efficiently by exploiting weakly-coupled problem structure. Next, in Section 6.6.3, I develop a substantial enhancement to optimal influence space search for exploiting reduced agent scope, and provide empirical results illustrative of revolutionary advances in agent scalability that are made possible by the complementary exploitation of degree of influence along with agent scope size. 6.2 Correctness of Optimal Influence-space Search Let us address the claim that agents can compute the optimal joint policy by exhaustively enumerating and evaluating the space of feasible influence points. Here, I denote an influence point as Γ, referring to agents’ collective influences (manifested in the form of the influence DBN described in Section 4.3.5 whose conditional probabilities are fully specified). Although discussions in previous chapters treated each agent as either influencing or influenced, this dichotomy does not generalize to problems with more than two agents. Consider an who agent is influenced by its peers but also influences its peers. Such an agent needs to reason about incoming influences exerted by its peers in addition to outgoing influences that it exerts. Consequently, both of these influence types are contained in the influence DBN. The following axioms follow directly from the treatment of influence presented in Section 4.3: (A1) Every joint policy π maps to some influence point Γ whose conditional probabilities reflect all of the agents’ nonlocal features’ transition probabilities resulting from the agents adopting π. I will denote this mapping as π 7→ Γ, and denote the set of feasible influence points as {Γ|∃π, π 7→ Γ}. 196 (A2) Given an influence point Γ and agent i’s local policy πi , agent i can compute its local value with respect to Γ, which I will denote Vi (πi , Γ), by evaluating πi in the context of a best-response model injected with Γ’s encoded transition probabilities for i’s nonlocal features, as long as πi is consistent with Γ’s encoded transition probabilities of i’s locally-controlled features.2 (A3) Given an influence point Γ, each agent i can compute a best response to Γ, which I will denote πi∗ (Γ), by taking the policy whose local value is greatest (arg maxπi Vi (πi , Γ)) subject to the constraint that πi∗ (Γ) is consistent with Γ’s encoded transition probabilities of i’s locally-controlled features. Note that, in contrast to the conventional notion of a best response, here a best response must account for the agent’s outgoing influences (corresponding to transitions of its locally-controlled features) as well as its incoming influences (corresponding to the transitions of its nonlocal features). In the context of influence-space search, the agent’s best response maximizes its utility conditioned on the influences exerted by its peers subject to its promised influences on its peers. In this sense, the agent considers the downstream effects of its behavior on others in addition to its own local value. As I prove below, by iterating through these more sophisticated best responses, the team of agents can maximize their joint value. Definition 6.1.DThe optimal E joint policy with respect to influence point Γ, ∗|Γ ∗|Γ denoted3 π ∗|Γ = π1 , ..., πn , is the highest-valued joint policy that maps to Γ: π ∗|Γ = arg max V (π) π∈Π|π7→Γ Definition 6.2. The value of an influence point Γ, denoted V (Γ), is the value of the optimal joint policy with respect to Γ:  V (Γ) = V π ∗|Γ = max V (π) π|π7→Γ 2 The consistency of πi with respect to Γ may be checked using, for instance, the MILP methodology that I have presented in Chapter 5. Specifically, πi is consistent if the MILP from Equation 5.1, with additional constraints specified by Equations 5.6 and 5.12, returns a solution. 3 Note the purposeful difference in notation between agent i’s best response πi∗ (Γ) and the ith ∗|Γ component π1 of the optimal joint policy with respect to Γ, which are, in principle, two different policies. Later, in Theorem 6.4, I prove that the two must be equally valued. 197 Theorem 6.3. An optimal joint policy π ∗ maps to an influence point Γ∗ whose value   is the greatest of any feasible influences: π ∗ 7→ Γ∗ = arg maxΓ|∃π,π7→Γ V (Γ) . Proof. By Axiom A1 above, π ∗ maps to some influence point Γ∗ . There can be no 6 Γ∗ such that V (Γ′ ) > V (Γ∗ ). If there were, this would other feasible influence Γ′ = imply the existence of another joint policy π ′ such that π ′ 7→ Γ′ and V (π ′ ) > V (π ∗ ), contradicting the premise that π ∗ is the optimal joint policy. Theorem 6.4. The value of any influence point Γ is equal to the summation of local P values of all agents’ best responses: V (Γ) = i∈N Vi (πi∗ (Γ), Γ), where πi∗ (Γ) is agent i’s best response to Γ (using the notation of the above axioms). P Proof. Let us assume that this theorem is false: that is, V (Γ) 6= i∈N Vi (πi∗ (Γ), Γ). P case 1 : V (Γ) < i∈N Vi (πi∗ (Γ), Γ)  P by definition of influence value (Def. 6.2) ⇒ V π ∗|Γ < i∈N Vi (πi∗ (Γ), Γ)  ⇒ V π ∗|Γ < V (hπ1∗ (Γ), ..., πn∗ (Γ)i) by Theorem 3.8. π ∗|Γ = arg maxπ∈Π|π7→Γ V (π) by Definition 6.1 ⇒ Contradiction. P case 2 : V (Γ) > i∈N Vi (πi∗ (Γ), Γ)  P by Definition 6.2 ⇒ V π ∗|Γ > i∈N Vi (πi∗ (Γ), Γ)  P P ∗ ∗|Γ by Theorem 3.8 > i∈N Vi (πi (Γ), Γ) ⇒ i∈N Vi π P P ∗|Γ ⇒ i∈N Vi πi , Γ > i∈N Vi (πi∗ (Γ), Γ) by Axiom A2 above   ∗|Γ by arithmetic ⇒ ∃i s.t. Vi πi , Γ > Vi (πi∗ (Γ), Γ) ∀i, πi∗ (Γ) = arg maxπi Vi (πi , Γ) ⇒ Contradiction.  Therefore, V (Γ) = V π ∗|Γ . by Axiom A3 above Corollary 6.5. Agents can compute an optimal joint policy π ∗ by: 1. exhaustively enumerating all feasible influence points {Γ|∃π ∈ Π, π 7→ Γ}, 2. evaluating each influence point by individually maximizing the agents’ local utiliP ties with respect to the influence point and summing V (Γ) = i∈N Vi (πi∗ (Γ), Γ), 3. selecting the highest-valued influence point Γ∗ = arg maxΓ V (Γ), and 4. individually computing best responses πi∗ (Γ∗ ) = arg maxπi Vi (πi , Γ∗ ) to Γ∗ . Proof. By Axiom A1, the agents will generate a set of influences in step 1 that includes the optimal influence point Γ∗ mapped from the optimal joint policy π ∗ . 198 By Theorem 6.4, in step 2, the agents will correctly evaluate each feasible influence, including Γ∗ . By Theorem 6.3, the agents will correctly determine that Γ∗ is the optimal influence in step 3. Lastly, in step 4, the agents will recover the optimal joint policy by computing best responses to Γ∗ (by Axiom A3, and Definition 6.1). At a high level, Corollary 6.5 precisely describes the structure of optimal influencespace search. It also proves that this exhaustive search methodology does indeed return optimal solutions, and validates the intuitions that agents can still behave optimally if they jointly reason at the abstract influence level. The search process involves largely-decentralized computation. Each agent individually computes its own local value of each influence point, and each agent individually computes its local policy with respect to the optimal influence point. In the next section, we will see that the generation of influences is similarly decomposable into individual agent computation. 6.3 Depth-First Search Influence-space search boils down to generating the feasible space of (combined) influence points, where each is a setting of all agents’ influences and fully-specifies the influence DBN, and selecting the one that maximizes the sum of agents’ best-response utilities. From Section 5.6, we already have a methodology for generating a single agent’s outgoing influence settings. The challenge is composing agents’ individual generations and evaluations in such a way as to efficiently search the space of combined influence settings. In this section, I develop one relatively simple composition that forms the basis for the more advanced search methods presented in Sections 6.4 and 6.6 (as well as the approximate search methods presented in Chapter 7). The simplest way to compose agents’ individual feasible influence generations is to construct a search tree wherein each node represents a partially-specified setting of agents’ influences. The root node of the tree represents a completely unspecified influence DBN. At the next level down, each node is assigned a particular setting for just one agent’s outgoing influences. At the next level, each node is assigned a particular setting for each of two agents’ outgoing influences, and so on all the way down to the leaf level, where leaf nodes are assigned a complete setting of all agents’ influences. In essence, we have divided the parameters of the influence DBN according to which agent controls each, placing one agent’s influence generation at each level of the tree. 199 Search Tree Interaction Digraph n3 n2 3 2 𝒏𝒕𝟐 𝒏𝒕+𝟏 𝟐 DBN containing influence setting for 𝑷𝒓 𝒏𝟐 … Influence DBN 𝒏𝒕+𝟏 𝟐 𝒏𝒕𝟑 𝒏𝒕+𝟏 𝟑 𝐏 𝐫 𝒏𝒕+𝟏 𝒏𝒕𝟐 𝟐 𝒏𝟐𝒕 𝒏𝒕𝟑 𝐏 𝐫 𝒏𝒕+𝟏 𝒏𝒕𝟑 𝟑 𝒏𝟐𝒕+𝟏 𝒏𝒕+𝟏 𝟑 complete influence DBN Leaf level  outgoing influence domain Agent 2 : Pr 𝑛𝟑 … optimal local utilities (of agents 2&3) for received influence settings optimal local utility for influence settings  u3*=21 𝒏𝒕𝟐 Agent 1 : Pr 𝑛2 … u2*=15, u3*=21 1 Root  Agent 3 (no outgoing influences) Figure 6.1: One path through the influence search tree. Depth-first OIS searches the space of feasible influence settings one by one, for each traversing a path from root to leaf. As shown in Figure 6.1, a path consists of a combination of agents’ outgoing influence settings, each of which I will refer to as an outgoing influence point. The optimal (combined) influence point corresponds to the path leading through the optimal combination of agents’ outgoing influence points. 6.3.1 Structure of Search Tree In the event that the interaction digraph contains no directed cycles4 , we can define a natural ordering over the agents’ generation problems, thereby placing one agent at each level of the search tree. Figure 6.1 shows a simple 3-agent example problem with an acyclic interaction digraph, the corresponding encoding of the influence DBN, and the resulting structure of the search tree. At the root of the search tree, influences are considered that are independent of all other influences. And at lower depths, feasible influence settings are generated by incorporating any higher-up influence settings on which they depend. This property is guaranteed for any total ordering of agents that maintains the partial order of the acyclic interaction digraph. Given one such ordering, the agents can use the pseudo-code presented in Algorithm 6.1 to perform a depth-first search of the influence space. The search begins with the call DF-OIS(root, ordering, nil), prompting the root 4 I relax this restriction later in Section 6.4. 200 Algorithm 6.1 Depth-First Optimal Influence-Space Search DF-OIS(i, ordering, DBN ) ...At Leaves, simply compute best response... 1: if i = LastAgent(ordering) then 2: P OM DPi ← BuildBestResponseModel(DBN ) 3: hlocalV al, πi i ←Evaluate(P OM DPi ) 4: return hlocalV al, DBN i 5: end if 6: nextAgent ← NextAgent(i, ordering) ...Enumerate feasible outgoing influence points... 7: P OM DPi ← BuildBestResponseModel(DBN ) 8: I ← GenerateFeasibleInfluences(P OM DPi ) ...Branch for each outgoing influence point... 9: bestJointV al ← −∞ 10: bestDBN ← nil 11: for each hinf luencei , localV ali ∈ I do 12: DBNi ← CopyAndAssign(DBN, inf luencei ) ...Pass Influence Settings Down... 13: hdescendantsV als, DBNchild i ← DF-OIS(j, ordering, DBNi ) 14: jointV al ← localV al + descendantsV als 15: if jointV al > bestJointV al then 16: bestJointV al ← jointV al 17: bestDBN ← DBNchild 18: end if 19: end for ...Pass Highest-Valued Influence Up... 20: return hbestJointV al, bestDBN i agent to build its (independent) local POMDP (line 1) and to generate all of the feasible settings of its outgoing influence parameters (line 8), each in the form of a partially-specified DBN. The root agent creates a branch for each feasible outgoing influence setting, as computed using Algorithm 5.2 from Section 5.6, passing down the setting (in line 13). Each branching operation is a recursive call to DF-OIS that prompts the next agent to construct a local POMDP in response to its ancestors’ influence settings (using the best response model I presented in Chapter 4), compute its feasible influences, and pass those on to the next agent. At the root of the tree, the influence DBN starts out as completely unspecified and is gradually filled in as it travels down the tree, at each subsequent level accumulating another agent’s influence settings. The agent at the leaf level of the tree does not influence others, so simply computes a best response to all of the influence settings of its ancestors, for each passing up its best-response utility value (lines 2-4). At 201 each intermediate node, the respective agent evaluates each of its outgoing influence settings that it passed down by taking the sum5 of the combined utility value passed up from its descendants (denoted descendantsV als) and its local utility value (line 14). In this manner, from the leaves to the root, the best outgoing influence setting is selected (lines 15–18) at each level of the tree, accounting for both local cost (or reward) as well as descendant reward (or cost). When the search completes, the result is an optimal influence-space point: a DBN that encodes the feasible influence settings that achieve the optimal team value. As a post-processing step (not shown in Algorithm 6.1), the agents compute their optimal joint policy by each computing a best response to the optimal influence DBN returned by the search. 6.3.2 Enumerating Feasible Influences At each non-leaf node of the tree, the respective agent calls GenerateFeasibleInfluences() to generate its feasible outgoing influence points. One such influence generation scheme is presented in Section 5.6 comprising a series of mixed-integer linear programs that the agent uses to find each feasible combination of settings of its outgoing influence parameters. Recall that the agent’s outgoing influence parameters specify the probabilistic transitions of the agent’s locally-controlled features that affect other agents. Interestingly, the MILP-driven generation of an individual agent’s influence settings takes the same form as OIS’s generation of combinations of agents’ feasible influences. Just like OIS’s generation, the MILP-driven generation of outgoing influences (Algorithm 5.2) constructs a tree, in this case placing an individual influence parameter at each level of the tree. As such, the branches in OIS’s DFS tree comprise the leaves of each agent’s MILP-driven search tree. The operation of OIS does not depend upon my constrained linear programming methodology, however. Any alternative generation scheme would suffice. For instance, the agent could simply enumerate all of its local policies, for each computing the implied settings of conditional probabilities that the influence DBN requires, and then manually partition the agent’s local policy space into impact-equivalence classes (Def. 3.45) whose influence parameter settings are equivalent. 5 Algorithm 6.1, as presented, relies on the property that the joint utility is a summation of local utilities (Thm. 3.8). In recent work (Witwicki & Durfee, 2010), I have published a slightly more general version of DF-OIS that accommodates arbitrary monotonic value compositions functions. It does so by passing values down the tree as well as up the tree, so that at every intermediate node, the full joint value is straightforwardly evaluable through invocation of the composition function. 202 6.3.3 Incorporating Ancestors’ Influences At each node below the root, with a call to BuildBestResponseModel(), agent i incorporates the outgoing influences of its ancestors that have been communicated with the DBN object passed into DF-OIS(). However, just as an ancestor agent’s outgoing influences make up a subset (and not necessarily the whole) of the DBN parameters, agent i’s incoming influence parameters sufficient for its best response reasoning also make up a subset (and not necessarily the whole) of the DBN parameters. In other words, some of the information contained within the communicated DBN is inessential (and unusable) to agent i. In this case, agent i may use marginalization to remove any unneeded variables (specifically, those that it cannot observe) from the conditional probabilities represented by the DBN parameters. Example 6.6. For instance, consider the interaction digraph, and corresponding influence DBN shown in Figure 6.2. Here, agent 7 models two nonlocal features, one (n7a ) influenced by agent 1 and the other (n7b ) influenced by agent 6. Additionally, agent 6 models a nonlocal feature n6 influenced by agent 1. The undirected digraph cycle between agents 1, 6, and 7 implies a conditional dependence relationship between n7a and n7b by way of n6 . Consequently, agent 1 encodes the influence t+1 t dependent distribution Γπi (n7a , n6 ) = P r nt+1 n7a , ~n6t , generating settings 7a , n6 |~ to all of the respective DBN parameters as it enumerates its feasible influence points.  t n6t , ~n7b Agent 6 encodes a history-dependent distribution Γπi (n7b ) = P r nt+1 that 7b |~ is conditioned on the history of feature n6 . Altogether, the two agents’ influences make up an influence DBN that connects the variables associated with all three nonlocal features (as illustrated in Figure 6.2). The influence DBN that agent 1 passes down the tree to agent 6 contains  t+1 t parameters of the form P r nt+1 n7a , ~n6t . However, note that according 7a , n6 |~ to the TD-POMDP local state for this problem, agent 6 does not model nor does it observe feature n7a . The only information that agent 6 needs (to reason  optimally about its own behavior and its own outgoing influence) is P r nt+1 n6t . 6 |~ By Theorem 4.18, in constructing its best response model, agent 6 can safely marginalize out ~nt+1 7a from the joint distribution. Similarly, agent 7 can marginalize out {n06 , ..., nT6 } from the complete DBN, leaving only parameters of the form  t+1 t n7a , ~nt7b . P r nt+1 7a , n7b |~ 203 1 n6 n7a 6 n7b 7 Interaction Digraph 𝒏𝒕𝟕𝒂 𝒏𝒕+𝟏 𝟕𝒂 𝒏𝒕𝟕𝒃 𝒏𝒕+𝟏 𝟕𝒃 𝒏𝒕𝟔 𝒏𝒕+𝟏 𝟔 Influence DBN 𝒕+𝟏 𝒕 𝒏𝟕𝒂 , 𝒏𝒕𝟔 𝐏 𝐫 𝒏𝒕+𝟏 𝟕𝒂 , 𝒏𝟔 𝒕 𝒕 𝐏𝐫 𝒏𝒕+𝟏 𝟕𝒃 𝒏𝟔 , 𝒏𝟕𝒃 Figure 6.2: Example of marginalization of unneeded DBN parameters by Agent 7. Upon extracting the necessary conditional probabilities from the influence DBN, agent i injects these into the transition model of its best-response POMDP (developed in Section 4.2). Agent i uses this model, denoted P OM DPi , to generate its feasible outgoing influence settings, thereby accounting for its ancestors’ specified influence settings. 6.4 Interaction Digraph Cycles As I have presented it thus far, Algorithm 6.1 requires an acyclic agent interaction digraph. The difficulty with cyclic graphs lies in the fact that there no longer exists a total ordering over the agents with the property that each agent can compute its best response and reason about its outgoing influence settings independently of agents that appear later on in the ordering. In this section, I describe one high-level strategy for searching the influence space when there are cycles. The basic idea is that, by examining the lower-level structure of agents’ influence parameters, we can transform the cyclic interaction digraph into an equivalent form that does not contain cycles, so as to apply (essentially) the same depth-first search techniques without loss of generality. This strategy obviates having to face the added complexity of other common approaches to coping with graphical model cycles, such as cycle cutset (Dechter, 2003). Example 6.7. Consider the interaction digraph and corresponding influence DBN shown in Figure 6.3. In this problem, agent 1 influences agent 2 through nonlocal feature n2 and agent 2 influences agent 1 through nonlocal feature n1 . This means that agent 1 cannot reason about its feasible outgoing influence settings without 204 accounting for 2’s influence, which 2 cannot reason without accounting for 1’s outgoing influence settings. Clearly, neither agent is able to generate its feasible influences independently of the other. The decomposition of influence generation by agent (developed in Section 6.3) will not work for this problem. Fortunately, by digging deeper into the structure of the TD-POMDP model, we find that there is an inherent acyclicity at the nonlocal feature value level. Regardless of whether of not a problem’s interaction digraph contain cycles, the influence DBN cannot contain cycles. No DBN can. In the case of the TD-POMDP, the impossibility of cyclic dependence among individual influence variables stems from the non-concurrency of agents’ interaction effects (described in Section 3.4.3.1). Intuitively, agent 1’s actions can affect agent 2’s at time step t, but agent 2’s concurrent actions cannot interfere. Agent 2 cannot use the consequences of agent 1’s influence to influence agent 1 back until time step t + 1. In other words, agent 2’s outgoing influence may be dependent on past outgoing influences of agent 1, but is conditionally independent of concurrent outgoing influences of agent 1. This insight leads us to a reformulation of the search process, wherein at each level of the tree, an agent reasons about a subset of its influence parameters. In essence, we can define an ordering over individual influence parameters with the necessary property that a parameter’s values are conditionally independent of the values of parameters that appear later in the ordering conditioned on the values of parameters appearing earlier in the ordering. Any ordering consistent with the ordering of the variables in the influence DBN suffices. Example 6.7 (continued). As illustrated by the search tree in Figure 6.3, instead of ordering the influence parameters by agent, we can order the influence parameters by the time indices of their nonlocal feature variables, such that agents 1 and 2 consider influence probabilities pertaining to earlier times before those pertaining to later times. The result is a depth-first search that iterates back and fourth between agent 1’s generation of feasible parameter values and agent 2’s generation of feasible influence values. Just as before, feasible influence parameter settings are passed down the tree and values are passed up the tree, so as to select the optimal parameter settings at each 205 Search Tree Interaction Digraph Root  n2 2 1 n1 𝒏𝒕𝟐 𝒏𝟎𝟏 𝒏𝒕+𝟏 𝟐 𝒏𝒕𝟏 Or, equivalently… 𝒏𝟏𝟐 𝒏𝟐𝟐 𝒏𝟎𝟏 𝒏𝟏𝟏 𝒏𝟐𝟏 𝒏𝒕+𝟏 𝟏 𝒏𝟐𝟏 𝒏𝟐𝟎 𝒏𝟏𝟎 𝒏𝑻𝟐 𝒏𝑻𝟏 DBN augmented with influence setting for Pr 𝑛12 … 𝒏𝟎𝟐 𝒏𝟏𝟐 𝒏𝟐𝟐 𝒏𝑻𝟐 𝒏𝟏𝟎 𝒏𝟏𝟏 𝒏𝟏𝟐 𝒏𝟏𝑻 complete influence DBN Agent 2 : Pr 𝑛10 Agent 1 : Pr 𝑛12 … Agent 2 optimal local utility for influence settings  u2*=21 𝒏𝟎𝟐 DBN augmented with influence setting for Pr 𝑛10 𝒏𝟎𝟐 Influence DBN outgoing influence domain optimal local utilities for influence setting  u1*=1 7,u2*=21 DBN containing influence setting for Pr 𝑛20 𝒏𝟎𝟐 Agent 1 : Pr 𝑛20 Leaf level  Figure 6.3: Example of Influence-Space Search on a cyclic interaction Digraph. level. However, in the cyclic case, agents do not evaluate their influences at every level of the tree. Instead each agent computes an exact best-response value after all of its outgoing influence parameters have been assigned. In Example 6.7, evaluation occurs at the lowest two levels of the tree. Another difference is that agents build best response models given only partiallyspecified incoming influences. For instance, at the 3rd level, agent 1 formulates a best response to partial influence P r(n01 ). This computation is sound because, in order to generate a feasible setting of its current outgoing influence parameter, P r(n12 ), agent 1 requires only a partially-specified best response model and it only needs to consider the possible policy decisions at time 0. P r(n12 ) is independent of all of the unspecified influences and future actions. 206 6.5 Empirical Results Recall that the empirical results presented in Chapter 4 uncovered trends in the size of the influence space and the degree of influence as they relate to various problem characteristics. I now evaluate the extent to which these trends translate to computational advantages (and disadvantages) for OIS over other optimal solution algorithms that do not employ influence-based abstraction. In particular, I test the general hypothesis that OIS outperforms its competitors on problems that are weaklycoupled. After describing each of four other optimal algorithms and my experimental setup in Section 6.5.1, the subsections that follow (6.5.2-6.5.5) are each devoted to comparing OIS against one of the four other algorithms. In each instance, I compare runtime systematically across a space of random problems so as to expose the strengths and weakness of OIS in relation to my characterization of influence-space size and degree of influence from Section 4.6. I conclude my analysis with a summary and discussion of the results in Section 6.5.7. 6.5.1 Experimental Setup Here, I describe the other optimal algorithms and the space of problems considered in this analysis. 6.5.1.1 Other Algorithms I compare the runtime performance of OIS with four other algorithms designed to compute optimal solutions to restricted flavors of transition-dependent Dec-POMDPs. The first two are straw men of my own design: decoupled joint policy search (without influence abstraction) and centralized joint policy formulation (using a mixed-integer linear program adapted using the single-agent MILP methodology that I presented in Chapter 5). The third and fourth are state-of-the-art algorithms whose development has been published by others within the last 3 years. Note that both of the state-of-the-art algorithms are designed for solving specialized flavors of transition-dependent Dec-POMDP problems (both of which are less general than the TD-POMDP). Moreover, the implementation of OIS has not been given any advantages for exploiting specialized structure beyond that which the other implementations exploit. Policy-Space Search. The first algorithm is an implementation of OIS (described in Section 6.3) that has been stripped of its policy abstraction. Instead of searching 207 for the optimal influence, agents search directly in the policy space. At each step, they exchange local policies instead of influence settings, but are able to exploit the interaction digraph structure in the same way that OIS does, and use the same best-response models used by OIS. In essence, comparing OIS against this method illustrates the advantages of policy abstraction. Centralized MILP approach. I have also implemented a method that applies my MILP methodology (developed in Chapter 5) to a centralized decision model with a joint transition matrix, joint state space, and joint action space. The objective of the MILP is to maximize the joint utility subject to the constraints that each agent must base its decisions solely on its own observations (using a variation of the same technique I developed in Section 5.2.3). As such, comparing OIS against this method illustrates the advantages of decoupling the joint policy formulation (on TD-POMDP problems, for which such a decomposition is efficient). SPIDER. SPIDER (Varakantham et al., 2007) is a decoupled best-response search method that performs a policy-space search but employs its own pruning to reduce the search space. It was originally developed for solving transition and observation independent problems, and was recently extended for application to a specialized class of two-agent transition-dependent problems involving interdependent tasks (Marecki & Tambe, 2009). As of yet, it has not been extended to solve any flavors of transitiondependent problems containing more than two agents. In the experiments that follow, I use the implementation of SPIDER graciously provided by its authors. Separable Bilinear Programming. The last algorithm, which I will denote SBP (Mostafa & Lesser, 2009), was designed for solving EDI-CR problems (as contrasted with TD-POMDPs in Section 3.4.1.5). SBP frames the joint policy formulation problem as a separable bilinear program (Petrik & Zilberstein, 2009). In the experiments that follow, I use the implementation of SBP graciously provided by its authors. Like the centralized MILP approach (described above), SBP is a centralized algorithm. However, unlike the centralized MILP, SBP exploits the factored structure of agents’ subproblems. 6.5.1.2 Random Problem Generation For the moment (and up until Section 6.5.6), I restrict the focus of my analysis to two-agent problems that are generated according to the same parameterization 208 described in Section 4.6. For each of the problems from my original testbed, I add a second agent, agent 2, and for each of agent 1’s nonlocally affecting tasks (task1x ), one of agent 2’s tasks (task2y ) is randomly selected (without replacement) as being nonlocally-affected. I restrict that every nonlocal effect e1x,2y takes the form of an enablement, such that agent 1’s completion of task1x allows agent 2 to execute task2y without achieving an automatic failure outcome. Agent 2’s tasks (including its positivequality outcomes) are instantiated using the same parameters as agent 1’s. In this analysis, I restrict consideration to problems with a single influencing agent and a single influenced agent. I impose both of these restrictions (enablement effects and acyclic interaction digraphs) for compatibility with the two state-of-the-art algorithms (“SPIDER” and “SBP”) against which I am comparing OIS. Due to the fact that the implementations of “SPIDER” and “SBP” were tailored to problem domains with differing assumptions, I could not run them on exactly the same set of problems.6 Instead, the results presented in Sections 6.5.4 and 6.5.5 respectively compare OIS to SPIDER and OIS to SBP on separate sets of problems, each generated via slight alteration of my original problem generation scheme.7 6.5.2 Comparison with Policy-Space Search I begin by comparing OIS with policy-space search. In a sense, this comparison is the purest evaluation of influence-based abstraction because both algorithms behave identically except that OIS abstracts each agent’s local policy space and policy-space search does not. Based on my empirical findings regarding influence space sizes and policy space sizes (presented in Section 4.6), I offer the following hypotheses. First, I posit that policy space search will be limited in its tractability to problems wherein each agent’s local decision model is small. Second, I hypothesize that out of the problems where policy-space search is tractable, it will outperform OIS only when the degree of influence is high (indicating that there are almost as many feasible influence 6 My implementation of OIS, however, is compatible with either set of assumptions. SPIDER required that agents not observe their nonlocal features and that tasks were constrained via a latest start time instead of a latest finish time. In my generation of problems for Section 6.5.4, I redefined a task’s window size parameterization accordingly and treated each enablement feature as a latent variable in the state of the affected agent. In addition to the same partial observability assumption as SPIDER, SBP required that each task have only two durations, that task windows be specified with an earliest start time but not a latest finish time, and that agents are not allowed to “wait” between their executions of tasks. As such, for the set of problems used in Section 6.5.5, I generated tasks whose latest finish times were constrained to be time T but whose earliest start time was selected according to parameters localW indowSize and N LAT W indow (whose semantics were introduced in Section 4.6.1.3 and summarized in Table 4.1) and I removed the “wait” action from agents’ decision models. 7 209 points as there are policies). In this case, the overhead of finding each unique influence point, by abstracting it from a policy, should outweigh the benefits of the reduced search space size. As in the empirical analysis from Chapter 4, I systematically generated problems over the entire space of parameter settings. Again, in the interest of space, here I present select results that illustrate the trade-offs that I observed across the entire space of parameter settings. As a metric for tractability, I allocated each method at most 10 minutes of computation time per problem. For any given setting of parameters, if any problem from that setting was not solved within 10 minutes, the point corresponding to that average runtime measurement is omitted from the plotted results. For this experiment, the empirical evidence corroborates both of my hypotheses. In general, policy-space search was able to solve problems wherein the local policy space size of the influencing agent contained no more than 10,000 policies. Although this may sound impressive, note that the policy space grows exponentially with the state and action spaces and the time horizon, and recall from my earlier analysis (Sec. 4.6) that problems with local policy space sizes in excess of 108 were not uncommon. Figure 6.4 plots the policy space size, degree of influence, and runtime of both OIS and policy-space search as a function of increasing problem time horizon for three different settings of the remaining problem parameters. For each individual plot, from left to right, the increase in time horizon results in an increase in average policy space size and a decrease in the degree of influence. Further, from top to bottom in Figure 6.4, the three cases (A, B, and C) represent three gradations of increasingly-large local problem size. As illustrated, for very small problems, policy-space search is faster than influence-space search due to the overhead of OIS’s feasible influence generation. However, except in very small problems such as in case A, as time horizons grow longer, the decreasing degree of influence is accompanied by an increase in the runtime of policy-space search such that it surpasses that of OIS. As problems become larger, the additional overhead of OIS is far outweighed by the growing policy space size. As the degree of influence falls lower and lower, the gap widens. For instance, in case C, when T=4, policy-space search takes two orders of magnitude longer than OIS, and for T=5, cannot compute solutions to each problem in 10 minutes whereas the average computation time taken by OIS just one second. As a testament to my weak coupling theory from Section 3.5.2, the trends observed in the computational advantages of OIS (over policy space search) are a direct 210 3 4 5 2 10 1 10 0 1 5 2 3 4 5 policy space size 0 1 2 1 −3 10 2 3 4 5 1 2 3 4 5 degree of influence 0 4 5 4 5 10 −1 10 −2 10 Runtime 1 10 1 OIS policy space search 0 10 −1 10 −2 10 −3 10 2 3 4 5 1 2 3 T T T (B) tasksPerAgent=2, localWindowSize=0.5, uncertainty=0.5, NLATs=1, influenceType=state 10 10 10 OIS policy space search mean seconds policy space size 10 10 −1 −2 10 T T T (A) tasksPerAgent=1, localWindowSize=0.5, uncertainty=0.0, NLATs=1, influenceType=state mean degree of influence 3 2 Runtime −1 10 3 4 5 degree of influence 0 10 −2 10 −4 10 Runtime 2 10 mean seconds mean policy space size mean policy space size 1 degree of influence 0 10 mean seconds 0.4 10 mean degree of influence policy space size mean degree of influence mean policy space size 0.9 10 1 OIS policy space search 1 10 0 10 −1 10 −2 10 −3 10 2 3 4 5 1 2 3 T T T (C) tasksPerAgent=2, localWindowSize=1.0, uncertainty=0.5, NLATs=1, influenceType=state Figure 6.4: OIS vs. Policy Space Search : growing problem size translation of the trends observed in the degree of influence. The same translation of trends can also be seen when the size of the agent’s nonlocally-affecting task window (Figure 6.5E) and the earliest start time of the nonlocally-affecting task window (Figures 6.5F and 6.5G) are varied. These last three plots, though seemingly complex, are model instances of the empirical trends evidenced and discussed in Section 4.6.2.5.8 8 Figure 6.5E confirms that, as the nonlocally-affecting task’s window increases, although the policy space increases, the degree of influence decreases; as a result, we observe that the computation time of naı̈ve policy-space search grows more steeply than that of influence-space search. Figure 6.5F and 6.5G confirm that the relationship of policy space size to the earliest start time (NLAT est) of the nonlocally-affecting task depends heavily on the sizes of local task windows. When local task windows are small (Fig. 6.5F), the policy space size grows very slightly with the nonlocally-affecting task window, causing an increase in the computation time of policy-space search when compared with influence-space search. When local task windows are large (Fig. 6.5G), we see the opposite trend: the policy space size decreases significantly, causing a decrease in the computation time of policy-space search while the computation of influence-space search remains relatively flat. 211 1 10 0 1 2 3 1.2 10 mean policy space size 0.5 1 1.5 2 policy space size 2 10 1 0 −2 10 0 1 2 3 0 0.5 1 1.5 2 2.5 3 degree of influence −1 10 0 Runtime OIS policy space search 0.5 1 1.5 −1.3 10 −1.4 10 0 2 0.5 1 1.5 2 NLAT_est NLAT_est NLAT_est (F) T=4, tasksPerAgent=2, localWindowSize=0.0, uncertainty=0.0, NLATWindow=2, influenceType=state 10 10 OIS policy space search mean seconds mean policy space size 1.3 10 3 −2 −1 10 NLATWindow NLATWindow NLATWindow (E) T=3, tasksPerAgent=2, localWindowSize=0.5, uncertainty=1.0, NLAT_est=0, influenceType=state policy space size 0 −1 10 10 Runtime 0 10 1 2 3 degree of influence 0 10 OIS policy space search −1 10 −2 10 Runtime 0 10 mean seconds 0 mean degree of influence 10 degree of influence 0 10 mean seconds 2 10 mean degree of influence policy space size mean degree of influence mean policy space size 3 10 0 −1 10 −2 10 1 2 3 0 0.5 1 1.5 2 2.5 3 NLAT_est NLAT_est NLAT_est (G) T=4, tasksPerAgent=2, localWindowSize=1.0, uncertainty=1.0, NLATWindow=1, influenceType=state Figure 6.5: OIS vs. Policy Space Search : window of interaction 6.5.3 Comparison with the Centralized MILP Approach Although the degree of influence appears to be strongly correlated with the computational requirements of OIS relative to policy-space search, it does not necessarily characterize OIS’s computation relative to other solution algorithms. Intuitively, the centralized MILP approach does not search the policy space directly; instead, it searches through agents’ joint occupation measures. Moreover, it is apparent from my empirical observations (not shown) that the size of the influencing agent’s policy space is not a strong predictor of the computation time of the MILP approach. Instead, the computation of the MILP approach is dependent on the size of the program (i.e. the number of variables and constraints), which is the product of the number of world states and the number of joint actions. This alternative dependence is advantageous for problems with few agents, few tasks, and small state spaces. As my empirical results confirm, the centralized approach computes optimal solutions faster than does OIS for a large portion of the problems in my testbed. Figure 6.6 212 plots MILP variables, influence-space size, and computation time of both methods for three gradations of increasingly-large problem size (labeled as A, B, and C). MILP variables 1 2 3 4 5 2 10 1 2 3 4 5 mean influence points mean # of variables −3 10 2 3 4 5 1 2 OIS influence space size 3 4 5 4 5 4 5 10 OIS centralized MILP 1 10 0 10 1 Runtime 1 10 2 10 0 2 1 −2 10 0 10 −1 10 −2 10 −3 10 2 3 4 5 1 2 3 T T T (B) tasksPerAgent=3, localWindowSize=0.5, uncertainty=0.5, NLATs=1, influenceType=state MILP variables 5 −1 10 mean influence points mean # of variables 10 1 10 2 3 10 0 MILP variables 4 1 10 T T T (A) tasksPerAgent=3, localWindowSize=0.0, uncertainty=0.5, NLATs=1, influenceType=state 10 10 OIS centralized MILP 1 mean seconds 1 Runtime −1 10 mean seconds mean influence points mean # of variables 2 10 10 OIS influence space size 2 10 3 4 5 OIS influence space size 10 OIS centralized MILP 1 10 0 10 1 Runtime 2 10 mean seconds 3 10 1 10 0 10 −1 10 −2 10 −3 10 2 3 4 5 1 2 3 T T T (C) tasksPerAgent=3, localWindowSize=1.0, uncertainty=0.5, NLATs=1, influenceType=state Figure 6.6: OIS vs. Centralized MILP : scaling Moving from top-to-bottom in Figure 6.6, case A has localW indowsize = 0.0, case B has localW indowsize = 0.5, and case C has localW indowsize = 1.0. As shown, increasing the time horizon in each of these cases increases the size of the centralized MILP, but the rate of increase is heavily dependent on the local window size; as a consequence, so is the rate of increase of the centralized MILP’s computation time. This brings us to the disadvantage of the centralized MILP’s dependence on the joint state and joint actions. As weakly-coupled agents’ local problem sizes increase, the joint state space increases significantly, ultimately yielding poor scalability of the MILP (as well as any other approach that works directly with the flat joint state and action representation). Notice that in all three cases, the centralized method’s runtime grows more steeply than did that of OIS. This same trend was observed across the 213 board and when varying other attributes relating to local problem size such as number of tasks and uncertainty. OIS’s superior scalability is not due exclusively to the fact that it works in the policy space, or even that it decomposes the joint policy formulation. Both of these traits are present in the policy-space search algorithm (Sec. 6.5.2), whose scalability was inferior to that of the centralized MILP. OIS’s scalability comes from its abstraction. For problems with large local model sizes but highly-constrained influences, OIS gains significant advantage over centralized methods such as the centralized MILP. This advantage is evident in Figure 6.7, which shows the effect of varying the windows size of an influencing agent’s nonlocally-affecting task. 3 mean influence points mean # of variables 4 10 3 10 2 10 1 10 0 10 1 2 3 4 5 OIS influence space size 10 OIS centralized MILP 2 10 1 10 0 10 1 Runtime 3 10 mean seconds MILP variables 5 10 2 10 1 10 0 10 −1 10 2 3 4 5 1 2 3 4 5 NLATWindow NLATWindow NLATWindow (D) T=5, tasksPerAgent=3, localWindowSize=1.0, uncertainty=1.0, NLAT_est=0, influenceType=state Figure 6.7: OIS vs. Centralized MILP : NLAT Window Size As shown, when the influence is highly-constrained such that the interaction can only occur during a restricted interval, OIS gains significant advantage over the centralized MILP by searching through a greatly-reduced search space. As the size of the nonlocally-affecting task window increases, so does the computation of OIS. Moreover, I observed that, for all settings involving problems hard enough such that either approach took longer than a second on average, the same qualitative trend shown in Figure 6.7 occurred, and in none of these instances did the average computation time of OIS surpass that of the centralized MILP at the window’s maximum value. Notice that, in contrast to the comparison of OIS and policy-space search in Section 6.5.2, degree of influence does not correlate with the difference in computation times among OIS and the centralized MILP. My empirical results from Section 4.6.2.5 suggest that as the nonlocally affecting task window increases, the degree of influence decreases, and that influence-based policy abstraction should be more effective. Although OIS may be increasingly-effective compared to policy-space search, here we see that the gap in computation time between OIS and the centralized MILP narrows as the nonlocally-affecting task window is increased, suggesting that OIS is losing its advantage even though the degree of influence is decreasing. Intuitively, 214 while degree of influence is computed from the policy space size, the centralized MILP does not search the policy space exhaustively, making this method insensitive to the degree of influence. 6.5.4 Comparison with SPIDER Next, let us turn to an algorithm that searches through the policy space, though not as naı̈vely as the policy-space search method from Section 6.5.2. SPIDER does not exhaustively evaluate each agent’s policies but instead performs pruning using heuristic evaluations of partially-specified policies (Varakantham et al., 2007). Its pruning makes it a much stronger competitor, scaling well beyond the reach of the naı̈ve policy-space search method in Section 6.5.2. In certain instances in my testbed, its average computation time scaled more gracefully than that of OIS (e.g., Figure 6.8A), and in other cases (e.g., Figure 6.8B) not as well. Figure 6.8, which plots only computation times, scales the problem time horizon along the x-axis, varying uncertainty from left to right, and varying localW indowSize from top to bottom. Notice that although SPIDER outperforms OIS in some cases, it starts out with a higher computation time, presumably due to the computational overhead of SPIDER’s pruning. As expected, both methods are affected by uncertainty as well as by local window size.9 However, the average runtime of OIS grows more steeply (on an exponential scale) than does SPIDER’s when localW indowSize = 0.5 (case A) and less steeply than SPIDER’s when localW indowSize = 1.0, affirming the hypothesis that influence-based abstraction is most effective in comparison with other approaches when agents’ local decision models are more complex. Evidently, SPIDER is able to significantly reduce the size of its search space under some circumstances. However, the pruning that SPIDER employs does not seem to exploit the same structure as influence-based abstraction. Figure 6.9 shows the effects of holding local problem size still and increasing the size of the influencing agent’s nonlocally-affecting task window. Here, regardless of the local tasks’ window sizes, SPIDER exhibits relatively little improvement for smaller nonlocally-affecting task window sizes. OIS, on the other hand, is exponentially faster for smaller nonlocallyaffecting task window sizes (just as we observed in Figure 6.7). When nonlocallyaffecting task windows are larger, and influences are less constrained, OIS may be at a disadvantage to algorithms such as SPIDER. The good news is that this disadvantage 9 Although I did not run naı̈ve policy-space search on this set of problems, the trends in runtime growth of SPIDER due to uncertainty and local window size are consistent with those that I observed in the running times of the naı̈ve policy-space search on the problem set from Section 6.5.2 215 Runtime 1 −1 10 −2 10 ois spider mean seconds mean seconds mean seconds 0 10 0 10 −1 10 −2 1 2 3 4 5 10 6 Runtime 1 10 ois spider 10 Runtime 1 10 ois spider 0 10 −1 10 −2 1 2 3 4 5 10 6 1 2 3 4 5 6 5 6 T T T uncertainty=0.0 uncertainty=0.5 uncertainty=1.0 (A) tasksPerAgent=3,localWindowSize=0.5,NLATs=1,influenceType=state Runtime 4 0 10 −2 10 ois spider mean seconds mean seconds mean seconds 2 10 2 10 0 10 −2 1 2 3 4 5 10 6 Runtime 4 10 ois spider 10 Runtime 4 10 ois spider 2 10 0 10 −2 1 2 3 4 5 10 6 1 2 3 4 T T T uncertainty=0.0 uncertainty=0.5 uncertainty=1.0 (B) tasksPerAgent=3,localWindowSize=1.0,NLATs=1,influenceType=state Figure 6.8: OIS vs. SPIDER : scaling local problem size appears to dwindle as agents’ local problem sizes become larger. Runtime −0.3 −0.5 10 −0.7 10 ois spider −0.9 10 1 2 3 4 5 1 10 0 10 ois spider −1 6 10 Runtime 4 10 mean seconds mean seconds mean seconds Runtime 2 10 10 1 2 3 4 5 2 10 0 10 ois spider −2 6 10 1 2 3 4 5 NLATWindow NLATWindow NLATWindow localWindowSize=0.0 localWindowSize=0.5 localWindowSize=1.0 (C) T=6,tasksPerAgent=3,uncertainty=1.0,NLATs=1,influenceType=state Figure 6.9: OIS vs. SPIDER : NLAT Window Size 6.5.5 Comparison with SBP The last algorithm that I consider in this analysis is separable bilinear programming (SBP), which computes the agents’ joint policy using a centralized representation that models the agents’ joint behavior in such a way that it can exploit their largelyindependent factored transition structure (Mostafa & Lesser, 2009). The question then becomes whether or not the computational advantages of SBP’s structural exploitation outweigh those of OIS’s influence-based abstraction. For weakly-coupled problems wherein agents influences are constrained, my empirical results suggest the opposite. Across all parameter settings, I observed a qualitatively-identical trend: OIS was 216 orders of magnitude faster than SBP when the nonlocally-affecting task’s window was highly-constrained, but approached SBP’s runtime as the nonlocally-affecting task window was expanded to its maximum value. Figure 6.10 illustrates the trend for one particular setting of parameters.10 I could not find a single parameter setting for which OIS’s computation time (statistically) significantly exceeded that of SBP when N LAT s reached its maximum value. Runtime 3 mean seconds 10 2 10 1 10 ois sbp 0 10 2 4 6 8 10 NLATWindow T=10,tasksPerAgent=5,localWindowSize=1.0,NLATs=1,influenceType=state Figure 6.10: OIS vs. SBP : NLAT Window Size 6.5.6 Scaling Beyond Two Agents The positive results in the preceding subsections indicate the potential of OIS to compute optimal solutions for problems with more weakly-coupled transitiondependent agents than is currently possible with any other solution algorithm. I now provide an initial demonstration of OIS’s scalability. Using the same parameterization of agents’ local decision problems as in earlier experiments (Table 4.1), I create random problems wherein n agents are connected in a chain of the same form depicted on the left-hand side of Figure 6.1. Here, agent 1 influences agent 2, who influences agent 3, who influences agent 4, and so on. Solving this problem using OIS entails constructing a search tree as developed in Section 6.3 that generates and evaluates each agent’s feasible influences in a depth-first manner. 10 Note that, given the constraints of the developers’ implementation of SBP (described in Section 6.5.1.2), my comparison of OIS and SBP required a variation of the test sets used in past experiments that turned out to be significantly different. In particular, for the problems in the experiment, agents could not perform a wait action in between task executions, which resulted in variations in the length of the time horizon having little effect on the size of the policy space or on the computation required by OIS. Further, the number of outcomes per task was necessarily 2. As such, in order to make problems challenging for OIS, I needed to instead increase the number of tasks per agent. 217 Figure 6.11 shows the runtime of OIS on a set of 25 random problems per point, wherein the number of agents (n) was varied, and the problem generator parameter settings were fixed such that agents have moderately-sized local decision models that remain weakly-coupled (tied together with a single nonlocally-affecting task per agent, with the exception of the agent at the end of the chain). As shown, OIS is able to compute optimal solutions to 5-agent problems in a reasonable about of time (10 minutes on average). Although the computation time taken by OIS is exponential in the number of agents, its exponential curve is far less steep than that of the centralized MILP approach, which is able to solve 3-agent chains in 10 minutes on average, and uses up all of its allotted 2GB of memory in the process of solving any of the 4-agent problems. Runtime 6000 df−ois centralized MILP 5000 mean seconds 4000 3000 2000 1000 0 −1000 −2000 −3000 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 n T=5,tasksPerAgent=3,localWindowSize=0.5,uncertainty=0.5 NLATs=1,influenceType=state Figure 6.11: Scalability of OIS and Centralized MILP to more than two agents. This scalability result is significant in that it is the first demonstration of optimal solution computation to (relatively-unrestricted) transition-dependent problems containing more than 3 agents. Not only does it demonstrate the tractability of TDPOMDP problems with 5 weakly-coupled agents, but it also affirms that influence-based policy abstraction indeed enables scalability beyond the state-or-the-art, surpassing that of any other algorithms’ available implementations. The reader may be left wondering why such a result has not been possible with prior approaches. Towards answering this question, I offer the following intuitions. The four algorithms that I included in my empirical comparison, each of which could be considered contenders for scaling transition-dependent problems, face inherent 218 obstacles that (as-of-yet) prevent their scalability to teams of > 3 agents: 1. Policy-space search is obligated to perform a number of policy evaluations on the order of kΠi kn−1 , a term which, when n > 3, is completely intractable for any sizable value of kΠi k. Although depth-first OIS also performs a number of best-response calculations that is exponential in the number of agents, it exploits a significant reduction in the base of the exponent by abstracting agents’ local policies. 2. The centralized MILP approach is weighed down by the exponentially-increasing size of the joint decision model. The addition of each agent corresponds to an exponential increase in the number of joint actions that the MILP considers, not to mention the exponential increase in the state space (resulting from the cross product of weakly-coupled agents’ local state spaces). Without significant exploitation of factored structure to keep the joint model compact, any centralized solution method will be inherently limited in its scalability. 3. It is unclear how one would implement SBP on a problem with more than two agents, given that it formulates the problem as a bilinear program. 4. In principle, SPIDER could be scaled to solve problems with more than 2 agents. In fact, it has already been scaled to problems with more than 2 transition-independent agents Varakantham et al. (2007). However, to handle transition-dependent agent problems, SPIDER needs to address additional issues that do not arise in the transition-independent case. For instance, its generation and pruning of candidate policies must be pursued in a manner that is consistent with the directional topology of agents’ influences. I faced these same issues in my design of OIS. Implementation issues aside, although initial results suggest OIS’s abstraction engenders a smaller search space than does SPIDER’s pruning, it remains an open question whether or not future implementations of SPIDER could accomplish the scalability results achieved by OIS in Figure 6.11. 6.5.7 Summary and Discussion The results of my empirical comparison may be summarized as follows: ❼ The tractability of naı̈ve policy-space search is restricted to relatively small 2-agent problems. 219 ❼ The centralized MILP outperforms OIS on 2-agent problems whose local decision models are relatively small. However, it scales much more poorly than does OIS as agents’ local decision models become more complex, or as the number of agents is increased. ❼ When applied to my space of test problems, the computational advantages of OIS over naı̈ve policy-space search are well-characterized by my earlier empirical evaluation of influence space size. The same sets of problems that have a low degree of influence also tend to result in a lower computation time of OIS relative to that of policy-space search. ❼ For algorithms that prune the policy space in their own way (e.g. SPIDER) or instead search the space in a fundamentally different manner (e.g. the centralized MILP), their computation time is less sensitive to the degree of influence. It is thus not surprising that degree of influence has less effect on their computation time than it does on OIS. As a consequence, OIS’s relative computational advantage is less predictable with respect to these algorithms than with respect to naı̈ve policy space search. In comparison with algorithms other than naı̈ve policy-space search, the constrainedness of agents’ influences (as captured by the size of the window of interaction) appears to be a stronger predictor of OIS’s computational advantage. ❼ For weakly-coupled problems wherein agents’ influence constrainedness is rel- atively low, OIS computes optimal solutions orders of magnitude faster than either of the state-of-the-art algorithms (SPIDER and SBP) across the entire space of test problems. Moreover, the advantage of OIS in these circumstances tends to increase as agents’ local decision models become more complex. ❼ I have demonstrated that OIS can compute optimal solutions for transition- dependent problems with more agents than has been achieved by any other algorithms (that are not substantially restricted in their applicability). The computation of optimal solutions for problems with five agents, where the previous state-of-the-art was three, is a significant advance considering that, for all known algorithms, computational complexity is doubly-exponential in the number of agents (by Observation 3.20). Despite its advantages over existing algorithms, OIS has several key shortcomings. For small problems, in particular those involving tightly-coupled agents, the overhead of 220 OIS’s influence-based abstraction makes it slower than other algorithms. Furthermore, for problems that are not weakly-coupled, OIS loses its ability to reduce the size of the search space, inevitably relinquishing its advantage over alternative solution methods. Although I have shown OIS to scale beyond 2 agents, it cannot escape the exponential increase in runtime with each new agent, and hence is limited in scalability to just a handful of weakly-coupled agents (using the implementation I have presented). These empirical results fulfill an important purpose in the overall scheme of this dissertation. Ever since Chapter 1, I have claimed that the focus of this work is the study of transition-dependent problems that, in the presence of weakly-coupled interaction structure, admit efficient and scalable solution algorithms. I began by formalizing the TD-POMDP in Chapter 3, claiming that, although generally intractable, the class of TD-POMDPs contains sets of weakly-coupled problems that can be solved efficiently. My weak coupling theory, developed in Section 3.5, allowed me to be more concrete in this claim: all else being equal, problems that accommodate a lower degree of influence should be easier to solve. In Chapter 4, I introduced influence-based abstraction as a methodology by which to exploit weakly-coupled structure, and, in defense of this claim, empirically characterized those problems for which influence-based abstraction achieves a low degree of influence. However, my claim that weakly-coupled problems enable more efficiently-computed solutions remained unaddressed. It was not until developing a complete solution algorithm, in this chapter, that I was able to analyze the extent to which influence-based abstraction could be used to compute solutions efficiently. The empirical results that I have presented in the preceding subsections do just this. My comparison of the computational cost of OIS with that of other solution algorithms affirms that influence-based abstraction provides a significant advantage over existing methods in the computation of optimal solutions to weakly-coupled transition-dependent problems, in that OIS solves such problems in orders of magnitude less time. These results do not just affirm the efficacy of influence-based abstraction. Bootstrapping off of my earlier analysis in Chapter 4, they also evaluate the circumstances under which influence-based abstraction gains the most traction in practice. Analogously, they have exposed circumstances under which influence-based abstraction is disadvantageous. By examining the benefits and limitations of OIS, these results may serve as a guide for researchers and developers with which to make informed decisions about the suitability of influence-based abstraction and of optimal influence-space search to the problems that they address. 221 Having come full circle, I have now fulfilled the primary contributions set fourth in Section 1.4. In the remainder of this chapter, I develop an extension of OIS for exploiting additional structure to yield more efficient solutions on problems with more than 2 agents. Then, in the next chapter, I develop extensions for computing approximate solutions. The development and evaluations of these last pieces is more preliminary, and the evaluation less systematic. 6.6 Scaling Beyond a Handful of Agents The scaling of OIS to 5 agents in Section 6.5.6 is a significant achievement, but for larger agent teams, DF-OIS hits a wall just as other methods hit a wall at 2 or 3 agents. This is to be expected, since the DF-OIS search tree is exponential in the number of agents. However, I claim that in the presence of additional structure, we can overcome this barrier and scale optimal influence space search to indefinitely many agents. The way forward is to exploit structure in the interaction digraph. The depth-first optimal influence-space search only utilizes the interaction digraph to order agents’ influence generations within the search tree. I now develop an extension that exploits structure in the connectivity of the interaction digraph to reduce computation. I begin by describing two situations (in Sections 6.6.1 and 6.6.2) wherein depth-first search performs redundant computation, providing suggestions of how such redundancy might be avoided. Afterwards, in Section 6.6.3, I present a more sophisticated algorithm that applies the bucket elimination paradigm (Dechter, 1999) to the problem of optimal influence-space search and demonstrate its scalability. 6.6.1 Independent Ancestors Consider the interaction digraph shown in Figure 6.12, containing one agent that is influenced by all of its peers. This structure induces a depth-first ordering of the agents according to their interaction digraph indices, such that agents {1, ..., n − 1} occupy the upper levels of the search tree and agent n occupies the lowest level. To search the space, agent 1 would generate its feasible outgoing influence settings and pass those down to agent 2. For each of agent 1’s settings, agent 2 would generate its feasible outgoing influence settings and pass those down to agent 3. As we proceed down the search tree, the nodes at each level grows exponentially. Agent 3 will receive on the order of kΓi k2 combinations of influences from agents 1 and 2. In turn, agent 3 will call GenerateFeasibleInfluences kΓi k2 times. However, according to the interaction digraph, agent 3 is uninfluenced by agents 1 and 2. Agent 222 1 2 n-1 n Figure 6.12: An interaction digraph wherein parents are independent. 3’s local quality and outgoing influence settings are completely independent of agent 1’s and agent 2’s influences. Thus, agent 3 is performing an identical computation of feasible influences every time that it considers a different combination of influences from agents 1 and 2. The same is true for all of agents {2, ..., n − 1}. For this problem, the only branching that is required is at the bottom of the tree, wherein agent n considers all feasible combinations of all influence settings of its ancestors. As a high-level strategy for avoiding this redundancy, consider a refactoring of the search process such that an agent i’s influence generation problem is explicitly separated from its influence evaluation problem. Regardless of the multitude of messages passed down by agent i’s peers earlier in the ordering, each of which agent i will later respond to with unique value messages passed up the tree, i does not need to generate outgoing influences for each. Instead, i only needs to generate outgoing influences for each of the unique combinations of incoming influences on which its own decision model depends. 6.6.2 Conditionally Independent Descendants In Figure 6.13, the digraph topology is such that there is only a single influencing agent (agent 1) who influences all of the remaining agents. Agents {2, ..., n} do not share any nonlocal features and so do not influence each others’ local transitions or local qualities. Using Definition 3.34, agents {2, ..., n} are decision-independent of each other conditioned on the decisions of agent 1. For this problem, the depth-first search tree would accurately reflect that agent 1 is the only influencing agent, such that the only branching that occurs is from the root of the search tree. Below the root, agents {2, ..., n} generate a single branch per influence setting from agent 1. In this case, the redundancy occurs as values are passed up the 223 1 2 n 3 Figure 6.13: An interaction digraph wherein children are conditionally independent tree. Each agent calculates a separate best response for each combination of influence settings involving all n − 1 of agent 1’s nonlocal features {n2 , n3 , n4 , ...}. If there are on the order of kΓ1 (ni )k feasible settings that uniquely specify the transitions of each nonlocal feature, agents {n2 , n3 , n4 , ...} will each have to perform kΓ1 (ni )kn−1 best response calculations. To avoid this redundancy, consider a restructuring of the search tree into 2 levels. As before, agent 1 sits at the root node. However, each branch corresponds to a feasible setting corresponding to only one of agent 1’s influences Γ1 (ni ) (instead of all of its influences {Γ1 (n1 ), Γ1 (n2 ), ...}). Each such branch leads to a leaf node controlled by agent i, such that agent i only needs to respond to kΓ1 (ni )kn−1 settings. In addition to the savings from computing fewer best responses, agents {2, ..., n} will avoid unnecessary exchange of messages over the course of the search. More generally, for any two nodes i and j that are decision-independent conditioned on common ancestors’ decisions, agents i and j need not exchange messages over the course of influence space search. 6.6.3 Bucket Elimination for Optimal Influence Search To take advantage of conditional independence relations among descendants as well as those among ancestors, I now present a more sophisticated reformulation of optimal influence-space search called Bucket Elimination OIS (BE-OIS). It follows the general scheme of Dechter’s bucket elimination algorithm for constraint optimization (Dechter, 1999). Bucket elimination performs dynamic programming using a wellordered elimination of variables, associating a bucket data structure with each variable to be eliminated. Given a collection of cost functions defined over subsets of variables, and a total order over variables, bucket elimination distributes the cost functions 224 into buckets (each associated with a single variable), according to the latest-ordered variable referenced by the cost function. One by one, the algorithm processes each bucket by combining its cost functions by summation, eliminating its variable by maximization, and passing the reduced cost function down to the next bucket that references any of the remaining variables. Top-down bucket processing is then followed by bottom-up propagation of optimal variable assignments. Analogous to the elimination of COP variables, here we would like to eliminate influence variables. As such, we will create a bucket that corresponds to each agents’ outgoing influences. For simplicity of exposition, let us assume that the agent interaction digraph contains no cycles.11 We can do so by combining agents’ value functions with respect to subsets of influence parameters. As such, the collections of messages passed from one bucket to the next are of the same flavor as the influence evaluation messages passed up the DF-OIS search tree, containing an influence setting and a value. More precisely, in BE-OIS, each message consists of a setting for a subset of influence parameters and the summation of all agents’ local values that are influenced by those parameters. However, BE-OIS has the potential to significantly reduce the number of such messages from that of DF-OIS. It does so by employing more sophisticated decomposition of the agents’ influence generation and evaluation. Figure 4 illustrates how BE-OIS searches the space. On the left is an agent interaction digraph, and on the right are the buckets, indexed by the agent whose outgoing influences are to be eliminated. The buckets are processed from the top-most bucket down, and the topology of message exchange during the course of bucket processing is depicted by the arrows connecting buckets. Appearing within each bucket are the single-agent value functions prior to processing and the multiple-agent value functions that have been processed by earlier buckets. The operation of BE-OIS proceeds in three phases: initialization, elimination, and assignment, each of which I describe as follows. Initialization. As with DF-OIS, I assume that BE-OIS will be initialized and invoked by a central entity, but note that, thereafter, its operation is fully decentralized. To start, an order is selected over influencing agents that is the reverse of some ordering consistent with the partial order of the interaction digraph.12 In contrast to DF-OIS, 11 I make this assumption without loss of generality. The BE-OIS algorithm that I describe here can be extended to accommodate interaction digraph cycles using the same technique developed in Section 6.4 that enabled DF-OIS to accommodate cycles. 12 One ordering may yield less computation than another. Dechter (1999) describes algorithms for determining the best ordering. 225 4 2 1 n3a n5b n3b 3 n6 𝑽∗𝟒 𝚪𝟐 𝒏𝟒 , 𝚪𝟒 + 𝑽∗𝟓 𝚪𝟐 𝒏𝟓𝒃 , 𝚪𝟒 2 n4 4 n5a 5 6 𝑽∗𝟐 𝚪𝟐 + 𝑽∗𝟒,𝟓 𝚪𝟐 𝒏𝟒 , 𝒏𝟓𝒃 1 Interaction Digraph 3 𝑽∗𝟑 𝚪𝟏 , 𝚪𝟐 𝒏𝟑𝒃 , 𝚪𝟑 + 𝑽∗𝟔 𝚪𝟑 + 𝑽∗𝟑,𝟔 𝚪𝟏 , 𝚪𝟐 𝒏𝟑𝒃 𝑽∗𝟏 𝚪𝟏 + 𝑽∗𝟐,𝟑,𝟒,𝟓,𝟔 𝚪𝟏 Bucket Processing Figure 6.14: Interaction digraph (left) and processing of buckets by BE-OIS (right) which starts with agents that influence the most peers, BE-OIS begins by reasoning about those influences of the agents that influence fewer peers. In Figure 6.14, the influencing agents {4, 3, 2, 1} are considered in that order. For each agent in the ordering, a bucket is created for reasoning about the joint value of (and ultimately eliminating) the corresponding agents’ outgoing influences. Next, each agent i’s local value function is placed into exactly one bucket. If i influences other agents, i’s value function is placed into bucket i. Otherwise, i’s value function is placed into the bucket indexed by i’ earliest-ordered agent digraph ancestor. This is consistent with Dechter’s bucket elimination algorithm for the following reason. By Theorem 3.33, the local value function of agent i is independent of the decisions of agents other than i and i’s ancestors Λ(i). Consequently, agent i’s optimal local value with respect to peers’ influences is independent of the portions of influence settings that do not pertain to the influences of i’ ancestors ΓΛ(i) : Vi∗ (Γ) = Vi∗ (Γi , ΓΛ(i) ) (6.1) Vi∗ (), exactly of the form given in Equation 6.1, belongs in the earliest-ordered bucket whose influence it references. At the end of initialization, each bucket i should include i’s local value function and the local value functions of any non-influencing children.13 Henceforth, I will denote bucket(i) as the set of indices of the component value 13 In Figure 6.14, notice that buckets 1, 2 and 3 contain additional components. These are components that are added from other buckets during the elimination phase. 226 functions in bucket i. I will denote scope(i) as the set of indices, except for i, of all influence components referenced by any agent in bucket(i). Further, I will denote Γscope(i) as the set of all influence parameters, except for Γi , referenced by any agent in bucket(i). Analogous to Dechter’s Bucket Elimination, Γscope(i) serves as the variables on which eliminated variables depend. Elimination. Elimination involves the processing of each bucket by the indexed agent. For each bucket i, agent i may begin processing its bucket i immediately, and in parallel with other agents’ eliminations. However, agent i can only finish processing its bucket once all of the buckets earlier in the ordering have finished. For agent i, the objective in processing its bucket is to compute the optimal setting  ∗|Γ of its outgoing influences Γi scope(i) = arg maxΓi V Γi , Γscope(i) for each feasible setting ∗|Γ of peers’ influences on which it depends {Γscope(i) }. In order to compute each Γi scope(i) , agent i decomposes this computation into the summation of local value functions in bucket i:   X  ∗|Γ (6.2) Vj∗ Γi , Γscope(i)  Γi scope(i) = arg max  Γi j∈bucket(i) Before agent i can evaluate Equation 6.2, it must recursively call upon all the agents in scope(i) to generate their feasible influence settings. In essence, agent i invokes a depthfirst influence-space generation for the subset of agents whose influence parameters are referenced within i’s bucket. However, unlike in DF-OIS, the generated influences are stored for later use by all ancestors. For the example problem in Figure 6.14, agent 4 is the first to process its bucket, and hence calls upon agent 2 to generate all of the feasible settings of Γ2 . Upon receiving all feasible combinations of ancestors’ influence settings {Γscope(i) }, agent i decomposes these into the sets of feasible influence settings required to compute each local value function in Equation 6.2. For instance, agent 4 in Figure 6.14 decomposes the set of feasible influences {Γ2 } into {Γ2 (n4 )} × {Γ2 (n5b )}. Additionally, agent i generates its own feasible influences (if it has not already done so) for each combination of dependent ancestors’ influence settings. All that remains is for agent i to pass the requisite combinations of settings of {Γi } × {Γscope(i) } to any descendant j that does not own a bucket, and to wait for evaluation messages back from j, each of the form hΓi , Γscope(i) i, Vj∗ (Γi , Γscope(i) ) . Additionally, agent i must wait for bucket-processing-completion messages from all buckets earlier in the ordering (before which additional evaluation messages may come). Agent i concludes its processing of bucket i by performing the maximization in Equation 6.2 for every setting of dependent ancestor influence settings, storing each 227 ∗|Γ optimal Γi scope(i) , for each creating an evaluation message Γscope(i) , Vi,Ψ(i) (scope(i)) , and sending all of these evaluation messages to the earliest-ordered agent referenced by the influences in the evaluation messages. In the example in Figure 6.14, agent 4 sends an evaluation message to agent 2, thereby inserting an additional component into agent 2’s bucket. Once all evaluation messages are sent, agent i broadcasts a bucket-processing-completion message to all agents later in the ordering, notifying them that bucket i has been processed and they can go ahead and finish processing their own buckets. If agent i is the last agent in the ordering, and hence the last agent to complete its bucket processing, it enters into the assignment phase. Note that, in this case, agent i will have eliminated the last influence component, computing a single value Γ∗i that is the unconditionally optimal setting of agent i’s outgoing influence. Assignment. Whereas in the elimination phase, evaluation messages are passed down from bucket to bucket, in the assignment phase, optimal influence settings are passed up. The last agent in the ordering has computed its optimal influence assignment Γ∗last , broadcasting this influence setting to previously-ordered agents in an assignment message. Recall that each such agent i has stored its optimal influence ∗|Γ settings Γi scope(i) . As soon as i receives all of the optimal settings of Γscope(i) , i can ∗|Γ∗ assign its optimal influence Γ∗i = Γi scope(i) , and broadcast an assignment message. This process continues until all agents will have assigned their optimal outgoing influence settings. As in DF-OIS, once all influences have been assigned, each agent can compute its local component of the optimal joint policy by computing a best response to the optimal influence point. 6.6.4 Complexity of Bucket Elimination OIS Intuitively, for problems like the one shown in Figure 6.14, BE-OIS has a lower asymptotic complexity than does DF-OIS because BE-OIS does not build a search tree whose depth is the number of agents. Instead, BE-OIS builds a set of smaller search trees, one for each bucket i whose maximum depth is equal to kscope(i)k. Recall that kscope(i)k is the number of other agents whose influences are referenced by the value functions in bucket i. In Dechter’s bucket elimination terminology (Dechter, 2003), kscope(i)k is the number of variables in bucket i minus 1. Dechter’s complexity theory (Dechter, 1999) tells us that, given that bucket elimination uses the optimal ordering over agents, the maximum number of variables in any bucket is equal to the induced width ω ∗ of the constraint graph. Therefore, given that the induced width 228 of the interaction digraph is ω (Def. 3.36), the number of feasible settings of Γscope(i) kω−1 ), where kΓmax k is the largest that are received by agent i is at most O(kΓmax i i number of feasible settings generated by any agent. In the worst case, for each of these settings, agent i must generate its feasible outgoing influence settings. Since there kω−1 ) generations. Agent i’s are at most n buckets, there will be at most O(n · kΓmax i generation of feasible influence adds another layer to the tree of generated influence settings for bucket i, raising the worst-case total of generated settings to O(kΓmax kω ). i This means that the number of influence settings evaluated by any agent in any bucket kω ). Since there are n agents, there will be at most O(n · kΓmax kω ) is at most O(kΓmax i i evaluations. Since generation and elimination of influence settings dominate all other operations of BE-OIS, the complexity is bounded by: O n · CE · kΓmax kω + n · CG · kΓmax kω−1 i i  (6.3) where CE is the worst-case complexity of the evaluation of any influence setting by any agent, and CG is the worst-case complexity of the generation of one agent’s feasible outgoing influence settings. 6.6.5 Empirical Results Notice from Equation 6.3 that the complexity of bucket elimination is linear and not exponential in the number of agents. Depth-first search, on the other hand, is necessarily exponential in the number of agents. Bucket elimination does not avoid an exponential term altogether. However, its exponent is bounded by the induced width of the influence digraph. In theory, for problems whose interaction digraphs have a fixed induced width, BE-OIS should scale linearly in the number of agents. To put this hypothesis to the test, I ran both BE-OIS and DF-OIS on a set of 25 random problems (per plotted point) whose interaction digraph is shown in Figure 6.15 (with a topology that I denote zigzag). As shown in Figure 6.16, BE-OIS is able to compute optimal solutions for 50 agents, and in orders of magnitude less time that it takes DF-OIS to compute optimal solutions for 6 agents. Given the ability of bucket elimination to exploit digraph structure (specifically, reduced agent scope), BE-OIS is able to scale well beyond DF-OIS. Moreover, it advances the art of transition-dependent agent planning into a whole new sphere of problems. I have provided compelling evidence that the techniques I have developed are applicable to very-large teams of weakly-coupled agents with structured graph topologies, which is a far cry away from the two-agent and three-agent limitations 229 chain 1 2 4 3 8 6 4 2 5 zigzag 1 3 5 7 9 Figure 6.15: “chain” and “zigzag” interaction digraph topologies. of past work. This degree of scalability could not have been accomplished without the exploitation of two complementary aspects of weakly-coupled problem structure (degree of influence and agent scope). The topology of the agents’ interaction digraph also plays an important role in this result. In both the zigzag topology and the chain topology (empirically tested in Section 6.5.6), each agent interacts with at most two other agents. However, due to the directionality of the influence arrows, one problem is significantly harder to solve than the other. In the chain topology, the maximum agent scope size (Def. 3.30) is the number of agents. Intuitively, since the last agent in the chain is influenced by all other agents, it must reason about all combinations of other agents’ feasible influences. In the zigzag topology, agent scope size is at most three, enabling an efficient decomposition of the search through the space of all combinations of agents’ influence settings, into sub-searches each through the space of just a couple of influence settings. 230 Runtime 400 be−ois df−ois 350 mean seconds 300 250 200 150 100 50 0 −50 0 10 20 30 40 50 60 n T=5,tasksPerAgent=3,localWindowSize=0.5,uncertainty=0.5 NLATs=1,influenceType=state Figure 6.16: Scalability of DF-OIS and BE-OIS on “zigzag” topology. 231 CHAPTER 7 Flexible Approximation Techniques Although OIS is competitive with other optimal algorithms on weakly-coupled problems and scales to problems with more agents than was previously possible, there are certainly problems for which computing the optimal solution (using OIS or any other algorithm) is intractable. In this chapter, I demonstrate that my influence-based framework is also suited to computing approximate solutions. Each of the three techniques that I present has the flavor of flexibly trading optimal solution quality for computational efficiency. In contrast to the previous chapters, this chapter presents a less systematic and more preliminary investigation. 7.1 Approximation of Influence Probabilities Recall that the influence space searched by OIS consists of vectors of probability values corresponding to the conditional probabilities implied by feasible influences. As I have developed in Section 5.6, generating new points in the influence space involves finding new probability values, component by component. OIS finds all probability combinations. Assuming a fixed influence encoding size, the more tightly-coupled a problem is, the denser the space of probabilities. The idea behind probability approximation is to avoid the generation of a new influence whose pairwise probabilities are all within ǫ of an influence found previously. This can be done using a very simple modification to OIS’s generation algorithm (Sec. 5.6). Each time a new influence γfound is found, the two new intervals added to the explore queue are reduced to {(γmin , γfound − ǫ), (γfound + ǫ, γmax )} such that no parameter values within ǫ of γfound are considered by future MILPs. Figure 7.1 presents initial empirical results, comparing different values of ǫ on a set of 25 random 4-agent (chain) problems (whose interaction digraphs take the form shown at the top of Figure 6.15). In each problem, each agent was given 3 tasks, each 232 with 3 randomly-selected durations (whose probabilities were generated uniformly at random and normalized) and randomly-selected outcome qualities (whose values were drawn randomly from the set {1.0, 2.0, 3.0}). One of agent 1’s tasks (chosen at random) was set to enable one of agent 2’s tasks (chosen at random), one of agent 2’s tasks (chosen at random) was set to enable one of agent 3’s tasks (chosen at random), and one of agent 3’s tasks (chosen at random) was set to enable one of agent 4’s tasks (chosen at random). The time horizon was set to 6 and every task’s window was set to the full duration of execution, making it a strongly-coupled problems (relative to those for which the reduction in NLATWindow I showed in Section 4.6.2.5 to yield exponentially-fewer influences). Figure 7.1: Empirical evaluation of ǫ-approximate OIS. As shown in Figure 7.1, the bar plot indicates a substantial decrease in the runtime (plotted on a log scale) as the value of ǫ is increased. In contrast, the solution quality table1 shows that the normalized joint utility of the highest-valued influence found decreases very slightly until ǫ becomes larger than 0.1. For this particular set of problems, approximating the influence probability space achieves large computational savings at the expense of very little solution quality. The performance of ǫ-approximate OIS affirms the intuition that, although agents may forgo finding the optimal influence point by approximating the probability space, they are still guaranteed to search the space relatively evenly (to the extent that it is populated evenly with probabilities). Parameter ǫ specifies the resolution to which they search. 1 The third row of the table, labeled “improvement over uncoordinated local policies” measures the average percentage improvement over the policy computed by each agent maximizing its local utility without regard to the other agents in the system (assuming pessimistically that its peers will not enable it). 233 7.2 Time Commitment Abstraction In the last section, we approximated the space of probabilities associated with each parameter. Alternatively, consider approximating the parameters themselves. That is, approximate the structure of the influence DBN. For example, there may be several features (e.g. cloud cover, time of day, and temperature) that are mutually-modeled by a team of rovers, but that are not all equally informative in predicting the rovers’ influences on each other. Feature selection methods could be used to remove all but the most useful influence dependencies, thereby reducing the space of possible influences significantly. Alternatively, we could remove DBN connections, thereby imposing faux conditional independence relationships. Ultimately, the goal is to reduce the number of parameters that encode the influence, as well as the size of the influence space. Here, I develop one particular approximation wherein the influence encoding Γ(nix ) has been reduced to just two parameters (t and ρ) of the form: Γ(nix ) = P r(ntix = true) ≥ ρ, where t is a time value and ρ is a probability value. In contrast to the usual influence information, P r(ntix = true) ≥ ρ doe not express a single probability of interaction, but instead a range of probability values. The agent proposes to adopt a policy that sets bit nix to true by time t with probability at least ρ. I call this a time commitment. In contrast to the more general notion of an influence, a time commitment has the implicit semantics that the nonlocal feature is event-driven (Def. 4.21): once the influencing agent sets it to true, it can never be set to f alse thereafter. Event-driven features are well-suited for modeling interactions among service-oriented agents. After describing the service-oriented context in Section 7.2.1, I present the formal details of time commitments (Section 7.2.2), discuss issues that arise when modeling time commitments (Section 7.2.3), and characterize the search space of time commitments (Section 7.2.4). 7.2.1 Service Coordination As a context for time commitments, consider a group of agents (such as is shown in Figure 7.2.1) who interact by performing services for one another. I refer to Agent 1 as a service-providing agent because it has various tasks that it can perform to fulfill the service requests of other agents. I refer to Agent 2 and Agent 3 as service-requesting agents because they can make use of the services provided by Agent 1. In particular, Agent 1 has three services {A, B, and C}, where providing Service A entails the completion of Task A, providing Service B entails the completion of Task B, and providing Service C entails the completion of Task C (which must be preceded by the 234 completion of Task B). These services in turn allow Agents 2 and 3 to complete their own tasks. ( Execution Window =0,8 ) Agent 1 enables Service Service Task A Task B Task C duration probability 1 1/3 2 1/3 3 1/3 Local Utility = 0 duration probability 1 1 duration probability 2 1 Local Utility = 1 Local Utility = 0 enables enables Task E Task D duration probability 4 1 Agent 2 Service Agent 3 Local Utility = 3 Request: Service A time 4, probability 1 duration probability 2 1/4 3 1/2 4 Agent 1/4 3 Local Utility = 3 Request: Service C time 4, probability 1 Figure 7.2: Service Coordination example 7.2.2 Time Commitment Formalism Within the context of service provision, time commitments are defined as follows: Definition 7.1. A probabilistic time commitment Cij (s) = ht, ρi is a guarantee that agent i will perform (for agent j) the actions necessary to deliver service s by time t with probability no less than ρ. Probabilistic time commitments allow agents to make promises to each other in the event that they cannot fully guarantee service provision. It can be extremely beneficial to model the inherent service uncertainty in this way. In our example, Agent 1 cannot guarantee provision of Service C until time 6. If Agent 3 waits until time 6, it will only be able to complete Task E (in the case that Task E’s duration is 2) with a probability of 14 . However, agent 1 can promise provision by time 5 with a probability of 23 , giving Agent 3 a 34 × 32 = 12 chance of completing Task E. Thus, by 235 Agent 1 committing to probabilistically providing Service C at time 5, Agent 3 can take advantage of the temporal uncertainty and effectively double its expected utility. While this example illustrates that the semantics of the commitment’s probability can capture uncertainty about whether a task will be completed in time to meet the time commitment, the probability can also summarize the likelihood that a task will even be attempted. That is, in some execution trajectories, a service provider might reach a state where it would be counterproductive to even begin one of the tasks about which it has made a commitment. This kind of behavior is captured in the commitment semantics: so long as the probability of encountering a trajectory that involves never starting a task, or not finishing it by time t, is no greater than 1 − ρ, then the provider can make the commitment to complete the task by t with probability at least ρ. 7.2.3 Modeling, Incompleteness, and Inconsistency Modeling time commitments is a little more tricky than modeling influences because a time commitments does not sufficiently encode the influencing agent’s policy. As I describe below, the time commitment model is incomplete because it only specifies (a bound on) the probability at time t, leaving the remain transition probabilities’ values unknown. Further, because of the ≥ inequality, the transition probability with which the nonlocal feature changes from f alse to true at the given time may not be exactly equal to ρ. It may be greater than ρ. Whatever the value with which it is modeled, that value may be inconsistent with the true value implied by the service provider’s policy. I describe one strategy for coping with this inconsistency below. A service-requesting agent cannot itself control the provision of Service C, but is concerned with whether or not C will be or has been provided. Hence it should model a nonlocal feature Service-C-completed. A commitment can be thought of as a promise from a service-providing agent to be, with probability at least ρ, in a state at time t in which the corresponding nonlocal feature is set. To a service-requesting agent, the commitment is a promise that a nonlocal feature will be set at time t with probability no less than ρ. Thus, from a practical standpoint, the commitment probability ρ corresponds to a portion of the transition probabilities of the nonlocal features in the service-requesting agent’s MDP. 236 Example 7.2. Consider a commitment C13 (C) = h5, 32 i by which Agent 1 promises to Agent 3 to complete Task C by time 5 with probability ≥ 23 . We can augment the transition model in Agent 3’s local MDP to represent this committed behavior of Agent 1. As shown in Figure 7.2.3, the transition caused by taking action “N” in state “NN4” is expanded into two possible transitions. This is because Agent 1 has committed to setting the Service-C-completed feature by time 5 with probability 23 . In this simple problem, there is only one transition at time 4 that is augmented by the modeled commitment, but in general, all transitions leading from time 4 to time 5 would be expanded in this manner. MDP for Agent 3 State Space: ServiceC-completed, TaskE, time ServiceC-completed  {N (not completed), F (completed)} TaskE  {N (not started), 0 (started at time 0), 1, 2, 3, ..., F (finished)} time  {0, 1, 2, 3, ..., 8} FF7 2 1 Action Space: { E (execute TaskE), N (don’t start executing anything new) } NN0 NN1 N NN2 N C13 C   5, NN3 N N 3 F56 NN4 N 2 ?N5 1 2 3 4 F57 F58 R=0 3 N 1 1 3 3 FN5 NN5 N N 4 F67 FN6 NN6 N N 3 F68 R=0 F78 R=0 4 N 3 4 E E E N FF8 R=3 N FN7 NN7 N N FN8 R=0 NN8 R=0 Figure 7.3: A conservative model of a time commitment. Notice that Agent 3 models Service-C-completed as “(N)ot completed” before the commitment time. There is no information encoded in the commitment about the value of the feature at times 0 through 4 nor is there information, in the case that the service is not provided by time 5, about the value of the bit at times 6 through 8. One method of dealing with the incomplete information of time commitments is for the requesting agent to construct a conservative model of the providing agent. That is, the agent assumes the worst: a zero probability that the services will be provided at all times not referred to by the time commitments. In our example, Agent 3 would assume that Service-C-completed takes on a value of N from times 0 − 4, and 237 cannot change from N to F after time 5. Modeling the change in feature value of Service-C-completed only at time 5 leads to a very compact local model. Note however, that this model is not entirely consistent with the behavior of the service provider. Agent 1 has committed to setting the value of Service-C-completed to “F(inished)” by time 5 instead of at time 5. But agent 3 models the feature as having value “N” at all times before 5. That is, Agent 3 is not modeling the possible completion of Task C any earlier than the commitment time, even though it is possible that Task C might finish at time 4. Similarly, given that Agent 1’s policy is really to start task C as soon as it finishes Tasks A and B, then if Task C does not finish at time 5 it must finish at time 6. However, the local model does not include that possibility. This inconsistency in the agent’s local model has the effect that the policy it constructs cannot react quickly to early service provision and cannot react at all to late provision. Consequently, the time commitment abstraction provides approximate solutions. The loss in solution quality due to the approximation will depend on a variety of problem characteristics. Intuitively, the approximation will work well for scenarios involving a single critical time. However, for scenarios with more flexibility in the timing of service provision and utilization, and for highly-uncertain services with a large number of possible completion times, I expect that a time commitment search will perform more poorly. 7.2.4 Space of Time Commitments Despite issues of incompleteness and inconsistency, an elegant aspect of the time commitment is that its domain is a well structured two-dimensional space of time and probabilities with some nice properties. Shown in Figure 7.4, a time commitment could, in principle, take any combination of time and probability values. However, the feasible space of time-probability pairs is bounded from above and from the left by the maximum feasible probability boundary. Intuitively, if a particular commitment ht, ρi is feasible, a more conservative commitment that assigns the same probability but at a later time (ht′ > t, ρi) must also be feasible. Similarly, any commitment ht, ρ′ < ρi that promises a lower probability at time t must also be feasible. As a consequence, the feasible boundary is necessarily nondecreasing as a function of the commitment time. Definition 7.3. The maximum feasible probability of a commitment Cij made at time t is the highest commitment probability than can be achieved by time t by any policy of agent i given its existing commitments (if any). 238   1 max support-optimal probability t 0 Figure 7.4: The space of feasible time commitments Given the application of time commitments to service problems, I also assume that the values of commitments for both the provider and the requester are well structured. For the provider, considering commitments all with probability ρ, a commitment at a later time is never of lower local value than a commitment at an earlier time. Similarly, for any two commitments with the same time value, the provider’s value for a commitment with a lower probability is never worse than the provider’s value for a commitment with a higher probability. I define the highest probability that achieves the highest possible provider value as the maximum support-optimal probability: Definition 7.4. The maximum support-optimal probability of a commitment Cij made at time t is the highest commitment probability than can be achieved by time t by any policy of agent i given its existing commitments (if any), without sacrificing any of i’s local utility. The opposite relationships hold for the requesting agent. That is, a higher probability of service provision always results in at least as much requester value as a lower probability of service provisions; similarly, earlier time commitments for services can never yield lower requester value than later time commitments. Thus, forming time commitments entails compromise, with the objective of striking a balance (in both the time and the probability dimensions) between the requesting agent’s local value and the providing agent’s local value. Naturally, this balance should occur between the two boundaries shown in Figure 7.4, where the lower boundary, the maximum support-optimal probability boundary, denotes for each time t, the highest probability ρ that does not sacrifice any provider utility. The next section presents one methodology for negotiating such a balance. 239 7.3 Greedy Service Negotiation I now present an influence-space search algorithm that uses the time commitment abstraction of influence to greedily, but rapidly, converge on a feasible time commitment for each interaction. Inspired by service choreography (Papazoglou et al., 2007), the algorithm takes the form of a pairwise agent negotiation between a service-providing agent and service-requesting agent. Although not guaranteed to return optimal time commitments, the negotiation algorithm has several advantageous properties when compared to my optimal influence-space search algorithms, which I list in the paragraphs below. Greedy Search. Instead of exhaustively exploring the feasible space of influences, the service negotiation algorithm myopically assigns each influence setting one-by-one, never returning to the previous one. In contrast to OIS, service negotiation performs the equivalent of one depth-first pass down the search tree, at each level greedily selecting the time commitment for each service that maximizes the (heuristic) value associated with that service (without regards to the services negotiated thereafter). Consequently, greedy negotiation scales linearly in the number of edges in the interaction digraph regardless of the connectivity, making it robust to strongly-coupled problems where the agent scope is high. Value-based Pruning. The service negotiation algorithm takes advantage of the properties of time commitments described in Section 7.2.4 to prune large portions of the search space. In particular, it accounts for requester utility to rule out time commitments that must be of lower value than those already considered. As such, the service negotiation can be thought of as a branch and bound extension to OIS. Negotiation, Not Enumeration. In contrast to OIS, which dictates that the influencing agent compute its own feasible outgoing influences, service negotiation involves the influenced agent requesting incoming influences in addition to the influencing agents proposing feasible influences. The agents thereby distribute the computational load of influence generation. Additionally, service negotiation begins, for each service, with an initial influence that maximizes the requesting agents local utility. Although this influence may not be feasible, it starts the search in a fruitful location, and enables swift convergence convergence of the requested and proposed influences. 240 In the subsections that follow, I develop and evaluate my greedy service negotiation algorithm. 7.3.1 Negotiation Protocol To plan and coordinate the executions of agents’ services, my algorithm utilizes a service choreography protocol. As shown in Figure 7.5, service-requesting agents submit requests to service-providing agents. The requests are dealt with through negotiations between the requester and provider that end in service provision agreements. Negotiation Service-Requester Composite Service Goals Service-Provider 1 Service Request 2 Feasibility and Initialization Quality Evaluation 4 Request Revision 3 Counterproposal Service Provision Agreement Formulation Figure 7.5: Negotiation Protocol As I describe in the sections that follow, for steps 1 and 4, service-requesting agents employ temporal and stochastic planning to reason about the timing of when the services are needed in order for their own temporally constrained goals to be met. Because of the temporal uncertainty and service dependencies, service-providing agents also employ temporal and stochastic planning techniques in steps 2 and 3 to decide what services can be provided at what times and with what likelihoods. The remainder of this section is structured as follows. In Section 7.3.2, I provide a methodology for service-provider reasoning: how to constrain its policy-formulation based on its commitments and in doing so evaluate the feasibility of commitments (Figure 7.5 step 2), and how to search the space of commitment values when formulating counterproposals (step 3). In section 7.3.3, I present a corresponding methodology for service requesters to evaluate counterproposals and formulate new service requests (steps 1 and 4). Having brought together all of the steps of the negotiation protocol, I discuss how the overarching problem of coordinating service activities of the system of 241 agents may be achieved through commitment convergence in Section 7.3.4. In section 7.3.5, I provide empirical results of the scalability and a discussion of the solution quality of my approach. 7.3.2 Service Provider Reasoning Next, I describe the inner workings of the negotiation protocol introduced in Figure 7.5. I begin by showing how service-providing agents can evaluate the feasibility of a received request (step 2 of the protocol) and propose alternative commitments (step 3). 7.3.2.1 Forming Commitment-Constrained Policies Service agents can solve the local models described in the previous section using standard MDP solution methods to compute execution policies. But in order to adhere to its probabilistic time commitments, a service-provider needs to calculate a policy that keeps its promises. For enforcing commitments, I extend the techniques from Chapter 5 to address time commitments. We can directly modify the standard MDP LP from Equation 5.1 to constrain the solution policy to adhere to a set of temporal probabilistic commitments: ∀j, max XX i a X xja − a X xia P (j|i, a) = αj a,i xia R (i, a) ∀i∀a, xia ≥ 0 X ∀s {i|{time(i)=ts ∧Statuss (i)=F }} X (7.1) xia ≥ ρs a Equation 7.1 adds a third constraint, requiring that the committing agent’s policy visit states with time = ts and a F inished status of service s with probability no less than ρs . This constraint exploits the fact that an occupancy measure must equal the probability of ever visiting a state at time t and taking the action. Since states are time indexed, no more than one state at time t can be visited in any one execution trajectory, nor can the probabilities of visiting any subset of states at time t sum to more than 1. Solving the new linear program yields a policy that is optimal for the committing agent with respect to its commitments to other agents if such a policy exists. If no such policy exists, the agent is overcommitted, and so the Linear Program is over-constrained and has no solution. In this case, the LP solver outputs “NO SOLUTION”. 242 7.3.2.2 Commitment Feasibility When a service request cannot be honored as requested, the LP formulation will find no solution. Rather than replying “no” to the requester, the protocol expects the provider to supply one or more counterproposals that represent alternative requests that it could commit to fulfilling (step 3 in Figure 7.5). In considering the space of possible counterproposals, not all commitment probabilities and times need be considered. In the following sections, I present some techniques to prune suboptimal values from the space of potential commitment counterproposals. 7.3.2.3 Pruning Commitment Times Recall that, for the service-providing agent, commitments pertain to the potential completion of its tasks. Each task has a certain discrete probability distribution over durations. So, to pick a time to promise a task completion with any probability greater than zero, it does not make sense to consider times that are less than the smallest positive probability duration. In the example problem, the agent cannot complete Task A before time step 1. For tasks that depend on other tasks, we can push the earliest commitment time further forward by adding the minimum durations of all dependent tasks. Task C depends on the completion of Task B, so the earliest time that should be considered for completing C is 2 + 1 = 3. More sophisticated temporal reasoning may be used to push the earliest commitment time forward even further. For example, given an existing commitment by Agent 1 to deliver Service A at time 3, we can deduce that Task A must be started at time 0 and cannot finish any earlier than time step 1. So, given previously established commitments, Service C should not be committed to any earlier than time 4. Though I do not incorporate this level reasoning into the implementation I use for my empirical studies (Section 7.3.5), it could be automated by representing the tasks in a temporal constraint satisfaction problem (Dechter, 2003) and applying constraint tightening techniques (e.g., Tsamardinos & Pollack, 2003). 7.3.2.4 Bounding Commitment Probabilities Having reduced the commitment space with respect to the time dimension, let us now consider the probability dimension. If the service-providing agent makes a commitment to completing Task A at time 2, it makes sense to set the commitment probability equal to the probability with which it can complete A in two time steps 243 or less: 2/3. If the agent promises a higher probability, it will not be able to meet its commitment. Thus, 2/3 is the maximum feasible probability for Agent 1’s commitment to providing A at time 2. The maximum feasible probability (Def. 7.3) of commitment to service sk can be computed using a linear program, slightly modified from Equation 7.1, that takes as input the service-providing agent’s local MDP with all previously made commitments set to their promised values (denoted {∀s 6= sk , hρs , ts i}), and (using occupancy measures) maximizes the probability of service sk being delivered at the given time: ∀j, max X {i|time(i)=tsk ∧Statussk (i)=F } X a X xja − a X xia P (j|i, a) = αj a,i xia ∀i∀a, xia ≥ 0 ∀s 6= sk X {i|{time(i)=ts ∧Statuss (i)=F }} X xia ≥ ρs a (7.2) In this new linear program, ρsk is a probability variable (unlike the rest of the {ρs } constants) and the solution maximizes that probability instead of maximizing local utility (as was the case in Equation 7.1). 7.3.2.5 Forming Counterproposals In OIS, influencing agents’ generate the entire space of their feasible outgoing influence settings. Here, I suggest a more efficient (though approximate) alternative for counter-proposing feasible time commitments. Instead of generating all feasible time commitments, let the service-providing agent instead compute feasible influences along its maximum feasible probability boundary (described in Section 7.2.4 and shown in Figure 7.4). When a request is deemed infeasible, the service provider informs the the requester of its limitations, thereby taking a useful step forward in the negotiation process. For this purpose, the service provider can use the LP in Equation 7.2 repeatedly to calculate the maximum feasible probability (Def. 7.3) for all relevant commitment times. Consider the example from Figure 7.2.1. The first request for A to be completed by time step 3 can be honored and a commitment (C12 (A) = ht = 3, ρ = 1.0i) formed. But next, the service provider receives a request from Agent 3 to deliver C by time step 4 (with implicit probability 1). Given the first commitment made to Agent 2, a commitment C13 (C) = ht = 4, ρ = 1.0i is not feasible. This is shown in Figure 7.6. 244 counterproposal ρ request 1 C13 (C )  t ,  2 feasible 3 alternatives 1 3 counterproposal 0 1 2 3 4 5 6 7 8 t Figure 7.6: An example of counterproposal. The service provider could, in principle, calculate the entire maximum feasible probability boundary over the tightened time interval [3, 8] as shown in Figure 7.6 (and Figure 7.4 abstractly). However, in counter-proposing, it is more efficient to use the time and probability of the request as a basis for providing selective feedback without the provider computing a lot of unnecessary boundary points. As shown in Figure 7.6, C can be delivered at the same time as the original request but with ′ smaller probability, yielding alternative commitment C13 (C) = ht = 4, ρ = 13 i. Or C can be delivered by a later time, 6, with the same probability as the request, yielding ′′ C13 (C) = ht = 6, ρ = 1.0i. These two counterproposals give the requester a reasonable sense of the boundary capabilities of the provider near the region of the previous request. Other points along the boundary could be provided, depending on the details of the negotiation algorithm. However, my current implementation finds only these two commitment counterproposals. The first of the two, C ′ = ht, ρ2 i, may be calculated using the probabilitymaximizing LP from Equation 7.2. The second, C ′′ = ht2 , ρi, requires instead a minimization of feasible commitment time.2 In Equation 7.3, I define a MILP that does just this, adding boolean variables ft to account for whether or not a commitment is feasible by time t. 2 If the provider cannot achieve the requested commitment probability ρ at any time, the second counterproposal is computed to be ht3 = the earliest time at which ρ3 can be achieved , ρ3 = the maximum probability achievable by the time horizon T i. 245 ∀j, X a max X t ft xja − X xia P (j|i, a) = αj a,i ∀i∀a, xia ≥ 0 X ∀s {i|{time(i)=ts ∧Status  s (i)=F }} X xia ≥ ρs a  X  X  xia  − ρsk − ft ≤ 0 ∀t < T, −1 ≤  a {i|{time(i)=t∧Statussk (i)=F }} ∀t, ft ∈ {0, 1} (7.3) In Equation 7.3, variable ft can be set to 1 only if the commitment can be satisfied by time t with its original probability ρsk . And so, in maximizing the number of ft variables that get set to 1, we are effectively minimizing the time that the commitment may be satisfied. The earliest feasible commitment time is then computed by finding the first ft variable set to 1 (mint {ft = 1}). As shown in Figure 7.6, this new MILP allows for exploration of the maximum feasible probability boundary by considering horizontal slices through the commitment space (probabilities) instead of vertical slices (times, as with the LP from Equation 7.2). 7.3.2.6 Issues of Service-Provider Utility The discussion so far has ignored the fact that service providers also have local utility, separate from the nonlocal utility that they can indirectly increase by fulfilling servicing requests. With the added consideration of service-provider utility, the space of commitments to consider grows, because the “best” commitment in terms of maximizing total utility might not be along the maximal feasible probability boundary. That is, by reducing the probability with which it will satisfy another agent’s request to a less-than-maximal value, the service provider might be able to develop a policy that improves its own local expected utility enough to more than compensate for the loss in the requesting agent’s expected utility. Here I summarize an extension that may be used to factor in the service providers’ local utilities. Equation 7.4 introduces a new linear program that allows a service provider to compute it’s maximum support-optimal probability (Def. 7.4), which is the maximum probability for the commitment at a given time that still allows the provider to maximize its own local value. 246 ∀j, X xja − a max ρsk X xia P (j|i, a) = αj a,i ∀i∀a, xia ≥ 0 X ∀s X xia ≥ ρs (7.4) s ∧Statuss (i)=F }} a X{i|{time(i)=t X ∗ xia R (i, a) ≥ Vprovider i a Note that this is only a slight modification of Equation 7.2: a constraint has been added to ensure that the expected utility of the policy is at least EU ∗ , the best local utility achievable by the service provider given its currently-enforced commitments (as computed by applying Equation 5.1 and evaluating the corresponding objective function). 7.3.3 Service Requester Reasoning Next, I develop methods for requesting services. 7.3.3.1 Request Initialization To begin the negotiation process, a service requester must formulate an initial request to send to the service provider (step 1 in Figure 7.5). Here I present one method by which all requests may be initialized. A service requester wants to formulate its best possible policy, which it can optimistically formulate by assuming that all of its commitment requests will be satisfied fully as early as it wants. That is, it can imagine that all providers will agree to commitments at time zero with probability 1, and formulate its own optimal policy accordingly, yielding is maximal local expected utility EU ∗ . Then, given that it knows this maximal local expected utility, the requester can turn the optimization problem around to find the latest time for the commitments that can achieve this utility. I have developed a MILP, shown in Equation 7.5, for computing a policy that performs commitment-enabled actions as late as possible ∗ while maintaining that the local utility is no worse than Vrequester . In Equation 7.5, I introduce integer variables yt ∈ {0, 1} that can only take a value of 1 if a commitment-utilizing action is performed at or before time t with probability greater than 0 (enforced by using a very small ε variable). Minimizing the sum of the y values forces commitment-utilizing actions to be performed as late as possible. Upon solving the MILP, the earliest time of such an action may be calculated by finding the first y variable that has value 1: mint {yt > 0}. This earliest commitment-utilization 247 time returned by the linear program is then used as a relaxation time for the requested commitment. These relaxed requests may still be overly optimistic (in terms of the service providers’ capabilities), but at least they do not impose unnecessarily demanding requirements on the providers. ∀j, X a min X t yt xja − X xia P (j|i, a) = αj a,i ∀i∀a, xia ≥ 0 X ∀s X s ∧Statuss (i)=F }} a X{i|{time(i)=t X ∗ xia R (i, a) ≥ Vrequester a i  X ∀t < T, −1 ≤  xia ≥ ρs {i,a|{time(i)≤t∧enables(C,a)}} ∀t, yt ∈ {0, 1} 7.3.3.2 (7.5)  xia  − yt − ε ≤ 0 Request Revision Next I discuss how a service-requesting agent like Agent 3 would process the commitments counter-proposed by a service provider in its negotiations (step 4 in Figure 7.5). Just like the service provider, the service requester can evaluate utilities of various counter-proposed commitments by solving local commitment-augmented MDPs (using the LP from Equation 7.1 in Section 7.3.2.1) and calculating the expected utilities of their respective solution policies (using Equation 2.7). The (self-interested) object of the requester is to find the best possible feasible commitment and thereby maximize its local utility. Along these lines, one very simple method of formulating a new request is to evaluate each counterproposal, identify the best one, and request it. In my running example, Agent 3 either could choose time 6 with probability 1 (giving it an expected local utility of 0.75), or time 4 with probability 13 (giving it an expected utility of 1.0). Agent 3 would then request the latter. A slightly more advanced variation is to further consider a commitment time and probability between the bounds of the counterproposals. The requester can simply interpolate optimistically, computing and evaluating the potential utility of a request whose time is halfway between the two counterproposals and whose probability is equal to the maximum probability of the two counterproposals. In the case of my running example, this optimistically′′′ interpolated request corresponds to commitment C13 (C) = ht = 5, ρ = 1i. Although 248 this interpolated commitment request will not be feasible, in this case the provider will respond with more counterproposals to better inform the requester of the boundary capabilities. By iterating back and forth in this way, the potential commitment time window will narrow monotonically and (since time is discrete) the process must terminate when the requester is unable to interpolate further. This strategy of rerequesting is implemented in the commitment convergence algorithm presented in the next section. From the perspective of the service requester, another response to counterproposals from potential service providers might be to consider them collectively, and accept multiple such proposals. In my running example, had there been a second potential provider for service C, the service request could have gone to it as well as to Agent 1. Assume for a moment that having service C at time 4 is important for the requester. The counterproposal from Agent 1 specifies that, at time 4, there is a probability of 13 that service C will be accomplished. If the other provider responded that, at time 4, it could provide C with a probability of 12 , then the requester has options. It could certainly choose to enlist the other agent to provide C, because of the higher probability. But, assuming that the possible providers are otherwise idle, and that they can pursue C concurrently and independently, the requester could accept both counterproposals, so as to increase the probability that at least one provision of C will succeed to 23 . 7.3.4 Negotiation-Driven Commitment Convergence Each request made to a service provider may be handled using the negotiation protocol introduced in Figure 7.5. As in the running example problem, each serviceproviding agent is first given a sequence of these incoming requests. The idea is to consider each request one at a time, converging on an agreement with the serviceproviding agent(s) through negotiation before moving to the next request. Our agents therefore search the space of commitments of all service requests greedily by setting the commitments one at a time. This strategy enables much quicker commitment convergence than would an exhaustive search (but at the potential loss of solution quality). Pseudo-code for my commitment convergence algorithm is shown in Algorithm 7.1. Each step of the algorithm involves agents solving linear programs (as described in Sections 7.3.2 and 7.3.3) to reason about requests, counterproposals, and optimal local behavior. One by one, each original request is dealt with in a pairwise negotiation between provider and requester. The two agents iterate through sets of potential 249 commitment values and eventually converge on a single agreed commitment for each requested service. This convergence of commitment values is guaranteed (in a number of iterations logarithmic in the problem time horizon) because of the methods agents use for counter-proposing and re-requesting. As is typical of greedy algorithms, a drawback of this particular commitment negotiation algorithm is that, by greedily maximizing the utility associated with the current commitment to a service provision, it can sacrifice potential solution quality of later service provisions. In my running example, if the service requests are handled in the order that they are shown in Figure 7.2.1, negotiation yields commitments C12 (A) = ht = 4, p = 1.0i and C13 (C) = ht = 5, p = 23 i. Given that the completion of Task A by time 3 is worth a local utility gain of u2 to Agent 2 and the completion of Task C by time 4 is worth a local utility gain of u3 to Agent 3, these two commitments together provide the requesters a total expected gain of u2 + 12 u3 . If we were to reverse the order in which the requests are considered in the example problem, the negotiation protocol brings us to a different set of commitments. A commitment C13 (C) = ht = 4, p = 1.0i will be made to Agent 3 promising the completion of Task C by time step 4. But when the provider next negotiates with Agent 2, it can only make commitments involving the execution of Task A after Tasks B and C. Otherwise its first commitment would be violated. In the example, this results in Task A finishing at time 4 with probability 13 . And completion of Task A after time 4 does not benefit Agent 2 at all. Thus, by using this alternate request order, negotiations converge on a set of commitments that provide the requesters a total gain of 13 u2 + u3 . Which ordering produces the better solution is dependent upon the relative utility benefit values u2 and u3 . Specifically, the first commitments are preferable when u2 is worth at least 12 of u3 , but otherwise the other commitments would be preferred. Although additional ordering heuristics could be overlaid on top of the greedy protocol described here, it is difficult to ensure in general that the right ordering will be attempted. Furthermore, the optimal set of commitments might not be achievable by greedy convergence using any ordering. It may be that two requesting agents will receive the greatest collective utility if they both compromise on the probability and/or time by which they are providing competing services. Such a compromise can only be achieved by simultaneously considering both potential commitments (instead of considering them one-by-one as with the greedy algorithm). 250 Algorithm 7.1 Greedy Request-Based Search for Commitments 1: procedure GRBS(p, agents) ⊲ Input: problem p, service agents ...Initialization... 2: C←∅ ⊲ The commitment set (stored by all agents) 3: for each agent ∈ agents do 4: requests ← agent.FormInitialRequests(p) ⊲ [see Sec. 7.3.3.1] 5: agent.CommunicateRequests(requests, agents) 6: end for 15: 16: 17: 18: 19: 20: 21: 22: ...Greedy Commitment Convergence... for each provider ∈ agents do for each r ∈ provider.ReadIncomingRequests() do requester ← r.sender acceptable ← provider.EvaluateFeasibility(r, p, C) ⊲ [7.3.2.1] while acceptable = false do cp ← provider.GenerateCounterproposals(r, p, C) ⊲ [7.3.2.5] requester.EvaluateAndMemorize(cp, p, C) r ← requester.GenerateNewRequest(cp, p, C) ⊲ [7.3.3.2] acceptable ← provider.EvaluateFeasibility(r, p, C) end while r ← requester.RelaxRequest(r, p, C) ⊲ [7.3.3.1] c ← provider.FormCommitment(r) provider.CommunicateCommitment(c, agents) C ← C ∪ {c} ⊲ New commitment added (by all agents) end for end for 23: 24: 25: 26: 27: ...Optimal Local Policy Formulation... for each agenti ∈ agents do πi∗ ← agenti .ComputeConstrainedOptimalPolicy(p, C) end for return π ← hπ1∗ , ..., πn∗ i ⊲ Output: joint policy end procedure 7: 8: 9: 10: 11: 12: 13: 14: 251 7.3.5 Empirical Results Two separate empirical studies follow. In the first, I analyze the scalability of greedy service negotiation and compare its quality with that of a MMDP solver on several sets of randomly-generated service problems. In the second, I perform a very preliminary comparison with OIS on the problem sets tested in Chapter 6. 7.3.5.1 Comparison with MMDP Solver The motivation for developing a greedy service negotiation approach is to be able to solve larger, more complex problems well with less computational effort. To this end, I evaluate how scaling up problem difficulty affects runtime, and how the greedy and approximate techniques impact solution quality. In this comparison, which summarizes my published results (Witwicki & Durfee, 2009), I compare greedy service negotiation against a MMDP (Boutilier, 1996) solver. My solver uses the same machinery as the “centralized MILP” approach described in Section 6.5.1.1, but does not constrain agents’ observability. Instead, it computes a joint policy that assumes that each agent observes the full global state at every time step. In contrast to my service negotiation algorithm, the MMDP is a centralized planning model that finds optimal joint policies by simultaneously accounting for all agents’ policy decisions. Greedy service negotiation exploits the largely decoupled structure in service coordination problems, but produces only approximately-optimal solutions. Because the MMDP does not exploit structure or approximation like my approach can, its runtime should be viewed more as providing a worst-case bound on computational effort. On the other hand, because the MMDP solver produces optimal joint policies that assume each agent has full global state awareness at all times, the expected qualities of its joint policies provide a best-case bound on the agents’ collective performance. In contrast, greedy service negotiation assumes that agents only know their local state and whether other agents have succeeded or failed in meeting their time commitments. Thus, while the bounds are not tight, the MMDP solver provides well-defined performance bounds against which to compare greedy service negotiation. I begin by presenting scalability results that demonstrate the scalability of greedy service negotiation. Figure 7.7 shows the runtime on variations of exactly the example problem (presented in Figure 7.2.1), where each variation is scaled up by simply stretching out the timing of all tasks3 and extending the time horizon accordingly (from T=8 to T=96). This leads to larger MDPs, more LP constraints, and potentially 3 Tasks maintain the same number of discrete durations, but each possible duration is scaled. 252 more iterations of commitment requesting and counter-proposing. As can be seen in Figure 7.7, the algorithm remains tractable for time horizons as a large as T=96 (at which point CPLEX is solving constrained MDPs with over 10,000 states), converging on commitments in a minute or less. I compare this runtime with solving the Multiagent MDP, which scales much worse with the problem time horizon, taking hours to return the optimal solution (utility = 6.0) for time horizons of greater than 40, whereas the commitment-based algorithm returned a near-optimal solution (utility = 5.5) in under a minute for problems with horizons as large as 96. This result provides some evidence that, although approximate, greedy service negotiation can produce reasonable solutions tractably, scaling gracefully with the problem time horizon. Figure 7.7: Scalability: problem time horizon. Next, I scale the local complexity of the example problem by adding random local tasks , each of which is not enabled by other agents’ services and may not be requested by other agents. This has the consequence that the agents’ problems are tied to one another with the same interaction structure as in the running example. But each of the agents problems becomes more complicated with additional local tasks (that may accrue utility) and additional local dependencies between internal tasks and services. Through random task additions, I automatically generate sets of random problems 253 (25 for each data point). Details of the problem generation schemes used throughout this section are provided in Appendix B. Figure 7.8 shows the results achieved on these augmented problems. Average running time is plotted on a log scale. This experiment offers strong evidence that greedy service negotiation scales to problems with increasingly complex local behavior. For random locally-augmented problems with more than 8 tasks, the MMDP model (not shown) takes more than an hour to solve on average. Note that as more local tasks are added, the agents’ individual decision problems are becoming larger, but also more weakly-coupled from one another. It is this weak coupling (as discussed in Section 7.2.3) that enables commitment-based negotiation to remain so much more efficient. Figure 7.8: Scalability: local complexity. In one more scalability test, I increase the size of the example problem by adding additional service-requesting agents. Maintaining the single-service-provider structure, randomly-generated agents are added, each with a small number of local tasks and a single service requirement. Further details are included in Appendix B. The average runtime for 25 random problems per data point is plotted in Figure 7.9. This time, notice that my commitment negotiation approach scales roughly linearly with the 254 number of agents. The MMDP scales exponentially, and therefore quickly becomes intractable. Figure 7.9: Scalability: number of agents. An alternative metric common in multiagent and service-oriented systems is the number of messages passed between agents. With my negotiation protocol, the number of messages scales linearly with the number of service provision relationships, regardless of the number of agents involved. Each relationship requires a minimum of 2 messages (for request and agreement) and can require a number of messages logarithmic in the number of time points in the worst case, if the requesting agent continues to request the optimistic interpolation (as described in Section 7.3.3.2). Of course, the number of messages can be decreased by creating more informative messages (and incurring the costs of forming those messages). For example, if the provider replies at the outset with the entire feasible probability boundary, then no iteration is needed. The computational benefits of greedy service negotiation over the MMDP come at a price in terms of the potential quality of the agents’ joint solution. Figure 7.10 shows the difference in quality by empirical comparison on a set of 25 randomly-generated service problems. Each problem contains 3 agents and a total of 9 tasks randomly distributed between the agents. There are 3 random enablement NLEs, but unlike in 255 the example problem, the services are not all provided by the same agent. Agents are not exclusively providers or requesters. The dependencies between each agent’s local tasks are random as are the task duration distributions. More details are provided in Appendix B. Although none of the individual problems are based on actual real-world service composition scenarios, I sought to generate a set of problems representative of a wide range of potential 3-agent scenarios, remaining impartial about characteristics such as service composition hierarchy, tightness of timing, and distribution of local utility. This evaluation provides preliminary evidence that my algorithms may produce coordinated, high-quality solutions for a variety of service composition problems. Figure 7.10: Average solution quality on 25 random problems The height of the bars represent solution quality, measured as the sum of the expected local utilities of the 3 agents. The corresponding error bars represent standard deviation in the solution qualities across the 25 problems. The number under each bar represents the average time to converge or (in the case of the MMDP) to compute the optimal solution. The left-most bar in Figure 7.10 indicates the average quality of a solution approach in which agents plan with completely-independent local models that do not consider any possibility of service provision from other agents. That is, agents build optimal local policies around an empty set of commitments. This approach serves as a lower bound over which coordinated agent behavior should rise. 256 And as shown in Figure 7.10, greedy service negotiation performs significantly better. My approach performs nearly as well as the optimal MMDP solution approach in a fraction of the computation time. Greedy service negotiation yields coordinated policies for these random service problems, achieving higher solution quality than that of uncoordinated policies, but this solution quality is, on average, lower than that of the optimal MMDP solution. Reasons for this gap in solution quality include the following. As described in Section 7.3.4, commitment values are converged upon greedily, one by one, and in a fixed order. I have included an intermediate data point to account for a portion of this loss of quality. The bar labeled “Optimally-Ordered Commitment Negotiation” represents the commitment convergence algorithm performed on all possible orderings of commitments so that the highest-quality joint policy (corresponding to the optimal commitment ordering) is selected. However, this approach still makes greedy choices (given the optimal ordering). The MMDP formulation, on the other hand, always makes the correct choices and always converges on the globally optimal joint policy for the agents. It always finds the best balance of service provision to multiple service requesters, as well as the best balance of provider and requester utility. Another drawback as compared with the MMDP formulation is that, in order to achieve compactness and efficiency, agents time-commitment-based models make some approximations of nonlocal agent behavior. For example, the agents forgo potential flexibility and sacrifice potential expected utility by representing each service commitment with just a single time and probability. That is, unlike the MMDP that assumes agents have global awareness and can react suitably when a service is provided earlier (or later) than planned, my approach (as described) only allows agents to model and react at the service’s committed time. However, there is nothing in the negotiation protocol that precludes making multiple (conditional) commitments at different times for each request. It is the subject of future work to combine service negotiation with richer influence representations. However, this would likely enlarge the influence space space, and so should be done with care. 7.3.5.2 Comparison with OIS I now present a very preliminary comparison of greedy service coordination and OIS. First, I demonstrate the superior scalability of greedy service negotiation on a set of 50 random problems (per number of agents) whose interaction digraphs have the chain topology (shown in Figure 6.15). Problems are generated using the same generator as described in Section 6.5.6, with randomly selected nonlocally-enabling 257 tasks set as services. Figure 7.11 plots the computation times of OIS and greedy service negotiation on a logarithmic scale. The plot only extends to 10 agents, but greedy service negotiation is capable of scaling to hundreds. Regardless of the interaction digraph topology, greedy service negotiation scales linearly. Moreover, notice that for 2-agent problems, greedy service negotiation performs more than an order of magnitude faster than OIS, due to the advantages that I listed in Section 7.3. Figure 7.11: Scalability: OIS vs. Greedy Service Negotiation on “chain” topology. However, it is not fair to compare only the runtimes of these two algorithms. In addition to imposing stricter restrictions on the problems than OIS, greedy service negotiation returns approximate solutions. Figure 7.12 shows the quality of solutions returned by greedy service negotiation relative to those returned by OIS. On top, the average absolute solution quality of both methods is plotted. On the bottom, the “Percentage Optimal” refers to the average of the ratio of returned solution quality to optimal solution quality. Clearly, greedy service negotiation trades some quality for its faster computation. However, further analysis is required to characterize the trade-off. 258 Solution Quality 120 mean value of returned solution greedy ois 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 n T=5,tasksPerAgent=3,localWindowSize=0.5,uncertainty=0.5 Percentage Optimal mean % of optimal of returned solution 110 105 100 95 90 85 80 75 70 65 60 2 3 4 5 n T=5,tasksPerAgent=3,localWindowSize=0.5,uncertainty=0.5 NLATs=1,influenceType=state Figure 7.12: Solution quality of Greedy Service Negotiation on “chain” topology. 259 CHAPTER 8 Conclusions The focus of this dissertation has been on the development of a general framework for abstracting agents’ influences and coordinating using those abstractions, with the ambition of advancing the state of the art in efficiency and scalability of planning for teams of weakly-coupled agents under uncertainty. Towards this goal, I have made several contributions to the field, each of which I summarize in Section 8.1. My work has also raised new research questions, several of which I discuss in Section 8.2. I conclude in Section 8.3, reflecting upon the accomplishments of this dissertation with respect to my longer-term aspirations. 8.1 Summary of Contributions In the subsections that follow, I organize my contributions into four thrusts, each of which is an integral component of my coordination framework. These thrusts and their constituent contributions correspond to the work that I have presented in Chapters 3, 4, 5, and 6. 8.1.1 Identifying Structure Chapter 3 focuses on the identification of weakly-coupled structure in transitiondependent problems. ❼ I have defined a model (the TD-POMDP), for teams of transition-dependent agents, that articulates several elements of exploitable structure that were previously exploited only in more restricted problem classes. This structure gives rise to an abstraction of transition probabilities and a subsequent decomposition of the joint model into efficiently-solvable local models. 260 ❼ I have developed a characterization that brings together three complementary aspects of weakly-coupled problem structure present in the TD-POMDP model. Along with this characterization, I contribute theory on the complexity of solving weakly-coupled problems, by exploiting the three aspects in concert, dependent on the degree to which each aspect of weakly-coupled structure is present. Aside from promoting a better understanding of the structural exploitations of other approaches, this theory motivates my investigation of influence-based abstraction as a method for reducing the size of the search space required for coordinating optimal behavior. It also guides my empirical analysis of the conditions under which influence-based abstraction is most effective at reducing the search space and most efficient in practice. 8.1.2 Abstracting Influences Chapter 4 focuses on the abstraction of influence information that is sufficient for optimal local planning and reasoning. ❼ I have developed a novel best-response model for TD-POMDP agents, the complexity of which depends on the number of shared state features regardless of either the number of agents or the complexity of peers’ behavior. ❼ Out of my best-response model comes a novel representation for policy ab- straction, influence, that encodes information from peers’ policies sufficient for optimal local reasoning, and whose encoding size is also a function of the number of shared state features irrespective of the number of agents. The conceptual contribution is the (later empirically validated) insight that, by formalizing agents’ transition influences, an influence space emerges that is potentially more efficient to search than the policy space, yet is still amenable to optimal solutions. ❼ In Section 4.6, I have presented an empirical analysis of the size of the feasible influence space in relation to the size of the policy space. The results contribute towards characterizing the circumstances under which influence-based policy abstraction is most advantageous. 8.1.3 Proposing and Evaluating Influences Chapter 5 presents a principled approach to constraining policies around influences. ❼ By conceptually linking the probabilistic information encoded in the MDP LP occupation measures with that encoded in agents’ influences, I have formalized a 261 mapping between influence and policy. In one direction, an agent can evaluate the influence settings that its policy implies by solving a linear program corresponding to its best response and evaluating a formula for each influence parameter. In the other direction, an agent can constrain its policy to directly adhere to a proposed influence. ❼ I have proven that an alternative approach, reward shaping, though potentially more efficient than LP-based constrained policy formulation, is not guaranteed to enforce a prescribed influence (even if that influence is feasible) regardless of the amount of parameter tuning. Moreover, while reward shaping may only produce approximately optimal local policies with respect to an influence, influenceconstrained policy formulation ensures that the local policies will be optimal among policies that achieve the prescribed influence. ❼ I have developed a method for enumerating an agent’s space of feasible outgoing influences efficiently. Instead of having to consider each of its policies explicitly, my algorithm solves a number of MILPs that is linear in the number of feasible influence points. ❼ In employing influence constraints to perform a number of different functions in Chapters 5 and 7, I have contributed an arsenal of constrained policy formulation techniques that may be adapted and extended to solve other decision-making problems involving behavioral constraints. 8.1.4 Coordinating Influences ❼ My primary contribution in Chapter 6 is the development of my optimal influence- space search algorithm, which combines components from each of the earlier chapters so as to decompose the policy formulation problem into a series of wellordered influence generation and influence evaluation steps that are guaranteed to search all feasible combinations of agents’ influences, one of which I have proven must correspond to the optimal joint policy. ❼ I have performed an empirical comparison of the computation time of optimal influence-space search with that of several other optimal algorithms, thereby identifying the strengths and weaknesses of my methodology. In doing so, I contribute compelling evidence in support of my hypothesis that optimal influence-space search provides significant computational advantage over existing 262 methods in the computation of optimal solutions for certain classes of weaklycoupled problems. Moreover my analysis evaluates the circumstances under which influence-space search gains the most traction in practice, providing data with which researchers and practitioners can make informed decisions about the suitability of adapting or applying influence-based abstraction and optimal influence-space search to their own problems. ❼ In Section 6.6.3, my initial study of exploiting agent scope in combination with degree of influence suggests the following: For problems with a low degree of influence and a fixed agent scope, influence-space search can achieve scalability in the number of transition-dependent agents well beyond the state of the art in optimal policy computation. Moreover, this portion of my work contributes a novel application of Dechter’s Bucket Elimination to influence-space generation and optimization. ❼ I present additional evaluations in Chapter 7 that contribute evidence of the efficacy of approximate influence-space search to reducing computation while still achieving near-optimal solution quality. 8.2 Open Questions This work has uncovered a number of interesting questions that were beyond the scope of this dissertation, but that are candidates for potentially-fruitful investigation in future work. 8.2.1 Quality-Bounded Influence Space Search I have devoted this dissertation primarily to the study of optimal influence-based methods. To date, I remain compelled by the formal guarantees that optimal algorithms provide. However, I am also interested in developing approximate algorithms with bounds on quality loss. Intuitively, the approximation of the influence probability space (Section 7.1) should, under some conditions, provide such guarantees. Inherently, it already guarantees consideration of an influence whose parameter settings are within some bound of the settings of the optimal influence point. Further investigation is needed to determine whether or not we can also bound quality, and whether or not those bounds would be useful in practice. 263 8.2.2 Influence Encoding Compaction Based on my characterization of influence encodings (e.g. state-dependent, historydependent, influence-dependent) in Section 4.3, different problems may call for different sufficient encodings. In particular, my empirical analysis in Section 4.6 considered two classes of problems: one in which a history- and influence-dependent encoding was needed and another in which a state-dependent encoding sufficed. I justified this latter sufficiency for acyclic cases with the knowledge that the nonlocal features were all event-driven (Def. 4.21), and with my theoretical result from Section 4.4. In my experiments, I manually set the state-dependent flag for such cases. What I did not do is to automate the determination of the appropriate encoding. Clearly, a smaller encoding is preferred for compactness of best-response models, for efficiency of influence generation, and for reduced influence space size (as my empirical results in Section 4.6.2.4 suggest). In the interest of automation, I pose the following questions: ❼ How can we automatically diagnose inefficiencies in the influence encoding and reduce the size of the encoding (e.g., by eliminating unnecessary variables, removing unnecessary connections, or instead revise the semantics of the encoding to allow a smaller number of parameters)? ❼ What are the circumstances under which extremely compact encodings (e.g., time commitments developed in Section 7.2) yield optimal solutions? ❼ Is there additional problem structure that could be exploited to reduce the size of the influence DBN? 8.2.3 Other Applications of Influence Abstraction In this dissertation, I have demonstrated that, for the problem of planning weaklycoupled cooperative agents’ decisions, influence-based abstraction is a powerful tool for decomposing one large joint problem into smaller, more efficiently-solvable local problems. The decomposition is made possible by the fact that, given few shared features, the resulting compact model of influence is a sufficient summary of nonlocal information for making local predictions. Note that neither decomposition nor prediction-making are specific to planning; nor are they specific to cooperative agents. I believe that there are other weakly-coupled problems that could similarly benefit from my influence formalism (as well as the structural aspects of my TD-POMDP model), such as the following: 264 ❼ learning (assuming that the structure of agents’ interactions is known a priori); ❼ reasoning about influences of adversaries in a competitive domain; ❼ incentivizing agents’ adoption of socially-beneficial influences (in the context of mechanism design). 8.3 Closing Remarks The research that I have presented in this dissertation is largely motivated by the long-term vision of applying multiagent sequential decision making models and techniques to solving real-world problems. Realizing this vision requires, among other things, bridging the gap between the limitations of the current state of Dec-POMDP research (which as of yet has been restricted to small toy problems involving few agents, or restrictive forms of interactions, or no bounds on solution quality) and the objectives of practitioners that might ultimately apply the Dec-POMDP technologies. My work aspires to narrow the gap by extending the state-of-the-art in efficient computation of optimal solutions to teams of weakly-coupled transition-dependent agents. Although small in respect to the size of this gap, my influence-based abstraction approach has inched out beyond the reach of other methods, computing solutions faster and scaling to more agents that was previously possible on a small, but well-characterized space of problems. I have thus accomplished that which I set out to accomplish in this dissertation. This achievement is a direct result of exploiting structure in transition-dependent problems. In fact, the greatest advances in agent scalability were made possible by simultaneously leveraging two different aspects of structure: locality of interaction and degree of influence; moreover, exploiting both yielded far greater gains that exploiting either one individually. This complementarity of structural exploitations, and of algorithmic frameworks, gives me hope that, by identifying and exploiting more structure, the field of Dec-POMDP research will one day close the gap, and successful applications will be realized. All it may take is a few more complementary advancements. 265 APPENDICES 266 APPENDIX A Comparison of EDI-DEC-MDP and TD-POMDP The Dec-MDP with Event-Driven Interactions (Becker et al., 2004a), otherwise known as the Event-Driven Dec-MDP (EDI-Dec-MDP), is the most closely-related subclass to that of the TD-POMDP. Here, I present the technical details of EDI-DecMDP and prove that it is no more general than the TD-POMDP. A.1 Event-Driven Dec-MDP Model The EDI-Dec-MDP is specified by the tuple M = hN , S, A, P, R, Ω, O, T, {dkij }i, wherein the usual suspects are as follows: N is the set of agents, S is a set of world states, A = A1 × A2 × ... × An is the joint action space, P : S × A × S → [0, 1] is the transition function, R : S ×A → Rn is the joint reward function, Ω = Ω1 ×Ω2 ×...×Ωn is the joint observation space, O : S × A × Ω → R is the observation function, and T is the finite horizon. Like the TD-POMDP, the world state S is factored into local state components Si . However, the EDI-Dec-MDPs factoring assumes no sharing of state features among local states: S = ×i∈N Si . Additionally, the observation function is restricted such that the EDI-Dec-MDP is locally fully observable (Definition 2.8): ∀oi , ∃si |P r(si |oi ) = 1. Further, the reward function is restricted such that EDI-DecMDP agents are reward independent (Definition 2.11) such that local rewards combine P by summation to equal the joint reward: R(s, a, s) = i∈N Ri (si , ai ). Event-Driven DEC-MDPs have structured transition dependencies Becker et al. (2004a), the set of which is denoted as {dkij }. In particular, one agent may influence the local state transitions of another through the occurrence of a proper event. Definition A.1. A primitive event e = (si , ai , s′i ) is a triplet of state, action, and outcome state that may occur in agent i’s execution history Φi = [s0i , a0i , s1i , a1i , ...]. 267 Definition A.2. An event E = {e1 , e2 , ..., eh } is a set of primitive events that is said to occur (Φi |= Eik ) in an execution sequence if one of the primitive events occurs in the execution sequence. Definition A.3. A primitive event is proper if it can occur at most once in any possible history. An event E = {e1 , e2 , ..., eh } is proper if all of its primitive events are proper and no two primitive events can both occur in any possible history. Interactions among EDI-Dec-MDP agents thereby occur though event dependencies of the form dkij = Eik , Djk , whereby an event in Eik brings about a change in the transitions, Djk (which is made up of state-action pairs), of agent j. Dependency satisfaction is captured by Boolean variable bksj ,aj , which is true when an event in Eik has occurred. Subsequently, the transition function P of a EDI-Dec-MDP is structured such that agent j’s local transition function Pj , in addition to depending on local state sj , depends on the nonlocally-affected boolean variable bksj ,aj . As such, the agents’ local state transitions are independent of one another with the exception of event dependencies captured by the bksj ,aj variables. A.2 Complexity of EDI-Dec-MDP Allen (2009) has recently proven that the computational complexity of the EDIDec-MDP is NEXP-complete, which is the same complexity class as the Dec-POMDP. This means that, to solve the EDI-Dec-MDP requires computation time irreducibly exponential, and space unbounded, in the model size the model size kMedi k). A.3 Reduction of EDI-Dec-MDP to TD-POMDP Any EDI-Dec-MDP can be reduced to an equivalent TD-POMDP simply by treating the event-driven boolean features bksj ,aj as nonlocal features. Theorem A.4 formalizes this sentiment. Theorem A.4. EDI-Dec-MDP ≤EXP TD-POMDP. Proof. Let us treat the reduction to the equivalent TD-POMDP td td td td U L Mtd = N td , {Sjtd }, {Atd j }, {Ωj }, {Oj }, {Rj }, {m̄j }, {Pj }, {Pj }, T (as per definition 3.15) one component at a time: ❼ N is identical to N edi . 268 ❼ Let the TD-POMDP world state include a boolean feature nj = bksj ,aj for every EDI-Dec-MDP even dependency variable bk . Further let each nj be a nonlocal feature controlled by agent i (where i is the other agent reference in bk . As such, each local state space of the TD-POMDP Sjtd is the local state space in the EDI-Dec-MDP augmented with each corresponding nonlocal features nj and each nonlocal feature of some other agent ni controlled by j. The remaining features in Sjtd are treated as locally-controlled features (Def. 3.12). As of yet, P k we have used time O(kdkij k) and space O( j∈N 2kdij k·kSj k ). edi ❼ Atd j is identical to Aj . edi ❼ Ωtd j is identical to Ωj . ❼ Ojtd is equivalent to Ojedi , such that agent j’s observations depend only on the edi local features of std j (or equivalently, all the features of sj ). This reduction is valid since the EDI-Dec-MDPs local full observability makes Ojedi more restrictive in its representation. ❼ Rjtd is identical to Rjedi (with the same conditional independence of nonlocal features as is the case in Ojtd ). ❼ m̄td j is the union of the set of nonlocal features {nj }, each controlled by another agent i, that influence agent j, and the set of nonlocal feature {ni }, each controlled by j. Since there is one nonlocal feature per dependency, each involving 2 agents, we will have performed a number of steps and used space proportional to to 2 · k{dkij }k. ❼ The TD-POMDP’s uncontrollable feature transition function PjU will be defined over an empty set since we are treating all features as locally-controlled. ❼ The only difference between the TD-POMDP’s local transition function PjL and Pjedi is that for Pjedi is defined over a smaller set of states, each of which do not encode the Boolean dependency variables of the form bksi ,ai related to agent j’s events but that influences i. Since PjL is defined over a state space that includes these these additional variables, PjL must expand the necessary transition information for each combination of value of bksj ,aj for each state s according to the whether or not the corresponding event Ejk has occurred. The k total size of the TD-POMDP local transition function PjL is kPjedi k · 2kdij k in the worst case, so this step of the reduction takes time and space proportional to k kPjedi k · 2kdij k . 269 ❼ T td = T edi . Upon completing the steps above, the result is a completely-specified TD-POMDP that captures the same semantics of the EDI-Dec-MDP from which it was reduced. k Since each step in the reduction takes at most O(kPjedi k · 2kdij k ) time an space, and each step has been validated (as per the semantics of each model component), the claim made in Theorem A.4 is proved. Corollary A.5. The TD-POMDP is NEXP-hard. Proof. The EDI-Dec-MDP is known to be NEXP-complete (Allen, 2009), which implies that the complexity of solving the EDI-Dec-MDP is irreducibly exponential in the model size kMedi k. By Theorem A.4, any EDI-Dec-MDP Medi can be solved by reducing it to an equivalent TD-POMDP Mtd , an operation which takes exponential time and space in kMedi k at worst, and solving the TD-POMDP. Since the computation required to solve the original EDI-Dec-MDP is no easier than exponential, the exponential computation required to reduce the problem does not affect the asymptotic complexity of solving the reduced problem Mtd , which must be at least NEXP. Therefore, the TD-POMDP is NEXP-hard. 270 APPENDIX B Random Service Problem Generation Local Complexity Scale-up Experiment. For the evaluation of local complexity, shown in Figure 7.8, tasks were generated and added one by one to each agent’s local problem (from the running example). These random local tasks were generated as follows. Each task duration distribution was computed by selecting an interval [1, k] of time units, where k was randomly chosen from a normal distribution centered around T , and then invoking a uniformly random number of possible durations from that 2 interval, each randomly valued within the interval and randomly assigned a probability (derived from normalizing a set of random numbers). For each local task added to an agent’s local problem, that task was probabilistically connected (via dependency) to an existing task. That is, with a probability of 0.3 the task was made to enable an existing task chosen at random with equal probability (but with the constraint that the existing task wasn’t already enabled by another task). Next, with a probability of 0.3 the task was made dependent on an existing local task chosen at random with equal probability. The utilities of these additional local tasks were selected uniformly randomly in the interval [0,3]. Agent Scale-up Experiment. For the evaluation of number of agents, shown in Figure 7.9, additional requesting agents were generated and added one by one to the example service coordination problem. These agents’ problems were generated as follows. First, a random number n of tasks was selected in the interval [1, 6] with equal probability. Next, one by one, each of the n tasks was randomly generated and added to the new agent’s problem exactly as dictated by the random task generation scheme in the preceding paragraph. Finally, one of these tasks was chosen at random 271 as being dependent on one of the services (selected randomly from those) provided by the service-provider. Utilities of tasks requiring the service of another agent were valued uniformly randomly over the interval [2, 5] and utilities of all other tasks were valued uniformly randomly over the interval [0, 3]. Solution Quality Experiment. For the evaluation of solution quality, shown in Figure 7.10, problems were generated randomly from scratch. Each of the problems was initialized with 3 agents, each containing empty local problems. 9 randomlygenerated tasks (as generated using the scheme in the local complexity experiment) were randomly distributed between the 3 agents. Next random service dependencies were introduced to connect the agents problems to one another. Three service enablements were imposed, each connecting two randomly selected tasks (constrained to come from two different agent’s local problems). The effect was to create random service compositional hierarchies. For each of these random coordination problems, there was no longer (necessarily) a single service-provider. Each agent had the potential of both providing services and requesting services. The utilities of tasks in these problems was valued as in the preceding paragraph. That is, utilities of tasks requiring the service of another agent were valued uniformly randomly over the interval [2, 5] and utilities of all other tasks were valued uniformly randomly over the interval [0, 3]. 272 BIBLIOGRAPHY 273 BIBLIOGRAPHY Allen, M. (2009). Agent Interactions in Decentralized Environments. Ph.D. thesis, University of Massachusetts, Amherst, Massachusetts. Amato, C., Bernstein, D., & Zilberstein, S. (2007). Optimizing memory-bounded controllers for decentralized POMDPs. UAI , (pp. 1–8). Atlas, J. (2009). A distributed constraint optimization approach for coordination under uncertainty. In Autonomous Agents and Multiagent Systems (AAMAS-09), (pp. 1263–1264). Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13 (4), 341–379. Beaty, D., Grady, M., May, L., & Gardini, B. (2008). Preliminary planning for an international mars sample return mission. Report of the iMARS Working Group, (pp. 1–60). http://mepag.jpl.nasa.gov/reports/iMARS_FinalReport.pdf. Becker, R. (2006). Exploiting Structure in Decentralized Markov Decision Processes. Ph.D. thesis, University of Massachusetts Amherst. Becker, R., Zilberstein, S., & Lesser, V. (2004a). Decentralized Markov decision processes with event-driven interactions. In Autonomous Agents and Multiagent Systems (AAMAS-04), (pp. 302–309). Becker, R., Zilberstein, S., Lesser, V., & Goldman, C. (2004b). Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research, 22 , 423–455. Becker, R., Zilberstein, S., Lesser, V., & Goldman, C. V. (2003). Transitionindependent decentralized Markov decision processes. In International Conference on Autonomous Agents and Multi Agent Systems, (pp. 41–48). Melbourne, Australia. Bellman, R. E. (1957). Dynamic Programming. Princeton University Press. Bernstein, D., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27 (4), 819–840. Bernstein, D., Hansen, E., & Zilberstein, S. (2005). Bounded policy iteration for decentralized POMDPs. IJCAI , (pp. 1287–1292). 274 Bernstein, D. S. (2005). Complexity Analysis and Optimal Algorithms for Decentralized Decision Making. Ph.D. thesis, University of Massachusetts Amherst. Bernstein, D. S., Amato, C., Hansen, E. A., & Zilberstein, S. (2009). Policy iteration for decentralized control of Markov decision processes. Journal of Artificial Intelligence Research, 34 , 89–132. Bernstein, D. S., Zilberstein, S., & Immerman, N. (2000). The complexity of decentralized control of Markov decision processes. In Uncertainty in Artificial Intelligence, (pp. 32–37). Stanford, California. Bernstein, D. S., Zilberstein, S., Washington, R., & Bresina., J. L. (2001). Planetary rover control as a Markov decision process. In International Symposium on Artificial Intelligence, Robotics and Automation in Space. Montreal, Canada. Beynier, A., & Mouaddib, A. (2005). A polynomial algorithm for decentralized Markov decision processes with temporal constraints. In Autonomous Agents and Multiagent Systems (AAMAS-05), (pp. 963–969). Boutilier, C. (1996). Planning, learning and coordination in multiagent decision processes. In Theoretical Aspects of Rationality and Knowledge, (pp. 195–210). San Francisco, CA, USA. Boutilier, C., Brafman, R., & Geib, C. (1997). Prioritized goal decomposition of Markov decision processes: Towards a synthesis of classical and decision theoretic planning. In M. Pollack (Ed.) International Joint Conference on Artificial Intelligence, (pp. 1156–1163). Boutilier, C., Dean, T., & Hanks, S. (1999a). Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11 , 1–94. Boutilier, C., Dean, T., & Hanks, S. (1999b). Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11 , 1–94. Brafman, R. I., & Domshlak, C. (2008). From one to many: Planning for loosely coupled multi-agent systems. In International Conference on Automated Planning and Scheduling, (pp. 28–35). Cassandra, A. R., Kaelbling, L. P., & Kurien, J. A. (1996). Acting under uncertainty: Discrete Bayesian models for mobile-robot navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, (pp. 963–972). Cavallo, R., Parkes, D. C., & Singh, S. (2006). Optimal coordination of looselycoupled self-interested robots. In the Workshop on Auction Mechanisms for Robot Coordination, AAAI-06 . Boston, MA. 275 Clement, B. J., Durfee, E. H., & Barrett, A. C. (2007). Abstract reasoning for planning and coordination. Journal of Artificial Intelligence Research, 28 , 453–515. Cohen, P., & Levesque, H. (1991). Teamwork. Nous, 35 , 487–512. Cohen, P. R., & Levesque, H. J. (1990). Intention is choice with commitment. Artificial Intelligence, 42 (2-3), 213–261. Cox, J. S., & Durfee, E. H. (2003). Discovering and exploiting synergy between hierarchical planning agents. In International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2003), (pp. 281–288). Dean, T., & Givan, R. (1997). Model minimization in Markov decision processes. In AAAI/IAAI , (pp. 106–111). Dean, T., Givan, R., & Kim, K.-E. (1998). Solving stochastic planning problems with large state and action spaces. In Artificial Intelligence Planning Systems, (pp. 102–110). Dean, T., & Lin, S.-H. (1995). Decomposition techniques for planning in stochastic domains. In International Joint Conference on Artificial Intelligence. Dearden, R., & Boutilier, C. (1997). Abstraction and approximate decision-theoretic planning. Artificial Intelligence, 89 (1-2), 219–283. Dechter, R. (1999). Bucket elimination: A unifying framework for reasoning. Artificial Intelligence, 113 (1-2), 41–85. Dechter, R. (2003). Constraint Processing. Morgan Kaufmann. Decker, K. (1996). TAEMS: A framework for environment centered analysis & design of coordination mechanisms. In Foundations of Distributed Artificial Intelligence, Ch. 16 , (pp. 429–448). Decker, K., & Lesser, V. (1992). Generalizing the Partial Global Planning Algorithm. International Journal on Intelligent Cooperative Information Systems, 1 (2), 319– 346. D’Epenoux, F. (1963). A probabilistic production and inventory problem. Management Science, 10 , 98108. Dolgov, D. A., & Durfee, E. H. (2004a). Graphical models in local, asymmetric multiagent Markov decision processes. In International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-04), (pp. 956–963). New York. Dolgov, D. A., & Durfee, E. H. (2004b). Optimal resource allocation and policy formulation in loosely-coupled Markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS 04), (pp. 315–324). Whistler, BC. 276 Dolgov, D. A., & Durfee, E. H. (2005). Stationary deterministic policies for constrained MDPs with multiple rewards, costs, and discount factors. In International Joint Conference on Artificial Intelligence (IJCAI-05). Dolgov, D. A., & Durfee, E. H. (2006). Resource allocation among agents with MDP-induced preferences. Journal of Artificial Intelligence Research, 27 , 505–549. Durfee, E. H., & Lesser, V. R. (1991). Partial global planning: A coordination framework for distributed hypothesis formation. IEEE Transactions on Systems, Man, and Cybernetics, 21 (5), 1167–1183. Fikes, R., & Nilsson, N. J. (1971). STRIPS: A new approach to the application of theorem proving to problem solving. Artif. Intell., 2 (3/4), 189–208. Gmytrasiewicz, P. J., & Doshi, P. (2004). Interactive POMDPs: Properties and preliminary results. In International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-04), (pp. 1374–1375). Goldman, C., & Zilberstein, S. (2004). Decentralized control of cooperative systems: Categorization and complexity analysis. Journal of Artificial Intelligence Research, 22 , 143–174. Goldman, C. V., & Zilberstein, S. (2003). Optimizing information exchange in cooperative multi-agent systems. In International Conference on Autonomous Agents and Multi Agent Systems, (pp. 137–144). Melbourne, Australia. Grosz, B. J., & Kraus, S. (1996). Collaborative plans for complex group action. Artificial Intelligence, 86 (2), 269–357. Guestrin, C., & Gordon, G. (2002). Distributed planning in hierarchical factored MDPs. In Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), (pp. 197–206). Edmonton, Canada. Guestrin, C., Koller, D., & Parr, R. (2001). Multiagent planning with factored MDPs. In Advances in Neural Information Processing Systems (NIPS), (pp. 1523–1530). Vancouver, Canada. Guestrin, C., Koller, D., Parr, R., & Venkataraman, S. (2003). Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19 , 399–468. Guestrin, C., Venkataraman, S., & Koller, D. (2002). Context-specific multiagent coordination and planning with factored MDPs. In Eighteenth national conference on Artificial intelligence (AAAI-02), (pp. 253–259). Guo, A., & Lesser, V. (2005). Planning for weakly-coupled partially observable stochastic games. In international joint conference on Artificial intelligence, (pp. 1715–1716). 277 Hansen, E., Bernstein, D., & Zilberstein, S. (2004). Dynamic programming for partially observable stochastic games. AAAI , (pp. 709–715). Horling, B., Lesser, V., Vincent, R., & Wagner, T. (2006). The soft real-time agent control architecture. Autonomous Agents and Multiagent Systems (AAMAS-2006), 12 (1), 35–92. Jennings, N. R. (1995). Controlling cooperative problem solving in industrial multiagent systems using joint intentions. Artificial Intelligence, 75 (2), 195–240. Jonsson, A., & Barto, A. (2005). A causal approach to hierarchical decomposition of factored MDPs. In ICML ’05: international conference on Machine learning, (pp. 401–408). Jordan, M. I. (Ed.) (1999). Learning in Graphical Models. Cambridge, MA, USA: MIT Press. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. ARTIFICIAL INTELLIGENCE , 101 , 99–134. Kallenberg, L. (1983). Linear Programming and Finite Markovian Control Problems. Math. Centrum, Amsterdam. Kearns, M. J., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. In IJCAI , (pp. 740–747). Kim, Y., Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2006). Exploiting locality of interaction in networked distributed POMDPs. In In AAAI Spring Symposium on Distributed Planning and Scheduling. Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press. Kube, C. R. (1997). Task modelling in collective robotics. Autonomous Robots, 4 , 53–72. Kumar, A., & Zilberstein, S. (2009). Constraint-based dynamic programming for decentralized POMDPs with structured interactions. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS-09), (pp. 561–568). Lesser, V., Decker, K., Wagner, T., Carver, N., Garvey, A., Horling, B., Neiman, D., Podorozhny, R., NagendraPrasad, M., Raja, A., Vincent, R., Xuan, P., & Zhang, X. (2004). Evolution of the GPGP/TAEMS Domain-Independent Coordination Framework. Journal of Autonomous Agents and Multiagent Systems, 9 (1), 87–143. Littman, M. L. (1996). Algorithms for Sequential Decision Making. Ph.D. thesis, Brown University, Providence, RI. 278 Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995a). Learning policies for partially observable environments: Scaling up. In International Conference on Machine Learning, (pp. 362–370). Littman, M. L., Dean, T., & Kaelbling, L. P. (1995b). On the complexity of solving Markov decision problems. In UAI , (pp. 394–402). Lusena, C., Goldsmith, J., & Mundhenk, M. (2001). Nonapproximability results for partially observable markov decision processes. Journal of Artificial Intelligence Research, 14 , 83–103. Madani, O., Hanks, S., & Condon, A. (1999). On the undecidability of probabilistic planning and infinite-horizon partially observable markov decision problems. In AAAI ’99/IAAI ’99: national conference on artificial intelligence and innovative applications of artificial intelligence, (pp. 541–548). Marecki, J., Gupta, T., Varakantham, P., Tambe, M., & Yokoo, M. (2008). Not all agents are equal: Scaling up distributed POMDPs for agent networks. In International Joint Conference on Autonomous Agents and Multiagent Systems. Marecki, J., & Tambe, M. (2007). On opportunistic techniques for solving decentralized MDPs with temporal constraints. In International Joint Conference on Autonomous Agents and Multi-agent Systems. Marecki, J., & Tambe, M. (2009). Planning with continuous resources for agent teams. In Autonomous Agents and Multiagent Systems (AAMAS-09), (pp. 1089–1096). Mataric, M. J. (1997). Reinforcement learning in the multi-robot domain. Auton. Robots, 4 (1), 73–83. Melo, F. S. (2008). Exploiting locality of interactions using a policy-gradient approach in multiagent learning. In Proceeding of the 2008 conference on ECAI 2008 , (pp. 157–161). Messias, J. V., Spaan, M. T., & Lima, P. U. (2010). Multi-robot planning under uncertainty with communication: a case study. In The MSDM workshop at AAMAS2010 , (pp. 54–61). Toronto, Canada. Meuleau, N., Hauskrecht, M., Kim, K.-E., Peshkin, L., Kaelbling, L. P., Dean, T., & Boutilier, C. (1998). Solving very large weakly coupled Markov decision processes. In AAAI ’98/IAAI ’98: Artificial intelligence/Innovative applications of artificial intelligence, (pp. 165–172). Modi, P. J., Shen, W. M., Tambe, M., & Yokoo, M. (2005). ADOPT: Asynchronous Distributed Constraint Optimization with Quality Guarantees. Artificial Intelligence, 16 (1–2), 149–180. 279 Mostafa, H., & Lesser, V. (2009). Offline Planning For Communication By Exploiting Structured Interactions In Decentralized MDPs. In Intelligent Agent Technologies (IAT-09), (pp. 193–200). Milan, Italy. Musliner, D. J., Durfee, E. H., Wu, J., Dolgov, D. A., Goldman, R. P., & Boddy, M. S. (2006). Coordinated plan management using multiagent MDPs. In Working Notes of the AAAI Spring Symp. on Distributed Plan and Schedule Management. Nair, R., Tambe, M., Yokoo, M., Pynadath, D. V., & Marsella, S. (2003). Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In IJCAI-03 , (pp. 705–711). Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2005). Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. AAAI05 , (pp. 133–139). Ogata, K. (1997). Modern control engineering (3rd ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc. Oliehoek, F. A. (2010). Value-Based Planning for Teams of Agents in Stochastic Partially Observable Environments. Ph.D. thesis, Informatics Institute, University of Amsterdam. Oliehoek, F. A., Spaan, M. T. J., & Vlassis, N. (2008a). Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32 , 289–353. Oliehoek, F. A., Spaan, M. T. J., Whiteson, S., & Vlassis, N. A. (2008b). Exploiting locality of interaction in factored Dec-POMDPs. In Autonomous Agents and Multiagent Systems (AAMAS-08), (pp. 517–524). Osentoski, S., & Mahadevan, S. (2007). Learning state-action basis functions for hierarchical MDPs. In ICML ’07: international conference on Machine learning, (pp. 705–712). Papadimitriou, C., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12 (3), 441–450. Papadimitriou, C. M. (1994). Computational Complexity. Reading, Massachusetts: Addison-Wesley. Papazoglou, M. P., Traverso, P., Dustdar, S., & Leymann, F. (2007). Service-oriented computing: State of the art and research challenges. Computer , 40 (11), 38–45. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Peshkin, L., Kim, K.-E., Meuleau, N., & Kaelbling, L. P. (2000). Learning to cooperate via policy search. In UAI , (pp. 489–496). 280 Petcu, A., & Faltings, B. (2005). Dpop: A scalable method for multiagent constraint optimization. In IJCAI 05 , (pp. 266–271). Edinburgh, Scotland. Petrik, M., & Zilberstein, S. (2009). A bilinear programming approach for multiagent planning. Journal of Artificial Intelligence Research, 35 , 235–274. Pineau, J., Gordon, G., & Thrun, S. (2006). Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research, 27 , 2006. Pinedo, M. L. (2008). Scheduling: Theory, Algorithms, and Systems. Springer. Poupart, P., Boutilier, C., Patrascu, R., & Schuurmans, D. (2002). Piecewise linear value function approximation for factored MDPs. In Eighteenth national conference on Artificial intelligence, (pp. 292–299). Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc. Pynadath, D. V., & Tambe, M. (2002). The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artificial Intelligence Research, 16 , 389–423. Rabinovich, Z., Goldman, C. V., & Rosenschein, J. S. (2003). The complexity of multiagent systems: the price of silence. In Autonomous Agents and Multiagent Systems (AAMAS-03), (pp. 1102–1103). Rich, C., & Sidner, C. L. (1997). COLLAGEN: When agents collaborate with people. In International Conference on Autonomous Agents (Agents’97), (pp. 284–291). Russell, S. J., Norvig, P., Candy, J. F., Malik, J. M., & Edwards, D. D. (1996). Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ, USA: Prentice-Hall, Inc. Seuken, S., & Zilberstein, S. (2007a). Improved memory-bounded dynamic programming for decentralized POMDPs. In Uncertainty in Artificial Intelligence, (pp. 344–351). Vancouver, British Columbia. Seuken, S., & Zilberstein, S. (2007b). Memory-bounded dynamic programming for DEC-POMDPs. IJCAI , (pp. 2009–2015). Seuken, S., & Zilberstein, S. (2008). Formal models and algorithms for decentralized decision making under uncertainty. Journal of Autonomous Agents and Multiagent Systems. Shen, J., Becker, R., & Lesser, V. (2006). Agent Interaction in Distributed MDPs and its Implications on Complexity. In International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-06), (pp. 529–536). Japan. 281 Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.) Advances in Neural Information Processing Systems (NIPS), vol. 10. Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21 (5), 1071–1088. Smith, R. G. (1980). The contract net protocol: High level communication and control in distributed problem solver. IEEE Transactions on Computers, C-29 (12), 1104–1113. Smith, S. F., Gallagher, A., & Zimmerman, T. L. (2007). Distributed management of flexible times schedules. In Autonomous Agents and Multiagent Systems (AAMAS07), (p. 74). Spaan, M. T. J., & Vlassis, N. (2005). Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24 , 195–220. Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R. S., Precup, D., & Singh, S. P. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112 (1-2), 181–211. Szer, D., Charpillet, F., & Zilberstein, S. (2005). MAA*: A heuristic search algorithm for solving decentralized POMDPs. UAI , (pp. 576–590). Tambe, M. (1997). Towards flexible teamwork. Journal of Artificial Intelligence Research, 7 , 83–124. Tasaki, M., Yabu, Y., Iwanuri, Y., Yokoo, M., Marecki, J., Varakantham, P., & Tambe, M. (2010). Introducing communication in Dis-POMDPs with locality of interaction. In Journal of Web Intelligence and Agent Systems (WIAS), Vol 8, No 3 . Tsamardinos, I., & Pollack, M. E. (2003). Efficient solution techniques for disjunctive temporal problems. Artificial Intelligence, 151 (1-2), 43–89. Varakantham, P., Kwak, J., Taylor, M., Marecki, J., Scerri, P., & Tambe, M. (2009). Exploiting coordination locales in distributed POMDPs via social model shaping. ICAPS-09 . Varakantham, P., Marecki, J., Yabu, Y., Tambe, M., & Yokoo, M. (2007). Letting loose a spider on a network of POMDPs: Generating quality guaranteed policies. In Autonomous Agents and Multiagent Systems (AAMAS-07), (pp. 817–824). Wagner, T., Horling, B., Lesser, V., Phelps, J., & Guralnik, V. (2003). The struggle for reuse: Pros and cons of generalization in TAEMS and its impact on technology transition. International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE-2003). 282 Williamson, S. A., Gerding, E. H., & Jennings, N. R. (2009). Reward shaping for valuing communications during multi-agent coordination. In Autonomous Agents and Multiagent Systems (AAMAS-09), (pp. 641–648). Witwicki, S. J., & Durfee, E. H. (2007). Commitment-driven distributed joint policy search. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS-2007), (pp. 480–487). Honolulu, Hawaii. Witwicki, S. J., & Durfee, E. H. (2009). Commitment-based service coordination. International Journal of Agent-Oriented Software Engineering (IJAOSE), 3 (1), 59–87. Witwicki, S. J., & Durfee, E. H. (2010). Influence-based policy abstraction for weaklycoupled Dec-POMDPs. In International Conference on Automated Planning and Scheduling (ICAPS-2010). Toronto, Canada. Wu, J., & Durfee, E. H. (2006). Mixed-integer linear programming for transitionindependent decentralized MDPs. In international joint conference on Autonomous agents and multiagent systems (AAMAS-06), (pp. 1058–1060). Wu, J., & Durfee, E. H. (2007). Solving large tæms problems efficiently by selective exploration and decomposition. In international joint conference on Autonomous agents and multiagent systems (AAMAS-07), (pp. 1–8). Wu, J., & Durfee, E. H. (2010). Resource-driven mission-phasing techniques for constrained agents in stochastic environments. Journal of Artificial Intelligence Research, 38 , 415–473. Xuan, P., & Lesser, V. (1999). Incorporating Uncertainty in Agent Commitments. International Workshop on Agent Theories, Architectures, and Languages (ATAL99), (pp. 57–70). Yokoo, M., Durfee, E. H., Ishida, T., & Kuwabara, K. (1998). The distributed constraint satisfaction problem: Formalization and algorithms. Knowledge and Data Engineering, 10 (5), 673–685. 283