Many real world decision problems are characterized by multiple conflicting objectives which must... more Many real world decision problems are characterized by multiple conflicting objectives which must be balanced based on their relative importance. In the dynamic weights setting the relative importance changes over time and specialized algorithms that deal with such change, such as the tabular Reinforcement Learning (RL) algorithm by Natarajan & Tadepalli (2005), are required. However, this earlier work is not feasible for RL settings that necessitate the use of function approximators. We generalize across weight changes and high-dimensional inputs by proposing a multi-objective Q-network whose outputs are conditioned on the relative importance of objectives, and introduce Diverse Experience Replay (DER) to counter the inherent non-stationarity of the dynamic weights setting. We perform an extensive experimental evaluation and compare our methods to adapted algorithms from Deep Multi-Task/Multi-Objective RL and show that our proposed network in combination with DER dominates these ad...
In multi-objective optimization, learning all the policies that reach Pareto-efficient solutions ... more In multi-objective optimization, learning all the policies that reach Pareto-efficient solutions is an expensive process. The set of optimal policies can grow exponentially with the number of objectives, and recovering all solutions requires an exhaustive exploration of the entire state space. We propose Pareto Conditioned Networks (PCN), a method that uses a single neural network to encompass all non-dominated policies. PCN associates every past transition with its episode’s return. It trains the network such that, when conditioned on this same return, it should reenact said transition. In doing so we transform the optimization problem into a classification problem. We recover a concrete policy by conditioning the network on the desired Pareto-efficient solution. Our method is stable as it learns in a supervised fashion, thus avoiding moving target issues. Moreover, by using a single network, PCN scales efficiently with the number of objectives. Finally, it makes minimal assumption...
We consider the challenge of policy simplification and verification in the context of policies le... more We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In wellbehaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deepRL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework introduced by Gelada et al. to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder ...
Multi-agent reinforcement learning (MARL) enables us to create adaptive agents in challenging env... more Multi-agent reinforcement learning (MARL) enables us to create adaptive agents in challenging environments, even when the agents have limited observation. Modern MARL methods have hitherto focused on finding factorized value functions. While this approach has proven successful, the resulting methods have convoluted network structures. We take a radically different approach, and build on the structure of independent Q-learners. Inspired by influence-based abstraction, we start from the observation that compact representations of the observation-action histories can be sufficient to learn close to optimal decentralized policies. Combining this observation with a dueling architecture, our algorithm, LAN, represents these policies as separate individual advantage functions w.r.t. a centralized critic. These local advantage networks condition only on a single agent’s local observationaction history. The centralized value function conditions on the agents’ representations as well as the f...
To Adapt or Not to Adapt - Consequences of Adapting Driver and Traffic Light Agents.- Optimal Con... more To Adapt or Not to Adapt - Consequences of Adapting Driver and Traffic Light Agents.- Optimal Control in Large Stochastic Multi-agent Systems.- Continuous-State Reinforcement Learning with Fuzzy Approximation.- Using Evolutionary Game-Theory to Analyse the Performance of Trading Strategies in a Continuous Double Auction Market.- Parallel Reinforcement Learning with Linear Function Approximation.- Combining Reinforcement Learning with Symbolic Planning.- Agent Interactions and Implicit Trust in IPD Environments.- Collaborative Learning with Logic-Based Models.- Priority Awareness: Towards a Computational Model of Human Fairness for Multi-agent Systems.- Bifurcation Analysis of Reinforcement Learning Agents in the Selten's Horse Game.- Bee Behaviour in Multi-agent Systems.- Stable Cooperation in the N-Player Prisoner's Dilemma: The Importance of Community Structure.- Solving Multi-stage Games with Hierarchical Learning Automata That Bootstrap.- Auctions, Evolution, and Multi-agent Learning.- Multi-agent Reinforcement Learning for Intrusion Detection.- Networks of Learning Automata and Limiting Games.- Multi-agent Learning by Distributed Feature Extraction.
2012 IV International Congress on Ultra Modern Telecommunications and Control Systems, 2012
Dense Wavelength Division Multiplexing (DWDM) has emerged as the premiere transport technology an... more Dense Wavelength Division Multiplexing (DWDM) has emerged as the premiere transport technology and has gained much traction in long-haul and metro/regional networks, while most of the Internet services have been converged on the Internet Protocol (IP) layer. This has motivated the expectation that IP over WDM multilayer infrastructure to be the preferred infrastructure for broadband communications in the near future. However routing and connection setup in multi-domain networks that considers both multilayer routing policies Physical Topology First (PTF) and Virtual Topology First (VTF) to handle circuit switched connection setup on both IP and optical WDM layers in multi-domain networks is not yet investigated sufficiently. In this paper we propose a routing scheme that can route a connection either on the optical or on the IP layer, which gives the opportunity to use the multilayer routing policies PTF and VTF in multi-domain network.
A major problem in Multi-agent reinforcement learning research (MARL) is to let multiple agents l... more A major problem in Multi-agent reinforcement learning research (MARL) is to let multiple agents learn how to coordinate to some equilibrium. Coordination in single stage problems, which are easily modeled as normal form games from game theory, is already ...
With the advent of theories on evolutionary transitions in biological complexity, interest in kin... more With the advent of theories on evolutionary transitions in biological complexity, interest in kinship, population structure and group selection has re-emerged. This paper focuses on the latter two concepts and analyzes their effects on the selection dynamics in an evolutionary game context. Concretely we investigate the selection dynamics of an iterative system that produces groups of different composition. The specific process is called an intrademic multilevel selection process. This multilevel selection process is analyzed ...
In this paper we address two topics which are currently under investigation at our research lab. ... more In this paper we address two topics which are currently under investigation at our research lab. The first concerns the question of how cooperation can emerge in a system with antagonistic agents and how this can be modelled through a system of Reinforcement Learning (RL) agents. Current problems result from the fact that RL systems try to model all agents active in the environment. As a solution we are examining biological niching models and measures in order to reduce the complexity of the agent, s learning model. The ...
Multi-agent coordination is prevalent in many real-world applications. However, such coordination... more Multi-agent coordination is prevalent in many real-world applications. However, such coordination is challenging due to its combinatorial nature. An important observation in this regard is that agents in the real world often only directly affect a limited set of neighboring agents. Leveraging such loose couplings among agents is key to making coordination in multi-agent systems feasible. In this work, we focus on learning to coordinate. Specifically, we consider the multi-agent multi-armed bandit framework, in which fully cooperative loosely-coupled agents must learn to coordinate their decisions to optimize a common objective. As opposed to in the planning setting, for learning methods it is challenging to establish theoretical guarantees. We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings. We provide a regret bound that is sublinear in time and low-order polynomial in the highest number of actions of a ...
We present a new model-based reinforcement learning algorithm, Cooperative Prioritized Sweeping, ... more We present a new model-based reinforcement learning algorithm, Cooperative Prioritized Sweeping, for efficient learning in multi-agent Markov decision processes. The algorithm allows for sample-efficient learning on large problems by exploiting a factorization to approximate the value function. Our approach only requires knowledge about the structure of the problem in the form of a dynamic decision network. Using this information, our method learns a model of the environment and performs temporal difference updates which affect multiple joint states and actions at once. Batch updates are additionally performed which efficiently back-propagate knowledge throughout the factored Q-function. Our method outperforms the state-of-the-art algorithm sparse cooperative Q-learning algorithm, both on the well-known SysAdmin benchmark and randomized environments.
Many real world decision problems are characterized by multiple conflicting objectives which must... more Many real world decision problems are characterized by multiple conflicting objectives which must be balanced based on their relative importance. In the dynamic weights setting the relative importance changes over time and specialized algorithms that deal with such change, such as the tabular Reinforcement Learning (RL) algorithm by Natarajan & Tadepalli (2005), are required. However, this earlier work is not feasible for RL settings that necessitate the use of function approximators. We generalize across weight changes and high-dimensional inputs by proposing a multi-objective Q-network whose outputs are conditioned on the relative importance of objectives, and introduce Diverse Experience Replay (DER) to counter the inherent non-stationarity of the dynamic weights setting. We perform an extensive experimental evaluation and compare our methods to adapted algorithms from Deep Multi-Task/Multi-Objective RL and show that our proposed network in combination with DER dominates these ad...
In multi-objective optimization, learning all the policies that reach Pareto-efficient solutions ... more In multi-objective optimization, learning all the policies that reach Pareto-efficient solutions is an expensive process. The set of optimal policies can grow exponentially with the number of objectives, and recovering all solutions requires an exhaustive exploration of the entire state space. We propose Pareto Conditioned Networks (PCN), a method that uses a single neural network to encompass all non-dominated policies. PCN associates every past transition with its episode’s return. It trains the network such that, when conditioned on this same return, it should reenact said transition. In doing so we transform the optimization problem into a classification problem. We recover a concrete policy by conditioning the network on the desired Pareto-efficient solution. Our method is stable as it learns in a supervised fashion, thus avoiding moving target issues. Moreover, by using a single network, PCN scales efficiently with the number of objectives. Finally, it makes minimal assumption...
We consider the challenge of policy simplification and verification in the context of policies le... more We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In wellbehaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deepRL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework introduced by Gelada et al. to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder ...
Multi-agent reinforcement learning (MARL) enables us to create adaptive agents in challenging env... more Multi-agent reinforcement learning (MARL) enables us to create adaptive agents in challenging environments, even when the agents have limited observation. Modern MARL methods have hitherto focused on finding factorized value functions. While this approach has proven successful, the resulting methods have convoluted network structures. We take a radically different approach, and build on the structure of independent Q-learners. Inspired by influence-based abstraction, we start from the observation that compact representations of the observation-action histories can be sufficient to learn close to optimal decentralized policies. Combining this observation with a dueling architecture, our algorithm, LAN, represents these policies as separate individual advantage functions w.r.t. a centralized critic. These local advantage networks condition only on a single agent’s local observationaction history. The centralized value function conditions on the agents’ representations as well as the f...
To Adapt or Not to Adapt - Consequences of Adapting Driver and Traffic Light Agents.- Optimal Con... more To Adapt or Not to Adapt - Consequences of Adapting Driver and Traffic Light Agents.- Optimal Control in Large Stochastic Multi-agent Systems.- Continuous-State Reinforcement Learning with Fuzzy Approximation.- Using Evolutionary Game-Theory to Analyse the Performance of Trading Strategies in a Continuous Double Auction Market.- Parallel Reinforcement Learning with Linear Function Approximation.- Combining Reinforcement Learning with Symbolic Planning.- Agent Interactions and Implicit Trust in IPD Environments.- Collaborative Learning with Logic-Based Models.- Priority Awareness: Towards a Computational Model of Human Fairness for Multi-agent Systems.- Bifurcation Analysis of Reinforcement Learning Agents in the Selten's Horse Game.- Bee Behaviour in Multi-agent Systems.- Stable Cooperation in the N-Player Prisoner's Dilemma: The Importance of Community Structure.- Solving Multi-stage Games with Hierarchical Learning Automata That Bootstrap.- Auctions, Evolution, and Multi-agent Learning.- Multi-agent Reinforcement Learning for Intrusion Detection.- Networks of Learning Automata and Limiting Games.- Multi-agent Learning by Distributed Feature Extraction.
2012 IV International Congress on Ultra Modern Telecommunications and Control Systems, 2012
Dense Wavelength Division Multiplexing (DWDM) has emerged as the premiere transport technology an... more Dense Wavelength Division Multiplexing (DWDM) has emerged as the premiere transport technology and has gained much traction in long-haul and metro/regional networks, while most of the Internet services have been converged on the Internet Protocol (IP) layer. This has motivated the expectation that IP over WDM multilayer infrastructure to be the preferred infrastructure for broadband communications in the near future. However routing and connection setup in multi-domain networks that considers both multilayer routing policies Physical Topology First (PTF) and Virtual Topology First (VTF) to handle circuit switched connection setup on both IP and optical WDM layers in multi-domain networks is not yet investigated sufficiently. In this paper we propose a routing scheme that can route a connection either on the optical or on the IP layer, which gives the opportunity to use the multilayer routing policies PTF and VTF in multi-domain network.
A major problem in Multi-agent reinforcement learning research (MARL) is to let multiple agents l... more A major problem in Multi-agent reinforcement learning research (MARL) is to let multiple agents learn how to coordinate to some equilibrium. Coordination in single stage problems, which are easily modeled as normal form games from game theory, is already ...
With the advent of theories on evolutionary transitions in biological complexity, interest in kin... more With the advent of theories on evolutionary transitions in biological complexity, interest in kinship, population structure and group selection has re-emerged. This paper focuses on the latter two concepts and analyzes their effects on the selection dynamics in an evolutionary game context. Concretely we investigate the selection dynamics of an iterative system that produces groups of different composition. The specific process is called an intrademic multilevel selection process. This multilevel selection process is analyzed ...
In this paper we address two topics which are currently under investigation at our research lab. ... more In this paper we address two topics which are currently under investigation at our research lab. The first concerns the question of how cooperation can emerge in a system with antagonistic agents and how this can be modelled through a system of Reinforcement Learning (RL) agents. Current problems result from the fact that RL systems try to model all agents active in the environment. As a solution we are examining biological niching models and measures in order to reduce the complexity of the agent, s learning model. The ...
Multi-agent coordination is prevalent in many real-world applications. However, such coordination... more Multi-agent coordination is prevalent in many real-world applications. However, such coordination is challenging due to its combinatorial nature. An important observation in this regard is that agents in the real world often only directly affect a limited set of neighboring agents. Leveraging such loose couplings among agents is key to making coordination in multi-agent systems feasible. In this work, we focus on learning to coordinate. Specifically, we consider the multi-agent multi-armed bandit framework, in which fully cooperative loosely-coupled agents must learn to coordinate their decisions to optimize a common objective. As opposed to in the planning setting, for learning methods it is challenging to establish theoretical guarantees. We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings. We provide a regret bound that is sublinear in time and low-order polynomial in the highest number of actions of a ...
We present a new model-based reinforcement learning algorithm, Cooperative Prioritized Sweeping, ... more We present a new model-based reinforcement learning algorithm, Cooperative Prioritized Sweeping, for efficient learning in multi-agent Markov decision processes. The algorithm allows for sample-efficient learning on large problems by exploiting a factorization to approximate the value function. Our approach only requires knowledge about the structure of the problem in the form of a dynamic decision network. Using this information, our method learns a model of the environment and performs temporal difference updates which affect multiple joint states and actions at once. Batch updates are additionally performed which efficiently back-propagate knowledge throughout the factored Q-function. Our method outperforms the state-of-the-art algorithm sparse cooperative Q-learning algorithm, both on the well-known SysAdmin benchmark and randomized environments.
Real-world decision-making tasks are generally complex, requiring trade-offs between multiple, of... more Real-world decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multiobjective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems fo...
Uploads
proceedings by Ann Now'E
Papers by Ann Now'E