research-article

Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

Authors:

Jianping WangAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 14, Issue 6

Article No.: 107, Pages 1 - 28

https://doi.org/10.1145/3623405

Published: 14 November 2023 Publication History

Abstract

In Deep Reinforcement Learning (DRL) domain, a compound learning task is often decomposed into several sub-tasks in a divide-and-conquer manner, each trained separately and then fused concurrently to achieve the original task, referred to as policy fusion. However, the state-of-the-art (SOTA) policy fusion methods treat the importance of sub-tasks equally throughout the task process, eliminating the possibility of the agent relying on different sub-tasks at various stages. To address this limitation, we propose a generic policy fusion approach, referred to as Policy Fusion Learning with Dynamic Weights and Prior Reward (PFLDWPR), to automate the time-varying selection of sub-tasks. Specifically, PFLDWPR produces a time-varying one-hot vector for sub-tasks to dynamically select a suitable sub-task and mask the rest throughout the entire task process, enabling the fused strategy to optimally guide the agent in executing the compound task. The sub-tasks with the dynamic one-hot vector are then aggregated to obtain the action policy for the original task. Moreover, we collect sub-tasks’s rewards at the pre-training stage as a prior reward, which, along with the current reward, is used to train the policy fusion network. Thus, this approach reduces fusion bias by leveraging prior experience. Experimental results under three popular learning tasks demonstrate that the proposed method significantly improves three SOTA policy fusion methods in terms of task duration, episode reward, and score difference.

Supplementary Material

3623405-supp (3623405-supp.pdf)

Supplementary materials

Download
3.43 MB

References

[1]

Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher. 2019. Dynamic weights in multi-objective deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 11–20.

[2]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39–48.

[3]

Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. 2016. Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv preprint arXiv:1611.06256 (2016). Retrieved from https://arxiv.org/abs/1611.06256

[4]

Glen Berseth, Cheng Xie, Paul Cernek, and Michiel Van de Panne. 2018. Progressive reinforcement learning with distillation for multi-skilled motion control. arXiv preprint arXiv:1802.04765 (2018). Retrieved from https://arxiv.org/abs/1802.04765

[5]

Dimitris Bertsimas and John Tsitsiklis. 1993. Simulated annealing. Statistical Science 8, 1 (1993), 10–15.

[6]

Reinaldo A. C. Bianchi, Carlos H. C. Ribeiro, and Anna Helena Reali Costa. 2012. Heuristically accelerated reinforcement learning: Theoretical and experimental results. In Proceedings of the ECAI. 169–174.

[7]

Haonan Chang, Zhuo Xu, and Masayoshi Tomizuka. 2020. Cascade attribute network: Decomposing reinforcement learning control policies using hierarchical neural networks. IFAC-PapersOnLine 53, 2 (2020), 8181–8186.

[8]

Xi Chen, Ali Ghadirzadeh, Mårten Björkman, and Patric Jensfelt. 2019. Meta-learning for multi-objective reinforcement learning. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 977–983.

Digital Library

[9]

Wojciech Czarnecki, Siddhant Jayakumar, Max Jaderberg, Leonard Hasenclever, Yee Whye Teh, Nicolas Heess, Simon Osindero, and Razvan Pascanu. 2018. Mix and match agent curricula for reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1087–1095.

[10]

Paresh Dhakan, Kathryn Kasmarik, Philip Vance, and Iñaki Rañóand Nazmul Siddique. 2022. Concurrent skill composition using ensemble of primitive skills. IEEE Transactions on Cognitive and Developmental Systems (2022), 1–12.

[11]

Xingping Dong, Jianbing Shen, Wenguan Wang, Ling Shao, Haibin Ling, and Fatih Porikli. 2019. Dynamical hyperparameter optimization via deep reinforcement learning in tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2019), 1515–1529.

[12]

Domingo Esteban, Leonel Rozo, and Darwin G. Caldwell. 2019. Hierarchical reinforcement learning for concurrent discovery of compound and composable policies. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1818–1825.

Digital Library

[13]

Behzad Ghazanfari and Matthew E. Taylor. 2017. Autonomous extracting a hierarchical structure of tasks in reinforcement learning and multi-task reinforcement learning. arXiv preprint arXiv:1709.04579 (2017). Retrieved from https://arxiv.org/abs/1709.04579

[14]

Joshua Hare. 2019. Dealing with sparse rewards in reinforcement learning. arXiv preprint arXiv:1910.09281 (2019).

[15]

Zhanpeng He, Ryan Julian, Eric Heiden, Hejia Zhang, Stefan Schaal, Joseph J. Lim, Gaurav Sukhatme, and Karol Hausman. 2018. Zero-shot skill composition and simulation-to-real transfer by learning task representations. arXiv preprint arXiv:1810.02422 (2018). Retrieved from https://arxiv.org/abs/1810.02422

[16]

Deepali Jain, Atil Iscen, and Ken Caluwaerts. 2019. Hierarchical reinforcement learning for quadruped locomotion. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 7551–7557.

Digital Library

[17]

Zsolt Kalmár, Csaba Szepesvári, and András Lőrincz. 1998. Module-based reinforcement learning: Experiments with a real robot. Autonomous Robots 5, 3 (1998), 273–295.

Digital Library

[18]

Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S. Hu, and Joseph J. Lim. 2018. Composing complex skills by learning transition policies. In Proceedings of the International Conference on Learning Representations.

[19]

Youngwoon Lee, Jingyun Yang, and Joseph J. Lim. 2019. Learning to coordinate manipulation skills via skill behavior diversification. In Proceedings of the International Conference on Learning Representations.

[20]

Yuxi Li. 2017. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 (2017). Retrieved from https://arxiv.org/abs/1701.07274

[21]

Jorge A. Mendez, Harm van Seijen, and Eric Eaton. 2022. Modular lifelong reinforcement learning via neural composition. arXiv preprint arXiv:2207.00429 (2022). Retrieved from https://arxiv.org/abs/2207.00429

[22]

Fangzhu Ming, Feng Gao, Kun Liu, and Chengmei Zhao. 2023. Cooperative modular reinforcement learning for large discrete action space problem. Neural Networks 161 (2023), 281–296.

Digital Library

[23]

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1928–1937.

Digital Library

[24]

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. Advances in Neural Information Processing Systems 27 (2014), 1–9.

[25]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013). Retrieved from https://arxiv.org/abs/1312.5602

[26]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.

[27]

Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, and Shimon Whiteson. 2016. Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707 (2016). Retrieved from https://arxiv.org/abs/1610.02707

[28]

Katharina Muelling, Jens Kober, and Jan Peters. 2010. Learning table tennis with a mixture of motor primitives. In Proceedings of the 2010 10th IEEE-RAS International Conference on Humanoid Robots. IEEE, 411–416.

[29]

Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. 2013. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research 32, 3 (2013), 263–279.

Digital Library

[30]

Brendan O’Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. 2016. Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626 (2016). Retrieved from https://arxiv.org/abs/1611.01626

[31]

Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. 2017. Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 2661–2670.

[32]

Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. 2017. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics 36, 4 (2017), 1–13.

Digital Library

[33]

Kai Ploeger, Michael Lutter, and Jan Peters. 2021. High acceleration reinforcement learning for real-world juggling with binary rewards. Conference on Robot Learning, PMLR, 642–653. Retrieved from https://arxiv.org/abs/2010.13483

[34]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017). Retrieved from https://arxiv.org/abs/1707.06347

[35]

Alessandro Sestini, Alexander Kuhnle, and Andrew D. Bagdanov. 2021. Policy fusion for adaptive and customizable reinforcement learning agents. In Proceedings of the 2021 IEEE Conference on Games (CoG). IEEE, 01–08.

Digital Library

[36]

Haobin Shi and Meng Xu. 2019. A multiple-attribute decision-making approach to reinforcement learning. IEEE Transactions on Cognitive and Developmental Systems 12, 4 (2019), 695–708.

[37]

Yuliya Shilova, Maksim Kavalerov, and Igor Bezukladnikov. 2016. Full echo q-routing with adaptive learning rates: A reinforcement learning approach to network routing. In Proceedings of the 2016 IEEE NW Russia Young Researchers in Electrical and Electronic Engineering Conference (EIConRusNW). IEEE, 341–344.

[38]

Christopher Simpkins and Charles Isbell. 2019. Composable modular reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 4975–4982.

Digital Library

[39]

Selmar K. Smit and Agoston E. Eiben. 2009. Comparing parameter tuning methods for evolutionary algorithms. In Proceedings of the 2009 IEEE Congress on Evolutionary Computation. IEEE, 399–406.

Digital Library

[40]

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015. Scalable bayesian optimization using deep neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 2171–2180.

[41]

Ghada Sokar, Elena Mocanu, Decebal Constantin Mocanu, Mykola Pechenizkiy, and Peter Stone. 2021. Dynamic sparse training for deep reinforcement learning. arXiv preprint arXiv:2106.04217 (2021). Retrieved from https://arxiv.org/abs/2106.04217

[42]

Ruipeng Su, Fengge Wu, and Junsuo Zhao. 2019. Deep reinforcement learning method based on DDPG with simulated annealing for satellite attitude control system. In Proceedings of the 2019 Chinese Automation Congress (CAC). IEEE, 390–395.

[43]

Changyin Sun, Wenzhang Liu, and Lu Dong. 2020. Reinforcement learning with task decomposition for cooperative multiagent systems. IEEE Transactions on Neural Networks and Learning Systems 32, 5 (2020), 2054–2065.

[44]

Hartmut Surmann, Christian Jestel, Robin Marchel, Franziska Musberg, Houssem Elhadj, and Mahbube Ardani. 2020. Deep reinforcement learning for real autonomous mobile robot navigation in indoor environments. arXiv preprint arXiv:2005.13857 (2020). Retrieved from https://arxiv.org/abs/2005.13857

[45]

Chen Tessler, Shahar Givony, Tom Zahavy, Daniel Mankowitz, and Shie Mannor. 2017. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of the AAAI Conference on Artificial Intelligence.

[46]

Paolo Tommasino, Daniele Caligiore, Marco Mirolli, and Gianluca Baldassarre. 2016. A reinforcement learning architecture that transfers knowledge between skills when solving multiple tasks. IEEE Transactions on Cognitive and Developmental Systems 11, 2 (2016), 292–317.

[47]

Benjamin Van Niekerk, Steven James, Adam Earle, and Benjamin Rosman. 2019. Composing value functions in reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 6401–6409.

[48]

Luiz Felipe Vecchietti, Minah Seo, and Dongsoo Har. 2020. Sampling rate decay in hindsight experience replay for robot control. IEEE Transactions on Cybernetics 52, 3 (2020), 1515–1526.

[49]

Chao Wang, Jian Wang, Jingjing Wang, and Xudong Zhang. 2020. Deep-reinforcement-learning-based autonomous UAV navigation with sparse rewards. IEEE Internet of Things Journal 7, 7 (2020), 6180–6190.

[50]

Jiexin Wang, Stefan Elfwing, and Eiji Uchibe. 2021. Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Networks 135 (2021), 115–126.

[51]

Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, and Ali Borji. 2019. Revisiting video saliency prediction in the deep learning era. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 1 (2019), 220–237.

Digital Library

[52]

Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Carola-Bibiane Schönlieb, and Hua Huang. 2020. Tuning-free plug-and-play proximal algorithm for inverse imaging problems. In Proceedings of the International Conference on Machine Learning. PMLR, 10158–10169.

[53]

Bohan Wu, Jayesh K. Gupta, and Mykel Kochenderfer. 2020. Model primitives for hierarchical lifelong reinforcement learning. Autonomous Agents and Multi-Agent Systems 34 (2020), 1–38.

Digital Library

[54]

Jie Xu, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, and Wojciech Matusik. 2020. Prediction-guided multi-objective reinforcement learning for continuous robot control. In Proceedings of the International Conference on Machine Learning. PMLR, 10607–10616.

[55]

Meng Xu, Haobin Shi, Kai Jiang, LiHua Wang, and Xuesi Li. 2019. A fuzzy approach to visual servoing with a bagging method for wheeled mobile robot. In Proceedings of the 2019 IEEE International Conference on Mechatronics and Automation (ICMA). IEEE, 444–449.

Digital Library

[56]

Meng Xu and Jianping Wang. 2022. Learning strategy for continuous robot visual control: A multi-objective perspective. Knowledge-Based Systems 252 (2022), 109448.

Digital Library

[57]

Meng Xu and Jianping Wang. 2023. Deep reinforcement learning for parameter tuning of robot visual servoing. ACM Transactions on Intelligent Systems and Technology 14, 2 (2023), 1–27.

Digital Library

[58]

Meng Xu, Qingfu Zhang, and Jianping Wang. 2021. Discounted sampling policy gradient for robot multi-objective visual control. In Proceedings of the International Conference on Evolutionary Multi-Criterion Optimization. Springer, 441–452.

Digital Library

[59]

Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. 2019. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Advances in Neural Information Processing Systems 32 (2019), 1–12.

[60]

Zhaoyang Yang, Kathryn Merrick, Lianwen Jin, and Hussein A. Abbass. 2018. Hierarchical deep reinforcement learning for continuous action control. IEEE Transactions on Neural Networks and Learning Systems 29, 11 (2018), 5174–5184.

[61]

Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. 2021. Sample efficient reinforcement learning with REINFORCE. In Proceedings of the AAAI Conference on Artificial Intelligence. 10887–10895.

[62]

Dongyang Zhao, Yue Huang, Changnan Xiao, Yue Li, and Shihong Deng. 2021. Hierarchical meta reinforcement learning for multi-task environments. https://openreview.net/forum?id=u9ax42K7ND

Cited By

Xu MChen XShe YJin YZhao GWang J(2024)Strengthening Cooperative Consensus in Multi-Robot ConfrontationACM Transactions on Intelligent Systems and Technology10.1145/363937115:2(1-27)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1145/3639371
Ye SLu J(undefined)Robust Recommender Systems with Rating Flip NoiseACM Transactions on Intelligent Systems and Technology10.1145/3641285
https://dl.acm.org/doi/10.1145/3641285

Index Terms

Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning
1. Computing methodologies

Recommendations

Modelling the Dynamic Joint Policy of Teammates with Attention Multi-agent DDPG
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems

Modelling teammates' policies in cooperative multi-agent systems has long been an interest and also a big challenge for the reinforcement learning (RL) community. The interest lies in the fact that if the agent knows the teammates' policies, it can ...
Read More
A Deep Policy Inference Q-Network for Multi-Agent Systems
AAMAS '18: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems

We present DPIQN, a deep policy inference Q-network that targets multi-agent systems composed of controllable agents, collaborators, and opponents that interact with each other. We focus on one challenging issue in such systems---modeling agents with ...
Read More
Reinforcement Learning for Improving Agent Design
In many reinforcement learning tasks, the goal is to learn a policy to manipulate an agent, whose design is fixed, to maximize some notion of cumulative reward. The design of the agent's physical structure is rarely optimized for the task at hand. In this ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 14, Issue 6

December 2023

493 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3632517

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2023

Online AM: 11 September 2023

Accepted: 25 August 2023

Revised: 07 August 2023

Received: 25 May 2023

Published in TIST Volume 14, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Hong Kong Research Grant Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
666
Total Downloads

Downloads (Last 12 months)666
Downloads (Last 6 weeks)33

Other Metrics

View Author Metrics

Citations

Cited By

Xu MChen XShe YJin YZhao GWang J(2024)Strengthening Cooperative Consensus in Multi-Robot ConfrontationACM Transactions on Intelligent Systems and Technology10.1145/363937115:2(1-27)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1145/3639371
Ye SLu J(undefined)Robust Recommender Systems with Rating Flip NoiseACM Transactions on Intelligent Systems and Technology10.1145/3641285
https://dl.acm.org/doi/10.1145/3641285

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents