Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

Published: 14 November 2023 Publication History
  • Get Citation Alerts
  • Abstract

    In Deep Reinforcement Learning (DRL) domain, a compound learning task is often decomposed into several sub-tasks in a divide-and-conquer manner, each trained separately and then fused concurrently to achieve the original task, referred to as policy fusion. However, the state-of-the-art (SOTA) policy fusion methods treat the importance of sub-tasks equally throughout the task process, eliminating the possibility of the agent relying on different sub-tasks at various stages. To address this limitation, we propose a generic policy fusion approach, referred to as Policy Fusion Learning with Dynamic Weights and Prior Reward (PFLDWPR), to automate the time-varying selection of sub-tasks. Specifically, PFLDWPR produces a time-varying one-hot vector for sub-tasks to dynamically select a suitable sub-task and mask the rest throughout the entire task process, enabling the fused strategy to optimally guide the agent in executing the compound task. The sub-tasks with the dynamic one-hot vector are then aggregated to obtain the action policy for the original task. Moreover, we collect sub-tasks’s rewards at the pre-training stage as a prior reward, which, along with the current reward, is used to train the policy fusion network. Thus, this approach reduces fusion bias by leveraging prior experience. Experimental results under three popular learning tasks demonstrate that the proposed method significantly improves three SOTA policy fusion methods in terms of task duration, episode reward, and score difference.

    Supplementary Material

    3623405-supp (3623405-supp.pdf)
    Supplementary materials

    References

    [1]
    Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher. 2019. Dynamic weights in multi-objective deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 11–20.
    [2]
    Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39–48.
    [3]
    Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. 2016. Reinforcement learning through asynchronous advantage actor-critic on a gpu. arXiv preprint arXiv:1611.06256 (2016). Retrieved from https://arxiv.org/abs/1611.06256
    [4]
    Glen Berseth, Cheng Xie, Paul Cernek, and Michiel Van de Panne. 2018. Progressive reinforcement learning with distillation for multi-skilled motion control. arXiv preprint arXiv:1802.04765 (2018). Retrieved from https://arxiv.org/abs/1802.04765
    [5]
    Dimitris Bertsimas and John Tsitsiklis. 1993. Simulated annealing. Statistical Science 8, 1 (1993), 10–15.
    [6]
    Reinaldo A. C. Bianchi, Carlos H. C. Ribeiro, and Anna Helena Reali Costa. 2012. Heuristically accelerated reinforcement learning: Theoretical and experimental results. In Proceedings of the ECAI. 169–174.
    [7]
    Haonan Chang, Zhuo Xu, and Masayoshi Tomizuka. 2020. Cascade attribute network: Decomposing reinforcement learning control policies using hierarchical neural networks. IFAC-PapersOnLine 53, 2 (2020), 8181–8186.
    [8]
    Xi Chen, Ali Ghadirzadeh, Mårten Björkman, and Patric Jensfelt. 2019. Meta-learning for multi-objective reinforcement learning. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 977–983.
    [9]
    Wojciech Czarnecki, Siddhant Jayakumar, Max Jaderberg, Leonard Hasenclever, Yee Whye Teh, Nicolas Heess, Simon Osindero, and Razvan Pascanu. 2018. Mix and match agent curricula for reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1087–1095.
    [10]
    Paresh Dhakan, Kathryn Kasmarik, Philip Vance, and Iñaki Rañóand Nazmul Siddique. 2022. Concurrent skill composition using ensemble of primitive skills. IEEE Transactions on Cognitive and Developmental Systems (2022), 1–12.
    [11]
    Xingping Dong, Jianbing Shen, Wenguan Wang, Ling Shao, Haibin Ling, and Fatih Porikli. 2019. Dynamical hyperparameter optimization via deep reinforcement learning in tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2019), 1515–1529.
    [12]
    Domingo Esteban, Leonel Rozo, and Darwin G. Caldwell. 2019. Hierarchical reinforcement learning for concurrent discovery of compound and composable policies. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1818–1825.
    [13]
    Behzad Ghazanfari and Matthew E. Taylor. 2017. Autonomous extracting a hierarchical structure of tasks in reinforcement learning and multi-task reinforcement learning. arXiv preprint arXiv:1709.04579 (2017). Retrieved from https://arxiv.org/abs/1709.04579
    [14]
    Joshua Hare. 2019. Dealing with sparse rewards in reinforcement learning. arXiv preprint arXiv:1910.09281 (2019).
    [15]
    Zhanpeng He, Ryan Julian, Eric Heiden, Hejia Zhang, Stefan Schaal, Joseph J. Lim, Gaurav Sukhatme, and Karol Hausman. 2018. Zero-shot skill composition and simulation-to-real transfer by learning task representations. arXiv preprint arXiv:1810.02422 (2018). Retrieved from https://arxiv.org/abs/1810.02422
    [16]
    Deepali Jain, Atil Iscen, and Ken Caluwaerts. 2019. Hierarchical reinforcement learning for quadruped locomotion. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 7551–7557.
    [17]
    Zsolt Kalmár, Csaba Szepesvári, and András Lőrincz. 1998. Module-based reinforcement learning: Experiments with a real robot. Autonomous Robots 5, 3 (1998), 273–295.
    [18]
    Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S. Hu, and Joseph J. Lim. 2018. Composing complex skills by learning transition policies. In Proceedings of the International Conference on Learning Representations.
    [19]
    Youngwoon Lee, Jingyun Yang, and Joseph J. Lim. 2019. Learning to coordinate manipulation skills via skill behavior diversification. In Proceedings of the International Conference on Learning Representations.
    [20]
    Yuxi Li. 2017. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 (2017). Retrieved from https://arxiv.org/abs/1701.07274
    [21]
    Jorge A. Mendez, Harm van Seijen, and Eric Eaton. 2022. Modular lifelong reinforcement learning via neural composition. arXiv preprint arXiv:2207.00429 (2022). Retrieved from https://arxiv.org/abs/2207.00429
    [22]
    Fangzhu Ming, Feng Gao, Kun Liu, and Chengmei Zhao. 2023. Cooperative modular reinforcement learning for large discrete action space problem. Neural Networks 161 (2023), 281–296.
    [23]
    Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1928–1937.
    [24]
    Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. Advances in Neural Information Processing Systems 27 (2014), 1–9.
    [25]
    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013). Retrieved from https://arxiv.org/abs/1312.5602
    [26]
    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
    [27]
    Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, and Shimon Whiteson. 2016. Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707 (2016). Retrieved from https://arxiv.org/abs/1610.02707
    [28]
    Katharina Muelling, Jens Kober, and Jan Peters. 2010. Learning table tennis with a mixture of motor primitives. In Proceedings of the 2010 10th IEEE-RAS International Conference on Humanoid Robots. IEEE, 411–416.
    [29]
    Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. 2013. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research 32, 3 (2013), 263–279.
    [30]
    Brendan O’Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. 2016. Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626 (2016). Retrieved from https://arxiv.org/abs/1611.01626
    [31]
    Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. 2017. Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 2661–2670.
    [32]
    Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. 2017. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics 36, 4 (2017), 1–13.
    [33]
    Kai Ploeger, Michael Lutter, and Jan Peters. 2021. High acceleration reinforcement learning for real-world juggling with binary rewards. Conference on Robot Learning, PMLR, 642–653. Retrieved from https://arxiv.org/abs/2010.13483
    [34]
    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017). Retrieved from https://arxiv.org/abs/1707.06347
    [35]
    Alessandro Sestini, Alexander Kuhnle, and Andrew D. Bagdanov. 2021. Policy fusion for adaptive and customizable reinforcement learning agents. In Proceedings of the 2021 IEEE Conference on Games (CoG). IEEE, 01–08.
    [36]
    Haobin Shi and Meng Xu. 2019. A multiple-attribute decision-making approach to reinforcement learning. IEEE Transactions on Cognitive and Developmental Systems 12, 4 (2019), 695–708.
    [37]
    Yuliya Shilova, Maksim Kavalerov, and Igor Bezukladnikov. 2016. Full echo q-routing with adaptive learning rates: A reinforcement learning approach to network routing. In Proceedings of the 2016 IEEE NW Russia Young Researchers in Electrical and Electronic Engineering Conference (EIConRusNW). IEEE, 341–344.
    [38]
    Christopher Simpkins and Charles Isbell. 2019. Composable modular reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 4975–4982.
    [39]
    Selmar K. Smit and Agoston E. Eiben. 2009. Comparing parameter tuning methods for evolutionary algorithms. In Proceedings of the 2009 IEEE Congress on Evolutionary Computation. IEEE, 399–406.
    [40]
    Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015. Scalable bayesian optimization using deep neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 2171–2180.
    [41]
    Ghada Sokar, Elena Mocanu, Decebal Constantin Mocanu, Mykola Pechenizkiy, and Peter Stone. 2021. Dynamic sparse training for deep reinforcement learning. arXiv preprint arXiv:2106.04217 (2021). Retrieved from https://arxiv.org/abs/2106.04217
    [42]
    Ruipeng Su, Fengge Wu, and Junsuo Zhao. 2019. Deep reinforcement learning method based on DDPG with simulated annealing for satellite attitude control system. In Proceedings of the 2019 Chinese Automation Congress (CAC). IEEE, 390–395.
    [43]
    Changyin Sun, Wenzhang Liu, and Lu Dong. 2020. Reinforcement learning with task decomposition for cooperative multiagent systems. IEEE Transactions on Neural Networks and Learning Systems 32, 5 (2020), 2054–2065.
    [44]
    Hartmut Surmann, Christian Jestel, Robin Marchel, Franziska Musberg, Houssem Elhadj, and Mahbube Ardani. 2020. Deep reinforcement learning for real autonomous mobile robot navigation in indoor environments. arXiv preprint arXiv:2005.13857 (2020). Retrieved from https://arxiv.org/abs/2005.13857
    [45]
    Chen Tessler, Shahar Givony, Tom Zahavy, Daniel Mankowitz, and Shie Mannor. 2017. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of the AAAI Conference on Artificial Intelligence.
    [46]
    Paolo Tommasino, Daniele Caligiore, Marco Mirolli, and Gianluca Baldassarre. 2016. A reinforcement learning architecture that transfers knowledge between skills when solving multiple tasks. IEEE Transactions on Cognitive and Developmental Systems 11, 2 (2016), 292–317.
    [47]
    Benjamin Van Niekerk, Steven James, Adam Earle, and Benjamin Rosman. 2019. Composing value functions in reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, 6401–6409.
    [48]
    Luiz Felipe Vecchietti, Minah Seo, and Dongsoo Har. 2020. Sampling rate decay in hindsight experience replay for robot control. IEEE Transactions on Cybernetics 52, 3 (2020), 1515–1526.
    [49]
    Chao Wang, Jian Wang, Jingjing Wang, and Xudong Zhang. 2020. Deep-reinforcement-learning-based autonomous UAV navigation with sparse rewards. IEEE Internet of Things Journal 7, 7 (2020), 6180–6190.
    [50]
    Jiexin Wang, Stefan Elfwing, and Eiji Uchibe. 2021. Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Networks 135 (2021), 115–126.
    [51]
    Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, and Ali Borji. 2019. Revisiting video saliency prediction in the deep learning era. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 1 (2019), 220–237.
    [52]
    Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Carola-Bibiane Schönlieb, and Hua Huang. 2020. Tuning-free plug-and-play proximal algorithm for inverse imaging problems. In Proceedings of the International Conference on Machine Learning. PMLR, 10158–10169.
    [53]
    Bohan Wu, Jayesh K. Gupta, and Mykel Kochenderfer. 2020. Model primitives for hierarchical lifelong reinforcement learning. Autonomous Agents and Multi-Agent Systems 34 (2020), 1–38.
    [54]
    Jie Xu, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, and Wojciech Matusik. 2020. Prediction-guided multi-objective reinforcement learning for continuous robot control. In Proceedings of the International Conference on Machine Learning. PMLR, 10607–10616.
    [55]
    Meng Xu, Haobin Shi, Kai Jiang, LiHua Wang, and Xuesi Li. 2019. A fuzzy approach to visual servoing with a bagging method for wheeled mobile robot. In Proceedings of the 2019 IEEE International Conference on Mechatronics and Automation (ICMA). IEEE, 444–449.
    [56]
    Meng Xu and Jianping Wang. 2022. Learning strategy for continuous robot visual control: A multi-objective perspective. Knowledge-Based Systems 252 (2022), 109448.
    [57]
    Meng Xu and Jianping Wang. 2023. Deep reinforcement learning for parameter tuning of robot visual servoing. ACM Transactions on Intelligent Systems and Technology 14, 2 (2023), 1–27.
    [58]
    Meng Xu, Qingfu Zhang, and Jianping Wang. 2021. Discounted sampling policy gradient for robot multi-objective visual control. In Proceedings of the International Conference on Evolutionary Multi-Criterion Optimization. Springer, 441–452.
    [59]
    Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. 2019. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Advances in Neural Information Processing Systems 32 (2019), 1–12.
    [60]
    Zhaoyang Yang, Kathryn Merrick, Lianwen Jin, and Hussein A. Abbass. 2018. Hierarchical deep reinforcement learning for continuous action control. IEEE Transactions on Neural Networks and Learning Systems 29, 11 (2018), 5174–5184.
    [61]
    Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. 2021. Sample efficient reinforcement learning with REINFORCE. In Proceedings of the AAAI Conference on Artificial Intelligence. 10887–10895.
    [62]
    Dongyang Zhao, Yue Huang, Changnan Xiao, Yue Li, and Shihong Deng. 2021. Hierarchical meta reinforcement learning for multi-task environments. https://openreview.net/forum?id=u9ax42K7ND

    Cited By

    View all
    • (2024)Strengthening Cooperative Consensus in Multi-Robot ConfrontationACM Transactions on Intelligent Systems and Technology10.1145/363937115:2(1-27)Online publication date: 22-Feb-2024
    • (undefined)Robust Recommender Systems with Rating Flip NoiseACM Transactions on Intelligent Systems and Technology10.1145/3641285

    Index Terms

    1. Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 14, Issue 6
      December 2023
      493 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/3632517
      • Editor:
      • Huan Liu
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 November 2023
      Online AM: 11 September 2023
      Accepted: 25 August 2023
      Revised: 07 August 2023
      Received: 25 May 2023
      Published in TIST Volume 14, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Compound agent learning
      2. deep reinforcement learning
      3. policy fusion
      4. dynamic weights
      5. prior reward

      Qualifiers

      • Research-article

      Funding Sources

      • Hong Kong Research Grant Council

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)666
      • Downloads (Last 6 weeks)33

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Strengthening Cooperative Consensus in Multi-Robot ConfrontationACM Transactions on Intelligent Systems and Technology10.1145/363937115:2(1-27)Online publication date: 22-Feb-2024
      • (undefined)Robust Recommender Systems with Rating Flip NoiseACM Transactions on Intelligent Systems and Technology10.1145/3641285

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media