Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Reward shaping using convolutional neural network

Published: 01 November 2023 Publication History

Abstract

In this paper, we propose Value Iteration Network for Reward Shaping (VIN-RS), a potential-based reward shaping mechanism using Convolutional Neural Network (CNN). The proposed VIN-RS embeds a CNN trained on computed labels using the message passing mechanism of the Hidden Markov Model. The CNN processes images or graphs of the environment to predict the shaping values. Recent work on reward shaping still has limitations towards training on a representation of the Markov Decision Process (MDP) and building an estimate of the transition matrix. The advantage of VIN-RS is to construct an effective potential function from an estimated MDP while automatically inferring the environment transition matrix. The proposed VIN-RS estimates the transition matrix through a self-learned convolution filter while extracting environment details from the input frames or sampled graphs. Due to (1) the previous success of using message passing for reward shaping; and (2) the CNN planning behavior, we use these messages to train the CNN of VIN-RS. Experiments are performed on tabular games, Atari 2600 and MuJoCo, for discrete and continuous action space. Our results illustrate promising improvements in the learning speed and maximum cumulative reward compared to the state-of-the-art. The improvement achieved by VIN-RS can only be observed for some of the games due to the underlying nature of some environments. In terms of the studied MuJoCo games, there is on average an increase of 30% in the maximum reward reached during early stages of learning.

References

[1]
M. Shurrab, S. Singh, R. Mizouni, H. Otrok, Iot sensor selection for target localization: a reinforcement learning based approach, Ad Hoc Netw. 134 (2022).
[2]
A. Alagha, S. Singh, R. Mizouni, J. Bentahar, H. Otrok, Target localization using multi-agent deep reinforcement learning with proximal policy optimization, Future Gener. Comput. Syst. 136 (2022) 342–357.
[3]
A.Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: theory and application to reward shaping, in: ICML, 1999, pp. 278–287.
[4]
Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.F.; Schulman, J.; Mané, D. (2016): Concrete problems in AI safety. CoRR arXiv:1606.06565 [abs].
[5]
L.C. Garaffa, M. Basso, A.A. Konzen, E.P. de Freitas, Reinforcement learning for mobile robotics exploration: a survey, IEEE Trans. Neural Netw. Learn. Syst. (2021),.
[6]
A. Tamar, Y. Wu, G. Thomas, S. Levine, P. Abbeel, Value iteration networks, in: NIPS, 2016, pp. 2146–2154.
[7]
M. Toussaint, A. Storkey, Probabilistic inference for solving discrete and continuous state Markov decision processes, in: ICML, 2006, pp. 945–952.
[8]
M. Klissarov, D. Precup, Reward propagation using graph convolutional networks, in: NeurIPS, 2020.
[9]
M. Petrik, An analysis of Laplacian methods for value function approximation in MDPs, in: IJCAI, 2007, pp. 2574–2579.
[10]
E. Wiewiora, G.W. Cottrell, C. Elkan, Principled methods for advising reinforcement learning agents, in: ICML, 2003, pp. 792–799.
[11]
H. Sami, R. Saado, A. El Saoudi, A. Mourad, H. Otrok, J. Bentahar, Opportunistic uav deployment for intelligent on-demand iov service management, IEEE Trans. Netw. Serv. Manag. (2023).
[12]
G. Rjoub, O.A. Wahab, J. Bentahar, R. Cohen, A.S. Bataineh, Trust-augmented deep reinforcement learning for federated learning client selection, Inf. Syst. Front. (2022),.
[13]
G. Rjoub, J. Bentahar, O.A. Wahab, A.S. Bataineh, Deep and reinforcement learning for automated task scheduling in large-scale cloud computing systems, Concurr. Comput., Pract. Exp. 33 (2021),.
[14]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. (2013): Playing Atari with deep reinforcement learning. CoRR arXiv:1312.5602 [abs].
[15]
O.A. Arqub, Z. Abo-Hammour, Numerical solution of systems of second-order boundary value problems using continuous genetic algorithm, Inf. Sci. 279 (2014) 396–415.
[16]
Z. Abo-Hammour, O. Abu Arqub, S. Momani, N. Shawagfeh, Optimization solution of Troesch's and Bratu's problems of ordinary type using novel continuous genetic algorithm, Discrete Dyn. Nat. Soc. (2014) 2014.
[17]
S. Niu, S. Chen, H. Guo, C. Targonski, M.C. Smith, J. Kovačević, Generalized value iteration networks: life beyond lattices, in: AAAI, 2018, pp. 6246–6253.
[18]
W. Li, B. Yang, G. Song, X. Jiang, Dynamic value iteration networks for the planning of rapidly changing UAV swarms, in: Frontiers of Information Technology & Electronic Engineering, 2021, pp. 1–10.
[19]
S. Yang, J. Li, J. Wang, Z. Liu, F. Yang, Learning urban navigation via value iteration network, in: IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 800–805.
[20]
A.K. Khatta, J. Singh, G. Kaur, Vehicle routing problem with value iteration network, in: Advanced Network Technologies and Intelligent Computing: Second International Conference, ANTIC 2022, Proceedings, Part I, Varanasi, India, December 22–24, 2022, Springer, 2023, pp. 3–15.
[21]
Z. Zheng, J. Oh, S. Singh, On learning intrinsic rewards for policy gradient methods, in: NeurIPS, 2018, pp. 4649–4659.
[22]
Y. Burda, H. Edwards, A. Storkey, O. Klimov, Exploration by random network distillation, in: ICLR, 2018.
[23]
D. Pathak, P. Agrawal, A.A. Efros, T. Darrell, Curiosity-driven exploration by self-supervised prediction, in: ICML, 2017, pp. 2778–2787.
[24]
M. Grześ, D. Kudenko, Online learning of shaping rewards in reinforcement learning, Neural Netw. 23 (4) (2010) 541–550.
[25]
A. Harutyunyan, T. Brys, P. Vrancx, A. Nowé, Shaping Mario with human advice, in: AAMAS, 2015, pp. 1913–1914.
[26]
H. Sami, J. Bentahar, A. Mourad, H. Otrok, E. Damiani, Graph convolutional recurrent networks for reward shaping in reinforcement learning, Inf. Sci. 608 (2022) 63–80.
[27]
B.D. Ziebart, A.L. Maas, J.A. Bagnell, A.K. Dey, Maximum entropy inverse reinforcement learning, in: AAAI, 2008, pp. 1433–1438.
[28]
L. Rabiner, B. Juang, An introduction to hidden Markov models, IEEE ASSP Mag. 3 (1) (1986) 4–16.
[29]
E. Todorov, T. Erez, Y. Tassa Mujoco, A physics engine for model-based control, in: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033.
[30]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. (2017): Proximal policy optimization algorithms. CoRR arXiv:1707.06347 [abs].
[31]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. (2016): OpenAI gym. arXiv preprint arXiv:1606.01540.
[32]
H. Sami, A. Mourad, W. El-Hajj, Vehicular-obus-as-on-demand-fogs: resource and context aware deployment of containerized micro-services, IEEE/ACM Trans. Netw. 28 (2) (2020) 778–790.
[33]
H. Sami, A. Mourad, H. Otrok, J. Bentahar, Demand-driven deep reinforcement learning for scalable fog and service placement, IEEE Trans. Serv. Comput. (2021).
[34]
H. Sami, H. Otrok, J. Bentahar, A. Mourad, Ai-based resource provisioning of ioe services in 6g: a deep reinforcement learning approach, IEEE Trans. Netw. Serv. Manag. 18 (3) (2021) 3527–3540.
[35]
M. Kadadha, H. Otrok, R. Mizouni, S. Singh, A. Ouali, On-chain behavior prediction machine learning model for blockchain-based crowdsourcing, Future Gener. Comput. Syst. (2022).
[36]
A. Hammoud, H. Sami, A. Mourad, H. Otrok, R. Mizouni, J. Bentahar, Ai, blockchain, and vehicular edge computing for smart and secure iov: challenges and directions, IEEE Int. Things Mag. 3 (2) (2020) 68–73.
[37]
A. Tsantekidis, N. Passalis, A.-S. Toufa, K. Saitas-Zarkias, S. Chairistanidis, A. Tefas, Price trailing for financial trading using deep reinforcement learning, IEEE Trans. Neural Netw. Learn. Syst. 32 (7) (2020) 2837–2846.

Cited By

View all
  • (2024)Norm Enforcement with a Soft Touch: Faster Emergence, Happier AgentsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663046(1837-1846)Online publication date: 6-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal
Information Sciences: an International Journal  Volume 648, Issue C
Nov 2023
1590 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 November 2023

Author Tags

  1. Reinforcement learning
  2. Reward shaping
  3. Convolutional neural network
  4. Value iteration network
  5. Atari
  6. MuJoCo

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Norm Enforcement with a Soft Touch: Faster Emergence, Happier AgentsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663046(1837-1846)Online publication date: 6-May-2024

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media