research-article

Reward shaping using convolutional neural network

Authors:

Jamal Bentahar,

Ernesto DamianiAuthors Info & Claims

Volume 648, Issue C

https://doi.org/10.1016/j.ins.2023.119481

Published: 01 November 2023 Publication History

Abstract

In this paper, we propose Value Iteration Network for Reward Shaping (VIN-RS), a potential-based reward shaping mechanism using Convolutional Neural Network (CNN). The proposed VIN-RS embeds a CNN trained on computed labels using the message passing mechanism of the Hidden Markov Model. The CNN processes images or graphs of the environment to predict the shaping values. Recent work on reward shaping still has limitations towards training on a representation of the Markov Decision Process (MDP) and building an estimate of the transition matrix. The advantage of VIN-RS is to construct an effective potential function from an estimated MDP while automatically inferring the environment transition matrix. The proposed VIN-RS estimates the transition matrix through a self-learned convolution filter while extracting environment details from the input frames or sampled graphs. Due to (1) the previous success of using message passing for reward shaping; and (2) the CNN planning behavior, we use these messages to train the CNN of VIN-RS. Experiments are performed on tabular games, Atari 2600 and MuJoCo, for discrete and continuous action space. Our results illustrate promising improvements in the learning speed and maximum cumulative reward compared to the state-of-the-art. The improvement achieved by VIN-RS can only be observed for some of the games due to the underlying nature of some environments. In terms of the studied MuJoCo games, there is on average an increase of 30% in the maximum reward reached during early stages of learning.

References

[1]

M. Shurrab, S. Singh, R. Mizouni, H. Otrok, Iot sensor selection for target localization: a reinforcement learning based approach, Ad Hoc Netw. 134 (2022).

Digital Library

[2]

A. Alagha, S. Singh, R. Mizouni, J. Bentahar, H. Otrok, Target localization using multi-agent deep reinforcement learning with proximal policy optimization, Future Gener. Comput. Syst. 136 (2022) 342–357.

[3]

A.Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: theory and application to reward shaping, in: ICML, 1999, pp. 278–287.

[4]

Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.F.; Schulman, J.; Mané, D. (2016): Concrete problems in AI safety. CoRR arXiv:1606.06565 [abs].

[5]

L.C. Garaffa, M. Basso, A.A. Konzen, E.P. de Freitas, Reinforcement learning for mobile robotics exploration: a survey, IEEE Trans. Neural Netw. Learn. Syst. (2021),.

[6]

A. Tamar, Y. Wu, G. Thomas, S. Levine, P. Abbeel, Value iteration networks, in: NIPS, 2016, pp. 2146–2154.

[7]

M. Toussaint, A. Storkey, Probabilistic inference for solving discrete and continuous state Markov decision processes, in: ICML, 2006, pp. 945–952.

[8]

M. Klissarov, D. Precup, Reward propagation using graph convolutional networks, in: NeurIPS, 2020.

[9]

M. Petrik, An analysis of Laplacian methods for value function approximation in MDPs, in: IJCAI, 2007, pp. 2574–2579.

[10]

E. Wiewiora, G.W. Cottrell, C. Elkan, Principled methods for advising reinforcement learning agents, in: ICML, 2003, pp. 792–799.

[11]

H. Sami, R. Saado, A. El Saoudi, A. Mourad, H. Otrok, J. Bentahar, Opportunistic uav deployment for intelligent on-demand iov service management, IEEE Trans. Netw. Serv. Manag. (2023).

[12]

G. Rjoub, O.A. Wahab, J. Bentahar, R. Cohen, A.S. Bataineh, Trust-augmented deep reinforcement learning for federated learning client selection, Inf. Syst. Front. (2022),.

[13]

G. Rjoub, J. Bentahar, O.A. Wahab, A.S. Bataineh, Deep and reinforcement learning for automated task scheduling in large-scale cloud computing systems, Concurr. Comput., Pract. Exp. 33 (2021),.

[14]

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. (2013): Playing Atari with deep reinforcement learning. CoRR arXiv:1312.5602 [abs].

[15]

O.A. Arqub, Z. Abo-Hammour, Numerical solution of systems of second-order boundary value problems using continuous genetic algorithm, Inf. Sci. 279 (2014) 396–415.

[16]

Z. Abo-Hammour, O. Abu Arqub, S. Momani, N. Shawagfeh, Optimization solution of Troesch's and Bratu's problems of ordinary type using novel continuous genetic algorithm, Discrete Dyn. Nat. Soc. (2014) 2014.

[17]

S. Niu, S. Chen, H. Guo, C. Targonski, M.C. Smith, J. Kovačević, Generalized value iteration networks: life beyond lattices, in: AAAI, 2018, pp. 6246–6253.

[18]

W. Li, B. Yang, G. Song, X. Jiang, Dynamic value iteration networks for the planning of rapidly changing UAV swarms, in: Frontiers of Information Technology & Electronic Engineering, 2021, pp. 1–10.

[19]

S. Yang, J. Li, J. Wang, Z. Liu, F. Yang, Learning urban navigation via value iteration network, in: IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 800–805.

[20]

A.K. Khatta, J. Singh, G. Kaur, Vehicle routing problem with value iteration network, in: Advanced Network Technologies and Intelligent Computing: Second International Conference, ANTIC 2022, Proceedings, Part I, Varanasi, India, December 22–24, 2022, Springer, 2023, pp. 3–15.

[21]

Z. Zheng, J. Oh, S. Singh, On learning intrinsic rewards for policy gradient methods, in: NeurIPS, 2018, pp. 4649–4659.

[22]

Y. Burda, H. Edwards, A. Storkey, O. Klimov, Exploration by random network distillation, in: ICLR, 2018.

[23]

D. Pathak, P. Agrawal, A.A. Efros, T. Darrell, Curiosity-driven exploration by self-supervised prediction, in: ICML, 2017, pp. 2778–2787.

[24]

M. Grześ, D. Kudenko, Online learning of shaping rewards in reinforcement learning, Neural Netw. 23 (4) (2010) 541–550.

[25]

A. Harutyunyan, T. Brys, P. Vrancx, A. Nowé, Shaping Mario with human advice, in: AAMAS, 2015, pp. 1913–1914.

[26]

H. Sami, J. Bentahar, A. Mourad, H. Otrok, E. Damiani, Graph convolutional recurrent networks for reward shaping in reinforcement learning, Inf. Sci. 608 (2022) 63–80.

[27]

B.D. Ziebart, A.L. Maas, J.A. Bagnell, A.K. Dey, Maximum entropy inverse reinforcement learning, in: AAAI, 2008, pp. 1433–1438.

[28]

L. Rabiner, B. Juang, An introduction to hidden Markov models, IEEE ASSP Mag. 3 (1) (1986) 4–16.

[29]

E. Todorov, T. Erez, Y. Tassa Mujoco, A physics engine for model-based control, in: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033.

[30]

Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. (2017): Proximal policy optimization algorithms. CoRR arXiv:1707.06347 [abs].

[31]

Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. (2016): OpenAI gym. arXiv preprint arXiv:1606.01540.

[32]

H. Sami, A. Mourad, W. El-Hajj, Vehicular-obus-as-on-demand-fogs: resource and context aware deployment of containerized micro-services, IEEE/ACM Trans. Netw. 28 (2) (2020) 778–790.

Digital Library

[33]

H. Sami, A. Mourad, H. Otrok, J. Bentahar, Demand-driven deep reinforcement learning for scalable fog and service placement, IEEE Trans. Serv. Comput. (2021).

[34]

H. Sami, H. Otrok, J. Bentahar, A. Mourad, Ai-based resource provisioning of ioe services in 6g: a deep reinforcement learning approach, IEEE Trans. Netw. Serv. Manag. 18 (3) (2021) 3527–3540.

[35]

M. Kadadha, H. Otrok, R. Mizouni, S. Singh, A. Ouali, On-chain behavior prediction machine learning model for blockchain-based crowdsourcing, Future Gener. Comput. Syst. (2022).

[36]

A. Hammoud, H. Sami, A. Mourad, H. Otrok, R. Mizouni, J. Bentahar, Ai, blockchain, and vehicular edge computing for smart and secure iov: challenges and directions, IEEE Int. Things Mag. 3 (2) (2020) 68–73.

[37]

A. Tsantekidis, N. Passalis, A.-S. Toufa, K. Saitas-Zarkias, S. Chairistanidis, A. Tefas, Price trailing for financial trading using deep reinforcement learning, IEEE Trans. Neural Netw. Learn. Syst. 32 (7) (2020) 2837–2846.

Cited By

Tzeng SAjmeri NSingh MDastani MSichman JAlechina NDignum V(2024)Norm Enforcement with a Soft Touch: Faster Emergence, Happier AgentsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663046(1837-1846)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663046

Recommendations

Graph convolutional recurrent networks for reward shaping in reinforcement learning
Abstract
In this paper, we consider the problem of low-speed convergence in Reinforcement Learning (RL). As a solution, various potential-based reward shaping techniques were proposed to form the potential function. Learning a potential function is still ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Landmark Based Reward Shaping in Reinforcement Learning with Hidden States
AAMAS '19: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems

While most of the work on reward shaping focuses on fully observable problems, there are very few studies that couple reward shaping with partial observability. Moreover, for problems with hidden states, where there is no prior information about the ...

Comments

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 648, Issue C

Nov 2023

1590 pages

ISSN:0020-0255

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 November 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tzeng SAjmeri NSingh MDastani MSichman JAlechina NDignum V(2024)Norm Enforcement with a Soft Touch: Faster Emergence, Happier AgentsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663046(1837-1846)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663046

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents