A Dynamic and Task-Independent Reward Shaping Approach for Discrete Partially Observable Markov Decision Processes

Nahali, Sepideh; Ayadi, Hajer; Huang, Jimmy X.; Pakizeh, Esmat; Pedram, Mir Mohsen; Safari, Leila

doi:10.1007/978-3-031-33377-4_26

Sepideh Nahali¹⁰,
Hajer Ayadi¹⁰,
Jimmy X. Huang¹⁰,
Esmat Pakizeh¹²,
Mir Mohsen Pedram¹² &
…
Leila Safari¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13936))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1245 Accesses

Abstract

Agents often need a long time to explore state-action space in order to learn how to act expectedly in Partially Observable Markov Decision Processes (POMDPs). With the reward shaping method, real-time POMDP planning can be guided both in terms of reliability and speed. In this paper, we propose Low Dimensional Policy Graph (LDPG), a new reward shaping method for reducing the dimension of the value function to extract the best state-action pairs. The reward function is then shaped using these key pairs. For accelerating learning speed, we analyze the Transition Function graph to discover significant paths to the learning agent’s goal. Direct comparison on five standard testbeds indicates LDPG brings about the deterministic finding of optimal actions faster regardless of the task type. Our method is shown to reach the goals more quickly (by 41.48 % improvement) and performed 61.57 % better in receiving rewards in the $ 4 \times 5 \times 2 $ domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Potential-based reward shaping for finite horizon online POMDP planning

Article 05 March 2015

An Approach of Transforming Non-Markovian Reward to Markovian Reward

POMDP Controllers with Optimal Budget

References

Abbeel, P., Ng, A.Y.: Exploration and apprenticeship learning in reinforcement learning. In: Proceedings of the 22nd International Conference on ML, ICML 2005, pp. 1–8, New York, NY, USA (2005)
Google Scholar
Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25(2), 163–177 (2001)
Article MATH Google Scholar
Chane-Sane, E., Schmid, C., Laptev, I.: Goal-conditioned reinforcement learning with imagined subgoals. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021. Proceedings of Machine Learning Research, vol. 139, pp. 1430–1440. PMLR (2021)
Google Scholar
Chehreghani, M.H., Bifet, A., Abdessalem, T.: Efficient exact and approximate algorithms for computing betweenness centrality in directed graphs. In: Advances in Knowledge Discovery and Data Mining, PAKDD, pp. 752–764 (2018)
Google Scholar
Colas, C., Karch, T., Sigaud, O., Oudeyer, P.: Autotelic agents with intrinsically motivated goal-conditioned reinforcement learning: a short survey. J. Artif. Intell. Res. 74, 1159–1199 (2022)
Article MathSciNet MATH Google Scholar
Dong, Y., Tang, X., Yuan, Y.: Principled reward shaping for reinforcement learning via lyapunov stability theory. Neurocomputing 393, 83–90 (2020)
Article Google Scholar
Eck, A., Soh, L.K., Devlin, S., Kudenko, D.: Potential-based reward shaping for finite horizon online pomdp planning. Auton. Agent. Multi-Agent Syst. 30, 403–445 (2015)
Article Google Scholar
Fischer, J., Ömer Sahin Tas: Information particle filter tree: an online algorithm for pomdps with belief-based rewards on continuous domains. In: Proceedings of the 37th International Conference on ML, ICML. vol. 119, pp. 3177–3187 (2020)
Google Scholar
Grzeundefined, M.: Reward shaping in episodic reinforcement learning. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2017, pp. 565–573. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC (2017)
Google Scholar
Guo, Y., Wu, Q., Honglak, L.: Learning action translator for meta reinforcement learning on sparse-reward tasks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36(6), pp. 6792–6800 (2022)
Google Scholar
Hussein, A., Elyan, E., Gaber, M.M., Jayne, C.: Deep reward shaping from demonstrations. In: International Joint Conference on Neural Networks (IJCNN), pp. 510–517 (2017)
Google Scholar
Hausknecht, M., Stone, P.: Deep recurrent q-learning for partially observable mdps. In: AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA15) (2015)
Google Scholar
Howard, R.A.: Dynamic programming and Markov processes. MIT Press (1960)
Google Scholar
Kaelbling, L.P., Cassandra, A.R.: Exact and approximate algorithms for partially observable Markov decision processes. In: Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence, pp. 374–381. Morgan Kaufmann Publishers Inc. (1998)
Google Scholar
Kalra, B., Munnangi, S.K., Majmundar, K., Manwani, N., Paruchuri, P.: Cooperative monitoring of malicious activity in stock exchanges. In: Trends and Applications in Knowledge Discovery and Data Mining. PAKDD, pp. 121–132 (2021)
Google Scholar
Kim, J., Seo, Y., Shin, J.: Landmark-guided subgoal generation in hierarchical reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 28336–28349. Curran Associates, Inc. (2021)
Google Scholar
Liu, S., Krishnan, R., Brunskill, E., Ni, L.M.: Modeling social information learning among taxi drivers. In: Advances in Knowledge Discovery and Data Mining, PAKDD, pp. 73–84. Berlin (2013)
Google Scholar
Mafi, N., Abtahi, F., Fasel, I.: Information theoretic reward shaping for curiosity driven learning in pomdps. In: Proceedings of the 2011 IEEE International Conference on Development and Learning (ICDL), vol. 2, pp. 1–7 (2011)
Google Scholar
Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: theory and application to reward shaping. In: In Proceedings of the Sixteenth International Conference on ML, pp. 278–287. Morgan Kaufmann (1999)
Google Scholar
Nourozzadeh: Shaping Methods to Accelerate Reinforcement Learning: From Easy to Challenging Tasks. Master’s thesis, Delft University of Technology (2010)
Google Scholar
Snel, M., Whiteson, S.: Multi-task reinforcement learning: Shaping and feature selection. In: Proceedings of the 9th European Conference on Recent Advances in Reinforcement Learning, EWRL 2011, pp. 237–248. Springer, Berlin (2011)
Google Scholar
Spaan, M.T.J., Vlassis, N.: Perseus: randomized point-based value iteration for pomdps. J. Artif. Int. Res. 24(1), 195–220 (2005)
MATH Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018)
Google Scholar
Tenenbaum, J.B., Silva, V.d., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Google Scholar
Wang, P., Fan, Y., Xia, L., Zhao, W.X., Niu, S., Huang, J.X.: KERL: a knowledge-guided reinforcement learning model for sequential recommendation. In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval SIGIR, China, pp. 209–218 (2020)
Google Scholar
Yuchen Wu, M.M., Shkurti, F.: Shaping rewards for reinforcement learning with imperfect demonstrations using generative models. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6628–6634 (2020)
Google Scholar
Zhanhong J., Michael J. Risbeck, V.R.S.M.J.A.C.Z.Y.M.L., Drees, K.H.: Building hvac control with reinforcement learning for reduction of energy cost and demand charge. Energy Buildings 239, 110833 (2021)
Google Scholar
Zhou, W., Li, W.: Programmatic reward design by example. In: Proceedings of the AAAI Conference on Artificial Intelligence 36(8), pp. 9233–9241 (2022)
Google Scholar
Åström, K.: Optimal control of Markov processes with incomplete state information. J. Math. Anal. Appl. 10(1), 174–205 (1965)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This research is supported by the research grants from Natural Sciences and Engineering Research Council (NSERC) of Canada. We thank four anonymous reviewers for their thorough review comments on this paper.

Author information

Authors and Affiliations

Information Retrieval and Knowledge Management Research Lab, York University, Toronto, Canada
Sepideh Nahali, Hajer Ayadi & Jimmy X. Huang
University of Zanjan, Zanjan, Iran
Leila Safari
Kharazmi University, Tehran, Iran
Esmat Pakizeh & Mir Mohsen Pedram

Authors

Sepideh Nahali
View author publications
You can also search for this author in PubMed Google Scholar
Hajer Ayadi
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy X. Huang
View author publications
You can also search for this author in PubMed Google Scholar
Esmat Pakizeh
View author publications
You can also search for this author in PubMed Google Scholar
Mir Mohsen Pedram
View author publications
You can also search for this author in PubMed Google Scholar
Leila Safari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sepideh Nahali .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Hisashi Kashima
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Tsuyoshi Ide
National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nahali, S., Ayadi, H., Huang, J.X., Pakizeh, E., Pedram, M.M., Safari, L. (2023). A Dynamic and Task-Independent Reward Shaping Approach for Discrete Partially Observable Markov Decision Processes. In: Kashima, H., Ide, T., Peng, WC. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science(), vol 13936. Springer, Cham. https://doi.org/10.1007/978-3-031-33377-4_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-33377-4_26
Published: 28 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33376-7
Online ISBN: 978-3-031-33377-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Dynamic and Task-Independent Reward Shaping Approach for Discrete Partially Observable Markov Decision Processes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Potential-based reward shaping for finite horizon online POMDP planning

An Approach of Transforming Non-Markovian Reward to Markovian Reward

POMDP Controllers with Optimal Budget

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Dynamic and Task-Independent Reward Shaping Approach for Discrete Partially Observable Markov Decision Processes

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Potential-based reward shaping for finite horizon online POMDP planning

An Approach of Transforming Non-Markovian Reward to Markovian Reward

POMDP Controllers with Optimal Budget

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation