Safe Policy Improvement Approaches and Their Limitations

Scholl, Philipp; Dietrich, Felix; Otte, Clemens; Udluft, Steffen

doi:10.1007/978-3-031-22953-4_4

Philipp Scholl¹⁰,
Felix Dietrich¹¹,
Clemens Otte¹² &
…
Steffen Udluft¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13786))

Included in the following conference series:

International Conference on Agents and Artificial Intelligence

257 Accesses
1 Citations

Abstract

Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SAMBA: safe model-based & active reinforcement learning

Article 04 January 2022

Verification and repair of control policies for safe reinforcement learning

Article 05 August 2017

Constrained reinforcement learning with statewise projection: a control barrier function approach

Article 19 February 2024

Notes

References

Brafman, R.I., Tennenholtz, M.: R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3 (2003)
Google Scholar
Chow, Y., Tamar, A., Mannor, S., Pavone, M.: Risk-sensitive and robust decision-making: a CVaR optimization approach. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (2015)
Google Scholar
Dantzig, G.B.: Linear Programming and Extensions. RAND Corporation, Santa Monica (1963)
Google Scholar
Fujimoto, S., Meger, D., Precup, D.: Off-policy deep reinforcement learning without exploration. In: Proceedings of the 36th International Conference on Machine Learning (2019)
Google Scholar
García, J., Fernandez, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16 (2015)
Google Scholar
Hans, A., Duell, S., Udluft, S.: Agent self-assessment: determining policy quality without execution. In: IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (2011)
Google Scholar
Hans, A., Udluft, S.: Efficient uncertainty propagation for reinforcement learning with limited data. In: Artificial Neural Networks - ICANN, vol. 5768 (2009)
Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Article MATH Google Scholar
Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Wiering, M., van Otterlo, M. (eds.) Reinforcement Learning. ALO, vol. 12, pp. 45–73. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27645-3_2
Chapter Google Scholar
Laroche, R., Trichelair, P., Tachet des Combes, R.: Safe policy improvement with baseline bootstrapping. In: Proceedings of the 36th International Conference on Machine Learning (2019)
Google Scholar
Leurent, E.: Safe and efficient reinforcement learning for behavioural planning in autonomous driving. Theses, Université de Lille (2020)
Google Scholar
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. CoRR abs/2005.01643 (2020)
Google Scholar
Maurer, A., Pontil, M.: Empirical Bernstein bounds and sample-variance penalization. In: COLT (2009)
Google Scholar
Nadjahi, K., Laroche, R., Tachet des Combes, R.: Safe policy improvement with soft baseline bootstrapping. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds.) ECML PKDD 2019. LNCS (LNAI), vol. 11908, pp. 53–68. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46133-1_4
Chapter Google Scholar
Nilim, A., El Ghaoui, L.: Robustness in Markov decision problems with uncertain transition matrices. In: Proceedings of the 16th International Conference on Neural Information Processing Systems (2003)
Google Scholar
Petrik, M., Ghavamzadeh, M., Chow, Y.: Safe policy improvement by minimizing robust baseline regret. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, Curran Associates Inc., Red Hook (2016)
Google Scholar
Schaefer, A.M., Schneegass, D., Sterzing, V., Udluft, S.: A neural reinforcement learning approach to gas turbine control. In: International Joint Conference on Neural Networks (2007)
Google Scholar
Schneegass, D., Hans, A., Udluft, S.: Uncertainty in reinforcement learning - awareness, quantisation, and control. In: Robot Learning. Sciyo (2010)
Google Scholar
Scholl, P.: Evaluation of safe policy improvement with soft baseline bootstrapping. Master’s thesis, Technical University of Munich (2021)
Google Scholar
Scholl, P., Dietrich, F., Otte, C., Udluft, S.: Safe policy improvement approaches on discrete Markov decision processes. In: Proceedings of the 14th International Conference on Agents and Artificial Intelligence, ICAART, vol. 2, pp. 142–151. INSTICC, SciTePress (2022). https://doi.org/10.5220/0010786600003116
Simão, T.D., Laroche, R., Tachet des Combes, R.: Safe policy improvement with an estimated baseline policy. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems (2020)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
MATH Google Scholar
Thomas, P.S.: Safe reinforcement learning. Doctoral dissertations. University of Massachusetts (2015)
Google Scholar
Wang, R., Foster, D., Kakade, S.M.: What are the statistical limits of offline RL with linear function approximation? In: International Conference on Learning Representations (2021)
Google Scholar

Download references

Acknowledgements

FD was partly funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project 468830823. PS, CO and SU were partly funded by German Federal Ministry of Education and Research, project 01IS18049A (ALICE III).

Author information

Authors and Affiliations

Ludwig-Maximilians University, Munich, Germany
Philipp Scholl
Technical University of Munich, Munich, Germany
Felix Dietrich
Siemens Technology, Munich, Germany
Clemens Otte & Steffen Udluft

Authors

Philipp Scholl
View author publications
You can also search for this author in PubMed Google Scholar
Felix Dietrich
View author publications
You can also search for this author in PubMed Google Scholar
Clemens Otte
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Udluft
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philipp Scholl .

Editor information

Editors and Affiliations

LIACC/FEUP, Porto, Portugal
Ana Paula Rocha
ICREA, Institute of Evolutionary Biology, Barcelona, Spain
Luc Steels
Leiden Institute of Advanced Computer Science, Leiden, The Netherlands
Jaap van den Herik

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scholl, P., Dietrich, F., Otte, C., Udluft, S. (2022). Safe Policy Improvement Approaches and Their Limitations. In: Rocha, A.P., Steels, L., van den Herik, J. (eds) Agents and Artificial Intelligence. ICAART 2022. Lecture Notes in Computer Science(), vol 13786. Springer, Cham. https://doi.org/10.1007/978-3-031-22953-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-22953-4_4
Published: 20 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22952-7
Online ISBN: 978-3-031-22953-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Safe Policy Improvement Approaches and Their Limitations

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SAMBA: safe model-based & active reinforcement learning

Verification and repair of control policies for safe reinforcement learning

Constrained reinforcement learning with statewise projection: a control barrier function approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Safe Policy Improvement Approaches and Their Limitations

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SAMBA: safe model-based & active reinforcement learning

Verification and repair of control policies for safe reinforcement learning

Constrained reinforcement learning with statewise projection: a control barrier function approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation