Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3437992.3439927acmconferencesArticle/Chapter ViewAbstractPublication PagespoplConference Proceedingsconference-collections
research-article

CertRL: formalizing convergence proofs for value and policy iteration in Coq

Published: 20 January 2021 Publication History

Abstract

Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy.
This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of the Bellman optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.

Supplementary Material

Auxiliary Archive (poplws21cppmain-p30-p-archive.zip)
This submission contains the source code of the formalization on which the paper is based.

References

[1]
Reynald Afeldt and Manabu Hagiwara. 2012. Formalization of Shannon's Theorems in SSReflect-Coq. In 3rd Conference on Interactive Theorem Proving (ITP 2012 ), Princeton, New Jersey, USA, August 13-15, 2012 (Lecture Notes in Computer Science, Vol. 7406 ). Springer, 233-249.
[2]
Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. 2018. Safe Reinforcement Learning via Shielding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018 ), Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press.
[3]
Philippe Audebaud and Christine Paulin-Mohring. 2009. Proofs of randomized algorithms in Coq. Science of Computer Programming 74, 8 ( 2009 ), 568-589.
[4]
Joshua S. Auerbach, Martin Hirzel, Louis Mandel, Avraham Shinnar, and Jérôme Siméon. 2017. Q*cert: A Platform for Implementing and Verifying Query Compilers. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 1703-1706. htps://doi.org/10.1145/3035918.3056447
[5]
Alexander Bagnall and Gordon Stewart. 2019. Certifying the True Error: Machine Learning in Coq with Verified Generalization Guarantees. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. AAAI Press, 2662-2669. htps://doi.org/10.1609/aaai.v33i01. 33012662
[6]
Richard Bellman. 1954. The theory of dynamic programming. Bull. Amer. Math. Soc. 60, 6 ( 11 1954 ), 503-515. htps://projecteuclid.org: 443/euclid.bams/1183519147
[7]
Alexander Bentkamp, Jasmin Christian Blanchette, and Dietrich Klakow. 2019. A Formal Proof of the Expressiveness of Deep Learning. J. Autom. Reason. 63, 2 ( 2019 ), 347-368. htps://doi.org/10.1007/s10817-018-9481-5
[8]
Brandon Bohrer, Yong Kiam Tan, Stefan Mitsch, Magnus O. Myreen, and André Platzer. 2018. VeriPhy: Verified Controller Executables from Verified Cyber-Physical System Models. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018 ), Dan Grossman (Ed.). ACM, 617-630.
[9]
Sylvie Boldo, François Clément, Florian Faissole, Vincent Martin, and Micaela Mayero. 2017. A Coq formal proof of the Lax-Milgram theorem. In 6th ACM SIGPLAN Conference on Certified Programs and Proofs. Paris, France. htps://doi.org/10.1145/3018610.3018625
[10]
Sylvie Boldo, Catherine Lelay, and Guillaume Melquiond. 2014. Coquelicot: A User-Friendly Library of Real Analysis for Coq. Mathematics in Computer Science 9 ( 03 2014 ). htps://doi.org/10.1007/s11786-014-0181-1
[11]
Manuel Eberl, Johannes Hölzl, and Tobias Nipkow. 2015. A Veriifed Compiler for Probability Density Functions. In ESOP 2015 (LNCS, Vol. 9032 ), Jan Vitek (Ed.). Springer, 80-104. htps://doi.org/10.1007/ 978-3-662-46669-8_4
[12]
Frank MV Feys, Helle Hvid Hansen, and Lawrence S Moss. 2018. LongTerm Values in Markov Decision Processes, (Co)algebraically. In International Workshop on Coalgebraic Methods in Computer Science. Springer, 78-99.
[13]
Nathan Fulton and André Platzer. 2018. Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018 ), Sheila McIlraith and Kilian Weinberger (Eds.). AAAI Press, 6485-6492.
[14]
Michèle Giry. 1982. A categorical approach to probability theory. In Categorical Aspects of Topology and Analysis, B. Banaschewski (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 68-85.
[15]
Shixiang Gu, Ethan Holly, Timothy P. Lillicrap, and Sergey Levine. 2016. Deep Reinforcement Learning for Robotic Manipulation. CoRR abs/1610.00633 ( 2016 ). arXiv: 1610.00633 htp://arxiv.org/abs/1610. 00633
[16]
Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. 2018. Logically-Correct Reinforcement Learning. CoRR abs/ 1801.08099 ( 2018 ). arXiv: 1801.08099
[17]
Johannes Hoelzl. 2017. Markov Chains and Markov Decision Processes in Isabelle/HOL. Journal of Automated Reasoning ( 2017 ). htps://doi. org/10.1007/s10817-016-9401-5
[18]
Johannes Hoelzl. 2017. Markov Processes in Isabelle/HOL. In Proceedings of the 6th ACM SIGPLAN Conference on Certified Programs and Proofs (Paris, France) (CPP 2017 ). Association for Computing Machinery, New York, NY, USA, 100-111. htps://doi.org/10.1145/3018610. 3018628
[19]
R.A. Howard. 1960. Dynamic Programming and Markov Processes. Technology Press of Massachusetts Institute of Technology. htps: //books.google.com/books?id=fXJEAAAAIAAJ
[20]
Nathan Hunt, N. Fulton, Sara Magliacane, N. Hoàng, Subhro Das, and Armando Solar-Lezama. 2020. Verifiably Safe Exploration for End-toEnd Reinforcement Learning. ArXiv abs/ 2007.01223 ( 2020 ).
[21]
Bart Jacobs. 2018. From probability monads to commutative efectuses. Journal of Logical and Algebraic Methods in Programming 94 ( 2018 ), 200-237. htps://doi.org/10.1016/j.jlamp. 2016. 11.006
[22]
C. Jones and Gordon D. Plotkin. 1989. A Probabilistic Powerdomain of Evaluations. In Proceedings of the Fourth Annual Symposium on Logic in Computer Science (LICS '89), Pacific Grove, California, USA, June 5-8, 1989. IEEE Computer Society, 186-195. htps://doi.org/10.1109/LICS. 1989.39173
[23]
Dexter Kozen. 2007. Coinductive Proof Principles for Stochastic Processes. CoRR abs/0711.0194 ( 2007 ). arXiv: 0711.0194 htp://arxiv.org/ abs/0711.0194
[24]
Dexter Kozen and Nicholas Ruozzi. 2009. Applications of Metric Coinduction. Log. Methods Comput. Sci. 5, 3 ( 2009 ). htp://arxiv.org/ abs/0908.2793
[25]
F William Lawvere. 1962. The category of probabilistic mappings. preprint ( 1962 ).
[26]
OpenAI. 2018. OpenAI Five. htps://blog.openai.com/openai-five/.
[27]
Paolo Perrone. 2018. Categorical Probability and Stochastic Dominance in Metric Spaces. Ph.D. Dissertation. University of Leipzig. htps: //paoloperrone.org/phdthesis.pdf
[28]
Paolo Perrone. 2019. Notes on Category Theory with examples from basic mathematics. arXiv: 1912.10642 [math.CT]
[29]
Martin L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming (1st ed.). John Wiley and Sons, Inc., USA.
[30]
Norman Ramsey and Avi Pfefer. 2002. Stochastic lambda calculus and monads of probability distributions. In Conference Record of POPL 2002 : The 29th SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Portland, OR, USA, January 16-18, 2002, John Launchbury and John C. Mitchell (Eds.). ACM, 154-165. htps://doi.org/10.1145/ 503272.503288
[31]
Adam ?cibior, Zoubin Ghahramani, and Andrew D. Gordon. 2015. Practical probabilistic programming with monads. In Proceedings of the 8th ACM SIGPLAN Symposium on Haskell, Haskell 2015, Vancouver, BC, Canada, September 3-4, 2015, Ben Lippmeier (Ed.). ACM, 165-176. htps://doi.org/10.1145/2804302.2804317
[32]
Daniel Selsam, Percy Liang, and David L. Dill. 2017. Developing Bug-Free Machine Learning Systems With Formal Mathematics. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research, Vol. 70 ), Doina Precup and Yee Whye Teh (Eds.). PMLR, 3047-3056. htp://proceedings.mlr.press/v70/selsam17a. html
[33]
Andrew Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin ?ídek, Alexander Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. 2020. Improved protein structure prediction using potentials from deep learning. Nature 577 (01 2020 ), 1-5. htps://doi.org/10.1038/s41586-019-1923-7
[34]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 529, 7587 (Jan. 2016 ), 484-489. htps://doi.org/10.1038/nature16961
[35]
Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
[36]
Joseph Tassarotti and Robert Harper. 2019. A separation logic for concurrent randomized programs. Proceedings of the ACM on Programming Languages 3, POPL ( 2019 ), 1-30.
[37]
Joseph Tassarotti, Jean-Baptiste Tristan, and Koundinya Vajjha. 2019. A Formal Proof of PAC Learnability for Decision Stumps. CoRR abs/ 1911.00385 ( 2019 ). arXiv: 1911.00385 htp://arxiv.org/abs/ 1911. 00385
[38]
The Coq Development Team. 2004. The Coq Proof Assistant Reference Manual. LogiCal Project. Version 8.0.
[39]
Jean-Baptiste Tristan, Joseph Tassarotti, Koundinya Vajjha, Michael L. Wick, and Anindya Banerjee. 2020. Verification of ML Systems via Reparameterization. CoRR abs/ 2007.06776 ( 2020 ). arXiv: 2007.06776 htps://arxiv.org/abs/ 2007.06776

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CPP 2021: Proceedings of the 10th ACM SIGPLAN International Conference on Certified Programs and Proofs
January 2021
342 pages
ISBN:9781450382991
DOI:10.1145/3437992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 January 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Coinduction
  2. Formal Verification
  3. Policy Iteration
  4. Reinforcement Learning
  5. Value Iteration

Qualifiers

  • Research-article

Conference

CPP '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 18 of 26 submissions, 69%

Upcoming Conference

POPL '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)8
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Dual Parallel Policy Iteration With Coupled Policy ImprovementIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320219235:3(4286-4298)Online publication date: Mar-2024
  • (2024)Efficient Formally Verified Maximal End Component Decomposition for MDPsFormal Methods10.1007/978-3-031-71162-6_11(206-225)Online publication date: 9-Sep-2024
  • (2023)Certified Reinforcement Learning with Logic GuidanceArtificial Intelligence10.1016/j.artint.2023.103949(103949)Online publication date: May-2023
  • (2023)Fast Verified SCCs for Probabilistic Model CheckingAutomated Technology for Verification and Analysis10.1007/978-3-031-45329-8_9(181-202)Online publication date: 22-Oct-2023
  • (2022)General Probability in Coq2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)10.1109/DSN-W54100.2022.00021(70-71)Online publication date: Jun-2022
  • (2021)Trimming Data Sets: a Verified Algorithm for Robust Mean EstimationProceedings of the 23rd International Symposium on Principles and Practice of Declarative Programming10.1145/3479394.3479412(1-9)Online publication date: 6-Sep-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media