research-article

CertRL: formalizing convergence proofs for value and policy iteration in Coq

Authors:

Koundinya Vajjha,

Avraham Shinnar,

Nathan FultonAuthors Info & Claims

CPP 2021: Proceedings of the 10th ACM SIGPLAN International Conference on Certified Programs and Proofs

Pages 18 - 31

https://doi.org/10.1145/3437992.3439927

Published: 20 January 2021 Publication History

Abstract

Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy.

This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of the Bellman optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.

Supplementary Material

Auxiliary Archive (poplws21cppmain-p30-p-archive.zip)

This submission contains the source code of the formalization on which the paper is based.

Download
562.25 KB

References

[1]

Reynald Afeldt and Manabu Hagiwara. 2012. Formalization of Shannon's Theorems in SSReflect-Coq. In 3rd Conference on Interactive Theorem Proving (ITP 2012 ), Princeton, New Jersey, USA, August 13-15, 2012 (Lecture Notes in Computer Science, Vol. 7406 ). Springer, 233-249.

[2]

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. 2018. Safe Reinforcement Learning via Shielding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018 ), Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press.

[3]

Philippe Audebaud and Christine Paulin-Mohring. 2009. Proofs of randomized algorithms in Coq. Science of Computer Programming 74, 8 ( 2009 ), 568-589.

[4]

Joshua S. Auerbach, Martin Hirzel, Louis Mandel, Avraham Shinnar, and Jérôme Siméon. 2017. Q*cert: A Platform for Implementing and Verifying Query Compilers. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 1703-1706. htps://doi.org/10.1145/3035918.3056447

Digital Library

[5]

Alexander Bagnall and Gordon Stewart. 2019. Certifying the True Error: Machine Learning in Coq with Verified Generalization Guarantees. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. AAAI Press, 2662-2669. htps://doi.org/10.1609/aaai.v33i01. 33012662

[6]

Richard Bellman. 1954. The theory of dynamic programming. Bull. Amer. Math. Soc. 60, 6 ( 11 1954 ), 503-515. htps://projecteuclid.org: 443/euclid.bams/1183519147

[7]

Alexander Bentkamp, Jasmin Christian Blanchette, and Dietrich Klakow. 2019. A Formal Proof of the Expressiveness of Deep Learning. J. Autom. Reason. 63, 2 ( 2019 ), 347-368. htps://doi.org/10.1007/s10817-018-9481-5

Digital Library

[8]

Brandon Bohrer, Yong Kiam Tan, Stefan Mitsch, Magnus O. Myreen, and André Platzer. 2018. VeriPhy: Verified Controller Executables from Verified Cyber-Physical System Models. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018 ), Dan Grossman (Ed.). ACM, 617-630.

Digital Library

[9]

Sylvie Boldo, François Clément, Florian Faissole, Vincent Martin, and Micaela Mayero. 2017. A Coq formal proof of the Lax-Milgram theorem. In 6th ACM SIGPLAN Conference on Certified Programs and Proofs. Paris, France. htps://doi.org/10.1145/3018610.3018625

Digital Library

[10]

Sylvie Boldo, Catherine Lelay, and Guillaume Melquiond. 2014. Coquelicot: A User-Friendly Library of Real Analysis for Coq. Mathematics in Computer Science 9 ( 03 2014 ). htps://doi.org/10.1007/s11786-014-0181-1

[11]

Manuel Eberl, Johannes Hölzl, and Tobias Nipkow. 2015. A Veriifed Compiler for Probability Density Functions. In ESOP 2015 (LNCS, Vol. 9032 ), Jan Vitek (Ed.). Springer, 80-104. htps://doi.org/10.1007/ 978-3-662-46669-8_4

[12]

Frank MV Feys, Helle Hvid Hansen, and Lawrence S Moss. 2018. LongTerm Values in Markov Decision Processes, (Co)algebraically. In International Workshop on Coalgebraic Methods in Computer Science. Springer, 78-99.

[13]

Nathan Fulton and André Platzer. 2018. Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018 ), Sheila McIlraith and Kilian Weinberger (Eds.). AAAI Press, 6485-6492.

[14]

Michèle Giry. 1982. A categorical approach to probability theory. In Categorical Aspects of Topology and Analysis, B. Banaschewski (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 68-85.

[15]

Shixiang Gu, Ethan Holly, Timothy P. Lillicrap, and Sergey Levine. 2016. Deep Reinforcement Learning for Robotic Manipulation. CoRR abs/1610.00633 ( 2016 ). arXiv: 1610.00633 htp://arxiv.org/abs/1610. 00633

[16]

Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. 2018. Logically-Correct Reinforcement Learning. CoRR abs/ 1801.08099 ( 2018 ). arXiv: 1801.08099

[17]

Johannes Hoelzl. 2017. Markov Chains and Markov Decision Processes in Isabelle/HOL. Journal of Automated Reasoning ( 2017 ). htps://doi. org/10.1007/s10817-016-9401-5

[18]

Johannes Hoelzl. 2017. Markov Processes in Isabelle/HOL. In Proceedings of the 6th ACM SIGPLAN Conference on Certified Programs and Proofs (Paris, France) (CPP 2017 ). Association for Computing Machinery, New York, NY, USA, 100-111. htps://doi.org/10.1145/3018610. 3018628

Digital Library

[19]

R.A. Howard. 1960. Dynamic Programming and Markov Processes. Technology Press of Massachusetts Institute of Technology. htps: //books.google.com/books?id=fXJEAAAAIAAJ

[20]

Nathan Hunt, N. Fulton, Sara Magliacane, N. Hoàng, Subhro Das, and Armando Solar-Lezama. 2020. Verifiably Safe Exploration for End-toEnd Reinforcement Learning. ArXiv abs/ 2007.01223 ( 2020 ).

[21]

Bart Jacobs. 2018. From probability monads to commutative efectuses. Journal of Logical and Algebraic Methods in Programming 94 ( 2018 ), 200-237. htps://doi.org/10.1016/j.jlamp. 2016. 11.006

[22]

C. Jones and Gordon D. Plotkin. 1989. A Probabilistic Powerdomain of Evaluations. In Proceedings of the Fourth Annual Symposium on Logic in Computer Science (LICS '89), Pacific Grove, California, USA, June 5-8, 1989. IEEE Computer Society, 186-195. htps://doi.org/10.1109/LICS. 1989.39173

[23]

Dexter Kozen. 2007. Coinductive Proof Principles for Stochastic Processes. CoRR abs/0711.0194 ( 2007 ). arXiv: 0711.0194 htp://arxiv.org/ abs/0711.0194

[24]

Dexter Kozen and Nicholas Ruozzi. 2009. Applications of Metric Coinduction. Log. Methods Comput. Sci. 5, 3 ( 2009 ). htp://arxiv.org/ abs/0908.2793

[25]

F William Lawvere. 1962. The category of probabilistic mappings. preprint ( 1962 ).

[26]

OpenAI. 2018. OpenAI Five. htps://blog.openai.com/openai-five/.

[27]

Paolo Perrone. 2018. Categorical Probability and Stochastic Dominance in Metric Spaces. Ph.D. Dissertation. University of Leipzig. htps: //paoloperrone.org/phdthesis.pdf

[28]

Paolo Perrone. 2019. Notes on Category Theory with examples from basic mathematics. arXiv: 1912.10642 [math.CT]

[29]

Martin L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming (1st ed.). John Wiley and Sons, Inc., USA.

Digital Library

[30]

Norman Ramsey and Avi Pfefer. 2002. Stochastic lambda calculus and monads of probability distributions. In Conference Record of POPL 2002 : The 29th SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Portland, OR, USA, January 16-18, 2002, John Launchbury and John C. Mitchell (Eds.). ACM, 154-165. htps://doi.org/10.1145/ 503272.503288

Digital Library

[31]

Adam ?cibior, Zoubin Ghahramani, and Andrew D. Gordon. 2015. Practical probabilistic programming with monads. In Proceedings of the 8th ACM SIGPLAN Symposium on Haskell, Haskell 2015, Vancouver, BC, Canada, September 3-4, 2015, Ben Lippmeier (Ed.). ACM, 165-176. htps://doi.org/10.1145/2804302.2804317

[32]

Daniel Selsam, Percy Liang, and David L. Dill. 2017. Developing Bug-Free Machine Learning Systems With Formal Mathematics. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research, Vol. 70 ), Doina Precup and Yee Whye Teh (Eds.). PMLR, 3047-3056. htp://proceedings.mlr.press/v70/selsam17a. html

[33]

Andrew Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin ?ídek, Alexander Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David Jones, David Silver, Koray Kavukcuoglu, and Demis Hassabis. 2020. Improved protein structure prediction using potentials from deep learning. Nature 577 (01 2020 ), 1-5. htps://doi.org/10.1038/s41586-019-1923-7

[34]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 529, 7587 (Jan. 2016 ), 484-489. htps://doi.org/10.1038/nature16961

[35]

Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.

Digital Library

[36]

Joseph Tassarotti and Robert Harper. 2019. A separation logic for concurrent randomized programs. Proceedings of the ACM on Programming Languages 3, POPL ( 2019 ), 1-30.

Digital Library

[37]

Joseph Tassarotti, Jean-Baptiste Tristan, and Koundinya Vajjha. 2019. A Formal Proof of PAC Learnability for Decision Stumps. CoRR abs/ 1911.00385 ( 2019 ). arXiv: 1911.00385 htp://arxiv.org/abs/ 1911. 00385

[38]

The Coq Development Team. 2004. The Coq Proof Assistant Reference Manual. LogiCal Project. Version 8.0.

[39]

Jean-Baptiste Tristan, Joseph Tassarotti, Koundinya Vajjha, Michael L. Wick, and Anindya Banerjee. 2020. Verification of ML Systems via Reparameterization. CoRR abs/ 2007.06776 ( 2020 ). arXiv: 2007.06776 htps://arxiv.org/abs/ 2007.06776

Cited By

Cheng YHuang LChen CWang X(2024)Dual Parallel Policy Iteration With Coupled Policy ImprovementIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320219235:3(4286-4298)Online publication date: Mar-2024
https://doi.org/10.1109/TNNLS.2022.3202192
Hartmanns AKohlen BLammich P(2024)Efficient Formally Verified Maximal End Component Decomposition for MDPsFormal Methods10.1007/978-3-031-71162-6_11(206-225)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-71162-6_11
Hasanbeig HKroening DAbate A(2023)Certified Reinforcement Learning with Logic GuidanceArtificial Intelligence10.1016/j.artint.2023.103949(103949)Online publication date: May-2023
https://doi.org/10.1016/j.artint.2023.103949
Show More Cited By

Index Terms

CertRL: formalizing convergence proofs for value and policy iteration in Coq

Recommendations

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of ...
A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies

We consider stochastic optimal control models with Borel spaces and universally measurable policies. For such models the standard policy iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed ...
Convergence Properties of Policy Iteration

This paper analyzes asymptotic convergence properties of policy iteration in a class of stationary, infinite-horizon Markovian decision problems that arise in optimal growth theory. These problems have continuous state and control variables and must ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CPP 2021: Proceedings of the 10th ACM SIGPLAN International Conference on Certified Programs and Proofs

January 2021

342 pages

ISBN:9781450382991

DOI:10.1145/3437992

Program Chairs:
Cătălin Hriţcu,
Andrei Popescu

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CPP '21

Sponsor:

SIGPLAN

CPP '21: 10th ACM SIGPLAN International Conference on Certified Programs and Proofs

January 17 - 19, 2021

Virtual, Denmark

Acceptance Rates

Overall Acceptance Rate 18 of 26 submissions, 69%

Upcoming Conference

POPL '25

Sponsor:
sigplan

The 52nd Annual ACM SIGPLAN Symposium on Principles of Programming Languages

January 19 - 25, 2025

Denver , CO , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
169
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)8

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cheng YHuang LChen CWang X(2024)Dual Parallel Policy Iteration With Coupled Policy ImprovementIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320219235:3(4286-4298)Online publication date: Mar-2024
https://doi.org/10.1109/TNNLS.2022.3202192
Hartmanns AKohlen BLammich P(2024)Efficient Formally Verified Maximal End Component Decomposition for MDPsFormal Methods10.1007/978-3-031-71162-6_11(206-225)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-71162-6_11
Hasanbeig HKroening DAbate A(2023)Certified Reinforcement Learning with Logic GuidanceArtificial Intelligence10.1016/j.artint.2023.103949(103949)Online publication date: May-2023
https://doi.org/10.1016/j.artint.2023.103949
Hartmanns AKohlen BLammich P(2023)Fast Verified SCCs for Probabilistic Model CheckingAutomated Technology for Verification and Analysis10.1007/978-3-031-45329-8_9(181-202)Online publication date: 22-Oct-2023
https://doi.org/10.1007/978-3-031-45329-8_9
Shinnar ATrager B(2022)General Probability in Coq2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)10.1109/DSN-W54100.2022.00021(70-71)Online publication date: Jun-2022
https://doi.org/10.1109/DSN-W54100.2022.00021
Daukantas IBruni ASchürmann C(2021)Trimming Data Sets: a Verified Algorithm for Robust Mean EstimationProceedings of the 23rd International Symposium on Principles and Practice of Declarative Programming10.1145/3479394.3479412(1-9)Online publication date: 6-Sep-2021
https://dl.acm.org/doi/10.1145/3479394.3479412

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents