research-article

Sim-to-Lab-to-Real: : Safe reinforcement learning with shielding and generalization guarantees

Authors:

Anirudha Majumdar,

Jaime F. FisacAuthors Info & Claims

Volume 314, Issue C

https://doi.org/10.1016/j.artint.2022.103811

Published: 01 January 2023 Publication History

Abstract

Safety is a critical component of autonomous systems and remains a challenge for learning-based policies to be utilized in the real world. In particular, policies learned using reinforcement learning often fail to generalize to novel environments due to unsafe behavior. In this paper, we propose Sim-to-Lab-to-Real to bridge the reality gap with a probabilistically guaranteed safety-aware policy distribution. To improve safety, we apply a dual policy setup where a performance policy is trained using the cumulative task reward and a backup (safety) policy is trained by solving the Safety Bellman Equation based on Hamilton-Jacobi (HJ) reachability analysis. In Sim-to-Lab transfer, we apply a supervisory control scheme to shield unsafe actions during exploration; in Lab-to-Real transfer, we leverage the Probably Approximately Correct (PAC)-Bayes framework to provide lower bounds on the expected performance and safety of policies in unseen environments. Additionally, inheriting from the HJ reachability analysis, the bound accounts for the expectation over the worst-case safety in each environment. We empirically study the proposed framework for ego-vision navigation in two types of indoor environments with varying degrees of photorealism. We also demonstrate strong generalization performance through hardware experiments in real indoor spaces with a quadrupedal robot. See https://sites.google.com/princeton.edu/sim-to-lab-to-real for supplementary material.

References

[1]

A. Kumar, Z. Fu, D. Pathak, J. Malik, RMA: rapid motor adaptation for legged robots, in: Proceedings of Robotics: Science and Systems (RSS), Virtual, 2021,.

[2]

Y. Zhu, R. Mottaghi, E. Kolve, J.J. Lim, A. Gupta, L. Fei-Fei, A. Farhadi, Target-driven visual navigation in indoor scenes using deep reinforcement learning, in: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3357–3364,.

Digital Library

[3]

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel, Domain randomization for transferring deep neural networks from simulation to the real world, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 23–30,.

Digital Library

[4]

Muratore, F.; Ramos, F.; Turk, G.; Yu, W.; Gienger, M.; Peters, J. (2021): Robot learning from randomized simulations: a review. arXiv:2111.00956.

[5]

F. Sadeghi, S. Levine, Cad2rl: real single-image flight without a single real image, in: Proceedings of Robotics: Science and Systems (RSS), Cambridge, Massachusetts, 2017,.

[6]

H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao, H. Zhang, 3D-FRONT: 3D furnished rooms with layOuts and semaNTics, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10933–10942.

[7]

Boston-Dynamics (2022): Inside the lab: robotics after hours. https://www.youtube.com/watch?v=Jq0GknnKvXM.

[8]

Y. Chow, M. Ghavamzadeh, Algorithms for cvar optimization in mdps, in: Proceedings of Advances in Neural Information Processing Systems (NeurIPS), Montreal, Quebec, Canada, 2014, pp. 3509–3517.

[9]

Y. Chow, M. Ghavamzadeh, L. Janson, M. Pavone, Risk-constrained reinforcement learning with percentile risk criteria, J. Mach. Learn. Res. 18 (2017) 6070–6120.

[10]

J.F. Fisac, A.K. Akametalu, M.N. Zeilinger, S. Kaynama, J. Gillula, C.J. Tomlin, A general safety framework for learning-based control in uncertain robotic systems, IEEE Trans. Autom. Control 64 (2019) 2737–2752.

[11]

J.F. Fisac, N.F. Lugovoy, V. Rubies-Royo, S. Ghosh, C.J. Tomlin, Bridging Hamilton-Jacobi safety analysis and reinforcement learning, in: Proceedings of the International Conference on Robotics and Automation (ICRA), 2019, pp. 8550–8556,.

Digital Library

[12]

K.-C. Hsu, V. Rubies-Royo, C.J. Tomlin, J.F. Fisac, Safety and liveness guarantees through reach-avoid reinforcement learning, in: Proceedings of Robotics: Science and Systems, Virtual, 2021,.

[13]

Srinivasan, K.; Eysenbach, B.; Ha, S.; Tan, J.; Finn, C. (2020): Learning to be safe: deep RL with a safety critic. arXiv:2010.14603.

[14]

B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J.E. Gonzalez, J. Ibarz, C. Finn, K. Goldberg, Recovery RL: safe reinforcement learning with learned recovery zones, IEEE Robot. Autom. Lett. 6 (2021) 4915–4922.

[15]

K. Zhou, J.C. Doyle, Essentials of Robust Control, vol. 104, Prentice Hall, Upper Saddle River, NJ, 1998.

[16]

S. Xu, T. Chen, Robust h-infinity control for uncertain stochastic systems with state delay, IEEE Transactions on Automatic Control 47 (2002) 2089–2094.

[17]

A. Majumdar, R. Tedrake, Funnel libraries for real-time robust feedback motion planning, Int. J. Robot. Res. 36 (2017) 947–982.

[18]

S. Singh, A. Majumdar, J.-J. Slotine, M. Pavone, Robust online motion planning via contraction theory and convex optimization, in: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 5883–5890,.

Digital Library

[19]

A. Majumdar, A. Farid, A. Sonar, PAC-Bayes control: learning policies that provably generalize to novel environments, Int. J. Robot. Res. 40 (2021) 574–593.

[20]

A. Farid, S. Veer, A. Majumdar, Task-driven out-of-distribution detection with statistical guarantees for robot learning, in: Proceedings of the Conference on Robot Learning (CoRL), 2021.

[21]

S. Veer, A. Majumdar, Probably approximately correct vision-based planning using motion primitives, in: Proceedings of the 2020 Conference on Robot Learning (CoRL), in: Proceedings of Machine Learning Research, PMLR, vol. 155, 2021, pp. 1001–1014.

[22]

J. García, F. Fernández, A comprehensive survey on safe reinforcement learning, J. Mach. Learn. Res. 16 (2015) 1437–1480.

[23]

S. Bansal, M. Chen, S. Herbert, C.J. Tomlin, Hamilton-Jacobi reachability: a brief overview and recent advances, in: Proceedings of the IEEE 56th Annual Conference on Decision and Control (CDC), 2017, pp. 2242–2253,.

Digital Library

[24]

J.F. Fisac, M. Chen, C.J. Tomlin, S.S. Sastry, Reach-avoid problems with time-varying dynamics, targets and constraints, in: Proceedings of the 18th International Conference on Hybrid Systems: Computation and Control, HSCC '15, New York, NY, USA, 2015, pp. 11–20,.

Digital Library

[25]

K. Leung, E. Schmerling, M. Zhang, M. Chen, J. Talbot, J.C. Gerdes, M. Pavone, On infusing reachability-based safety assurance within planning frameworks for human–robot vehicle interactions, Int. J. Robot. Res. 39 (2020) 1326–1345.

[26]

R. Cheng, G. Orosz, R.M. Murray, J.W. Burdick, End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks, in: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, AAAI'19/IAAI'19/EAAI'19, AAAI Press, 2019,.

Digital Library

[27]

Dalal, G.; Dvijotham, K.; Vecerik, M.; Hester, T.; Paduraru, C.; Tassa, Y. (2018): Safe exploration in continuous action spaces. arXiv:1801.08757.

[28]

Chen, B.; Francis, J.; Oh, J.; Nyberg, E.; Herbert, S.L. (2021): Safe autonomous racing via approximate reachability on ego-vision. arXiv:2110.07699.

[29]

F. Berkenkamp, A.P. Schoellig, A. Krause, Safe controller optimization for quadrotors with gaussian processes, in: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 491–496,.

Digital Library

[30]

T. Koller, F. Berkenkamp, M. Turchetta, A. Krause, Learning-based model predictive control for safe exploration, in: Proceedings of the IEEE Conference on Decision and Control (CDC), 2018, pp. 6059–6066,.

Digital Library

[31]

A. Liu, G. Shi, S.-J. Chung, A. Anandkumar, Y. Yue, Robust regression for safe exploration in control, in: Proceedings of the 2nd Conference on Learning for Dynamics and Control, in: Proceedings of Machine Learning Research, PMLR, vol. 120, 2020, pp. 608–619. https://proceedings.mlr.press/v120/liu20a.html.

[32]

V.N. Vapnik, A.Y. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, in: Measures of Complexity, Springer, 2015, pp. 11–30.

[33]

O. Bousquet, S. Boucheron, G. Lugosi, Introduction to statistical learning theory, in: Summer School on Machine Learning, Springer, 2003, pp. 169–207.

[34]

D.A. McAllester, Some PAC-bayesian theorems, Mach. Learn. 37 (1999) 355–363.

[35]

G.K. Dziugaite, D.M. Roy, Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, in: Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI), Sydney, Australia, August 11–15, 2017.

[36]

M. Pérez-Ortiz, O. Rivasplata, J. Shawe-Taylor, C. Szepesvári, Tighter risk certificates for neural networks, J. Mach. Learn. Res. 22 (2021).

[37]

A.Z. Ren, S. Veer, A. Majumdar, Generalization guarantees for imitation learning, in: Proceedings of the 2020 Conference on Robot Learning (CoRL), in: Proceedings of Machine Learning Research, PMLR, vol. 155, 2021, pp. 1426–1442.

[38]

Gurgen, A.E.; Majumdar, A.; Veer, S. (2021): Learning provably robust motion planners using funnel libraries. arXiv preprint arXiv:2111.08733.

[39]

Agarwal, A.; Veer, S.; Ren, A.Z.; Majumdar, A. (2021): Stronger generalization guarantees for robot learning by combining generative models and real-world data. arXiv preprint arXiv:2111.08761.

[40]

A. Farid, D. Snyder, A.Z. Ren, A. Majumdar, Failure prediction with statistical guarantees for vision-based robot control, in: Proceedings of the Robotics: Science and Systems (RSS), 2022.

[41]

B. Eysenbach, A. Gupta, J. Ibarz, S. Levine, Diversity is all you need: learning skills without a reward function, in: Proceedings of the International Conference on Learning Representations (ICLR), 2019.

[42]

F. Bonin-Font, A. Ortiz, G. Oliver, Visual navigation for mobile robots: a survey, J. Intell. Robot. Syst. 53 (2008) 263–296.

Digital Library

[43]

R. Sim, J.J. Little, Autonomous vision-based exploration and mapping using hybrid maps and Rao-Blackwellised particle filters, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2006, pp. 2082–2089,.

[44]

S. Thrun, A. Bücken, Integrating grid-based and topological maps for mobile robot navigation, in: Proceedings of the AAAI Conference on Artificial Intelligence, 1996, pp. 944–951.

[45]

S. Bansal, V. Tolani, S. Gupta, J. Malik, C. Tomlin, Combining optimal control and learning for visual navigation in novel environments, in: Proceedings of the 2020 Conference on Robot Learning (CoRL), in: Proceedings of Machine Learning Research, PMLR, vol. 100, 2020, pp. 420–429.

[46]

S. Gupta, J. Davidson, S. Levine, R. Sukthankar, J. Malik, Cognitive mapping and planning for visual navigation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2616–2625.

[47]

C. Richter, N. Roy, Safe visual navigation via deep learning and novelty detection, in: Proceedings of Robotics: Science and Systems (RSS), Cambridge, Massachusetts, 2017,.

[48]

L. Wellhausen, R. Ranftl, M. Hutter, Safe robot navigation via multi-modal anomaly detection, IEEE Robot. Autom. Lett. 5 (2020) 1326–1333.

[49]

B. Lütjens, M. Everett, J.P. How, Safe reinforcement learning with model uncertainty estimates, in: Proceedings of the International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 8662–8668.

[50]

Kahn, G.; Villaflor, A.; Pong, V.; Abbeel, P.; Levine, S. (2017): Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182.

[51]

A. Bajcsy, S. Bansal, E. Bronstein, V. Tolani, C.J. Tomlin, An efficient reachability-based framework for provably safe autonomous navigation in unknown environments, in: Proceedings of the IEEE 58th Conference on Decision and Control (CDC), IEEE, 2019, pp. 1758–1765.

[52]

A. Li, S. Bansal, G. Giovanis, V. Tolani, C. Tomlin, M. Chen, Generating robust supervision for learning-based visual navigation using Hamilton-Jacobi reachability, in: Proceedings of the 2nd Conference on Learning for Dynamics and Control, in: Proceedings of Machine Learning Research, PMLR, vol. 120, 2020, pp. 500–510.

[53]

Ramos, F.; Possas, R.C.; Fox, D. (2019): Bayessim: adaptive domain randomization via probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728.

[54]

Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, D. Fox, Closing the sim-to-real loop: adapting simulation randomization with real world experience, in: 2019 International Conference on Robotics and Automation (ICRA), 2019.

[55]

Lim, V.; Huang, H.; Chen, L.Y.; Wang, J.; Ichnowski, J.; Seita, D.; Laskey, M.; Goldberg, K. (2021): Planar robot casting with real2sim2real self-supervised learning. arXiv preprint arXiv:2111.04814.

[56]

B. Mehta, M. Diaz, F. Golemo, C.J. Pal, L. Paull, Active domain randomization, in: Conference on Robot Learning, 2020, pp. 1162–1176.

[57]

F. Muratore, C. Eilers, M. Gienger, J. Peters, Data-efficient domain randomization with bayesian optimization, IEEE Robot. Autom. Lett. 6 (2021) 911–918.

[58]

M. Cutler, T.J. Walsh, J.P. How, Reinforcement learning with multi-fidelity simulators, in: Proceedings of the IEEE/RSJ International Conference on Robotics and Automation (ICRA), 2014.

[59]

G. Shafer, V. Vovk, A tutorial on conformal prediction, J. Mach. Learn. Res. 9 (2008).

[60]

T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft Actor-Critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor, in: Proceedings of the 35th International Conference on Machine Learning, in: Proceedings of Machine Learning Research, PMLR, vol. 80, 2018, pp. 1861–1870.

[61]

M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, U. Topcu, Safe reinforcement learning via shielding, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI Press, 2018.

[62]

A. Jabri, K. Hsu, A. Gupta, B. Eysenbach, S. Levine, C. Finn, Unsupervised curricula for visual meta-reinforcement learning, Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019, pp. 10519–10530.

[63]

S. Kumar, A. Kumar, S. Levine, C. Finn, One solution is not all you need: few-shot extrapolation via structured MaxEnt RL, Advances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 8198–8210.

[64]

A. Sharma, S. Gu, S. Levine, V. Kumar, K. Hausman, Dynamics-aware unsupervised discovery of skills, in: Proceedings of the International Conference on Learning Representations (ICLR), 2020.

[65]

J. Langford, R. Caruana, (Not) bounding the true error, Advances in Neural Information Processing Systems (NeurIPS), vol. 14, MIT Press, 2002.

[66]

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, D. Quillen, Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection, Int. J. Robot. Res. 37 (2018) 421–436.

[67]

Quartz (2019): Amazon - this company built one of the world's most efficient warehouses by embracing chaos. https://classic.qz.com/perfect-company-2/1172282/this-company-built-one-of-the-worlds-most-efficient-warehouses-by-embracing-chaos/.

[68]

FutureCar (2022): A look at how Waymo's self-driving test fleet safely traveled 2.7 million miles in San Francisco last year. https://www.futurecar.com/5158/A-Look-at-How-Waymos-Self-Driving-Test-Fleet-Safely-Traveled-2-7-Million-Miles-in-San-Francisco-Last-Year.

[69]

Ichnowski, J.; Chen, K.; Dharmarajan, K.; Adebola, S.; Danielczuk, M.; Mayoral-Vilches, V.; Zhan, H.; Xu, D.; Ghassemi, R.; Kubiatowicz, J.; et al. (2022): Fogros 2: an adaptive and extensible platform for cloud and fog robotics using ros 2. arXiv preprint arXiv:2205.09778.

[70]

B. Eysenbach, S. Gu, J. Ibarz, S. Levine, Leave no trace: learning to reset for safe and autonomous reinforcement learning, in: Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018, https://openreview.net/forum?id=S1vuO-bCW.

[71]

Z. Borsos, M. Mutny, A. Krause, Coresets via bilevel optimization for continual learning and streaming, Adv. Neural Inf. Process. Syst. 33 (2020) 14879–14890.

[72]

Guedj, B. (2019): A primer on PAC-bayesian learning. arXiv preprint arXiv:1901.05353.

[73]

S. Arora, R. Ge, B. Neyshabur, Y. Zhang, Stronger generalization bounds for deep nets via a compression approach, in: International Conference on Machine Learning, PMLR, 2018, pp. 254–263.

Cited By

Wei WWang DLi LLiang J(2024)Re-attentive experience replay in off-policy reinforcement learningMachine Language10.1007/s10994-023-06505-8113:5(2327-2349)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s10994-023-06505-8

Index Terms

Sim-to-Lab-to-Real: Safe reinforcement learning with shielding and generalization guarantees
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Robotics
2. Computing methodologies
  1. Artificial intelligence
  2. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

On groupthink in safety analysis: an industrial case study
ICSE-SEIP '18: Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice

Context: In safety-critical systems, an effective safety analysis produces high-quality safety requirements and ensures a safe product from an early stage. Motivation: In safety-critical industries, safety analysis happens mostly in groups. The ...
Continuous-action reinforcement learning with fast policy search and adaptive basis function selection
Special issue on Recent advances on machine learning and Cybernetics

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open ...
Fast Learning in an Actor-Critic Architecture with Reward and Punishment
Proceedings of the 2008 conference on Tenth Scandinavian Conference on Artificial Intelligence: SCAI 2008

A reinforcement architecture is introduced that consists of three complementary learning systems with different generalization abilities. The ACTOR learns state-action associations, the CRITIC learns a goal-gradient, and the PUNISH system learns what ...

Comments

Information & Contributors

Information

Published In

cover image Artificial Intelligence

Artificial Intelligence Volume 314, Issue C

Jan 2023

467 pages

ISSN:0004-3702

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers Ltd.

United Kingdom

Publication History

Published: 01 January 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wei WWang DLi LLiang J(2024)Re-attentive experience replay in off-policy reinforcement learningMachine Language10.1007/s10994-023-06505-8113:5(2327-2349)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s10994-023-06505-8

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents