Abstract
As robots acquire increasingly sophisticated skills and see increasingly complex and varied environments, the threat of an edge case or anomalous failure is ever present. For example, Tesla cars have seen interesting failure modes ranging from autopilot disengagements due to inactive traffic lights carried by trucks to phantom braking caused by images of stop signs on roadside billboards. These system-level failures are not due to failures of any individual component of the autonomy stack but rather system-level deficiencies in semantic reasoning. Such edge cases, which we call semantic anomalies, are simple for a human to disentangle yet require insightful reasoning. To this end, we study the application of large language models (LLMs), endowed with broad contextual understanding and reasoning capabilities, to recognize such edge cases and introduce a monitoring framework for semantic anomaly detection in vision-based policies. Our experiments apply this framework to a finite state machine policy for autonomous driving and a learned policy for object manipulation. These experiments demonstrate that the LLM-based monitor can effectively identify semantic anomalies in a manner that shows agreement with human reasoning. Finally, we provide an extended discussion on the strengths and weaknesses of this approach and motivate a research outlook on how we can further use foundation models for semantic anomaly detection. Our project webpage can be found at https://sites.google.com/view/llm-anomaly-detection.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
Relevant documentation, data and/or code is readily available to verify the validity of the results presented upon request.
Notes
Although we use YOLOv8 (Jocher et al., 2023) in our vehicle planner, we find that DETR yields similar performance. We apply the baselines to DETR as it is trained on the same dataset as YOLOv8, though its architecture is more amenable to applying traditional OOD detectors.
This task is adapted from the put-blocks-in-bowl task defined by (Shridhar et al., 2021).
In these experiments, we generated these scene descriptions using privileged simulator information. In principle, an object detector could have been used to identify the objects involved in our experiments, however we found that the simulator visuals were not amenable to pretrained detection models.
References
Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., Fieguth, P., Cao, X., Khosravi, A., Acharya, U. R., Makarenkov, V., & Nahavandi, S. (2021). A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76, 243–297.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: A visual language model for few-shot learning. In Advances in neural information processing systems.
Amini, A., Schwarting, W., Soleimany, A., & Rus, D. (2020). Deep evidential regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Advances in neural information processing systems, (vol. 33, pp. 14927–14937). Curran Associates, Inc.
Antonante, P., Spivak, D. I., & Carlone, L. (2021). Monitoring and diagnosability of perception systems. In 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 168–175).
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. Invariant risk minimization. arXiv:1907.02893.
Banerjee, S., Sharma, A., Schmerling, E., Spolaor, M., Nemerouf, M., & Pavone, M. (2023). Data lifecycle management in evolving input distributions for learning-based aerospace applications. In L. Karlinsky, T. Michaeli, & K. Nishino (Eds.), Computer vision–ECCV 2022 workshops (pp. 127–142). Cham: Springer.
Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., et al. (2023). Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning (pp. 287–318). PMLR.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In Advances in neural information processing systems, (Vol. 33, pp. 1877–1901).
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision (pp. 213–229). Springer.
Chen, W., Hu, S., Talak, R., & Carlone, L. (2022). Leveraging large language models for robot 3d scene understanding.
Cui, Y., Niekum, S., Gupta, A., Kumar, V., & Rajeswaran, A. (2022). Can foundation models perform zero-shot task specification for robot manipulation? In Learning for dynamics and control conference (pp. 893–905). PMLR.
Daftry, S., Zeng, S., Bagnell, J. A., & Hebert, M. (2016). Introspective perception: Learning to predict failures in vision systems. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1743–1750).
de Haan, P., Jayaraman, D., & Levine, S. (2019). Causal confusion in imitation learning. In Advances in neural information processing systems (Vol. 32). Curran Associates, Inc.
De Lange, M., Aljundi, R., Masana, M., Sarah Parisot, X., Jia, A. L., Slabaugh, G., & Tuytelaars, T. (2022). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3366–3385.
Denouden, T., Salay, R., Czarnecki, K., Abdelzad, V., Phan, B., & Vernekar, S. (2018). Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance. arXiv preprint arXiv:1812.02765.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). Carla: An open urban driving simulator. In Conference on robot learning (pp. 1–16). PMLR.
Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T. B, & Vanhoucke, V. (2022). Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 international conference on robotics and automation (ICRA) (pp. 2553–2560). IEEE.
Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., & Florence, P. (2023). Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In M. Florina Balcan and K. Q. Weinberger (Eds.), Proceedings of the 33rd international conference on machine learning, volume 48 of proceedings of machine learning research (pp. 1050-1059), New York, New York, USA, 20–22. PMLR.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673.
Gomez-Donoso, F., Castano-Amoros, J., Escalona, F., & Cazorla, M. (2023). Three-dimensional reconstruction using SFM for actual pedestrian classification. Expert Systems with Applications, 213, 119006.
Gulrajani, I., & Lopez-Paz, D. (2021). In search of lost domain generalization. In International conference on learning representations.
Hendrycks, D., & Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International conference on learning representations.
Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al. (2022). Inner monologue: Embodied reasoning through planning with language models. In 6th annual conference on robot learning.
Japkowicz, N., Myers, C. E., & Gluck, M. A. (1995). A novelty detection approach to classification. In International joint conference on artificial intelligence.
Jocher, G., Chaurasia, A., & Qiu, J. (January 2023). YOLO by Ultralytics.
Koh, P. W., et al. (Jul 2021). Wilds: A benchmark of in-the-wild distribution shifts. In M. Meila, & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning, volume 139 of proceedings of machine learning research (pp. 5637–5664). PMLR, 18–24.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. In Advances in neural information processing systems.
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems. (Vol. 30). Curran Associates Inc.
Lee, M. A., Tan, M., Zhu, Y., & Bohg, J. (2021). Detect, reject, correct: Crossmodal compensation of corrupted sensors. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 909–916).
Lee, K., Lee, K., Lee, H., & Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems. (Vol. 31). Curran Associates Inc.
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888–12900). PMLR.
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Florence, P., Zeng, A., et al. (2022). Code as policies: Language model programs for embodied control. In Workshop on language and robotics at CoRL 2022.
Liang, S., Li, Y., & Srikant, R. (2018). Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th international conference on learning representations, ICLR 2018.
Lin, K., Agia, C., Migimatsu, T., Pavone, M., & Bohg, J. (2023). Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153.
Lin, Z., Roy, S. D., & Li, Y. (2021). Mood: Multi-level out-of-distribution detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15313–15323).
Liu, J., Lin, Z., Padhy, S., Tran, D., Bedrax Weiss, T., & Lakshminarayanan, B. (2020). Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 7498–7512). Curran Associates, Inc.
Liu, Z., Bahety, A., & Song, S. (2023). Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724.
Madaan, A., Zhou, S., Alon, U., Yang, Y., & Neubig, G. (2022). Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128.
McAllister, R., Kahn, G., Clune, J., & Levine, S. (2019). Robustness to out-of-distribution inputs via task-aware generative uncertainty. In ICRA (pp. 2083–2089).
Michels, F., Adaloglou, N., Kaiser, T., & Kollmann, M. (2023). Contrastive language-image pretrained (clip) models are powerful out-of-distribution detectors.
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., & Houlsby, N. (2022). Simple open-vocabulary object detection with vision transformers. In ECCV.
OpenAI. (2023). Gpt-4 technical report.
Osband, I., & Wen, Z., Asghari, S. M., Dwaracherla, V., Ibrahimi, M., Lu, X., & Van Roy, B. (2023). Epistemic neural networks.
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan, B., & Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Proceedings of the 33rd international conference on neural information processing systems, Red Hook, NY, USA. Curran Associates Inc.
Oza, P., & Patel, V. M. (2019). C2ae: Class conditioned auto-encoder for open-set recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2307–2316).
Rabiee, S., & Biswas, J. (2019). IVOA: Introspective vision for obstacle avoidance. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1230–1235). IEEE Press.
Ren, A. Z., Dixit, A., Bodrova, A., Singh, S., Tu, S., Brown, N., Xu, P., Takayama, L., Xia, F., Varley, J., et al. (2023). Robots that ask for help: Uncertainty alignment for large language model planners. arXiv preprint arXiv:2307.01928.
Richter, C., & Roy, N. (July 2017). Safe visual navigation via deep learning and novelty detection. In RSS.
Ritter, H., Botev, A., & Barber, D. (2018). A scalable Laplace approximation for neural networks. In 6th international conference on learning representations, ICLR 2018-conference track proceedings, volume 6. International conference on representation learning.
Rosinol, A., Violette, A., Abate, M., Hughes, N., Chang, Y., Shi, J., Gupta, A., & Carlone, L. (2021). Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research, 40(12–14), 1510–1546.
Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Müller, E., & Kloft, M. (Jul 2018). Deep one-class classification. In Proceedings of the 35th international conference on machine learning, volume 80 of proceedings of machine learning research (pp. 4393–4402). PMLR, 10–15.
Ruff, L., Kauffmann, J. R., Vandermeulen, R. A., Montavon, G., Samek, W., Kloft, M., Dietterich, T. G., & Müller, K.-R. (2021). A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5), 756–795.
Salehi, M., Mirzaei, H., Hendrycks, D., Li, Y., Rohban, M. H., & Sabokrou, M. (2021). A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges.
Shah, D., Osiński, B., Levine, S., et al. (2023). LM-NAV: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning (pp. 492–504). PMLR.
Sharma, A., Azizan, N., & Pavone, M. (2021). Sketching curvature for efficient out-of-distribution detection for deep neural networks. In Uncertainty in artificial intelligence (pp. 1958–1967). PMLR.
Sharma, A., Azizan, N., & Pavone, M. (Jul 2021). Sketching curvature for efficient out-of-distribution detection for deep neural networks. In C. de Campos & M. H. Maathuis (Eds.), Proceedings of the thirty-seventh conference on uncertainty in artificial intelligence, volume 161 of proceedings of machine learning research (pp. 1958-1967). PMLR, 27–30.
Shridhar, M., Manuelli, L., & Fox, D. (2021). Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th conference on robot learning (CoRL).
Sinha, R., Sharma, A., Banerjee, S., Lew, T., Luo, R., Richards, S. M, Sun, Y., Schmerling, E., & Pavone, M. (2022). A system-level view on out-of-distribution data in robotics. arXiv:2212.14020.
Srivastava, M., Goodman, N., & Sadigh, D. (2023). Generating language corrections for teaching physical control tasks. arXiv preprint arXiv:2306.07012.
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR, 2011 (pp. 1521–1528).
Volk, G., Müller, S., von Bernuth, A., Hospach, D., & Bringmann, O. (2019). Towards robust CNN-based object detection through augmentation with synthetic rain variations. In 2019 IEEE intelligent transportation systems conference (ITSC) (pp. 285–292).
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E. H., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in neural information processing systems.
Wilson, G., & Cook, D. J. (2020). A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology, 11(5), 1–46.
Yang, J., Zhou, K., Li, Y., & Liu, Z. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., & Darrell, T. (June 2020). BDD100K: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF conference on computer vision and pattern recognition (CVPR).
Zeng, A., Florence, P., Tompson, J., Welker, S., Chien, J., Attarian, M., Armstrong, T., Krasin, I., Duong, D., Sindhwani, V., & Lee, J. (2020). Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on robot learning (CoRL).
Zeng, A., Wong, A., Welker, S., Choromanski, K., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., Lee, J., Vanhoucke, V., et al. (2022). Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598.
Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., & Chen, H. (2018). Deep autoencoding Gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations.
Funding
The NASA University Leadership initiative (Grant #80NSSC20M0163) provided funds to assist the authors with their research. Amine Elhafsi is supported by a NASA NSTGRO fellowship (Grant #80NSSC19K1143). This article solely reflects the opinions and conclusions of its authors and not any NASA entity.
Author information
Authors and Affiliations
Contributions
AE initiated the project, developed the methodology, performed prompt tuning, and implemented and conducted the experiments. RS prepared the structure for the CARLA autonomous vehicle stack, conducted autonomous vehicle experiments, computed autoencoder OOD detector baseline metrics, processed experimental results, and performed data analysis. CA implemented the autoencoder OOD detector baseline for the learned policy experiments. ES implemented the autonomous vehicle traffic light classification, performed data analysis, and advised the project. IADN advised the project. MP was the primary advisor for the project. The manuscript was jointly written by Amine, Rohan and Edward. All authors reviewed and revised the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Consent to participate
Not applicable.
Consent for publication
The authors unanimously endorsed the content and provided explicit consent for submission. They also ensured that consent was obtained from the responsible authorities at the institute(s)/organization(s) where the research was conducted.
Code availability
Code and data will be provided at https://sites.google.com/view/llm-anomaly-detection.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Additional details: reasoning-based policy
The following template was designed to prompt an analysis of the autonomous vehicle’s scene observations. Placeholders are indicated by the braces and are substituted for the relevant information at each query.
Appendix B Additional experimental details: learned policy
1.1 B.1 Prompt template
The following prompt was designed to elicit a comparison of the distractor objects and the blocks and bowls from the LLM. Placeholders are indicated by the braces and are substituted for the relevant information at each query.
We chose to abstain from using few-shot prompting for this set of experiments. We noted that the diversity exhibited by the common household object classes used as distractors (as compared to driving objects classes, such as traffic lights and signals exhibit some degree of standardization features) necessitated some degree of zero-shot reasoning by the LLM. This zero-shot prompting strategy encouraged the LLM to leverage its inherent knowledge of common objects more effectively. In contrast, when few-shot prompted, we found that the responses tended to overfit to the provided examples, negatively impacting the LLM’s function as a monitor.
1.2 B.2 Semantic and neutral distractors
See Table 7.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Elhafsi, A., Sinha, R., Agia, C. et al. Semantic anomaly detection with large language models. Auton Robot 47, 1035–1055 (2023). https://doi.org/10.1007/s10514-023-10132-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-023-10132-6