Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Free access
Just Accepted

Deep Learning Library Testing: Definition, Methods and Challenges

Online AM: 05 February 2025 Publication History


Recently, software systems powered by deep learning (DL) techniques have significantly facilitated people’s lives in many aspects. As the backbone of these DL systems, various DL libraries undertake the underlying optimization and computation. However, like traditional software, DL libraries are not immune to bugs. These bugs may be propagated to programs and software developed based on DL libraries, thereby posing serious threats to users’ personal property and safety. Studying the characteristics of DL libraries, their associated bugs, and the corresponding testing methods is crucial for enhancing the security of DL systems and advancing the widespread application of DL technology. This paper provides an overview of the testing research on various DL libraries, discusses the strengths and weaknesses of existing methods, and provides guidance and reference for the application of DL library testing methods. This paper first introduces the workflow of DL underlying libraries and the characteristics of three kinds of DL libraries involved, namely DL framework, DL compiler, and DL hardware library. Subsequently, this paper constructs a literature collection pipeline and comprehensively summarizes existing testing methods on these DL libraries to analyze their effectiveness and limitations. It also reports findings and the challenges of existing DL library testing in real-world applications for future research.


2010. IEEE Standard Classification for Software Anomalies. IEEE Std 1044-2009 (Revision of IEEE Std 1044-1993) (2010), 1–23.
2024. Our repository with more experiment details. https://github.com/shiningrain/CSUR_DL_library_survey.
Marcel Aach, Eray Inanc, Rakesh Sarma, Morris Riedel, and Andreas Lintermann. 2023. Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks. Journal of Big Data 10, 1 (2023), 1–23.
Martín Abadi. 2016. TensorFlow: learning functions at scale. In Proceedings of the 21st ACM SIGPLAN international conference on functional programming. 1–1.
Aitor Arrieta. 2022. Multi-objective metamorphic follow-up test case selection for deep learning systems. In Proceedings of the Genetic and Evolutionary Computation Conference. 1327–1335.
Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, and Torsten Hoefler. 2019. A modular benchmarking infrastructure for high-performance and reproducible deep learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 66–77.
Houssem Ben Braiek and Foutse Khomh. 2020. On testing machine learning programs. Journal of Systems and Software 164 (2020), 110542.
Junming Cao, Bihuan Chen, Chao Sun, Longjie Hu, Shuaihong Wu, and Xin Peng. 2022. Understanding performance problems in deep learning systems. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 357–369.
Jinyin Chen, Chengyu Jia, Yunjie Yan, Jie Ge, Haibin Zheng, and Yao Cheng. 2024. A Miss Is as Good as A Mile: Metamorphic Testing for Deep Learning Operators. Proceedings of the ACM on Software Engineering 1, FSE(2024), 2005–2027.
Junjie Chen, Yihua Liang, Qingchao Shen, Jiajun Jiang, and Shuochuan Li. 2023. Toward understanding deep learning framework bugs. ACM Transactions on Software Engineering and Methodology 32, 6(2023), 1–31.
Junjie Chen and Chenyao Suo. 2022. Boosting compiler testing via compiler optimization exploration. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 4(2022), 1–33.
Junjie Chen, Chenyao Suo, Jiajun Jiang, Peiqi Chen, and Xingjian Li. 2023. Compiler test-program generation via memoized configuration search. In Proc. 45th International Conference on Software Engineering.
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.
Tsong Y Chen, Shing C Cheung, and Shiu Ming Yiu. 1998. Metamorphic testing: a new approach for generating next test cases. technical report hkust-cs98-01(1998).
Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A comprehensive study on challenges in deploying deep learning based software. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 750–762.
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759(2014).
Neophytos Christou, Di Jin, Vaggelis Atlidakis, Baishakhi Ray, and Vasileios P Kemerlis. 2023. IvySyn: Automated Vulnerability Discovery in Deep Learning Frameworks. In 32nd USENIX Security Symposium (USENIX Security 23). 2383–2400.
Di Cui, Xingyu Li, Feiyang Liu, Siqi Wang, Jie Dai, Lu Wang, and Qingshan Li. 2022. Towards Demystifying the Impact of Dependency Structures on Bug Locations in Deep Learning Libraries. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 249–260.
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435.
Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2024. Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning Libraries. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 70, 13 pages.
Yinlin Deng, Chenyuan Yang, Anjiang Wei, and Lingming Zhang. 2022. Fuzzing deep-learning libraries via automated relational api inference. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 44–56.
Junhua Ding, Xiaojun Kang, and Xin-Hua Hu. 2017. Validating a deep learning framework by metamorphic testing. In 2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET). IEEE, 28–34.
Jack Dongarra, Sven Hammarling, Nicholas J Higham, Samuel D Relton, Pedro Valero-Lara, and Mawussi Zounon. 2017. The design and performance of batched BLAS on modern high-performance computing systems. Procedia Computer Science 108 (2017), 495–504.
Xiaoting Du, Yulei Sui, Zhihao Liu, and Jun Ai. 2022. An empirical study of fault triggers in deep learning frameworks. IEEE Transactions on Dependable and Secure Computing (2022).
Xiaoting Du, Zheng Zheng, Lei Ma, and Jianjun Zhao. 2021. An Empirical Study on Common Bugs in Deep Learning Compilers. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 184–195.
Khashayar Etemadi, Bardia Mohammadi, Zhendong Su, and Martin Monperrus. 2024. Mokav: Execution-driven Differential Testing with LLMs. arXiv preprint arXiv:2406.10375(2024).
Jian Ge, Huiqun Yu, Guisheng Fan, Jianhao Tang, and Zijie Huang. 2023. Just-In-Time Defect Prediction for Intellignet Computing Frameworks (in Chinese). Journal of Software 34, 9 (2023), 0–0.
Bahar Gezici and Ayça Kolukısa Tarhan. 2022. Systematic literature review on software quality for AI-based software. Empirical Software Engineering 27, 3 (2022), 66.
Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. 2020. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics 37, 3 (2020), 362–386.
Alex Groce, Gerard Holzmann, and Rajeev Joshi. 2007. Randomized differential testing as a prelude to formal verification. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 621–631.
Diandian Gu, Yining Shi, Haozhe Liu, Ge Wu, Haiou Jiang, Yaoshuai Zhao, and Yun Ma. 2022. Defect Detection for Deep Learning Frameworks Based on Meta Operators (in Chinese). Chinese Journal Of Computers 45, 2 (2022), 240–255.
Jiazhen Gu, Xuchuan Luo, Yangfan Zhou, and Xin Wang. 2022. Muffin: Testing deep learning libraries via neural architecture fuzzing. In Proceedings of the 44th International Conference on Software Engineering. 1418–1430.
Qianyu Guo, Sen Chen, Xiaofei Xie, Lei Ma, Qiang Hu, Hongtao Liu, Yang Liu, Jianjun Zhao, and Xiaohong Li. 2019. An empirical study towards characterizing deep learning development and deployment across different frameworks and platforms. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 810–822.
Qianyu Guo, Xiaofei Xie, Yi Li, Xiaoyu Zhang, Yang Liu, Xiaohong Li, and Chao Shen. 2020. Audee: Automated testing for deep learning frameworks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 486–498.
Nima Shiri Harzevili, Jiho Shin, Junjie Wang, Song Wang, and Nachiappan Nagappan. 2023. Characterizing and understanding software security vulnerabilities in machine learning libraries. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 27–38.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Yi He, Takumi Uezono, and Yanjing Li. 2021. Efficient functional in-field self-test for deep learning accelerators. In 2021 IEEE International Test Conference (ITC). IEEE, 93–102.
Steffen Herbold and Tobias Haar. 2022. Smoke testing for machine learning: simple tests to discover severe bugs. Empirical Software Engineering 27, 2 (2022), 45.
Shuo Hong, Hailong Sun, Xiang Gao, and Shin Hwei Tan. 2024. Investigating and Detecting Silent Bugs in PyTorch Programs. In 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 272–283.
Soneya Binta Hossain and Matthew Dwyer. 2024. TOGLL: Correct and Strong Test Oracle Generation with LLMs. arXiv preprint arXiv:2405.03786(2024).
Qianchao Hu, Feng Wang, Binglin Liu, and Haitian Liu. 2023. Research on Deep Neural Network Testing Techniques. In Proceedings of the 2023 4th International Conference on Machine Learning and Computer Application. 113–119.
Kaifeng Huang, Bihuan Chen, Susheng Wu, Junming Cao, Lei Ma, and Xin Peng. 2023. Demystifying dependency bugs in deep learning stack. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 450–462.
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 510–520.
Jiahe Ji, Wei Kong, Jianwen Tian, Taotao Gu, Yuanping Nie, and Xiaohui Kuang. 2023. Survey on Fuzzing Techniques in Deep Learning Libraries. In 2023 8th International Conference on Data Science in Cyberspace (DSC). IEEE, 461–467.
Li Jia, Hao Zhong, and Linpeng Huang. 2021. The unit test quality of deep learning libraries: A mutation analysis. In 2021 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 47–57.
Li Jia, Hao Zhong, Xiaoyin Wang, Linpeng Huang, and Zexuan Li. 2022. How Do Injected Bugs Affect Deep Learning?. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 793–804.
Li Jia, Hao Zhong, Xiaoyin Wang, Linpeng Huang, and Xuansheng Lu. 2020. An empirical study on bugs inside tensorflow. In Database Systems for Advanced Applications: 25th International Conference, DASFAA 2020, Jeju, South Korea, September 24–27, 2020, Proceedings, Part I 25. Springer, 604–620.
Li Jia, Hao Zhong, Xiaoyin Wang, Linpeng Huang, and Xuansheng Lu. 2021. The symptoms, causes, and repairs of bugs inside a deep learning library. Journal of Systems and Software 177 (2021), 110935.
Bo Jiang, Xiaoyan Wang, Wing Kwong Chan, TH Tse, Na Li, Yongfeng Yin, and Zhenyu Zhang. 2020. Cudasmith: A fuzzer for CUDA compilers. In 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 861–871.
Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs. ACM SIGPLAN Notices 47, 6 (2012), 77–88.
Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1946–1956.
Marc Jorda, Pedro Valero-Lara, and Antonio J Pena. 2019. Performance evaluation of cudnn convolution algorithms on nvidia volta gpus. IEEE Access 7(2019), 70461–70473.
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
Hong Jin Kang, Pattarakrit Rattanukul, Stefanus Agus Haryono, Truong Giang Nguyen, Chaiyong Ragkhitwetsagul, Corina Pasareanu, and David Lo. 2022. SkipFuzz: Active Learning-based Input Selection for Fuzzing Deep Learning Libraries. arXiv preprint arXiv:2212.04038(2022).
K Ken. [n. d.]. Exclusive: surveillance footage of tesla crash on sf’s bay bridge hours after elon musk announces “self-driving” feature. https://theintercept.com/2023/01/10/tesla-crash-footage-autopilot/
Nikhil Ketkar and Nikhil Ketkar. 2017. Introduction to keras. Deep learning with python: a hands-on introduction (2017), 97–111.
Misoo Kim, Youngkyoung Kim, and Eunseok Lee. 2021. Denchmark: A bug benchmark of deep learning-related software. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 540–544.
Barbara Kitchenham, O Pearl Brereton, David Budgen, Mark Turner, John Bailey, and Stephen Linkman. 2009. Systematic literature reviews in software engineering–a systematic literature review. Information and software technology 51, 1 (2009), 7–15.
Barbara Kitchenham, Lech Madeyski, and David Budgen. 2022. SEGRESS: Software engineering guidelines for reporting secondary studies. IEEE Transactions on Software Engineering 49, 3 (2022), 1273–1298.
Eliska Kloberdanz, Kyle G Kloberdanz, and Wei Le. 2022. DeepStability: A study of unstable numerical methods and their solutions in deep learning. In Proceedings of the 44th International Conference on Software Engineering. 586–597.
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. Quantifying the Carbon Emissions of Machine Learning. arXiv preprint arXiv:1910.09700(2019).
Zhongzheng Lai, Huaming Chen, Ruoxi Sun, Yu Zhang, Minhui Xue, and Dong Yuan. 2024. On Security Weaknesses and Vulnerabilities in Deep Learning Systems. arXiv preprint arXiv:2406.08688(2024).
Maksim Levental and Elena Orlova. 2020. Comparing the costs of abstraction for dl frameworks. arXiv preprint arXiv:2012.07163(2020).
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. 2019. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2019), 94–110.
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 191–202.
Cheng Li, Abdul Dakkak, Jinjun Xiong, and Wen-mei Hwu. 2020. Benanza: Automatic μBenchmark Generation to Compute” Lower-bound” Latency and Inform Optimizations of Deep Learning Models on GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 440–450.
Hang Li. 2018. Deep learning for natural language processing: advantages and challenges. National Science Review 5, 1 (2018), 24–26.
Junqiang Li, Senyi Li, Jiawei Wu, Long Luo, Yang Bai, and Hongfang Yu. 2022. MMOS: Multi-Staged Mutation Operator Scheduling for Deep Learning Library Testing. In GLOBECOM 2022-2022 IEEE Global Communications Conference. IEEE, 6103–6108.
Meiziniu Li, Jialun Cao, Yongqiang Tian, Tsz On Li, Ming Wen, and Shing-Chi Cheung. 2023. Comet: Coverage-guided model generation for deep learning library testing. ACM Transactions on Software Engineering and Methodology 32, 5(2023), 1–34.
Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. 2020. The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems 32, 3 (2020), 708–727.
Xiaoting Li, Xiao Liu, Lingwei Chen, Rupesh Prajapati, and Dinghao Wu. 2022. ALPHAPROG: reinforcement generation of valid programs for compiler fuzzing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.  36. 12559–12565.
Zengyang Li, Sicheng Wang, Wenshuo Wang, Peng Liang, Ran Mo, and Bing Li. 2023. Understanding bugs in multi-language deep learning frameworks. In 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 328–338.
Jie Liang, Mingzhe Wang, Yuanliang Chen, Yu Jiang, and Renwei Zhang. 2018. Fuzz testing in practice: Obstacles and solutions. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 562–566.
Yunkai Liang, Yun Lin, Xuezhi Song, Jun Sun, Zhiyong Feng, and Jin Song Dong. 2022. gDefects4DL: a dataset of general real-world deep learning program defects. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 90–94.
Shuyan Liao and Chun Shan. 2024. A PSO-based Method to Test Deep Learning Library at API Level. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering. 117–130.
Kuiliang Lin, Xiangpu Song, Yingpei Zeng, and Shanqing Guo. 2023. DeepDiffer: Find Deep Learning Compiler Bugs via Priority-guided Differential Fuzzing. In 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 616–627.
Bingchang Liu, Liang Shi, Zhuhua Cai, and Min Li. 2012. Software vulnerability discovery techniques: A survey. In 2012 fourth international conference on multimedia information networking and security. IEEE, 152–156.
Jiakun Liu, Qiao Huang, Xin Xia, Emad Shihab, David Lo, and Shanping Li. 2020. Is using deep learning frameworks free? characterizing technical debt in deep learning frameworks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Society. 1–10.
Jiawei Liu, Yuheng Huang, Zhijie Wang, Lei Ma, Chunrong Fang, Mingzheng Gu, Xufan Zhang, and Zhenyu Chen. 2023. Generation-based Differential Fuzzing for Deep Learning Libraries. ACM Transactions on Software Engineering and Methodology 33, 2(2023), 1–28.
Jiawei Liu, Jinkun Lin, Fabian Ruffy, Cheng Tan, Jinyang Li, Aurojit Panda, and Lingming Zhang. 2023. Nnsmith: Generating diverse and valid test cases for deep learning compilers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 530–543.
Jiawei Liu, Jinjun Peng, Yuyao Wang, and Lingming Zhang. 2023. Neuri: Diversifying dnn generation via inductive rule inference. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 657–669.
Jiawei Liu, Yuxiang Wei, Sen Yang, Yinlin Deng, and Lingming Zhang. 2022. Coverage-guided tensor compiler fuzzing with joint ir-pass mutation. Proceedings of the ACM on Programming Languages 6, OOPSLA1(2022), 1–26.
Ling Liu, Yanzhao Wu, Wenqi Wei, Wenqi Cao, Semih Sahin, and Qi Zhang. 2018. Benchmarking deep learning frameworks: Design considerations, metrics and beyond. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1258–1269.
Xuanzhe Liu, Diandian Gu, Zhenpeng Chen, Jinfeng Wen, Zili Zhang, Yun Ma, Haoyu Wang, and Xin Jin. 2023. Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective. ACM Transactions on Software Engineering and Methodology 32, 6(2023), 1–26.
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. [n. d.]. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. In Forty-first International Conference on Machine Learning.
Zhihao Liu, Yang Zheng, Xiaoting Du, Zheng Hu, Wenjie Ding, Yanming Miao, and Zheng Zheng. 2022. Taxonomy of Aging-related Bugs in Deep Learning Libraries. In 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 423–434.
Nikolaos Louloudakis, Perry Gibson, José Cano, and Ajitha Rajan. 2023. DeltaNN: Assessing the impact of computational environment parameters on the performance of image recognition models. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 414–424.
Haoyang Ma, Qingchao Shen, Yongqiang Tian, Junjie Chen, and Shing-Chi Cheung. 2023. Fuzzing Deep Learning Compilers with HirGen. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 248–260.
Xiangyue Ma, Xiaoting Du, Qing Cai, Yang Zheng, Jing Hu, and Zheng Zheng. 2023. A Survey on Testing of Deep Learning Frameworks (in Chinese). Journal of Software (2023).
Valentin JM Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick Woo. 2019. The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering 47, 11 (2019), 2312–2331.
Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2022. Software engineering for AI-based systems: a survey. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2(2022), 1–59.
William M McKeeman. 1998. Differential testing for software. Digital Technical Journal 10, 1 (1998), 100–107.
Microsoft. 2023. ONNX Github repository. https://github.com/onnx/onnx
Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Yinglong Zou, Tao Zheng, and Zhenyu Chen. 2024. DevMuT: Testing Deep Learning Framework via Developer Expertise-Based Mutation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1533–1544.
Zhumakhan Nazir, Vladislav Yarovenko, and Jurn-Gyu Park. 2023. Interpretable ML enhanced CNN Performance Analysis of cuBLAS, cuDNN and TensorRT. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing. 1260–1265.
Mahdi Nejadgholi and Jinqiu Yang. 2019. A study of oracle approximations in testing deep learning libraries. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 785–796.
Yuanping Nie, Xiong Xiao, Bing Yang, Hanqing Li, Long Luo, Hongfang Yu, and Gang Sun. 2024. Python Coverage Guided Fuzzing for Deep Learning Framework. In 2024 International Conference on Electronic Engineering and Information Systems (EEISS). IEEE, 1–6.
Adrian Nistor, Po-Chun Chang, Cosmin Radoi, and Shan Lu. 2015. Caramel: Detecting and fixing performance problems that have non-intrusive fixes. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol.  1. IEEE, 902–912.
Peter Oehlert. 2005. Violating assumptions with fuzzing. IEEE Security & Privacy 3, 2 (2005), 58–62.
Işıl Öz. 2024. Quantitative Performance Analysis of BLAS Libraries on GPU Architectures. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi 26, 76 (2024), 40–48.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350(2021).
Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1027–1038.
Alexander Prochnow and Jinqiu Yang. 2022. DiffWatch: watch out for the evolving differential testing in deep learning libraries. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 46–50.
Lili Quan, Qianyu Guo, Xiaofei Xie, Sen Chen, Xiaohong Li, and Yang Liu. 2022. Towards understanding the faults of javascript-based deep learning systems. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.
Luyao Ren, ZiHeng Wang, Yingfei Xiong, Li Zhang, Guoyue Jiang, and Tao Xie. 2023. Effective Random Test Generation for Deep Learning Compilers. arXiv preprint arXiv:2302.00842(2023).
Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, et al. 2018. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907(2018).
Richard Schumi and Jun Sun. 2022. ExAIS: executable AI semantics. In Proceedings of the 44th International Conference on Software Engineering. 859–870.
Sergio Segura, Gordon Fraser, Ana B Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Transactions on software engineering 42, 9 (2016), 805–824.
Qingchao Shen, Haoyang Ma, Junjie Chen, Yongqiang Tian, Shing-Chi Cheung, and Xiang Chen. 2021. A comprehensive study of deep learning compiler bugs. In Proceedings of the 29th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 968–980.
Qingchao Shen, Yongqiang Tian, Haoyang Ma, Junjie Chen, Lili Huang, Ruifeng Fu, Shing-Chi Cheung, and Zan Wang. 2024. A Tale of Two DL Cities: When Library Tests Meet Compiler. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 305–316.
Xiangzhong Shen, Jieyi Zhang, Xiaonan Wang, Hongfang Yu, and Gang Sun. 2021. Deep learning framework fuzzing based on model mutation. In 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC). IEEE, 375–380.
Jingyi Shi, Yang Xiao, Yuekang Li, Yeting Li, Dongsong Yu, Chendong Yu, Hui Su, Yufeng Chen, and Wei Huo. 2023. Acetest: Automated constraint extraction for testing deep learning operators. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 690–702.
ABC7.com staff. [n. d.]. Uber gives up testing of self-driving cars in California in wake of fatal Arizona crash. https://abc7.com/self-driving-uber-crash-video-pedestrian-hit-by-car-autonomous-vehicles/3269690/
Y Sun. 2020. Tesla and PyTorch: PyTorch Developer Conference Highlights. https://medium.com/data-science-bootcamp/tesla-and-pytorch-pytorch-developer-conference-highlights-part-3ed36f2c9d5e
Yifan Sun, Saoni Mukherjee, Trinayan Baruah, Shi Dong, Julian Gutierrez, Prannoy Mohan, and David Kaeli. 2018. Evaluating performance tradeoffs on the radeon open compute platform. In 2018 IEEE international symposium on performance analysis of systems and software (ISPASS). IEEE, 209–218.
Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, and Giuliano Antoniol. 2024. Silent bugs in deep learning frameworks: an empirical study of keras and tensorflow. Empirical Software Engineering 29, 1 (2024), 10.
Qiuming Tao, Wei Wu, Chen Zhao, and Wuwei Shen. 2010. An automatic testing approach for compiler based on metamorphic testing technique. In 2010 Asia Pacific Software Engineering Conference. IEEE, 270–279.
TensorFlow. 2020. Learn how TensorFlow solves real, everyday machine learning problems. https://www.tensorflow.org/about/case-studies
Takumi Uezono, Yi He, and Yanjing Li. 2022. Achieving automotive safety requirements through functional in-field self-test for deep learning accelerators. In 2022 IEEE International Test Conference (ITC). IEEE, 465–473.
Tatiana Castro Vélez, Raffi Khatchadourian, Mehdi Bagherzadeh, and Anita Raja. 2022. Challenges in migrating imperative deep learning programs to graph execution: an empirical study. In Proceedings of the 19th international conference on mining software repositories. 469–481.
Gaurav Verma, Swetang Finviya, Abid M Malik, Murali Emani, and Barbara Chapman. 2022. Towards neural architecture-aware exploration of compiler optimizations in a deep learning {graph} compiler. In Proceedings of the 19th ACM International Conference on Computing Frontiers. 244–250.
Gaurav Verma, Yashi Gupta, Abid M Malik, and Barbara Chapman. 2021. Performance evaluation of deep learning compilers for edge inference. In 2021 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 858–865.
Chaojin Wang, Jian Shen, Chunrong Fang, Xiangsheng Guan, Kaitao Wu, and Jiang Wang. 2020. Accuracy measurement of deep neural network accelerator via metamorphic testing. In 2020 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 55–61.
Jiannan Wang, Thibaud Lutellier, Shangshu Qian, Hung Viet Pham, and Lin Tan. 2022. EAGLE: creating equivalent graphs to test deep learning libraries. In Proceedings of the 44th International Conference on Software Engineering. 798–810.
Jun Wang, Guanping Xiao, Shuai Zhang, Huashan Lei, Yepang Liu, and Yulei Sui. 2023. Compatibility issues in deep learning systems: Problems and opportunities. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 476–488.
Jiyuan Wang, Qian Zhang, Guoqing Harry Xu, and Miryung Kim. 2021. Qdiff: Differential testing of quantum software stacks. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 692–704.
Zihan Wang, Pengbo Nie, Xinyuan Miao, Yuting Chen, Chengcheng Wan, Lei Bu, and Jianjun Zhao. 2023. GenCoG: A DSL-Based Approach to Generating Computation Graphs for TVM Testing. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 904–916.
Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep learning library testing via effective model generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 788–799.
Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: fault localization for deep neural networks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 251–262.
Anjiang Wei, Yinlin Deng, Chenyuan Yang, and Lingming Zhang. 2022. Free lunch for testing: Fuzzing deep-learning libraries from open source. In Proceedings of the 44th International Conference on Software Engineering. 995–1007.
Moshi Wei, Nima Shiri Harzevili, YueKai Huang, Jinqiu Yang, Junjie Wang, and Song Wang. 2024. Demystifying and Detecting Misuses of Deep Learning APIs. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
Jiawei Wu, Senyi Li, Junqiang Li, Long Luo, Hongfang Yu, and Gang Sun. 2022. DeepCov: Coverage Guided Deep Learning Framework Fuzzing. In 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC). IEEE, 399–404.
Mingyuan Wu, Minghai Lu, Heming Cui, Junjie Chen, Yuqun Zhang, and Lingming Zhang. 2023. Jitfuzz: Coverage-guided fuzzing for jvm just-in-time compilers. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 56–68.
Dongwei Xiao, Zhibo Liu, Yuanyuan Yuan, Qi Pang, and Shuai Wang. 2022. Metamorphic testing of deep learning compilers. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, 1(2022), 1–28.
Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, and Michael W Godfrey. 2022. DocTer: documentation-guided fuzzing for testing deep learning API functions. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 176–188.
Danning Xie, Jiannan Wang, Hung Viet Pham, Lin Tan, Yu Guo, Adnan Aziz, and Erik Meijer. 2024. CEDAR: Continuous Testing of Deep Learning Libraries. In International Conference on Software Analysis, Evolution, and Reengineering,. IEEE.
Chenyuan Yang, Yinlin Deng, Jiayi Yao, Yuxing Tu, Hanchi Li, and Lingming Zhang. 2023. Fuzzing automatic differentiation in deep-learning libraries. arXiv preprint arXiv:2302.04351(2023).
Yilin Yang, Tianxing He, Zhilong Xia, and Yang Feng. 2022. A comprehensive empirical study on bug characteristics of deep learning frameworks. Information and Software Technology 151 (2022), 107004.
Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering 48, 1 (2020), 1–36.
Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An empirical study on program failures of deep learning jobs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1159–1170.
Tianyi Zhang, Cuiyun Gao, Lei Ma, Michael Lyu, and Miryung Kim. 2019. An empirical study of common challenges in developing deep learning applications. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 104–115.
Xufan Zhang, Jiawei Liu, Ning Sun, Chunrong Fang, Jia Liu, Jiang Wang, Dong Chai, and Zhenyu Chen. 2021. Duo: Differential fuzzing for deep learning operators. IEEE Transactions on Reliability 70, 4 (2021), 1671–1685.
Xiaoyu Zhang, Chao Shen, Chenhao Lin, Qian Li, Qian Wang, Qi Li, and Xiaohong Guan. 2022. The Testing and Repairing Methods for Machine Learning Model Security. ACTA ELECTONICA SINICA 50, 12 (2022), 2884.
Xufan Zhang, Ning Sun, Chunrong Fang, Jiawei Liu, Jia Liu, Dong Chai, Jiang Wang, and Zhenyu Chen. 2021. Predoo: precision testing of deep learning operators. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 400–412.
Xiaoyu Zhang, Juan Zhai, Shiqing Ma, and Chao Shen. 2021. Autotrainer: An automatic dnn training problem detection and repair system. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 359–371.
Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Shiwei Wang, and Chao Shen. 2024. CITADEL: Context Similarity Based Deep Learning Framework Bug Finding. arXiv preprint arXiv:2406.12196(2024).
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 129–140.
Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D Lee, Wotao Yin, Mingyi Hong, et al. [n. d.]. Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark. In Forty-first International Conference on Machine Learning.
Zhiyi Zhang, Pu Wang, Hongjing Guo, Ziyuan Wang, Yuqian Zhou, and Zhiqiu Huang. 2021. Deepbackground: Metamorphic testing for deep-learning-driven image recognition systems accompanied by background-relevance. Information and Software Technology 140 (2021), 106701.
Hao Zhong. 2022. Enriching compiler testing with real program from bug report. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
Chijin Zhou, Bingzhou Qian, Gwihwan Go, Quan Zhang, Shanshan Li, and Yu Jiang. 2024. PolyJuice: Detecting Mis-compilation Bugs in Tensor Compilers with Equality Saturation Based Rewriting. Proc. ACM Program. Lang. 8, OOPSLA2, Article 317 (Oct. 2024), 27 pages.
Ruofan Zhu, Ganhao Chen, Wenbo Shen, Xiaofei Xie, and Rui Chang. 2025. My Model is Malware to You: Transforming AI Models into Malware by Abusing TensorFlow APIs. In Proceedings of the 2025 IEEE Symposium on Security and Privacy (S&P). IEEE, IEEE.
Yinglong Zou, Haofeng Sun, Chunrong Fang, Jiawei Liu, and Zhenping Zhang. 2023. Deep learning framework testing via hierarchical and heuristic model generation. Journal of Systems and Software 201 (2023), 111681.



Information & Contributors


Published In

cover image ACM Computing Surveys
ACM Computing Surveys Just Accepted
Table of Contents


Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 05 February 2025
Accepted: 03 February 2025
Revised: 14 January 2025
Received: 26 February 2024

Check for updates

Author Tags

  1. Deep Learning Testing
  2. Deep Learning Library Testing
  3. Deep Learning
  4. Software Testing


  • Survey


Other Metrics

Bibliometrics & Citations


Article Metrics

  • 0
    Total Citations
  • 35
    Total Downloads
  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)35
Reflects downloads up to 08 Feb 2025

Other Metrics


View Options

View options


View or Download as a PDF file.



View online with eReader.


Login options

Full Access






Share this Publication link

Share on social media