Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3533767.3534220acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article
Open access

DocTer: documentation-guided fuzzing for testing deep learning API functions

Published: 18 July 2022 Publication History

Abstract

Input constraints are useful for many software development tasks. For example, input constraints of a function enable the generation of valid inputs, i.e., inputs that follow these constraints, to test the function deeper. API functions of deep learning (DL) libraries have DL-specific input constraints, which are described informally in the free-form API documentation. Existing constraint-extraction techniques are ineffective for extracting DL-specific input constraints.
To fill this gap, we design and implement a new technique—DocTer—to analyze API documentation to extract DL-specific input constraints for DL API functions. DocTer features a novel algorithm that automatically constructs rules to extract API parameter constraints from syntactic patterns in the form of dependency parse trees of API descriptions. These rules are then applied to a large volume of API documents in popular DL libraries to extract their input parameter constraints. To demonstrate the effectiveness of the extracted constraints, DocTer uses the constraints to enable the automatic generation of valid and invalid inputs to test DL API functions.
Our evaluation on three popular DL libraries (TensorFlow, PyTorch, and MXNet) shows that DocTer’s precision in extracting input constraints is 85.4%. DocTer detects 94 bugs from 174 API functions, including one previously unknown security vulnerability that is now documented in the CVE database, while a baseline technique without input constraints detects only 59 bugs. Most (63) of the 94 bugs are previously unknown, 54 of which have been fixed or confirmed by developers after we report them. In addition, DocTer detects 43 inconsistencies in documents, 39 of which are fixed or confirmed.

References

[1]
1999. The Java Modeling Language (JML). "https://www.cs.ucf.edu/~leavens/JML/examples.shtml"
[2]
2004. Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
[3]
2013. American Fuzzy Lop. http://lcamtuf.coredump.cx/afl/
[4]
2014. Universal Dependencies. https://universaldependencies.org/
[5]
2015. libFuzzer – a library for coverage-guided fuzz testing. http://llvm.org/docs/LibFuzzer.html
[6]
2016. OSS-Fuzz. https://github.com/google/oss-fuzz
[7]
2016. pytype. "https://github.com/google/pytype"
[8]
2017. What is the best programming language for Machine Learning? https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-learning-a745c156d6b7
[9]
2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. "https://github.com/rohanpadhye/fuzzfactory"
[10]
2019. incubator-mxnet. https://github.com/apache/incubator-mxnet/blob/1.6.0/python/mxnet/ndarray/ndarray.py##L64-L74
[11]
2019. torch.Tensor. https://pytorch.org/docs/1.5.0/tensors.html
[12]
2020. tf.dtypes.DType. https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/dtypes/DType
[13]
2022. ’s Supplementary Material. https://github.com/lin-tan/DocTer
[14]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th $USENIX$ symposium on operating systems design and implementation ($OSDI$ 16). 265–283.
[15]
Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language Processing with Python. O’Reilly Media Inc.
[16]
Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael D Ernst, Mauro Pezzè, and Sergio Delgado Castellanos. 2018. Translating code comments to procedure specifications. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 242–253.
[17]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arxiv:1512.01274.
[18]
Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing Probabilistic Programming Systems. ESEC/FSE 2018. Association for Computing Machinery, New York, NY, USA. isbn:9781450355735 https://doi.org/10.1145/3236024.3236057
[19]
Gordon Fraser and Andrea Arcuri. 2013. Whole Test Suite Generation. IEEE Transactions on Software Engineering, 39, 2 (2013), feb., 276 –291.
[20]
Xiang Gao, Ripon K Saha, Mukul R Prasad, and Abhik Roychoudhury. 2020. Fuzz Testing based Data Augmentation to Improve Robustness of Deep Neural Networks. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20).
[21]
P. Godefroid, A. Kiezun, and M. Y. Levin. 2008. Grammar-based Whitebox Fuzzing. In Proceedings of the ACM SIGPLAN conference on Programming language design and implementation. 206–215.
[22]
Alberto Goffi, Alessandra Gorla, Michael D Ernst, and Mauro Pezzè. 2016. Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th international symposium on software testing and analysis. 213–224.
[23]
Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. DLFuzz: Differential Fuzzing Testing of Deep Learning Systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). ACM, New York, NY, USA. 739–743. isbn:978-1-4503-5573-5 https://doi.org/10.1145/3236024.3264835
[24]
Qianyu Guo, Xiaofei Xie, Yi Li, Xiaoyu Zhang, Yang Liu, Xiaohong Li, and Chao Shen. 2020. Audee: Automated Testing for Deep Learning Frameworks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).
[25]
Q. Hu, L. Ma, X. Xie, B. Yu, Y. Liu, and J. Zhao. 2019. DeepMutation++: A Mutation Testing Framework for Deep Learning Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1158–1161.
[26]
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of Real Faults in Deep Learning Systems. In Proceedings of 42nd International Conference on Software Engineering (ICSE ’20). ACM.
[27]
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA. 510–520. isbn:9781450355728 https://doi.org/10.1145/3338906.3338955
[28]
Sifis Lagouvardos, Julian Dolby, Neville Grech, Anastasios Antoniadis, and Yannis Smaragdakis. 2020. Static Analysis of Shape in TensorFlow Programs. In ECOOP 2020.
[29]
Caroline Lemieux and Koushik Sen. 2018. FairFuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage. In ASE, Marianne Huchard, Christian Kästner, and Gordon Fraser (Eds.). ACM, 475–485.
[30]
Shuang Liu, Jun Sun, Yang Liu, Yue Zhang, Bimlesh Wadhwa, Jin Song Dong, and Xinyu Wang. 2014. Automatic early defects detection in use case documents. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 785–790.
[31]
Tao Lv, Ruishi Li, Yi Yang, Kai Chen, Xiaojing Liao, XiaoFeng Wang, Peiwei Hu, and Luyi Xing. 2020. RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1837–1852.
[32]
R. Majumda and R. Xu. 2007. Directed Test Generation Using Symbolic Grammars. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering. 134–143.
[33]
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.
[34]
Manish Motwani and Yuriy Brun. 2019. Automatically generating precise Oracles from structured natural language specifications. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 188–199.
[35]
M. Nejadgholi and J. Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 785–796.
[36]
Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. 2019. TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing. In Proceedings of the 36th International Conference on Machine Learning, Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.) (Proceedings of Machine Learning Research, Vol. 97). PMLR, Long Beach, California, USA. 4901–4911.
[37]
Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed Random Testing for Java. In Companion to the 22Nd ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications Companion (OOPSLA ’07). ACM, New York, NY, USA. 815–816. isbn:978-1-59593-865-7 https://doi.org/10.1145/1297846.1297902
[38]
Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar. 2012. Inferring Method Specifications from Natural Language API Descriptions. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE Press, Piscataway, NJ, USA. 815–825. isbn:978-1-4673-1067-3
[39]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[40]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12 (2011), 2825–2830.
[41]
Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: Fuzzing by Program Transformation. In IEEE Symposium on Security and Privacy. IEEE Computer Society, 697–710. isbn:978-1-5386-4353-2
[42]
Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: Cross-backend Validation to Detect and Localize Bugs in Deep Learning Libraries. In Proceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, Piscataway, NJ, USA. 1027–1038. https://doi.org/10.1109/ICSE.2019.00107
[43]
Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. 2018. Multiple-Implementation Testing of Supervised Learning Software. In Proc. AAAI-18 Workshop on Engineering Dependable and Secure Machine Learning Systems (EDSMLS).
[44]
Robert Swiecki. 2015. Honggfuzz: A general-purpose, easy-to-use fuzzer with interesting analysis options. URl: https://github. com/google/honggfuzz.
[45]
Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /*Icomment: Bugs or Bad Comments?*/. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (SOSP ’07). ACM, New York, NY, USA. 145–158. isbn:978-1-59593-591-5 https://doi.org/10.1145/1294261.1294276
[46]
Lin Tan, Yuanyuan Zhou, and Yoann Padioleau. 2011. aComment: Mining Annotations from Comments and Code to Detect Interrupt Related Concurrency Bugs. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). ACM, New York, NY, USA. 11–20. isbn:978-1-4503-0445-0 https://doi.org/10.1145/1985793.1985796
[47]
S. H. Tan, D. Marinov, L. Tan, and G. T. Leavens. 2012. @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. 260–269. issn:2159-4848 https://doi.org/10.1109/ICST.2012.106
[48]
Saeid Tizpaz-Niari, Pavol Černỳ, and Ashutosh Trivedi. 2020. Detecting and understanding real-world differential performance bugs in machine learning libraries. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 189–199.
[49]
Sakshi Udeshi and Sudipta Chattopadhyay. 2019. Grammar Based Directed Testing of Machine Learning Systems. CoRR, abs/1902.10027 (2019), arxiv:1902.10027.
[50]
Jackson Vanover, Xuan Deng, and Cindy Rubio-González. 2020. Discovering discrepancies in numerical libraries. In Proceedings of the 2020 International Symposium on Software Testing and Analysis (ISSTA 2020). 488–501. https://doi.org/10.1145/3395363.3397380
[51]
Haohan Wang, Da Sun, and Eric P Xing. 2019. What if we simply swap the two text fragments? a straightforward yet effective way to test the robustness of methods to confounding signals in nature language inference tasks. In Proceedings of the AAAI Conference on Artificial Intelligence. 33, 7136–7143.
[52]
Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep Learning Library Testing via Effective Model Generation. In Proceedings of the 2020 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020).
[53]
Edmund Wong, Lei Zhang, Song Wang, Taiyue Liu, and Lin Tan. 2015. Dase: Document-assisted symbolic execution for improving automated software testing. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 1, 620–631.
[54]
Qian Wu, Ling Wu, Guangtai Liang, Qianxiang Wang, Tao Xie, and Hong Mei. 2013. Inferring dependency constraints on parameters for web services. In Proceedings of the 22nd international conference on World Wide Web. 1421–1432.
[55]
Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A Coverage-guided Fuzz Testing Framework for Deep Neural Networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). ACM, New York, NY, USA. 146–157. isbn:978-1-4503-6224-5 https://doi.org/10.1145/3293882.3330579
[56]
Mohammed Javeed Zaki. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE transactions on knowledge and data engineering, 17, 8 (2005), 1021–1035.
[57]
Juan Zhai, Yu Shi, Minxue Pan, Guian Zhou, Yongxiang Liu, Chunrong Fang, Shiqing Ma, Lin Tan, and Xiangyu Zhang. 2020. C2S: Translating Natural Language Comments to Formal Program. In Proceedings of the 2020 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020).
[58]
Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An Empirical Study on Program Failures of Deep Learning Jobs. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). IEEE/ACM.
[59]
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 2020 International Symposium on Software Testing and Analysis (ISSTA 2018). 129–140. https://doi.org/10.1145/3213846.3213866
[60]
W. Zheng, W. Wang, D. Liu, C. Zhang, Q. Zeng, Y. Deng, W. Yang, P. He, and T. Xie. 2019. Testing Untestable Neural Machine Translation: An Industrial Case. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 314–315.
[61]
Hao Zhong, Lu Zhang, Tao Xie, and Hong Mei. 2009. Inferring resource specifications from natural language API documentation. In 2009 IEEE/ACM International Conference on Automated Software Engineering. 307–318.
[62]
Yu Zhou, Changzhi Wang, Xin Yan, Taolue Chen, Sebastiano Panichella, and Harald Gall. 2018. Automatic detection and repair recommendation of directive defects in Java API documentation. IEEE Transactions on Software Engineering, 46, 9 (2018), 1004–1023.

Cited By

View all
  • (2025)Deep Learning Library Testing: Definition, Methods and ChallengesACM Computing Surveys10.1145/3716497Online publication date: 5-Feb-2025
  • (2025)D3: Differential Testing of Distributed Deep Learning With Model GenerationIEEE Transactions on Software Engineering10.1109/TSE.2024.346165751:1(38-52)Online publication date: 1-Jan-2025
  • (2024)WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language ModelsProceedings of the ACM on Programming Languages10.1145/36897368:OOPSLA2(709-735)Online publication date: 8-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2022
808 pages
ISBN:9781450393799
DOI:10.1145/3533767
This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. test generation
  3. testing
  4. text analytics

Qualifiers

  • Research-article

Conference

ISSTA '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)685
  • Downloads (Last 6 weeks)79
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Deep Learning Library Testing: Definition, Methods and ChallengesACM Computing Surveys10.1145/3716497Online publication date: 5-Feb-2025
  • (2025)D3: Differential Testing of Distributed Deep Learning With Model GenerationIEEE Transactions on Software Engineering10.1109/TSE.2024.346165751:1(38-52)Online publication date: 1-Jan-2025
  • (2024)WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language ModelsProceedings of the ACM on Programming Languages10.1145/36897368:OOPSLA2(709-735)Online publication date: 8-Oct-2024
  • (2024)History-Driven Fuzzing for Deep Learning LibrariesACM Transactions on Software Engineering and Methodology10.1145/368883834:1(1-29)Online publication date: 28-Dec-2024
  • (2024)A PSO-based Method to Test Deep Learning Library at API LevelProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering10.1145/3672758.3672777(117-130)Online publication date: 26-Jan-2024
  • (2024)A Miss Is as Good as A Mile: Metamorphic Testing for Deep Learning OperatorsProceedings of the ACM on Software Engineering10.1145/36607961:FSE(2005-2027)Online publication date: 12-Jul-2024
  • (2024)Contract-based Validation of Conceptual Design Bugs for Engineering Complex Machine Learning SoftwareProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3652620.3688201(155-161)Online publication date: 22-Sep-2024
  • (2024)Practitioners’ Expectations on Automated Test GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680386(1618-1630)Online publication date: 11-Sep-2024
  • (2024)Large Language Models Can Connect the Dots: Exploring Model Optimization Bugs with Domain Knowledge-Aware PromptsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680383(1579-1591)Online publication date: 11-Sep-2024
  • (2024)Towards More Complete Constraints for Deep Learning Library Testing via Complementary Set Guided RefinementProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680364(1338-1350)Online publication date: 11-Sep-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media