research-article

DeepGini: prioritizing massive tests to enhance the robustness of deep neural networks

Authors:

Zhenyu ChenAuthors Info & Claims

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 177 - 188

https://doi.org/10.1145/3395363.3397357

Published: 18 July 2020 Publication History

Abstract

Deep neural networks (DNN) have been deployed in many software systems to assist in various classification tasks. In company with the fantastic effectiveness in classification, DNNs could also exhibit incorrect behaviors and result in accidents and losses. Therefore, testing techniques that can detect incorrect DNN behaviors and improve DNN quality are extremely necessary and critical. However, the testing oracle, which defines the correct output for a given input, is often not available in the automated testing. To obtain the oracle information, the testing tasks of DNN-based systems usually require expensive human efforts to label the testing data, which significantly slows down the process of quality assurance.

To mitigate this problem, we propose DeepGini, a test prioritization technique designed based on a statistical perspective of DNN. Such a statistical perspective allows us to reduce the problem of measuring misclassification probability to the problem of measuring set impurity, which allows us to quickly identify possibly-misclassified tests. To evaluate, we conduct an extensive empirical study on popular datasets and prevalent DNN models. The experimental results demonstrate that DeepGini outperforms existing coverage-based techniques in prioritizing tests regarding both effectiveness and efficiency. Meanwhile, we observe that the tests prioritized at the front by DeepGini are more effective in improving the DNN quality in comparison with the coverage-based techniques.

References

[1]

Ken Binmore and Joan Davies. 2002. Calculus: concepts and methods. Cambridge University Press.

[2]

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 ( 2016 ).

[3]

Timothy Alan Budd. 1981. Mutation Analysis of Program Test Data. ( 1981 ).

[4]

Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 63-70.

[5]

Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 39-57.

[6]

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Cliford Stein. 2009. Introduction to algorithms. MIT press.

[7]

Alex Davies. [n. d.]. Tesla's Latest Autopilot Death Looks Just Like a Prior Crash. Available at https://www.wired.com/story/teslas-latest-autopilot-death-lookslike-prior-crash/ ( 2020 /01/27). ([n. d.]).

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248-255.

[9]

John S Denker and Yann Lecun. 1991. Transforming neural-net output levels to probability distributions. In Advances in neural information processing systems. 853-859.

[10]

Daniel Di Nardo, Nadia Alshahwan, Lionel Briand, and Yvan Labiche. 2013. Coverage-based test case prioritisation: An industrial case study. In Software Testing, Verification and Validation (ICST), 2013 IEEE Sixth International Conference on. IEEE, 302-311.

[11]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proceedings of 2015 3rd International Conference on Learning Representations (ICLR).

[12]

Mary Jean Harrold. 1999. Testing evolving software. Journal of Systems and Software 47, 2-3 ( 1999 ), 173-181.

Digital Library

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770-778.

[14]

James A Jones and Mary Jean Harrold. 2003. Test-suite reduction and prioritization for modified condition/decision coverage. IEEE Transactions on software Engineering 29, 3 ( 2003 ), 195-209.

Digital Library

[15]

Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering (ICSE '19). IEEE Press, 1039-1049.

Digital Library

[16]

Bogdan Korel, George Koutsogiannakis, and Luay H Tahat. 2007. Model-based test prioritization heuristic methods and their evaluation. In Proceedings of the 3rd international workshop on Advances in model-based testing. ACM, 34-43.

Digital Library

[17]

Bogdan Korel, George Koutsogiannakis, and Luay H Tahat. 2008. Application of system models in regression test suite prioritization. In Software Maintenance, 2008. ICSM 2008. IEEE International Conference on. IEEE, 247-256.

[18]

Bogdan Korel, Luay Ho Tahat, and Mark Harman. 2005. Test prioritization using system models. In Software Maintenance, 2005. ICSM'05. Proceedings of the 21st IEEE International Conference on. IEEE, 559-568.

Digital Library

[19]

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2017. Adversarial Examples in the Physical World. In Proceedings of 2017 5th International Conference on Learning Representations (ICLR).

[20]

David Leon and Andy Podgurski. 2003. A comparison of coverage-based and distribution-based techniques for filtering and prioritizing test cases. In 2003 IEEE 14th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 442.

[21]

Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. 2019. Structural coverage criteria for neural networks could be misleading. In 2019 IEEE/ACM 41st International Conference on Software Engineering : New Ideas and Emerging Results (ICSE-NIER). IEEE, 89-92.

Digital Library

[22]

Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, et al. 2018. Deepgauge: Multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 120-131.

Digital Library

[23]

L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao, and Y. Wang. 2018. DeepMutation: Mutation Testing of Deep Learning Systems. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 100-111.

[24]

Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. 2016. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on. IEEE, 372-387.

[25]

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 1-18.

Digital Library

[26]

J. Ross Quinlan. 1986. Induction of decision trees. Machine learning 1, 1 ( 1986 ), 81-106.

[27]

Laura Elena Raileanu and Kilian Stofel. 2004. Theoretical comparison between the gini index and information gain criteria. Annals of Mathematics and Artificial Intelligence 41, 1 ( 2004 ), 77-93.

Digital Library

[28]

R Tyrrell Rockafellar. 1993. Lagrange multipliers and optimality. SIAM review 35, 2 ( 1993 ), 183-238.

[29]

Gregg Rothermel and Mary Jean Harrold. 1996. Analyzing regression test selection techniques. IEEE Transactions on software engineering 22, 8 ( 1996 ), 529-551.

Digital Library

[30]

Gregg Rothermel, Roland H Untch, Chengyun Chu, and Mary Jean Harrold. 1999. Test case prioritization: An empirical study. In Software Maintenance, 1999. (ICSM'99) Proceedings. IEEE International Conference on. IEEE, 179-188.

[31]

Gregg Rothermel, Roland H. Untch, Chengyun Chu, and Mary Jean Harrold. 2001. Prioritizing test cases for regression testing. IEEE Transactions on software engineering 27, 10 ( 2001 ), 929-948.

Digital Library

[32]

Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin-Madison.

[33]

Claude Elwood Shannon. 1948. A mathematical theory of communication. Bell system technical journal 27, 3 ( 1948 ), 379-423.

[34]

Mark Sherrif, Mike Lake, and Laurie Williams. 2007. Prioritization of regression tests using singular value decomposition with empirical change records. In Software Reliability, 2007. ISSRE'07. The 18th IEEE International Symposium on. IEEE, 81-90.

[35]

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 ( 2016 ), 484.

[36]

Jack Stewart. [n. d.]. Tesla's Autopilot Was Involved in Another Deadly Car Crash. Available at https://www.wired.com/story/tesla-autopilot-self-drivingcrash-california/ ( 2020 /01/27). ([n. d.]).

[37]

Youcheng Sun, Xiaowei Huang, and Daniel Kroening. 2018. Testing Deep Neural Networks. arXiv preprint arXiv: 1803. 04792 ( 2018 ).

[38]

Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018. Concolic Testing for Deep Neural Networks. arXiv preprint arXiv: 1805. 00089 ( 2018 ).

[39]

Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering. ACM, 303-314.

Digital Library

[40]

Paolo Tonella, Paolo Avesani, and Angelo Susi. 2006. Using the case-based ranking methodology for test case prioritization. In Software Maintenance, 2006. ICSM' 06. 22nd IEEE International Conference on. IEEE, 123-133.

Digital Library

[41]

Matt P Wand and M Chris Jones. [n. d.]. Kernel Smoothing. CRC Press.

[42]

Matthew Wicker, Xiaowei Huang, and Marta Kwiatkowska. 2018. Feature-Guided Black-Box Safety Testing of Deep Neural Networks. In Tools and Algorithms for the Construction and Analysis of Systems. Springer, 408-426.

[43]

Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

Digital Library

[44]

W Eric Wong, Joseph R Horgan, Saul London, and Aditya P Mathur. 1998. Efect of test set minimization on fault detection efectiveness. Software: Practice and Experience 28, 4 ( 1998 ), 347-369.

[45]

Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geofrey Zweig. 2016. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 ( 2016 ).

[46]

Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification and Reliability 22, 2 ( 2012 ), 67-120.

[47]

Shin Yoo, Mark Harman, Paolo Tonella, and Angelo Susi. 2009. Clustering test cases to achieve efective and scalable prioritisation incorporating expert knowledge. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 201-212.

Digital Library

[48]

Long Zhang, Xuechao Sun, Yong Li, and Zhenyu Zhang. 2019. A noise-sensitivityanalysis-based test prioritization technique for deep neural networks. arXiv preprint arXiv:1901. 00054 ( 2019 ).

[49]

Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 132-142.

Digital Library

[50]

Chris Ziegler. [n. d.]. A Google self-driving car caused a crash for the first time. Available at https://www.theverge.com/ 2016 /2/29/11134344/google-self-drivingcar-crash-report ( 2020 /01/27). ([n. d.]).

Cited By

Lin CZhang XShen C(2024)DeepLogic: Priority Testing of Deep Learning Through Interpretable Logic UnitsChinese Journal of Electronics10.23919/cje.2022.00.45133:4(948-964)Online publication date: Jul-2024
https://doi.org/10.23919/cje.2022.00.451
Jiang ZLi HTian XWang R(2024)Semantic feature-based test selection for deep neural networks: A frequency domain perspectiveComputer Science and Information Systems10.2298/CSIS230907045J21:4(1499-1522)Online publication date: 2024
https://doi.org/10.2298/CSIS230907045J
Shen JLi ZPan MLi XFilkov VRay BZhou M(2024)Prioritizing Test Inputs for DNNs Using Training DynamicsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695498(1219-1231)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695498
Show More Cited By

Index Terms

DeepGini: prioritizing massive tests to enhance the robustness of deep neural networks
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Prioritizing Test Inputs for Deep Neural Networks via Mutation Analysis
ICSE '21: Proceedings of the 43rd International Conference on Software Engineering

Deep Neural Network (DNN) testing is one of the most widely-used ways to guarantee the quality of DNNs. However, labeling test inputs to check the correctness of DNN prediction is very costly, which could largely affect the efficiency of DNN testing, ...
Prioritizing Variable-Strength Covering Array
COMPSAC '13: Proceedings of the 2013 IEEE 37th Annual Computer Software and Applications Conference

Combinatorial interaction testing is a well-studied testing strategy, and has been widely applied in practice. Combinatorial interaction test suite, such as fixed-strength and variable-strength interaction test suite, is widely used for combinatorial ...
A Static Approach to Prioritizing JUnit Test Cases

Test case prioritization is used in regression testing to schedule the execution order of test cases so as to expose faults earlier in testing. Over the past few years, many test case prioritization techniques have been proposed in the literature. Most ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 2020

591 pages

ISBN:9781450380089

DOI:10.1145/3395363

General Chair:
Sarfraz Khurshid
University of Texas at Austin, USA
,
Program Chair:
Corina S. Păsăreanu
Carnegie Mellon University Silicon Valley / NASA Ames Research Center, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISSTA '20

Sponsor:

SIGSOFT

ISSTA '20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 18 - 22, 2020

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Sponsor:
sigsoft

34th ACM SIGSOFT International Symposium on Software Testing and Analysis

June 25 - 28, 2025

Trondheim , Norway

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

141
Total Citations
View Citations
1,118
Total Downloads

Downloads (Last 12 months)232
Downloads (Last 6 weeks)19

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lin CZhang XShen C(2024)DeepLogic: Priority Testing of Deep Learning Through Interpretable Logic UnitsChinese Journal of Electronics10.23919/cje.2022.00.45133:4(948-964)Online publication date: Jul-2024
https://doi.org/10.23919/cje.2022.00.451
Jiang ZLi HTian XWang R(2024)Semantic feature-based test selection for deep neural networks: A frequency domain perspectiveComputer Science and Information Systems10.2298/CSIS230907045J21:4(1499-1522)Online publication date: 2024
https://doi.org/10.2298/CSIS230907045J
Shen JLi ZPan MLi XFilkov VRay BZhou M(2024)Prioritizing Test Inputs for DNNs Using Training DynamicsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695498(1219-1231)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695498
Chen JWang JZhang XSun YKwiatkowska MChen JCheng PFilkov VRay BZhou M(2024)FAST: Boosting Uncertainty-based Test Prioritization Methods for Neural Networks via Feature SelectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695472(895-906)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695472
Wang HWei ZZhou QChan W(2024)Context-Aware Fuzzing for Robustness Enhancement of Deep Learning ModelsACM Transactions on Software Engineering and Methodology10.1145/368046434:1(1-68)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3680464
Tambon FKhomh FAntoniol G(2024)GIST: Generated Inputs Sets Transferability in Deep LearningACM Transactions on Software Engineering and Methodology10.1145/367245733:8(1-38)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672457
Huang DBu QFu YQing YXie XChen JCui H(2024)Neuron Sensitivity-Guided Test Case SelectionACM Transactions on Software Engineering and Methodology10.1145/367245433:7(1-32)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3672454
Wan CLiu SXie SLiu YHoffmann HMaire MLu S(2024)Keeper: Automated Testing and Fixing of Machine Learning SoftwareACM Transactions on Software Engineering and Methodology10.1145/367245133:7(1-33)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672451
Wang ZXu SFan LCai XLi LLiu Z(2024)Can Coverage Criteria Guide Failure Discovery for Image Classifiers? An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/367244633:7(1-28)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672446
Liu ZFeng YXu JXu B(2024)ObjTest: Object-Level Mutation for Testing Object Detection SystemsProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671400(61-70)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3671400
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten