research-article

QuoTe: Quality-oriented Testing for Deep Learning Systems

Authors:

Peng ChengAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 5

Article No.: 125, Pages 1 - 33

https://doi.org/10.1145/3582573

Published: 22 July 2023 Publication History

Abstract

Recently, there has been significant growth of interest in applying software engineering techniques for the quality assurance of deep learning (DL) systems. One popular direction is DL testing—that is, given a property of test, defects of DL systems are found either by fuzzing or guided search with the help of certain testing metrics. However, recent studies have revealed that the neuron coverage metrics, which are commonly used by most existing DL testing approaches, are not necessarily correlated with model quality (e.g., robustness, the most studied model property), and are also not an effective measurement on the confidence of the model quality after testing. In this work, we address this gap by proposing a novel testing framework called QuoTe (i.e., Quality-oriented Testing). A key part of QuoTe is a quantitative measurement on (1) the value of each test case in enhancing the model property of interest (often via retraining) and (2) the convergence quality of the model property improvement. QuoTe utilizes the proposed metric to automatically select or generate valuable test cases for improving model quality. The proposed metric is also a lightweight yet strong indicator of how well the improvement converged. Extensive experiments on both image and tabular datasets with a variety of model architectures confirm the effectiveness and efficiency of QuoTe in improving DL model quality—that is, robustness and fairness. As a generic quality-oriented testing framework, future adaptations can be made to other domains (e.g., text) as well as other model properties.

Appendix

A Additional Figures

Figure A.1.

Figure A.1. The FOL distribution of FGSM and PGD attacks for different models.

Figure A.2.

Figure A.2. The FOL distribution of AEQUITAS and ADF testing algorithms for different models.

Figure A.3.

Figure A.3. The trend of robustness improvement (ATTACK) in iterations on the MNIST and CIFAR-10 datasets (with confidence interval of 95% significance level).

References

[1]

Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha. 2019. Black box fairness testing of machine learning models. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 625–635.

Digital Library

[2]

Haldun Akoglu. 2018. User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine 18, 3 (2018), 91–93.

[3]

Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2022. Black-box safety analysis and retraining of DNNs based on feature extraction and clustering. arXiv preprint arXiv:2201.05077 (2022).

[4]

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise Reduction in Speech Processing. Springer, 1–4.

[5]

Neal E. Boudette. 2017. Tesla’s self-driving system cleared in deadly crash. New York Times. Retrieved February 21, 2023 from https://www.nytimes.com/2017/01/19/business/tesla-model-s-autopilot-fatal-crash.html.

[6]

Wieland Brendel, Jonas Rauber, Matthias Kümmerer, Ivan Ustyuzhaninov, and Matthias Bethge. 2019. Accurate, reliable and fast robustness evaluation. In Advances in Neural Information Processing Systems. 12861–12871.

[7]

Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Testing (AITest’19). IEEE, Los Alamitos, CA, 63–70.

[8]

Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). 209–224.

Digital Library

[9]

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. 2019. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705 (2019).

[10]

Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP’17). IEEE, Los Alamitos, CA, 39–57.

[11]

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493–2537.

Digital Library

[12]

Francesco Croce and Matthias Hein. 2020. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the International Conference on Machine Learning. 2206–2216.

[13]

Yizhen Dong, Peixin Zhang, Jingyi Wang, Shuang Liu, Jun Sun, Jianye Hao, Xinyu Wang, Li Wang, Jinsong Dong, and Ting Dai. 2020. An empirical study on correlation between coverage and robustness for deep neural networks. In Proceedings of the 2020 25th International Conference on Engineering of Complex Computer Systems (ICECCS’20). IEEE, Los Alamitos, CA, 73–82.

[14]

Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. Retrieved February 21, 2023 from http://archive.ics.uci.edu/ml.

[15]

Ranjie Duan, Xingjun Ma, Yisen Wang, James Bailey, A. Kai Qin, and Yun Yang. 2020. Adversarial camouflage: Hiding physical-world attacks with natural styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1000–1008.

[16]

Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2022. HUDD: A tool to debug DNNs for safety analysis. In Proceedings of the 2022 IEEE/ACM 44st International Conference on Software Engineering.

Digital Library

[17]

Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. DeepGini: Prioritizing massive tests to enhance the robustness of deep neural networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’20). ACM, New York, NY, 177–188. DOI:

Digital Library

[18]

Samuel G. Finlayson, John D. Bowers, Joichi Ito, Jonathan L. Zittrain, Andrew L. Beam, and Isaac S. Kohane. 2019. Adversarial attacks on medical machine learning. Science 363, 6433 (2019), 1287–1289.

[19]

Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness testing: Testing software for discrimination. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 498–510.

Digital Library

[20]

Patrice Godefroid, Adam Kiezun, and Michael Y. Levin. 2008. Grammar-based whitebox fuzzing. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 206–215.

Digital Library

[21]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). 2672–2680.

[22]

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations. http://arxiv.org/abs/1412.6572.

[23]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Los Alamitos, CA, 6645–6649.

[24]

Loren Grush. 2015. Google engineer apologizes after photos app tags two black people as gorillas. The Verge. Retrieved February 21, 2023 from https://www.theverge.com/2015/7/1/8880363/google-apologizes-photos-app-tags-two-black-people-gorillas.

[25]

Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. DLFuzz: Differential fuzzing testing of deep learning systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 739–743.

Digital Library

[26]

Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. 2020. Is neuron coverage a meaningful measure for testing deep neural networks? In Proceedings of the Joint Meeting on Foundations of Software Engineering (FSE’20).

Digital Library

[27]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[28]

Qiang Hu, Yuejun Guo, Maxime Cordy, Xiaofei Xie, Lei Ma, Mike Papadakis, and Yves Le Traon. 2022. An empirical study on data distribution-aware test selection for deep learning enhancement. ACM Transactions on Software Engineering and Methodology 31, 4 (2022), Article 78, 30.

Digital Library

[29]

Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. 2017. Safety verification of deep neural networks. In Proceedings of the International Conference on Computer Aided Verification. 3–29.

[30]

Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1110–1121.

Digital Library

[31]

Todd Huster, Cho-Yu Jason Chiang, and Ritu Chadha. 2019. Limitations of the Lipschitz constant as a defense against adversarial examples. In ECML PKDD 2018 Workshops. Lecture Notes in Computer Science, Vol. 11329. Springer, 16–29.

[32]

Kimmo Karkkainen and Jungseock Joo. 2021. FairFace: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1548–1558.

[33]

Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. 2017. Towards proving the adversarial robustness of deep neural networks. arXiv preprint arXiv:1709.02802 (2017).

[34]

Marc Khoury and Dylan Hadfield-Menell. 2019. Adversarial training with Voronoi constraints. arXiv preprint arXiv:1905.01019 (2019).

[35]

Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE’2019). IEEE, Los Alamitos, CA, 1039–1049.

Digital Library

[36]

Ron Kohavi. 1996. Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). 202–207.

[37]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.

[38]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.

[39]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.

[40]

Seokhyun Lee, Sooyoung Cha, Dain Lee, and Hakjoo Oh. 2020. Effective white-box testing of deep neural networks with adaptive neuron-selection strategy. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 165–176.

Digital Library

[41]

Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J. Zico Kolter, et al. 2011. Towards fully autonomous driving: Systems and algorithms. In Proceedings of the 2011 IEEE Intelligent Vehicles Symposium (IV’11). IEEE, Los Alamitos, CA, 163–168.

[42]

Yu Li, Muxi Chen, and Qiang Xu. 2022. HybridRepair: Towards annotation-efficient repair for deep learning models. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 227–238.

Digital Library

[43]

Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. 2019. Structural coverage criteria for neural networks could be misleading. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER’19). IEEE, Los Alamitos, CA, 89–92.

Digital Library

[44]

Lei Ma, Felix Juefei-Xu, Minhui Xue, Bo Li, Li Li, Yang Liu, and Jianjun Zhao. 2019. DeepCT: Tomographic combinatorial testing for deep learning systems. In Proceedings of the 2019 IEEE 26th International Conference on Software Analysis, Evolution, and Reengineering (SANER’19). IEEE, Los Alamitos, CA, 614–618.

[45]

Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, et al. 2018. DeepGauge: Multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, New York, NY, 120–131.

Digital Library

[46]

Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems. ACM Transactions on Software Engineering and Methodology 30, 2 (2021), 1–22.

Digital Library

[47]

Xingjun Ma, Yuhao Niu, Lin Gu, Yisen Wang, Yitian Zhao, James Bailey, and Feng Lu. 2020. Understanding adversarial attacks on deep learning based medical image analysis systems. Pattern Recognition 110 (2020), 107332.

[48]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=rJzIBfZAb.

[49]

Sérgio Moro, Paulo Cortez, and Paulo Rita. 2014. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62 (2014), 22–31.

[50]

Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA.

Digital Library

[51]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning.

[52]

Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications (OOPSLA’07). 815–816.

Digital Library

[53]

Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016. The limitations of deep learning in adversarial settings. In Proceedings of the 2016 IEEE European Symposium on Security and Privacy (EuroS&P’16). IEEE, Los Alamitos, CA, 372–387.

[54]

Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep face recognition. In Proceedings of the 2015 British Machine Vision Conference (BMVC’15).

[55]

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, New York, NY, 1–18.

Digital Library

[56]

Vincenzo Riccio, Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepMetis: Augmenting a deep learning test set to increase its mutation score. In Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). IEEE, Los Alamitos, CA, 355–367.

Digital Library

[57]

Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: A systematic mapping. Empirical Software Engineering 25, 6 (2020), 5193–5254.

Digital Library

[58]

Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering 14, 2 (2009), 131–164.

Digital Library

[59]

Patrick Schober, Christa Boer, and Lothar A. Schwarte. 2018. Correlation coefficients: Appropriate use and interpretation. Anesthesia & Analgesia 126, 5 (2018), 1763–1768.

[60]

Ali Shahrokni and Robert Feldt. 2013. A systematic review of software robustness. Information and Software Technology 55, 1 (2013), 1–17.

Digital Library

[61]

Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and Baowen Xu. 2020. Multiple-boundary clustering and prioritization to promote neural network retraining. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 410–422.

Digital Library

[62]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[63]

Gagandeep Singh, Timon Gehr, Markus Püschel, and Martin Vechev. 2019. An abstract domain for certifying neural networks. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–30.

Digital Library

[64]

Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018. Concolic testing for deep neural networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE’18). ACM, New York, NY, 109–119. DOI:

Digital Library

[65]

Richard Szeliski. 2010. Computer Vision: Algorithms and Applications. Springer Science & Business Media.

[66]

Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering. 303–314.

Digital Library

[67]

Hoang-Dung Tran, Diago Manzanas Lopez, Patrick Musau, Xiaodong Yang, Luan Viet Nguyen, Weiming Xiang, and Taylor T. Johnson. 2019. Star-based reachability analysis of deep neural networks. In Proceedings of the International Symposium on Formal Methods. 670–686.

Digital Library

[68]

Sakshi Udeshi, Pryanshu Arora, and Sudipta Chattopadhyay. 2018. Automated directed fairness testing. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 98–108.

Digital Library

[69]

Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-oriented testing for deep learning systems. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, Los Alamitos, CA, 300–311.

Digital Library

[70]

Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019. Adversarial sample detection for deep neural network through model mutation testing. In Proceedings of the 41st International Conference on Software Engineering. IEEE, Los Alamitos, CA, 1245–1256.

Digital Library

[71]

Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. 2019. Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 692–702.

[72]

Xinyu Wang, Jun Sun, Zhenbang Chen, Peixin Zhang, Jingyi Wang, and Yun Lin. 2018. Towards optimal concolic testing. In Proceedings of the 40th International Conference on Software Engineering. 291–302.

Digital Library

[73]

Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. 2019. On the convergence and robustness of adversarial training. In Proceedings of the 36th International Conference on Machine Learning. 6586–6595. http://proceedings.mlr.press/v97/wang19i.html.

[74]

Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. 2019. Improving adversarial robustness requires revisiting misclassified examples. In Proceedings of the International Conference on Learning Representations.

[75]

Zan Wang, Hanmo You, Junjie Chen, Yingyi Zhang, Xuyuan Dong, and Wenbin Zhang. 2021. Prioritizing test inputs for deep neural networks via mutation analysis. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, Los Alamitos, CA, 397–409.

Digital Library

[76]

Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S. Dhillon, and Luca Daniel. 2018. Towards fast computation of certified robustness for ReLU networks. arXiv preprint arXiv:1804.09699 (2018).

[77]

Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. 2018. Evaluating the robustness of neural networks: An extreme value theory approach. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=BkUHlMZ0b.

[78]

Matthew Wicker, Xiaowei Huang, and Marta Kwiatkowska. 2018. Feature-guided black-box safety testing of deep neural networks. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems. 408–426.

[79]

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in Software Engineering. Springer Science & Business Media.

[80]

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. FASHION-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).

[81]

Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’19). ACM, New York, NY, 146–157. DOI:

Digital Library

[82]

Huan Xu and Shie Mannor. 2012. Robustness and generalization. Machine Learning 86, 3 (2012), 391–423.

Digital Library

[83]

Pengfei Yang, Renjue Li, Jianlin Li, Cheng-Chao Huang, Jingyi Wang, Jun Sun, Bai Xue, and Lijun Zhang. 2020. Improving neural network verification through spurious region guided refinement. arXiv preprint arXiv:2010.07722 (2020).

[84]

Bing Yu, Hua Qi, Qing Guo, Felix Juefei-Xu, Xiaofei Xie, Lei Ma, and Jianjun Zhao. 2022. DeepRepair: Style-guided repairing for deep neural networks in the real-world operational environment. IEEE Transactions on Reliability 71, 4 (2022), 1401–1416.

[85]

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. 2019. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the International Conference on Machine Learning. PMLR, 7472–7482.

[86]

Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering 48, 1 (2022), 1–36.

[87]

Peixin Zhang, Jingyi Wang, Jun Sun, Guoliang Dong, Xinyu Wang, Xingen Wang, Jin Song Dong, and Ting Dai. 2020. White-box fairness testing through adversarial sampling. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 949–960.

Digital Library

[88]

Peixin Zhang, Jingyi Wang, Jun Sun, and Xinyu Wang. 2021. Fairness testing of deep image classification with adequacy metrics. arXiv preprint arXiv:2111.08856 (2021).

[89]

Peixin Zhang, Jingyi Wang, Jun Sun, Xinyu Wang, Guoliang Dong, Xingen Wang, Ting Dai, and Jin Song Dong. 2022. Automatic fairness testing of neural classifiers through adversarial sampling. IEEE Transactions on Software Engineering 48, 9 (2022), 3593–3612.

[90]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.

Cited By

Tamizharasi AEzhumalai P(2024)Hybrid whale optimized crow search algorithm and multi-SVM classifier for effective system level test case selectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23270046:2(4191-4207)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.3233/JIFS-232700
Tambon FKhomh FAntoniol G(2024)GIST: Generated Inputs Sets Transferability in Deep LearningACM Transactions on Software Engineering and Methodology10.1145/3672457Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672457
Sun ZChen ZZhang JHao D(2024)Fairness Testing of Machine Translation SystemsACM Transactions on Software Engineering and Methodology10.1145/366460833:6(1-27)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3664608
Show More Cited By

Index Terms

QuoTe: Quality-oriented Testing for Deep Learning Systems
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Robustness testing of composed real-time systems
Special Supplement Issue in Section A and B: Selected Papers from the ISCA International Conference on Software Engineering and Data Engineering, 2009

In this paper, we suggest a methodology for testing robustness of Real-Time Component-Based Systems (RTCBS). A RTCBS system is described as a collection of components where each component is modeled as a Timed Input-Output Automaton (TIOA). For each ...
Can Offline Testing of Deep Neural Networks Replace Their Online Testing?: A Case Study of Automated Driving Systems
Abstract
We distinguish two general modes of testing for Deep Neural Networks (DNNs): Offline testing where DNNs are tested as individual units based on test datasets obtained without involving the DNNs under test, and online testing where DNNs are ...
Software Testing and Quality Assurance: Theory and Practice

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 32, Issue 5

September 2023

905 pages

ISSN:1049-331X

EISSN:1557-7392

DOI:10.1145/3610417

Editor:
Mauro Pezzè
USI Università della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2023

Online AM: 10 February 2023

Accepted: 16 December 2022

Revised: 04 December 2022

Received: 02 March 2022

Published in TOSEM Volume 32, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China
Key R&D Program of Zhejiang
NSFC Program
Fundamental Research Funds for Central Universities

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
853
Total Downloads

Downloads (Last 12 months)574
Downloads (Last 6 weeks)49

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tamizharasi AEzhumalai P(2024)Hybrid whale optimized crow search algorithm and multi-SVM classifier for effective system level test case selectionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23270046:2(4191-4207)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.3233/JIFS-232700
Tambon FKhomh FAntoniol G(2024)GIST: Generated Inputs Sets Transferability in Deep LearningACM Transactions on Software Engineering and Methodology10.1145/3672457Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672457
Sun ZChen ZZhang JHao D(2024)Fairness Testing of Machine Translation SystemsACM Transactions on Software Engineering and Methodology10.1145/366460833:6(1-27)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3664608
Zhang SZhu JHao BSun YNie XZhu JLiu XLi XMa YPei Dd'Amorim M(2024)Fault Diagnosis for Test Alarms in Microservices through Multi-source DataCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663833(115-125)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663833
Hu QGuo YXie XCordy MMa LPapadakis MLe Traon Y(2024)Test Optimization in DNN Testing: A SurveyACM Transactions on Software Engineering and Methodology10.1145/364367833:4(1-42)Online publication date: 27-Jan-2024
https://dl.acm.org/doi/10.1145/3643678
Li MHan XChu XLiang ZLi XHandl J(2024)Empirical Comparison between MOEAs and Local Search on Multi-Objective Combinatorial Optimisation ProblemsProceedings of the Genetic and Evolutionary Computation Conference10.1145/3638529.3654077(547-556)Online publication date: 14-Jul-2024
https://dl.acm.org/doi/10.1145/3638529.3654077
Rosado Gomez ACalderón Benavides MAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Framework for Bias Detection in Machine Learning Models: A Fairness ApproachProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635731(1152-1154)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635731
Rezaalipour MFuria C(2023)aNNoTest: An Annotation-based Test Generation Tool for Neural Network Programs2023 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58846.2023.00075(574-579)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICSME58846.2023.00075
Cao RBao LZhangsun PWu CWei SSun RLi RZhang Z(2023)PTSSBench: a performance evaluation platform in support of automated parameter tuning of software systemsAutomated Software Engineering10.1007/s10515-023-00402-z31:1Online publication date: 21-Nov-2023
https://dl.acm.org/doi/10.1007/s10515-023-00402-z

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents