Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3551349.3556929acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections

QATest: A Uniform Fuzzing Framework for Question Answering Systems

Published: 05 January 2023 Publication History


The tremendous advancements in deep learning techniques have empowered question answering(QA) systems with the capability of dealing with various tasks. Many commercial QA systems, such as Siri, Google Home, and Alexa, have been deployed to assist people in different daily activities. However, modern QA systems are often designed to deal with different topics and task formats, which makes both the test collection and labeling tasks difficult and thus threats their quality.
To alleviate this challenge, in this paper, we design and implement a fuzzing framework for QA systems, namely QATest, based on the metamorphic testing theory. It provides the first uniform solution to generate tests with oracle information automatically for various QA systems, such as machine reading comprehension, open-domain QA, and QA on knowledge bases. To further improve testing efficiency and generate more tests detecting erroneous behaviors, we design N-Gram coverage and perplexity priority based on the features of the question data to guide the generation process. To evaluate the performance of QATest, we experiment with it on four QA systems that are designed for different tasks. The experiment results show that the tests generated by QATest detect hundreds of erroneous behaviors of QA systems efficiently. Also, the results confirm that the testing criteria can improve test diversity and fuzzing efficiency.


[n.d.]. Amazon promises fix for creepy Alexa laugh - BBC News. https://www.bbc.com/news/technology-43325230. (Accessed on 05/05/2022).
[n.d.]. Python Release Python 3.6.0 | Python.org. https://www.python.org/downloads/release/python-360/. (Accessed on 05/05/2022).
[n.d.]. PyTorch. https://pytorch.org/. (Accessed on 05/05/2022).
[n.d.]. The Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuAD-explorer/. (Accessed on 05/07/2022).
[n.d.]. TagMe - TagMe API. https://services.d4science.org/web/tagme/tagme-help. (Accessed on 04/30/2022).
[n.d.]. Tay: Microsoft issues apology over racist chatbot fiasco - BBC News. https://www.bbc.com/news/technology-35902104. (Accessed on 05/05/2022).
Razieh Baradaran, Razieh Ghiasi, and Hossein Amirkhani. 2020. A survey on machine reading comprehension systems. Natural Language Engineering(2020), 1–50.
Asma Ben Abacha and Dina Demner-Fushman. 2019. A Question-Entailment Approach to Question Answering. BMC Bioinform. 20, 1 (2019), 511:1–511:23. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1533–1544. https://aclanthology.org/D13-1160
William B Cavnar, John M Trenkle, 1994. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, Vol. 161175. Citeseer.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051(2017).
Junjie Chen, Ming Yan, Zan Wang, Yuning Kang, and Zhuo Wu. 2020. Deep neural network test coverage: How far are we?arXiv preprint arXiv:2010.04946(2020).
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 104–116. https://doi.org/10.1109/ASE51524.2021.9678670
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Validation on Machine Reading Comprehension Software without Annotated Labels: A Property-Based Method. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 590–602. https://doi.org/10.1145/3468264.3468569
KR1442 Chowdhary. 2020. Natural language processing. Fundamentals of artificial intelligence(2020), 603–649.
Philipp Cimiano, Christina Unger, and John McCrae. 2014. Ontology-based interpretation of natural language. Synthesis Lectures on Human Language Technologies 7, 2(2014), 1–178.
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL.
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018. Question answering by reasoning across documents with graph convolutional networks. arXiv preprint arXiv:1808.09920(2018).
Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review(2020), 1–56. https://doi.org/10.1007/s10462-020-09866-x
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. DeepGini: Prioritizing Massive Tests to Enhance the Robustness of Deep Neural Networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 177–188. https://doi.org/10.1145/3395363.3397357
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. In Proceedings of the Web Conference 2021(Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 3477–3488. https://doi.org/10.1145/3442381.3449992
Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-Invariant Testing for Machine Translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 961–973. https://doi.org/10.1145/3377811.3380339
Yuncheng Hua, Yuan-Fang Li, Gholamreza Haffari, Guilin Qi, and Wei Wu. 2020. Retrieve, Program, Repeat: Complex Knowledge Base Question Answering via Alternate Meta-learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, Christian Bessiere (Ed.). ijcai.org, 3679–3686. https://doi.org/10.24963/ijcai.2020/509
Dan Jurafsky. 2000. Speech & language processing. Pearson Education India.
Veton Kepuska and Gamal Bohouta. 2018. Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home). In 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 99–103. https://doi.org/10.1109/CCWC.2018.8301638
Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, 2019. Measuring compositional generalization: A comprehensive method on realistic data. arXiv preprint arXiv:1912.09713(2019).
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700(2020).
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE ’19). IEEE Press, 1039–1049. https://doi.org/10.1109/ICSE.2019.00108
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683(2017).
Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. Complex Knowledge Base Question Answering: A Survey. arXiv preprint arXiv:2108.06688(2021).
Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. A survey on complex knowledge base question answering: Methods, challenges and solutions. arXiv preprint arXiv:2105.11644(2021).
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942(2019).
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
Zixi Liu, Yang Feng, and Zhenyu Chen. 2021. DialTest: Automated Testing for Recurrent-Neural-Network-Driven Dialogue Systems. Association for Computing Machinery, New York, NY, USA, 115–126. https://doi.org/10.1145/3460319.3464829
Edward Ma. 2019. NLP Augmentation. https://github.com/makcedward/nlpaug.
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. Association for Computing Machinery, New York, NY, USA, 120–131. https://doi.org/10.1145/3238147.3238202
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. 2020. The effect of natural distribution shift on question answering models. In International Conference on Machine Learning. PMLR, 6905–6916.
John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. 2020. The Effect of Natural Distribution Shift on Question Answering Models. In Proceedings of the 37th International Conference on Machine Learning(ICML’20). JMLR.org, Article 641, 12 pages.
Amit Mishra and Sanjay Kumar Jain. 2016. A survey on question answering systems with classification. Journal of King Saud University-Computer and Information Sciences 28, 3(2016), 345–361.
Emmanuel Mutabazi, Jianjun Ni, Guangyi Tang, and Weidong Cao. 2021. A review on medical textual question answering systems based on deep learning approaches. Applied Sciences 11, 12 (2021), 5456.
J. R. Norris. 1997. Markov Chains. Cambridge University Press. https://doi.org/10.1017/CBO9780511810633
Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. 2019. Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. In International Conference on Machine Learning. PMLR, 4901–4911.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822(2018).
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250(2016).
Amrita Saha, Vardaan Pahuja, Mitesh M. Khapra, Karthik Sankaranarayanan, and Sarath Chandar. 2018. Complex Sequential Question Answering: Towards Learning to Converse over Linked Question Answer Pairs with a Knowledge Graph. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 87, 9 pages.
Xiangzhong Shen, Jieyi Zhang, Xiaonan Wang, Hongfang Yu, and Gang Sun. 2021. Deep Learning Framework Fuzzing Based on Model Mutation. In 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC). IEEE, 375–380.
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 303–314. https://doi.org/10.1145/3180155.3180220
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830(2016).
Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A Coverage-Guided Fuzz Testing Framework for Deep Neural Networks. Association for Computing Machinery, New York, NY, USA, 146–157. https://doi.org/10.1145/3293882.3330579
Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 conference on empirical methods in natural language processing. 2013–2018.
Munazza Zaib, Wei Emma Zhang, Quan Z Sheng, Adnan Mahmood, and Yang Zhang. 2021. Conversational question answering: A survey. arXiv preprint arXiv:2106.00874(2021).
Xufan Zhang, Jiawei Liu, Ning Sun, Chunrong Fang, Jia Liu, Jiang Wang, Dong Chai, and Zhenyu Chen. 2021. Duo: Differential Fuzzing for Deep Learning Operators. IEEE Transactions on Reliability 70, 4 (2021), 1671–1685.
Xin Zhang, An Yang, Sujian Li, and Yizhong Wang. 2019. Machine reading comprehension: a literature review. arXiv preprint arXiv:1907.01686(2019).
Wujie Zheng, Wenyu Wang, Dian Liu, Changrong Zhang, Qinsong Zeng, Yuetang Deng, Wei Yang, Pinjia He, and Tao Xie. 2019. Testing Untestable Neural Machine Translation: An Industrial Case. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 314–315. https://doi.org/10.1109/ICSE-Companion.2019.00131

Cited By

View all
  • (2024)NLPLego: Assembling Test Generation for Natural Language Processing ApplicationsACM Transactions on Software Engineering and Methodology10.1145/369163134:2(1-36)Online publication date: 5-Oct-2024
  • (2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
  • (2024)MicroFuzz: An Efficient Fuzzing Framework for MicroservicesProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639723(216-227)Online publication date: 14-Apr-2024
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Other conferences
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
October 2022
2006 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023


Request permissions for this article.

Check for updates


  • Distinguished Paper

Author Tags

  1. automated testing
  2. fuzz testing
  3. natural language processing
  4. question answering systems


  • Research-article
  • Research
  • Refereed limited


ASE '22

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)176
  • Downloads (Last 6 weeks)21
Reflects downloads up to 20 Jan 2025

Other Metrics


Cited By

View all
  • (2024)NLPLego: Assembling Test Generation for Natural Language Processing ApplicationsACM Transactions on Software Engineering and Methodology10.1145/369163134:2(1-36)Online publication date: 5-Oct-2024
  • (2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
  • (2024)MicroFuzz: An Efficient Fuzzing Framework for MicroservicesProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639723(216-227)Online publication date: 14-Apr-2024
  • (2024)DialTest‐EA: An Enhanced Fuzzing Approach With Energy Adjustment for Dialogue Systems via Metamorphic TestingSoftware Testing, Verification and Reliability10.1002/stvr.1897Online publication date: 10-Oct-2024
  • (2024)Hybrid mutation driven testing for natural language inferenceJournal of Software: Evolution and Process10.1002/smr.2694Online publication date: 17-Jun-2024
  • (2023)Fuzzing with Sequence Diversity Inference for Sequential Decision-making Model Testing2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00041(706-717)Online publication date: 9-Oct-2023
  • (2023)Software Testing of Generative AI Systems: Challenges and Opportunities2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00009(4-14)Online publication date: 14-May-2023
  • (2023)Information Technology for Finding Answers to Questions from Open Web Resources2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT)10.1109/CSIT61576.2023.10324087(1-7)Online publication date: 19-Oct-2023

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.


HTML Format

View this article in HTML Format.

HTML Format







Share this Publication link

Share on social media