research-article

QATest: A Uniform Fuzzing Framework for Question Answering Systems

Authors:

Baowen XuAuthors Info & Claims

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Article No.: 81, Pages 1 - 12

https://doi.org/10.1145/3551349.3556929

Published: 05 January 2023 Publication History

Abstract

The tremendous advancements in deep learning techniques have empowered question answering(QA) systems with the capability of dealing with various tasks. Many commercial QA systems, such as Siri, Google Home, and Alexa, have been deployed to assist people in different daily activities. However, modern QA systems are often designed to deal with different topics and task formats, which makes both the test collection and labeling tasks difficult and thus threats their quality.

To alleviate this challenge, in this paper, we design and implement a fuzzing framework for QA systems, namely QATest, based on the metamorphic testing theory. It provides the first uniform solution to generate tests with oracle information automatically for various QA systems, such as machine reading comprehension, open-domain QA, and QA on knowledge bases. To further improve testing efficiency and generate more tests detecting erroneous behaviors, we design N-Gram coverage and perplexity priority based on the features of the question data to guide the generation process. To evaluate the performance of QATest, we experiment with it on four QA systems that are designed for different tasks. The experiment results show that the tests generated by QATest detect hundreds of erroneous behaviors of QA systems efficiently. Also, the results confirm that the testing criteria can improve test diversity and fuzzing efficiency.

References

[1]

[n.d.]. Amazon promises fix for creepy Alexa laugh - BBC News. https://www.bbc.com/news/technology-43325230. (Accessed on 05/05/2022).

[2]

[n.d.]. Python Release Python 3.6.0 | Python.org. https://www.python.org/downloads/release/python-360/. (Accessed on 05/05/2022).

[3]

[n.d.]. PyTorch. https://pytorch.org/. (Accessed on 05/05/2022).

[4]

[n.d.]. The Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuAD-explorer/. (Accessed on 05/07/2022).

[5]

[n.d.]. TagMe - TagMe API. https://services.d4science.org/web/tagme/tagme-help. (Accessed on 04/30/2022).

[6]

[n.d.]. Tay: Microsoft issues apology over racist chatbot fiasco - BBC News. https://www.bbc.com/news/technology-35902104. (Accessed on 05/05/2022).

[7]

Razieh Baradaran, Razieh Ghiasi, and Hossein Amirkhani. 2020. A survey on machine reading comprehension systems. Natural Language Engineering(2020), 1–50.

[8]

Asma Ben Abacha and Dina Demner-Fushman. 2019. A Question-Entailment Approach to Question Answering. BMC Bioinform. 20, 1 (2019), 511:1–511:23. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4

[9]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1533–1544. https://aclanthology.org/D13-1160

[10]

William B Cavnar, John M Trenkle, 1994. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, Vol. 161175. Citeseer.

[11]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051(2017).

[12]

Junjie Chen, Ming Yan, Zan Wang, Yuning Kang, and Zhuo Wu. 2020. Deep neural network test coverage: How far are we?arXiv preprint arXiv:2010.04946(2020).

[13]

Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 104–116. https://doi.org/10.1109/ASE51524.2021.9678670

Digital Library

[14]

Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Validation on Machine Reading Comprehension Software without Annotated Labels: A Property-Based Method. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 590–602. https://doi.org/10.1145/3468264.3468569

Digital Library

[15]

KR1442 Chowdhary. 2020. Natural language processing. Fundamentals of artificial intelligence(2020), 603–649.

[16]

Philipp Cimiano, Christina Unger, and John McCrae. 2014. Ontology-based interpretation of natural language. Synthesis Lectures on Human Language Technologies 7, 2(2014), 1–178.

[17]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL.

[18]

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2018. Question answering by reasoning across documents with graph convolutional networks. arXiv preprint arXiv:1808.09920(2018).

[19]

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2020. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review(2020), 1–56. https://doi.org/10.1007/s10462-020-09866-x

Digital Library

[20]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

[21]

Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. DeepGini: Prioritizing Massive Tests to Enhance the Robustness of Deep Neural Networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 177–188. https://doi.org/10.1145/3395363.3397357

Digital Library

[22]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.

Digital Library

[23]

Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. In Proceedings of the Web Conference 2021(Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 3477–3488. https://doi.org/10.1145/3442381.3449992

Digital Library

[24]

Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-Invariant Testing for Machine Translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 961–973. https://doi.org/10.1145/3377811.3380339

Digital Library

[25]

Yuncheng Hua, Yuan-Fang Li, Gholamreza Haffari, Guilin Qi, and Wei Wu. 2020. Retrieve, Program, Repeat: Complex Knowledge Base Question Answering via Alternate Meta-learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, Christian Bessiere (Ed.). ijcai.org, 3679–3686. https://doi.org/10.24963/ijcai.2020/509

[26]

Dan Jurafsky. 2000. Speech & language processing. Pearson Education India.

[27]

Veton Kepuska and Gamal Bohouta. 2018. Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home). In 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 99–103. https://doi.org/10.1109/CCWC.2018.8301638

[28]

Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, 2019. Measuring compositional generalization: A comprehensive method on realistic data. arXiv preprint arXiv:1912.09713(2019).

[29]

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700(2020).

[30]

Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE ’19). IEEE Press, 1039–1049. https://doi.org/10.1109/ICSE.2019.00108

Digital Library

[31]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683(2017).

[32]

Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. Complex Knowledge Base Question Answering: A Survey. arXiv preprint arXiv:2108.06688(2021).

[33]

Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. A survey on complex knowledge base question answering: Methods, challenges and solutions. arXiv preprint arXiv:2105.11644(2021).

[34]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942(2019).

[35]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013

[36]

Zixi Liu, Yang Feng, and Zhenyu Chen. 2021. DialTest: Automated Testing for Recurrent-Neural-Network-Driven Dialogue Systems. Association for Computing Machinery, New York, NY, USA, 115–126. https://doi.org/10.1145/3460319.3464829

Digital Library

[37]

Edward Ma. 2019. NLP Augmentation. https://github.com/makcedward/nlpaug.

[38]

Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. Association for Computing Machinery, New York, NY, USA, 120–131. https://doi.org/10.1145/3238147.3238202

Digital Library

[39]

George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.

Digital Library

[40]

John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. 2020. The effect of natural distribution shift on question answering models. In International Conference on Machine Learning. PMLR, 6905–6916.

[41]

John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. 2020. The Effect of Natural Distribution Shift on Question Answering Models. In Proceedings of the 37th International Conference on Machine Learning(ICML’20). JMLR.org, Article 641, 12 pages.

Digital Library

[42]

Amit Mishra and Sanjay Kumar Jain. 2016. A survey on question answering systems with classification. Journal of King Saud University-Computer and Information Sciences 28, 3(2016), 345–361.

Digital Library

[43]

Emmanuel Mutabazi, Jianjun Ni, Guangyi Tang, and Weidong Cao. 2021. A review on medical textual question answering systems based on deep learning approaches. Applied Sciences 11, 12 (2021), 5456.

[44]

J. R. Norris. 1997. Markov Chains. Cambridge University Press. https://doi.org/10.1017/CBO9780511810633

[45]

Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. 2019. Tensorfuzz: Debugging neural networks with coverage-guided fuzzing. In International Conference on Machine Learning. PMLR, 4901–4911.

[46]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html

[47]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822(2018).

[48]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250(2016).

[49]

Amrita Saha, Vardaan Pahuja, Mitesh M. Khapra, Karthik Sankaranarayanan, and Sarath Chandar. 2018. Complex Sequential Question Answering: Towards Learning to Converse over Linked Question Answer Pairs with a Knowledge Graph. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 87, 9 pages.

[50]

Xiangzhong Shen, Jieyi Zhang, Xiaonan Wang, Hongfang Yu, and Gang Sun. 2021. Deep Learning Framework Fuzzing Based on Model Mutation. In 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC). IEEE, 375–380.

[51]

Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 303–314. https://doi.org/10.1145/3180155.3180220

Digital Library

[52]

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830(2016).

[53]

Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A Coverage-Guided Fuzz Testing Framework for Deep Neural Networks. Association for Computing Machinery, New York, NY, USA, 146–157. https://doi.org/10.1145/3293882.3330579

Digital Library

[54]

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 conference on empirical methods in natural language processing. 2013–2018.

[55]

Munazza Zaib, Wei Emma Zhang, Quan Z Sheng, Adnan Mahmood, and Yang Zhang. 2021. Conversational question answering: A survey. arXiv preprint arXiv:2106.00874(2021).

[56]

Xufan Zhang, Jiawei Liu, Ning Sun, Chunrong Fang, Jia Liu, Jiang Wang, Dong Chai, and Zhenyu Chen. 2021. Duo: Differential Fuzzing for Deep Learning Operators. IEEE Transactions on Reliability 70, 4 (2021), 1671–1685.

[57]

Xin Zhang, An Yang, Sujian Li, and Yizhong Wang. 2019. Machine reading comprehension: a literature review. arXiv preprint arXiv:1907.01686(2019).

[58]

Wujie Zheng, Wenyu Wang, Dian Liu, Changrong Zhang, Qinsong Zeng, Yuetang Deng, Wei Yang, Pinjia He, and Tao Xie. 2019. Testing Untestable Neural Machine Translation: An Industrial Case. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 314–315. https://doi.org/10.1109/ICSE-Companion.2019.00131

Digital Library

Cited By

Ji PFeng YZhang RXue RZhang YHuang WLiu JZhao Z(2024)NLPLego: Assembling Test Generation for Natural Language Processing ApplicationsACM Transactions on Software Engineering and Methodology10.1145/369163134:2(1-36)Online publication date: 5-Oct-2024
https://dl.acm.org/doi/10.1145/3691631
Xie XJin SChen SCheung S(2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3675396
Di PLiu BGao YRoychoudhury APaiva AAbreu RStorey MAniche MNagappan N(2024)MicroFuzz: An Efficient Fuzzing Framework for MicroservicesProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639723(216-227)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639477.3639723
Show More Cited By

Index Terms

QATest: A Uniform Fuzzing Framework for Question Answering Systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Natural Test Generation for Precise Testing of Question Answering Software
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Question answering (QA) software uses information retrieval and natural language processing techniques to automatically answer questions posed by humans in a natural language. Like other AI-based software, QA software may contain bugs. To automatically ...
Accuracy evaluation of methods and techniques in Web-based question answering systems: a survey

Question answering (QA) systems answer the queries of users efficiently in the least amount of time. A researcher has to decide which among various methods and techniques available will be used to retrieve accurate answers when developing a QA system. ...
Guiding Greybox Fuzzing with Mutation Testing
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

Greybox fuzzing and mutation testing are two popular but mostly independent fields of software testing research that have so far had limited overlap. Greybox fuzzing, generally geared towards searching for new bugs, predominantly uses code coverage ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

October 2022

2006 pages

ISBN:9781450394758

DOI:10.1145/3551349

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Distinguished Paper

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ASE '22

ASE '22: 37th IEEE/ACM International Conference on Automated Software Engineering

October 10 - 14, 2022

MI, Rochester, USA

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
522
Total Downloads

Downloads (Last 12 months)176
Downloads (Last 6 weeks)21

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ji PFeng YZhang RXue RZhang YHuang WLiu JZhao Z(2024)NLPLego: Assembling Test Generation for Natural Language Processing ApplicationsACM Transactions on Software Engineering and Methodology10.1145/369163134:2(1-36)Online publication date: 5-Oct-2024
https://dl.acm.org/doi/10.1145/3691631
Xie XJin SChen SCheung S(2024)Word Closure-Based Metamorphic Testing for Machine TranslationACM Transactions on Software Engineering and Methodology10.1145/367539633:8(1-46)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3675396
Di PLiu BGao YRoychoudhury APaiva AAbreu RStorey MAniche MNagappan N(2024)MicroFuzz: An Efficient Fuzzing Framework for MicroservicesProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639723(216-227)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639477.3639723
Chen HChen JWu YCai SAhmad BHuang RWang SZhang C(2024)DialTest‐EA: An Enhanced Fuzzing Approach With Energy Adjustment for Dialogue Systems via Metamorphic TestingSoftware Testing, Verification and Reliability10.1002/stvr.1897Online publication date: 10-Oct-2024
https://doi.org/10.1002/stvr.1897
Meng LLi YChen LMa MZhou YXu B(2024)Hybrid mutation driven testing for natural language inferenceJournal of Software: Evolution and Process10.1002/smr.2694Online publication date: 17-Jun-2024
https://doi.org/10.1002/smr.2694
Wang KWang YWang JWang Q(2023)Fuzzing with Sequence Diversity Inference for Sequential Decision-making Model Testing2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00041(706-717)Online publication date: 9-Oct-2023
https://doi.org/10.1109/ISSRE59848.2023.00041
Aleti A(2023)Software Testing of Generative AI Systems: Challenges and Opportunities2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00009(4-14)Online publication date: 14-May-2023
https://doi.org/10.1109/ICSE-FoSE59343.2023.00009
Zdebskyi PBerko AVysotska V(2023)Information Technology for Finding Answers to Questions from Open Web Resources2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT)10.1109/CSIT61576.2023.10324087(1-7)Online publication date: 19-Oct-2023
https://doi.org/10.1109/CSIT61576.2023.10324087

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents