survey

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

Authors:

Isabelle AugensteinAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 10

Article No.: 197, Pages 1 - 45

https://doi.org/10.1145/3560260

Published: 02 February 2023 Publication History

Abstract

Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with more than 80 new datasets appearing in the past 2 years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of “skills” that question answering/reading comprehension systems are supposed to acquire and propose a new taxonomy. The supplementary materials survey the current multilingual resources and monolingual resources for languages other than English, and we discuss the implications of overfocusing on English. The study is aimed at both practitioners looking for pointers to the wealth of existing data and at researchers working on new resources.

References

[1]

Mostafa Abdou, Cezar Sas, Rahul Aralikatte, Isabelle Augenstein, and Anders Søgaard. 2019. X-WikiRE: A large, multilingual resource for relation extraction as machine comprehension. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo’19). 265–274.

[2]

Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya, and Gerhard Weikum. 2019. ComQA: A community-sourced dataset for complex factoid question answering with paraphrase clusters. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 307–317. https://aclweb.org/anthology/papers/N/N19/N19-1027/.

[3]

Manoj Acharya, Karan Jariwala, and Christopher Kanan. 2019. VQD: Visual query detection in natural scenes. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 1955–1961.

[4]

Manoj Acharya, Kushal Kafle, and Christopher Kanan. 2019. TallyQA: Answering complex counting questions. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 8076–8084.

Digital Library

[5]

Douglas Adams. 2009. The Hitchhiker’s Guide to the Galaxay. Ballantine Books, New York, NY. PR6051.D3352 H5 2009

[6]

Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer Suleman, Harm de Vries, and Siva Reddy. 2022. TopiOCQA: Open-domain conversational question answering with topic switching. Transactions of the Association for Computational Linguistics 10 (April2022), 468–483.

[7]

Arjun Akula, Soravit Changpinyo, Boqing Gong, Piyush Sharma, Song-Chun Zhu, and Radu Soricut. 2021. CrossVQA: Scalably generating benchmarks for systematically testing VQA generalization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 2148–2166.

[8]

Jan Alexandersson, Bianka Buschbeck-Wolf, Tsutomu Fujinami, Michael Kipp, Stephan Koch, Elisabeth Maier, Norbert Reithinger, Birte Schmitz, and Melanie Siegel. 1998. Dialogue Acts in Verbmobil 2. Technical Report. Verbmobil.

[9]

Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 19th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 520–534. https://aclanthology.org/2021.naacl-main.44.

[10]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 2425–2433. http://ieeexplore.ieee.org/document/7410636/.

Digital Library

[11]

Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 4623–4637.

[12]

Akari Asai and Eunsol Choi. 2021. Challenges in information-seeking QA: Unanswerable questions and paragraph retrieval. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 1492–1504.

[13]

Akari Asai, Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2018. Multilingual extractive reading comprehension by runtime machine translation. arXiv:1809.03275 [CS] (2018). http://arxiv.org/abs/1809.03275.

[14]

Akari Asai, Jungo Kasai, Jonathan H. Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2020. XOR QA: Cross-lingual open-retrieval question answering. arXiv:2010.11856 [CS] (2020). http://arxiv.org/abs/2010.11856.

[15]

Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv:1704.00057 [CS] (2017). http://arxiv.org/abs/1704.00057.

[16]

Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, and Isabelle Augenstein. 2020. Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 7352–7364.

[17]

Isabelle Augenstein, Christina Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, and Jakob Grue Simonsen. 2019. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4685–4697.

[18]

Kathleen M. Bailey. 2018. Multiple-choice item format. In The TESOL Encyclopedia of English Language Teaching. Wiley, 1–8.

[19]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, et al. 2016. MS MARCO: A human generated machine reading comprehension dataset. arXiv:1611.09268 [CS] (2016). http://arxiv.org/abs/1611.09268.

[20]

Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindienst. 2017. Embracing data abundance: BookTest dataset for reading comprehension. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17). https://openreview.net/pdf?id=H1U4mhVFe.

[21]

Somnath Banerjee, Sudip Kumar Naskar, and Paolo Rosso. 2016. The first cross-script code-mixed question answering corpus. In Proceedings of the Workshop on Modeling, Learning, and Mining for Cross/Multilinguality (MultiLingMine’16) Co-located with the 2016 European Conference on Information Retrieval (ECIR’16). 1–10. 56–65. http://ceur-ws.org/Vol-1589/MultiLingMine6.pdf.

[22]

Paul Bartha. 2019. Analogy and analogical reasoning. In The Stanford Encyclopedia of Philosophy (Spring 2019 ed.), Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/spr2019/entries/reasoning-analogy/.

[23]

Emily M. Bender. 2019. The #BenderRule: On naming the languages we study and why it matters. The Gradient. Retrieved September 16, 2022 from https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/.

[24]

Emily M. Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6 (2018), 587–604.

[25]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 1533–1544. https://www.aclweb.org/anthology/D13-1160.

[26]

Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D. Manning. 2014. Modeling biological processes for reading comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1499–1510.

[27]

Yevgeni Berzak, Jonathan Malmaud, and Roger Levy. 2020. STARC: Structured annotations for reading comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 5726–5735. https://www.aclweb.org/anthology/2020.acl-main.507.

[28]

Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-Tau Yih, and Yejin Choi. 2019. Abductive commonsense reasoning. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=Byg1v1HKDB.

[29]

Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, et al. 2020. Experience grounds language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 8718–8735.

[30]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about physical commonsense in natural language. Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 7432–7439.

[31]

Johannes Bjerva, Nikita Bhutani, Behzad Golshan, Wang-Chiew Tan, and Isabelle Augenstein. 2020. SubjQA: A dataset for subjectivity and review comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 5480–5494.

[32]

Simon Blackburn. 2008. Inference. Retrieved September 16, 2022 from https://www.oxfordreference.com/view/10.1093/acref/9780199541430.001.0001/acref-9780199541430.

[33]

Simon Blackburn. 2008. Reasoning. Retrieved September 16, 2022 from https://www.oxfordreference.com/view/10.1093/acref/9780199541430.001.0001/acref-9780199541430.

[34]

Elisa Bone and Mike Prosser. 2020. Multiple Choice Questions: An Introductory Guide. Retrieved September 16, 2022 from https://melbourne-cshe.unimelb.edu.au/__data/assets/pdf_file/0010/3430648/multiple-choice-questions_final.pdf.

[35]

Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv:1506.02075 [CS] (2015). http://arxiv.org/abs/1506.02075.

[36]

Jordan Boyd-Graber. 2019. What question answering can learn from trivia nerds. arXiv:1910.14464 [CS] (2019). http://arxiv.org/abs/1910.14464.

[37]

Jordan Boyd-Graber, Shi Feng, and Pedro Rodriguez. 2018. Human-computer question answering: The case for quizbowl. In Proceedings of the NIPS’17 Competition: Building Intelligent Systems, Sergio Escalera and Markus Weimer (Eds.). Springer Series on Challenges in Machine Learning. Springer International, Cham, Switzerland, 169–180.

[38]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language models are few-shot learners. arXiv:2005.14165 [CS] (2020). http://arxiv.org/abs/2005.14165.

[39]

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ—A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 5016–5026.

[40]

B. Barla Cambazoglu, Mark Sanderson, Falk Scholer, and Bruce Croft. 2020. A review of public datasets in question answering research. ACM SIGIR Forum 54, 2 (2020), 23. http://www.sigir.org/wp-content/uploads/2020/12/p07.pdf.

Digital Library

[41]

Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, and Eneko Agirre. 2020. DoQA-accessing domain-specific FAQs via conversational QA. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 7302–7314. https://aclanthology.org/2020.acl-main.652/.

[42]

Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Hanwang Zhang. 2022. KQA Pro: A dataset with explicit compositional programs for complex question answering over knowledge base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 6101–6119.

[43]

Casimiro Pio Carrino, Marta R. Costa-Jussà, and José A. R. Fonollosa. 2019. Automatic Spanish translation of the SQuAD dataset for multilingual question answering. arXiv:1912.05200 [CS] (2019). http://arxiv.org/abs/1912.05200.

[44]

Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Anthony Ferritto, Radu Florian, Martin Franz, Dinesh Garg, et al. 2020. The TechQA dataset. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 1269–1278. https://www.aclweb.org/anthology/2020.acl-main.117.

[45]

Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. 2020. Evaluation of text generation: A survey. arXiv:2006.14799 [CS] (2020). http://arxiv.org/abs/2006.14799.

[46]

Khyathi Chandu, Ekaterina Loginova, Vishal Gupta, Josef van Genabith, Günter Neumann, Manoj Chinnakotla, Eric Nyberg, and Alan W. Black. 2018. Code-mixed question answering challenge: Crowd-sourcing data and techniques. In Proceedings of the 3rd Workshop on Computational Approaches to Linguistic Code-Switching. 29–38.

[47]

Graham Chapman, John Cleese, Terry Gilliam, Eric Idle, Terry Jones, Michael Palin, John Goldstone, Spike Milligan, Monty Python (Comedy troupe), Handmade Films, and Criterion Collection (Firm). 1999. Life of Brian.

[48]

Prithvijit Chattopadhyay, Ramakrishna Vedantam, Ramprasaath R. Selvaraju, Dhruv Batra, and Devi Parikh. 2017. Counting everyday objects in everyday scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1135–1144. https://openaccess.thecvf.com/content_cvpr_2017/html/Chattopadhyay_Counting_Everyday_Objects_CVPR_2017_paper.html.

[49]

Anthony Chen, Pallavi Gudipati, Shayne Longpre, Xiao Ling, and Sameer Singh. 2021. Evaluating entity disambiguation and the role of popularity in retrieval-based NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL’20). 4472–4485.

[50]

Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 119–124.

[51]

Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2020. MOCHA: A dataset for training and evaluating generative reading comprehension metrics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6521–6532. https://www.aclweb.org/anthology/2020.emnlp-main.528.

[52]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 1870–1879.

[53]

Danqi Chen and Wen-Tau Yih. 2020. Open-domain question answering. In Proceedings of ACL: Tutorial Abstracts. 34–37.

[54]

Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Yang Wang. 2020. HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of EMNLP’20. 1026–1036.

[55]

Xingyu Chen, Zihan Zhao, Lu Chen, JiaBao Ji, Danyang Zhang, Ao Luo, Yuxuan Xiong, and Kai Yu. 2021. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 4173–4185.

[56]

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, et al. 2021. FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 3697–3711.

[57]

Carlos Iván Chesnevar, Ana Gabriela Maguitman, and Ronald Prescott Loui. 2000. Logical models of argument. ACM Computing Surveys 32, 4 (2000), 337–383.

Digital Library

[58]

Minseok Cho, Reinald Kim Amplayo, Seung-Won Hwang, and Jonghyuck Park. 2018. Adversarial TableQA: Attention supervision for question answering on tables. In Proceedings of Machine Learning Research. 391–406. http://proceedings.mlr.press/v95/cho18a/cho18a.pdf.

[59]

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-Tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question answering in context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2174–2184. http://aclweb.org/anthology/D18-1241.

[60]

Eunsol Choi, Jennimaria Palomaki, Matthew Lamm, Tom Kwiatkowski, Dipanjan Das, and Michael Collins. 2021. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics 9 (April 2021), 447–461.

[61]

Sagnik Ray Choudhury, Anna Rogers, and Isabelle Augenstein. 2022. Machine reading, fast and slow: When do models “Understand” language? In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 78–93. https://aclanthology.org/2022.coling-1.8.

[62]

Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh, and Gerhard Weikum. 2019. Look before you hop: Conversational question answering over knowledge graphs using judicious context expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 729–738.

Digital Library

[63]

Manuel Ciosici, Joe Cecil, Dong-Ho Lee, Alex Hedges, Marjorie Freedman, and Ralph Weischedel. 2021. Perhaps PTLMs should go to school—A task to assess open book and closed book QA. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 6104–6111.

[64]

Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 845–855.

[65]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 2924–2936. https://aclweb.org/anthology/papers/N/N19/N19-1300/.

[66]

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics 8 (July 2020), 454–470.

[67]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv:1803.05457 [CS] (2018). http://arxiv.org/abs/1803.05457.

[68]

Anthony Colas, Seokhwan Kim, Franck Dernoncourt, Siddhesh Gupte, Zhe Wang, and Doo Soon Kim. 2020. TutorialVQA: Question answering dataset for tutorial videos. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 5450–5455. https://www.aclweb.org/anthology/2020.lrec-1.670.

[69]

Tarcísio Souza Costa, Simon Gottschalk, and Elena Demidova. 2020. Event-QA: A dataset for event-centric question answering over knowledge graphs. arXiv:2004.11861 [CS] (2020). http://arxiv.org/abs/2004.11861.

[70]

Danilo Croce, Alexandra Zelenanska, and Roberto Basili. 2019. Enabling deep learning for large scale question answering in Italian. Intelligenza Artificiale 13, 1 (Jan. 2019), 49–61.

[71]

Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2019. A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 5883–5889.

[72]

Yiming Cui, Ting Liu, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2018. Dataset for the first evaluation on chinese machine reading comprehension. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1431.

[73]

Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. 2016. Consensus attention-based neural networks for chinese reading comprehension. In Proceedings of the International Conference on Computational Linguistics (COLING’16). 1777–1786. https://www.aclweb.org/anthology/C16-1167.

[74]

Yiming Cui, Ting Liu, Ziqing Yang, Zhipeng Chen, Wentao Ma, Wanxiang Che, Shijin Wang, and Guoping Hu. 2020. A sentence Cloze dataset for Chinese machine reading comprehension. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 6717–6723.

[75]

Pradeep Dasigi, Nelson F. Liu, Ana Marasović, Noah A. Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 5924–5931.

[76]

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 19th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 4599–4610.

[77]

Lorenz Demey, Barteld Kooi, and Joshua Sack. 2019. Logic and probability. In The Stanford Encyclopedia of Philosophy (Summer 2019 ed.), Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/sum2019/entries/logic-probability/.

[78]

Martin d’Hoffschmidt, Maxime Vidal, Wacim Belblidia, and Tom Brendlé. 2020. FQuAD: French question answering dataset. arXiv:2002.06071 [CS] (2020). http://arxiv.org/abs/2002.06071.

[79]

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia: Knowledge-powered conversational agents. arXiv:1811.01241 [CS] (2018). http://arxiv.org/abs/1811.01241.

[80]

Cícero dos Santos, Luciano Barbosa, Dasha Bogdanova, and Bianca Zadrozny. 2015. Learning hybrid representations to retrieve semantically equivalent questions. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 694–699.

[81]

Igor Douven. 2017. Abduction. In The Stanford Encyclopedia of Philosophy (Summer 2017 ed.), Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/sum2017/entries/abduction/.

[82]

Dheeru Dua, Ananth Gottumukkala, Alon Talmor, Matt Gardner, and Sameer Singh. 2019. ORB: An open reading benchmark for comprehensive evaluation of machine reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 147–153.

[83]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 2368–2378. https://aclweb.org/anthology/papers/N/N19/N19-1246/.

[84]

Nan Duan and Duyu Tang. 2018. Overview of the NLPCC 2017 shared task: Open domain Chinese question answering. In Natural Language Processing and Chinese Computing, Xuanjing Huang, Jing Jiang, Dongyan Zhao, Yansong Feng, and Yu Hong (Eds.). Springer, Cham, Switzerland, 954–961.

[85]

Jesse Dunietz, Greg Burnham, Akash Bharadwaj, Owen Rambow, Jennifer Chu-Carroll, and Dave Ferrucci. 2020. To test machine comprehension, start by defining comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 7839–7859. https://www.aclweb.org/anthology/2020.acl-main.701.

[86]

Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv:1704.05179 [CS] (2017). http://arxiv.org/abs/1704.05179.

[87]

Daria Dzendzik, Jennifer Foster, and Carl Vogel. 2021. English machine reading comprehension datasets: A survey. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 8784–8804.

[88]

Pavel Efimov, Andrey Chertok, Leonid Boytsov, and Pavel Braslavski. 2020. SberQuAD—Russian reading comprehension dataset: Description and analysis. arXiv:1912.09723 [CS] (2020).

Digital Library

[89]

Ahmed Elgohary, Denis Peskov, and Jordan Boyd-Graber. 2019. Can you unpack that? Learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 5918–5924.

[90]

Mihail Eric and Christopher D. Manning. 2017. Key-value retrieval networks for task-oriented dialogue. arXiv:1705.05414 [CS] (2017). http://arxiv.org/abs/1705.05414.

[91]

Allyson Ettinger. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8 (2020), 34–48. arXiv:1907.13528

[92]

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3558–3567.

[93]

Manaal Faruqui and Dipanjan Das. 2018. Identifying well-formed natural language questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP’18). 798–803.

[94]

H. M. Fayek and J. Johnson. 2020. Temporal reasoning via audio question answering. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2283–2294.

Digital Library

[95]

Alena Fenogenova, Vladislav Mikhailov, and Denis Shevelev. 2020. Read and reason with MuSeRC and RuCoS: Datasets for machine reading comprehension for Russian. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 6481–6497.

[96]

James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A dataset of incomplete information reading comprehension questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 1137–1147.

[97]

Noa Garcia, Chentao Ye, Zihua Liu, Qingtao Hu, Mayu Otani, Chenhui Chu, Yuta Nakashima, and Teruko Mitamura. 2020. A dataset and baselines for visual question answering on art. arXiv:2008.12520 [CS] (2020). http://arxiv.org/abs/2008.12520.

[98]

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, et al. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of EMNLP’20. 1307–1323.

[99]

Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. 2019. Question answering is a format; when is it useful? arXiv:1909.11291 (2019).

[100]

Matt Gardner, William Merrill, Jesse Dodge, Matthew Peters, Alexis Ross, Sameer Singh, and Noah A. Smith. 2021. Competency problems: On finding and removing artifacts in language data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 1801–1813.

[101]

Siddhant Garg, Thuy Vu, and Alessandro Moschitti. 2020. TANDA: Transfer and adapt pre-trained transformer models for answer sentence selection. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 7780–7788.

[102]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for datasets. arXiv:1803.09010 [CS] (2020). http://arxiv.org/abs/1803.09010.

[103]

Atticus Geiger, Ignacio Cases, Lauri Karttunen, and Christopher Potts. 2019. Posing fair generalization tasks for natural language inference. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4475–4485.

[104]

Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 1161–1166.

[105]

Taisia Glushkova, Alexey Machnev, Alena Fenogenova, Tatiana Shavrina, Ekaterina Artemova, and Dmitry I. Ignatov. 2020. DaNetQA: A yes/no question answering dataset for the russian language. arXiv:2010.02605 [CS] (2020). http://arxiv.org/abs/2010.02605.

[106]

Yoav Goldberg. 2019. Assessing BERT’s syntactic abilities. arXiv:1901.05287 [CS] (2019). http://arxiv.org/abs/1901.05287.

[107]

Ana Valeria González, Anna Rogers, and Anders Søgaard. 2021. On the interaction of belief bias and explanations. In Findings of ACL-IJCNLP’21. 2930–2942. https://aclanthology.org/2021.findings-acl.259.

[108]

Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2012. SemEval-2012 Task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics (*SEM’12). 394–398. https://aclweb.org/anthology/papers/S/S12/S12-1052/.

[109]

Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. 2018. IQA: Visual question answering in interactive environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 4089–4098.

[110]

Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three levels of generalization for question answering on knowledge bases. arXiv:2011.07743 [CS] (2021).

Digital Library

[111]

Mandy Guo, Yinfei Yang, Daniel Cer, Qinlan Shen, and Noah Constant. 2020. MultiReQA: A cross-domain evaluation for retrieval question answering models. arXiv:2005.02507 [CS] (2020). http://arxiv.org/abs/2005.02507.

[112]

Shangmin Guo, Kang Liu, Shizhu He, Cao Liu, Jun Zhao, and Zhuoyu Wei. 2017. IJCNLP-2017 Task 5: Multi-choice question answering in examinations. In Proceedings of the IJCNLP’17, Shared Tasks. 34–40. https://www.aclweb.org/anthology/I17-4005.

[113]

Aditya Gupta, Jiacheng Xu, Shyam Upadhyay, Diyi Yang, and Manaal Faruqui. 2021. Disfl-QA: A benchmark dataset for understanding disfluencies in question answering. In Findings of ACL 2021. https://arxiv.org/abs/2106.04016.

[114]

Deepak Gupta, Surabhi Kumari, Asif Ekbal, and Pushpak Bhattacharyya. 2018. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1440.

[115]

Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda, Anirudha Rayasam, and Zachary C. Lipton. 2019. AmazonQA: A review-based question answering task. In Proceedings of the 2019 International Joint Conference on Artificial Intelligence (IJCAI’19). 4996–5002.

[116]

Vishal Gupta, Manoj Chinnakotla, and Manish Shrivastava. 2018. Transliteration better than translation? Answering code-mixed questions over a knowledge base. In Proceedings of the 3rd Workshop on Computational Approaches to Linguistic Code-Switching. 39–50.

[117]

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 107–112.

[118]

Rujun Han, I-Hung Hsu, Jiao Sun, Julia Baylon, Qiang Ning, Dan Roth, and Nanyun Peng. 2021. ESTER: A machine reading comprehension dataset for reasoning about event semantic relations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 7543–7559.

[119]

Helia Hashemi, Mohammad Aliannejadi, Hamed Zamani, and W. Bruce Croft. 2019. ANTIQUE: A non-factoid question answering benchmark. arXiv:1905.08957 [CS] (2019). http://arxiv.org/abs/1905.08957.

[120]

Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017. Toward automated fact-checking: Detecting check-worthy factual claims by ClaimBuster. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17). ACM, New York, NY, 1803–1812. http://dblp.uni-trier.de/db/conf/kdd/kdd2017.html#HassanALT17.

Digital Library

[121]

James Hawthorne. 2021. Inductive logic. In The Stanford Encyclopedia of Philosophy (Spring 2022 ed.), Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/spr2021/entries/logic-inductive/.

[122]

Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, et al. 2018. DuReader: A Chinese machine reading comprehension dataset from real-world applications. In Proceedings of the Workshop on Machine Reading for Question Answering. 37–46.

[123]

Nancy Hedberg, Juan M. Sosa, and Lorna Fadden. 2004. Meanings and configurations of questions in English. In Proceedings of the International Conference on Speech Prosody. 309–312. https://www.isca-speech.org/archive/sp2004/papers/sp04_309.pdf.

[124]

Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24–27, 1990. https://www.aclweb.org/anthology/H90-1021.

Digital Library

[125]

Karl Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’15). 1693–1701. http://dl.acm.org/citation.cfm?id=2969239.2969428.

[126]

Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. WikiReading: A novel large-scale language understanding task over Wikipedia. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). 1535–1545.

[127]

Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv:1511.02301 [CS] (2015). http://arxiv.org/abs/1511.02301.

[128]

Andrea Horbach, Itziar Aldabe, Marie Bexte, Oier Lopez de Lacalle, and Montse Maritxalar. 2020. Linguistic appropriateness and pedagogic usefulness of reading comprehension questions. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 1753–1762. https://www.aclweb.org/anthology/2020.lrec-1.217.

[129]

Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 2391–2401.

[130]

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6700–6709. https://openaccess.thecvf.com/content_CVPR_2019/html/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.html.

[131]

SRI International. 2011. SRI’s Amex Travel Agent Data. Retrieved September 16, 2022 from http://www.ai.sri.com/~communic/amex/amex.html.

[132]

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2758–2766. https://openaccess.thecvf.com/content_cvpr_2017/html/Jang_TGIF-QA_Toward_Spatio-Temporal_CVPR_2017_paper.html.

[133]

Dalton Jeffrey, Xiong Chenyan, and Callan Jamie. 2019. CAsT 2019: The conversational assistance track overview. In Proceedings of the Text REtrival Conference (TREC’19).

[134]

Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2021–2031.

[135]

Zhen Jia, Abdalghani Abujabal, Rishiraj Saha Roy, Jannik Strötgen, and Gerhard Weikum. 2018. TempQuestions: A benchmark for temporal question answering. In Companion of WWW’18. ACM, New York, NY, 1057–1062.

Digital Library

[136]

Zhen Jia, Abdalghani Abujabal, Rishiraj Saha Roy, Jannik Strötgen, and Gerhard Weikum. 2018. TEQUILA: Temporal question answering over knowledge bases. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’18). ACM, New York, NY, 1807–1810.

Digital Library

[137]

Kelvin Jiang, Dekun Wu, and Hui Jiang. 2019. FreebaseQA: A new factoid QA data set matching trivia-style question-answer pairs with Freebase. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 318–323. https://aclweb.org/anthology/papers/N/N19/N19-1028/.

[138]

Carlos E. Jimenez, Olga Russakovsky, and Karthik Narasimhan. 2022. CARETS: A consistency and robustness evaluative test suite for VQA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 6392–6405.

[139]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 2567–2577.

[140]

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2901–2910.

[141]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 1601–1611.

[142]

Yannis Katsis, Saneem Chemmengath, Vishwajeet Kumar, Samarth Bharadwaj, Mustafa Canim, Michael Glass, Alfio Gliozzo, et al. 2022. AIT-QA: Question answering dataset over complex tables in the airline industry. In Proceedings of the 20th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’22). 305–314.

[143]

Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data. In Proceedings of the International Conference on Learning Representations(ICLR’19). https://openreview.net/forum?id=Sklgs0NFvr.

[144]

Rachel Keraron, Guillaume Lancrenon, Mathilde Bras, Frédéric Allary, Gilles Moyse, Thomas Scialom, Edmundo-Pavel Soriano-Morales, and Jacopo Staiano. 2020. Project PIAF: Building a native French question-answering dataset. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 5481–5490. https://www.aclweb.org/anthology/2020.lrec-1.673.

[145]

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 252–262.

[146]

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. UnifiedQA: Crossing format boundaries with a single QA system. arXiv:2005.00700 [CS] (2020). https://arxiv.org/abs/2005.00700.

[147]

Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. 2017. DeepStory: Video story QA by deep embedded memory networks. In Proceedings of the 2017 International Joint Conference on Artificial Intelligence (IJCAI’17). https://openreview.net/forum?id=ryZczSz_bS.

[148]

Seokhwan Kim, Luis Ferdinando D’Haro, Rafael E. Banchs, Matthew Henderson, Jason Willisams, and Koichiro Yoshino. 2016. Dialog State Tracking Challenge 5 Handbook v.3.1. Retrieved September 16, 2022 from http://workshop.colips.org/dstc5/.

[149]

Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gabor Melis, and Edward Grefenstette. 2018. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317–328. http://aclweb.org/anthology/Q18-1023.

[150]

Xiang Kong, Varun Gangal, and Eduard Hovy. 2020. SCDE: Sentence Cloze dataset with high quality distractors from examinations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 5668–5683.

[151]

Robert Koons. 2017. Defeasible reasoning. In The Stanford Encyclopedia of Philosophy (Winter 2017 ed.), Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/win2017/entries/reasoning-defeasible/.

[152]

Vladislav Korablinov and Pavel Braslavski. 2020. RuBQ: A Russian dataset for question answering over wikidata. arXiv:2005.10659 [CS] (2020). http://arxiv.org/abs/2005.10659.

[153]

Kalpesh Krishna, Aurko Roy, and Mohit Iyyer. 2021. Hurdles to progress in long-form question answering. In Proceedings of the 19th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 4940–4957.

[154]

Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Regina Barzilay. 2014. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 271–281.

[155]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, et al. 2019. Natural questions: A benchmark for question answering research. Transactions of Association for Computational Linguistics 7 (2019), 452–466. https://ai.google/research/pubs/pub47761.

[156]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 785–794.

[157]

C. Lee, S. Wang, H. Chang, and H. Lee. 2018. ODSQA: Open-domain spoken question answering dataset. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT’18). 949–956.

[158]

Kyungjae Lee, Kyoungho Yoon, Sunghyun Park, and Seung-Won Hwang. 2018. Semi-supervised training data generation for multilingual question answering. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1437.

[159]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, compositional video question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 1369–1379. https://www.aclweb.org/anthology/papers/D/D18/D18-1167/.

[160]

Hector J. Levesque, Ernest Davis, and Leora Morgenstern. 2012. The Winograd Schema Challenge. In Proceedings of the 13th International Conference on Principles of Knowledge Representation and Reasoning. 552–561.

[161]

Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL’17). 333–342.

[162]

Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 7315–7330. https://www.aclweb.org/anthology/2020.acl-main.653/.

[163]

Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, and Hung-Yi Lee. 2018. Spoken SQuAD: A study of mitigating the impact of speech recognition errors on listening comprehension. arXiv:1804.00320 [CS] (2018). http://arxiv.org/abs/1804.00320.

[164]

Haonan Li, Martin Tomko, Maria Vasardani, and Timothy Baldwin. 2022. MultiSpanQA: A dataset for multi-span question answering. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 1250–1260.

[165]

Jiaqi Li, Ming Liu, Min-Yen Kan, Zihao Zheng, Zekun Wang, Wenqiang Lei, Ting Liu, and Bing Qin. 2020. Molweni: A challenge multiparty dialogues-based machine reading comprehension dataset with discourse structure. arXiv:2004.05080 [CS] (2020). http://arxiv.org/abs/2004.05080.

[166]

Jing Li, Shangping Zhong, and Kaizhi Chen. 2021. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 8862–8874.

[167]

Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. 2016. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. arXiv:1607.06275 [CS] (2016). http://arxiv.org/abs/1607.06275.

[168]

Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019. Entity-relation extraction as multi-turn question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 1340–1350.

[169]

Yongqi Li, Wenjie Li, and Liqiang Nie. 2022. MMCoQA: Conversational question answering over text, tables, and images. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 4220–4231.

[170]

Yichan Liang, Jianheng Li, and Jian Yin. 2019. A new multi-choice reading comprehension dataset for curriculum learning. In Proceedings of the 11th Asian Conference on Machine Learning. 742–757. http://proceedings.mlr.press/v101/liang19a.html.

[171]

Seungyoung Lim, Myungji Kim, and Jooyoul Lee. 2019. KorQuAD1.0: Korean QA dataset for machine reading comprehension. arXiv:1909.07005 [CS] (2019). http://arxiv.org/abs/1909.07005.

[172]

Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xiang Ren. 2020. Birds have four legs? NumerSense: Probing numerical commonsense knowledge of pre-trained language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6862–6868.

[173]

Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 58–62.

[174]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 3214–3252.

[175]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Computer Vision—ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, Cham, Switzerland, 740–755.

[176]

Tal Linzen. 2020. How can we accelerate progress towards human-like linguistic generalization? arXiv:2005.00955 [CS] (2020). https://arxiv.org/pdf/2005.00955.pdf.

[177]

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the 2020 International Joint Conference on Artificial Intelligence (IJCAI’20). 3622–3628.

[178]

Jiahua Liu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2019. XQA: A cross-lingual open-domain question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 2358–2368.

[179]

Pengyuan Liu, Yuning Deng, Chenghao Zhu, and Han Hu. 2019. XCMRC: Evaluating cross-lingual machine reading comprehension. In Natural Language Processing and Chinese Computing, Jie Tang, Min-Yen Kan, Dongyan Zhao, Sujian Li, and Hongying Zan (Eds.). Springer, Cham, Switzerland, 552–564.

Digital Library

[180]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 [CS] (2019). http://arxiv.org/abs/1907.11692.

[181]

Teng Long, Emmanuel Bengio, Ryan Lowe, Jackie Chi Kit Cheung, and Doina Precup. 2017. World knowledge for reading comprehension: Rare entity prediction with hierarchical LSTMs using external descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 825–834.

[182]

Shayne Longpre, Yi Lu, and Joachim Daiber. 2020. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. arXiv:2007.15207 [CS] (2020). http://arxiv.org/abs/2007.15207.

[183]

Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv:1506.08909 [CS] (2015). http://arxiv.org/abs/1506.08909.

[184]

Kaixin Ma, Tomasz Jurczyk, and Jinho D. Choi. 2018. Challenging reading comprehension on daily conversation: Passage completion on multiparty dialog. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 2039–2048.

[185]

Leigh-Ann MacFarlane and Genviève Boulet. 2017. Multiple-choice tests can support deep learning! Proceedings of the Atlantic Universities’ Teaching Showcase 21 (2017), 61–66. https://ojs.library.dal.ca/auts/article/view/8430.

[186]

Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, and Christopher Pal. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. arXiv:1611.07810 [CS] (2017). http://arxiv.org/abs/1611.07810.

[187]

Cheryl L. Marcham, Treasa M. Turnbeaugh, Susan Gould, and Joel T. Nadler. 2018. Developing certification exam questions: More deliberate than you may think. Professional Safety 63, 5 (May 2018), 44–49. https://onepetro.org/PS/article/63/05/44/33528/Developing-Certification-Exam-Questions-More.

[188]

Michał Marcinczuk, Marcin Ptak, Adam Radziszewski, and Maciej Piasecki. 2013. Open dataset for development of Polish question answering systems. In Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. https://www.researchgate.net/profile/Maciej-Piasecki/publication/272685856_Open_dataset_for_development_of_Polish_Question_Answering_systems.

[189]

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL’22. 2263–2279.

[190]

Julian McAuley and Alex Yang. 2016. Addressing complex and subjective product-related queries with customer reviews. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). 625–635.

Digital Library

[191]

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv:1806.08730 [CS, STAT] (2018). http://arxiv.org/abs/1806.08730.

[192]

John McCarthy and Patrick Hayes. 1969. Some philosophical problems from the standpoint of artificial intelligence. In Machine Intelligence 4, B. Meltzer and Donald Michie (Eds.). Edinburgh University Press, 463–502.

[193]

Tom McCoy, Junghyun Min, and Tal Linzen. 2019. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv:1911.02969 [CS] (2019). http://arxiv.org/abs/1911.02969.

[194]

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3428–3448.

[195]

Danielle S. McNamara and Joe Magliano. 2009. Toward a comprehensive model of comprehension. In The Psychology of Learning and Motivation. Psychology of Learning and Motivation Series, Vol. 51. Academic Press, Cambridge, MA, 297–384.

[196]

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 975–984. https://www.aclweb.org/anthology/2020.acl-main.92.

[197]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2381–2391. http://aclweb.org/anthology/D18-1260.

[198]

Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. AmbigQA: Answering ambiguous open-domain questions. arXiv:2004.10645 [CS] (2020). http://arxiv.org/abs/2004.10645.

[199]

Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. 2021. SPARTQA: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’21). 4582–4598.

[200]

Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, and Chitta Baral. 2020. Towards question format independent numerical reasoning: A set of prerequisite tasks. arXiv preprint arXiv:2005.08516 (2020). https://arxiv.org/abs/2005.08516.

[201]

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*’19). ACM, New York, NY, 220–229.

Digital Library

[202]

Ashutosh Modi, Tatjana Anikina, Simon Ostermann, and Manfred Pinkal. 2016. InScript: Narrative texts annotated with script information. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16). 3485–3493. https://www.aclweb.org/anthology/L16-1555.

[203]

Timo Möller, Anthony Reina, Raghavan Jayakumar, and Malte Pietsch. 2020. COVID-QA: A question answering dataset for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL’20. https://www.aclweb.org/anthology/2020.nlpcovid19-acl.18.

[204]

Nasrin Mostafazadeh, Michael Roth, Nathanael Chambers, and Annie Louis. 2017. LSDSem 2017 shared task: The story Cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential, and Discourse-Level Semantics. 46–51. http://www.aclweb.org/anthology/W17-0900.

[205]

Hussein Mozannar, Elie Maamary, Karl El Hajal, and Hazem Hajj. 2019. Neural Arabic question answering. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 108–118.

[206]

Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, and Bohyung Han. 2017. MarioQA: Answering questions by watching gameplay videos. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). http://arxiv.org/abs/1612.01669.

[207]

Preslav Nakov, Doris Hoogeveen, Lluís Màrquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, and Karin Verspoor. 2017. SemEval-2017 Task 3: Community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval’17). 27–48. http://www.aclweb.org/anthology/S17-2003.

[208]

Preslav Nakov, Lluís Màrquez, Walid Magdy, Alessandro Moschitti, Jim Glass, and Bilal Randeree. 2015. SemEval-2015 Task 3: Answer selection in community question answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). 269–281.

[209]

Preslav Nakov, Lluís Màrquez, Alessandro Moschitti, Walid Magdy, Hamdy Mubarak, abed Alhakim Freihat, Jim Glass, and Bilal Randeree. 2016. SemEval-2016 Task 3: Community question answering. 525–545.

[210]

Kiet Nguyen, Vu Nguyen, Anh Nguyen, and Ngan Nguyen. 2020. A Vietnamese dataset for evaluating machine reading comprehension. In Proceedings of the International Conference on Learning Representations (ICLR’20). 2595–2605.

[211]

Qiang Ning, Hao Wu, Rujun Han, Nanyun Peng, Matt Gardner, and Dan Roth. 2020. TORQUE: A reading comprehension dataset of temporal ordering questions. arXiv:2005.00242 [CS] (2020). http://arxiv.org/abs/2005.00242.

[212]

Kazumasa Omura, Daisuke Kawahara, and Sadao Kurohashi. 2020. A method for building a commonsense inference dataset based on basic events. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 2450–2460. https://www.aclweb.org/anthology/2020.emnlp-main.192.

[213]

Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did what: A large-scale person-centered Cloze dataset. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 2230–2235.

[214]

Simon Ostermann, Ashutosh Modi, Michael Roth, Stefan Thater, and Manfred Pinkal. 2018. MCScript: A novel dataset for assessing machine comprehension using script knowledge. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1564.

[215]

Simon Ostermann, Michael Roth, Ashutosh Modi, Stefan Thater, and Manfred Pinkal. 2018. SemEval-2018 Task 11: Machine comprehension using commonsense knowledge. In Proceedings of the 12th International Workshop on Semantic Evaluation. 747–757.

[216]

Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA: A large corpus for question answering on electronic medical records. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2357–2368. http://aclweb.org/anthology/D18-1258.

[217]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, et al. 2022. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 5336–5358.

[218]

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv:1606.06031 [CS] (2016). http://arxiv.org/abs/1606.06031.

[219]

Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 1470–1480.

[220]

Alkesh Patel, Akanksha Bindal, Hadas Kotek, Christopher Klein, and Jason Williams. 2020. Generating natural questions from images for multimodal assistants. arXiv:2012.03678 [CS] (2020). http://arxiv.org/abs/2012.03678.

[221]

Anselmo Peñas, Christina Unger, and Axel-Cyrille Ngonga Ngomo. 2014. Overview of CLEF question answering track 2014. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction. Springer, Cham, Switzerland, 300–306.

[222]

Anselmo Peñas, Christina Unger, Georgios Paliouras, and Ioannis Kakadiaris. 2015. Overview of the CLEF question answering track 2015. In Experimental IR Meets Multilinguality, Multimodality, and Interaction.Lecture Notes in Computer Science, Vol. 9283. Springer, 539–544.

[223]

Gustavo Penha, Alexandru Balan, and Claudia Hauff. 2019. Introducing MANtIS: A novel multi-domain information seeking dialogues dataset. arXiv:1912.04639 [CS] (2019). http://arxiv.org/abs/1912.04639.

[224]

Denis Peskov, Nancy Clarke, Jason Krone, Brigi Fodor, Yi Zhang, Adel Youssef, and Mona Diab. 2019. Multi-domain goal-oriented dialogues (MultiDoGO): Strategies toward curating and annotating large scale dialogue data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4526–4536.

[225]

Jonas Pfeiffer, Gregor Geigle, Aishwarya Kamath, Jan-Martin Steitz, Stefan Roth, Ivan Vulić, and Iryna Gurevych. 2022. xGQA: Cross-lingual visual question answering. In Findings of ACL’22. 2497–2511.

[226]

Eric Price. 2014. The NIPS Experiment. Retrieved September 16, 2022 from http://blog.mrtz.org/2014/12/15/the-nips-experiment.html.

[227]

Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, and Zachary C. Lipton. 2019. Learning to deceive with attention-based explanations. arXiv:1909.07913 [CS] (2019). http://arxiv.org/abs/1909.07913.

[228]

Lianhui Qin, Aditya Gupta, Shyam Upadhyay, Luheng He, Yejin Choi, and Manaal Faruqui. 2021. TIMEDIAL: Temporal commonsense reasoning in dialog. arXiv:2106.04571 [CS.CL] (2021).

[229]

Boyu Qiu, Xu Chen, Jungang Xu, and Yingfei Sun. 2019. A survey on neural machine reading comprehension. arXiv:1906.03824 [CS] (2019). http://arxiv.org/abs/1906.03824.

[230]

Chen Qu, Liu Yang, W. Bruce Croft, Johanne R. Trippas, Yongfeng Zhang, and Minghui Qiu. 2018. Analyzing and characterizing user intent in information-seeking conversations. In Proceedings of the 41st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). ACM, New York, NY, 989–992.

Digital Library

[231]

Filip Radlinski, Krisztian Balog, Bill Byrne, and Karthik Krishnamoorthi. 2019. Coached conversational preference elicitation: A case study in understanding movie preferences. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. https://research.google/pubs/pub48414/.

[232]

Khyathi Chandu Raghavi, Manoj Kumar Chinnakotla, and Manish Shrivastava. 2015. “Answer ka type kya he?”: Learning to classify questions in code-mixed language. In Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion). ACM, New York, NY, 853–858.

Digital Library

[233]

Nazneen Fatema Rajani, Ben Krause, Wengpeng Yin, Tong Niu, Richard Socher, and Caiming Xiong. 2020. Explaining and improving model behavior with k nearest neighbor representations. arXiv:2010.09030 [CS] (2020). http://arxiv.org/abs/2010.09030.

[234]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 784–789. http://aclweb.org/anthology/P18-2124.

[235]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 2383–2392.

[236]

Alan Ramponi and Barbara Plank. 2020. Neural unsupervised domain adaptation in NLP—A survey. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 6838–6855.

[237]

Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A. Smith, and Yejin Choi. 2018. Event2Mind: Commonsense inference on events, intents, and reactions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 463–473.

[238]

Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (March 2019), 249–266.

[239]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). ACM, New York, NY, 1135–1144.

Digital Library

[240]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 4902–4912. https://www.aclweb.org/anthology/2020.acl-main.442.

[241]

Matthew Richardson, Christopher J. C. Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 193–203.

[242]

Pedro Rodriguez and Jordan Boyd-Graber. 2021. Evaluation paradigms in question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 9630–9642.

[243]

Pedro Rodriguez, Paul Crook, Seungwhan Moon, and Zhiguang Wang. 2020. Information seeking in the spirit of learning: A dataset for conversational curiosity. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 8153–8172.

[244]

Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber. 2021. Quizbowl: The case for incremental question answering. arXiv:1904.04792 [CS] (2021).

[245]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning. 6.

[246]

Anna Rogers. 2019. How the Transformers Broke NLP Leaderboards. Retrieved September 16, 2022 fromhttps://hackingsemantics.xyz/2019/leaderboards/.

[247]

Anna Rogers. 2021. Changing the world by changing the data. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’21). 2182–2194. https://aclanthology.org/2021.acl-long.170.

[248]

Anna Rogers and Isabelle Augenstein. 2020. What can we do to improve peer review in NLP? In Findings of EMNLP’20. 1256–1262. https://www.aclweb.org/anthology/2020.findings-emnlp.112/.

[249]

Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. 2020. Getting closer to AI complete question answering: A set of prerequisite real tasks. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 8722–8731. https://aaai.org/ojs/index.php/AAAI/article/view/6398.

[250]

Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips, and Yinfei Yang. 2020. LAReQA: Language-agnostic answer retrieval from a multilingual pool. arXiv:2004.05484 [CS] (2020). http://arxiv.org/abs/2004.05484.

[251]

Sebastian Ruder and Si Avirup. 2021. Multi-domain multilingual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21).

[252]

Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. 2020. Thinking like a skeptic: Defeasible inference in natural language. In Findings of EMNLP’20. 4661–4675.

[253]

Barbara Rychalska, Dominika Basaj, Anna Wróblewska, and Przemyslaw Biecek. 2018. Does it care what you asked? Understanding importance of verbs in deep learning QA system. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP. 322–324. http://aclweb.org/anthology/W18-5436.

[254]

Mrinmaya Sachan, Kumar Dubey, Eric Xing, and Matthew Richardson. 2015. Learning answer-entailing structures for machine comprehension. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 239–249.

[255]

Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rocktäschel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. Interpretation of natural language rules in conversational machine reading. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2087–2097.

[256]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. WinoGrande: An adversarial Winograd Schema Challenge at scale. arXiv:1907.10641 [CS] (2019). http://arxiv.org/abs/1907.10641.

[257]

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Social IQA: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4453–4463.

[258]

Viktor Schlegel, Goran Nenadic, and Riza Batista-Navarro. 2020. Beyond leaderboards: A survey of methods for revealing weaknesses in natural language inference data and models. arXiv:2005.14709 [CS] (2020). http://arxiv.org/abs/2005.14709.

[259]

Viktor Schlegel, Marco Valentino, André Freitas, Goran Nenadic, and Riza Batista-Navarro. 2020. A framework for evaluation of machine reading comprehension gold standards. In Proceedings of the Language Resources and Evaluation Conference. http://arxiv.org/abs/2003.04642.

[260]

Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2015. A survey of available corpora for building data-driven dialogue systems. arXiv:1512.05742 [CS, STAT] (2015). http://arxiv.org/abs/1512.05742.

[261]

Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. 2019. DRCD: A Chinese machine reading comprehension dataset. arXiv:1806.00920 [CS] (2019). http://arxiv.org/abs/1806.00920.

[262]

Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically solving number word problems by semantic parsing and reasoning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1132–1142.

[263]

Hideyuki Shibuki, Kotaro Sakamoto, Yoshionobu Kano, Teruko Mitamura, Madoka Ishioroshi, Kelly Y. Itakura, Di Wang, Tatsunori Mori, and Noriko Kando. 2014. Overview of the NTCIR-11 QA-lab task. In Proceedings of the 11th NTCIR Conference. 518–529. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/OVERVIEW/01-NTCIR11-OV-QALAB-ShibukiH.pdf.

[264]

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton. 2019. CLUTRR: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4496–4505.

[265]

Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, and Alan W. Black. 2020. A survey of code-switched speech and language processing. arXiv:1904.00784 [CS, STAT] (2020). http://arxiv.org/abs/1904.00784.

[266]

Amir Soleimani, Christof Monz, and Marcel Worring. 2021. NLQuAD: A non-factoid long question answering data set. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL’21). 1245–1255. https://aclanthology.org/2021.eacl-main.106.

[267]

Saku Sugawara and Akiko Aizawa. 2016. An analysis of prerequisite skills for reading comprehension. In Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods. 1–5.

[268]

Saku Sugawara, Yusuke Kido, Hikaru Yokono, and Akiko Aizawa. 2017. Evaluation metrics for machine reading comprehension: Prerequisite skills and readability. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 806–817.

[269]

Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and Akiko Aizawa. 2020. Assessing the benchmarking capacity of machine reading comprehension datasets. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). http://arxiv.org/abs/1911.09241.

[270]

Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. 2017. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 217–223.

[271]

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 6418–6428.

[272]

Haitian Sun, William Cohen, and Ruslan Salakhutdinov. 2022. ConditionalQA: A complex reading comprehension dataset with conditional answers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 3627–3637.

[273]

Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. 2019. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics 7 (April 2019), 217–231.

[274]

Ningyuan Sun, Xuefeng Yang, and Yunfeng Liu. 2020. TableQA: A large-scale Chinese Text-to-SQL dataset for table-aware SQL generation. arXiv:2006.06434 [CS] (2020). http://arxiv.org/abs/2006.06434.

[275]

Simon Suster and Walter Daelemans. 2018. CliCR: A dataset of clinical case reports for machine reading comprehension. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 1551–1563.

[276]

Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-Tau Yih, and Ashish Sabharwal. 2019. QuaRel: A dataset and models for answering questions about qualitative relationships. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19).

Digital Library

[277]

Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. 2019. QuaRTz: An open-domain dataset of qualitative relationship questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 5941–5946.

[278]

Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 641–651.

[279]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 4149–4158. https://www.aclweb.org/anthology/papers/N/N19/N19-1421/.

[280]

Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari Asai, Gabriel Ilharco, Hannaneh Hajishirzi, and Jonathan Berant. 2021. MultimodalQA: Complex question answering over text, tables and images. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21). 12. https://openreview.net/pdf/f3dad930cb55abce99a229e35cc131a2db791b66.pdf.

[281]

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).

[282]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of the 35th Conference on Neural Information Processing Systems, Datasets, and Benchmarks Track. https://openreview.net/forum?id=wCu6T5xFjeJ.

[283]

Paul Thomas, Daniel McDuff, Mary Czerwinski, and Nick Craswell. 2017. MISC: A data set of information-seeking conversations. In Proceedings of the 1st International Workshop on Conversational Approaches to Information Retrieval (CAIR’17). https://www.microsoft.com/en-us/research/wp-content/uploads/2017/07/Thomas-etal-CAIR17.pdf.

[284]

Jesse Thomason, Daniel Gordon, and Yonatan Bisk. 2019. Shifting the baseline: Single modality performance on visual navigation & QA. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’19). 1977–1983. https://www.aclweb.org/anthology/papers/N/N19/N19-1197/.

[285]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: A large-scale dataset for fact extraction and verification. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 809–819.

[286]

Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson. 2018. Informing the design of spoken conversational search: Perspective paper. In Proceedings of the 2018 Conference on Human Information Interaction and Retrieval (CHIIR’18). ACM, New York, NY, 32–41.

Digital Library

[287]

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. 191–200.

[288]

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R. Alvers, Dirk Weissenborn, et al. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 1 (April 2015), 138.

[289]

Bo-Hsiang Tseng, Sheng-Syun Shen, Hung-Yi Lee, and Lin-Shan Lee. 2016. Towards machine comprehension of spoken content: Initial TOEFL listening comprehension test by machine. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (Interspeech’16). 2731–2735.

[290]

Shyam Upadhyay and Ming-Wei Chang. 2017. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 494–504. https://www.aclweb.org/anthology/E17-1047.

[291]

Svitlana Vakulenko and Vadim Savenkov. 2017. TableQA: Question answering on tabular data. arXiv:1705.06504 [CS] (2017). http://arxiv.org/abs/1705.06504.

[292]

Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, and Emiel Krahmer. 2019. Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation. 355–368.

[293]

Elke van der Meer, Reinhard Beyer, Bertram Heinze, and Isolde Badel. 2002. Temporal order relations in language comprehension. Journal of Experimental Psychology. Learning, Memory, and Cognition 28, 4 (July 2002), 770–779.

[294]

Teun A. van Dijk and Walter Kintsch. 1983. Strategies of Discourse Comprehension. Academic Press, New York, NY. P302 .D472 1983

[295]

David Vilares and Carlos Gómez-Rodríguez. 2019. HEAD-QA: A healthcare dataset for complex reasoning. arXiv:1906.04701 [CS] (2019). http://arxiv.org/abs/1906.04701.

[296]

Ellen M. Voorhees and Dawn M. Tice. 2000. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). ACM, New York, NY, 200–207.

Digital Library

[297]

Eric Wallace and Jordan Boyd-Graber. 2018. Trick me if you can: Adversarial writing of trivia challenge questions. In Proceedings of the Association for Computational Linguistics Student Research Workshop (ACL-SRW’18). 127–133. http://aclweb.org/anthology/P18-3018.

[298]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing NLP. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’19)http://arxiv.org/abs/1908.07125.

[299]

Bingning Wang, Ting Yao, Qi Zhang, Jingfang Xu, and Xiaochuan Wang. 2020. ReCO: A large scale Chinese reading comprehension dataset on opinion. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 8. https://www.aaai.org/Papers/AAAI/2020GB/AAAI-WangB.2547.pdf.

[300]

Jiexin Wang, Adam Jatowt, Michael Färber, and Masatoshi Yoshikawa. 2021. Improving question answering for event-focused questions in temporal collections of news articles. Information Retrieval Journal 24, 1 (Feb. 2021), 29–54.

Digital Library

[301]

Jiexin Wang, Adam Jatowt, and Masatoshi Yoshikawa. 2022. ArchivalQA: A large-scale benchmark dataset for open domain question answering over historical news collections. arXiv:2109.03438 [CS].

[302]

Ping Wang, Tian Shi, and Chandan K. Reddy. 2020. Text-to-SQL generation for question answering on electronic medical records. In Proceedings of the Web Conference 2020 (WWW’20). ACM, New York, NY, 350–361.

Digital Library

[303]

Takuto Watarai and Masatoshi Tsuchiya. 2020. Developing dataset of Japanese slot filling quizzes designed for evaluation of machine reading comprehension. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 6895–6901. https://www.aclweb.org/anthology/2020.lrec-1.852.

[304]

Dirk Weissenborn, Pasquale Minervini, Isabelle Augenstein, Johannes Welbl, Tim Rocktäschel, Matko Bošnjak, Jeff Mitchell, et al. 2018. Jack the Reader—A machine reading framework. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 25–30.

[305]

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards AI-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 (2015).

[306]

Michael White, Graham Chapman, John Cleese, Eric Idle, Terry Gilliam, Terry Jones, Michael Palin, et al. 2001. Monty Python and the Holy Grail.

[307]

Yuk Wah Wong and Raymond Mooney. 2006. Learning for semantic parsing with statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. 439–446. https://www.aclweb.org/anthology/N06-1056.

Digital Library

[308]

Chien-Sheng Wu, Andrea Madotto, Wenhao Liu, Pascale Fung, and Caiming Xiong. 2022. QAConv: Question answering on informative conversations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 5389–5411.

[309]

Wenhan Xiong, Jiawei Wu, Hong Wang, Vivek Kulkarni, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2019. TWEETQA: A social media focused question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 5020–5031.

[310]

Canwen Xu, Jiaxin Pei, Hongtao Wu, Yiyu Liu, and Chenliang Li. 2020. MATINF: A jointly labeled large-scale dataset for classification, question answering and summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 3586–3596. https://www.aclweb.org/anthology/2020.acl-main.330.

[311]

Ying Xu, Dakuo Wang, Mo Yu, Daniel Ritchie, Bingsheng Yao, Tongshuang Wu, Zheng Zhang, et al. 2022. Fantastic questions and where to find them: FairytaleQA—An authentic dataset for narrative comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 447–460.

[312]

Yi Yang, Wen-Tau Yih, and Christopher Meek. 2015. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 2013–2018. http://aclweb.org/anthology/D15-1237.

[313]

Zhengzhe Yang and Jinho D. Choi. 2019. FriendsQA: Open-domain question answering on TV show transcripts. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. 188–197.

[314]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Nautral Language Processing (EMNLP’18). 2369–2380. http://aclweb.org/anthology/D18-1259.

[315]

Mark Yatskar. 2019. A qualitative comparison of CoQA, SQuAD 2.0, and QuAC. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 2318–2323. https://www.aclweb.org/anthology/papers/N/N19/N19-1241/.

[316]

Fan Yin, Zhouxing Shi, Cho-Jui Hsieh, and Kai-Wei Chang. 2021. On the faithfulness measurements for model interpretations. arXiv:2104.08782 [CS] (2021). http://arxiv.org/abs/2104.08782.

[317]

Chenyu You, Nuo Chen, Fenglin Liu, Dongchao Yang, and Yuexian Zou. 2020. Towards data distillation for end-to-end spoken conversational question answering. arXiv:2010.08923 [CS, EESS] (2020). http://arxiv.org/abs/2010.08923.

[318]

Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. 2019. ReClor: A reading comprehension dataset requiring logical reasoning. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=HJgJtT4tvB.

[319]

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 93–104. http://aclweb.org/anthology/D18-1009.

[320]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In Proceedings of the Conference of the Association for Computational Linguistics (ACL’19). http://arxiv.org/abs/1905.07830.

[321]

Michael Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 7371–7387.

[322]

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv:1810.12885 [cs] (oct2018). arXiv:1810.12885 [cs] http://arxiv.org/abs/1810.12885.

[323]

Yian Zhang, Alex Warstadt, Haau-Sing Li, and Samuel R. Bowman. 2020. When do you need billions of words of pretraining data? arXiv:2011.04946 [cs] (nov2020). arXiv:2011.04946 [cs] http://arxiv.org/abs/2011.04946.

[324]

Zhuosheng Zhang and Hai Zhao. 2018. One-shot learning for question-answering in gaokao history challenge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 449–461. https://www.aclweb.org/anthology/C18-1038.

[325]

Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning (ICML’21). http://arxiv.org/abs/2102.09690.

[326]

Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv:1709.00103 [CS] (2017). http://arxiv.org/abs/1709.00103.

[327]

Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. 2019. “Going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 3361–3367.

[328]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL’21). 3277–3287.

[329]

Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv:2101.00774 [CS] (2021). http://arxiv.org/abs/2101.00774.

[330]

Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (Sept. 2017), 409–421.

Digital Library

[331]

Ming Zhu, Aman Ahuja, Da-Cheng Juan, Wei Wei, and Chandan K. Reddy. 2020. Question answering with long multiple-span answers. In Findings of EMNLP’20. 3840–3849.

[332]

Rolf A. Zwaan. 2016. Situation models, mental simulations, and abstract concepts in discourse comprehension. Psychonomic Bulletin & Review 23, 4 (Aug. 2016), 1028–1034.

Cited By

Sinha PKhushi Dagur A(2024)Improved Framework Model to Train and Evaluate Difficulty of Interview Question Using Generative AIReal-Time Data Decisions With AI and ChatGPT Techniques10.4018/979-8-3693-2284-0.ch004(69-90)Online publication date: 31-May-2024
https://doi.org/10.4018/979-8-3693-2284-0.ch004
Jafar KMohammad AIssa APanov A(2024)Automating the search for legal information in Arabic: A novel approach to document retrievalRussian Technological Journal10.32362/2500-316X-2024-12-5-7-112:5(7-16)Online publication date: 4-Oct-2024
https://doi.org/10.32362/2500-316X-2024-12-5-7-1
Sari AAmalia NEmelia TArifin M(2024)Teaching Materials for Reading in a Professional Context to Improve the Reading Skills of English Education Program StudentsJurnal Penelitian dan Pengembangan Pendidikan10.23887/jppp.v8i1.656938:1(166-178)Online publication date: 26-Apr-2024
https://doi.org/10.23887/jppp.v8i1.65693
Show More Cited By

Index Terms

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction
      2. Question answering

Recommendations

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Abstract
The paper presents SberQuAD – a large Russian reading comprehension (RC) dataset created similarly to English SQuAD. SberQuAD contains about 50K question-paragraph-answer triples and is seven times larger compared to the next competitor. We ...
Common Difficulties of Reading Comprehension Experienced by Vietnamese Students
ICEMT '21: Proceedings of the 5th International Conference on Education and Multimedia Technology

The following paper explores the most challenging reading comprehension problems confronted by freshmen in EFL reading classes. Also, some possible solutions were suggested based on such challenges by the researcher. Fifty freshmen in a private ...
Children's reading comprehension and metacomprehension on screen versus on paper
Abstract
On-screen reading is becoming increasingly prevalent in educational settings, and children are now are expected to comprehend texts that they read on screens. However, research suggests that reading on screen impairs comprehension ...
Highlights
- Reading on screen impaired children's comprehension compared to reading on paper.

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 55, Issue 10

October 2023

772 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3567475

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2023

Online AM: 13 September 2022

Accepted: 11 August 2022

Revised: 03 July 2022

Received: 27 July 2021

Published in CSUR Volume 55, Issue 10

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
5,788
Total Downloads

Downloads (Last 12 months)2,640
Downloads (Last 6 weeks)240

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sinha PKhushi Dagur A(2024)Improved Framework Model to Train and Evaluate Difficulty of Interview Question Using Generative AIReal-Time Data Decisions With AI and ChatGPT Techniques10.4018/979-8-3693-2284-0.ch004(69-90)Online publication date: 31-May-2024
https://doi.org/10.4018/979-8-3693-2284-0.ch004
Jafar KMohammad AIssa APanov A(2024)Automating the search for legal information in Arabic: A novel approach to document retrievalRussian Technological Journal10.32362/2500-316X-2024-12-5-7-112:5(7-16)Online publication date: 4-Oct-2024
https://doi.org/10.32362/2500-316X-2024-12-5-7-1
Sari AAmalia NEmelia TArifin M(2024)Teaching Materials for Reading in a Professional Context to Improve the Reading Skills of English Education Program StudentsJurnal Penelitian dan Pengembangan Pendidikan10.23887/jppp.v8i1.656938:1(166-178)Online publication date: 26-Apr-2024
https://doi.org/10.23887/jppp.v8i1.65693
Thai TLuu S(2024)INTEGRATING IMAGE FEATURES WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE NETWORK FOR MULTILINGUAL VISUAL QUESTION ANSWERINGJournal of Computer Science and Cybernetics10.15625/1813-9663/1815540:2(117-134)Online publication date: 16-Mar-2024
https://doi.org/10.15625/1813-9663/18155
Anam RAnwar MJamal MBajwa UDiez IAlvarado EFlores EAshraf I(2024)A deep learning approach for Named Entity Recognition in Urdu languagePLOS ONE10.1371/journal.pone.030072519:3(e0300725)Online publication date: 28-Mar-2024
https://doi.org/10.1371/journal.pone.0300725
Mallet ABeneš MCogranne R(2024)Cover-source mismatch in steganalysis: systematic reviewEURASIP Journal on Information Security10.1186/s13635-024-00171-62024:1Online publication date: 12-Aug-2024
https://doi.org/10.1186/s13635-024-00171-6
Fok RChang JAugust TZhang AWeld D(2024)Qlarify: Recursively Expandable Abstracts for Dynamic Information Retrieval over Scientific PapersProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676397(1-21)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676397
Mozafari JJangra AJatowt AHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)TriviaHG: A Dataset for Automatic Hint Generation from Factoid QuestionsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657855(2060-2070)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657855
Jin ZWang XCheng FSun CLiu QQu H(2024)ShortcutLens: A Visual Analytics Approach for Exploring Shortcuts in Natural Language Understanding DatasetIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.323638030:7(3594-3608)Online publication date: Jul-2024
https://doi.org/10.1109/TVCG.2023.3236380
Lu YSun CYan YZhu HSong DPeng QYu LWang XJiang JYe X(2024)A Comprehensive Survey of Datasets for Large Language Model Evaluation2024 5th Information Communication Technologies Conference (ICTC)10.1109/ICTC61510.2024.10601918(330-336)Online publication date: 10-May-2024
https://doi.org/10.1109/ICTC61510.2024.10601918
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents