Abstract
Machine Reading Comprehension (MRC) is a task that enables machines to mirror key cognitive processes involving reading, comprehending a text passage, and answering questions about it. There has been significant progress in this task for English in recent years, where recent systems not only surpassed human-level performance but also demonstrated advancements in emulating complex human cognitive processes. However, the development of Arabic MRC has not kept pace due to language challenges and the lack of large-scale, high-quality datasets. Existing datasets are either small, low quality or released as a part of large multilingual corpora. We present the Arabic Question Answering Dataset (ArQuaD), a large MRC dataset for the Arabic language. The dataset comprises 16,020 questions posed by language experts on passages extracted from Arabic Wikipedia articles, where the answer to each question is a text segment from the corresponding reading passage. Besides providing various dataset analyses, we fine-tuned several pre-trained language models to obtain benchmark results. Among the compared methods, AraBERTv0.2-large achieved the best performance with an exact match of 68.95% and an F1-score of 87.15%. However, the significantly higher performance observed in human evaluations (exact match of 86% and F1-score of 95.5%) suggests a significant margin of possible improvement in future research. We release the dataset publicly at https://github.com/RashaMObeidat/ArQuAD to encourage further development of language-aware MRC models for the Arabic language.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The dataset generated during and/or analyzed during the current study is available in the ArQuAD repository, https://github.com/RashaMObeidat/ArQuAD.
References
Dang HT, Kelly D, Lin J, et al. Overview of the TREC 2007 question answering track. In: Trec (vol. 7). 2007. p. 63.
Magnini B, Giampiccolo D, Aunimo L, Ayache C, Osenova P, Penas A, Rijke MD, Sacaleanu B, Santos D, Sutcliffe R. The multilingual question answering track at clef. In: Calzolari N, Choukri K, Gangemi A, Maegaard B, Mariani J, Odjik J, Tapias D, editors. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006) (Genoa Italy 22-28 May 2006). 2006.
Olvera-Lobo M-D, Gutiérrez-Artacho J. Question answering track evaluation in TREC, CLEF and NTCIR. In: New Contributions in Information Systems and Technologies. 2015. p. 13–22.
Kangavari MR, Ghandchi S, Golpour M. Information retrieval: Improving question answering systems by query reformulation and answer validation. Int J Ind Manuf Eng. 2008;2(12):1275–82.
Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. Austin, Texas; 2016. p. 2383–92. https://doi.org/10.18653/v1/D16-1264, https://aclanthology.org/D16-1264.
Hewlett D, Lacoste A, Jones L, Polosukhin I, Fandrianto A, Han J, Kelcey M, Berthelot D. Wikireading: A novel large-scale language understanding task over wikipedia. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers). 2016. p. 1535–45.
Yang Z, Qi P, Zhang S, Bengio Y, Cohen W, Salakhutdinov R, Manning CD. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. p. 2369–80.
Trischler A, Wang T, Yuan X, Harris J, Sordoni A, Bachman P, Suleman K. Newsqa: A machine comprehension dataset. arXiv:1611.09830 [Preprint]. 2016. Available from: http://arxiv.org/abs/1611.09830.
Dunn M, Sagun L, Higgins M, Guney VU, Cirik V, Cho K. Searchqa: A new q &a dataset augmented with context from a search engine. arXiv:1704.05179 [Preprint]. 2017. Available from: http://arxiv.org/abs/1704.05179.
d’Hoffschmidt M, Belblidia W, Heinrich Q, Brendlé T, Vidal M. Fquad: French question answering dataset. In: Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. p. 1193–208.
Lim S, Kim M, Lee J. Korquad1. 0: Korean qa dataset for machine reading comprehension. arXiv:1909.07005 [Preprint]. 2019. Available from: http://arxiv.org/abs/1909.07005.
Soygazi F, Çiftçi O, Kök U, Cengiz S. Thquad: Turkish historic question answering dataset for reading comprehension. In: 2021 6th International Conference on Computer Science and Engineering (UBMK). IEEE; 2021. p. 215–20.
Efimov P, Chertok A, Boytsov L, Braslavski P. Sberquad–Russian reading comprehension dataset: Description and analysis. In: International Conference of the Cross-Language Evaluation Forum for European Languages. Springer; 2020. p. 3–15.
Möller T, Risch J, Pietsch M. Germanquad and Germandpr: Improving non-english question answering and passage retrieval. arXiv:2104.12741 [Preprint]. 2021. Available from: http://arxiv.org/abs/2104.12741.
Mozannar H, Maamary E, El Hajal K, Hajj H. Neural Arabic question answering. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop. 2019. p. 108–18.
Bdour WN, Gharaibeh NK. Development of yes/no Arabic question answering system. arXiv:1302.5675 [Preprint]. 2013. Available from: http://arxiv.org/abs/1302.5675.
Azmi AM, Alshenaifi NA. Lemaza: An Arabic why-question answering system. Nat Lang Eng. 2017;23(6):877–903.
Atef A, Mattar B, Sherif S, Elrefai E, Torki M. Aqad: 17,000+ Arabic questions for machine comprehension of text. In: 2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA). IEEE; 2020. p. 1–6.
Chandra A, Fahrizain A, Laufried SW, et al. A survey on non-english question answering dataset. arXiv:2112.13634 [Preprint]. 2021. Available from: http://arxiv.org/abs/2112.13634.
Lewis P, Oguz B, Rinott R, Riedel S, Schwenk H. Mlqa: Evaluating cross-lingual extractive question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. p. 7315–30.
Clark JH, Choi E, Collins M, Garrette D, Kwiatkowski T, Nikolaev V, Palomaki J. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Trans Assoc Comput Linguist. 2020;8:454–70.
McClelland JL. Capturing advanced human cognitive abilities with deep neural networks. Trends Cogn Sci. 2022;26(12):1047–50.
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst. 2019;32.
Yang A, Wang Q, Liu J, Liu K, Lyu Y, Wu H, She Q, Li S. Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. p. 2346–57.
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O. Spanbert: Improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist. 2020;8:64–77.
Yamada I, Asai A, Shindo H, Takeda H, Matsumoto Y. Luke: Deep contextualized entity representations with entity-aware self-attention. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 6442–54.
Jun C, Jang H, Sim M, Kim H, Choi J, Min K, Bae K. Anna: Enhanced language representation for question answering. In: Proceedings of the 7th Workshop on Representation Learning for NLP. 2022. p. 121–32.
Dzendzik D, Foster J, Vogel C. English machine reading comprehension datasets: A survey. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. p. 8784–804.
Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P. Teaching machines to read and comprehend. Adv Neural Inf Process Syst. 2015;28.
Khot T, Clark P, Guerquin M, Jansen P, Sabharwal A. Qasc: A dataset for question answering via sentence composition. In: Proceedings of the AAAI Conference on Artificial Intelligence (vol. 34). 2020. p. 8082–90.
Nguyen T, Rosenberg M, Song X, Gao J, Tiwary S, Majumder R, Deng L. MS MARCO: A human generated machine reading comprehension dataset. In: CoCo@ NIPs. 2016.
Huang L, Le Bras R, Bhagavatula C, Choi Y. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. p. 2391–401.
Rajpurkar P, Jia R, Liang P. Know what you don’t know: Unanswerable questions for squad. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers). 2018. p. 784–9.
Kwiatkowski T, Palomaki J, Redfield O, Collins M, Parikh A, Alberti C, Epstein D, Polosukhin I, Devlin J, Lee K, et al. Natural questions: A benchmark for question answering research. Trans Assoc Comput Linguist. 2019;7:453–66.
Bjerva J, Bhutani N, Golshan B, Tan W-C, Augenstein I. Subjqa: A dataset for subjectivity and review comprehension. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 5480–94.
Joshi M, Choi E, Weld DS, Zettlemoyer L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers). 2017. p. 1601–11.
Dua D, Wang Y, Dasigi P, Stanovsky G, Singh S, Gardner M. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (vol. 1: Long and Short Papers). 2019. p. 2368–78.
Reddy S, Chen D, Manning CD. Coqa: A conversational question answering challenge. Trans Assoc Comput Linguist. 2019;7:249–66.
Zhang S, Liu X, Liu J, Gao J, Duh K, Van Durme B. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv:1810.12885 [Preprint]. 2018. Available from: http://arxiv.org/abs/1810.12885.
Suster S, Daelemans W. CliCR: A dataset of clinical case reports for machine reading comprehension. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (vol. 1: Long Papers). 2018. p. 1551–63.
Lai G, Xie Q, Liu H, Yang Y, Hovy E. Race: Large-scale reading comprehension dataset from examinations. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. p. 785–94.
Ostermann S, Modi A, Roth M, Thater S, Pinkal M. Mcscript: A novel dataset for assessing machine comprehension using script knowledge. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.
Lee K, Yoon K, Park S, Hwang SW. Semi-supervised training data generation for multilingual question answering. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.
Fenogenova A, Mikhailov V, Shevelev D. Read and reason with MuSeRC and RuCoS: Datasets for machine reading comprehension for Russian. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020. p. 6481–97.
Shao CC, Liu T, Lai Y, Tseng Y, Tsai S. DRCD: A Chinese machine reading comprehension dataset. arXiv:1806.00920 [Preprint]. 2018. Available from: http://arxiv.org/abs/1806.00920.
So B, Byun K, Kang K, Cho S. Jaquad: Japanese question answering dataset for machine reading comprehension. arXiv:2202.01764 [Preprint]. 2022. Available from: http://arxiv.org/abs/2202.01764.
Sayama HF, Araujo AV, Fernandes ER. Faquad: Reading comprehension dataset in the domain of Brazilian higher education. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS). IEEE; 2019. p. 443–8.
Kazemi A, Mozafari J, Nematbakhsh MA. Persianquad: The native question answering dataset for the Persian language. IEEE Access. 2022;10:26045–57.
Kazi S, Khoja S. Uquad1. 0: Development of an Urdu question answering training data for machine reading comprehension. arXiv:2111.01543 [Preprint]. 2021. Available from: http://arxiv.org/abs/2111.01543.
Asai A, Kasai J, Clark JH, Lee K, Choi E, Hajishirzi H. XOR QA: Cross-lingual open-retrieval question answering. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. p. 547–64.
Artetxe M, Ruder S, Yogatama D. On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. p. 4623–37.
Longpre S, Lu Y, Daiber J. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. Trans Assoc Comput Linguist. 2021;9:1389–406.
Liang Y, Duan N, Gong Y, Wu N, Guo F, Qi W, Gong M, Shou L, Jiang D, Cao G, et al. Xglue: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 6008–18.
Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: International Conference on Machine Learning, PMLR; 2020. p. 4411–21.
Kenton JDMWC, Toutanova LK. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT. 2019. p. 4171–86.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: A robustly optimized BERT pretraining approach. arXiv:1907.11692 [Preprint]. 2019. Available from: http://arxiv.org/abs/1907.11692.
Micheli V, d’Hoffschmidt M, Fleuret F. On the importance of pre-training data volume for compact language models,. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. p. 7853–8.
Albilali E, Al-Twairesh N, Hosny M. Constructing arabic reading comprehension datasets: Arabic wikireading and kaiflematha. Lang Resour Eval. 2022;1–36.
Malhas R, Elsayed T. Arabic machine reading comprehension on the Holy Qur’an using CL-AraBERT. Inf Process Manage. 2022;59(6):103068.
Biltawi MM, Awajan A, Tedmori S. Arabic span extraction-based reading comprehension benchmark (ASER) and neural baseline models. ACM Trans Asian Low-Resour Lang Inf Process. 2023;22(5):1–29.
Peñas A, Hovy EH, Forner P, Rodrigo Á, Sutcliffe RF, Forascu C, Sporleder C. Overview of qa4mre at clef 2011: Question answering for machine reading evaluation. In: CLEF (Notebook Papers/Labs/Workshop). Citeseer; 2011. p. 1–20.
Ismail WS, Homsi MN. Dawqas: A dataset for Arabic why question answering system. Proc Comput Sci. 2018;142:123–31.
Akour M, Abufardeh S, Magel K, Al-Radaideh Q. Qarabpro: A rule based question answering system for reading comprehension tests in Arabic. Am J Appl Sci. 2011;8(6):652–61.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur. 1960;20(1):37–46.
Kitaev N, Cao S, Klein D. Multilingual constituency parsing with self-attention and pre-training. arXiv:1812.11760 [Preprint]. 2018. Available from: http://arxiv.org/abs/1812.11760.
Darwish K, Mubarak H. Farasa: A new fast and accurate Arabic word segmenter. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 2016. p. 1070–4.
Rybin I, Korablinov V, Efimov P, Braslavski P. Rubq 2.0: An innovated Russian question answering dataset. In: European Semantic Web Conference. Springer; 2021. p. 532–47.
Moldovan DI, Harabagiu SM, Pasca M, Mihalcea R, Goodrum R, Girju R, Rus V. Lasso: A tool for surfing the answer net. In: TREC (vol. 8). 1999. p. 65–73.
Clark K, Luong M-T, Le QV, Manning CD. Electra: Pre-training text encoders as discriminators rather than generators. arXiv:2003.10555 [Preprint]. 2020. Available from: http://arxiv.org/abs/2003.10555.
Zhang Z, Yang J, Zhao H. Retrospective reader for machine reading comprehension. In: Proceedings of the AAAI Conference on Artificial Intelligence (vol. 35). 2021. p. 14506–14.
Antoun W, Baly F, Hajj H. Arabert: Transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 2020. p. 9–15.
Helwe C, Dib G, Shamas M, Elbassuoni S. A semi-supervised BERT approach for Arabic named entity recognition. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop. 2020. p. 49–57.
Obeidat R, Bashayreh A, Younis LB. The impact of combining Arabic sarcasm detection datasets on the performance of BERT-based model. In: 2022 13th International Conference on Information and Communication Systems (ICICS). IEEE; 2022. p. 22–9.
Beltagy A, Abouelenin A, ElSherief O. Arabic dialect identification using BERT-based domain adaptation. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop. 2020. p. 262–7.
Antoun W, Baly F, Hajj H. Araelectra: Pre-training text discriminators for Arabic language understanding. arXiv:2012.15516 [Preprint]. 2020. Available from: http://arxiv.org/abs/2012.15516.
Muller B, Anastasopoulos A, Sagot B, Seddah D. When being unseen from MBERT is just the beginning: Handling new languages with multilingual language models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. p. 448–62.
Liu Q, Mao R, Geng X, Cambria E. Semantic matching in machine reading comprehension: An empirical study. Inf Process Manage. 2023;60(2):103145.
Wadhwa S, Chandu KR, Nyberg E. Comparative analysis of neural qa models on squad. ACL. 2018;2018:89.
Funding
This study was funded by the Deanship Of Research at Jordan University of Science and Technology (grant number 20210222).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics Approval
This article contains no studies with human participants or animals performed by any of the authors.
Competing Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Obeidat, R., Al-Harbi, M., Al-Ayyoub, M. et al. ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset. Cogn Comput 16, 984–1003 (2024). https://doi.org/10.1007/s12559-024-10248-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-024-10248-6