research-article

New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

Authors:

Kiet Van Nguyen,

Anh Gia-Tuan Nguyen,

Ngan Luu-Thuy NguyenAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 21, Issue 5

Article No.: 105, Pages 1 - 28

https://doi.org/10.1145/3527631

Published: 23 September 2022 Publication History

Abstract

Machine reading comprehension is a natural language understanding task where the computing system is required to read a text and then find the answer to a specific question posed by a human. Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models. Furthermore, machine reading comprehension (MRC) for the health sector has potential for practical applications; nevertheless, MRC research in this domain is currently scarce. This article presents UIT-ViNewsQA, a new corpus for the Vietnamese language to evaluate MRC models for the healthcare textual domain. The corpus consists of 22,057 human-generated question-answer pairs. Crowd-workers create the questions and answers on a collection of 4,416 online Vietnamese healthcare news articles, where the answers are textual spans extracted from the corresponding articles. We introduce a process for creating a high-quality corpus for the Vietnamese machine reading comprehension task. Linguistically, our corpus accommodates diversity in question and answer types. In addition, we conduct experiments and compare the effectiveness of different MRC methods based on the neural networks and transformer architectures. Experimental results on our corpus show that the MRC system based on ALBERT architecture outperforms the neural network architectures and the BERT-based approach, an exact match score of 65.26% and an F1-score of 84.89%. The best machine model achieves about 10.90% F1-score less efficiently than humans, which proves that exploring machine models on UIT-ViNewsQA to surpass humans is challenging for researchers in the future. Our corpus is publicly available on our website: http://nlp.uit.edu.vn/datasets for research purposes.

References

[1]

Anna Aniol, Marcin Pietron, and Jerzy Duda. 2019. Ensemble approach for natural language question answering problem. In Proceedings of the 2019 7th International Symposium on Computing and Networking Workshops (CANDARW). IEEE, 180–183.

[2]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1870–1879.

[3]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451.

[4]

Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2019. A span-extraction dataset for chinese machine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5886–5891.

[5]

Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. 2016. Consensus attention-based neural networks for chinese reading comprehension. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 1777–1786.

[6]

Yiming Cui, Ting Liu, Ziqing Yang, Zhipeng Chen, Wentao Ma, Wanxiang Che, Shijin Wang, and Guoping Hu. 2020. A sentence cloze dataset for Chinese machine reading comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6717–6723. https://www.aclweb.org/anthology/2020.coling-main.589.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.

[8]

Martin d’Hoffschmidt, Wacim Belblidia, Quentin Heinrich, Tom Brendlé, and Maxime Vidal. 2020. FQuAD: French question answering dataset. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1193–1208. https://www.aclweb.org/anthology/2020.findings-emnlp.107.

[9]

Phuc Do, Truong H. V. Phan, and Brij B. Gupta. 2021. Developing a Vietnamese tourism question answering system using knowledge graph and deep learning. Transactions on Asian and Low-Resource Language Information Processing 20, 5 (2021), 1–18.

Digital Library

[10]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2368–2378.

[11]

Pavel Efimov, Andrey Chertok, Leonid Boytsov, and Pavel Braslavski. 2020. SberQuAD – Russian reading comprehension dataset: Description and analysis. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Hideo Joho, Christina Lioma, Carsten Eickhoff, Aurélie Névéol, Linda Cappellato, and Nicola Ferro (Eds.). Springer International Publishing, Cham, 3–15.

[12]

Deepak Gupta, Asif Ekbal, and Pushpak Bhattacharyya. 2019. A deep neural network framework for English Hindi question answering. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 2 (2019), 1–22.

[13]

Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. 2018. DuReader: A Chinese machine reading comprehension dataset from real-world applications. ACL 2018 (2018), 37.

[14]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems. 1693–1701.

[15]

Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The Goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301 (2015).

[16]

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 328–339.

[17]

Hsin-Yuan Huang, Chenguang Zhu, Yelong Shen, and Weizhu Chen. 2018. FusionNet: Fusing via fully-aware attention with application to machine comprehension. In Proceedings of the International Conference on Learning Representations.

[18]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2567–2577.

[19]

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.

[20]

Tomasz Jurczyk, Michael Zhai, and Jinho D. Choi. 2016. Selqa: A new benchmark for selection-based question answering. In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 820–827.

[21]

Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 66–71.

[22]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 785–794.

[23]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations.

[24]

Phuong Hong Le and Duc-Thien Bui. 2018. A factoid question answering system for Vietnamese. In Companion Proceedings of the Web Conference 2018. International World Wide Web Conferences Steering Committee, 1049–1055.

[25]

Seungyoung Lim, Myungji Kim, and Jooyoul Lee. 2019. Korquad1.0: Korean QA dataset for machine reading comprehension. arXiv preprint arXiv:1909.07005 (2019).

[26]

Kiet Nguyen, Vu Nguyen, Anh Nguyen, and Ngan Nguyen. 2020. A Vietnamese dataset for evaluating machine reading comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 2595–2605. https://www.aclweb.org/anthology/2020.coling-main.233.

[27]

K. V. Nguyen, V. D. Nguyen, P. X. V. Nguyen, T. T. H. Truong, and N. L. Nguyen. 2018. UIT-VSFC: Vietnamese students’ feedback corpus for sentiment analysis. In Proceedings of the 2018 10th International Conference on Knowledge and Systems Engineering (KSE). 19–24.

[28]

Luan Thanh Nguyen, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2021. Constructive and toxic speech detection for open-domain social media comments in Vietnamese. In Advances and Trends in Artificial Intelligence. Proceedings of the 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26-29, 2021), Part I.Lecture Notes in Computer Science, Vol. 12798,. Hamido Fujita, Ali Selamat, Jerry Chun-Wei Lin, and Moonis Ali (Eds.). Springer-Verlag, 572–583.

Digital Library

[29]

Nhung Thi-Hong Nguyen, Phuong Phan-Dieu Ha, Luan Thanh Nguyen, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2021. Vietnamese complaint detection on e-commerce websites. In New Trends in Intelligent Software Methodologies, Tools and Techniques. IOS Press, 618–629.

[30]

Quy T. Nguyen, Yusuke Miyao, Ha TT Le, and Nhung TH Nguyen. 2018. Ensuring annotation consistency and accuracy for Vietnamese treebank. Language Resources and Evaluation 52, 1 (2018), 269–315.

Digital Library

[31]

Cheoneum Park, Heejun Song, and Changki Lee. 2020. S3-NET: SRU-based sentence and self-matching networks for machine reading comprehension. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 3 (2020), 1–14.

Digital Library

[32]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227–2237.

[33]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pretraining. (2018). https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.

[34]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 784–789.

[35]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2383–2392.

[36]

Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (2019), 249–266.

[37]

Matthew Richardson, Christopher J. C. Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 193–203.

[38]

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016).

[39]

Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. 2018. DRED: A Chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920 (2018).

[40]

Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. 2019. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics 7 (2019), 217–231.

[41]

Simon Šuster and Walter Daelemans. 2018. CliCR: A dataset of clinical case reports for machine reading comprehension. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1551–1563.

[42]

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, Vancouver, Canada, 191–200.

[43]

Anh Tuan Nguyen, Mai Hoang Dao, and Dat Quoc Nguyen. 2020. A pilot study of text-to-SQL semantic parsing for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4079–4085.

[44]

Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Nguyen, and Ngan Luu-Thuy Nguyen. 2020. Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension. IEEE Access 8 (2020), 201404–201417.

[45]

Soumya Wadhwa, Khyathi Chandu, and Eric Nyberg. 2018. Comparative analysis of neural QA models on SQuAD. In Proceedings of the Workshop on Machine Reading for Question Answering. 89–97.

[46]

Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905 (2016).

[47]

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. 2018. R 3: Reinforced ranker-reader for open-domain question answering. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

[48]

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 189–198.

[49]

Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural QA as simple as possible but not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 271–280.

[50]

Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2018. Large-scale cloze test dataset created by teachers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2344–2356.

[51]

Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718 (2019).

[52]

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of the International Conference on Learning Representations.

[53]

Xiao Zhang, Ji Wu, Zhiyang He, Xien Liu, and Ying Su. 2018. Medical exam question answering with large-scale reading comprehension. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

Cited By

Pham QLe HDang Nhat MTran T. KTran-Tien MDang VVu HNguyen MPhan X(2024)Towards Vietnamese Question and Answer Generation: An Empirical StudyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3675781Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675781
Van Nguyen KLe TDo T(2024)Numerical reasoning reading comprehension on Vietnamese COVID-19 news: task, corpus, and challengesNeural Computing and Applications10.1007/s00521-024-09744-5Online publication date: 3-May-2024
https://doi.org/10.1007/s00521-024-09744-5
Gao FHou JGu JZhang L(2023)Knowledge Graph based Mutual Attention for Machine Reading Comprehension over Anti-Terrorism CorpusData Intelligence10.1162/dint_a_002105:3(685-706)Online publication date: 1-Aug-2023
https://doi.org/10.1162/dint_a_00210
Show More Cited By

Index Terms

New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis

Recommendations

Sentence Extraction-Based Machine Reading Comprehension for Vietnamese
Knowledge Science, Engineering and Management
Abstract
The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading ... $_{}$
XCMRC: Evaluating Cross-Lingual Machine Reading Comprehension
Natural Language Processing and Chinese Computing
Abstract
We present XCMRC, the first public cross-lingual language understanding (XLU) benchmark which aims to test machines on their cross-lingual reading comprehension ability. To be specific, XCMRC is a Cross-lingual Cloze-style Machine Reading ...
A survey of deep learning techniques for machine reading comprehension
Abstract
Reading comprehension involves the process of reading and understanding textual information in order to answer questions related to it. It finds practical applications in various domains such as domain-specific FAQs, search engines, and dialog ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 5

September 2022

486 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3533669

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2022

Online AM: 02 May 2022

Accepted: 03 February 2022

Revised: 26 November 2021

Received: 29 May 2020

Published in TALLIP Volume 21, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Master, PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF)
Institute of Big Data
Vietnam National University HoChiMinh City

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
301
Total Downloads

Downloads (Last 12 months)86
Downloads (Last 6 weeks)9

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pham QLe HDang Nhat MTran T. KTran-Tien MDang VVu HNguyen MPhan X(2024)Towards Vietnamese Question and Answer Generation: An Empirical StudyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3675781Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675781
Van Nguyen KLe TDo T(2024)Numerical reasoning reading comprehension on Vietnamese COVID-19 news: task, corpus, and challengesNeural Computing and Applications10.1007/s00521-024-09744-5Online publication date: 3-May-2024
https://doi.org/10.1007/s00521-024-09744-5
Gao FHou JGu JZhang L(2023)Knowledge Graph based Mutual Attention for Machine Reading Comprehension over Anti-Terrorism CorpusData Intelligence10.1162/dint_a_002105:3(685-706)Online publication date: 1-Aug-2023
https://doi.org/10.1162/dint_a_00210
Nguyen SVo TNguyen DTran DNguyen K(2023)ViQP: Dataset for Vietnamese Question Paraphrasing2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)10.1109/MAPR59823.2023.10288738(1-6)Online publication date: 5-Oct-2023
https://doi.org/10.1109/MAPR59823.2023.10288738
Van Nguyen KDo PNguyen NNguyen ANguyen N(2023)Multi-stage transfer learning with BERTology-based language models for question answering system in vietnameseInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01735-z14:5(1877-1902)Online publication date: 30-Jan-2023
https://doi.org/10.1007/s13042-022-01735-z
Le HHo VNguyen DNguyen N(2022)Integrating Semantic Information into Sketchy Reading Module of Retro-Reader for Vietnamese Machine Reading Comprehension2022 9th NAFOSTED Conference on Information and Computer Science (NICS)10.1109/NICS56915.2022.10013390(53-58)Online publication date: 31-Oct-2022
https://doi.org/10.1109/NICS56915.2022.10013390

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents