Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

Published: 23 September 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Machine reading comprehension is a natural language understanding task where the computing system is required to read a text and then find the answer to a specific question posed by a human. Large-scale and high-quality corpora are necessary for evaluating machine reading comprehension models. Furthermore, machine reading comprehension (MRC) for the health sector has potential for practical applications; nevertheless, MRC research in this domain is currently scarce. This article presents UIT-ViNewsQA, a new corpus for the Vietnamese language to evaluate MRC models for the healthcare textual domain. The corpus consists of 22,057 human-generated question-answer pairs. Crowd-workers create the questions and answers on a collection of 4,416 online Vietnamese healthcare news articles, where the answers are textual spans extracted from the corresponding articles. We introduce a process for creating a high-quality corpus for the Vietnamese machine reading comprehension task. Linguistically, our corpus accommodates diversity in question and answer types. In addition, we conduct experiments and compare the effectiveness of different MRC methods based on the neural networks and transformer architectures. Experimental results on our corpus show that the MRC system based on ALBERT architecture outperforms the neural network architectures and the BERT-based approach, an exact match score of 65.26% and an F1-score of 84.89%. The best machine model achieves about 10.90% F1-score less efficiently than humans, which proves that exploring machine models on UIT-ViNewsQA to surpass humans is challenging for researchers in the future. Our corpus is publicly available on our website: http://nlp.uit.edu.vn/datasets for research purposes.

    References

    [1]
    Anna Aniol, Marcin Pietron, and Jerzy Duda. 2019. Ensemble approach for natural language question answering problem. In Proceedings of the 2019 7th International Symposium on Computing and Networking Workshops (CANDARW). IEEE, 180–183.
    [2]
    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1870–1879.
    [3]
    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451.
    [4]
    Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, and Guoping Hu. 2019. A span-extraction dataset for chinese machine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5886–5891.
    [5]
    Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. 2016. Consensus attention-based neural networks for chinese reading comprehension. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 1777–1786.
    [6]
    Yiming Cui, Ting Liu, Ziqing Yang, Zhipeng Chen, Wentao Ma, Wanxiang Che, Shijin Wang, and Guoping Hu. 2020. A sentence cloze dataset for Chinese machine reading comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6717–6723. https://www.aclweb.org/anthology/2020.coling-main.589.
    [7]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
    [8]
    Martin d’Hoffschmidt, Wacim Belblidia, Quentin Heinrich, Tom Brendlé, and Maxime Vidal. 2020. FQuAD: French question answering dataset. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1193–1208. https://www.aclweb.org/anthology/2020.findings-emnlp.107.
    [9]
    Phuc Do, Truong H. V. Phan, and Brij B. Gupta. 2021. Developing a Vietnamese tourism question answering system using knowledge graph and deep learning. Transactions on Asian and Low-Resource Language Information Processing 20, 5 (2021), 1–18.
    [10]
    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2368–2378.
    [11]
    Pavel Efimov, Andrey Chertok, Leonid Boytsov, and Pavel Braslavski. 2020. SberQuAD – Russian reading comprehension dataset: Description and analysis. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Hideo Joho, Christina Lioma, Carsten Eickhoff, Aurélie Névéol, Linda Cappellato, and Nicola Ferro (Eds.). Springer International Publishing, Cham, 3–15.
    [12]
    Deepak Gupta, Asif Ekbal, and Pushpak Bhattacharyya. 2019. A deep neural network framework for English Hindi question answering. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 2 (2019), 1–22.
    [13]
    Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, et al. 2018. DuReader: A Chinese machine reading comprehension dataset from real-world applications. ACL 2018 (2018), 37.
    [14]
    Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems. 1693–1701.
    [15]
    Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The Goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301 (2015).
    [16]
    Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 328–339.
    [17]
    Hsin-Yuan Huang, Chenguang Zhu, Yelong Shen, and Weizhu Chen. 2018. FusionNet: Fusing via fully-aware attention with application to machine comprehension. In Proceedings of the International Conference on Learning Representations.
    [18]
    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2567–2577.
    [19]
    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.
    [20]
    Tomasz Jurczyk, Michael Zhai, and Jinho D. Choi. 2016. Selqa: A new benchmark for selection-based question answering. In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 820–827.
    [21]
    Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 66–71.
    [22]
    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 785–794.
    [23]
    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations.
    [24]
    Phuong Hong Le and Duc-Thien Bui. 2018. A factoid question answering system for Vietnamese. In Companion Proceedings of the Web Conference 2018. International World Wide Web Conferences Steering Committee, 1049–1055.
    [25]
    Seungyoung Lim, Myungji Kim, and Jooyoul Lee. 2019. Korquad1.0: Korean QA dataset for machine reading comprehension. arXiv preprint arXiv:1909.07005 (2019).
    [26]
    Kiet Nguyen, Vu Nguyen, Anh Nguyen, and Ngan Nguyen. 2020. A Vietnamese dataset for evaluating machine reading comprehension. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 2595–2605. https://www.aclweb.org/anthology/2020.coling-main.233.
    [27]
    K. V. Nguyen, V. D. Nguyen, P. X. V. Nguyen, T. T. H. Truong, and N. L. Nguyen. 2018. UIT-VSFC: Vietnamese students’ feedback corpus for sentiment analysis. In Proceedings of the 2018 10th International Conference on Knowledge and Systems Engineering (KSE). 19–24.
    [28]
    Luan Thanh Nguyen, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2021. Constructive and toxic speech detection for open-domain social media comments in Vietnamese. In Advances and Trends in Artificial Intelligence. Proceedings of the 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26-29, 2021), Part I.Lecture Notes in Computer Science, Vol. 12798,. Hamido Fujita, Ali Selamat, Jerry Chun-Wei Lin, and Moonis Ali (Eds.). Springer-Verlag, 572–583.
    [29]
    Nhung Thi-Hong Nguyen, Phuong Phan-Dieu Ha, Luan Thanh Nguyen, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2021. Vietnamese complaint detection on e-commerce websites. In New Trends in Intelligent Software Methodologies, Tools and Techniques. IOS Press, 618–629.
    [30]
    Quy T. Nguyen, Yusuke Miyao, Ha TT Le, and Nhung TH Nguyen. 2018. Ensuring annotation consistency and accuracy for Vietnamese treebank. Language Resources and Evaluation 52, 1 (2018), 269–315.
    [31]
    Cheoneum Park, Heejun Song, and Changki Lee. 2020. S3-NET: SRU-based sentence and self-matching networks for machine reading comprehension. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 3 (2020), 1–14.
    [32]
    Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227–2237.
    [33]
    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pretraining. (2018). https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
    [34]
    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 784–789.
    [35]
    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2383–2392.
    [36]
    Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (2019), 249–266.
    [37]
    Matthew Richardson, Christopher J. C. Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 193–203.
    [38]
    Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016).
    [39]
    Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, and Sam Tsai. 2018. DRED: A Chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920 (2018).
    [40]
    Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. 2019. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics 7 (2019), 217–231.
    [41]
    Simon Šuster and Walter Daelemans. 2018. CliCR: A dataset of clinical case reports for machine reading comprehension. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1551–1563.
    [42]
    Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, Vancouver, Canada, 191–200.
    [43]
    Anh Tuan Nguyen, Mai Hoang Dao, and Dat Quoc Nguyen. 2020. A pilot study of text-to-SQL semantic parsing for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4079–4085.
    [44]
    Kiet Van Nguyen, Khiem Vinh Tran, Son T. Luu, Anh Gia-Tuan Nguyen, and Ngan Luu-Thuy Nguyen. 2020. Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension. IEEE Access 8 (2020), 201404–201417.
    [45]
    Soumya Wadhwa, Khyathi Chandu, and Eric Nyberg. 2018. Comparative analysis of neural QA models on SQuAD. In Proceedings of the Workshop on Machine Reading for Question Answering. 89–97.
    [46]
    Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905 (2016).
    [47]
    Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. 2018. R 3: Reinforced ranker-reader for open-domain question answering. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
    [48]
    Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 189–198.
    [49]
    Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural QA as simple as possible but not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 271–280.
    [50]
    Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2018. Large-scale cloze test dataset created by teachers. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2344–2356.
    [51]
    Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718 (2019).
    [52]
    Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of the International Conference on Learning Representations.
    [53]
    Xiao Zhang, Ji Wu, Zhiyang He, Xien Liu, and Ying Su. 2018. Medical exam question answering with large-scale reading comprehension. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

    Cited By

    View all
    • (2024)Towards Vietnamese Question and Answer Generation: An Empirical StudyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3675781Online publication date: 29-Jun-2024
    • (2024)Numerical reasoning reading comprehension on Vietnamese COVID-19 news: task, corpus, and challengesNeural Computing and Applications10.1007/s00521-024-09744-5Online publication date: 3-May-2024
    • (2023)Knowledge Graph based Mutual Attention for Machine Reading Comprehension over Anti-Terrorism CorpusData Intelligence10.1162/dint_a_002105:3(685-706)Online publication date: 1-Aug-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 5
    September 2022
    486 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3533669
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 September 2022
    Online AM: 02 May 2022
    Accepted: 03 February 2022
    Revised: 26 November 2021
    Received: 29 May 2020
    Published in TALLIP Volume 21, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Machine reading comprehension
    2. question answering
    3. Vietnamese

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • Master, PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF)
    • Institute of Big Data
    • Vietnam National University HoChiMinh City

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)86
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards Vietnamese Question and Answer Generation: An Empirical StudyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3675781Online publication date: 29-Jun-2024
    • (2024)Numerical reasoning reading comprehension on Vietnamese COVID-19 news: task, corpus, and challengesNeural Computing and Applications10.1007/s00521-024-09744-5Online publication date: 3-May-2024
    • (2023)Knowledge Graph based Mutual Attention for Machine Reading Comprehension over Anti-Terrorism CorpusData Intelligence10.1162/dint_a_002105:3(685-706)Online publication date: 1-Aug-2023
    • (2023)ViQP: Dataset for Vietnamese Question Paraphrasing2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)10.1109/MAPR59823.2023.10288738(1-6)Online publication date: 5-Oct-2023
    • (2023)Multi-stage transfer learning with BERTology-based language models for question answering system in vietnameseInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01735-z14:5(1877-1902)Online publication date: 30-Jan-2023
    • (2022)Integrating Semantic Information into Sketchy Reading Module of Retro-Reader for Vietnamese Machine Reading Comprehension2022 9th NAFOSTED Conference on Information and Computer Science (NICS)10.1109/NICS56915.2022.10013390(53-58)Online publication date: 31-Oct-2022

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media