Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3459637.3481961acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

WikiCheck: An End-to-end Open Source Automatic Fact-Checking API based on Wikipedia

Published: 30 October 2021 Publication History

Abstract

With the growth of fake news and disinformation, the NLP community has been working to assist humans in fact-checking. However, most academic research has focused on model accuracy without paying attention to resource efficiency, which is crucial in real-life scenarios. In this work, we review the State-of-the-Art datasets and solutions for Automatic Fact-checking and test their applicability in production environments. We discover overfitting issues in those models, and we propose a data filtering method that improves the model's performance and generalization. Then, we design an unsupervised fine-tuning of the Masked Language models to improve its accuracy working with Wikipedia. We also propose a novel query enhancing method to improve evidence discovery using the Wikipedia Search API. Finally, we present a new fact-checking system, the WikiCheck API that automatically performs a facts validation process based on the Wikipedia knowledge base. It is comparable to SOTA solutions in terms of accuracy and can be used on low-memory CPU instances.

References

[1]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, Vol. 5 (2017), 135--146. https://doi.org/10.1162/tacl_a_00051
[2]
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015a. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
[3]
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015b. A large annotated corpus for learning natural language inference. CoRR, Vol. abs/1508.05326 (2015). arxiv: 1508.05326 http://arxiv.org/abs/1508.05326
[4]
Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. arxiv: 1803.11175 [cs.CL]
[5]
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, and Hui Jiang. 2016. Enhancing and Combining Sequential and Tree LS™ for Natural Language Inference. CoRR, Vol. abs/1609.06038 (2016). arxiv: 1609.06038 http://arxiv.org/abs/1609.06038
[6]
Anton Chernyavskiy, Dmitry Ilvovsky, and Preslav Nakov. 2021. WhatTheWikiFact: Fact-Checking Claims Against Wikipedia. arXiv preprint arXiv:2105.00826 (2021).
[7]
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. 670--680. https://doi.org/10.18653/v1/D17-1070
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, Vol. abs/1810.04805 (2018). arxiv: 1810.04805 http://arxiv.org/abs/1810.04805
[9]
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation Artifacts in Natural Language Inference Data. CoRR, Vol. abs/1803.02324 (2018). arxiv: 1803.02324 http://arxiv.org/abs/1803.02324
[10]
Aaron Halfaker and R Stuart Geiger. 2020. Ores: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction, Vol. 4, CSCW2 (2020), 1--37.
[11]
Andreas Hanselowski, Hao Zhang, Zile Li, Daniil Sorokin, Benjamin Schiller, Claudia Schulz, and Iryna Gurevych. 2018. UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics, Brussels, Belgium, 103--108. https://doi.org/10.18653/v1/W18--5516
[12]
Naeemul Hassan, Gensheng Zhang, Fatma Arslan, Josue Caraballo, Damian Jimenez, Siddhant Gawsane, Shohedul Hasan, Minumol Joseph, Aaditya Kulkarni, Anil Kumar Nayak, Vikas Sable, Chengkai Li, and Mark Tremayne. 2017. ClaimBuster: The First-Ever End-to-End Fact-Checking System. Proc. VLDB Endow., Vol. 10, 12 (Aug. 2017), 1945--1948. https://doi.org/10.14778/3137765.3137815
[13]
Douwe Kiela, Changhan Wang, and Kyunghyun Cho. 2018. Dynamic Meta-Embeddings for Improved Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 1466--1477. https://doi.org/10.18653/v1/D18--1176
[14]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. http://arxiv.org/abs/1909.11942 cite arxiv:1909.11942.
[15]
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4487--4496. https://doi.org/10.18653/v1/P19-1441
[16]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. CoRR, Vol. abs/1609.07843 (2016). arxiv: 1609.07843 http://arxiv.org/abs/1609.07843
[17]
Tomas Mikolov, Kai Chen, G. S. Corrado, and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR, Vol. abs/1301.3781 (2013).
[18]
Yixin Nie, Haonan Chen, and Mohit Bansal. 2018. Combining Fact Extraction and Verification with Neural Semantic Matching Networks. arxiv: 1811.07039 [cs.CL]
[19]
Cathy O'neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.
[20]
Jonathan Pilault, Amine Elhattami, and Christopher Pal. 2020. Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data. arxiv: 2009.09139 [cs.LG]
[21]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR, Vol. abs/1908.10084 (2019). arxiv: 1908.10084 http://arxiv.org/abs/1908.10084
[22]
Diego Saez-Trumper. 2019. Online disinformation and the role of wikipedia. arXiv preprint arXiv:1910.12596 (2019).
[23]
Aalok Sathe, Salar Ather, Tuan Manh Le, Nathan Perry, and Joonsuk Park. 2020. Automated Fact-Checking of Claims from Wikipedia. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6874--6882. https://www.aclweb.org/anthology/2020.lrec-1.849
[24]
Peter Warren Singer and Emerson T Brooking. 2018. LikeWar: The weaponization of social media. Eamon Dolan Books.
[25]
C Estelle Smith, Bowen Yu, Anjali Srivastava, Aaron Halfaker, Loren Terveen, and Haiyi Zhu. 2020. Keeping Community in the Loop: Understanding Wikipedia Stakeholder Values for Machine Learning-Based Systems. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1--14.
[26]
Aarne Talman, Anssi Yli-Jyrä, and Jörg Tiedemann. 2019. Sentence embeddings in NLI with iterative refinement encoders. Natural Language Engineering, Vol. 25, 4 (Jul 2019), 467--482. https://doi.org/10.1017/s1351324919000202
[27]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018a. FEVER: a large-scale dataset for Fact Extraction and VERification. CoRR, Vol. abs/1803.05355 (2018). arxiv: 1803.05355 http://arxiv.org/abs/1803.05355
[28]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018b. FEVER: a large-scale dataset for Fact Extraction and VERification. arxiv: 1803.05355 [cs.CL]
[29]
James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018c. The Fact Extraction and VERification (FEVER) Shared Task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics, Brussels, Belgium, 1--9. https://doi.org/10.18653/v1/W18--5501
[30]
Andreas Vlachos and S. Riedel. 2014. Fact Checking: Task definition and dataset construction. In LTCSS@ACL.
[31]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. CoRR, Vol. abs/1804.07461 (2018). arxiv: 1804.07461 http://arxiv.org/abs/1804.07461
[32]
William Yang Wang. 2017. "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection. CoRR, Vol. abs/1705.00648 (2017). arxiv: 1705.00648 http://arxiv.org/abs/1705.00648
[33]
Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2017. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. CoRR, Vol. abs/1704.05426 (2017). arxiv: 1704.05426 http://arxiv.org/abs/1704.05426
[34]
Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. UCL Machine Reading Group: Four Factor Framework For Fact Finding (HexaF). In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics, Brussels, Belgium, 97--102. https://doi.org/10.18653/v1/W18-5515
[35]
Wonjin Yoon, Yoon Sun Yeo, Minbyul Jeong, Bong-Jun Yi, and Jaewoo Kang. 2020. Learning by Semantic Similarity Makes Abstractive Summarization Better. arxiv: 2002.07767 [cs.CL]
[36]
Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware BERT for Language Understanding. arxiv: 1909.02209 [cs.CL]
[37]
Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In arXiv preprint arXiv:1506.06724.

Cited By

View all

Index Terms

  1. WikiCheck: An End-to-end Open Source Automatic Fact-Checking API based on Wikipedia

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
    October 2021
    4966 pages
    ISBN:9781450384469
    DOI:10.1145/3459637
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. applied research
    2. fact-checking
    3. nli
    4. nlp
    5. wikipedia

    Qualifiers

    • Research-article

    Conference

    CIKM '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)50
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 22 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Knowledge Verification From DataIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320224435:3(4324-4338)Online publication date: Mar-2024
    • (2024)A knowledge enhanced learning and semantic composition model for multi-claim fact checkingKnowledge-Based Systems10.1016/j.knosys.2024.112439304(112439)Online publication date: Nov-2024
    • (2024)Can data improve knowledge graph?Memetic Computing10.1007/s12293-024-00429-z16:3(403-413)Online publication date: 12-Aug-2024
    • (2024)Pipeline and dataset generation for automated fact-checking in almost any languageNeural Computing and Applications10.1007/s00521-024-10113-5Online publication date: 2-Aug-2024
    • (2023)Designing and Evaluating Presentation Strategies for Fact-Checked ContentProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614841(751-761)Online publication date: 21-Oct-2023
    • (2022)FacTeR-CheckKnowledge-Based Systems10.1016/j.knosys.2022.109265251:COnline publication date: 5-Sep-2022
    • (2021)WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671319(427-436)Online publication date: 15-Dec-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media