research-article

WikiCheck: An End-to-end Open Source Automatic Fact-Checking API based on Wikipedia

Authors:

Mykola Trokhymovych,

Diego Saez-TrumperAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 4155 - 4164

https://doi.org/10.1145/3459637.3481961

Published: 30 October 2021 Publication History

Abstract

With the growth of fake news and disinformation, the NLP community has been working to assist humans in fact-checking. However, most academic research has focused on model accuracy without paying attention to resource efficiency, which is crucial in real-life scenarios. In this work, we review the State-of-the-Art datasets and solutions for Automatic Fact-checking and test their applicability in production environments. We discover overfitting issues in those models, and we propose a data filtering method that improves the model's performance and generalization. Then, we design an unsupervised fine-tuning of the Masked Language models to improve its accuracy working with Wikipedia. We also propose a novel query enhancing method to improve evidence discovery using the Wikipedia Search API. Finally, we present a new fact-checking system, the WikiCheck API that automatically performs a facts validation process based on the Wikipedia knowledge base. It is comparable to SOTA solutions in terms of accuracy and can be used on low-memory CPU instances.

References

[1]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, Vol. 5 (2017), 135--146. https://doi.org/10.1162/tacl_a_00051

[2]

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015a. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.

[3]

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015b. A large annotated corpus for learning natural language inference. CoRR, Vol. abs/1508.05326 (2015). arxiv: 1508.05326 http://arxiv.org/abs/1508.05326

[4]

Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. arxiv: 1803.11175 [cs.CL]

[5]

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, and Hui Jiang. 2016. Enhancing and Combining Sequential and Tree LS™ for Natural Language Inference. CoRR, Vol. abs/1609.06038 (2016). arxiv: 1609.06038 http://arxiv.org/abs/1609.06038

[6]

Anton Chernyavskiy, Dmitry Ilvovsky, and Preslav Nakov. 2021. WhatTheWikiFact: Fact-Checking Claims Against Wikipedia. arXiv preprint arXiv:2105.00826 (2021).

Digital Library

[7]

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. 670--680. https://doi.org/10.18653/v1/D17-1070

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, Vol. abs/1810.04805 (2018). arxiv: 1810.04805 http://arxiv.org/abs/1810.04805

[9]

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation Artifacts in Natural Language Inference Data. CoRR, Vol. abs/1803.02324 (2018). arxiv: 1803.02324 http://arxiv.org/abs/1803.02324

[10]

Aaron Halfaker and R Stuart Geiger. 2020. Ores: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction, Vol. 4, CSCW2 (2020), 1--37.

Digital Library

[11]

Andreas Hanselowski, Hao Zhang, Zile Li, Daniil Sorokin, Benjamin Schiller, Claudia Schulz, and Iryna Gurevych. 2018. UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics, Brussels, Belgium, 103--108. https://doi.org/10.18653/v1/W18--5516

[12]

Naeemul Hassan, Gensheng Zhang, Fatma Arslan, Josue Caraballo, Damian Jimenez, Siddhant Gawsane, Shohedul Hasan, Minumol Joseph, Aaditya Kulkarni, Anil Kumar Nayak, Vikas Sable, Chengkai Li, and Mark Tremayne. 2017. ClaimBuster: The First-Ever End-to-End Fact-Checking System. Proc. VLDB Endow., Vol. 10, 12 (Aug. 2017), 1945--1948. https://doi.org/10.14778/3137765.3137815

Digital Library

[13]

Douwe Kiela, Changhan Wang, and Kyunghyun Cho. 2018. Dynamic Meta-Embeddings for Improved Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 1466--1477. https://doi.org/10.18653/v1/D18--1176

[14]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. http://arxiv.org/abs/1909.11942 cite arxiv:1909.11942.

[15]

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4487--4496. https://doi.org/10.18653/v1/P19-1441

[16]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. CoRR, Vol. abs/1609.07843 (2016). arxiv: 1609.07843 http://arxiv.org/abs/1609.07843

[17]

Tomas Mikolov, Kai Chen, G. S. Corrado, and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR, Vol. abs/1301.3781 (2013).

[18]

Yixin Nie, Haonan Chen, and Mohit Bansal. 2018. Combining Fact Extraction and Verification with Neural Semantic Matching Networks. arxiv: 1811.07039 [cs.CL]

[19]

Cathy O'neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.

Digital Library

[20]

Jonathan Pilault, Amine Elhattami, and Christopher Pal. 2020. Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data. arxiv: 2009.09139 [cs.LG]

[21]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR, Vol. abs/1908.10084 (2019). arxiv: 1908.10084 http://arxiv.org/abs/1908.10084

[22]

Diego Saez-Trumper. 2019. Online disinformation and the role of wikipedia. arXiv preprint arXiv:1910.12596 (2019).

[23]

Aalok Sathe, Salar Ather, Tuan Manh Le, Nathan Perry, and Joonsuk Park. 2020. Automated Fact-Checking of Claims from Wikipedia. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6874--6882. https://www.aclweb.org/anthology/2020.lrec-1.849

[24]

Peter Warren Singer and Emerson T Brooking. 2018. LikeWar: The weaponization of social media. Eamon Dolan Books.

Digital Library

[25]

C Estelle Smith, Bowen Yu, Anjali Srivastava, Aaron Halfaker, Loren Terveen, and Haiyi Zhu. 2020. Keeping Community in the Loop: Understanding Wikipedia Stakeholder Values for Machine Learning-Based Systems. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1--14.

Digital Library

[26]

Aarne Talman, Anssi Yli-Jyrä, and Jörg Tiedemann. 2019. Sentence embeddings in NLI with iterative refinement encoders. Natural Language Engineering, Vol. 25, 4 (Jul 2019), 467--482. https://doi.org/10.1017/s1351324919000202

[27]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018a. FEVER: a large-scale dataset for Fact Extraction and VERification. CoRR, Vol. abs/1803.05355 (2018). arxiv: 1803.05355 http://arxiv.org/abs/1803.05355

[28]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018b. FEVER: a large-scale dataset for Fact Extraction and VERification. arxiv: 1803.05355 [cs.CL]

[29]

James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018c. The Fact Extraction and VERification (FEVER) Shared Task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics, Brussels, Belgium, 1--9. https://doi.org/10.18653/v1/W18--5501

[30]

Andreas Vlachos and S. Riedel. 2014. Fact Checking: Task definition and dataset construction. In LTCSS@ACL.

[31]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. CoRR, Vol. abs/1804.07461 (2018). arxiv: 1804.07461 http://arxiv.org/abs/1804.07461

[32]

William Yang Wang. 2017. "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection. CoRR, Vol. abs/1705.00648 (2017). arxiv: 1705.00648 http://arxiv.org/abs/1705.00648

[33]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2017. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. CoRR, Vol. abs/1704.05426 (2017). arxiv: 1704.05426 http://arxiv.org/abs/1704.05426

[34]

Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. UCL Machine Reading Group: Four Factor Framework For Fact Finding (HexaF). In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics, Brussels, Belgium, 97--102. https://doi.org/10.18653/v1/W18-5515

[35]

Wonjin Yoon, Yoon Sun Yeo, Minbyul Jeong, Bong-Jun Yi, and Jaewoo Kang. 2020. Learning by Semantic Similarity Makes Abstractive Summarization Better. arxiv: 2002.07767 [cs.CL]

[36]

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware BERT for Language Understanding. arxiv: 1909.02209 [cs.CL]

[37]

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In arXiv preprint arXiv:1506.06724.

Cited By

Wang XBan TChen LWu XLyu DChen H(2024)Knowledge Verification From DataIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320224435:3(4324-4338)Online publication date: Mar-2024
https://doi.org/10.1109/TNNLS.2022.3202244
Wang SWei PKong QMao W(2024)A knowledge enhanced learning and semantic composition model for multi-claim fact checkingKnowledge-Based Systems10.1016/j.knosys.2024.112439304(112439)Online publication date: Nov-2024
https://doi.org/10.1016/j.knosys.2024.112439
Huang PLiu K(2024)Can data improve knowledge graph?Memetic Computing10.1007/s12293-024-00429-z16:3(403-413)Online publication date: 12-Aug-2024
https://doi.org/10.1007/s12293-024-00429-z
Show More Cited By

Index Terms

WikiCheck: An End-to-end Open Source Automatic Fact-Checking API based on Wikipedia
1. Information systems
  1. Information retrieval

Recommendations

WhatTheWikiFact: Fact-Checking Claims Against Wikipedia
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

The rise of Internet has made it a major source of information. Unfortunately, not all information online is true, and thus a number of fact-checking initiatives have been launched, both manual and automatic, to deal with the problem. Here, we present ...
Linguistic Signals under Misinformation and Fact-Checking: Evidence from User Comments on Social Media

Misinformation and fact-checking are opposite forces in the news environment: the former creates inaccuracies to mislead people, while the latter provides evidence to rebut the former. These news articles are often posted on social media and attract user ...
Diffusion of Community Fact-Checked Misinformation on Twitter
CSCW

The spread of misinformation on social media is a pressing societal problem that platforms, policymakers, and researchers continue to grapple with. As a countermeasure, recent works have proposed to employ non-expert fact-checkers in the crowd to fact-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

October 2021

4966 pages

ISBN:9781450384469

DOI:10.1145/3459637

General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '21

Sponsor:

CIKM '21: The 30th ACM International Conference on Information and Knowledge Management

November 1 - 5, 2021

Queensland, Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
289
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)3

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang XBan TChen LWu XLyu DChen H(2024)Knowledge Verification From DataIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320224435:3(4324-4338)Online publication date: Mar-2024
https://doi.org/10.1109/TNNLS.2022.3202244
Wang SWei PKong QMao W(2024)A knowledge enhanced learning and semantic composition model for multi-claim fact checkingKnowledge-Based Systems10.1016/j.knosys.2024.112439304(112439)Online publication date: Nov-2024
https://doi.org/10.1016/j.knosys.2024.112439
Huang PLiu K(2024)Can data improve knowledge graph?Memetic Computing10.1007/s12293-024-00429-z16:3(403-413)Online publication date: 12-Aug-2024
https://doi.org/10.1007/s12293-024-00429-z
Drchal JUllrich HMlynář TMoravec V(2024)Pipeline and dataset generation for automated fact-checking in almost any languageNeural Computing and Applications10.1007/s00521-024-10113-5Online publication date: 2-Aug-2024
https://doi.org/10.1007/s00521-024-10113-5
Hettiachchi DJi KKennedy JMcCosker ASalim FSanderson MScholer FSpina DFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Designing and Evaluating Presentation Strategies for Fact-Checked ContentProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614841(751-761)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614841
Martín AHuertas-Tato JHuertas-García ÁVillar-Rodríguez GCamacho D(2022)FacTeR-CheckKnowledge-Based Systems10.1016/j.knosys.2022.109265251:COnline publication date: 5-Sep-2022
https://dl.acm.org/doi/10.1016/j.knosys.2022.109265
Hsu CLi CSaez-Trumper DHsu Y(2021)WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671319(427-436)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671319

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents