Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Toward Veracity Assessment in RDF Knowledge Bases: An Exploratory Analysis

Published: 20 February 2018 Publication History

Abstract

Among different characteristics of knowledge bases, data quality is one of the most relevant to maximize the benefits of the provided information. Knowledge base quality assessment poses a number of big data challenges such as high volume, variety, velocity, and veracity. In this article, we focus on answering questions related to the assessment of the veracity of facts through Deep Fact Validation (DeFacto), a triple validation framework designed to assess facts in RDF knowledge bases. Despite current developments in the research area, the underlying framework faces many challenges. This article pinpoints and discusses these issues and conducts a thorough analysis of its pipeline, aiming at reducing the error propagation through its components. Furthermore, we discuss recent developments related to this fact validation as well as describing advantages and drawbacks of state-of-the-art models. As a result of this exploratory analysis, we give insights and directions toward a better architecture to tackle the complex task of fact-checking in knowledge bases.

References

[1]
Gabor Angeli, Melvin Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Vol. 1, 344--354.
[2]
Hannah Bast, Björn Buchhold, and Elmar Haussmann. 2015. Relevance scores for triples from type-like relations. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 243--252.
[3]
Laure Berti-Équille and Javier Borge-Holthoefer. 2015. Veracity of Data: From Truth Discovery Computation Algorithms to Models of Misinformation Dynamics. Morgan 8 Claypool Publishers.
[4]
Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. DBpedia—A crystallization point for the Web of Data. J. Web Semant: Sci., Serv. Agents World Wide Web 7, 3 (2009), 154--165.
[5]
Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory. Springer, 316--330.
[6]
Jorge Carrillo de Albornoz, Laura Plaza, and Pablo Gervás. 2010. A hybrid approach to emotional sentence polarity and intensity classification. In Proceedings of the 14th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 153--161.
[7]
Davide Ceolin, Paul Groth, Willem Robert Van Hage, Archana Nottamkandath, and Wan Fokkink. 2012. Trust evaluation through user reputation and provenance analysis. In Proceedings of the 8th International Conference on Uncertainty Reasoning for the Semantic Web, Vol. 900. CEUR-WS.org, 15--26.
[8]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn. 20, 3 (Sept. 1995), 273--297.
[9]
Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, and Kalina Bontcheva. 2015. Analysis of named entity recognition and linking for tweets. Inform. Proc. Manag. 51, 2 (2015), 32--49.
[10]
Boyang Ding, Quan Wang, and Bin Wang. 2017. Leveraging text and knowledge bases for triple scoring: An ensemble approach—The BOKCHOY triple scorer at WSDM Cup 2017. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. Retrieved from http://www.wsdm-cup-2017.org/proceedings.html.
[11]
Diego Esteves, Diego Moussallem, Ciro Baron Neto, Tommaso Soru, Ricardo Usbeck, Markus Ackermann, and Jens Lehmann. 2015. MEX vocabulary: A lightweight interchange format for machine learning experiments. In Proceedings of the 11th International Conference on Semantic Systems. ACM, 169--176.
[12]
Diego Esteves, Rafael Peres, Jens Lehmann, and Giulio Napolitano. 2017. Named entity recognition in Twitter using images and text. Arxiv:1710.11027 (2017).
[13]
Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2004. Web-scale information extraction in Knowitall: (preliminary results). In Proceedings of the 13th International Conference on World Wide Web (WWW’04). ACM, New York, 100--110.
[14]
Daniel Gerber, Diego Esteves, Jens Lehmann, Lorenz Bühmann, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, and René Speck. 2015. DeFacto—Temporal and multilingual deep fact validation. J. Web Semant: Sci., Serv. Agents World Wide Web 35 (2015), 85--101.
[15]
Daniel Gerber and Axel-Cyrille Ngonga Ngomo. 2012. Extracting multilingual natural-language patterns for RDF predicates. In International Conference on Knowledge Engineering and Knowledge Management. Springer, 87--96.
[16]
Hugh Glaser, Afraz Jaffri, and Ian Millard. 2009. Managing co-reference on the semantic web. WWW2009 Workshop: Linked Data on the Web (LDOW2009). University of Southampton Institutional Repository. https://eprints.soton.ac.uk/267587/.
[17]
Faegheh Hasibi, Darío Garigliotti, Shuo Zhang, and Krisztian Balog. 2017. Supervised ranking of triples for type-like relations—the cress triple scorer at the WSDM Cup 2017. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. Retrieved from http://www.wsdm-cup-2017.org/proceedings.html.
[18]
Naeemul Hassan, Afroza Sultana, You Wu, Gensheng Zhang, Chengkai Li, Jun Yang, and Cong Yu. 2014. Data in, fact out: Automated monitoring of facts by FactQatcher. Proceedings of the VLDB Endowment 7, 13 (2014), 1557--1560.
[19]
Soon Gill Hong, Sin-hee Cho, and Mun Yong Yi. 2014. Unsupervised verb inference from nouns crossing root boundary. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin City University and Association for Computational Linguistics, Dublin, 1248--1259.
[20]
Soon Gill Hong and Mun Yong Yi. 2017. Plausibility assessment of triples with instance-based learning distantly supervised by background knowledge. Submitted to Semant. Web J. Retrieved from http://www.semantic-web-journal.net/system/files/swj1546.pdf.
[21]
Krzysztof Janowicz. 2009. Trust and Provenance—-You Canfit Have One Without the Other. Technical Report. Institute for Geoinformatics, University of Muenster, Germany.
[22]
Jens Lehmann, Daniel Gerber, Mohamed Morsey, and Axel-Cyrille Ngonga Ngomo. 2012. DeFacto—Deep fact validation. In Proceedings of the International Semantic Web Conference.
[23]
Furong Li, Xin Luna Dong, Anno Langen, and Yang Li. 2017. Knowledge verification for longtail verticals. PVLDB 10, 11 (2017), 1370--1381.
[24]
Xian Li, Weiyi Meng, and Clement Yu. 2011. T-verifier: Verifying truthfulness of fact statements. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE’11). IEEE Computer Society, Washington, DC, 63--74.
[25]
Marco Lippi and Paolo Torroni. 2016. Argumentation mining: State of the art and emerging trends. ACM Trans. Internet Technol. (TOIT) 16, 2 (2016), 10.
[26]
Catherine Macleod, Ralph Grishman, Adam Meyers, Leslie Barrett, and Ruth Reeves. 1998. NOMLEX: A lexicon of nominalizations. In Proceedings of Euralex98. 187--193.
[27]
Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and Juan Miguel Gómez-Berbís. 2013. Named entity recognition: Fallacies, challenges and opportunities. Computer Standards 8 Interfaces 35, 5 (2013), 482--489.
[28]
George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (Nov. 1995), 39--41. 0001-0782
[29]
Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation mining: The detection, classification and structure of arguments in text. In Proceedings of the 12th International Conference on Artificial Intelligence and Law. ACM, 98--107.
[30]
Jeff Pasternack and Dan Roth. 2011. Generalized fact-finding. In Proceedings of the 20th International Conference Companion on World Wide Web. ACM, 99--100.
[31]
Jeff Pasternack and Dan Roth. 2011. Making better informed trust decisions with generalized fact-finding. In IJCAI. 2324--2329.
[32]
Jeff Pasternack and Dan Roth. 2013. Latent credibility analysis. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). 1009--1020.
[33]
Zhong Qian, Peifeng Li, Qiaoming Zhu, Guodong Zhou, Zhunchen Luo, and Wei Luo. Speculation and negation scope detection via convolutional neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 815--825.
[34]
Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 147--155.
[35]
Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 74--84.
[36]
Gil Rocha, Henrique Lopes Cardoso, and Jorge Teixeira. 2016. ArgMine: A framework for argumentation mining. Computational Processing of the Portuguese Language-12th International Conference, PROPOR. 13--15.
[37]
Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In IEEE 30th International Conference on Data Engineering, Chicago (ICDE’14). 1294--1297.
[38]
B. Saha and D. Srivastava. 2014. Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering. 1294--1297.
[39]
Mehdi Samadi, Partha Talukdar, Manuela Veloso, and Manuel Blum. 2016. ClaimEval: Integrated and flexible framework for claim evaluation using credibility of sources. In Proceedings of the 13th AAAI Conference on Artificial Intelligence (AAAI’16). AAAI Press, 222--228.
[40]
Mehdi Samadi, Manuela M. Veloso, and Manuel Blum. 2013. OpenEval: Web information query evaluation. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI’13). AAAI Press, Bellevue, Washington, 1163--1169. http://dl.acm.org/citation.cfm?id=2891460.2891622.
[41]
Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, Christopher Potts, and others. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13), Vol. 1631. Citeseer, 1642.
[42]
Stephen Soderland, Oren Etzioni, Tal Shaked, and D. Weld. 2004. The use of web-based statistics to validate information extraction. In AAAI-04 Workshop on Adaptive Text Extraction and Mining. 21--26.
[43]
Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science. 18--22.
[44]
Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2015. Quality assessment for linked data: A survey. Semantic Web 7, 1 (2015), 63--93.
[45]
Xiaodan Zhu, Svetlana Kiritchenko, and Saif M. Mohammad. 2014. NRC-Canada-2014: Recent improvements in the sentiment analysis of tweets. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval’14). Citeseer, 443--447.
[46]
Valentin Zmiycharov, Dimitar Alexandrov, Preslav Nakov, Ivan Koychev, and Yasen Kiprov. 2017. Finding people’s professions and nationalities using distant supervision: The FMI@SU “goosefoot” team at the WSDM Cup 2017 triple scoring task. In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK, Martin Potthast, Stefan Heindorf, and Hannah Bast (Eds.). CEUR-WS.org. http://www.wsdm-cup-2017.org/proceedings.html.

Cited By

View all
  • (2024)Efficient and Reliable Estimation of Knowledge Graph AccuracyProceedings of the VLDB Endowment10.14778/3665844.366586517:9(2392-2403)Online publication date: 1-May-2024
  • (2023)Examining Knowledge Extraction Processes from Heterogeneous Data SourcesBrilliant Engineering10.36937/ben.2023.47984:1(1-8)Online publication date: 8-Feb-2023
  • (2021)Knowledge GraphsSynthesis Lectures on Data, Semantics, and Knowledge10.2200/S01125ED1V01Y202109DSK02212:2(1-257)Online publication date: 8-Nov-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 9, Issue 3
Special Issue on Improving the Veracity and Value of Big Data
September 2017
140 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3183573
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2018
Accepted: 01 January 2018
Revised: 01 December 2017
Received: 01 April 2017
Published in JDIQ Volume 9, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DeFacto
  2. benchmark
  3. data quality
  4. exploratory data analysis
  5. fact checking
  6. linked data
  7. trustworthiness

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient and Reliable Estimation of Knowledge Graph AccuracyProceedings of the VLDB Endowment10.14778/3665844.366586517:9(2392-2403)Online publication date: 1-May-2024
  • (2023)Examining Knowledge Extraction Processes from Heterogeneous Data SourcesBrilliant Engineering10.36937/ben.2023.47984:1(1-8)Online publication date: 8-Feb-2023
  • (2021)Knowledge GraphsSynthesis Lectures on Data, Semantics, and Knowledge10.2200/S01125ED1V01Y202109DSK02212:2(1-257)Online publication date: 8-Nov-2021
  • (2021)Advances in Data Management in the Big Data EraAdvancing Research in Information and Communication Technology10.1007/978-3-030-81701-5_4(99-126)Online publication date: 4-Aug-2021
  • (2020)How to Build a Knowledge GraphKnowledge Graphs10.1007/978-3-030-37439-6_2(11-68)Online publication date: 1-Feb-2020
  • (2019)TISCO: Temporal Scoping of FactsCompanion Proceedings of The 2019 World Wide Web Conference10.1145/3308560.3316524(959-960)Online publication date: 13-May-2019
  • (undefined)TISCO: Temporal Scoping of FactsSSRN Electronic Journal10.2139/ssrn.3254234

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media