Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Knowledge-based trust: estimating the trustworthiness of web sources

Published: 01 May 2015 Publication History

Abstract

The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy.
The facts are automatically extracted from each source by information extraction methods commonly used to construct knowledge bases. We propose a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model.
We call the trustworthiness score we computed Knowledge-Based Trust (KBT). On synthetic data, we show that our method can reliably compute the true trustworthiness levels of the sources. We then apply it to a database of 2.8B facts extracted from the web, and thereby estimate the trustworthiness of 119M webpages. Manual evaluation of a subset of the results confirms the effectiveness of the method.

References

[1]
J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1): 1--41, 2008.
[2]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247--1250, 2008.
[3]
A. Borodin, G. Roberts, J. Rosenthal, and P. Tsaparas. Link analysis ranking: algorithms, theory, and experiments. TOIT, 5: 231--297, 2005.
[4]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7): 107--117, 1998.
[5]
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. In SIGIR, 2007.
[6]
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: Easy, efficient data-parallel pipelines. In PLDI, pages 363--375, 2010.
[7]
X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 2010.
[8]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009.
[9]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009.
[10]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
[11]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014.
[12]
X. L. Dong and F. Naumann. Data fusion--resolving data conflicts for integration. PVLDB, 2009.
[13]
X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6, 2013.
[14]
O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. Open information extraction: the second generation. In IJCAI, 2011.
[15]
L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In WWW, pages 413--422, 2013.
[16]
Top 15 most popular celebrity gossip websites. http://www.ebizmba.com/articles/gossip-websites, 2014.
[17]
Z. Gyngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In VLDB, pages 576--587, 2014.
[18]
S. Kamvar, M. Schlosser, and H. Garcia-Molina. The Eigentrust algorithm for reputation management in P2P networks. In WWW, 2003.
[19]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998.
[20]
V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In AIRWeb, 2006.
[21]
Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In SIGMOD, pages 1187--1198, 2014.
[22]
X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the Deep Web: Is the problem solved? PVLDB, 6(2), 2013.
[23]
X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Scaling up copy detection. In ICDE, 2015.
[24]
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877--885, 2010.
[25]
J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324--2329, 2011.
[26]
J. Pasternack and D. Roth. Latent credibility analysis. In WWW, 2013.
[27]
R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In Sigmod, 2014.
[28]
A. Singh and L. Liu. TrustMe: anonymous management of trust relationshiops in decentralized P2P systems. In IEEE Intl. Conf. on Peer-to-Peer Computing, 2003.
[29]
M. Wu and A. Marian. Corroborating answers from multiple web sources. In Proc. of the WebDB Workshop, 2007.
[30]
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007.
[31]
X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, pages 217--226, 2011.
[32]
B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In QDB, 2012.
[33]
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6): 550--561, 2012.

Cited By

View all
  • (2024)Stability of Weighted Majority Voting under Estimated WeightsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662856(96-104)Online publication date: 6-May-2024
  • (2024)FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous DataProceedings of the VLDB Endowment10.14778/3648160.364817417:6(1337-1349)Online publication date: 3-May-2024
  • (2023)Maximizing Neutrality in News OrderingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599425(11-24)Online publication date: 6-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 9
May 2015
76 pages
ISSN:2150-8097
  • Editors:
  • Chen Li,
  • Volker Markl
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 May 2015
Published in PVLDB Volume 8, Issue 9

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)6
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Stability of Weighted Majority Voting under Estimated WeightsProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662856(96-104)Online publication date: 6-May-2024
  • (2024)FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous DataProceedings of the VLDB Endowment10.14778/3648160.364817417:6(1337-1349)Online publication date: 3-May-2024
  • (2023)Maximizing Neutrality in News OrderingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599425(11-24)Online publication date: 6-Aug-2023
  • (2023)Toward the adoption of digital assistive technologyTelecommunications Policy10.1016/j.telpol.2022.10248347:2Online publication date: 1-Mar-2023
  • (2022)Saga: A Platform for Continuous Construction and Serving of Knowledge at ScaleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526049(2259-2272)Online publication date: 10-Jun-2022
  • (2022)Learning Trustworthy Web Sources to Derive Correct Answers and Reduce Health Misinformation in SearchProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531812(2099-2104)Online publication date: 6-Jul-2022
  • (2022)Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarityThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00757-x32:3(475-500)Online publication date: 19-Jul-2022
  • (2021)Rethinking searchACM SIGIR Forum10.1145/3476415.347642855:1(1-27)Online publication date: 16-Jul-2021
  • (2021)Information Extraction From Co-Occurring Similar EntitiesProceedings of the Web Conference 202110.1145/3442381.3449836(3999-4009)Online publication date: 19-Apr-2021
  • (2020)On detecting cherry-picked trendlinesProceedings of the VLDB Endowment10.14778/3380750.338076213:6(939-952)Online publication date: 1-Feb-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media