article

Free access

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

Authors:

Gianluca Demartini,

Djellel Eddine Difallah, and

Philippe Cudré-MaurouxAuthors Info & Claims

The VLDB Journal — The International Journal on Very Large Data Bases, Volume 22, Issue 5

Pages 665 - 687

https://doi.org/10.1007/s00778-013-0324-z

Published: 01 October 2013 Publication History

Abstract

We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

References

[1]

Alonso, O., Baeza-Yates, R.A.: Design and implementation of relevance assessments using crowdsourcing. In: ECIR, pp. 153---164 (2011).

[2]

Bailey, P., de Vries, A.P., Craswell, N., Soboroff, I.: Overview of the TREC 2007 enterprise track. In: TREC (2007)

[3]

Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2010 entity track. In: TREC (2010)

[4]

Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670---2676 (2007)

[5]

Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD '03, pp. 39---48. ACM, New York (2003).

[6]

Blanco, R., Halpin, H., Herzig, D., Mika, P., Pound, J., Thompson, H.S., Tran, D.T.: Repeatable and reliable search system evaluation using crowdsourcing. In: SIGIR, pp. 923---932 (2011)

[7]

Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: International Semantic Web Conference (ISWC), pp. 83---97 (2011)

[8]

Bouquet, P., Stoermer, H., Niederee, C., Mana, A.: Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE International Conference on Semantic Computing (ICSC), pp. 554---561 (2008)

[9]

Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, vol. 6 (2006)

[10]

Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: EACL (2006)

[11]

Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537---1555 (2012).

[12]

Ciaramita, M., Altun, Y.: Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP '06, pp. 594---602. ACL, Stroudsburg (2006). http://dl.acm.org/citation.cfm?id=1610075.1610158

[13]

Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, vol. 2007, pp. 708---716 (2007)

[14]

Cudré-Mauroux, P., Aberer, K., Feher, A.: Probabilistic message passing in peer data management systems. In: International Conference on Data Engineering (ICDE) (2006)

[15]

Cudré-Mauroux, P., Haghani, P., Jost, M., Aberer, K., De Meer, H.: idMesh: graph-based disambiguation of linked data. In: WWW '09, pp. 591---600. ACM, New York (2009).

[16]

Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the ACL (2002)

[17]

Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, WWW '12, pp. 469---478. ACM, New York (2012).

[18]

Demartini, G., Iofciu, T., de Vries, A.P.: Overview of the INEX 2009 entity ranking track. In: INEX, pp. 254---264 (2009)

[19]

Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39 (1977)

[20]

Difallah, D.E., Demartini, G., Cudré-Mauroux, P.: Pick-A-Crowd: Tell me what you like, and I'll tell you what to do. In: WWW'13. ACM, New York (2013)

[21]

Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85---96. ACM, New York (2005)

[22]

Feng, A., Franklin, M.J., Kossmann, D., Kraska, T., Madden, S., Ramesh, S., Wang, A., Xin, R.: CrowdDB: Query Processing with the VLDB Crowd. PVLDB 4(11), 1387---1390 (2011)

[23]

Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pp. 80---88 (2010)

[24]

Getoor, L., Machanavajjhala, A.: Entity resolution: tutorial. In: VLDB (2012)

[25]

Haas, K., Mika, P., Tarjan, P., Blanco, R.: Enhanced results for web search. In: SIGIR, pp. 725---734 (2011)

[26]

Han, X., Zhao, J.: Named entity disambiguation by leveraging wikipedia semantic knowledge. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, pp. 215---224. ACM, New York (2009).

[27]

Jaro, M.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida. J. Am. Stat. Assoc. 84(406), 414---420 (1989)

[28]

Kazai, G.: In search of quality in crowdsourcing for search engine evaluation. In: ECIR, pp. 165---176 (2011)

[29]

Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In: SIGIR, pp. 205---214 (2011)

[30]

Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423---430 (2003)

[31]

Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441---450 (2010)

[32]

Kschischang, F., Frey, B., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory 47(2) (2001)

[33]

Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707---710 (1966)

[34]

Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Inform. Sci. 89(12), 1---38 (1996).

[35]

Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: Cdas: a crowdsourcing data analytics system. Proc. VLDB Endow. 5(10), 1040---1051 (2012). http://dl.acm.org/citation.cfm?id=2336664.2336676

[36]

Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Human-powered sorts and joins. PVLDB 5(1), 13---24 (2011)

[37]

Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics) (2011)

[38]

Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM '07, pp. 233---242. ACM, New York (2007).

[39]

Murphy, K.M., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Uncertainty in Artificial Intelligence (UAI) (1999)

[40]

On, B., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE), pp. 496---505 (2007)

[41]

Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM '12, pp. 53---62. ACM, New York (2012).

[42]

Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW, pp. 771---780 (2010)

[43]

Selke, J., Lofi, C., Balke, W.T.: Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. Proc. VLDB Endow. 5(6), 538---549 (2012). http://dl.acm.org/citation.cfm?id=2168651.2168655

[44]

Shen, W., Wang, J., Luo, P., Wang, M.: Liege: link entities in web lists with knowledge base. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '12, pp. 1424---1432. ACM, New York (2012).

[45]

Tonon, A., Demartini, G., Cudré-Mauroux, P.: Combining inverted indices and structured search for ad-hoc object retrieval. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '12, pp. 125---134. ACM, New York (2012).

[46]

von Ahn, L., Dabbish, L.: Designing games with a purpose. Commun. ACM 51(8), 58---67 (2008).

[47]

von Ahn, L., Liu, R., Blum, M.: Peekaboom: a game for locating objects in images. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '06, pp. 55---64. ACM, New York (2006).

[48]

Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483---1494 (2012)

[49]

Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, SIGMOD '09, pp. 219---232. ACM, New York (2009).

[50]

Winkler, W.: The state of record linkage and current research problems. US Census Bureau. In: Statistical Research Division (1999)

[51]

Wylot, M., Pont, J., Wisniewski, M., Cudré-Mauroux, P.: dipLODocus{RDF}--short and long-tail rdf analytics for massive webs of data. In: International Semantic Web Conference (ISWC), pp. 778---793 (2011)

Cited By

Youngmann BCafarella MSalimi BZeng A(2023)Causal Data IntegrationProceedings of the VLDB Endowment10.14778/3603581.360360216:10(2659-2665)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.14778/3603581.3603602
Luggen MAudiffren JDifallah DCudré-Mauroux P(2021)Wiki2Prop: A Multimodal Approach for Predicting Wikidata Properties from WikipediaProceedings of the Web Conference 202110.1145/3442381.3450082(2357-2366)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450082
Cudré-Mauroux P(2020)Leveraging Knowledge Graphs for Big Data IntegrationSemantic Web10.3233/SW-19037111:1(13-17)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.3233/SW-190371
Show More Cited By

Index Terms

Large-scale linked data integration using probabilistic reasoning and crowdsourcing
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Large scale instance matching via multiple indexes and candidate selection

Instance matching aims to discover the linkage between different descriptions of real objects across heterogeneous data sources. With the rapid development of Semantic Web, especially of the linked data, automatically instance matching has been become ...
Read More
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking
WWW '12: Proceedings of the 21st international conference on World Wide Web

We tackle the problem of entity linking for large collections of online pages; Our system, ZenCrowd, identifies entities from natural language text using state of the art techniques and automatically connects them to the Linked Open Data cloud. We show ...
Read More
Crowdsourcing for data management

Crowdsourcing provides access to a pool of human workers who can contribute solutions to tasks that are challenging for computers. Proposals have been made for the use of crowdsourcing in a wide range of data management tasks, including data gathering, ...
Read More

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 22, Issue 5

October 2013

137 pages

ISSN:1066-8888

Issue’s Table of Contents

Copyright © Copyright © 2013 Springer-Verlag Berlin Heidelberg.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 October 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
247
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)3

Other Metrics

View Author Metrics

Citations

Cited By

Youngmann BCafarella MSalimi BZeng A(2023)Causal Data IntegrationProceedings of the VLDB Endowment10.14778/3603581.360360216:10(2659-2665)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.14778/3603581.3603602
Luggen MAudiffren JDifallah DCudré-Mauroux P(2021)Wiki2Prop: A Multimodal Approach for Predicting Wikidata Properties from WikipediaProceedings of the Web Conference 202110.1145/3442381.3450082(2357-2366)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450082
Cudré-Mauroux P(2020)Leveraging Knowledge Graphs for Big Data IntegrationSemantic Web10.3233/SW-19037111:1(13-17)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.3233/SW-190371
Christophides VEfthymiou VPalpanas TPapadakis GStefanidis K(2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.1145/3418896
Alpizar-Chacon ISosnovsky SAtzenbeck CRubart JMillard D(2019)Expanding the Web of KnowledgeProceedings of the 30th ACM Conference on Hypertext and Social Media10.1145/3342220.3343671(9-18)Online publication date: 12-Sep-2019
https://dl.acm.org/doi/10.1145/3342220.3343671
Feyisetan OSimperl E(2019)Beyond Monetary IncentivesACM Transactions on Social Computing10.1145/33217002:2(1-31)Online publication date: 13-Jun-2019
https://dl.acm.org/doi/10.1145/3321700
Difallah DChecco ADemartini GCudré-Mauroux P(2019)Deadline-Aware Fair Scheduling for Multi-Tenant Crowd-Powered SystemsACM Transactions on Social Computing10.1145/33010032:1(1-29)Online publication date: 21-Feb-2019
https://dl.acm.org/doi/10.1145/3301003
Feeney KMendel Gleason GBrennan RZaveri AKontokostas DHellmann SUmbrich J(2018)Linked data schemataSemantic Web10.3233/SW-1702719:1(53-75)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.3233/SW-170271
Bu QSimperl EZerr SLi Y(2018)Using microtasks to crowdsource DBpedia entity classificationSemantic Web10.3233/SW-1702619:3(337-354)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.3233/SW-170261
Daniel FKucherbaev PCappiello CBenatallah BAllahbakhsh M(2018)Quality Control in CrowdsourcingACM Computing Surveys10.1145/314814851:1(1-40)Online publication date: 4-Jan-2018
https://dl.acm.org/doi/10.1145/3148148
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents