Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

Published: 01 October 2013 Publication History
  • Get Citation Alerts
  • Abstract

    We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

    References

    [1]
    Alonso, O., Baeza-Yates, R.A.: Design and implementation of relevance assessments using crowdsourcing. In: ECIR, pp. 153---164 (2011).
    [2]
    Bailey, P., de Vries, A.P., Craswell, N., Soboroff, I.: Overview of the TREC 2007 enterprise track. In: TREC (2007)
    [3]
    Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2010 entity track. In: TREC (2010)
    [4]
    Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670---2676 (2007)
    [5]
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD '03, pp. 39---48. ACM, New York (2003).
    [6]
    Blanco, R., Halpin, H., Herzig, D., Mika, P., Pound, J., Thompson, H.S., Tran, D.T.: Repeatable and reliable search system evaluation using crowdsourcing. In: SIGIR, pp. 923---932 (2011)
    [7]
    Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: International Semantic Web Conference (ISWC), pp. 83---97 (2011)
    [8]
    Bouquet, P., Stoermer, H., Niederee, C., Mana, A.: Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE International Conference on Semantic Computing (ICSC), pp. 554---561 (2008)
    [9]
    Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, vol. 6 (2006)
    [10]
    Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: EACL (2006)
    [11]
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537---1555 (2012).
    [12]
    Ciaramita, M., Altun, Y.: Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP '06, pp. 594---602. ACL, Stroudsburg (2006). http://dl.acm.org/citation.cfm?id=1610075.1610158
    [13]
    Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, vol. 2007, pp. 708---716 (2007)
    [14]
    Cudré-Mauroux, P., Aberer, K., Feher, A.: Probabilistic message passing in peer data management systems. In: International Conference on Data Engineering (ICDE) (2006)
    [15]
    Cudré-Mauroux, P., Haghani, P., Jost, M., Aberer, K., De Meer, H.: idMesh: graph-based disambiguation of linked data. In: WWW '09, pp. 591---600. ACM, New York (2009).
    [16]
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the ACL (2002)
    [17]
    Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, WWW '12, pp. 469---478. ACM, New York (2012).
    [18]
    Demartini, G., Iofciu, T., de Vries, A.P.: Overview of the INEX 2009 entity ranking track. In: INEX, pp. 254---264 (2009)
    [19]
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39 (1977)
    [20]
    Difallah, D.E., Demartini, G., Cudré-Mauroux, P.: Pick-A-Crowd: Tell me what you like, and I'll tell you what to do. In: WWW'13. ACM, New York (2013)
    [21]
    Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85---96. ACM, New York (2005)
    [22]
    Feng, A., Franklin, M.J., Kossmann, D., Kraska, T., Madden, S., Ramesh, S., Wang, A., Xin, R.: CrowdDB: Query Processing with the VLDB Crowd. PVLDB 4(11), 1387---1390 (2011)
    [23]
    Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pp. 80---88 (2010)
    [24]
    Getoor, L., Machanavajjhala, A.: Entity resolution: tutorial. In: VLDB (2012)
    [25]
    Haas, K., Mika, P., Tarjan, P., Blanco, R.: Enhanced results for web search. In: SIGIR, pp. 725---734 (2011)
    [26]
    Han, X., Zhao, J.: Named entity disambiguation by leveraging wikipedia semantic knowledge. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, pp. 215---224. ACM, New York (2009).
    [27]
    Jaro, M.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida. J. Am. Stat. Assoc. 84(406), 414---420 (1989)
    [28]
    Kazai, G.: In search of quality in crowdsourcing for search engine evaluation. In: ECIR, pp. 165---176 (2011)
    [29]
    Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In: SIGIR, pp. 205---214 (2011)
    [30]
    Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423---430 (2003)
    [31]
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441---450 (2010)
    [32]
    Kschischang, F., Frey, B., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory 47(2) (2001)
    [33]
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707---710 (1966)
    [34]
    Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Inform. Sci. 89(12), 1---38 (1996).
    [35]
    Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: Cdas: a crowdsourcing data analytics system. Proc. VLDB Endow. 5(10), 1040---1051 (2012). http://dl.acm.org/citation.cfm?id=2336664.2336676
    [36]
    Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Human-powered sorts and joins. PVLDB 5(1), 13---24 (2011)
    [37]
    Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics) (2011)
    [38]
    Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM '07, pp. 233---242. ACM, New York (2007).
    [39]
    Murphy, K.M., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Uncertainty in Artificial Intelligence (UAI) (1999)
    [40]
    On, B., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE), pp. 496---505 (2007)
    [41]
    Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM '12, pp. 53---62. ACM, New York (2012).
    [42]
    Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW, pp. 771---780 (2010)
    [43]
    Selke, J., Lofi, C., Balke, W.T.: Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. Proc. VLDB Endow. 5(6), 538---549 (2012). http://dl.acm.org/citation.cfm?id=2168651.2168655
    [44]
    Shen, W., Wang, J., Luo, P., Wang, M.: Liege: link entities in web lists with knowledge base. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '12, pp. 1424---1432. ACM, New York (2012).
    [45]
    Tonon, A., Demartini, G., Cudré-Mauroux, P.: Combining inverted indices and structured search for ad-hoc object retrieval. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '12, pp. 125---134. ACM, New York (2012).
    [46]
    von Ahn, L., Dabbish, L.: Designing games with a purpose. Commun. ACM 51(8), 58---67 (2008).
    [47]
    von Ahn, L., Liu, R., Blum, M.: Peekaboom: a game for locating objects in images. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '06, pp. 55---64. ACM, New York (2006).
    [48]
    Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483---1494 (2012)
    [49]
    Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, SIGMOD '09, pp. 219---232. ACM, New York (2009).
    [50]
    Winkler, W.: The state of record linkage and current research problems. US Census Bureau. In: Statistical Research Division (1999)
    [51]
    Wylot, M., Pont, J., Wisniewski, M., Cudré-Mauroux, P.: dipLODocus{RDF}--short and long-tail rdf analytics for massive webs of data. In: International Semantic Web Conference (ISWC), pp. 778---793 (2011)

    Cited By

    View all

    Index Terms

    1. Large-scale linked data integration using probabilistic reasoning and crowdsourcing
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image The VLDB Journal — The International Journal on Very Large Data Bases
        The VLDB Journal — The International Journal on Very Large Data Bases  Volume 22, Issue 5
        October 2013
        137 pages

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 01 October 2013

        Author Tags

        1. Crowdsourcing
        2. Data integration
        3. Entity linking
        4. Instance matching
        5. Probabilistic reasoning

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)20
        • Downloads (Last 6 weeks)3

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Causal Data IntegrationProceedings of the VLDB Endowment10.14778/3603581.360360216:10(2659-2665)Online publication date: 1-Jun-2023
        • (2021)Wiki2Prop: A Multimodal Approach for Predicting Wikidata Properties from WikipediaProceedings of the Web Conference 202110.1145/3442381.3450082(2357-2366)Online publication date: 19-Apr-2021
        • (2020)Leveraging Knowledge Graphs for Big Data IntegrationSemantic Web10.3233/SW-19037111:1(13-17)Online publication date: 1-Jan-2020
        • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
        • (2019)Expanding the Web of KnowledgeProceedings of the 30th ACM Conference on Hypertext and Social Media10.1145/3342220.3343671(9-18)Online publication date: 12-Sep-2019
        • (2019)Beyond Monetary IncentivesACM Transactions on Social Computing10.1145/33217002:2(1-31)Online publication date: 13-Jun-2019
        • (2019)Deadline-Aware Fair Scheduling for Multi-Tenant Crowd-Powered SystemsACM Transactions on Social Computing10.1145/33010032:1(1-29)Online publication date: 21-Feb-2019
        • (2018)Linked data schemataSemantic Web10.3233/SW-1702719:1(53-75)Online publication date: 1-Jan-2018
        • (2018)Using microtasks to crowdsource DBpedia entity classificationSemantic Web10.3233/SW-1702619:3(337-354)Online publication date: 1-Jan-2018
        • (2018)Quality Control in CrowdsourcingACM Computing Surveys10.1145/314814851:1(1-40)Online publication date: 4-Jan-2018
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media