Abstract
Entity Resolution (ER) in data integration systems is the problem of identifying groups of tuples from one or multiple data sources that represent the same real-world entity. This is a crucial stage of data integration processes, which often need to integrate data at query-time. This task becomes even more challenging in scenarios with dynamic data sources or when a large volume of data needs to be integrated. Then, to deal with large volumes of data, new ER solutions have been proposed. One possible approach consists in performing the ER process over query results rather than in the whole set of tuples being integrated. Additionally, previous results of ER tasks can be reused in order to reduce the number of comparisons between pairs of tuples at query-time. In a similar way, indexing techniques can also be employed to help the identification of equivalent tuples and to reduce the number of comparisons between pairs of tuples. In this context, this work proposes an incremental ER process over query results. The contributions of this work are the specification, the implementation and the evaluation of the proposed incremental process. We performed some experiments and we concluded that the incremental ER at query-time is more efficient than traditional ER processes.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Altowim, Y., Kalashnikov, D.V., Mehrotra, S. (2014). Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment, 7(11), 999–1010. https://doi.org/10.14778/2732967.2732975.
Altwaijry, H., Kalashnikov, D.V., Mehrotra, S. (2013). Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 6(14), 1846–1857. https://doi.org/10.14778/2556549.2556567.
Altwaijry, H., Mehrotra, S., Kalashnikov, D.V. (2015). Query: a framework for integrating entity resolution with query processing. Proceedings of the VLDB Endowment, 9 (3), 120–131. https://doi.org/10.14778/2850583.2850587.
Bellahsene, Z., Bonifati, A., Rahm, E. (2011). Schema matching and mapping, 1st edn. Heidelberg: Springer.
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J. (2009). Swoosh: a generic approach to entity resolution. The VLDB Journal, 18(1), 255–276.
Bhattacharya, I., & Getoor, L. (2007). Query-time entity resolution. Journal of Artificial Intelligence Research (JAIR), 30, 621–657.
Bhattacharya, I., & Getoor, L. (2006). Entity Resolution in Graphs, (pp. 311–344). New York: Wiley.
Christen, P. (2008). Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08 (pp. 1065–1068). New York: ACM. https://doi.org/10.1145/1401890.1402020. http://doi.acm.org/10.1145/1401890.1402020
Christen, P. (2012). Data matching - concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Berlin: Springer.
Day, W.H., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1(1), 7–24.
Doan, A., Halevy, A., Ives, Z. (2012). Principles of data integration, 1st edn. San Francisco : Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=2401764.
Dong, X.L., & Srivastava, D. (2015). Big data integration. Synthesis lectures on data management. Morgan & Claypool Publishers. https://doi.org/10.2200/S00578ED1V01Y201404DTM040.
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16. http://dblp.uni-trier.de/db/journals/tkde/tkde19.html#ElmagarmidIV07.
Euzenat, J., & Shvaiko, P. (2013). Ontology matching, 2nd edn. Berlin: Springer.
Firmani, D., Saha, B., Srivastava, D. (2016). Online entity resolution using an oracle. Proceedings of the VLDB Endowment, 9(5), 384–395. http://dblp.uni-trier.de/db/journals/pvldb/pvldb9.html#FirmaniSS16.
Getoor, L., & Machanavajjhala, A. (2012). Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5 (12), 2018–2019. http://dblp.uni-trier.de/db/journals/pvldb/pvldb5.html#GetoorM12.
Gruenheid, A., Dong, X.L., Srivastava, D. (2014). Incremental record linkage. Proceedings of the VLDB Endowment, 7(9), 697–708. http://dblp.uni-trier.de/db/journals/pvldb/pvldb7.html#GruenheidDS14.
Guo, S., Dong, X., Srivastava, D., Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 3 (1), 417–428. http://dblp.uni-trier.de/db/journals/pvldb/pvldb3.html#GuoDSZ10.
Huang, J., Ertekin, S., Giles, C.L. (2006). Efficient name disambiguation for large-scale databases. In J. Fürnkranz, T. Scheffer, M. Spiliopoulou (Eds.) , PKDD, Lecture Notes in Computer Science (Vol. 4213, pp. 536–544). http://dblp.uni-trier.de/db/conf/pkdd/pkdd2006.html#HuangEG06. Berlin: Springer.
Jin, C., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.N. (2015). Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce. In L.T. Watson, J. Weinbub, M. Sosonkina, W.I. Thacker (Eds.) , SpringSim (HPS). SCS/ACM. http://dblp.uni-trier.de/db/conf/springsim/springsim2015-5.html#JinCHAC15 (pp. 83–92).
Kogan, J., Nicholas, C.K., Teboulle, M. (Eds.). (2006). Grouping multidimensional data - recent advances in clustering. Berlin: Springer. http://dblp.uni-trier.de/db/books/daglib/0015184.html.
Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 69(2), 197–210. http://dblp.uni-trier.de/db/journals/dke/dke69.html#KopckeR10.
Lenzerini, M. (2011). Ontology-based data management. In Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11. https://doi.org/10.1145/2063576.2063582. http://doi.acm.org/10.1145/2063576.2063582 (pp. 5–6). New York: ACM.
Li, Y., Swarup, V., Jajodia, S. (2003). Constructing a virtual primary key for fingerprinting relational data, (pp. 133–141). New York: ACM. https://doi.org/10.1145/947380.947398. http://doi.acm.org/10.1145/947380.947398.
Mamun, A.A., Mi, T., Aseltine, R., Rajasekaran, S. (2013). Efficient sequential and parallel algorithms for record linkage. Journal of the American Medical Informatics Association, 21(2), 252–262.
Mathieu, C., Sankur, O., Schudy, W. (2010). Online correlation clustering. In J.Y. Marion, & T. Schwentick (Eds.) , STACS, LIPIcs (Vol. 5, pp. 573–584). Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik. http://dblp.uni-trier.de/db/conf/stacs/stacs2010.html#MathieuSS10.
On, B.W., Lee, I., Choi, G.S., Park, H.S. (2014). Discriminative and deterministic approaches towards entity resolution. Journal of Intelligent Information System, 43 (1), 101–127. http://dblp.uni-trier.de/db/journals/jiis/jiis43.html#OnLCP14.
Otero-Cerdeira, L., Rodríguez-Martínez, F.J., Gómez-Rodríguez, A. (2014). Ontology matching: a literature review. Expert Systems with Applications, 42, 949–971.
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T. (2016). Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment, 9(9), 684–695. http://dblp.uni-trier.de/db/journals/pvldb/pvldb9.html#0001SGP16.
Papenbrock, T., Heise, A., Naumann, F. (2015). Progressive duplicate detection. IEEE Transactions on Knowledge and Data Engineering, 27(5), 1316–1329. http://dblp.uni-trier.de/db/journals/tkde/tkde27.html#PapenbrockHN15.
Pochampally, R., Sarma, A.D., Dong, X.L., Meliou, A., Srivastava, D. (2015). Fusing data with correlations. CoRR arXiv:1503.00306. http://dblp.uni-trier.de/db/journals/corr/corr1503.html#PochampallySDMS15.
Rahm, E., & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334–350. https://doi.org/10.1007/s007780100057.
Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B. (2016). Sjclust: towards a framework for integrating similarity join algorithms and clustering. In S. Hammoudi, L.A. Maciaszek, M. Missikoff, O. Camp, J. Cordeiro (Eds.) , ICEIS (1). SciTePress. http://dblp.uni-trier.de/db/conf/iceis/iceis2016-1.html#RibeiroCBN16 (pp. 75–80).
Rubim, I.C., & Braganholo, V. (2017). Detecting referential inconsistencies in electronic cv datasets. Journal of the Brazilian Computer Society, 23(1), 3:1–3:11. http://dblp.uni-trier.de/db/journals/jbcs/jbcs23.html#RubimB17.
Su, W., Wang, J., Lochovsky, F.H., Society, I.C. (2010). Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 22(4), 578–589.
Vieira, P., Salgado, A.C., Lóscio, B.F. (2016). A query-driven, incremental process for entity resolution. In R. Pichler, & A.S. da Silva (Eds.) , AMW, CEUR Workshop Proceedings (Vol. 1644). URL http://dblp.uni-trier.de/db/conf/amw/amw2016.html#VieiraSL16.
Vieira, P.K.M., Lóscio, B.F., Salgado, A.C. (2017). Dynamic indexing for incremental entity resolution in data integration systems. In S. Hammoudi, M. Smialek, O. Camp, J. Filipe (Eds.) , ICEIS (1). SciTePress. http://dblp.uni-trier.de/db/conf/iceis/iceis2017-1.html#VieiraLS17 (pp. 185–192).
Whang, S.E., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 3(1–2), 1326–1337.
Whang, S.E., & Garcia-Molina, H. (2014). Incremental entity resolution on rules and data. VLDB Journal, 23(1), 77–102. http://dblp.uni-trier.de/db/journals/vldb/vldb23.html#WhangG14.
Whang, S.E., Marmaros, D., Garcia-Molina, H. (2013). Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering, 25(5), 1111–1124. http://dblp.uni-trier.de/db/journals/tkde/tkde25.html#WhangMG13.
Widyantoro, D.H., Ioerger, T.R., Yen, J. (2002). An incremental approach to building a cluster hierarchy. In ICDM. IEEE Computer Society. http://dblp.uni-trier.de/db/conf/icdm/icdm2002.html#WidyantoroIY02 (pp. 705–708).
Young, S.R., Arel, I., Karnowski, T.P., Rose, D.C. (2010). A fast and stable incremental clustering algorithm. In S. Latifi (Ed.) , ITNG. IEEE Computer Society. http://dblp.uni-trier.de/db/conf/itng/itng2010.html#YoungAKR10 (pp. 204–209).
Acknowledgements
The authors thank Center of Informatics at Federal University of Pernambuco, Brazil, for the infrastructure for development of this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Vieira, P.K.M., Lóscio, B.F. & Salgado, A.C. Incremental entity resolution process over query results for data integration systems. J Intell Inf Syst 52, 451–471 (2019). https://doi.org/10.1007/s10844-019-00544-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-019-00544-1