Abstract
To address the entity resolution problem, existing studies usually consist of two-steps. Given two lists of records, in the first step a small set of duplicate records (a candidate set) are selected based on index structures and algorithms for efficient entity resolution. Then, a given similarity function is applied to quantify the similarity of records in the candidate set. However, for real applications, it is a non-trivial task to select appropriate indexing techniques and similarity functions. In this paper, we tackle the problem of indexing and similarity function identification using both discriminative and deterministic approaches that select the best of indexing and similarity measures. According to our experimental results, our proposed solution considering both discriminative and deterministic approaches shows more than a 90 % average accuracy within hundreds of seconds.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international world wide web conference (WWW’05). Chiba, Japan, 10–14 May 2005.
Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J. (2005). Swoosh: a generic approach to entity resolution. Technical Report 2005-5, InforLab, Stanford University.
Bennett, C., Gacs, P., Li, M., Vitanyi, P., Zurek, W. (2002). Information distance. IEEE Transactions on Information Theory, 44(4), 1407–1423.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–36.
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S. (2003). Adaptive name-matching in information integration. IEEE Intelligent Systems, 18(5), 16–23.
Chaudhuri, S., Chen, B., Ganti, V., Kaushik, R. (2007). Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on very large data bases (VLDB’07). Vienna, Austria, 23–27 September 2007.
Cochinwala, M., Kurien, V., Lalk, G., Shasha, D. (2001). Efficient data reconciliation. Information Sciences, 137(1), 1–15.
Cohen, W., Ravikumar, P., Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the 8th international joint conference on artificial intelligence (IJCAI’03). Acapulco, Mexico, 9–15 August 2003.
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactios on Knowledge and Data Engineering, 24(9), 1537–1555.
Doan, A., Lu, Y., Lee, Y., Han, J. (2003). Profile-based object matching for information integraion. IEEE Intelligent Systems, 18(5), 54–59.
Dong, X., Halevy, A., Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 24th ACM SIGMOD international conference on management of data (SIGMOD’05). Baltimore, Maryland, USA, 13–16 June 2005.
Elmagarmid, A., Ipeirotis, P., Verykios, V. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Fan, W., Jia, X., Li, J., Ma, S. (2009). Reasoning about record matching rules. In Proceedings of the 35th internation conference on very large data bases (VLDB’09). Lyon, France, 24–28 August 2009.
Fellegi, I., & Sunter, A. (1968). A theory for record linkage. Journal of American Statistical Association, 63(324), 1321–1332.
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnana, S., Pietarinen, L., Srivastava, D. (2001). Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4), 90–101.
Gravano, L., Ipeirotis, P., Koudas, N., Srivastava, D. (2003). Text joins in an RDBMS for web data integration. In Proceedings of the 12th international world wide web conference (WWW’03). Budapest, Hungary, 20–24 May 2003.
Guo, S., Dong, X., Srivastava, D., Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In Proceedings of the 36th international conference on very large data bases (VLDB’10). Singapore, 13–17 September 2010.
Halbert, D. (2008). Record linkage. American Journal of Public Health, 36(12), 1412–1416.
Hammouda, K., & Kamel, M. (2004). Document similarity using a phrase indexing graph model. Knowledge and Information Systems, 6, 710–727.
Han, H., Zha, H., Lee Giles, C. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’05). Denvor, 7–11 June 2005.
Hernandez, M., & Stolfo, S. (1995). The Merge/purge problem for large databases. In Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD’95). San Jose, 22–25 May 1995.
Herranz, J., Nin, J., Sole, M. (2010). Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Transactios on Knowledge and Data Engineering, 23(10), 1541–1554.
Hong, Y., On, B., Lee, D. (2004). System support for name authority control problem in digital libraries: OpenDBLP approach. In Proceedings of the 8th European conference on digital libraries (ECDL’04). Bath, 12–17 September 2004.
Jaro, M. (1989). Advances in record linkage methodology as applied to matching the 1985 census of Tampa Florida. Journal of American Statistical Association, 84(406), 414–420.
Kalashnikov, D., Mehrotra, S., Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In Proceedings of the SIAM data mining conference (SDM’05). Newport Beach, 21–23 April 2005.
Kim, H., & Lee, D. (2010). HARRA: fast iterative hashed record linkage for large-scale data collections. In Proceedings of the 13th international conference on extending database technology (EDBT’10). Lausanne, Switzerland, 22–26 March 2010.
Koudas, N., Sarawagi, S., Srivastava, D. (2006). Record linkage: Similarity measures and algorithms. In Proceedings of the 25th ACM SIGMOD international conference on management of data (SIGMOD’06). Chicago, 26–29 June 2006.
Lawrence, S., Lee Giles, C., Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.
Lee, D., On, B., Kang, J., Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of ACM SIGMOD workshop on information quality in information systems (IQIS’05). Baltimore, 13–16 June 2005.
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.
Li, P., Dong, X., Maurino, A., Srivastava, D. (2011). Linking temporal records. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.
Lim, E., Srivastava, J., Prabhakar, S., Richardson, J. (1993). Entity identification in database integration. In Proceedings of international conference on data engineering (ICDE’93). Vienna, 19–23 April 1993.
Lu, W., Milios, J., Japkowicz, M., Zhang, Y. (2006). Node similarity in the citation graph. Knowledge and Information Systems, 11, 105–129.
Monge, A., & Elkan, C. (1996). The field matching problem: Algorithms and applications. In Proceedings of international conference on knowledge discovery and data mining (KDD’96). Portland.
On, B., & Choi, G. (2012). Acase study of understanding the nature of redundant entities in bibliographic digital libraries. Technical Report (2012–001), Public Data Research Center, Advanced Institutes of Convergence Technology, Seoul National University, Suwon, Korea.
On, B., Koudas, N., Lee, D., Srivastava, D. (2007). Group linkage. In Proceedings of international conference on data engineering (ICDE’07). Istanbul, 15–20 April 2007.
On, B., Lee, D., Kang, J., Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’05). Denver, 7–11 June 2005.
Pasula, H., Marthi, B., Milch, B., Russell, S., Shapitser, I. (2003). Identity uncertainty and citation matching. Advances in neural information processing (Vol. 15). Cambridge: MIT press.
Rastogi, V., Dalvi, N., Garofalakis, M. (2011). Large-scale collective entity matching. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.
Sarawagi, S., & Bhamidipty, A. (2002). Interactive deduplication using active learning. In Proceedings of international conference on knowledge discovery and data mining (KDD’02). Edmonton, 23–26 July 2002.
Shen, W., Li, X., Doan, A. (2005). Constraint-based entity matching. In Proceedings of the 25th national conference on artificial intelligence (AAAI’05). Pittsburgh, 9–13 July 2005.
Verykios, V., Elmagarmid, A., Houstis, E. (2000). Automating the approximate record matching process. Information Sciences, 126(1), 83–98.
Wang, J., Li, G., Yu, J., Feng, J. (2011). Entity matching: How similar is similar. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.
Whang, S., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. In Proceedings of the 36th international conference on very large data bases (VLDB’10). Singapore, 13–17 September 2010.
Xiao, C., Wang, W., Lin, X. (2008). Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. In Proceedings of the 34th international conference on very large data bases (VLDB’08). Auckland, 24–30 August 2008.
Acknowledgments
This research were supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013-012524) for the first author, the Energy Efficiency & Resources of the Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government Ministry of Knowledge Economy (No. 20132010101800) for the first and second authors, and the 2012 Yeungnam University Research Grant for the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
On, BW., Lee, I., Choi, G.S. et al. Discriminative and deterministic approaches towards entity resolution. J Intell Inf Syst 43, 101–127 (2014). https://doi.org/10.1007/s10844-014-0308-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-014-0308-5