Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1559845.1559870acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Entity resolution with iterative blocking

Published: 29 June 2009 Publication History

Abstract

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking can be more accurate and efficient than blocking for large datasets.

References

[1]
N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In FOCS, pages 238--, 2002.
[2]
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Identification, 2003.
[3]
O. Benjelloun, H. Garcia-Molina, D. Menestrina, S. E. Whang, Q. Su, and J. Widom. Swoosh: a generic approach to entity resolution. VLDB J., 2008.
[4]
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2004.
[5]
M. Bilenko, B. Kamath, and R. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006.
[6]
S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, Tokyo, Japan, 2005.
[7]
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.
[8]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.
[9]
L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, 2003.
[10]
L. Gu and R. A. Baxter. Adaptive filtering for efficient record linkage. In SDM, 2004.
[11]
M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.
[12]
M. A. Hernáandez and S. J. Stolfo. The merge/purge problem for large databases. In Proc. of ACM SIGMOD, pages 127--138, 1995.
[13]
T. N. Herzog, F. J. Scheuren, and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer, July 2007.
[14]
P. Indyk. A small approximately min-wise independent family of hash functions. J. Algorithms, 38(1):84--90, 2001.
[15]
A. K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of KDD, pages 169--178, Boston, MA, 2000.
[16]
M. Michelson and C. Knoblock. Learning blocking schemes for record linkage. In AAAI, 2006.
[17]
A. E. Monge and C. P. Elkan. An efficient domain independent algorithm for detecting approximately duplicate database records. In SIGMOD DMKD, 1997.
[18]
H. B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press, Inc., New York, NY, USA, 1988.
[19]
H. B. Newcombe and J. M. Kennedy. Record linkage: making maximum use of the discriminating power of identifying information. Commun. ACM, 5(11):563--566, 1962.
[20]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of ACM SIGKDD, Edmonton, Alberta, 2002.
[21]
S. Tejada, C. A. Knoblock, and S. Minton. Learning ob ject identification rules for information integration. Information Systems Journal, 26(8):635--656, 2001.
[22]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. Technical report, Stanford University, 2008.{4} I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 2004.
[23]
W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 2006.
[24]
W. E. Winkler. Approximate string comparator search strategies for very large administrative lists. Technical report, US Bureau of the Census, 2005.
[25]
W. Yancey. Bigmatch: A program for extracting probable matches from a large file for record linkage. Technical report, US Bureau of the Census, 2002.

Cited By

View all
  • (2022)PLINProceedings of the VLDB Endowment10.14778/3565816.356582616:2(243-255)Online publication date: 23-Nov-2022
  • (2022)Saga: A Platform for Continuous Construction and Serving of Knowledge at ScaleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526049(2259-2272)Online publication date: 10-Jun-2022
  • (2022)Deep and Collective Entity Resolution in Parallel2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00200(2060-2072)Online publication date: May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
June 2009
1168 pages
ISBN:9781605585512
DOI:10.1145/1559845
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. blocking
  2. entity resolution
  3. iterative blocking

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '09
Sponsor:
SIGMOD/PODS '09: International Conference on Management of Data
June 29 - July 2, 2009
Rhode Island, Providence, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)5
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)PLINProceedings of the VLDB Endowment10.14778/3565816.356582616:2(243-255)Online publication date: 23-Nov-2022
  • (2022)Saga: A Platform for Continuous Construction and Serving of Knowledge at ScaleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526049(2259-2272)Online publication date: 10-Jun-2022
  • (2022)Deep and Collective Entity Resolution in Parallel2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00200(2060-2072)Online publication date: May-2022
  • (2022)JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentrationKnowledge-Based Systems10.1016/j.knosys.2022.109033251(109033)Online publication date: Sep-2022
  • (2022)Cost-effective crowdsourced join queries for entity resolution without prior knowledgeFuture Generation Computer Systems10.1016/j.future.2021.09.008127:C(240-251)Online publication date: 1-Feb-2022
  • (2022)The role of transitive closure in evaluating blocking methods for dirty entity resolutionJournal of Intelligent Information Systems10.1007/s10844-021-00676-358:3(561-590)Online publication date: 1-Jun-2022
  • (2022)A supervised machine learning framework with combined blocking for detecting serial crimesApplied Intelligence10.1007/s10489-021-02942-x52:10(11517-11538)Online publication date: 1-Aug-2022
  • (2022)Pattern Discovery for Heterogeneous DataKnowledge Discovery from Multi-Sourced Data10.1007/978-981-19-1879-7_4(53-67)Online publication date: 14-Jun-2022
  • (2022)Entity Resolution in the Web of Data10.1007/978-3-031-79468-1Online publication date: 28-Mar-2022
  • (2022)The Four Generations of Entity ResolutionundefinedOnline publication date: 25-Feb-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media