Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1008694.1008697acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Iterative record linkage for cleaning and integration

Published: 13 June 2004 Publication History
  • Get Citation Alerts
  • Abstract

    Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples' attribute values; tuples with similarity scores above a certain threshold are declared to be matches. While this method can perform quite well in many domains, particularly domains where there is not a large amount of noise in the data, in some domains looking only at tuple values is not enough. By also examining the context of the tuple, i.e. the other tuples to which it is linked, we can come up with a more accurate linkage decision. But this additional accuracy comes at a price. In order to correctly find all duplicates, we may need to make multiple passes over the data; as linkages are discovered, they may in turn allow us to discover additional linkages. We present results that illustrate the power and feasibility of making use of join information when comparing records.

    References

    [1]
    R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the 28th International Conference on Very Large Databases (VLDB-2002), Hong Kong, China, 2002.
    [2]
    M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, DC, 2003.
    [3]
    S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pages 313--324, San Diego, CA, 2003.
    [4]
    W. Cohen. Overview of record linkage methods. Powerpoint presentation, available at http://www-2.cs.cmu.edu/wcohen/Matching-1.ppt.
    [5]
    W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18:288--321, 2000.
    [6]
    W. Cohen and J. Richman. Learning to match and cluster entity names. In ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval, New Orleans, LA, Sept. 2001.
    [7]
    W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 73--78, Acapulco, Mexico, Aug. 2003.
    [8]
    W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002.
    [9]
    I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
    [10]
    C. L. Giles, K. Bollacker, and S. Lawrence. CiteSeer: An automatic citation indexing system. In Proceedings of the Third ACM Conference on Digital Libraries, pages 89--98, Pittsburgh, PA, June 23--26 1998.
    [11]
    L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003.
    [12]
    M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD-95), pages 127--138, San Jose, CA, May 1995.
    [13]
    J. A. Hylton. Identifying and merging related bibliographic records. Master's thesis, Department of Electrical Engineering and Computer Science, MIT, 1996.
    [14]
    M. Ley. The dblp computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval.
    [15]
    A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 79--86, Acapulco, Mexico, Aug. 2003.
    [16]
    A. K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth International Conference On Knowledge Discovery and Data Mining (KDD-2000), pages 169--178, Boston, MA, Aug. 2000.
    [17]
    V. S. V. Mohamed G. Elfeky, Ahmed K. Elmagarmid. Tailor: A record linkage tool box. In 18th International Conference on Data Engineering (ICDE'02), 2002.
    [18]
    A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267--270, Portland, OR, August 1996.
    [19]
    A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, Tuscon, AZ, May 1997.
    [20]
    G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.
    [21]
    H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science, 130:954--959, 1959.
    [22]
    H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems 15. MIT Press, 2003.
    [23]
    E. Ristad and P. Yianilos. Learning string edit distance. IEEE Transactions on PAMI, 20(5):522--532, 1998.
    [24]
    S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002.
    [25]
    S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems Journal, 26(8):635--656, 2001.
    [26]
    W. E. Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In Proceedings of the Section on Survey Research Methods, American Statistical Association, pages 354--359, 1990.
    [27]
    W. E. Winkler. Improved decision rules in the fellegi-sunter model of record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 1993.
    [28]
    W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 1999.
    [29]
    W. E. Winkler. Methods for record linkage and Bayesian networks. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 2002.

    Cited By

    View all
    • (2023)Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural NetworksSSRN Electronic Journal10.2139/ssrn.4577447Online publication date: 2023
    • (2022)A search-based identification of variable microservices for enterprise SaaSFrontiers of Computer Science10.1007/s11704-022-1390-417:3Online publication date: 10-Nov-2022
    • (2021)Deep Learning Based Decision Support Framework for Cardiovascular Disease Prediction2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)10.1109/CSDE53843.2021.9718459(1-12)Online publication date: 8-Dec-2021
    • Show More Cited By

    Index Terms

    1. Iterative record linkage for cleaning and integration

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
      June 2004
      85 pages
      ISBN:158113908X
      DOI:10.1145/1008694
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2004

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. clustering
      2. deduplication
      3. distance measure
      4. record linkage

      Qualifiers

      • Article

      Conference

      DMKD04
      Sponsor:

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 12 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural NetworksSSRN Electronic Journal10.2139/ssrn.4577447Online publication date: 2023
      • (2022)A search-based identification of variable microservices for enterprise SaaSFrontiers of Computer Science10.1007/s11704-022-1390-417:3Online publication date: 10-Nov-2022
      • (2021)Deep Learning Based Decision Support Framework for Cardiovascular Disease Prediction2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)10.1109/CSDE53843.2021.9718459(1-12)Online publication date: 8-Dec-2021
      • (2021)An Incremental Hierarchical Clustering Based System For Record Linkage In E-Commerce DomainThe Computer Journal10.1093/comjnl/bxab17966:3(581-602)Online publication date: 11-Nov-2021
      • (2021)Challenging Data Models and Data Confidentiality Through “Pay-As-You-Go” Approach Entity ResolutionComputer Networks, Big Data and IoT10.1007/978-981-16-0965-7_37(469-482)Online publication date: 22-Jun-2021
      • (2020)Connecting family trees to construct a population-scale and longitudinal geo-social network for the U.SInternational Journal of Geographical Information Science10.1080/13658816.2020.182188535:12(2380-2423)Online publication date: 30-Sep-2020
      • (2020)A Survey on Blocking Technology of Entity ResolutionJournal of Computer Science and Technology10.1007/s11390-020-0350-435:4(769-793)Online publication date: 27-Jul-2020
      • (2020)A node resistance-based probability model for resolving duplicate named entitiesScientometrics10.1007/s11192-020-03585-4Online publication date: 13-Jul-2020
      • (2019)EMBench++Semantic Web10.3233/SW-18033110:2(435-450)Online publication date: 1-Jan-2019
      • (2019)GUP: A cognitive initiative to identify relationships between academia and industry2019 SoutheastCon10.1109/SoutheastCon42311.2019.9020380(1-3)Online publication date: Apr-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media