research-article

Open access

Data Preparation for Duplicate Detection

Authors:

Ioannis Koumarelas,

Felix NaumannAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 12, Issue 3

Article No.: 15, Pages 1 - 24

https://doi.org/10.1145/3377878

Published: 13 June 2020 Publication History

All formats PDF

Abstract

Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection.

Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.

References

[1]

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow. 9, 12 (2016), 993--1004.

Digital Library

[2]

Ziawasch Abedjan, John Morcos, Michael N. Gubanov, Ihab F. Ilyas, Michael Stonebraker, Paolo Papotti, and Mourad Ouzzani. 2015. DataXFormer: Leveraging the web for semantic transformations. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’15).

[3]

Akiko Aizawa and Keizo Oyama. 2005. A fast linkage detection scheme for multi-source information integration. In Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration (WIRI’05). 30--39.

Digital Library

[4]

Rohan Baxter, Peter Christen, and Tim Churches. 2003. A comparison of fast blocking methods for record linkage. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’03), Vol. 3. 25--27.

[5]

Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. VLDB J. 18, 1 (2009), 255--276.

Digital Library

[6]

Amit Chandel, Oktie Hassanzadeh, Nick Koudas, Mohammad Sadoghi, and Divesh Srivastava. 2007. Benchmarking declarative approximate selection predicates. In Proceedings of the International Conference on Management of Data (SIGMOD’07). 353--364.

Digital Library

[7]

Peter Christen. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). 151--159.

Digital Library

[8]

Peter Christen. 2008. Febrl: A freely available record linkage system with a graphical user interface. In Proceedings of the Australasian Workshop on Health Data and Knowledge Management (HDKM’08). 17--25.

[9]

Peter Christen. 2012. Data Matching—Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Data-Centric Systems and Applications.

Digital Library

[10]

Tim Churches, Peter Christen, Kim Lim, and Justin Xi Zhu. 2002. Preparation of name and address data for record linkage using hidden Markov models. BMC Med. Inf. Dec. Making 2, 1 (2002), 9.

[11]

Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. 2001. Efficient data reconciliation. Inf. Sci. 137, 1 (2001), 1--15.

Digital Library

[12]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.

[13]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning (1st ed.). Springer-Verlag, Berlin.

[14]

Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. Proceedings of the International Conference on Programming Languages (SIGPLAN). 317--330.

Digital Library

[15]

Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques. Elsevier.

Digital Library

[16]

David Hand and Peter Christen. 2018. A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28, 3 (2018), 539--547.

Digital Library

[17]

Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek Narasayya, and Surajit Chaudhuri. 2018. Transform-data-by-example (TDE): An extensible search engine for data transformations. Proc. VLDB Endow. 11, 10 (2018), 1165--1177.

Digital Library

[18]

Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data Quality and Record Linkage Techniques. Springer-Verlag, New York.

Digital Library

[19]

James Inman. 1849. Navigation and Nautical Astronomy, for the Use of British Seamen. F. 8 J. Rivington.

[20]

Zhongjun Jin, Michael R. Anderson, Michael Cafarella, and H. V. Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the International Conference on Management of Data (SIGMOD’17). 683--698.

[21]

Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo Buono. 2011. Research directions in data wrangling: Visualizations and transformations for usable and credible data. Inf. Vis. 10, 4 (2011), 271--288.

Digital Library

[22]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 1--2 (2010), 484--493.

Digital Library

[23]

Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Tilmann Rabl, and Volker Markl. 2017. Blockjoin: Efficient matrix partitioning through joins. Proc. VLDB Endow. 10, 13 (2017), 2061--2072.

Digital Library

[24]

Yang W. Lee, Leo L. Pipino, James D. Funk, and Richard Y. Wang. 2009. Journey to Data Quality. The MIT Press.

[25]

Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.

Digital Library

[26]

Pei Li, Xin Luna Dong, Andrea Maurino, and Divesh Srivastava. 2011. Linking temporal records. Proc. VLDB Endow. 4, 11 (2011), 956--967.

Digital Library

[27]

Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 9, 9 (2016), 636--647.

Digital Library

[28]

Alvaro E. Monge and Charles P. Elkan. 1996. The field matching problem: Algorithms and applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’96). 267--270.

Digital Library

[29]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data (SIGMOD’18). 19--34.

Digital Library

[30]

Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Morgan 8 Claypool Publishers.

[31]

Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems. In Proceedings of the International Workshop on Data and Information Quality. 219--233.

[32]

Dorian Pyle. 1999. Data Preparation for Data Mining. Vol. 1. Morgan Kaufmann.

[33]

Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.

[34]

Thomas C. Redman. 2001. Data Quality: The Field Guide. Digital Press.

[35]

Sadhan Sood and Dmitri Loguinov. 2011. Probabilistic near-duplicate detection using simhash. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’11). 1117--1126.

Digital Library

[36]

Dinusha Vatsalan, Ziad Sehili, Peter Christen, and Erhard Rahm. 2017. Privacy-preserving record linkage for big data: Current approaches and research challenges. In Handbook of Big Data Technologies. Springer, 851--895.

[37]

Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. Proc. VLDB Endow. 4, 10 (2011), 622--633.

Digital Library

[38]

Y. Y. R. Wang, R. Y. Wang, M. Ziad, and Y. W. Lee. 2001. Data Quality. Springer US.

[39]

Melanie Weis, Felix Naumann, Ulrich Jehle, Jens Lufter, and Holger Schuster. 2008. Industry-scale duplicate detection. Proc. VLDB Endow. 1, 2 (2008), 1253--1264.

Digital Library

[40]

Ying Yang, Niccolo Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. 2015. Lenses: An on-demand approach to ETL. Proc. VLDB Endow. 8, 12 (2015), 1578--1589.

Digital Library

Cited By

Wagner G(2024)BibDedupe: An Open-Source Python Library for Bibliographic Record DeduplicationJournal of Open Source Software10.21105/joss.063189:97(6318)Online publication date: May-2024
https://doi.org/10.21105/joss.06318
Ternikov A(2024)Company Name Matching Using Job Market Data EnrichmentIT Professional10.1109/MITP.2024.337117926:2(76-82)Online publication date: Mar-2024
https://doi.org/10.1109/MITP.2024.3371179
Khalid HZimányi E(2024)Repairing raw metadata for metadata managementInformation Systems10.1016/j.is.2024.102344122(102344)Online publication date: May-2024
https://doi.org/10.1016/j.is.2024.102344
Show More Cited By

Index Terms

Data Preparation for Duplicate Detection
1. Information systems
  1. Data management systems
    1. Information integration
  2. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Duplicate Record Detection: A Survey

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription ...
Efficient and Effective Duplicate Detection in Hierarchical Data

Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate ...
Scalable Iterative Graph Duplicate Detection

Duplicate detection determines different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 12, Issue 3

On the Horizon and Regular Articles

September 2020

104 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3404101

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2020

Online AM: 07 May 2020

Accepted: 01 January 2020

Revised: 01 December 2019

Received: 01 February 2019

Published in JDIQ Volume 12, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
763
Total Downloads

Downloads (Last 12 months)215
Downloads (Last 6 weeks)56

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wagner G(2024)BibDedupe: An Open-Source Python Library for Bibliographic Record DeduplicationJournal of Open Source Software10.21105/joss.063189:97(6318)Online publication date: May-2024
https://doi.org/10.21105/joss.06318
Ternikov A(2024)Company Name Matching Using Job Market Data EnrichmentIT Professional10.1109/MITP.2024.337117926:2(76-82)Online publication date: Mar-2024
https://doi.org/10.1109/MITP.2024.3371179
Khalid HZimányi E(2024)Repairing raw metadata for metadata managementInformation Systems10.1016/j.is.2024.102344122(102344)Online publication date: May-2024
https://doi.org/10.1016/j.is.2024.102344
Wiecha P(2024)Deep learning for nano-photonic materials – The solution to everything!?Current Opinion in Solid State and Materials Science10.1016/j.cossms.2023.10112928(101129)Online publication date: Feb-2024
https://doi.org/10.1016/j.cossms.2023.101129
Medvedev AMedvedev A(2023)Forecasting financial markets using advanced machine learning algorithmsE3S Web of Conferences10.1051/e3sconf/202340308007403(08007)Online publication date: 25-Jul-2023
https://doi.org/10.1051/e3sconf/202340308007
Graf MLaskowski LPapsdorf FSold FGremmelspacher RNaumann FPanse F(2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554823
Glake DRitter NOcker FAhmady-Moghaddam NOsterholz DLenfers UClemen TDemartini GZuccon GCulpepper JHuang ZTong H(2021)Hierarchical Semantics Matching For Heterogeneous Spatio-temporal SourcesProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482350(565-575)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482350
Clemen TAhmady-Moghaddam NLenfers UOcker FOsterholz DStröbele JGlake DDiallo STolk AGiabbanelli P(2021)Multi-Agent Systems and Digital Twins for Smarter CitiesProceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3437959.3459254(45-55)Online publication date: 21-May-2021
https://dl.acm.org/doi/10.1145/3437959.3459254
Panse FNaumann F(2021)Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00269(2373-2376)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00269

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents