Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Data Preparation for Duplicate Detection

Published: 13 June 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection.
    Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.

    References

    [1]
    Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow. 9, 12 (2016), 993--1004.
    [2]
    Ziawasch Abedjan, John Morcos, Michael N. Gubanov, Ihab F. Ilyas, Michael Stonebraker, Paolo Papotti, and Mourad Ouzzani. 2015. DataXFormer: Leveraging the web for semantic transformations. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’15).
    [3]
    Akiko Aizawa and Keizo Oyama. 2005. A fast linkage detection scheme for multi-source information integration. In Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration (WIRI’05). 30--39.
    [4]
    Rohan Baxter, Peter Christen, and Tim Churches. 2003. A comparison of fast blocking methods for record linkage. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’03), Vol. 3. 25--27.
    [5]
    Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. VLDB J. 18, 1 (2009), 255--276.
    [6]
    Amit Chandel, Oktie Hassanzadeh, Nick Koudas, Mohammad Sadoghi, and Divesh Srivastava. 2007. Benchmarking declarative approximate selection predicates. In Proceedings of the International Conference on Management of Data (SIGMOD’07). 353--364.
    [7]
    Peter Christen. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). 151--159.
    [8]
    Peter Christen. 2008. Febrl: A freely available record linkage system with a graphical user interface. In Proceedings of the Australasian Workshop on Health Data and Knowledge Management (HDKM’08). 17--25.
    [9]
    Peter Christen. 2012. Data Matching—Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Data-Centric Systems and Applications.
    [10]
    Tim Churches, Peter Christen, Kim Lim, and Justin Xi Zhu. 2002. Preparation of name and address data for record linkage using hidden Markov models. BMC Med. Inf. Dec. Making 2, 1 (2002), 9.
    [11]
    Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. 2001. Efficient data reconciliation. Inf. Sci. 137, 1 (2001), 1--15.
    [12]
    Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.
    [13]
    Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning (1st ed.). Springer-Verlag, Berlin.
    [14]
    Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. Proceedings of the International Conference on Programming Languages (SIGPLAN). 317--330.
    [15]
    Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques. Elsevier.
    [16]
    David Hand and Peter Christen. 2018. A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28, 3 (2018), 539--547.
    [17]
    Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek Narasayya, and Surajit Chaudhuri. 2018. Transform-data-by-example (TDE): An extensible search engine for data transformations. Proc. VLDB Endow. 11, 10 (2018), 1165--1177.
    [18]
    Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data Quality and Record Linkage Techniques. Springer-Verlag, New York.
    [19]
    James Inman. 1849. Navigation and Nautical Astronomy, for the Use of British Seamen. F. 8 J. Rivington.
    [20]
    Zhongjun Jin, Michael R. Anderson, Michael Cafarella, and H. V. Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the International Conference on Management of Data (SIGMOD’17). 683--698.
    [21]
    Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo Buono. 2011. Research directions in data wrangling: Visualizations and transformations for usable and credible data. Inf. Vis. 10, 4 (2011), 271--288.
    [22]
    Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 1--2 (2010), 484--493.
    [23]
    Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Tilmann Rabl, and Volker Markl. 2017. Blockjoin: Efficient matrix partitioning through joins. Proc. VLDB Endow. 10, 13 (2017), 2061--2072.
    [24]
    Yang W. Lee, Leo L. Pipino, James D. Funk, and Richard Y. Wang. 2009. Journey to Data Quality. The MIT Press.
    [25]
    Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.
    [26]
    Pei Li, Xin Luna Dong, Andrea Maurino, and Divesh Srivastava. 2011. Linking temporal records. Proc. VLDB Endow. 4, 11 (2011), 956--967.
    [27]
    Willi Mann, Nikolaus Augsten, and Panagiotis Bouros. 2016. An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 9, 9 (2016), 636--647.
    [28]
    Alvaro E. Monge and Charles P. Elkan. 1996. The field matching problem: Algorithms and applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’96). 267--270.
    [29]
    Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data (SIGMOD’18). 19--34.
    [30]
    Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Morgan 8 Claypool Publishers.
    [31]
    Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems. In Proceedings of the International Workshop on Data and Information Quality. 219--233.
    [32]
    Dorian Pyle. 1999. Data Preparation for Data Mining. Vol. 1. Morgan Kaufmann.
    [33]
    Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.
    [34]
    Thomas C. Redman. 2001. Data Quality: The Field Guide. Digital Press.
    [35]
    Sadhan Sood and Dmitri Loguinov. 2011. Probabilistic near-duplicate detection using simhash. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’11). 1117--1126.
    [36]
    Dinusha Vatsalan, Ziad Sehili, Peter Christen, and Erhard Rahm. 2017. Privacy-preserving record linkage for big data: Current approaches and research challenges. In Handbook of Big Data Technologies. Springer, 851--895.
    [37]
    Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. Proc. VLDB Endow. 4, 10 (2011), 622--633.
    [38]
    Y. Y. R. Wang, R. Y. Wang, M. Ziad, and Y. W. Lee. 2001. Data Quality. Springer US.
    [39]
    Melanie Weis, Felix Naumann, Ulrich Jehle, Jens Lufter, and Holger Schuster. 2008. Industry-scale duplicate detection. Proc. VLDB Endow. 1, 2 (2008), 1253--1264.
    [40]
    Ying Yang, Niccolo Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. 2015. Lenses: An on-demand approach to ETL. Proc. VLDB Endow. 8, 12 (2015), 1578--1589.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 12, Issue 3
    On the Horizon and Regular Articles
    September 2020
    104 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3404101
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2020
    Online AM: 07 May 2020
    Accepted: 01 January 2020
    Revised: 01 December 2019
    Received: 01 February 2019
    Published in JDIQ Volume 12, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data preparation
    2. data wrangling
    3. duplicate detection
    4. record linkage
    5. similarity measures

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)215
    • Downloads (Last 6 weeks)56
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)BibDedupe: An Open-Source Python Library for Bibliographic Record DeduplicationJournal of Open Source Software10.21105/joss.063189:97(6318)Online publication date: May-2024
    • (2024)Company Name Matching Using Job Market Data EnrichmentIT Professional10.1109/MITP.2024.337117926:2(76-82)Online publication date: Mar-2024
    • (2024)Repairing raw metadata for metadata managementInformation Systems10.1016/j.is.2024.102344122(102344)Online publication date: May-2024
    • (2024)Deep learning for nano-photonic materials – The solution to everything!?Current Opinion in Solid State and Materials Science10.1016/j.cossms.2023.10112928(101129)Online publication date: Feb-2024
    • (2023)Forecasting financial markets using advanced machine learning algorithmsE3S Web of Conferences10.1051/e3sconf/202340308007403(08007)Online publication date: 25-Jul-2023
    • (2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 1-Aug-2022
    • (2021)Hierarchical Semantics Matching For Heterogeneous Spatio-temporal SourcesProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482350(565-575)Online publication date: 26-Oct-2021
    • (2021)Multi-Agent Systems and Digital Twins for Smarter CitiesProceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3437959.3459254(45-55)Online publication date: 21-May-2021
    • (2021)Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00269(2373-2376)Online publication date: Apr-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media