Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Distributed data deduplication

Published: 01 July 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.

    References

    [1]
    Apache hadoop. http://hadoop.apache.org.
    [2]
    F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010.
    [3]
    A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, 2014.
    [4]
    R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--597, 2002.
    [5]
    P. Beame, P. Koutris, and D. Suciu. Skew in parallel query processing. In R. Hull and M. Grohe, editors, PODS, pages 212--223. ACM, 2014.
    [6]
    M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, pages 87--96, 2006.
    [7]
    P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. on Knowl. and Data Eng., 24(9):1537--1555, Sept. 2012.
    [8]
    S. Chu, M. Balazinska, and D. Suciu. From theory to practice: Efficient join query evaluation in a parallel database system. In SIGMOD, pages 63--78, 2015.
    [9]
    X. Chu, I. F. Ilyas, and P. Koutris. Distributed Data Deduplication. Technical Report CS-2016-02, University of Waterloo, 2016.
    [10]
    J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
    [11]
    D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, pages 27--40, 1992.
    [12]
    A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007.
    [13]
    L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012.
    [14]
    D. M. Gordon, G. Kuperberg, and O. Patashnik. New constructions for covering designs. J. COMBIN. DESIGNS, 3(269--284), 1995.
    [15]
    D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, et al. Demonstration of the myria big data management service. In SIGMOD, pages 881--884. ACM, 2014.
    [16]
    M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. ACM SIGMOD Record, 24(2):127--138, 1995.
    [17]
    I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015.
    [18]
    P. Indyk. A small approximately min-wise independent family of hash functions. Journal of Algorithms, 38(1):84--90, 2001.
    [19]
    L. Kolb, A. Thor, and E. Rahm. Dedoop: efficient deduplication with hadoop. PVLDB, 5(12):1878--1881, 2012.
    [20]
    L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In ICDE, pages 618--629, 2012.
    [21]
    H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.
    [22]
    A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960. ACM, 2011.
    [23]
    M. Raab and A. Steger. "balls into bins" - A simple and tight analysis. In RANDOM, pages 159--170, 1998.
    [24]
    A. D. Sarma, Y. He, and S. Chaudhuri. Clusterjoin: A similarity joins framework using map-reduce. PVLDB, 7(12):1059--1070, 2014.
    [25]
    J. Schönheim. On coverings. Pacific Journal of Mathematics, 14:1405--1411, 1964.
    [26]
    J. D. Ullman. Designing good mapreduce algorithms. XRDS, 19(1):30--34, Sept. 2012.
    [27]
    R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506. ACM, 2010.
    [28]
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, volume 10, page 10, 2010.

    Cited By

    View all
    • (2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
    • (2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
    • (2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
    • Show More Cited By

    Index Terms

    1. Distributed data deduplication
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 9, Issue 11
      July 2016
      60 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 July 2016
      Published in PVLDB Volume 9, Issue 11

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)54
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
      • (2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
      • (2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
      • (2022)Serving deep learning models with deduplication from relational databasesProceedings of the VLDB Endowment10.14778/3547305.354732515:10(2230-2243)Online publication date: 7-Sep-2022
      • (2022)Multidimensional Assignment Problem for Multipartite Entity ResolutionJournal of Global Optimization10.1007/s10898-022-01141-384:2(491-523)Online publication date: 1-Oct-2022
      • (2021)Parallel discrepancy detection and incremental detectionProceedings of the VLDB Endowment10.14778/3457390.345740014:8(1351-1364)Online publication date: 21-Oct-2021
      • (2021)New Algorithms for Monotone ClassificationProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458324(260-272)Online publication date: 20-Jun-2021
      • (2021)Auto-FuzzyJoinProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452824(1064-1076)Online publication date: 9-Jun-2021
      • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
      • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media