research-article

Distributed data deduplication

Authors:

Paraschos KoutrisAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 9, Issue 11

Pages 864 - 875

https://doi.org/10.14778/2983200.2983203

Published: 01 July 2016 Publication History

Abstract

Data deduplication refers to the process of identifying tuples in a relation that refer to the same real world entity. The complexity of the problem is inherently quadratic with respect to the number of tuples, since a similarity value must be computed for every pair of tuples. To avoid comparing tuple pairs that are obviously non-duplicates, blocking techniques are used to divide the tuples into blocks and only tuples within the same block are compared. However, even with the use of blocking, data deduplication remains a costly problem for large datasets. In this paper, we show how to further speed up data deduplication by leveraging parallelism in a shared-nothing computing environment. Our main contribution is a distribution strategy, called Dis-Dedup, that minimizes the maximum workload across all worker nodes and provides strong theoretical guarantees. We demonstrate the effectiveness of our proposed strategy by performing extensive experiments on both synthetic datasets with varying block size distributions, as well as real world datasets.

References

[1]

Apache hadoop. http://hadoop.apache.org.

[2]

F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010.

Digital Library

[3]

A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, 2014.

Digital Library

[4]

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--597, 2002.

Digital Library

[5]

P. Beame, P. Koutris, and D. Suciu. Skew in parallel query processing. In R. Hull and M. Grohe, editors, PODS, pages 212--223. ACM, 2014.

Digital Library

[6]

M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, pages 87--96, 2006.

Digital Library

[7]

P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. on Knowl. and Data Eng., 24(9):1537--1555, Sept. 2012.

Digital Library

[8]

S. Chu, M. Balazinska, and D. Suciu. From theory to practice: Efficient join query evaluation in a parallel database system. In SIGMOD, pages 63--78, 2015.

Digital Library

[9]

X. Chu, I. F. Ilyas, and P. Koutris. Distributed Data Deduplication. Technical Report CS-2016-02, University of Waterloo, 2016.

[10]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[11]

D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, pages 27--40, 1992.

Digital Library

[12]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007.

Digital Library

[13]

L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012.

Digital Library

[14]

D. M. Gordon, G. Kuperberg, and O. Patashnik. New constructions for covering designs. J. COMBIN. DESIGNS, 3(269--284), 1995.

[15]

D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, et al. Demonstration of the myria big data management service. In SIGMOD, pages 881--884. ACM, 2014.

Digital Library

[16]

M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. ACM SIGMOD Record, 24(2):127--138, 1995.

Digital Library

[17]

I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015.

Digital Library

[18]

P. Indyk. A small approximately min-wise independent family of hash functions. Journal of Algorithms, 38(1):84--90, 2001.

Digital Library

[19]

L. Kolb, A. Thor, and E. Rahm. Dedoop: efficient deduplication with hadoop. PVLDB, 5(12):1878--1881, 2012.

Digital Library

[20]

L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In ICDE, pages 618--629, 2012.

Digital Library

[21]

H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.

Digital Library

[22]

A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960. ACM, 2011.

Digital Library

[23]

M. Raab and A. Steger. "balls into bins" - A simple and tight analysis. In RANDOM, pages 159--170, 1998.

Digital Library

[24]

A. D. Sarma, Y. He, and S. Chaudhuri. Clusterjoin: A similarity joins framework using map-reduce. PVLDB, 7(12):1059--1070, 2014.

Digital Library

[25]

J. Schönheim. On coverings. Pacific Journal of Mathematics, 14:1405--1411, 1964.

[26]

J. D. Ullman. Designing good mapreduce algorithms. XRDS, 19(1):30--34, Sept. 2012.

Digital Library

[27]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506. ACM, 2010.

Digital Library

[28]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In HotCloud, volume 10, page 10, 2010.

Digital Library

Cited By

Heidari AMichalopoulos GIlyas IRekatsinas T(2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
https://dl.acm.org/doi/10.1145/3593579
Siddiqi SKern RBoehm M(2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617338
Fan WFu WJin RLiu MLu PTian C(2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588940
Show More Cited By

Index Terms

Distributed data deduplication
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Index terms have been assigned to the content through auto-classification.

Recommendations

A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Inline Data Deduplication for SSD-Based Distributed Storage
ICPADS '15: Proceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)

Data deduplication is used to overcome two issues on Solid State Drives (SSDs). One is price per GB of storage space, and the other is the write limit or disk endurance. By eliminating duplicate data, the deduplication system improves storage efficiency ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 9, Issue 11

July 2016

60 pages

ISSN:2150-8097

Editors:
Surajit Chaudhuri
Microsoft Research
,
Jayant Haritsa
I.I.Sc. Bangalore

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2016

Published in PVLDB Volume 9, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
460
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)6

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Heidari AMichalopoulos GIlyas IRekatsinas T(2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
https://dl.acm.org/doi/10.1145/3593579
Siddiqi SKern RBoehm M(2023)SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning ApplicationsProceedings of the ACM on Management of Data10.1145/36173381:3(1-26)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617338
Fan WFu WJin RLiu MLu PTian C(2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588940
Zhou LChen JDas AMin HYu LZhao MZou J(2022)Serving deep learning models with deduplication from relational databasesProceedings of the VLDB Endowment10.14778/3547305.354732515:10(2230-2243)Online publication date: 7-Sep-2022
https://dl.acm.org/doi/10.14778/3547305.3547325
Kammerdiner ASemenov APasiliao E(2022)Multidimensional Assignment Problem for Multipartite Entity ResolutionJournal of Global Optimization10.1007/s10898-022-01141-384:2(491-523)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s10898-022-01141-3
Fan WTian CWang YYin Q(2021)Parallel discrepancy detection and incremental detectionProceedings of the VLDB Endowment10.14778/3457390.345740014:8(1351-1364)Online publication date: 21-Oct-2021
https://dl.acm.org/doi/10.14778/3457390.3457400
Tao YWang YLibkin LPichler RGuagliardo P(2021)New Algorithms for Monotone ClassificationProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458324(260-272)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3452021.3458324
Li PCheng XChu XHe YChaudhuri SLi GLi ZIdreos SSrivastava D(2021)Auto-FuzzyJoinProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452824(1064-1076)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452824
Christophides VEfthymiou VPalpanas TPapadakis GStefanidis K(2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.1145/3418896
Papadakis GSkoutas DThanos EPalpanas T(2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
https://dl.acm.org/doi/10.1145/3377455
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents