research-article

Public Access

Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

Authors:

Zhen “Jason” Sun,

Geoff Kuenning,

Philip Shilane,

Vasily Tarasov,

Erez ZadokAuthors Info & Claims

ACM Transactions on Storage (TOS), Volume 14, Issue 2

Article No.: 13, Pages 1 - 27

https://doi.org/10.1145/3183890

Published: 11 May 2018 Publication History

Abstract

Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this article, we first collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We then analyzed the dataset, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node deduplication analysis, our primary focus was individual-user data. Despite apparently similar roles and behavior among all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk (multiple consecutive chunks), but it also leads to high data skew (imbalance of space usage across nodes). We also found that large chunking sizes are better for cluster deduplication, as they significantly reduce data-routing overhead, while their negative impact on deduplication ratios is small and acceptable. We draw interesting conclusions from both single-node and cluster deduplication analysis and make recommendations for future deduplication systems design.

References

[1]

D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MASCOTS’09). IEEE Computer Society, 1--9.

[2]

Zhen Cao, Vasily Tarasov, Hari Raman, Dean Hildebrand, and Erez Zadok. 2017. On the performance variation in modern storage stacks. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 329--343.

Digital Library

[3]

B. Debnath, S. Sengupta, and J. Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference. USENIX, 16.

Digital Library

[4]

W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). USENIX, 15--29.

Digital Library

[5]

F. Douglis, D. Bhardwaj, H. Qian, and P. Shilane. 2011. Content-aware load balancing for distributed backup. In Proceedings of the USENIX Large Installation System Administration Conference. USENIX, 13--13.

Digital Library

[6]

A. El-Shimi, R. Kalach, A. Kumar, A. Oltean, J. Li, and S. Sengupta. 2012. Primary data deduplication—Large scale study and system design. In Proceedings of the USENIX Annual Technical Conference. USENIX, 285--296.

Digital Library

[7]

Kave Eshghi, Mark Lillibridge, Deepavali Bhagwat, and Mark Watkins. 2015. Improving Multi-Node Deduplication Performance for Interleaved Data via Sticky-Auction Routing. Technical Report HPL-2015-77. HP Laboratories.

[8]

D. Frey, A. Kermarrec, and K. Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the Symposium on Cloud Computing (SOCC’12). ACM, 17.

Digital Library

[9]

FSL-data-set 2016. FSLHomes data set and tools. Retrieved from tracer.filesystems.org.

[10]

Min Fu, Dan Feng, Yu Hua, Xubin He, and Zuoning Chen. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting history information. In Proceedings of the Annual Technical Conference. USENIX, 181--192.

Digital Library

[11]

Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of the International Conference on Middleware. ACM, 354--373.

Digital Library

[12]

Y. Fu, N. Xiao, X. Liao, and F. Liu. 2013. Application-aware client-side data reduction and encryption of personal data in cloud backup services. J. Comput. Sci. Technol. 28, 6 (Nov. 2013), 1012--1024.

[13]

A. George and B. Medha. 2015. Identifying trends in enterprise data protection systems. In USENIX Annual Technical Conference. USENIX, 151--164.

Digital Library

[14]

A. Gharaibeh, C. Constantinescu, M. Lu, A. Sharma, R. Routray, P. Sarkar, D. Pease, and M. Ripeanu. 2014. DedupT: Deduplication for tape systems. In Proceedings of the 30th Symposium on Mass Storage Systems and Technologies (MSST’14). IEEE Computer Society, 1--11.

[15]

Jhon Gratz and David Reinsel. 2010. The Digital Universe Decade—Are You Ready? IDC White Paper.

[16]

F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference. USENIX, 25--25.

Digital Library

[17]

M. Jianting. 2012. A deduplication-based data archiving system. In Proceedings of the International Conference on Image, Vision and Computing (ICIVC’12). ACM, 1--12.

[18]

K. Jin and E. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). ACM, Haifa, Israel, 7.

Digital Library

[19]

R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Trans. Stor. 6, 3 (2010), 13.

Digital Library

[20]

M. Li, C. Qin, and P. Lee. 2015. CDStore: Toward reliable, secure, and cost-efficient cloud storage via convergent dispersal. In Proceedings of the USENIX Annual Technical Conference. USENIX, 111--124.

Digital Library

[21]

M. Lillibridge and K. Eshghi. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX, 183--197.

Digital Library

[22]

M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). USENIX, 111--123.

Digital Library

[23]

X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. 2015. Metadata considered harmful … to deduplication. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems. USENIX, 11.

Digital Library

[24]

X. Lin, M. Hibler, E. Eide, and R. Ricci. 2015. Using deduplicating storage for efficient disk image deployment. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation. IEEE Computer Society, 1--14.

[25]

M. Lu, D. Chambliss, J. Glider, and C. Constantinescu. 2012. Insights for data reduction in primary storage: A practical analysis. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’12). ACM, Haifa, Israel, 14.

Digital Library

[26]

D. Meister and A. Brinkmann. 2009. Multi-level comparison of data deduplication in a backup scenario. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09).

Digital Library

[27]

D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MSST’10). IEEE Computer Society,1--6.

[28]

D. Meister, A. Brinkmann, and T. Suss. 2013. File recipe compression in data deduplication systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX,175--182.

Digital Library

[29]

D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, and J. Kunkel. 2012. A study on data deduplication in hpc storage systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society, 7.

Digital Library

[30]

D. Meyer and W. Bolosky. 2011. A study of practical deduplication. ACM Trans. Stor. 7, 4 (2011), 14.

Digital Library

[31]

N. Park and D. Lilja. 2010. Characterizing datasets for data deduplication in backup applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’10). IEEE Computer Society, 1--10.

Digital Library

[32]

K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12).

Digital Library

[33]

Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd International IEEE Symposium on Mass Storage Systems and Technologies (MSST’16). IEEE, 1--7.

[34]

Yujuan Tan, Dan Feng, Fangting Huang, and Zhichao Yan. 2011. SORT: A similarity-ownership based routing scheme to improve data read performance for deduplication clusters. Int. J. Adv. Comput. Technol. 3, 9 (2011), 270--277.

[35]

V. Tarasov, A. Mudrankitony, W. Buik, P. Shilane, G. Kuenning, and E. Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Annual Technical Conference. USENIX, 261--272.

Digital Library

[36]

C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago, G. Calkowski, C. Dubnicki, and A. Bohra. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). USENIX, 225--239.

Digital Library

[37]

C. Vaughn, C. Miller, O. Ekenta, H. Sun, M. Bhadkamkar, P. Efstathopoulos, and E. Kardes. 2015. Soothsayer: Predicting capacity usage in backup storage systems. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MASCOTS’15). IEEE, Atlanta, GA, USA, 208--217.

Digital Library

[38]

R. Villars, C. Olofson, and M. Eastwood. 2011. Big Data: What It Is and Why You Should Care. White Paper.

[39]

G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). USENIX, 33--48.

Digital Library

[40]

J. Wei, H. Jiang, K. Zhou, and D. Feng. 2010. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proceedings of the Symposium on Mass Storage Systems and Technologies Conference (MSST’10). IEEE Computer Society, 1--14.

Digital Library

[41]

W. Xia, H. Jiang, D. Feng, and Y. Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference. USENIX, 26--28.

Digital Library

[42]

T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’10). IEEE Computer Society, 1--12.

[43]

Y. Zhou, D. Feng, W. Xia, M. Fu, F. Huang, Y. Zhang, and C. Li. 2015. SecDep: A user-aware efficient fine-grained secure dedupication scheme with multi-level key management. In Proceedings of the 31th Symposium on Mass Storage Systems and Technologies (MSST’15). IEEE Computer Society, 1--14.

[44]

B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. USENIX, 1--14.

Digital Library

Cited By

Jackowski AŚlusarczyk ŁLichota KWełnicki MWijata RKielar MKopeć TDubnicki CIwanicki K(2023)ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level DeduplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325050134:7(2180-2197)Online publication date: Jul-2023
https://doi.org/10.1109/TPDS.2023.3250501
Wong TThakkar SHsieh KTom ZSaraiya HShilane P(2023)Dataset Similarity Detection for Global Deduplication in the DD File System2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00255(3322-3335)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00255
Li JHuang SRen YYang ZLee PZhang XHao Y(2022)Enabling Secure and Space-Efficient Metadata Management in Encrypted DeduplicationIEEE Transactions on Computers10.1109/TC.2021.306732671:4(959-970)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TC.2021.3067326
Show More Cited By

Index Terms

Cluster and Single-Node Analysis of Long-Term Deduplication Patterns
1. General and reference
  1. Cross-computing tools and techniques
    1. Evaluation
    2. Performance

Recommendations

A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Analysis of Long Term File Reference Patterns for Application to File Migration Algorithms

In most large computer installations files are moved between on-line disk and mass storage (tape, integrated mass storage device) either automatically by the system and/or at the direction of the user. In this paper we present and analyze long term file ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage

ACM Transactions on Storage Volume 14, Issue 2

May 2018

210 pages

ISSN:1553-3077

EISSN:1553-3093

DOI:10.1145/3208078

Editor:
Sam H. Noh
Ulsan National Institute of Science and Technology, Ulsan, Korea

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2018

Accepted: 01 January 2018

Revised: 01 December 2017

Received: 01 September 2017

Published in TOS Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

China 863
ONR
National Natural Science Foundation of China
NSF
Dell-EMC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
500
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)6

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jackowski AŚlusarczyk ŁLichota KWełnicki MWijata RKielar MKopeć TDubnicki CIwanicki K(2023)ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level DeduplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325050134:7(2180-2197)Online publication date: Jul-2023
https://doi.org/10.1109/TPDS.2023.3250501
Wong TThakkar SHsieh KTom ZSaraiya HShilane P(2023)Dataset Similarity Detection for Global Deduplication in the DD File System2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00255(3322-3335)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00255
Li JHuang SRen YYang ZLee PZhang XHao Y(2022)Enabling Secure and Space-Efficient Metadata Management in Encrypted DeduplicationIEEE Transactions on Computers10.1109/TC.2021.306732671:4(959-970)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TC.2021.3067326
Gururaj RMoh MMoh TShilane PBhanjois B(2022)Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM53663.2022.9721761(1-8)Online publication date: 3-Jan-2022
https://doi.org/10.1109/IMCOM53663.2022.9721761
Yuan XMoh MMoh T(2022)Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)10.1109/ASONAM55673.2022.10068661(269-276)Online publication date: 10-Nov-2022
https://doi.org/10.1109/ASONAM55673.2022.10068661
Duggal AJenkins FShilane PChinthekindi RShah RKamat MDan TDahlia M(2019)Data domain cloud tierProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358862(647-660)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358862
Li JLee PRen YZhang X(2019)Metadedup: Deduplicating Metadata in Encrypted Deduplication via Indirection2019 35th Symposium on Mass Storage Systems and Technologies (MSST)10.1109/MSST.2019.00007(269-281)Online publication date: May-2019
https://doi.org/10.1109/MSST.2019.00007
Zhang CQi DCai ZHuang WHuang XLi WGuo J(2019)MII:A Novel Content Defined Chunking Algorithm for Finding Incremental data in Data SynchronizationIEEE Access10.1109/ACCESS.2019.2926195(1-1)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2926195

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents