Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

Published: 11 May 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this article, we first collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We then analyzed the dataset, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node deduplication analysis, our primary focus was individual-user data. Despite apparently similar roles and behavior among all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk (multiple consecutive chunks), but it also leads to high data skew (imbalance of space usage across nodes). We also found that large chunking sizes are better for cluster deduplication, as they significantly reduce data-routing overhead, while their negative impact on deduplication ratios is small and acceptable. We draw interesting conclusions from both single-node and cluster deduplication analysis and make recommendations for future deduplication systems design.

    References

    [1]
    D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MASCOTS’09). IEEE Computer Society, 1--9.
    [2]
    Zhen Cao, Vasily Tarasov, Hari Raman, Dean Hildebrand, and Erez Zadok. 2017. On the performance variation in modern storage stacks. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 329--343.
    [3]
    B. Debnath, S. Sengupta, and J. Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference. USENIX, 16.
    [4]
    W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). USENIX, 15--29.
    [5]
    F. Douglis, D. Bhardwaj, H. Qian, and P. Shilane. 2011. Content-aware load balancing for distributed backup. In Proceedings of the USENIX Large Installation System Administration Conference. USENIX, 13--13.
    [6]
    A. El-Shimi, R. Kalach, A. Kumar, A. Oltean, J. Li, and S. Sengupta. 2012. Primary data deduplication—Large scale study and system design. In Proceedings of the USENIX Annual Technical Conference. USENIX, 285--296.
    [7]
    Kave Eshghi, Mark Lillibridge, Deepavali Bhagwat, and Mark Watkins. 2015. Improving Multi-Node Deduplication Performance for Interleaved Data via Sticky-Auction Routing. Technical Report HPL-2015-77. HP Laboratories.
    [8]
    D. Frey, A. Kermarrec, and K. Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the Symposium on Cloud Computing (SOCC’12). ACM, 17.
    [9]
    FSL-data-set 2016. FSLHomes data set and tools. Retrieved from tracer.filesystems.org.
    [10]
    Min Fu, Dan Feng, Yu Hua, Xubin He, and Zuoning Chen. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting history information. In Proceedings of the Annual Technical Conference. USENIX, 181--192.
    [11]
    Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of the International Conference on Middleware. ACM, 354--373.
    [12]
    Y. Fu, N. Xiao, X. Liao, and F. Liu. 2013. Application-aware client-side data reduction and encryption of personal data in cloud backup services. J. Comput. Sci. Technol. 28, 6 (Nov. 2013), 1012--1024.
    [13]
    A. George and B. Medha. 2015. Identifying trends in enterprise data protection systems. In USENIX Annual Technical Conference. USENIX, 151--164.
    [14]
    A. Gharaibeh, C. Constantinescu, M. Lu, A. Sharma, R. Routray, P. Sarkar, D. Pease, and M. Ripeanu. 2014. DedupT: Deduplication for tape systems. In Proceedings of the 30th Symposium on Mass Storage Systems and Technologies (MSST’14). IEEE Computer Society, 1--11.
    [15]
    Jhon Gratz and David Reinsel. 2010. The Digital Universe Decade—Are You Ready? IDC White Paper.
    [16]
    F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference. USENIX, 25--25.
    [17]
    M. Jianting. 2012. A deduplication-based data archiving system. In Proceedings of the International Conference on Image, Vision and Computing (ICIVC’12). ACM, 1--12.
    [18]
    K. Jin and E. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09). ACM, Haifa, Israel, 7.
    [19]
    R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Trans. Stor. 6, 3 (2010), 13.
    [20]
    M. Li, C. Qin, and P. Lee. 2015. CDStore: Toward reliable, secure, and cost-efficient cloud storage via convergent dispersal. In Proceedings of the USENIX Annual Technical Conference. USENIX, 111--124.
    [21]
    M. Lillibridge and K. Eshghi. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX, 183--197.
    [22]
    M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). USENIX, 111--123.
    [23]
    X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. 2015. Metadata considered harmful … to deduplication. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems. USENIX, 11.
    [24]
    X. Lin, M. Hibler, E. Eide, and R. Ricci. 2015. Using deduplicating storage for efficient disk image deployment. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation. IEEE Computer Society, 1--14.
    [25]
    M. Lu, D. Chambliss, J. Glider, and C. Constantinescu. 2012. Insights for data reduction in primary storage: A practical analysis. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’12). ACM, Haifa, Israel, 14.
    [26]
    D. Meister and A. Brinkmann. 2009. Multi-level comparison of data deduplication in a backup scenario. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09).
    [27]
    D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MSST’10). IEEE Computer Society,1--6.
    [28]
    D. Meister, A. Brinkmann, and T. Suss. 2013. File recipe compression in data deduplication systems. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX,175--182.
    [29]
    D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M. Kuhn, and J. Kunkel. 2012. A study on data deduplication in hpc storage systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society, 7.
    [30]
    D. Meyer and W. Bolosky. 2011. A study of practical deduplication. ACM Trans. Stor. 7, 4 (2011), 14.
    [31]
    N. Park and D. Lilja. 2010. Characterizing datasets for data deduplication in backup applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’10). IEEE Computer Society, 1--10.
    [32]
    K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12).
    [33]
    Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd International IEEE Symposium on Mass Storage Systems and Technologies (MSST’16). IEEE, 1--7.
    [34]
    Yujuan Tan, Dan Feng, Fangting Huang, and Zhichao Yan. 2011. SORT: A similarity-ownership based routing scheme to improve data read performance for deduplication clusters. Int. J. Adv. Comput. Technol. 3, 9 (2011), 270--277.
    [35]
    V. Tarasov, A. Mudrankitony, W. Buik, P. Shilane, G. Kuenning, and E. Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Annual Technical Conference. USENIX, 261--272.
    [36]
    C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago, G. Calkowski, C. Dubnicki, and A. Bohra. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). USENIX, 225--239.
    [37]
    C. Vaughn, C. Miller, O. Ekenta, H. Sun, M. Bhadkamkar, P. Efstathopoulos, and E. Kardes. 2015. Soothsayer: Predicting capacity usage in backup storage systems. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems Conference (MASCOTS’15). IEEE, Atlanta, GA, USA, 208--217.
    [38]
    R. Villars, C. Olofson, and M. Eastwood. 2011. Big Data: What It Is and Why You Should Care. White Paper.
    [39]
    G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). USENIX, 33--48.
    [40]
    J. Wei, H. Jiang, K. Zhou, and D. Feng. 2010. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proceedings of the Symposium on Mass Storage Systems and Technologies Conference (MSST’10). IEEE Computer Society, 1--14.
    [41]
    W. Xia, H. Jiang, D. Feng, and Y. Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference. USENIX, 26--28.
    [42]
    T. Yang, H. Jiang, D. Feng, Z. Niu, K. Zhou, and Y. Wan. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’10). IEEE Computer Society, 1--12.
    [43]
    Y. Zhou, D. Feng, W. Xia, M. Fu, F. Huang, Y. Zhang, and C. Li. 2015. SecDep: A user-aware efficient fine-grained secure dedupication scheme with multi-level key management. In Proceedings of the 31th Symposium on Mass Storage Systems and Technologies (MSST’15). IEEE Computer Society, 1--14.
    [44]
    B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. USENIX, 1--14.

    Cited By

    View all
    • (2023)ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level DeduplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325050134:7(2180-2197)Online publication date: Jul-2023
    • (2023)Dataset Similarity Detection for Global Deduplication in the DD File System2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00255(3322-3335)Online publication date: Apr-2023
    • (2022)Enabling Secure and Space-Efficient Metadata Management in Encrypted DeduplicationIEEE Transactions on Computers10.1109/TC.2021.306732671:4(959-970)Online publication date: 1-Apr-2022
    • Show More Cited By

    Index Terms

    1. Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 14, Issue 2
        May 2018
        210 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3208078
        • Editor:
        • Sam H. Noh
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 11 May 2018
        Accepted: 01 January 2018
        Revised: 01 December 2017
        Received: 01 September 2017
        Published in TOS Volume 14, Issue 2

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. User study
        2. data routing algorithms
        3. large data set

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • China 863
        • ONR
        • National Natural Science Foundation of China
        • NSF
        • Dell-EMC

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)60
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 27 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level DeduplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325050134:7(2180-2197)Online publication date: Jul-2023
        • (2023)Dataset Similarity Detection for Global Deduplication in the DD File System2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00255(3322-3335)Online publication date: Apr-2023
        • (2022)Enabling Secure and Space-Efficient Metadata Management in Encrypted DeduplicationIEEE Transactions on Computers10.1109/TC.2021.306732671:4(959-970)Online publication date: 1-Apr-2022
        • (2022)Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM)10.1109/IMCOM53663.2022.9721761(1-8)Online publication date: 3-Jan-2022
        • (2022)Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)10.1109/ASONAM55673.2022.10068661(269-276)Online publication date: 10-Nov-2022
        • (2019)Data domain cloud tierProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358862(647-660)Online publication date: 10-Jul-2019
        • (2019)Metadedup: Deduplicating Metadata in Encrypted Deduplication via Indirection2019 35th Symposium on Mass Storage Systems and Technologies (MSST)10.1109/MSST.2019.00007(269-281)Online publication date: May-2019
        • (2019)MII:A Novel Content Defined Chunking Algorithm for Finding Incremental data in Data SynchronizationIEEE Access10.1109/ACCESS.2019.2926195(1-1)Online publication date: 2019

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media