Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Lazy Exact Deduplication

Published: 10 June 2017 Publication History

Abstract

Deduplication aims to reduce duplicate data in storage systems by removing redundant copies of data blocks, which are compared to one another using fingerprints. However, repeated on-disk fingerprint lookups lead to high disk traffic, which results in a bottleneck.
In this article, we propose a “lazy” data deduplication method, which buffers incoming fingerprints that are used to perform on-disk lookups in batches, with the aim of improving subsequent prefetching. In deduplication in general, prefetching is used to improve the cache hit rate by exploiting locality within the incoming fingerprint stream. For lazy deduplication, we design a buffering strategy that preserves locality in order to facilitate prefetching. Furthermore, as the proportion of deduplication time spent on I/O decreases, the proportion spent on fingerprint calculation and chunking increases. Thus, we also utilize parallel approaches (utilizing multiple CPU cores and a graphics processing unit) to further improve the overall performance.
Experimental results indicate that the lazy method improves fingerprint identification performance by over 50% compared with an “eager” method with the same data layout. The GPU improves the hash calculation by a factor of 4.6 and multithreaded chunking by a factor of 4.16. Deduplication performance can be improved by over 45% on SSD and 80% on HDD in the last round on the real datasets.

References

[1]
D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of IEEE International Symposium on Modeling, Analysis 8 Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.
[2]
P. Bhatotia, R. Rodrigues, and A. Verma. 2012. Shredder: GPU-accelerated incremental storage and computation. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’12). 1--15.
[3]
B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.
[4]
P. Bose, H. Guo, E. Kranakis, A. Maheshwari, P. Morin, J. Morrison, M. Smid, and Y. Tang. 2008. On the false-positive rate of bloom filters. Inform. Process. Lett. 108, 4 (2008), 210--213.
[5]
F. C. Botelho, P. Shilane, N. Garg, and W. Hsu. 2013. Memory efficient sanitization of a deduplicated storage system. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’13). 81--94.
[6]
A. T. Clements, I. Ahmad, M. Vilayannur, and J. Li. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of USENIX Annual Technical Conference (ATC’09). 101--114.
[7]
B. Debnath, S. Sengupta, and J. Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of USENIX Annual Technical Conference (ATC’10).
[8]
B. Debnath, S. Sengupta, and J. Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of ACM International Conference on Management of Data (SIGMOD’11). 25--36.
[9]
M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, F. Huang, and Q. Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of USENIX Annual Technical Conference (ATC’14). 181--192.
[10]
F. Guo and P. Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of USENIX Annual Technical Conference (ATC’11). 25--25.
[11]
K. Jin and E. L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of ACM Israeli Experimental Systems Conference (SYSTOR’09), Vol. 7.
[12]
C. Kim, K.-W. Park, K. Park, and K. H. Park. 2011. Rethinking deduplication in cloud: From data profiling to blueprint. In Proceedings of IEEE International Conference on Networked Computing and Advanced Information Management (NCM’11). 101--104.
[13]
J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. u. Lee, S. Kang, Y. Won, and J. Cha. 2012a. Deduplication in SSDs: Model and quantitative analysis. In 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). 1--12.
[14]
J. Kim, C. Lee, S. Lee, I. Son, J. Choi, S. Yoon, H. u. Lee, S. Kang, Y. Won, and J. Cha. 2012b. Deduplication in SSDs: Model and quantitative analysis. In Proceedings of IEEE Symposium on Mass Storage Systems and Technologies (MSST’12). 1--12.
[15]
R. Koller and R. Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Trans. Storage 6, 3 (2010), 13.
[16]
H. Li, T. Liang, and J. Chiu. 2013. A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. J. Supercomput. 66, 1 (2013), 381--405.
[17]
X. Li and D. J. Lilja. 2009. A highly parallel GPU-based hash accelerator for a data deduplication system. In Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS’09). 268--275.
[18]
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’09). 111--123.
[19]
X. Lin, F. Douglis, J. Li, X. Li, R. Ricci, S. Smaldone, and G. Wallace. 2015a. Metadata considered harmful... to deduplication. In Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’15). 11--16.
[20]
X. Lin, M. Hibler, E. Eide, and R. Ricci. 2015b. Using deduplicating storage for efficient disk image deployment. EAI Endorsed Trans. Scalable Information Systems 2, 6 (2015), e1.
[21]
G. Lu, Y. J. Nam, and D. H. C. Du. 2012. BloomStore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In Proceedings of IEEE Symposium on Mass Storage Systems and Technologies (MSST’12). 1--11.
[22]
J. Ma, R. J. Stones, Y. Ma, J. Wang, J. Ren, G. Wang, and X. Liu. 2016. Lazy exact deduplication. In Proceedings of IEEE 32th Symposium on Mass Storage Systems and Technologies (MSST’16). 1--10.
[23]
L. Ma, C. Zhen, B. Zhao, J. Ma, G. Wang, and X. Liu. 2010. Towards fast de-duplication using low energy coprocessor. In Proceedings of IEEE International Conference on Networking, Architecture and Storage (NAS’10). 395--402.
[24]
U. Manber. 1994. Finding similar files in a large file system. In Usenix Winter, Vol. 94. 1--10.
[25]
B. Mao, H. Jiang, S. Wu, Y. Fu, and L. Tian. 2014. Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans. Storage (TOS) 10, 2 (2014), 6:1--6:22.
[26]
D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’10). 1--6.
[27]
D. T. Meyer and W. J. Bolosky. 2011. A study of practical deduplication. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’11) (2011), 1--13.
[28]
D. T. Meyer and W. J. Bolosky. 2012. A study of practical deduplication. ACM Trans. Storage (TOS) 7, 4 (2012), 14:1--14:20.
[29]
J. Min, D. Yoon, and Y. Won. 2011. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6 (2011), 824--840.
[30]
C.-H. Ng and P. P. C. Lee. 2013. RevDedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of ACM Asia-Pacific Workshop on Systems, Vol. 15.
[31]
NVIDIA. 2013. NVIDIA CUDA. Retrieved from https://developer.nvidia.com/cuda-downloads (July 2013).
[32]
R. Pagh and F. F. Rodler. 2001. Cuckoo Hashing. Springer.
[33]
J. Paulo and J. Pereira. 2014. A survey and classification of storage deduplication systems. ACM Comput. Surv. (CSUR) 47, 1 (2014), 11:1--11:30.
[34]
C. Policroniades and I. Pratt. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of USENIX Annual Technical Conference (ATC’04). 73--86.
[35]
S. Quinlan and S. Dorward. 2002. Venti: A new approach to archival data storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’02), Vol. 4. 89--102.
[36]
M. O. Rabin. 1981. Fingerprinting by Random Polynomials. Center for Research in Computing Technology, Aiken Computation Laboratory.
[37]
P. Shilane, M. Huang, G. Wallace, and W. Hsu. 2012. WAN-optimized replication of backup datasets using stream-informed delta compression. ACM Trans. Storage (TOS) 8, 4 (2012), 13:1--13:26.
[38]
K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST’12). 24--24.
[39]
Y. Tan, Z. Yan, D. Feng, X. He, Q. Zou, and L. Yang. 2015. De-Frag: An efficient scheme to improve deduplication performance via reducing data placement de-linearization. Cluster Comput. 18, 1 (2015), 79--92.
[40]
V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of USENIX Annual Technical Conference (ATC’12). 261--272.
[41]
VIA Technologies. 2008. VIA nano processor. Retrieved from http://www.viatech.com.cn/cn/downloads/whitepapers/processors/WP080529VIA_Nano.pdf.
[42]
M. M. Waldrop. 2016. The chips are down for moores law. Nature 530 (2016), 144--147.
[43]
W. Xia, H. Jiang, D. Feng, and Y. Hua. 2015. Similarity and locality based indexing for high performance data deduplication. IEEE Trans. Comput. 64, 4 (2015), 1162--1176.
[44]
W. Xia, H. Jiang, D. Feng, and H. Yu. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of USENIX Annual Technical Conference (ATC’11). 26--28.
[45]
B. Zhu, K. Li, and H. Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’08). 269--282.

Cited By

View all
  • (2023)InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container UtilizationACM Transactions on Storage10.1145/356842619:1(1-27)Online publication date: 11-Jan-2023
  • (2022)Security-Aware and Efficient Data Deduplication for Edge-Assisted Cloud Storage SystemsIEEE Transactions on Services Computing10.1109/TSC.2022.3195318(1-12)Online publication date: 2022
  • (2021)Introduction to data deduplication approachesData Deduplication Approaches10.1016/B978-0-12-823395-5.00019-7(1-15)Online publication date: 2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 13, Issue 2
Special Issue on MSST 2016 and Regular Papers
May 2017
199 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3098275
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2017
Accepted: 01 March 2017
Revised: 01 February 2017
Received: 01 November 2016
Published in TOS Volume 13, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deduplication
  2. GPU
  3. disk I/O
  4. lazy method

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NSF of China
  • Natural Science Foundation of Tianjin
  • PhD Candidate Research Innovation Fund of Nankai University

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container UtilizationACM Transactions on Storage10.1145/356842619:1(1-27)Online publication date: 11-Jan-2023
  • (2022)Security-Aware and Efficient Data Deduplication for Edge-Assisted Cloud Storage SystemsIEEE Transactions on Services Computing10.1109/TSC.2022.3195318(1-12)Online publication date: 2022
  • (2021)Introduction to data deduplication approachesData Deduplication Approaches10.1016/B978-0-12-823395-5.00019-7(1-15)Online publication date: 2021
  • (2020)Towards Optimizing Deduplication on Persistent MemoryNetwork and Parallel Computing10.1007/978-3-030-79478-1_39(465-477)Online publication date: 28-Sep-2020
  • (2020)A Cross-Domain Secure Deduplication Scheme Based on Threshold Blind SignatureMobile Wireless Middleware, Operating Systems and Applications10.1007/978-3-030-62205-3_7(78-91)Online publication date: 5-Nov-2020
  • (2019)Flash-Based Storage Deduplication TechniquesInternational Journal of Embedded and Real-Time Communication Systems10.4018/IJERTCS.201907010310:3(32-48)Online publication date: 1-Jul-2019
  • (2019)A High-Performance Processing-in-Memory Accelerator for Inline Data Deduplication2019 IEEE 37th International Conference on Computer Design (ICCD)10.1109/ICCD46524.2019.00077(515-523)Online publication date: Nov-2019
  • (2019)CIDR: A Cost-Effective In-Line Data Reduction System for Terabit-Per-Second Scale SSD Arrays2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00025(28-41)Online publication date: Feb-2019
  • (2018)Survey on Deduplication Techniques in Flash-Based StorageProceedings of the 22st Conference of Open Innovations Association FRUCT10.5555/3266365.3266369(25-33)Online publication date: 21-May-2018
  • (2018)Survey on Deduplication Techniques in Flash-Based Storage2018 22nd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT.2018.8468295(25-33)Online publication date: May-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media