Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A study of practical deduplication

Published: 02 February 2012 Publication History

Abstract

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of redundancy. We found that whole-file deduplication achieves about three quarters of the space savings of the most aggressive block-level deduplication for storage of live file systems, and 87% of the savings for backup images. We also studied file fragmentation, finding that it is not prevalent, and updated prior file system metadata studies, finding that the distribution of file sizes continues to skew toward very large unstructured files.

References

[1]
Agrawal, N., Bolosky, W., Douceur, J., and Lorch, J. 2007. A five-year study of file-system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies.
[2]
BackupRead. 2010. Microsoft Corp. BackupRead function. MSDN. http://msdn.microsoft.com/en-us/library/aa362509(VS.85).aspx
[3]
Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., and Hristidis, V. 2009. Borg: Block-reorganization for self-optimizing storage systems. In Proceedings of the 7th USENIX Conference on File and Storage Technologies.
[4]
Bhagwat, D., Eshghi, K., Long, D., and Lillibridge, M. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup, In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, Los Alamitos, CA.
[5]
Bloom, B. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426.
[6]
Bolosky, W., Corbin, S., Goebel, D., and Douceur, J. 2000. Single instance storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium.
[7]
Clements, A., Ahmad, I., Vilayannur, M., and Li, J. 2009. Decentralized deduplication in SAN cluster file systems. InProceedings of the USENIX Annual Technical Conference.
[8]
Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technology.
[9]
Dorward, S. and Quinlan, S. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies.
[10]
Douceur, J. and Bolosky, W. 1999. A large-scale study of file-system contents. In Proceeedings of the ACM SIGMETRICS International Conference on Measurement and Modelling of Computer Systems. ACM, New York.
[11]
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., and Welnicki, M. 2009. Hydrastor: A scalable secondary storage. In Proceedings of the 7th USENIX Conference on File and Storage Technologies.
[12]
Huang, H., Hung, W., and Shin, K. G. 2005. Fs2: Dynamic data replication in free disk space for improving disk performance and energy consumption. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. ACM, New York.
[13]
Kulkarni, P., Douglis, F., Lavoie, J., and Tracey, J. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference.
[14]
Jin, K. and Miller, E. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR: The Israeli Experimental Systems Conference.
[15]
Lillibridge, M., Eshghi, K., Bhagwat, D., Deola-Likar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies.
[16]
Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., and Vivier, L. 2007. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux Symposium
[17]
MS Atime. 2010. Microsoft Corp. Disabling last access time in Windows Vista to improve NTFS perfomance. The Storage Team Blog. http://blogs.technet.com/b/filecab/archive/2006/11/07/disabling-last-access-time-in-windows-vista-to-improve-ntfs-performance.aspx.
[18]
MS Filesystem. 2010. Microsoft Corp. File systems. Microsoft TechNet. http://technet.microsoft.com/en-us/library/cc938929.aspx.
[19]
VSS. 2010. Microsoft Corp.Volume shadow copy service. MSDN. http://msdn.microsoft.com/en-us/library/bb968832(VS.85).aspx.
[20]
Miller, D. R. 2009. Storage economics: Four principles for reducing total cost of ownership. Hitachi Corporate Web Site. http://www.hds.com/assets/pdf/four-principles-for-reducing-total-cost-of-ownership.pdf.
[21]
Murphy, N. and Seltzer, M. 2009. Hierarchical file systems are dead. In Proceedings of the 12th Workshop on Hot Topics in Operating Systems.
[22]
Nagar, R. 1997. Windows NT File System Internals. O'Reilly.
[23]
Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems. In Proceedings of the. USENIX Annual Technical Conference.
[24]
Rabin, M. 1981. Fingerprinting by random polynomials. Tech. rep. TR-CSE-03-01. Harvard University Center for Research in Computing Technology.
[25]
Rivest, R. 1992. The MD5 message-digest algorithm. http://tools.ietf.org/rfc/rfc1321.txt.
[26]
Satyanarayanan, M. 1981. A study of file sizes and functional lifetimes. In Proceedings of the 8th ACM Symposium on Operating Systems Principles.
[27]
Scheduled Tasks. 2010. Microsoft Corp. description of the scheduled tasks in Widows Vista. Microsoft support. http://support.microsoft.com/kb/939039.
[28]
Seltzer, M. and Smith, K. 1997. File system aging: Increasing the relevance of file system benchmarks. In Proceedings of the 1997 ACM SIGMETRICS, ACM, New York.
[29]
Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. 1996. Scalability in the XFS file system. In Proceedings of the USENIX Annual Technical Conference.
[30]
Vogels, W. 1999. File system usage in windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM, New York.
[31]
Ungureanu, C., Atkin, B., Aranya, A., Gokhale, S., Rago, S., Cakowski, G., Dubnicki, C., and Bohra, A. 2010. Hydrafs: A high-throughput file system for the Hydrastor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies.
[32]
Ungureanu, E. and Kruus, C. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies.
[33]
Zhu, B., Li, K., and Patterson, H. 2008 Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, 1--14.

Cited By

View all
  • (2024)Secure and Efficient Traffic Obfuscation Scheme for Deduplicated Cloud StorageAutomatic Control and Computer Sciences10.3103/S014641162470005658:2(153-165)Online publication date: 1-Apr-2024
  • (2024)A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681313(3499-3508)Online publication date: 28-Oct-2024
  • (2024)DWare: Cost-Efficient Decentralized Storage With Adaptive MiddlewareIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345965019(8529-8543)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 7, Issue 4
January 2012
65 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/2078861
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2012
Accepted: 01 September 2011
Received: 01 September 2011
Published in TOS Volume 7, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deduplication
  2. Windows
  3. data
  4. filesystem
  5. study

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)126
  • Downloads (Last 6 weeks)9
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Secure and Efficient Traffic Obfuscation Scheme for Deduplicated Cloud StorageAutomatic Control and Computer Sciences10.3103/S014641162470005658:2(153-165)Online publication date: 1-Apr-2024
  • (2024)A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681313(3499-3508)Online publication date: 28-Oct-2024
  • (2024)DWare: Cost-Efficient Decentralized Storage With Adaptive MiddlewareIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345965019(8529-8543)Online publication date: 1-Jan-2024
  • (2024)Enabling Transparent Deduplication and Auditing for Encrypted Data in CloudIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.333447521:4(3545-3561)Online publication date: 1-Jul-2024
  • (2024)LSDedup: Layered Secure Deduplication for Cloud StorageIEEE Transactions on Computers10.1109/TC.2023.333195373:2(422-435)Online publication date: 1-Feb-2024
  • (2024)Efficient Data Security Using Predictions of File Availability on the Web2024 21st Annual International Conference on Privacy, Security and Trust (PST)10.1109/PST62714.2024.10788042(1-12)Online publication date: 28-Aug-2024
  • (2024)ERD: AVX-512-based Enhancement of Resemblance Detection for Post-Deduplication Delta Compression2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781356(1-4)Online publication date: 9-Nov-2024
  • (2024)NLPDedup: Using Natural Language Processing for Data Deduplication2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI63651.2024.00031(115-120)Online publication date: 6-Jul-2024
  • (2024)SmartChunk: A hybrid content based chunking algorithm with hash de-duplication for effective data deduplication in cloud storage system2024 First International Conference on Innovations in Communications, Electrical and Computer Engineering (ICICEC)10.1109/ICICEC62498.2024.10808698(1-7)Online publication date: 24-Oct-2024
  • (2024)WoW-IO: a Gaming-Based Storage Trace Generator for Edge Computing2024 IEEE 8th International Conference on Fog and Edge Computing (ICFEC)10.1109/ICFEC61590.2024.00020(51-58)Online publication date: 6-May-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media