Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2827701.2827712guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Metadata considered harmful ... to deduplication

Published: 06 July 2015 Publication History

Abstract

Deduplication is widely used to improve space efficiency in storage systems. While much attention has been paid to making the process of deduplication fast and scalable, the effectiveness of deduplication can vary dramatically depending on the data stored. We show that many file formats suffer from a fundamental design property that is incompatible with deduplication: they intersperse metadata with data in ways that result in otherwise identical data being different. We examine three models for improving deduplication in the presence of embedded metadata: deduplication-friendly data formats, application-level post-processing, and format-aware deduplication. Working with realworld file formats and datasets, we find that by separating metadata from data, deduplication ratios are improved significantly--in some cases as dramatically as 5.6×.

References

[1]
BOLOSKY, W. J., CORBIN, S., GOEBEL, D., AND DOUCEUR, J. R. Single instance storage in windows 2000. In Proc. of USENIX Windows Systems Symposium (2000).
[2]
DOUGLIS, F., BHARDWAJ, D., QIAN, H., AND SHILANE, P. Content-aware load balancing for distributed backup. In Proc. of LISA (2011).
[3]
DUBNICKI, C., GRYZ, L., HELDT, L., KACZMARCZYK, M., KILIAN, W., STRZELCZAK, P., SZCZEPKOWSKI, J., UNGUREANU, C., AND WELNICKI, M. Hydrastor: A scalable secondary storage. In Proc. of FAST (2009).
[4]
FILE SYSTEMS AND STORAGE LAB FROM STONY BROOK UNIVERSITY. fs-hasher. http://tracer.filesystems.org/. Retrieved March 9, 2015.
[5]
FU, M., FENG, D., HUA, Y., HE, X., CHEN, Z., XIA, W., HUANG, F., AND LIU, Q. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proc. of USENIX ATC (2014).
[6]
GNU TAR. Basic tar format. http://www.gnu.org/software/tar/manual/html_node/Standard.html.
[7]
JIN, K., AND MILLER, E. L. The effectiveness of deduplication on virtual machine disk images. In Proc. of SYSTOR (2009).
[8]
LILLIBRIDGE, M., ESHGHI, K., AND BHAGWAT, D. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proc. of FAST (2013).
[9]
LILLIBRIDGE, M., ESHGHI, K., BHAGWAT, D., DEOLALIKAR, V., TREZISE, G., AND CAMBLE, P. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. of FAST (2009).
[10]
LIN, X., LU, G., DOUGLIS, F., SHILANE, P., AND WALLACE, G. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proc. of FAST (2014).
[11]
MEYER, D., AND BOLOSKY, W. A study of practical deduplication. In Proc. of FAST (2011).
[12]
MUTHITACHAROEN, A., CHEN, B., AND MAZIÈRES, D. A low-bandwidth network file system. SIGOPS Oper. Syst. Rev. (2001).
[13]
ORACLE CORP. Database backup and recovery user's guide. http://docs.oracle.com/cd/E11882_01/backup.112/e10642/. Retrieved March 9, 2015.
[14]
QUINLAN, S., AND DORWARD, S. Venti: A new approach to archival data storage. In Proc. of FAST (2002).
[15]
SMALDONE, S., WALLACE, G., AND HSU, W. Efficiently storing virtual machine backups. In Proc. of HotStorage (2009).
[16]
SRINIVASAN, K., BISSON, T., GOODSON, G., AND VORUGANTI, K. idedup: Latency-aware, inline data deduplication for primary storage. In Proc. of FAST (2012).
[17]
SUNG, B., PARK, S., OH, Y., MA, J., LEE, U., AND PARK, C. An efficient data deduplication based on tar-format awareness in backup applications. In FAST (2013). Poster.
[18]
WALLACE, G., DOUGLIS, F., QIAN, H., SHILANE, P., SMALDONE, S., CHAMNESS, M., AND HSU, W. Characteristics of backup workloads in production systems. In Proc. of FAST (2012).
[19]
ZFS. http://en.wikipedia.org/wiki/ZFS.
[20]
ZHU, B., LI, K., AND PATTERSON, H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. of FAST (2008).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
HotStorage'15: Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems
July 2015
17 pages

Sponsors

  • VMware
  • NetApp
  • Google Inc.
  • Facebook: Facebook
  • HP: HP

Publisher

USENIX Association

United States

Publication History

Published: 06 July 2015

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media