Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

WAN-optimized replication of backup datasets using stream-informed delta compression

Published: 06 December 2012 Publication History

Abstract

Replicating data off site is critical for disaster recovery reasons, but the current approach of transferring tapes is cumbersome and error prone. Replicating across a wide area network (WAN) is a promising alternative, but fast network connections are expensive or impractical in many remote locations, so improved compression is needed to make WAN replication truly practical. We present a new technique for replicating backup datasets across a WAN that not only eliminates duplicate regions of files (deduplication) but also compresses similar regions of files with delta compression, which is available as a feature of EMC Data Domain systems.
Our main contribution is an architecture that adds stream-informed delta compression to already existing deduplication systems and eliminates the need for new, persistent indexes. Unlike techniques based on knowing a file's version or that use a memory cache, our approach achieves delta compression across all data replicated to a server at any time in the past. From a detailed analysis of datasets and statistics from hundreds of customers using our product, we achieve an additional 2X compression from delta compression beyond deduplication and local compression, which enables customers to replicate data that would otherwise fail to complete within their backup window.

References

[1]
Aronovich, L., Asher, R., Bachmat, E., Bitner, H., Hirsch, M., and Klein, S. T. 2009. The design of a similarity based deduplication system. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR '09). ACM, New York, 6:1--6:14.
[2]
Bhagwat, D., Eshghi, K., Long, D. D., and Lillibridge, M. 2009. Extreme binning: scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.
[3]
Bobbarjung, D. R., Jagannathan, S., and Dubnicki, C. 2006. Improving duplicate elimination in storage systems. Trans. Storage 2, 424--448.
[4]
Brin, S., Davis, J., and García-Molina, H. 1995. Copy detection mechanisms for digital documents. In Proceedings of ACM SIGMOD International Conference on Management of Data. 398--409.
[5]
Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences. 21.
[6]
Broder, A. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. 1--10.
[7]
Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 1998. Min-wise independent permutations (extended abstract). In Proceedings of the 30th Annual ACM Symposium on Theory of Computing. ACM, New York, 327--336.
[8]
Burns, R. C. and Long, D. D. E. 1997. Efficient distributed backup with delta compression. In Proceedings of the 5th Workshop on I/O in Parallel and Distributed Systems. New York, 27--36.
[9]
Chan, M. C. and Woo, T. Y. C. 1999. Cache-based compaction: a new technique for optimizing web transfer. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM '99), 117--125.
[10]
Chen, Y., Qu, Z., Zhang, Z., and Yeo, B.-L. 2004. Data redundancy and compression methods for a disk-based network backup system. In Proceedings of the International Conference on Information Technology: Coding and Computing. 778.
[11]
Debnath, B., Sengupta, S., and Li, J. 2010. Chunkstash: speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference.
[12]
Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies.
[13]
Douglis, F. and Iyengar, A. 2003. Application-specific delta-encoding via resemblance detection. In Proceedings of the USENIX Annual Technical Conference. 113--126.
[14]
EMC Corporation. 2010. Data Domain Boost Software. http://www.datadomain.com/products/dd-boost.html.
[15]
Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., and Hawkes, R. 2007. Jumbo store: providing efficient incremental upload and versioning for a utility rendering service. In Proceedings of the 5th USENIX Conference on File and Storage Technologies.
[16]
Gailly, J. L. and Adler, M. 2003. The GZIP compressor. http://www.gzip.org.
[17]
Guo, F. and Efstathopoulos, P. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference.
[18]
Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms: an empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 192--214.
[19]
Jain, N., Dahlin, M., and Tewari, R. 2005. Taper: tiered approach for eliminating redundancy in replica synchronization. In Proceedings of the 4th USENIX Conference on File and Storage Technologies.
[20]
Kulkarni, P., Douglis, F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference. 59--72.
[21]
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. 111--123.
[22]
MacDonald, J. 2000. File system support for delta compression. M.S. thesis, Department of Electrical Engineering and Computer Science, University of California, Berkeley.
[23]
Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. 1--10.
[24]
Min, J., Yoon, D., and Won, Y. 2010. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6, 824--840.
[25]
Mogul, J. C., Douglis, F., Feldmann, A., and Krishnamurthy, B. 1997. Potential benefits of delta encoding and data compression for http. In Proceedings of the ACM SIGCOMM 1997 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 181--194.
[26]
Muthitacharoen, A., Chen, B., and Mazières, D. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01). 174--187.
[27]
Park, K., Ihm, S., Bowman, M., and Pai, V. S. 2007. Supporting practical content-addressable caching with CZIP compression. In Proceedings of the USENIX Annual Technical Conference. 14:1--14:14.
[28]
Park, N. and Lilja, D. 2010. Characterizing datasets for data deduplication in backup applications. In Proceedings of the IEEE International Symposium on Workload Characterization.
[29]
Patterson, H., Manley, S., Federwisch, M., Hitz, D., Kleiman, S., and Owara, S. 2002. Snapmirror: file system based asynchronous mirroring for disaster recovery. In Proceedings of the 1st USENIX Conference on File and Storage Technologies.
[30]
Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of the USENIX Annual Technical Conference. 73--86.
[31]
Quinlan, S. and Dorward, S. 2002. Venti: a new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies.
[32]
Rabin, M. O. 1981. Fingerprinting by random polynomials. Tech. rep., Center for Research in Computing Technology.
[33]
Riverbed Technology. 2011. Riverbed Steelhead Product Family. http://www.riverbed.com/us/assets/media/documents/data_sheets/DataSheet-Riverbed-FamilyProduct.pdf.
[34]
Shilane, P., Huang, M., Wallace, G., and Hsu, W. 2012a. WAN optimized replication of backup datasets using stream-informed delta compression. In Proceedings of the 10th USENIX Conference on File and Storage Technologies.
[35]
Shilane, P., Wallace, G., Huang, M., and Hsu, W. 2012b. Delta compressed and deduplicated storage using stream-informed locality. In Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems.
[36]
Spring, N. T. and Wetherall, D. 2000. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of the ACM SIGCOMM 2000 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 87--95.
[37]
Suel, T. and Memon, N. 2002. Algorithms for delta compression and remote file synchronization. In Lossless Compression Handbook, K. Sayood, Ed. Academic Press, San Diego, CA.
[38]
Suel, T., Noel, P., and Trendafilov, D. 2004. Improved file synchronization techniques for maintaining large replicated collections over slow networks. In Proceedings of the 20th International Conference on Data Engineering.
[39]
Trendafilov, D., Memon, N., and Suel, T. 2002. Zdelta: An efficient delta compression tool. Tech. rep., Department of Computer and Information Science, Polytechnic University.
[40]
Tridgell, A. 2000. Efficient algorithms for sorting and synchronization. Ph.D. thesis, Australian National University.
[41]
Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., and Hsu, W. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies.
[42]
Xia, W., Jiang, H., Feng, D., and Hua, Y. 2011. Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference.
[43]
You, L. and Karamanolis, C. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the 21st Symposium on Mass Storage Systems.
[44]
You, L., Pollack, K., Long, D. D. E., and Gopinath, K. 2011. Presidio: a framework for efficient archival data storage. ACM Trans. Storage 7, 2.
[45]
Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the Data Domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. 269--282.

Cited By

View all
  • (2024)SimEncProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692030(615-630)Online publication date: 10-Jul-2024
  • (2024)Encrypted Data Reduction: Removing Redundancy from Encrypted Data in Outsourced StorageACM Transactions on Storage10.1145/368527820:4(1-30)Online publication date: 29-Jul-2024
  • (2024)The Design of Fast Delta Encoding for Delta Compression Based Storage SystemsACM Transactions on Storage10.1145/366481720:4(1-30)Online publication date: 14-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 8, Issue 4
November 2012
82 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/2385603
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2012
Accepted: 01 August 2012
Received: 01 August 2012
Published in TOS Volume 8, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Backup storage
  2. deduplication
  3. delta compression
  4. network replication

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SimEncProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692030(615-630)Online publication date: 10-Jul-2024
  • (2024)Encrypted Data Reduction: Removing Redundancy from Encrypted Data in Outsourced StorageACM Transactions on Storage10.1145/368527820:4(1-30)Online publication date: 29-Jul-2024
  • (2024)The Design of Fast Delta Encoding for Delta Compression Based Storage SystemsACM Transactions on Storage10.1145/366481720:4(1-30)Online publication date: 14-May-2024
  • (2024)Efficient Time-Series Data Delivery in IoT with XenderIEEE Transactions on Mobile Computing10.1109/TMC.2023.3296608(1-15)Online publication date: 2024
  • (2024)The Design of a Lossless Deduplication Scheme to Eliminate Fine-Grained Redundancy for JPEG Image Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336345673:5(1385-1399)Online publication date: 7-Feb-2024
  • (2024)Applying Delta Compression to Packed Datasets for Efficient Data ReductionIEEE Transactions on Computers10.1109/TC.2023.331840473:1(73-85)Online publication date: Jan-2024
  • (2024)ERD: AVX-512-based Enhancement of Resemblance Detection for Post-Deduplication Delta Compression2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781356(1-4)Online publication date: 9-Nov-2024
  • (2024)SuperDelta: Multiple Referenced Base Chunks Scheme for Fine-grained Deduplication Backup Storage System2024 Data Compression Conference (DCC)10.1109/DCC58796.2024.00044(362-371)Online publication date: 19-Mar-2024
  • (2024)Chunk2vec: A novel resemblance detection scheme based on Sentence‐BERT for post‐deduplication delta compression in network transmissionIET Communications10.1049/cmu2.12719Online publication date: 4-Jan-2024
  • (2024)RESIST: Randomized Encryption for Deduplicated Cloud Storage SystemArabian Journal for Science and Engineering10.1007/s13369-024-09658-3Online publication date: 25-Oct-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media