Deduplication @ SSRC

Deduplication

This project is no longer active. Information is still available below.

The Storage Systems Research Center is a pioneer in the use of deduplication to consolidate storage. Deduplication identifies and eliminates duplicate data by matching data sequences in new data with identical or similar sequences in already-stored data. Only references to the existing data in place of the new data are maintained.

The most challenging problems in data deduplication are avoiding the disk bottleneck and building a system that scales gracefully with terabytes to petabytes of data. At the SSRC we have developed techniques that keep a low RAM footprint, preserve the deduplication ratio, and can be parallelized for a smooth scale out. The trade-offs to consider are: RAM usage, ingest speed and the deduplication ratio.

While storage space savings make deduplication an attractive solution, there are other trade-offs to consider as well. As common pieces of data get shared across multiple files `fragmentation' occurs. Pieces of a file end up being scattered across disk/s resulting in slow restore speeds -- many disk seeks are required to read a deduplicated file. Our current work attempts to group file data together before writing it to disk. This grouping is expected to improve read performance by amortizing the total number of seeks required to read a file. However, such grouping results in the loss of deduplication because some redundancy needs to be reintroduced to build groups.

Keeping disk spun-down to reduce power consumption is another approach used to reduce operational costs. However, when deduplicated data gets fragmented pieces of files get spread across many disks. To read a deduplicated file, more disks need to be powered on --- in the worst case, one disk per piece of file data. Our initial results show a direct correlation between the amount of space saved by deduplication and the number of disk accesses required to retrieve a file.

Status

The goal of our work investigating trade-offs between deduplication, power consumption, and fragmentation is to find a "sweet spot" that gives us the best deduplication while reducing power consumption and the number of disk seeks for read operations. Our preliminary results look promising. We are testing new data placement algorithms to investigate the trade-off between deduplication and power consumption.

We are also looking at problems related to secure deduplication and premature data destruction attacks. Specifically, our ongoing work looks at deleting securely deduplicated data and the problem of reference counting with cryptographic constructs.

Faculty

Alumni

Publications

Date		Publication
Apr 8, 2013		Avani Wildani, Ethan L. Miller, Ohad Rodeh, HANDS: A Heuristically Arranged Non-Backup In-line Deduplication System, Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE 2013), April 2013. [Deduplication] [Prediction and Grouping]
Dec 1, 2012		Zhike Zhang, Deepavali Bhagwat, Witold Litwin, Darrell D. E. Long, Thomas Schwarz, Improved Deduplication through Parallel Binning, Proceedings of the 31st IEEE International Performance, Computing and Communications Conference (IPCCC '12), December 2012. [Deduplication] [Deduplication Optimization]
Mar 7, 2012		Avani Wildani, Ethan L. Miller, Ohad Rodeh, HANDS: A Heuristically Arranged Non-Backup In-line Deduplication System, Technical Report UCSC-SSRC-12-03, March 2012. [Archival Storage] [Deduplication] [Prediction and Grouping]
Jan 1, 2012		Witold Litwin, Darrell D. E. Long, Thomas Schwarz, Combining Chunk Boundary and Chunk Signature Calculations for Deduplication, IEEE Latin America Transactions , January 2012. [Deduplication]
May 11, 2011		Stephanie Jones, Online De-duplication in a Log-Structured File System for Primary Storage, Technical Report UCSC-SSRC-11-03, May 2011. [Deduplication] [Deduplication Optimization]
Sep 30, 2010		Keren Jin, Deduplication for Virtual Machine Disk Images, Technical Report UCSC-SSRC-10-01, September 2010. [Deduplication]
Sep 1, 2010		Deepavali Bhagwat, Deduplication for Large Scale Backup and Archival Storage, Technical Report UCSC-SSRC-, September 2010. [Archival Storage] [Deduplication] [Deduplication Optimization]
Sep 21, 2009		Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, Mark Lillibridge, Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup , Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009), September 2009. [Deduplication]
May 4, 2009		Keren Jin, Ethan L. Miller, The Effectiveness of Deduplication on Virtual Machine Disk Images, Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, May 2009. [Deduplication]
Feb 24, 2009		Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, Peter Camble, Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality, Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST '09), February 2009. [Deduplication]
Oct 31, 2008		Mark W. Storer, Kevin Greenan, Darrell D. E. Long, Ethan L. Miller, Secure Data Deduplication, Proceedings of the 4th International Workshop on Storage Security and Survivability (StorageSS 2008), October 2008. Held in conjunction with the 15th ACM Conference on Computer and Communications Security (CCS 2008). [Archival Storage] [Secure File and Storage Systems] [Deduplication]
Aug 12, 2007		Deepavali Bhagwat, Kave Eshghi, Pankaj Mehra, Content-based Document Routing and Index Partitioning for Scalable Similarity-based Searches in a Large Corpus, Proceedings of the 13th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD '07), August 2007, pages 105-112. [Archival Storage] [Scalable File System Indexing] [Deduplication]
Sep 13, 2006		Deepavali Bhagwat, Kristal Pollack, Darrell D. E. Long, Thomas Schwarz, Ethan L. Miller, Jehan-François Pâris, Providing High Reliability in a Minimum Redundancy Archival Storage System, Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '06), September 2006, pages 413-421. [Archival Storage] [Deduplication]
Jun 30, 2006		Lawrence You, Efficient Archival Data Storage, Technical Report UCSC-SSRC-06-04, June 2006. Ph.D. thesis. [Archival Storage] [Deduplication]
Apr 15, 2005		Lawrence You, Kristal Pollack, Darrell D. E. Long, Deep Store: An Archival Storage System Architecture, Proceedings of the 21st International Conference on Data Engineering (ICDE '05), April 2005. [Archival Storage] [Deduplication]
Apr 1, 2004		Lawrence You, Christos Karamanolis, Evaluation of efficient archival storage techniques, Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, April 2004. [Archival Storage] [Deduplication]

Last modified 19 Oct 2020

Deduplication

Status

Faculty

Alumni

Sponsors

Publications