Deduplication
The Storage Systems Research Center is a pioneer in the use of deduplication to consolidate storage. Deduplication identifies and eliminates duplicate data by matching data sequences in new data with identical or similar sequences in already-stored data. Only references to the existing data in place of the new data are maintained.
The most challenging problems in data deduplication are avoiding the disk bottleneck and building a system that scales gracefully with terabytes to petabytes of data. At the SSRC we have developed techniques that keep a low RAM footprint, preserve the deduplication ratio, and can be parallelized for a smooth scale out. The trade-offs to consider are: RAM usage, ingest speed and the deduplication ratio.
While storage space savings make deduplication an attractive solution, there are other trade-offs to consider as well. As common pieces of data get shared across multiple files `fragmentation' occurs. Pieces of a file end up being scattered across disk/s resulting in slow restore speeds -- many disk seeks are required to read a deduplicated file. Our current work attempts to group file data together before writing it to disk. This grouping is expected to improve read performance by amortizing the total number of seeks required to read a file. However, such grouping results in the loss of deduplication because some redundancy needs to be reintroduced to build groups.
Keeping disk spun-down to reduce power consumption is another approach used to reduce operational costs. However, when deduplicated data gets fragmented pieces of files get spread across many disks. To read a deduplicated file, more disks need to be powered on --- in the worst case, one disk per piece of file data. Our initial results show a direct correlation between the amount of space saved by deduplication and the number of disk accesses required to retrieve a file.
Status
The goal of our work investigating trade-offs between deduplication, power consumption, and fragmentation is to find a "sweet spot" that gives us the best deduplication while reducing power consumption and the number of disk seeks for read operations. Our preliminary results look promising. We are testing new data placement algorithms to investigate the trade-off between deduplication and power consumption.
We are also looking at problems related to secure deduplication and premature data destruction attacks. Specifically, our ongoing work looks at deleting securely deduplicated data and the problem of reference counting with cryptographic constructs.