Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Improving duplicate elimination in storage systems

Published: 01 November 2006 Publication History

Abstract

Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of today's exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems.Intelligent object partitioning techniques identify data that is new when objects are updated, and transfer only these chunks to a storage server. In this article, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40% over comparable techniques.

References

[1]
Ajtai, M., Burns, R., Fagin, R., Long, D., and Stockmeyer, L. 2000. Compactly encoding unstructured input with differential compression. J. ACM. 49, 3, 318--367.]]
[2]
Berlekamp, E. R. 1968. Algebraic Coding Theory. McGraw-Hill, New York.]]
[3]
Blomer, J., Kalfane, M., Karp, R., Karpinski, M., Luby, M., and Zuckerman, D. 1995. An xor-based erasure-resilient coding scheme. Tech. Rep., International Computer Science Institute, Berkeley, California.]]
[4]
Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences Conference. IEEE Computer Society, 21.]]
[5]
Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntactic clustering of the web. In Proceedings of the 6th International WWW Conference. 391--404.]]
[6]
Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. Springer Verlag, 1--10.]]
[7]
Cederqvist, P. 1992. Version management with cvs. http://www.cvshome.org/docs/manual/.]]
[8]
Cox, L., Murray, C., and Noble, B. 2002. Pastiche: Making backup cheap and easy. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation. Boston.]]
[9]
Douglis, F. and Iyengar, A. 2003. Application-Specific deltaencoding via resemblance detection. In Usenix Annual Technical Conference. 59--72.]]
[10]
Douglis, P. K. F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Usenix Annual Technical Conference. 59--72.]]
[11]
Goldberg, A. V. and Yianilos, P. N. 1998. Towards an archival intermemory. In Proceedings of the Advances in Digital Libraries Conference.]]
[12]
Hong, B., Plantenberg, D., Long, D. D. E., and Sivan-Zimet, M. 2004. Duplicate data elimination in a san file system. In Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST). 301--314.]]
[13]
Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms an empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 2, 192--214.]]
[14]
Jain, N., Dahlin, M., and Tewari, R. 2005. Taper: Tiered approach for eliminating redundancy in replica sychronization. In Proceedings of the 4th Usenix Conference on File and Storage Technologies (FAST).]]
[15]
Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. 2000. Oceanstore: An architecture for global store persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Cambridge, MA.]]
[16]
Lelewer, D. A. and Hirschberg, D. S. 1987. Data compression. ACM Comput., Surv. 3, 261--296.]]
[17]
Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S. 2002. Search and replication in unstructured peer-to-peer networks. In Proceedings of the 16th International Conference on Supercomputing. ACM, New York, 84--95.]]
[18]
Manber, U. 1994. Finding similar files in a large file system. In Usenix Winter Conference. 1--10.]]
[19]
Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Symposium on Operating Systems Principles. 174--187.]]
[20]
National Institute of Standards and Technology FIPS PUB 180-1. 1995. Secure hash standard.]]
[21]
Ouyang, Z., Memon, N., Suel, T., and Trendafilov, D. 2006. Cluster-Based delta compression of a collection of files. In International Conference on Web Information Systems Engineering (WISE).]]
[22]
Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Usenix Annual Technical Conference. 73--86.]]
[23]
Quinlan, S. and Dorwards, S. 2002. Venti: A new approach to archival storage. In Usenix Conference on File and Storage Technologies.]]
[24]
Rabin, M. 1981. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard University.]]
[25]
Rochkind, M. J. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4, 364--370.]]
[26]
Shivakumar, N. and García-Molina, H. 1995. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd Annual Conference on the Theory and Practice of Digital Libraries.]]
[27]
Tichy, W. F. 1984. String to string correction problem with block moves. ACM Trans. Softw. Eng. 2, 4 (Dec.), 364--370.]]
[28]
Tichy, W. F. 1985. RCS---A system for version control. Softw.---Pract. Exper. 15, 7, 637--654.]]
[29]
W. J. Bolosky, S. Corbin, D. G. and Douceur, J. R. Single instance storage in windows 2000. In Usenix Annual Technical Conference.]]
[30]
Weatherspoon, H. and Kubiatowicz, J. 2002. Erasure coding vs. replication: A quantitative comparison. In 1st International Workshop on Peer-to-Peer Systems. Cambridge, MA.]]
[31]
You, L. L. and Karamanolis, C. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the 21st IEEE Symposium on Mass Storage Systems and Technologies (MSST).]]
[32]
Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 3, 337--343.]]

Cited By

View all
  • (2024)The Design of Fast Delta Encoding for Delta Compression Based Storage SystemsACM Transactions on Storage10.1145/366481720:4(1-30)Online publication date: 14-May-2024
  • (2023)FASTSync: A FAST Delta Sync Scheme for Encrypted Cloud Storage in High-bandwidth Network EnvironmentsACM Transactions on Storage10.1145/360753619:4(1-22)Online publication date: 3-Oct-2023
  • (2023)A Detailed Review of Data Deduplication Approaches in the Cloud and Key Challenges2023 4th International Conference on Smart Electronics and Communication (ICOSEC)10.1109/ICOSEC58147.2023.10276004(1771-1779)Online publication date: 20-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 2, Issue 4
November 2006
133 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/1210596
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2006
Published in TOS Volume 2, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Rabin's fingerprints
  2. Storage management
  3. content-based addressing
  4. duplicate elimination

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Design of Fast Delta Encoding for Delta Compression Based Storage SystemsACM Transactions on Storage10.1145/366481720:4(1-30)Online publication date: 14-May-2024
  • (2023)FASTSync: A FAST Delta Sync Scheme for Encrypted Cloud Storage in High-bandwidth Network EnvironmentsACM Transactions on Storage10.1145/360753619:4(1-22)Online publication date: 3-Oct-2023
  • (2023)A Detailed Review of Data Deduplication Approaches in the Cloud and Key Challenges2023 4th International Conference on Smart Electronics and Communication (ICOSEC)10.1109/ICOSEC58147.2023.10276004(1771-1779)Online publication date: 20-Sep-2023
  • (2023)Double Sliding Window Chunking Algorithm for Data Deduplication in Ocean ObservationIEEE Access10.1109/ACCESS.2023.327678511(70470-70481)Online publication date: 2023
  • (2022)Enabling Secure and Space-Efficient Metadata Management in Encrypted DeduplicationIEEE Transactions on Computers10.1109/TC.2021.306732671:4(959-970)Online publication date: 1-Apr-2022
  • (2022)BDKM: A Blockchain-Based Secure Deduplication Scheme with Reliable Key ManagementNeural Processing Letters10.1007/s11063-021-10450-954:4(2657-2674)Online publication date: 1-Aug-2022
  • (2022)An efficient enhanced prefix hash tree model for optimizing the storage and image deduplication in cloudConcurrency and Computation: Practice and Experience10.1002/cpe.719934:23Online publication date: 5-Aug-2022
  • (2021)Asymptotic Analysis of Data Deduplication with a Constant Number of Substitutions2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9517909(3296-3301)Online publication date: 12-Jul-2021
  • (2021)Improving Restore Performance of Deduplication Systems via a Greedy Rewriting Scheme2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS53394.2021.00042(291-298)Online publication date: Dec-2021
  • (2021)When Delta Sync Meets Message-Locked Encryption: a Feature-based Delta Sync Scheme for Encrypted Cloud Storage2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00040(337-347)Online publication date: Jul-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media