research-article

Efficient Deduplication in a Distributed Primary Storage Infrastructure

Authors:

José PereiraAuthors Info & Claims

ACM Transactions on Storage (TOS), Volume 12, Issue 4

Article No.: 20, Pages 1 - 35

https://doi.org/10.1145/2876509

Published: 20 May 2016 Publication History

Abstract

A large amount of duplicate data typically exists across volumes of virtual machines in cloud computing infrastructures. Deduplication allows reclaiming these duplicates while improving the cost-effectiveness of large-scale multitenant infrastructures. However, traditional archival and backup deduplication systems impose prohibitive storage overhead for virtual machines hosting latency-sensitive applications. Primary deduplication systems reduce such penalty but rely on special cluster filesystems, centralized components, or restrictive workload assumptions. Also, some of these systems reduce storage overhead by confining deduplication to off-peak periods that may be scarce in a cloud environment.

We present DEDIS, a dependable and fully decentralized system that performs cluster-wide off-line deduplication of virtual machines’ primary volumes. DEDIS works on top of any unsophisticated storage backend, centralized or distributed, as long as it exports a basic shared block device interface. Also, DEDIS does not rely on data locality assumptions and incorporates novel optimizations for reducing deduplication overhead and increasing its reliability.

The evaluation of an open-source prototype shows that minimal I/O overhead is achievable even when deduplication and intensive storage I/O are executed simultaneously. Also, our design scales out and allows collocating DEDIS components and virtual machines in the same servers, thus, sparing the need of additional hardware.

References

[1]

Rami Al-Rfou, Nikhil Patwardhan, and Phanindra Bhagavatula. 2010. Deduplication and Compression Benchmarking in Filebench. Technical Report.

[2]

Darrell Anderson. 2002. Fstress: A Flexible Network File Service Benchmark. Technical Report. Duke University.

[3]

Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[4]

William J. Bolosky, Scott Corbin, David Goebel, and John R. Douceur. 2000. Single instance storage in Windows 2000. In Proceedings of USENIX Windows System Symposium (WSS).

Digital Library

[5]

Citrix Systems, Inc. 2014. Blktap documentation. Retrieved from http://wiki.xen.org/wiki/Blktap2.

[6]

Austin T. Clements, Irfan Ahmad, Murali Vilayannur, and Jinyuan Li. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of USENIX Annual Technical Conference (ATC).

Digital Library

[7]

Russell Coker. 2015. Bonnie++ web page. Retrieved from http://www.coker.com.au/bonnie++/.

[8]

D. Iacono. 2013. Enterprise storage: Efficient,virtualized and flash optimized. IDC White Paper.

[9]

Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. Chunk stash: Speeding up inline storage deduplication using flash memory. In Proceedings of USENIX Annual Technical Conference (ATC).

Digital Library

[10]

Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of USENIX Conference on File and Storage Technologies (FAST).

Digital Library

[11]

John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. 2002. Reclaiming Space from Duplicate Files in a Serverless Distributed File System. Technical Report MSR-TR-2002-30. Microsoft Research.

[12]

Ahmed El-Shimi, Ran Kalach, Ankit Kumar, Adi Oltean, Jin Li, and Sudipta Sengupta. 2012. Primary data deduplication large scale study and system design. In Proceedings of USENIX Annual Technical Conference (ATC).

Digital Library

[13]

EMC. 2012. New Digital Universe Study Reveals Big Data Gap. http://www.emc.com/about/news/press/2012/20121211-01.htm. (2012).

[14]

Davide Frey, Anne-Marie Kermarrec, and Konstantinos Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the Third ACM Symposium on Cloud Computing (SOCC).

Digital Library

[15]

Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of ACM/IFIP/USENIX International Middleware Conference.

Digital Library

[16]

Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of USENIX Annual Technical Conference (ATC).

Digital Library

[17]

HP. 2011. Complete storage and data protection architecture for VMware vSphere. White Paper (2011).

[18]

Bo Hong and Darrell D. E. Long. 2004. Duplicate data elimination in a san file system. In Proceedings of Conference on Mass Storage Systems (MSST).

[19]

Keren Jin and Ethan L. Miller. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of International Systems and Storage Conference (SYSTOR).

Digital Library

[20]

Jones, M. 2010. Virtio: An I/O virtualization framework for linux. IBM White Paper (2010).

[21]

Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of International Systems and Storage Conference (SYSTOR).

Digital Library

[22]

Jürgen Kaiser, Dirk Meister, André Brinkmann, and Sascha Effert. 2012. Design of an exact data deduplication cluster. In Proceedings of Conference on Mass Storage Systems (MSST).

[23]

Jeffrey Katcher. 1997. PostMark: A New File System Benchmark. Technical Report. NetApp.

[24]

Ricardo Koller and Raju Rangaswami. 2010a. I/O deduplication: Utilizing content similarity to improve I/O performance. ACM Transactions on Storage 6, 3 (Sept. 2010), 13:1--13:26.

Digital Library

[25]

Ricardo Koller and Raju Rangaswami. 2010b. I/O deduplication: Utilizing content similarity to improve I/O performance. In Proceedings of USENIX Conference on File and Storage Technologies (FAST).

Digital Library

[26]

Lessfs. 2014. Lessfs page. Retrieved from http://www.lessfs.com/wordpress/.

[27]

Yan-Kit Li, Min Xu, Chun-Ho Ng, and Patrick P. C. Lee. 2014. Efficient hybrid inline and out-of-line deduplication for backup storage. Trans. Storage 11, 1 (2014), 2:1--2:21.

Digital Library

[28]

Anthony Liguori and Eric Van Hensbergen. 2008. Experiences with content addressable storage and virtual disks. In Proceedings of USENIX Workshop on I/O Virtualization (WIOV).

Digital Library

[29]

Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of USENIX Conference on File and Storage Technologies (FAST).

Digital Library

[30]

D. Meister and A. Brinkmann. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of Conference on Mass Storage Systems (MSST).

Digital Library

[31]

Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, Geoffrey Lefebvre, Michael J. Feeley, Norman C. Hutchinson, and Andrew Warfield. 2008. Parallax: Virtual disks for virtual machines. In Proceedings of European Conference on Computer Systems (EuroSys).

Digital Library

[32]

Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In Proceedings of USENIX Conference on File and Storage Technologies (FAST).

Digital Library

[33]

Dutch T. Meyer and William J. Bolosky. 2012. A study of practical deduplication. ACM Transactions on Storage 7, 4 (2012), 14:1--14:20.

Digital Library

[34]

Chun-Ho Ng, Mingcao Ma, Tsz-Yeung Wong, Patrick P. C. Lee, and John C. S. Lui. 2011. Live deduplication storage of virtual machine images in an open-source cloud. In Proceedings of ACM/IFIP/USENIX International Middleware Conference.

Digital Library

[35]

William Norcott. 2015. IOzone web page. Retrieved from http://www.iozone.org/.

[36]

Michael A. Olson, Keith Bostic, and Margo Seltzer. 1999. Berkeley DB. In Proceedings of USENIX Annual Technical Conference (ATC).

Digital Library

[37]

Opendedup. 2014. Opendedup web page. Retrieved from http://opendedup.org.

[38]

OpenSolaris. 2014. ZFS documentation. Retrieved from http://www.freebsd.org/doc/en/books/handbook/filesystems-zfs.html.

[39]

OpenStack Foundation. 2014. OpenStack web page. Retrieved from https://www.openstack.org.

[40]

OpenStack Foundation. 2016. Cinder documentation. Retrieved from http://docs.openstack.org/developer/cinder/.

[41]

T. Ozawa and M. Kazutaka. 2014. ACCORD web page. Retrieved from http://www.osrg.net/accord/.

[42]

Joao Paulo and Jose Pereira. 2011. Model checking a decentralized storage deduplication protocol. In Fast Abstract in Latin-American Symposium on Dependable Computing.

[43]

J. Paulo and J. Pereira. 2014a. Distributed exact deduplication for primary storage infrastructures. In Distributed Applications and Interoperable Systems.

Digital Library

[44]

João Paulo and José Pereira. 2014b. A survey and classification of storage deduplication systems. Comput. Surveys 47, 1 (2014), 11:1--11:30.

Digital Library

[45]

J. Paulo, P. Reis, J. Pereira, and A. Sousa. 2012. DEDISbench: A benchmark for deduplicated storage systems. In Proceedings of International Symposium on Secure Virtual Infrastructures (DOA-SVI).

[46]

J. Paulo, P. Reis, J. Pereira, and A. Sousa. 2013. Towards an accurate evaluation of deduplicated storage systems. International Journal of Computer Systems Science and Engineering 29, 1, 1:73--1:83.

[47]

Sean Quinlan and Sean Dorward. 2002. Venti: A new approach to archival storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST).

Digital Library

[48]

Sean Rhea, Russ Cox, and Alex Pesterev. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of USENIX Annual Technical Conference (ATC).

Digital Library

[49]

Rusty Russell. 2008. Virtio: Towards a de-facto standard for virtual I/O devices. SIGOPS Operating Systems Review 42, 5 (2008), 95--103.

Digital Library

[50]

Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu. 2012. Delta compressed and deduplicated storage using stream-informed locality. In Proceedings of USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage).

Digital Library

[51]

Kiran Srinivasan, Tim Bisson, Garth Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of USENIX Conference on File and Storage Technologies (FAST).

Digital Library

[52]

Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In Poster Session of USENIX Annual Technical Conference (ATC).

Digital Library

[53]

Y. Tsuchiya and T. Watanabe. 2011. DBLK: Deduplication for primary block storage. In Proceedings of Conference on Mass Storage Systems (MSST).

Digital Library

[54]

Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Stephen Rago, Grzegorz Calkowski, Cezary Dubnicki, and Aniruddha Bohra. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of USENIX Conference on File and Storage Technologies (FAST).

Digital Library

[55]

Jiansheng Wei, Hong Jiang, Ke Zhou, and Dan Feng. 2010. MAD2: A scalable high-throughput exact deduplication approach for network backup services. In Proceedings of Conference on Mass Storage Systems (MSST).

Digital Library

[56]

Avani Wildani, Ethan L. Miller, and Ohad Rodeh. 2013. HANDS: A heuristically arranged non-backup in-line deduplication system. In Proceedings of the International Conference on Data Engineering (ICDE).

Digital Library

[57]

Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of USENIX Annual Technical Conference (ATC).

Digital Library

[58]

Tianming Yang, Hong Jiang, Dan Feng, Zhongying Niu, Ke Zhou, and Yaping Wan. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of International Parallel & Distributed Processing Symposium (IPDPS).

[59]

Lawrence L. You, Kristal T. Pollack, and Darrell D. E. Long. 2005. Deep store: An archival storage system architecture. In Proceedings of International Conference on Data Engineering (ICDE).

Digital Library

[60]

Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of USENIX Conference on File and Storage Technologies (FAST).

Digital Library

Cited By

Jackowski AŚlusarczyk ŁLichota KWełnicki MWijata RKielar MKopeć TDubnicki CIwanicki K(2023)ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level DeduplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325050134:7(2180-2197)Online publication date: Jul-2023
https://doi.org/10.1109/TPDS.2023.3250501
Gharib MFazli M(2023)Secure cloud storage with anonymous deduplication using ID-based key managementThe Journal of Supercomputing10.1007/s11227-022-04751-679:2(2356-2382)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1007/s11227-022-04751-6
Yuan JZou XXu HCao ZLi SXia WWang PChen L(2022)A Focused Garbage Collection Approach for Primary Deduplicated Storage with Low Memory Overhead2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00053(315-323)Online publication date: Oct-2022
https://doi.org/10.1109/ICCD56317.2022.00053
Show More Cited By

Efficient Deduplication in a Distributed Primary Storage Infrastructure
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
    2. Software system structures

Recommendations

Distributed Exact Deduplication for Primary Storage Infrastructures
Proceedings of the 14th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems - Volume 8460

Deduplication of primary storage volumes in a cloud computing environment is increasingly desirable, as the resulting space savings contribute to the cost effectiveness of a large scale multi-tenant infrastructure. However, traditional archival and ...
Live deduplication storage of virtual machine images in an open-source cloud
Middleware'11: Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware

Deduplication is an approach of avoiding storing data blocks with identical content, and has been shown to effectively reduce the disk space for storing multi-gigabyte virtual machine (VM) images. However, it remains challenging to deploy deduplication ...
Live deduplication storage of virtual machine images in an open-source cloud
Middleware '11: Proceedings of the 12th International Middleware Conference

Deduplication is an approach of avoiding storing data blocks with identical content, and has been shown to effectively reduce the disk space for storing multi-gigabyte virtual machine (VM) images. However, it remains challenging to deploy deduplication ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage

ACM Transactions on Storage Volume 12, Issue 4

August 2016

213 pages

ISSN:1553-3077

EISSN:1553-3093

DOI:10.1145/2940403

Editor:
Darrell D. E. Long
University of California Santa Cruz, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2016

Accepted: 01 January 2016

Revised: 01 September 2015

Received: 01 October 2014

Published in TOS Volume 12, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Funds through the FCT - Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology)
ERDF - European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
605
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jackowski AŚlusarczyk ŁLichota KWełnicki MWijata RKielar MKopeć TDubnicki CIwanicki K(2023)ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level DeduplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325050134:7(2180-2197)Online publication date: Jul-2023
https://doi.org/10.1109/TPDS.2023.3250501
Gharib MFazli M(2023)Secure cloud storage with anonymous deduplication using ID-based key managementThe Journal of Supercomputing10.1007/s11227-022-04751-679:2(2356-2382)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1007/s11227-022-04751-6
Yuan JZou XXu HCao ZLi SXia WWang PChen L(2022)A Focused Garbage Collection Approach for Primary Deduplicated Storage with Low Memory Overhead2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00053(315-323)Online publication date: Oct-2022
https://doi.org/10.1109/ICCD56317.2022.00053
Godavari ASudhakar CRamesh T(2022)File Semantic Aware Primary Storage Deduplication SystemIETE Journal of Research10.1080/03772063.2022.2050306(1-13)Online publication date: 16-Mar-2022
https://doi.org/10.1080/03772063.2022.2050306
Miranda MEsteves TPortela BPaulo JWassermann BMalka MChidambaram VRaz D(2021)S2DedupProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463773(1-12)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1145/3456727.3463773
Yin JTang YDeng SZheng BZomaya A(2021)MUSE: A Multi-Tierd and SLA-Driven Deduplication Framework for Cloud Storage SystemsIEEE Transactions on Computers10.1109/TC.2020.299663870:5(759-774)Online publication date: 1-May-2021
https://doi.org/10.1109/TC.2020.2996638
Wu SMao BJiang HLuan HZhou J(2019)PFP: Improving the Reliability of Deduplication-based Storage Systems with Per-File ParityIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.289894230:9(2117-2129)Online publication date: 6-Aug-2019
https://dl.acm.org/doi/10.1109/TPDS.2019.2898942
Fu YXiao NJiang HHu GChen W(2019)Application-Aware Big Data Deduplication in Cloud EnvironmentIEEE Transactions on Cloud Computing10.1109/TCC.2017.27100437:4(921-934)Online publication date: 1-Oct-2019
https://doi.org/10.1109/TCC.2017.2710043
Li JHou M(2018)Improving Data Availability for Deduplication in Cloud StorageInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.201804010610:2(70-89)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.4018/IJGHPC.2018040106
Wu HWang CFu YSakr SLu KZhu L(2018)A Differentiated Caching Mechanism to Enable Primary Storage Deduplication in CloudsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.279094629:6(1202-1216)Online publication date: 1-Jun-2018
https://doi.org/10.1109/TPDS.2018.2790946
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents