Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Client-Side Journaling for Durable Shared Storage

Published: 17 November 2017 Publication History

Abstract

Hardware consolidation in the datacenter often leads to scalability bottlenecks from heavy utilization of critical resources, such as the storage and network bandwidth. Client-side caching on durable media is already applied at block level to reduce the storage backend load but has received criticism for added overhead, restricted sharing, and possible data loss at client crash. We introduce a journal to the kernel-level client of an object-based distributed filesystem to improve durability at high I/O performance and reduced shared resource utilization. Storage virtualization at the file interface achieves clear consistency semantics across data and metadata, supports native file sharing among clients, and provides flexible configuration of durable data staging at the host. Over a prototype that we have implemented, we experimentally quantify the performance and efficiency of the proposed Arion system in comparison to a production system. We run microbenchmarks and application-level workloads over a local cluster and a public cloud. We demonstrate reduced latency by 60% and improved performance up to 150% at reduced server network and disk bandwidth by 41% and 77%, respectively. The performance improvement reaches 92% for 16 relational databases as clients and gets as high as 11.3x with two key-value stores as clients.

References

[1]
Amazon EC2. 2017. Amazon EC2 Instance Types. Retrieved October 13, 2017, from https://aws.amazon.com/ec2/instance-types/.
[2]
Amazon EFS. 2015. Amazon Elastic File System. Retrieved October 13, 2017, from https://aws.amazon.com/efs/.
[3]
Raja Appuswamy, Sergey Legtchenko, and Antony Rowstron. 2014. Towards paravirtualized network file systems. In Proceedings of the 2014 USENIX Workshop on Hot Topics in Storage and File Systems. Article No. 11.
[4]
Dulcardo Arteaga, Jorge Cabrera, Jing Xu, and Swaminathan Sundararaman. 2016. CloudCache: On-demand flash cache management for cloud computing. In Proceedings of the 2016 USENIX Conference on File and Storage Technologies. 355–369.
[5]
Dulcardo Arteaga and Ming Zhao. 2014. Client-side flash caching for cloud systems. In Proceedings of the 2014 ACM International Systems and Storage Conference. 7:1--7:11.
[6]
Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2013. Highly available transactions: Virtues and limitations. Proceedings of the VLDB Endowment 7, 3, 181--192.
[7]
Mary Baker, Satoshi Asami, Etienne Deprit, John Ousterhout, and Margo Seltzer. 1992. Non-volatile memory for fast, reliable file systems. In Proceedings of the 1992 ACM ASPLOS Conference. 10--22.
[8]
Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the 2013 ACM Symposium on Operating Systems Principles. 325--340.
[9]
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan 8 Claypool.
[10]
Bcache. 2010. Home Page. Retrieved October 13, 2017, from https://bcache.evilpiepirate.org/.
[11]
Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. 1995. A critique of ANSI SQL isolation levels. In Proceedings of the 1995 ACM SIGMOD Conference. 1--10.
[12]
Philip A. Bernstein and Nathan Goodman. 1983. Multiversion concurrency control—Theory and algorithms. ACM Transactions on Database Systems 8, 4, 465--483.
[13]
Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, MA.
[14]
Alysson Bessani, Ricardo Mendes, Tiago Oliveira, Nuno Neves, Miguel Correia, Marcelo Pasin, and Paulo Verissimo. 2014. SCFS: A shared cloud-backed file system. In Proceedings of the 2014 USENIX Annual Technical Conference. 169--180.
[15]
Deepavali Bhagwat, Mahesh Patil, Michal Ostrowski, Murali Vilayannur, Woon Jung, and Chethan Kumar. 2015. A practical implementation of clustered fault tolerant write acceleration in a virtualized environment. In Proceedings of the 2015 USENIX Conference on File and Storage Technologies. 287--300.
[16]
Kenneth Birman, Daniel Freedman, Qi Huang, and Patrick Dowell. 2012. Overcoming CAP with consistent soft-state replication. Computer 45, 2, 50--58.
[17]
BobMcGee. 2016. EC2 instance types. exact network performance? (March 2016). https://stackoverflow.com/questions/18507405/ec2-instance-typess-exact-network-performance/35806587#35806587.
[18]
William J. Bolosky, John R. Douceur, David Ely, and Marvin Theimer. 2000. Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. In Proceedings of the 2000 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 34--43.
[19]
Daniel P. Bovet and Marco Cesati. 2005. Understanding the Linux Kernel (3rd ed.). O’Reilly Media, Sebastopol, CA.
[20]
Sebastian Burckhardt, Daan Leijen, Manuel Fähndrich, and Mooly Sagiv. 2012. Eventually consistent transactions. In Programming Languages and Systems. Lecture Notes in Computer Science, Vol. 7211. Springer, 67--86.
[21]
Steve Byan, James Lentini, Anshul Madan, Luis Pabon, Michael Condict, Jeff Kimmel, Steve Kleiman, Christopher Small, and Mark Storer. 2012. Mercury: Host-side flash caching for the data center. In Proceedings of the 2012 IEEE International Conference on Massive Storage Systems and Technology. 12.
[22]
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, et al. 2011. Windows Azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 2011 ACM Symposium on Operating Systems Principles. 143--157.
[23]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the 2006 USENIX Symposium on Operating Systems Design and Implementation. 205--218.
[24]
Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 2013 ACM Symposium on Operating Systems Principles. 228--243.
[25]
Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 2012 USENIX Conference on File and Storage Technologies. 73--86.
[26]
Michael Conley, Amin Vahdat, and George Porter. 2015. Achieving cost-efficient, data-intensive computing in the cloud. In Proceedings of the 2015 ACM Symposium on Cloud Computing. 302--314.
[27]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 2010 ACM Symposium on Cloud Computing. 143--154.
[28]
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, et al. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 2012 USENIX Symposium on Operating Systems Design and Implementation. 251--264.
[29]
Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 2015 ACM Symposium on Operating Systems Principles. 54--70.
[30]
John C. Eidson. 2006. Measurement, Control, and Communication Using IEEE 1588. Springer-Verlag London Ltd.
[31]
Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Analysis of HDFS over HBase: A Facebook messages case study. In Proceedings of the 2014 USENIX Conference on File and Storage Technologies. 199--212.
[32]
Tyler Harter, Chris Dragga, Michael Vaughn, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2011. A file is not a file: Understanding the I/O behavior of apple desktop applications. In Proceedings of the 2011 ACM Symposium on Operating Systems Principles. 71--83.
[33]
Andromachi Hatzieleftheriou and Stergios V. Anastasiadis. 2015. Host-side filesystem journaling for durable shared storage. In Proceedings of the 2015 USENIX Conference on File and Storage Technologies. 59--66.
[34]
Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12, 3, 463--492.
[35]
Dean Hildebrand, Anna Povzner, Renu Tewari, and Vasily Tarasov. 2011. Revisiting the storage stack in virtualized NAS environments. In Proceedings of the 2011 USENIX Workshop on I/O Virtualization. Article No. 4.
[36]
John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. 1988. Scale and performance in a distributed file system. ACM Transactions on Computer Systems 6, 1, 51--81.
[37]
David Howells. 2006. FS-Cache: A network filesystem caching facility. In Proceedings of the 2006 Linux Symposium. 427--440.
[38]
William K. Josephson, Lars A. Bongo, David Flynn, and Kai Li. 2010. DFS: A file system for virtualized flash storage. In Proceedings of the 2010 USENIX Conference on File and Storage Technologies. 85--100.
[39]
Michael L. Kazar, Bruce W. Leverett, Owen T. Anderson, Vasilis Apostolides, Beth A. Bottos, Sailesh Chutani, Craig F. Everhart, W. Anthony Mason, Shu-Tsui Tu, and Edward R. Zayas. 1990. DEcorum file system architectural overview. In Proceedings of the 1990 USENIX Summer Technical Conference. 151--164.
[40]
J. J. Kistler and M. Satyanarayanan. 1992. Disconnected operation in the coda file system. ACM Transactions on Computer Systems 10, 1, 3--25.
[41]
Ricardo Koller, Leonardo Marmol, Raju Rangaswami, Swaminathan Sundararaman, Nisha Talagala, and Ming Zhao. 2013. Write policies for host-side flash caches. In Proceedings of the 2013 USENIX Conference on File and Storage Technologies. 45--58.
[42]
Duy Le, Hai Huang, and Haining Wang. 2012. Understanding performance implications of nested file systems in a virtualized environment. In Proceedings of the 2012 USENIX Conference on File and Storage Technologies. 87--100.
[43]
Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim, Joo-Young Hwang, and Sangyeun Cho. 2017. Understanding write behaviors of storage backends in Ceph object store. In Proceedings of the 2017 IEEE International Conference on Massive Storage Systems and Technology. 10.
[44]
Eunji Lee, Hyokyung Bahn, and Sam H. Noh. 2014. A unified buffer cache architecture that subsumes journaling functionality via nonvolatile memory. ACM Transactions on Storage 10, 1, 1:1--1:17.
[45]
Lanyue Lu, Yupu Zhang, Thanh Do, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Physical disentanglement in a container-based file system. In Proceedings of the 2014 USENIX Symposium on Operating Systems Design and Implementation. 81--96.
[46]
Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Annual Technical Conference. 773--875.
[47]
Timothy Mann, Andrew Birrell, Andy Hisgen, Charles Jerian, and Garret Swart. 1994. A coherent distributed file cache with directory write-behind. ACM Transactions on Computer Systems 12, 2, 123--164.
[48]
Bob McGee. 2016. EC2 Instance Types’ Exact Network Performance? Available at https://stackoverflow.com/questions/18507405/ec2-instance-typess-exact-network-performance/35806587#35806587
[49]
Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, Geoffrey Lefebvre, Michael J. Feeley, Norman C. Hutchinson, and Andrew Warfield. 2008. Parallax: Virtual disks for virtual machines. In Proceedings of the 2008 ACM European Conference on Computer Systems. 41--54.
[50]
Dutch T. Meyer, Jake Wires, Norman C. Hutchinson, and Andrew Warfield. 2011. Namespace management in virtual desktops. login: The USENIX Magazine 36, 1, 6--11.
[51]
James Mickens, Edmund B. Nightingale, Jeremy Elson, Krishna Nareddy, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram, and Osama Khan. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In Proceedings of the 2014 USENIX Symposium on Networked Systems Design and Implementation. 257--273.
[52]
David L. Mills. 1995. Improved algorithms for synchronizing computer network clocks. IEEE/ACM Transactions on Networking 3, 3, 245--254.
[53]
Michael N. Nelson, Brent B. Welch, and John K. Ousterhout. 1988. Caching in the Sprite network file system. ACM Transactions on Computer Systems 6, 1, 134--154.
[54]
Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 1988 ACM Symposium on Principles of Distributed Computing. 8--17.
[55]
Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In Proceedings of the 2011 ACM Symposium on Operating Systems Principles. 29--41.
[56]
Openstack Manila. 2014. Home Page. Retrieved October 13, 2017, from https://wiki.openstack.org/wiki/Manila.
[57]
David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why do Internet services fail, and what can be done about it? In Proceedings of the 2003 USENIX Symposium on Internet Technologies and Systems. 1--15.
[58]
Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, and David Hitz. 1994. NFS version 3 design and implementation. In Proceedings of the 1994 USENIX Summer Technical Conference. 137--152.
[59]
Ben Pfaff, Tal Garfinkel, and Mendel Rosenblum. 2006. Virtualization aware file systems: Getting beyond the limitations of virtual disks. In Proceedings of the 2006 USENIX Symposium on Networked Systems Design and Implementation. 353--366.
[60]
Dai Qin, Angela Demke Brown, and Ashvin Goel. 2014. Reliable writeback for client-side flash caches. In Proceedings of the 2014 USENIX Annual Technical Conference. 451--462.
[61]
Abhishek Rajimwale, Vijay Chidambaram, Deepak Ramamurthi, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2011. Coerced cache eviction and discreet-mode journaling: Dealing with misbehaving disks. In Proceedings of the 2011 International Conference on Dependable Systems and Networks. 518--529.
[62]
RBD. 2010. Ceph’s RADOS Block Device. Retrieved October 13, 2017, from docs.ceph.com/docs/master/rbd/rbd.
[63]
David P. Reed. 1983. Implementing atomic actions on decentralized data. ACM Transactions on Computer Systems 1, 1, 3--23.
[64]
Mahadev Satanarayanan. 1990. Scalable, secure, and highly available distributed file access. Computer 23, 5, 9--21.
[65]
Frank Schmuck and Roger Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 USENIX Conference on File and Storage Technologies. 231--244.
[66]
Seagate. 2007. Product Manual Cheetah 15K.5 SAS (Specifications for model ST3300655SS). Seagate Technology LLC.
[67]
Mohammad Shamma, Dutch T. Meyer, Jake Wires, Maria Ivanova, Norman C. Hutchinson, and Andrew Warfield. 2011. Capo: Recapitulating storage for virtual desktops. In Proceedings of the 2011 USENIX Conference on File and Storage Technologies. 31--45.
[68]
Justin Sheehy. 2015. There is no now. Communications of the ACM 58, 5, 36--41.
[69]
IBM Spectrum. 2017. Highly available write cache (HAWC). In IBM Spectrum Scale Version 4 Release 2.3, Administration Guide. IBM Corp.
[70]
Vasily Tarasov, Dean Hildebrand, Geoff Kuenning, and Erez Zadok. 2013a. Virtual machine workloads: The case for new benchmarks for NAS. In Proceedings of the 2013 USENIX Conference on File and Storage Technologies. 307--320.
[71]
Vasily Tarasov, Deepak Jain, Dean Hildebrand, Renu Tewari, Geoff Kuenning, and Erez Zadok. 2013b. Improving I/O performance using virtual disk introspection. In Proceedings of the 2013 USENIX Workshop on Hot Topics in Storage and File Systems. Article 11, 5 pages.
[72]
Vasily Tarasov, Erez Zadok, and Spencer Shepler. 2016. Filebench: A flexible framework for file system benchmarking. login: The USENIX Magazine 41, 1, 6--12. https://github.com/filebench/filebench/wiki.
[73]
Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike J. Spreitzer, Marvin M. Theimer, and Brent B. Welch. 1994. Session guarantees for weakly consistent replicated data. In Proceedings of the 1994 International Conference on Parallel and Distributed Information Systems. 140--149.
[74]
The Austin Group. 2008. POSIX.1-2008 Volume 2: System Interfaces. IEEE Std 1003.1 and The Open Group Base Specifications Issue 7.
[75]
Robert H. Thomas. 1979. A majority consensus approach to concurrency control for multiple copy databases. ACM Transactions on Database Systems 4, 2, 180--209.
[76]
Alexander Thomson and Daniel J. Abadi. 2015. CalvinFS: Consistent WAN replication and scalable metadata management for distributed file systems. In Proceedings of the 2015 USENIX Conference on File and Storage Technologies. 1--14.
[77]
Satyam B. Vaghani. 2010. Virtual machine file system. ACM SIGOPS Operating Systems Review 44, 4, 57--70.
[78]
David C. van Moolenbroek, Raja Appuswamy, and Andrew S. Tanenbaum. 2014. Towards a flexible, lightweight virtualization alternative. In Proceedings of the 2014 ACM International Systems and Storage Conference 8:1--8:7.
[79]
Michael Vrable, Stefan Savage, and Geoffrey M. Voelker. 2012. BlueSky: A cloud-backed file system for the enterprise. In Proceedings of the 2012 USENIX Conference on File and Storage Technologies. 237--250.
[80]
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 2006 USENIX Symposium on Operating Systems Design and Implementation. 307--320.
[81]
Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2014. Building Consistent Transactions With Inconsistent Replication. Technical Report UW-CSE-14-12-01. University of Washington.
[82]
Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast databases with fast durability and recovery through multicore parallelism. In Proceedings of the 2014 USENIX Symposium on Operating Systems Design and Implementation. 465--477.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 13, Issue 4
Special Issue on MSST 2017 and Regular Papers
November 2017
329 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3160863
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2017
Accepted: 01 September 2017
Revised: 01 August 2017
Received: 01 December 2016
Published in TOS Volume 13, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cloud storage
  2. crash consistency
  3. distributed filesystems
  4. failure recovery
  5. scalability

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 267
    Total Downloads
  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media