research-article

DAOS and friends: a proposal for an exascale storage system

Authors:

Carlos Maltzahn,

Quincey Koziol,

Eric BartonAuthors Info & Claims

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 50, Pages 1 - 12

Published: 13 November 2016 Publication History

Abstract

The DOE Extreme-Scale Technology Acceleration Fast Forward Storage and IO Stack project is going to have significant impact on storage systems design within and beyond the HPC community. With phase two of the project starting, it is an excellent opportunity to explore the complete design and how it will address the needs of extreme scale platforms. This paper examines each layer of the proposed stack in some detail along with cross-cutting topics, such as transactions and metadata management.

This paper not only provides a timely summary of important aspects of the design specifications but also captures the underlying reasoning that is not available elsewhere. We encourage the broader community to understand the design, intent, and future directions to foster discussion guiding phase two and the ultimate production storage stack based on this work. An initial performance evaluation of the early prototype implementation is also provided to validate the presented design.

References

[1]

H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time: adding value to the io pipelines of high performance applications with jitstaging. In Proceedings of the 20th international symposium on High performance distributed computing, pages 27--36. ACM, 2011.

Digital Library

[2]

H. Abbasi, J. Lofstead, F. Zheng, S. Klasky, K. Schwan, and M. Wolf. Extending i/o through high performance data services. In Cluster Computing, Luoisiana, LA, September 2009. IEEE International.

[3]

H. Abbasi, M. Wolf, and K. Schwan. LIVE data workspace: A flexible, dynamic and extensible platform for petascale applications. In CLUSTER '07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 341--348, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[4]

J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, and J. Woodring. Jitter-free co-processing on a prototype exascale storage stack. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, pages 1--5, April 2012.

[5]

J. Bent, G. Grider, B. Kettering, A. Manzanares, M. McClelland, A. Torres, and A. Torrez. Storage challenges at los alamos national lab. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, pages 1--5, April 2012.

[6]

J. Bent, B. Settlemyer, H. Bao, S. Faibish, J. Sauer, and J. Zhang. Bad check: Bulk asynchronous distributed checkpointing and io. In Proceedings of Tenth Parallel Data Storage Workshop at Supercomputing 2015, 2015.

Digital Library

[7]

P. J. Braam. The lustre storage architecture. Cluster File Systems Inc. Architecture, design, and manual for Lustre, Nov. 2002. http://www.lustre.org/docs/lustre.pdf.

[8]

P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages 317--327, Atlanta, GA, Oct. 2000. USENIX Association.

Digital Library

[9]

M. Curry, G. Danielson, L. Ward, and J. Lofstead. An overview of the sirocco parallel storage system. In 2016 HPC-IO in the Data Center Workshop at ISC 2016. Springer, 2016.

[10]

M. L. Curry, L. Ward, and G. Danielson. Motivation and design of the sirocco storage system, version 1.0. Technical report, Sandia National Laboratories, Albuquerque, New Mexico, 2015. http://www.cs.sandia.gov/Scalable_IO/sirocco.

[11]

Fastforward storage and i/o stack design documents. Intel FastForward Wiki, February 2014. https://wiki.hpdd.intel.com/display/PUB/Fast+Forward+Storage+and+IO+Program+Documents.

[12]

M. Folk, G. Heber, Q. Koziol, E. Pourmal, and D. Robinson. An overview of the hdf5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pages 36--47. ACM, 2011.

Digital Library

[13]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proceedings of the NineteenthACM Symposium on Operating Systems Principles, pages 96--108, Bolton Landing, NY, Oct. 2003. ACM Press.

Digital Library

[14]

N. Jain, G. Liao, and T. L. Willke. Graphbuilder: Scalable graph etl framework. In First International Workshop on Graph Data Management Experiences and Systems, GRADES '13, pages 4:1--4:6, New York, NY, USA, 2013. ACM.

Digital Library

[15]

J. Lofstead, J. Dayal, I. Jimenez, and C. Maltzahn. Efficient transactions for parallel data movement. In The Petascale Data Storage Workshop at Supercomputing, Denver, CO, November 2013.

Digital Library

[16]

J. Lofstead, J. Dayal, I. Jimenez, and C. Maltzahn. Efficient, failure resilient transactions for parallel and distributed computing. In Data Intensive Scalable Computing Systems (DISCS), 2014 International Workshop on, pages 17--24, Nov 2014.

Digital Library

[17]

J. Lofstead, J. Dayal, K. Schwan, and R. Oldfield. D2t: Doubly distributed transactions for high performance and distributed computing. In IEEE Cluster Conference, Beijing, China, September 2012.

Digital Library

[18]

J. Lofstead, I. Jimenez, and C. Maltzahn. Consistency and fault tolerance considerations for the next iteration of the doe fast forward storage and io project. In 2014 43rd International Conference on Parallel Processing Workshops, pages 61--69. IEEE, 2014.

Digital Library

[19]

J. Lofstead, I. Jimenez, C. Maltzahn, Q. Koziol, J. Bent, and E. Barton. Poster: An innovative storage stack addressing extreme scale platforms and big data applications. In 2014 IEEE International Conference on Cluster Computing (CLUSTER), pages 280--281. IEEE, 2014.

[20]

J. Lofstead, R. Oldfield, T. Kordenbrock, and C. Reiss. Extending scalability of collective io through nessie and staging. In The Petascale Data Storage Workshop at Supercomputing, Seattle, WA, November 2011.

Digital Library

[21]

J. Lofstead, F. Zheng, S. Klasky, and K. Schwan. Adaptable, metadata rich IO methods for portable high performance IO. In Proceedings of the International Parallel and Distributed Processing Symposium, Rome, Italy, 2009.

Digital Library

[22]

J. Lofstead, F. Zheng, Q. Liu, S. Klasky, R. Oldfield, T. Kordenbrock, K. Schwan, and M. Wolf. Managing variability in the IO performance of petascale storage systems. In Proceedings ofSC2010: High Performance Networking and Computing, Nov. 2010.

Digital Library

[23]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012.

Digital Library

[24]

D. Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2, 2014.

Digital Library

[25]

A. Nisar, W.-k. Liao, and A. Choudhary. Scaling parallel I/O performance through I/O delegate and caching system. In SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[26]

P. Nowoczynski, N. Stone, J. Yanovich, and J. Sommerfield. Zest checkpoint storage system for large supercomputers. In Petascale Data Storage Workshop, 2008. PDSW '08. 3rd, pages 1--5, nov. 2008.

[27]

R. A. Oldfield, A. B. Maccabe, S. Arunagiri, T. Kordenbrock, R. Riesen, L. Ward, and P. Widener. Lightweight I/O for scientific applications. In Proceedings of the IEEE International Conference on Cluster Computing, Barcelona, Spain, Sept. 2006.

[28]

Object-based storage architecture: Defining a new generation of storage systems built on distributed, intelligent storage devices. Panasas Inc. white paper, version 1.0, Oct. 2003. http://www.panasas.com/docs/.

[29]

B. Powlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz. NFS version 3: Design and implementations. In Proceedings of the 1994 Summer USENIX Technical Conferece, pages 137--152, 1994.

[30]

O. Rodeh, J. Bacik, and C. Mason. Btrfs: The linux b-tree filesystem. ACM Transactions on Storage (TOS), 9(3):9, 2013.

Digital Library

[31]

F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the USENIX FAST '02Conference on File and Storage Technologies, pages 231--244, Monterey, CA, Jan. 2002. USENIX Association.

Digital Library

[32]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST '10, pages 1--10, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[33]

J. Soumagne, D. Kimpe, J. A. Zounmevo, M. Chaarawi, Q. Koziol, A. Afsahi, and R. B. Ross. Mercury: Enabling remote procedure call for high-performance computing. In CLUSTER, pages 1--8. IEEE, 2013.

[34]

F. Wang, M. Nelson, S. Oral, S. Atchley, S. Weil, B. W. Settlemyer, B. Caldwell, and J. Hill. Performance and scalability evaluation of the ceph parallel file system. In Proceedings of the 8th Parallel Data Storage Workshop, pages 14--19. ACM, 2013.

Digital Library

[35]

S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 2006Symposium on Operating Systems Design and Implementation, pages 307--320. University of California, Santa Cruz, 2006.

Digital Library

[36]

S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In OSDI'06, Seattle, WA, Nov. 2006.

Digital Library

[37]

Y. Zhang, A. Rajimwale, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. End-to-end data integrity for file systems: A zfs case study. In R. C. Burns and K. Keeton, editors, FAST, pages 29--42. USENIX, 2010.

Digital Library

[38]

F. Zheng, H. Abbasi, C. Docan, J. Lofstead, S. Klasky, Q. Liu, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf. PreDatA - preparatory data analytics on Peta-Scale machines. In In Proceedings of 24th IEEE International Parallel and Distributed Processing Symposium, April, Atlanta, Georgia, 2010.

Cited By

Ji XYang BZhang TMa XZhu XWang XEl-Sayed NZhai JLiu WXue WMerchant AWeatherspoon H(2019)Automatic, application-aware I/O forwarding resource allocationProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323323(265-279)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323323
Yang BJi XMa XWang XZhang TZhu XEl-Sayed NLan HYang YZhai JLiu WXue WLorch JYu M(2019)End-to-end I/O monitoring on a leading supercomputerProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323267(379-394)Online publication date: 26-Feb-2019
https://dl.acm.org/doi/10.5555/3323234.3323267
Liang WChen YLiu JAn H(2018)Contention-Aware Resource Scheduling for Burst Buffer SystemsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229718(1-7)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229718
Show More Cited By

Recommendations

Understanding DAOS Storage Performance Scalability
HPCAsia '23 Workshops: Proceedings of the HPC Asia 2023 Workshops

High performance scale-out storage systems are a critical component of modern HPC and AI clusters. However, characterizing their performance remains challenging: Different client I/O patterns have very different performance scaling behavior, and ...
DAOS: A Scale-Out High Performance Storage Stack for Storage Class Memory
Supercomputing Frontiers
Abstract
The Distributed Asynchronous Object Storage (DAOS) is an open source scale-out storage system that is designed from the ground up to support Storage Class Memory (SCM) and NVMe storage in user space. Its advanced storage API enables the native ...
My friends are here!: why talk to "strangers"?
CSCW Companion '14: Proceedings of the companion publication of the 17th ACM conference on Computer supported cooperative work & social computing

Many online communities face the challenge of incorporating a stable influx of newcomers into the community. Research on socialization in offline organizations suggest that newcomers who join an organization with a cohort of other newcomers are more ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2016

1034 pages

ISBN:9781467388153

Conference Chair:
John West
University of Texas at Austin

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGARCH
IEEE-CS

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

SC '16 Paper Acceptance Rate 81 of 442 submissions, 18%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
148
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ji XYang BZhang TMa XZhu XWang XEl-Sayed NZhai JLiu WXue WMerchant AWeatherspoon H(2019)Automatic, application-aware I/O forwarding resource allocationProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323323(265-279)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323323
Yang BJi XMa XWang XZhang TZhu XEl-Sayed NLan HYang YZhai JLiu WXue WLorch JYu M(2019)End-to-end I/O monitoring on a leading supercomputerProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323267(379-394)Online publication date: 26-Feb-2019
https://dl.acm.org/doi/10.5555/3323234.3323267
Liang WChen YLiu JAn H(2018)Contention-Aware Resource Scheduling for Burst Buffer SystemsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229718(1-7)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229718
Tang HByna STessier FWang TDong BMu JKoziol QSoumagne JVishwanath VLiu JWarren REl-Araby EEl-Ghazawi TPanda D(2018)Toward scalable and asynchronous object-centric data management for HPCProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00026(113-122)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1109/CCGRID.2018.00026
Sim HKim YVazhkudai SVallée GLim SButt AMohr BRaghavan P(2017)TagitProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126929(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126929
Jones TBrim MVallee GMayer BWelch ALi TLang MIonkov LOtstott DGavrilovska AEisenhauer GDoudali TFernando P(2017)UNITYProceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 201710.1145/3095770.3095776(1-8)Online publication date: 27-Jun-2017
https://dl.acm.org/doi/10.1145/3095770.3095776

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents