Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3014904.3014971acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

DAOS and friends: a proposal for an exascale storage system

Published: 13 November 2016 Publication History

Abstract

The DOE Extreme-Scale Technology Acceleration Fast Forward Storage and IO Stack project is going to have significant impact on storage systems design within and beyond the HPC community. With phase two of the project starting, it is an excellent opportunity to explore the complete design and how it will address the needs of extreme scale platforms. This paper examines each layer of the proposed stack in some detail along with cross-cutting topics, such as transactions and metadata management.
This paper not only provides a timely summary of important aspects of the design specifications but also captures the underlying reasoning that is not available elsewhere. We encourage the broader community to understand the design, intent, and future directions to foster discussion guiding phase two and the ultimate production storage stack based on this work. An initial performance evaluation of the early prototype implementation is also provided to validate the presented design.

References

[1]
H. Abbasi, G. Eisenhauer, M. Wolf, K. Schwan, and S. Klasky. Just in time: adding value to the io pipelines of high performance applications with jitstaging. In Proceedings of the 20th international symposium on High performance distributed computing, pages 27--36. ACM, 2011.
[2]
H. Abbasi, J. Lofstead, F. Zheng, S. Klasky, K. Schwan, and M. Wolf. Extending i/o through high performance data services. In Cluster Computing, Luoisiana, LA, September 2009. IEEE International.
[3]
H. Abbasi, M. Wolf, and K. Schwan. LIVE data workspace: A flexible, dynamic and extensible platform for petascale applications. In CLUSTER '07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 341--348, Washington, DC, USA, 2007. IEEE Computer Society.
[4]
J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, and J. Woodring. Jitter-free co-processing on a prototype exascale storage stack. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, pages 1--5, April 2012.
[5]
J. Bent, G. Grider, B. Kettering, A. Manzanares, M. McClelland, A. Torres, and A. Torrez. Storage challenges at los alamos national lab. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, pages 1--5, April 2012.
[6]
J. Bent, B. Settlemyer, H. Bao, S. Faibish, J. Sauer, and J. Zhang. Bad check: Bulk asynchronous distributed checkpointing and io. In Proceedings of Tenth Parallel Data Storage Workshop at Supercomputing 2015, 2015.
[7]
P. J. Braam. The lustre storage architecture. Cluster File Systems Inc. Architecture, design, and manual for Lustre, Nov. 2002. http://www.lustre.org/docs/lustre.pdf.
[8]
P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages 317--327, Atlanta, GA, Oct. 2000. USENIX Association.
[9]
M. Curry, G. Danielson, L. Ward, and J. Lofstead. An overview of the sirocco parallel storage system. In 2016 HPC-IO in the Data Center Workshop at ISC 2016. Springer, 2016.
[10]
M. L. Curry, L. Ward, and G. Danielson. Motivation and design of the sirocco storage system, version 1.0. Technical report, Sandia National Laboratories, Albuquerque, New Mexico, 2015. http://www.cs.sandia.gov/Scalable_IO/sirocco.
[11]
Fastforward storage and i/o stack design documents. Intel FastForward Wiki, February 2014. https://wiki.hpdd.intel.com/display/PUB/Fast+Forward+Storage+and+IO+Program+Documents.
[12]
M. Folk, G. Heber, Q. Koziol, E. Pourmal, and D. Robinson. An overview of the hdf5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, pages 36--47. ACM, 2011.
[13]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proceedings of the NineteenthACM Symposium on Operating Systems Principles, pages 96--108, Bolton Landing, NY, Oct. 2003. ACM Press.
[14]
N. Jain, G. Liao, and T. L. Willke. Graphbuilder: Scalable graph etl framework. In First International Workshop on Graph Data Management Experiences and Systems, GRADES '13, pages 4:1--4:6, New York, NY, USA, 2013. ACM.
[15]
J. Lofstead, J. Dayal, I. Jimenez, and C. Maltzahn. Efficient transactions for parallel data movement. In The Petascale Data Storage Workshop at Supercomputing, Denver, CO, November 2013.
[16]
J. Lofstead, J. Dayal, I. Jimenez, and C. Maltzahn. Efficient, failure resilient transactions for parallel and distributed computing. In Data Intensive Scalable Computing Systems (DISCS), 2014 International Workshop on, pages 17--24, Nov 2014.
[17]
J. Lofstead, J. Dayal, K. Schwan, and R. Oldfield. D2t: Doubly distributed transactions for high performance and distributed computing. In IEEE Cluster Conference, Beijing, China, September 2012.
[18]
J. Lofstead, I. Jimenez, and C. Maltzahn. Consistency and fault tolerance considerations for the next iteration of the doe fast forward storage and io project. In 2014 43rd International Conference on Parallel Processing Workshops, pages 61--69. IEEE, 2014.
[19]
J. Lofstead, I. Jimenez, C. Maltzahn, Q. Koziol, J. Bent, and E. Barton. Poster: An innovative storage stack addressing extreme scale platforms and big data applications. In 2014 IEEE International Conference on Cluster Computing (CLUSTER), pages 280--281. IEEE, 2014.
[20]
J. Lofstead, R. Oldfield, T. Kordenbrock, and C. Reiss. Extending scalability of collective io through nessie and staging. In The Petascale Data Storage Workshop at Supercomputing, Seattle, WA, November 2011.
[21]
J. Lofstead, F. Zheng, S. Klasky, and K. Schwan. Adaptable, metadata rich IO methods for portable high performance IO. In Proceedings of the International Parallel and Distributed Processing Symposium, Rome, Italy, 2009.
[22]
J. Lofstead, F. Zheng, Q. Liu, S. Klasky, R. Oldfield, T. Kordenbrock, K. Schwan, and M. Wolf. Managing variability in the IO performance of petascale storage systems. In Proceedings ofSC2010: High Performance Networking and Computing, Nov. 2010.
[23]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012.
[24]
D. Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2, 2014.
[25]
A. Nisar, W.-k. Liao, and A. Choudhary. Scaling parallel I/O performance through I/O delegate and caching system. In SC '08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.
[26]
P. Nowoczynski, N. Stone, J. Yanovich, and J. Sommerfield. Zest checkpoint storage system for large supercomputers. In Petascale Data Storage Workshop, 2008. PDSW '08. 3rd, pages 1--5, nov. 2008.
[27]
R. A. Oldfield, A. B. Maccabe, S. Arunagiri, T. Kordenbrock, R. Riesen, L. Ward, and P. Widener. Lightweight I/O for scientific applications. In Proceedings of the IEEE International Conference on Cluster Computing, Barcelona, Spain, Sept. 2006.
[28]
Object-based storage architecture: Defining a new generation of storage systems built on distributed, intelligent storage devices. Panasas Inc. white paper, version 1.0, Oct. 2003. http://www.panasas.com/docs/.
[29]
B. Powlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz. NFS version 3: Design and implementations. In Proceedings of the 1994 Summer USENIX Technical Conferece, pages 137--152, 1994.
[30]
O. Rodeh, J. Bacik, and C. Mason. Btrfs: The linux b-tree filesystem. ACM Transactions on Storage (TOS), 9(3):9, 2013.
[31]
F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the USENIX FAST '02Conference on File and Storage Technologies, pages 231--244, Monterey, CA, Jan. 2002. USENIX Association.
[32]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST '10, pages 1--10, Washington, DC, USA, 2010. IEEE Computer Society.
[33]
J. Soumagne, D. Kimpe, J. A. Zounmevo, M. Chaarawi, Q. Koziol, A. Afsahi, and R. B. Ross. Mercury: Enabling remote procedure call for high-performance computing. In CLUSTER, pages 1--8. IEEE, 2013.
[34]
F. Wang, M. Nelson, S. Oral, S. Atchley, S. Weil, B. W. Settlemyer, B. Caldwell, and J. Hill. Performance and scalability evaluation of the ceph parallel file system. In Proceedings of the 8th Parallel Data Storage Workshop, pages 14--19. ACM, 2013.
[35]
S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 2006Symposium on Operating Systems Design and Implementation, pages 307--320. University of California, Santa Cruz, 2006.
[36]
S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In OSDI'06, Seattle, WA, Nov. 2006.
[37]
Y. Zhang, A. Rajimwale, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. End-to-end data integrity for file systems: A zfs case study. In R. C. Burns and K. Keeton, editors, FAST, pages 29--42. USENIX, 2010.
[38]
F. Zheng, H. Abbasi, C. Docan, J. Lofstead, S. Klasky, Q. Liu, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf. PreDatA - preparatory data analytics on Peta-Scale machines. In In Proceedings of 24th IEEE International Parallel and Distributed Processing Symposium, April, Atlanta, Georgia, 2010.

Cited By

View all
  • (2019)Automatic, application-aware I/O forwarding resource allocationProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323323(265-279)Online publication date: 25-Feb-2019
  • (2019)End-to-end I/O monitoring on a leading supercomputerProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323267(379-394)Online publication date: 26-Feb-2019
  • (2018)Contention-Aware Resource Scheduling for Burst Buffer SystemsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229718(1-7)Online publication date: 13-Aug-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2016
1034 pages
ISBN:9781467388153
  • Conference Chair:
  • John West

Sponsors

In-Cooperation

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Qualifiers

  • Research-article

Conference

SC16
Sponsor:

Acceptance Rates

SC '16 Paper Acceptance Rate 81 of 442 submissions, 18%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Automatic, application-aware I/O forwarding resource allocationProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323323(265-279)Online publication date: 25-Feb-2019
  • (2019)End-to-end I/O monitoring on a leading supercomputerProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323267(379-394)Online publication date: 26-Feb-2019
  • (2018)Contention-Aware Resource Scheduling for Burst Buffer SystemsWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229718(1-7)Online publication date: 13-Aug-2018
  • (2018)Toward scalable and asynchronous object-centric data management for HPCProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00026(113-122)Online publication date: 1-May-2018
  • (2017)TagitProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126929(1-12)Online publication date: 12-Nov-2017
  • (2017)UNITYProceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 201710.1145/3095770.3095776(1-8)Online publication date: 27-Jun-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media