research-article

Public Access

MaDaTS: Managing Data on Tiered Storage for Scientific Workflows

Authors:

Devarshi Ghoshal,

Lavanya RamakrishnanAuthors Info & Claims

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pages 41 - 52

https://doi.org/10.1145/3078597.3078611

Published: 26 June 2017 Publication History

Abstract

Scientific workflows are increasingly used in High Performance Computing (HPC) environments to manage complex simulation and analyses, often consuming and generating large amounts of data. However, workflow tools have limited support for managing the input, output and intermediate data. The data elements of a workflow are often managed by the user through scripts or other ad-hoc mechanisms. Technology advances for future HPC systems is redefining the memory and storage subsystem by introducing additional tiers to improve the I/O performance of data-intensive applications. These architectural changes introduce additional complexities to managing data for scientific workflows. Thus, we need to manage the scientific workflow data across the tiered storage system on HPC machines. In this paper, we present the design and implementation of MaDaTS (Managing Data on Tiered Storage for Scientific Workflows), a software architecture that manages data for scientific workflows. We introduce Virtual Data Space (VDS), an abstraction of the data in a workflow that hides the complexities of the underlying storage system while allowing users to control data management strategies. We evaluate the data management strategies with real scientific and synthetic workflows, and demonstrate the capabilities of MaDaTS. Our experiments demonstrate the flexibility, performance and scalability gains of MaDaTS as compared to the traditional approach of managing data in scientific workflows.

References

[1]

Asif Akram, J Kewley, and Rob Allan. 2006. A Data centric approach for Workflows. In 2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06).

Digital Library

[2]

William Allcock, John Bresnahan, Rajkumar Keimuthu, Michael Link, Catalin Dumitrescu, Ioan Raicu, and Ian Foster. 2005. The Globus Striped GridFTP Framework and Server. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (SC '05). IEEE Computer Society, Washington, DC, USA, 54.

Digital Library

[3]

Javier Rojas Balderrama, Matthieu Simonin, and Cedric Tedeschi. 2015. GinFlow: A Decentralised Adaptive Workflow Execution Manager. Ph.D. Dissertation. Inria.

[4]

Chao Chen, Michael Lang, Latchesar Ionkov, and Yong Chen. 2016. Active Burst- Butter: In-Transit Processing Integrated into Hierarchical Storage. In Networking, Architecture and Storage (NAS), 2016 IEEE International Conference on.

[5]

Ann L. Chervenak, Robert Schuler, Matei Ripeanu, Muhammad Ali Amer, Shishir Bharathi, Ian Foster, Adriana Iamnitchi, and Carl Kesselman. 2009. The Globus Replica Location Service: Design and Experience. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sept. 2009).

Digital Library

[6]

Christopher Daley, Devarshi Ghoshal, Glenn Lockwood, Sudip Dosanjh, Lavanya Ramakrishnan, and Nicholas Wright. 2016. Performance Characterization of Scientific Workflows for the Optimal Use of Burst Butters. In 11th Workshop on Workflows in Support of Large-Scale Science (WORKS'16).

[7]

E. Deelman and A. Chervenak. 2008. Data Management Challenges of Data- Intensive Scientific Workflows. In Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on.

Digital Library

[8]

Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G Bruce Berriman, John Good, and others. 2005. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13, 3 (2005), 219--237.

Digital Library

[9]

Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. DataSpaces: an interaction and coordination framework for coupled simulation workflows. Cluster Computing 15, 2 (2012).

Digital Library

[10]

Ian T. Foster, Jens-S. Vockler, Michael Wilde, and Yong Zhao. 2002. Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation. In Proceedings of the 14th International Conference on Scientific and Statistical Database Management (SSDBM '02). IEEE Computer Society.

Digital Library

[11]

Michael Franklin, Alon Halevy, and David Maier. 2005. From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record 34, 4 (2005).

Digital Library

[12]

Valerie Hendrix, James Fox, Devarshi Ghoshal, and Lavanya Ramakrishnan. 2016. Tigres workflow library: Supporting scientific pipelines on hpc systems. In Cluster, Cloud and Grid Computing (CCGrid), 2016 16th IEEE/ACM International Symposium on.

Digital Library

[13]

D. Henseler, B. Landsteiner, D. Petesch, C. Wright, and N.J. Wright. 2016. Architecture and Design of Cray DataWarp. In Cray User Group CUG.

[14]

Stephen Herbein et al. 2016. Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16).

Digital Library

[15]

Chen Jin, Scott Klasky, Stephen Hodson, Weikuan Yu, Jay Lofstead, Hasan Abbasi, Karsten Schwan, Matthew Wolf, W Liao, Alok Choudhary, and others. 2008. Adaptive io system (adios). Cray User's Group (2008).

[16]

Youngjae Kim, Aayush Gupta, Bhuvan Urgaonkar, Piotr Berman, and Anand Sivasubramaniam. 2011. HybridStore: A Cost-Efficient, High-Performance Storage System Combining SSDs and HDDs. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '11). Washington, DC, USA.

Digital Library

[17]

David T. Liu and Michael J. Franklin. 2004. GridDB: A Data-centric Overlay for Scientific Grids. In the 30th International Conference on Very Large Data Bases.

Digital Library

[18]

N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[19]

A. Luckow, L. Lacinski, and S. Jha. 2010. SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

Digital Library

[20]

Henry M. Monti, Ali R. Buff, and Sudharshan S. Vazhkudai. 2013. On Timely Staging of HPC Job Input Data. IEEE Transactions on Parallel and Distributed Systems 24, 9 (2013).

Digital Library

[21]

Bill Nitzberg and Virginia Lo. 1991. Distributed Shared Memory: A Survey of Issues and Algorithms. Computer 24, 8 (Aug. 1991).

Digital Library

[22]

Ramya Prabhakar, Sudharshan S Vazhkudai, Youngjae Kim, Ali R Buff, Min Li, and Mahmut Kandemir. 2011. Provisioning a multi-tiered data staging area for extreme-scale machines. In 2011 31st International Conference on Distributed Computing Systems (ICDCS).

Digital Library

[23]

Arcot Rajasekar, Reagan Moore, Chien-yi Hou, Christopher A Lee, Richard Marciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, and others. 2010. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services 2, 1 (2010), 1--143.

Digital Library

[24]

Lavanya Ramakrishnan and Beth Plale. 2010. A Multi-dimensional Classification Model for Scientific Workflow Characteristics. In the 1st International Workshop on Workflow Approaches to New Data-centric Science (Wands '10). ACM.

Digital Library

[25]

Melissa Romanus, Fan Zhang, Tong Jin, Qian Sun, Hoang Bui, Manish Parashar, Jong Choi, Saloman Janhunen, Robert Hager, Scott Klasky, Choong-Seock Chang, and Ivan Rodero. 2016. Persistent Data Staging Services for Data Intensive Insitu Scientific Workflows. In Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing (DIDC '16). ACM, New York, NY, USA, 8.

Digital Library

[26]

Masahiro Tanaka and Osamu Tatebe. 2010. Pwrake: A Parallel and Distributed Flexible Workflow Management Tool for Wide-area Data Intensive Computing. In the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA.

Digital Library

[27]

Ian J Taylor, Ewa Deelman, Dennis B Gannon, and Matthew Shields. 2014. Workflows for e-Science: scientific workflows for grids. Springer Publishing Company.

Digital Library

[28]

Teng Wang, Sarp Oral, Michael Pritchard, Kevin Vasko, and Weikuan Yu. 2015. Development of a Burst Buffer System for Data-Intensive Applications. CoRR (2015).

[29]

Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swiff: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011).

Digital Library

[30]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 15--28.

Digital Library

[31]

F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi. 2012. Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform. In 26th International Parallel Distributed Processing Symposium (IPDPS).

Digital Library

[32]

G. Zhang, L. Chiu, C. Dickey, L. Liu, P. Muench, and S. Seshadri. 2010. Automated lookahead data migration in SSD-enabled multi-tiered storage systems. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

Digital Library

[33]

Zhe Zhang, Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Gregory G. Pike, John W. Cobb, and Frank Mueller. 2007. Optimizing Center Performance Through Coordinated Data Staging, Scheduling and Recovery. In the 2007 ACM/IEEE Conference on Supercomputing (SC '07). ACM, New York, NY, USA.

Digital Library

[34]

Fang Zheng, Hasan Abbasi, Ciprian Docan, Jay Lofstead, Qing Liu, Scott Klasky, Manish Parashar, Norbert Podhorszki, Karsten Schwan, and Matthew Wolf. 2010. PreDatA--preparatory data analytics on peta-scale machines. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE.

Cited By

Alshlahy WRhouma D(2024)Detection of misbehaving individuals in social networks using overlapping communities and machine learningJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.102110(102110)Online publication date: Jul-2024
https://doi.org/10.1016/j.jksuci.2024.102110
Lu TZhong YSun ZChen XZhou YWu FYang YHuang YYang YMohror KArnold DBadia R(2023)ADT-FSE: A New Encoder for SZProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607044(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607044
Roy RPatel TGadepally VTiwari DLee JAgrawal KSpear M(2022)MashupProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508407(46-60)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508407
Show More Cited By

Index Terms

MaDaTS: Managing Data on Tiered Storage for Scientific Workflows
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Software infrastructure
        Middleware
    2. Software system structures
      1. Abstraction, modeling and modularity
      2. Software architectures
        Data flow architectures

Recommendations

Programming Abstractions for Managing Workflows on Tiered Storage Systems
Scientific workflows in High Performance Computing (HPC) environments are processing large amounts of data. The storage hierarchy on HPC systems is getting deeper, driven by new technologies (NVRAMs, SSDs, etc.) There is a need for new programming ...
Tiered data management system: Accelerating data processing on HPC systems
Abstract
The explosion of scientific data generated from large-scale simulations and advanced sensors makes scientific workflows more complex and more data-intensive. Supporting these data-intensive workflows on high-performance computing systems presents ...
Highlights
- Optimizing I/O performance for scientific workflows.
- Data management systems on tiered storage architecture.
- Customizing data management strategies for different workflow access patterns.
- Data-aware task scheduling.
Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows
DIDC '16: Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing

Scientific simulation workflows executing on very large scale computing systems are essential modalities for scientific investigation. The increasing scales and resolution of these simulations provide new opportunities for accurately modeling complex ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

June 2017

254 pages

ISBN:9781450346993

DOI:10.1145/3078597

General Chairs:
Howie Huang
George Washington University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Adriana Iamnitchi
University of South Florida, USA
,
Alexandru Iosup
Vrije Universiteit Amsterdam and Delft University of Technology, NLD

Copyright © 2017 ACM.

© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Advanced Scientific Computing Research

Conference

HPDC '17

Sponsor:

University of Arizona
SIGARCH

HPDC '17: The 26th International Symposium on High-Performance Parallel and Distributed Computing

June 26 - 30, 2017

DC, Washington, USA

Acceptance Rates

HPDC '17 Paper Acceptance Rate 19 of 100 submissions, 19%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
652
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)4

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alshlahy WRhouma D(2024)Detection of misbehaving individuals in social networks using overlapping communities and machine learningJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.102110(102110)Online publication date: Jul-2024
https://doi.org/10.1016/j.jksuci.2024.102110
Lu TZhong YSun ZChen XZhou YWu FYang YHuang YYang YMohror KArnold DBadia R(2023)ADT-FSE: A New Encoder for SZProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607044(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607044
Roy RPatel TGadepally VTiwari DLee JAgrawal KSpear M(2022)MashupProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508407(46-60)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508407
Wu HDeng TZou YYin SChen SXie T(2021)ADA: An Application-Conscious Data Acquirer for Visual Molecular DynamicsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3473509(1-9)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3473509
Ghoshal DRamakrishnan L(2021)Programming Abstractions for Managing Workflows on Tiered Storage SystemsACM Transactions on Storage10.1145/345711917:4(1-21)Online publication date: 25-Oct-2021
https://dl.acm.org/doi/10.1145/3457119
Ghoshal DPaine DPastorello GElbashandy AGunter DAmusat ORamakrishnan LLofstead JMaltzahn CJimenez I(2021)Experiences with ReproducibilityProceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems10.1145/3456287.3465478(3-8)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3456287.3465478
Do TPottier LCaíno-Lores SFerreira da Silva RCuendet MWeinstein HEstrada TTaufer MDeelman E(2021)A lightweight method for evaluating in situ workflow efficiencyJournal of Computational Science10.1016/j.jocs.2020.10125948(101259)Online publication date: Jan-2021
https://doi.org/10.1016/j.jocs.2020.101259
Ghoshal DAustin BBard DDaley CLockwood GWright NRamakrishnan L(2020)Characterizing Scientific Workflows on HPC Systems using Logs2020 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS51914.2020.00013(57-64)Online publication date: Dec-2020
https://doi.org/10.1109/WORKS51914.2020.00013
Pottier Lda Silva RCasanova HDeelman E(2020)Modeling the Performance of Scientific Workflow Executions on HPC Platforms with Burst Buffers2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00019(92-103)Online publication date: Oct-2020
https://doi.org/10.1109/CLUSTER49012.2020.00019
Orzechowski MBaliś BDutka ŁSłota RKitowski J(2020)Transparent Data Access for Scientific Workflows Across CloudsEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_62(751-755)Online publication date: 29-May-2020
https://doi.org/10.1007/978-3-030-48340-1_62
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents