research-article

Public Access

E-HPC: a library for elastic resource management in HPC environments

Authors:

Devarshi Ghoshal,

Gonzalo P. Rodrigo,

Lavanya RamakrishnanAuthors Info & Claims

WORKS '17: Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science

Article No.: 1, Pages 1 - 11

https://doi.org/10.1145/3150994.3150996

Published: 12 November 2017 Publication History

Abstract

Next-generation data-intensive scientific workflows need to support streaming and real-time applications with dynamic resource needs on high performance computing (HPC) platforms. The static resource allocation model on current HPC systems that was designed for monolithic MPI applications is insufficient to support the elastic resource needs of current and future workflows. In this paper, we discuss the design, implementation and evaluation of Elastic-HPC (E-HPC), an elastic framework for managing resources for scientific workflows on current HPC systems. E-HPC considers a resource slot for a workflow as an elastic window that might map to different physical resources over the duration of a workflow. Our framework uses checkpoint-restart as the underlying mechanism to migrate workflow execution across the dynamic window of resources. E-HPC provides the foundation necessary to enable dynamic resource allocation of HPC resources that are needed for streaming and real-time workflows. E-HPC has negligible overhead beyond the cost of checkpointing. Additionally, E-HPC results in decreased turnaround time of workflows compared to traditional model of resource allocation for workflows, where resources are allocated per stage of the workflow. Our evaluation shows that E-HPC improves core hour utilization for common workflow resource use patterns and provides an effective framework for elastic expansion of resources for applications with dynamic resource needs.

References

[1]

Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy H Katz, Andrew Konwinski, Gunho Lee, David A Patterson, Ariel Rabkin, Ion Stoica, et al. 2009. Above the clouds: A berkeley view of cloud computing. Technical Report. Technical Report UCB/EECS-2009-28, EECS Department, University of California, Berkeley.

[2]

Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. 2016. Borg, omega, and kubernetes. Commun. ACM 59, 5 (2016), 50--57.

Digital Library

[3]

Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. 2007. Workflow management in condor. Workflows for e-Science (2007), 357--375.

[4]

Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Sonal Patil, Mei-Hui Su, Karan Vahi, and Miron Livny. 2004. Pegasus: Mapping scientific workflows onto the grid. In Grid Computing. Springer, 11--20.

[5]

Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G Bruce Berriman, John Good, et al. 2005. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13, 3 (2005), 219--237.

Digital Library

[6]

T. Fahringer, R. Prodan, R.Duan, J. Hofer, F. Nadeem, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.-L. Truong, A. Villazon, and M. Wieczorek. 2007. ASKALON: A Development and Grid Computing Environment for Scientific Workflows. In Workflows for e-Science, I. Taylor et al. (Eds.). Springer-Verlag, 450--471.

[7]

Dror G Feitelson, Larry Rudolph, Uwe Schwiegelshohn, Kenneth C Sevcik, and Parkson Wong. 1997. Theory and practice in parallel job scheduling. In Workshop on Job SchedulingStrategies for Parallel Processing. Springer, 1--34.

[8]

Guilherme Galante and Luis Carlos Ede Bona. 2012. A survey on cloud computing elasticity. In Utility and Cloud Computing (UCC), 2012 IEEE Fifth International Conference on. IEEE, 263--270.

Digital Library

[9]

Duane A. Gilmour, James P. Hanna, and Gary Blank. 2004. Dynamic Resource Allocation in an HPC Environment. In Proceedings of the 2004 Users Group Conference (DOD UGC 2004). IEEE Computer Society, Washington, DC, USA, 260--265.

Digital Library

[10]

Valeria Hendrix, James Fox, Devarshi Ghoshal, and Lavanya Ramakrishnan. 2016. Tigres Workflow Library: Supporting Scientific Pipelines on HPC Systems. Cluster, Cloud, and Grid Computing (CCGrid), 2016 16th IEEE ACM International Symposium (May 2016).

[11]

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, Vol. 11. 22--22.

Digital Library

[12]

Joseph C Jacob, Daniel S Katz, G Bruce Berriman, John C Good, Anastasia Laity, Ewa Deelman, Carl Kesselman, Gurmeet Singh, Mei-Hui Su, Thomas Prince, et al. 2009. Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. International Journal of Computational Science and Engineering 4, 2 (2009), 73--87.

Digital Library

[13]

Gideon Juve, Ann Chervenak, Ewa Deelman, Shishir Bharathi, Gaurang Mehta, and Karan Vahi. 2013. Characterizing and profiling scientific workflows. Future Generation Computer Systems 29, 3 (2013), 682--692.

Digital Library

[14]

Cristian Klein and Christian Perez. 2011. An rms architecture for efficiently supporting complex-moldable applications. In High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on. IEEE, 211--220.

Digital Library

[15]

Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter heron: Stream processing at scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 239--250.

Digital Library

[16]

Feng Liu and Jon B. Weissman. 2015. Elastic Job Bundling: An Adaptive Resource Request Strategy for Large-scale Parallel Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 33, 12 pages.

Digital Library

[17]

Peter Mell, Tim Grance, et al. 2011. The NIST definition of cloud computing. (2011).

[18]

Hashim Mohamed and Dick Epema. 2008. KOALA: a co-allocating grid scheduler. Concurrency and Computation: Practice and Experience 20, 16 (2008), 1851--1876.

Digital Library

[19]

H. H. Mohamed and D. H. J. Epema. 2005. Experiences with the KOALA co-allocating scheduler in multiclusters. In Proceedings of the International Symposium on Cluster Computing and the Grid (CCGRID2005). IEEE Computer Society, 784--791.

[20]

A. W. Mu'alem and D. G. Feitelson. 2001. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems 12, 6 (Jun 2001), 529--543.

Digital Library

[21]

NERSC Cori 2016. http://www.nersc.gov/users/computational-systems/cori/configuration/. (2016).

[22]

Pegasus University of Southern California Information Sciences Institute. 2015. Pegasus Montage Tutorial. https://confluence.pegasus.isi.edu/display/pegasus/Montage. (2015).

[23]

Fawaz Paraiso, Philippe Merle, and Lionel Seinturier. 2016. soCloud: a service-oriented component-based PaaS for managing portability, provisioning, elasticity, and high availability across multiple clouds. Computing 98, 5 (2016), 539--565.

Digital Library

[24]

William D Pence, L Chiappetti, Clive G Page, RA Shaw, and E Stobie. 2010. Definition of the flexible image transport system (fits), version 3.0. Astronomy & Astrophysics 524 (2010), A42.

[25]

Lavanya Ramakrishnan, Charles Koelbel, Yang-Suk Kee, Rich Wolski, Daniel Nurmi, Dennis Gannon, Graziano Obertelli, Asim YarKhan, Anirban Mandal, T Mark Huang, et al. 2009. VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1--12.

Digital Library

[26]

Michael Rieker, Jason Ansel, and Gene Cooperman. 2006. Transparent User-Level Checkpointing for the Native POSIX Thread Library for Linux. In The 2006 International Conference on Parallel and Distributed Processing Techniques and Applications. Las Vegas, NV, 492--498.

[27]

Gonzalo P Rodrigo, Erik Elmroth, Per-Olov Östberg, and Lavanya Ramakrishnan. 2017. Enabling Workflow-Aware Scheduling on HPC Systems. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 3--14.

Digital Library

[28]

Gonzalo Pedro Rodrigo Álvarez, Per-Olov Östberg, Erik Elmroth, and Lavanya Ramakrishnan. 2015. ***A2l2: An application aware flexible hpc scheduling model for low-latency allocation. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing. ACM, 11--19.

Digital Library

[29]

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 351--364.

Digital Library

[30]

Upendra Sharma, Prashant Shenoy, Sambit Sahu, and Anees Shaikh. 2011. A cost-aware elasticity provisioning system for the cloud. In Distributed Computing Systems (ICDCS), 2011 31st International Conference on. IEEE, 559--570.

Digital Library

[31]

Gurmeet Singh, Mei-Hui Su, Karan Vahi, Ewa Deelman, Bruce Berriman, John Good, Daniel S Katz, and Gaurang Mehta. 2008. Workflow task clustering for best effort systems with Pegasus. In Proceedings of the 15th ACM Mardi Gras conference: From lightweight mash-ups to lambda grids: Understanding the spectrum of distributed computing requirements, applications, tools, infrastructures, interoperability, and the incremental adoption of key capabilities. ACM, 9.

[32]

Shekhar Srikantaiah, Aman Kansal, and Feng Zhao. 2008. Energy aware consolidation for cloud computing. In Proceedings of the 2008 conference on Power aware computing and systems, Vol. 10. San Diego, California, 1--5.

Digital Library

[33]

Luis Tomás and Johan Tordsson. 2013. Improving cloud infrastructure utilization through overbooking. In Proceedings of the 2013 ACM Cloud and Autonomic Computing conference. ACM, 5.

Digital Library

[34]

Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 5.

Digital Library

[35]

Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 18.

Digital Library

[36]

Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swift: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011).

[37]

XSEDE Gordon 2015. http://www.sdsc.edu/support/user_guides/gordon.html. (2015).

[38]

Jia Yu and Rajkumar Buyya. 2005. A taxonomy of workflow management systems for grid computing. Journal of Grid Computing 3, 3--4 (2005), 171--200.

[39]

Xiaobing Zhou, Hao Chen, Ke Wang, Michael Lang, and Ioan Raicu. 2013. Exploring distributed resource allocation techniques in the slurm job management system. Illinois Institute of Technology, Department of Computer Science, Technical Report (2013).

Cited By

Bhattarai RPritchard HGhafoor S(2024)Dynamic Resource Management for Elastic Scientific Workflows using PMIx2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00131(686-695)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00131
Hanafy WLiang QBashir NIrwin DShenoy P(2023)CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-EfficiencyProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36267887:3(1-28)Online publication date: 7-Dec-2023
https://dl.acm.org/doi/10.1145/3626788
Mururu GKhan SChatterjee BChen CPorter CGavrilovska APande S(2023)Beacons: An End-to-End Compiler Framework for Predicting and Utilizing Dynamic Loop CharacteristicsProceedings of the ACM on Programming Languages10.1145/36228037:OOPSLA2(173-203)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622803
Show More Cited By

Index Terms

E-HPC: a library for elastic resource management in HPC environments
1. General and reference
  1. Cross-computing tools and techniques
    1. Design
    2. Evaluation

Recommendations

Dynamic steering of HPC scientific workflows

The field of scientific workflow management systems has grown significantly as applications start using them successfully. In 2007, several active researchers in scientific workflow developments presented the challenges for the state of the art in ...
CAMERA 2.0: A Data-centric Metagenomics Community Infrastructure Driven by Scientific Workflows
SERVICES '10: Proceedings of the 2010 6th World Congress on Services

Over the last decade, workflows have been established as a mechanism for scientific developers to create simplified views of complex scientific processes. However, there is a need for a comprehensive system architecture to link scientific developers ...
Pegasus, a workflow management system for science automation

Modern science often requires the execution of large-scale, multi-stage simulation and data analysis pipelines to enable the study of complex systems. The amount of computation and data involved in these pipelines requires scalable workflow management ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WORKS '17: Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science

November 2017

87 pages

ISBN:9781450351294

DOI:10.1145/3150994

General Chairs:
Johan Montagnat
CNRS, Sophia Antipolis, France
,
Ian Taylor
Cardiff University, UK and University of Notre Dame
,
Program Chairs:
Sandra Gesing
University of Notre Dame, Notre Dame, IN
,
Rizos Sakellariou
University of Manchester, UK

Copyright © 2017 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Advanced Scientific Computing Research

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

WORKS '17 Paper Acceptance Rate 8 of 25 submissions, 32%;

Overall Acceptance Rate 30 of 54 submissions, 56%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
499
Total Downloads

Downloads (Last 12 months)133
Downloads (Last 6 weeks)16

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bhattarai RPritchard HGhafoor S(2024)Dynamic Resource Management for Elastic Scientific Workflows using PMIx2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00131(686-695)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00131
Hanafy WLiang QBashir NIrwin DShenoy P(2023)CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-EfficiencyProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36267887:3(1-28)Online publication date: 7-Dec-2023
https://dl.acm.org/doi/10.1145/3626788
Mururu GKhan SChatterjee BChen CPorter CGavrilovska APande S(2023)Beacons: An End-to-End Compiler Framework for Predicting and Utilizing Dynamic Loop CharacteristicsProceedings of the ACM on Programming Languages10.1145/36228037:OOPSLA2(173-203)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622803
Dorier MWang ZRamesh SAyachit USnyder SRoss RParashar M(2023)Towards elastic in situ analysis for high-performance computing simulationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.014177(106-116)Online publication date: Jul-2023
https://doi.org/10.1016/j.jpdc.2023.02.014
Wang ZDorier MSubedi PDavis PParashar M(2023)Adaptive elasticity policies for staging-based in situ visualizationFuture Generation Computer Systems10.1016/j.future.2022.12.010142(75-89)Online publication date: May-2023
https://doi.org/10.1016/j.future.2022.12.010
Yang BZou YLiu WXue W(2022)An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00128(1294-1304)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00128
Dorier MWang ZAyachit USnyder SRoss RParashar M(2022)Colza: Enabling Elastic In Situ Visualization for High-performance Computing Simulations2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00059(538-548)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00059
Galante Gda Rosa Righi R(2022)Adaptive parallel applications: from shared memory architectures to fog computing (2002–2022)Cluster Computing10.1007/s10586-022-03692-225:6(4439-4461)Online publication date: 2-Aug-2022
https://doi.org/10.1007/s10586-022-03692-2
Li TShi HLu Xde Supinski BHall MGamblin T(2021)HatRPCProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476191(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476191
Wang ZDorier MSubedi PDavis PParashar M(2021)An Adaptive Elasticity Policy For Staging Based In-Situ Processing2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS54523.2021.00010(33-41)Online publication date: Dec-2021
https://doi.org/10.1109/WORKS54523.2021.00010
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents