Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3150994.3150996acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

E-HPC: a library for elastic resource management in HPC environments

Published: 12 November 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Next-generation data-intensive scientific workflows need to support streaming and real-time applications with dynamic resource needs on high performance computing (HPC) platforms. The static resource allocation model on current HPC systems that was designed for monolithic MPI applications is insufficient to support the elastic resource needs of current and future workflows. In this paper, we discuss the design, implementation and evaluation of Elastic-HPC (E-HPC), an elastic framework for managing resources for scientific workflows on current HPC systems. E-HPC considers a resource slot for a workflow as an elastic window that might map to different physical resources over the duration of a workflow. Our framework uses checkpoint-restart as the underlying mechanism to migrate workflow execution across the dynamic window of resources. E-HPC provides the foundation necessary to enable dynamic resource allocation of HPC resources that are needed for streaming and real-time workflows. E-HPC has negligible overhead beyond the cost of checkpointing. Additionally, E-HPC results in decreased turnaround time of workflows compared to traditional model of resource allocation for workflows, where resources are allocated per stage of the workflow. Our evaluation shows that E-HPC improves core hour utilization for common workflow resource use patterns and provides an effective framework for elastic expansion of resources for applications with dynamic resource needs.

    References

    [1]
    Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy H Katz, Andrew Konwinski, Gunho Lee, David A Patterson, Ariel Rabkin, Ion Stoica, et al. 2009. Above the clouds: A berkeley view of cloud computing. Technical Report. Technical Report UCB/EECS-2009-28, EECS Department, University of California, Berkeley.
    [2]
    Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. 2016. Borg, omega, and kubernetes. Commun. ACM 59, 5 (2016), 50--57.
    [3]
    Peter Couvares, Tevfik Kosar, Alain Roy, Jeff Weber, and Kent Wenger. 2007. Workflow management in condor. Workflows for e-Science (2007), 357--375.
    [4]
    Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Sonal Patil, Mei-Hui Su, Karan Vahi, and Miron Livny. 2004. Pegasus: Mapping scientific workflows onto the grid. In Grid Computing. Springer, 11--20.
    [5]
    Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G Bruce Berriman, John Good, et al. 2005. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13, 3 (2005), 219--237.
    [6]
    T. Fahringer, R. Prodan, R.Duan, J. Hofer, F. Nadeem, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.-L. Truong, A. Villazon, and M. Wieczorek. 2007. ASKALON: A Development and Grid Computing Environment for Scientific Workflows. In Workflows for e-Science, I. Taylor et al. (Eds.). Springer-Verlag, 450--471.
    [7]
    Dror G Feitelson, Larry Rudolph, Uwe Schwiegelshohn, Kenneth C Sevcik, and Parkson Wong. 1997. Theory and practice in parallel job scheduling. In Workshop on Job SchedulingStrategies for Parallel Processing. Springer, 1--34.
    [8]
    Guilherme Galante and Luis Carlos Ede Bona. 2012. A survey on cloud computing elasticity. In Utility and Cloud Computing (UCC), 2012 IEEE Fifth International Conference on. IEEE, 263--270.
    [9]
    Duane A. Gilmour, James P. Hanna, and Gary Blank. 2004. Dynamic Resource Allocation in an HPC Environment. In Proceedings of the 2004 Users Group Conference (DOD UGC 2004). IEEE Computer Society, Washington, DC, USA, 260--265.
    [10]
    Valeria Hendrix, James Fox, Devarshi Ghoshal, and Lavanya Ramakrishnan. 2016. Tigres Workflow Library: Supporting Scientific Pipelines on HPC Systems. Cluster, Cloud, and Grid Computing (CCGrid), 2016 16th IEEE ACM International Symposium (May 2016).
    [11]
    Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, Vol. 11. 22--22.
    [12]
    Joseph C Jacob, Daniel S Katz, G Bruce Berriman, John C Good, Anastasia Laity, Ewa Deelman, Carl Kesselman, Gurmeet Singh, Mei-Hui Su, Thomas Prince, et al. 2009. Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. International Journal of Computational Science and Engineering 4, 2 (2009), 73--87.
    [13]
    Gideon Juve, Ann Chervenak, Ewa Deelman, Shishir Bharathi, Gaurang Mehta, and Karan Vahi. 2013. Characterizing and profiling scientific workflows. Future Generation Computer Systems 29, 3 (2013), 682--692.
    [14]
    Cristian Klein and Christian Perez. 2011. An rms architecture for efficiently supporting complex-moldable applications. In High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on. IEEE, 211--220.
    [15]
    Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter heron: Stream processing at scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 239--250.
    [16]
    Feng Liu and Jon B. Weissman. 2015. Elastic Job Bundling: An Adaptive Resource Request Strategy for Large-scale Parallel Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 33, 12 pages.
    [17]
    Peter Mell, Tim Grance, et al. 2011. The NIST definition of cloud computing. (2011).
    [18]
    Hashim Mohamed and Dick Epema. 2008. KOALA: a co-allocating grid scheduler. Concurrency and Computation: Practice and Experience 20, 16 (2008), 1851--1876.
    [19]
    H. H. Mohamed and D. H. J. Epema. 2005. Experiences with the KOALA co-allocating scheduler in multiclusters. In Proceedings of the International Symposium on Cluster Computing and the Grid (CCGRID2005). IEEE Computer Society, 784--791.
    [20]
    A. W. Mu'alem and D. G. Feitelson. 2001. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems 12, 6 (Jun 2001), 529--543.
    [21]
    NERSC Cori 2016. http://www.nersc.gov/users/computational-systems/cori/configuration/. (2016).
    [22]
    Pegasus University of Southern California Information Sciences Institute. 2015. Pegasus Montage Tutorial. https://confluence.pegasus.isi.edu/display/pegasus/Montage. (2015).
    [23]
    Fawaz Paraiso, Philippe Merle, and Lionel Seinturier. 2016. soCloud: a service-oriented component-based PaaS for managing portability, provisioning, elasticity, and high availability across multiple clouds. Computing 98, 5 (2016), 539--565.
    [24]
    William D Pence, L Chiappetti, Clive G Page, RA Shaw, and E Stobie. 2010. Definition of the flexible image transport system (fits), version 3.0. Astronomy & Astrophysics 524 (2010), A42.
    [25]
    Lavanya Ramakrishnan, Charles Koelbel, Yang-Suk Kee, Rich Wolski, Daniel Nurmi, Dennis Gannon, Graziano Obertelli, Asim YarKhan, Anirban Mandal, T Mark Huang, et al. 2009. VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1--12.
    [26]
    Michael Rieker, Jason Ansel, and Gene Cooperman. 2006. Transparent User-Level Checkpointing for the Native POSIX Thread Library for Linux. In The 2006 International Conference on Parallel and Distributed Processing Techniques and Applications. Las Vegas, NV, 492--498.
    [27]
    Gonzalo P Rodrigo, Erik Elmroth, Per-Olov Östberg, and Lavanya Ramakrishnan. 2017. Enabling Workflow-Aware Scheduling on HPC Systems. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 3--14.
    [28]
    Gonzalo Pedro Rodrigo Álvarez, Per-Olov Östberg, Erik Elmroth, and Lavanya Ramakrishnan. 2015. ***A2l2: An application aware flexible hpc scheduling model for low-latency allocation. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing. ACM, 11--19.
    [29]
    Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 351--364.
    [30]
    Upendra Sharma, Prashant Shenoy, Sambit Sahu, and Anees Shaikh. 2011. A cost-aware elasticity provisioning system for the cloud. In Distributed Computing Systems (ICDCS), 2011 31st International Conference on. IEEE, 559--570.
    [31]
    Gurmeet Singh, Mei-Hui Su, Karan Vahi, Ewa Deelman, Bruce Berriman, John Good, Daniel S Katz, and Gaurang Mehta. 2008. Workflow task clustering for best effort systems with Pegasus. In Proceedings of the 15th ACM Mardi Gras conference: From lightweight mash-ups to lambda grids: Understanding the spectrum of distributed computing requirements, applications, tools, infrastructures, interoperability, and the incremental adoption of key capabilities. ACM, 9.
    [32]
    Shekhar Srikantaiah, Aman Kansal, and Feng Zhao. 2008. Energy aware consolidation for cloud computing. In Proceedings of the 2008 conference on Power aware computing and systems, Vol. 10. San Diego, California, 1--5.
    [33]
    Luis Tomás and Johan Tordsson. 2013. Improving cloud infrastructure utilization through overbooking. In Proceedings of the 2013 ACM Cloud and Autonomic Computing conference. ACM, 5.
    [34]
    Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. 2013. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 5.
    [35]
    Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 18.
    [36]
    Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swift: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011).
    [37]
    XSEDE Gordon 2015. http://www.sdsc.edu/support/user_guides/gordon.html. (2015).
    [38]
    Jia Yu and Rajkumar Buyya. 2005. A taxonomy of workflow management systems for grid computing. Journal of Grid Computing 3, 3--4 (2005), 171--200.
    [39]
    Xiaobing Zhou, Hao Chen, Ke Wang, Michael Lang, and Ioan Raicu. 2013. Exploring distributed resource allocation techniques in the slurm job management system. Illinois Institute of Technology, Department of Computer Science, Technical Report (2013).

    Cited By

    View all
    • (2024)Dynamic Resource Management for Elastic Scientific Workflows using PMIx2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00131(686-695)Online publication date: 27-May-2024
    • (2023)CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-EfficiencyProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36267887:3(1-28)Online publication date: 7-Dec-2023
    • (2023)Beacons: An End-to-End Compiler Framework for Predicting and Utilizing Dynamic Loop CharacteristicsProceedings of the ACM on Programming Languages10.1145/36228037:OOPSLA2(173-203)Online publication date: 16-Oct-2023
    • Show More Cited By

    Index Terms

    1. E-HPC: a library for elastic resource management in HPC environments

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WORKS '17: Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science
        November 2017
        87 pages
        ISBN:9781450351294
        DOI:10.1145/3150994
        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 12 November 2017

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. HPC systems
        2. elastic resource management
        3. scientific workflows

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        SC '17
        Sponsor:

        Acceptance Rates

        WORKS '17 Paper Acceptance Rate 8 of 25 submissions, 32%;
        Overall Acceptance Rate 30 of 54 submissions, 56%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)133
        • Downloads (Last 6 weeks)16
        Reflects downloads up to 26 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Dynamic Resource Management for Elastic Scientific Workflows using PMIx2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00131(686-695)Online publication date: 27-May-2024
        • (2023)CarbonScaler: Leveraging Cloud Workload Elasticity for Optimizing Carbon-EfficiencyProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36267887:3(1-28)Online publication date: 7-Dec-2023
        • (2023)Beacons: An End-to-End Compiler Framework for Predicting and Utilizing Dynamic Loop CharacteristicsProceedings of the ACM on Programming Languages10.1145/36228037:OOPSLA2(173-203)Online publication date: 16-Oct-2023
        • (2023)Towards elastic in situ analysis for high-performance computing simulationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.014177(106-116)Online publication date: Jul-2023
        • (2023)Adaptive elasticity policies for staging-based in situ visualizationFuture Generation Computer Systems10.1016/j.future.2022.12.010142(75-89)Online publication date: May-2023
        • (2022)An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00128(1294-1304)Online publication date: May-2022
        • (2022)Colza: Enabling Elastic In Situ Visualization for High-performance Computing Simulations2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00059(538-548)Online publication date: May-2022
        • (2022)Adaptive parallel applications: from shared memory architectures to fog computing (2002–2022)Cluster Computing10.1007/s10586-022-03692-225:6(4439-4461)Online publication date: 2-Aug-2022
        • (2021)HatRPCProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476191(1-14)Online publication date: 14-Nov-2021
        • (2021)An Adaptive Elasticity Policy For Staging Based In-Situ Processing2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS54523.2021.00010(33-41)Online publication date: Dec-2021
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media