Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3078597.3078604acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

Enabling Workflow-Aware Scheduling on HPC Systems

Published: 26 June 2017 Publication History

Abstract

Scientific workflows are increasingly common in the workloads of current High Performance Computing (HPC) systems. However, HPC schedulers do not incorporate workflow-specific mechanisms beyond the capacity to declare dependencies between their jobs. Thus, workflows are run as sets of batch jobs with dependencies, which induces long intermediate wait times and, consequently, long workflow turnaround times. Alternatively, to reduce their turnaround time, workflows may be submitted as single pilot jobs that are allocated their maximum required resources for their entire runtime. Pilot jobs achieve shorter turnaround times but reduce the HPC system's utilization because resources may idle during the workflow's execution. We present a workflow-aware scheduling (WoAS) system that enables existing scheduling algorithms to exploit fine-grained information on a workflow's resource requirements and structure without modification. The current implementation of WoAS is integrated into Slurm, a widely used HPC batch scheduler. We evaluate the system using a simulator using real and synthetic workflows and a synthetic baseline workload that captures job patterns observed over three years of workload data from Edison, a large supercomputer hosted at the National Energy Research Scientific Computing Center. Our results show that WoAS reduces workflow turnaround times and improves system utilization without significantly slowing down conventional jobs.

References

[1]
Stephen Bailey. 2016. (01 2016). https://bitbucket.org/berkeleylab/qdo
[2]
Shishir Bharathi, Ann Chervenak, Ewa Deelman, Gaurang Mehta, Mei-Hui Su, and Karan Vahi. 2008. Characterization of scientific workflows. In 2008 third Workshop on Workflows in Support of Large-Scale Science. IEEE, 1--10.
[3]
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In OSDI, Vol. 14. 285--300.
[4]
Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Sonal Patil, Mei-Hui Su, Karan Vahi, and Miron Livny. 2004. Pegasus: Mapping scientific workflows onto the grid. In Grid Computing. Springer, 11--20.
[5]
Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. 2015. Hawk: Hybrid Datacenter Scheduling. In USENIX Annual Technical Conference. 499--510.
[6]
T. Fahringer, R. Prodan, R.Duan, J. Hofer, F. Nadeem, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.-L. Truong, A. Villazon, and M. Wieczorek. 2007. ASKALON: A Development and Grid Computing Environment for Scientific Workflows. In Workflows for e-Science, I. Taylor and others (Eds.). Springer-Verlag, 450--471.
[7]
Dror G Feitelson, Larry Rudolph, and Uwe Schwiegelshohn. 2005. Parallel job scheduling, a status report. In Job Scheduling Strategies for Parallel Processing. Springer, 1--16.
[8]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, Vol. 11.
[9]
Anubhav Jain, Shyue Ping Ong, Wei Chen, Bharat Medasani, Xiaohui Qu, Michael Kocher, Miriam Brafman, Guido Petretto, Gian-Marco Rignanese, Geoffroy Hautier, and others. 2015. FireWorks: a dynamic workflow system designed for high-throughput applications. Concurrency and Computation: Practice and Experience 27, 17 (2015), 5037--5059.
[10]
Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga. 2015. Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters. In USENIX Annual Technical Conference. 485--497.
[11]
William TC Kramer and Clint Ryan. 2003. Performance variability of highly parallel architectures. In International Conference on Computational Science. Springer, 560--569.
[12]
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter heron: Stream processing at scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 239--250.
[13]
David A Lifka. 1995. The ANL/IBM SP scheduling system. In Job Scheduling Strategies for Parallel Processing. Springer, 295--303.
[14]
Alejandro Lucero. 2011. Simulation of batch scheduling using real productionready software tools. Proceedings of the 5th IBERGRID (2011).
[15]
Muthucumaru Maheswaran, Shoukat Ali, HJ Siegal, Debra Hensgen, and Richard F Freund. 1999. Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In Heterogeneous Computing Workshop, 1999.(HCW'99) Proceedings. Eighth. IEEE, 30--44.
[16]
Hashim H Mohamed and Dick HJ Epema. 2005. The design and implementation of the KOALA co-allocating grid scheduler. In European Grid Conference. Springer, 640--650.
[17]
Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, and Mike Wilde. 2007. Falkon: a Fast and Light-weight tasK executiON framework. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing. ACM, 43.
[18]
Lavanya Ramakrishnan and Dennis Gannon. 2008. A survey of distributed workflow characteristics and resource requirements. Indiana University (2008), 1--23.
[19]
Lavanya Ramakrishnan, Charles Koelbel, Yang-Suk Kee, Rich Wolski, Daniel Nurmi, Dennis Gannon, Graziano Obertelli, Asim YarKhan, Anirban Mandal, T Mark Huang, and others. 2009. VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1--12.
[20]
Gonzalo Rodrigo, Erik Elmroth, Per-Olov Ostberg, and Lavanya Ramakrishnan. 2017. ScSF: A Scheduling Simulation Framework. In Workshop on Job Scheduling Strategies for Parallel Processing. Accepted, Springer.
[21]
Gonzalo Rodrigo, Per-Olov Ostberg, Erik Elmroth, Katie Antypas, Richard Gerber, and Lavanya Ramakrishnan. 2016. Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 521--526.
[22]
Gonzalo Rodrigo, P-O Ostberg, Erik Elmroth, Katie Antypass, Richard Gerber, and Lavanya Ramakrishnan. 2015. HPC System Lifetime Story: Workload Characterization and Evolutionary Analyses on NERSC Systems. In The 24th International ACM Symposium on High-Performance Distributed Computing (HPDC).
[23]
Gonzalo Rodrigo, Lavanya Ramakrishnan, P-O Ostberg, and Erik Elmroth. 2015. A2L2: an Application Aware flexible HPC scheduling model for Low Latency allocation. In The 8th International Workshop on Virtualization Technologies in Distributed Computing (VTDC).
[24]
Rizos Sakellariou and Henan Zhao. 2004. A hybrid heuristic for DAG scheduling on heterogeneous systems. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, 111.
[25]
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 351--364.
[26]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. the hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, 1--10.
[27]
Massimo Benini Stephen Trofinoff. 2015. Using and Modifying the BSC Slurm Workload Simulator. In Slurm User Group.
[28]
Haluk Topcuoglu, Salim Hariri, and Min-you Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems 13, 3 (2002), 260--274.
[29]
Marek Wieczorek, Radu Prodan, and Thomas Fahringer. 2005. Scheduling of scientific workflows in the ASKALON grid environment. ACM SIGMOD Record 34, 3 (2005), 56--62.
[30]
Jia Yu and Rajkumar Buyya. 2005. A taxonomy of scientific workflow systems for grid computing. ACM Sigmod Record 34, 3 (2005), 44--49.
[31]
Jia Yu, Rajkumar Buyya, and Chen Khong Tham. 2005. Cost-based scheduling of scientific workflow applications on utility grids. In First International Conference on e-Science and Grid Computing (e-Science'05). IEEE.
[32]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. HotCloud 10 (2010).
[33]
Yong Zhao, Mihael Hategan, Ben Clifford, Ian Foster, Gregor Von Laszewski, Veronika Nefedova, Ioan Raicu, Tiberiu Stef-Praun, and Michael Wilde. 2007. Swift: Fast, reliable, loosely coupled parallel computation. In 2007 IEEE Congress on Services. IEEE, 199--206.

Cited By

View all
  • (2024)StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific WorkflowsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390288:1(1-29)Online publication date: 21-Feb-2024
  • (2024)Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00034(300-311)Online publication date: 27-May-2024
  • (2023)How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00025(166-179)Online publication date: May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
June 2017
254 pages
ISBN:9781450346993
DOI:10.1145/3078597
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. high performance computing
  2. hpc
  3. scheduling
  4. simulation
  5. slurm
  6. workflows

Qualifiers

  • Research-article

Funding Sources

Conference

HPDC '17
Sponsor:

Acceptance Rates

HPDC '17 Paper Acceptance Rate 19 of 100 submissions, 19%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)221
  • Downloads (Last 6 weeks)25
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific WorkflowsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390288:1(1-29)Online publication date: 21-Feb-2024
  • (2024)Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00034(300-311)Online publication date: 27-May-2024
  • (2023)How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00025(166-179)Online publication date: May-2023
  • (2022)DayDreamProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571914(1-18)Online publication date: 13-Nov-2022
  • (2022)MashupProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508407(46-60)Online publication date: 2-Apr-2022
  • (2022)DayDream: Executing Dynamic Scientific Workflows on Serverless Platforms with Hot StartsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00027(1-18)Online publication date: Nov-2022
  • (2022)Towards Advanced Monitoring for Scientific Workflows2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020864(2709-2715)Online publication date: 17-Dec-2022
  • (2021)Not All Tasks Are Created Equal: Adaptive Resource Allocation for Heterogeneous Tasks in Dynamic Workflows2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS54523.2021.00008(17-24)Online publication date: Nov-2021
  • (2021)Joint Task Scheduling and Containerizing for Efficient Edge ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305944732:8(2086-2100)Online publication date: 1-Aug-2021
  • (2020)InferLineProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421285(477-491)Online publication date: 12-Oct-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media