research-article

Public Access

Enabling Workflow-Aware Scheduling on HPC Systems

Authors:

Gonzalo P. Rodrigo,

Per-Olov Östberg,

Lavanya RamakrishnanAuthors Info & Claims

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pages 3 - 14

https://doi.org/10.1145/3078597.3078604

Published: 26 June 2017 Publication History

Abstract

Scientific workflows are increasingly common in the workloads of current High Performance Computing (HPC) systems. However, HPC schedulers do not incorporate workflow-specific mechanisms beyond the capacity to declare dependencies between their jobs. Thus, workflows are run as sets of batch jobs with dependencies, which induces long intermediate wait times and, consequently, long workflow turnaround times. Alternatively, to reduce their turnaround time, workflows may be submitted as single pilot jobs that are allocated their maximum required resources for their entire runtime. Pilot jobs achieve shorter turnaround times but reduce the HPC system's utilization because resources may idle during the workflow's execution. We present a workflow-aware scheduling (WoAS) system that enables existing scheduling algorithms to exploit fine-grained information on a workflow's resource requirements and structure without modification. The current implementation of WoAS is integrated into Slurm, a widely used HPC batch scheduler. We evaluate the system using a simulator using real and synthetic workflows and a synthetic baseline workload that captures job patterns observed over three years of workload data from Edison, a large supercomputer hosted at the National Energy Research Scientific Computing Center. Our results show that WoAS reduces workflow turnaround times and improves system utilization without significantly slowing down conventional jobs.

References

[1]

Stephen Bailey. 2016. (01 2016). https://bitbucket.org/berkeleylab/qdo

[2]

Shishir Bharathi, Ann Chervenak, Ewa Deelman, Gaurang Mehta, Mei-Hui Su, and Karan Vahi. 2008. Characterization of scientific workflows. In 2008 third Workshop on Workflows in Support of Large-Scale Science. IEEE, 1--10.

[3]

Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In OSDI, Vol. 14. 285--300.

Digital Library

[4]

Ewa Deelman, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Sonal Patil, Mei-Hui Su, Karan Vahi, and Miron Livny. 2004. Pegasus: Mapping scientific workflows onto the grid. In Grid Computing. Springer, 11--20.

[5]

Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. 2015. Hawk: Hybrid Datacenter Scheduling. In USENIX Annual Technical Conference. 499--510.

Digital Library

[6]

T. Fahringer, R. Prodan, R.Duan, J. Hofer, F. Nadeem, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.-L. Truong, A. Villazon, and M. Wieczorek. 2007. ASKALON: A Development and Grid Computing Environment for Scientific Workflows. In Workflows for e-Science, I. Taylor and others (Eds.). Springer-Verlag, 450--471.

[7]

Dror G Feitelson, Larry Rudolph, and Uwe Schwiegelshohn. 2005. Parallel job scheduling, a status report. In Job Scheduling Strategies for Parallel Processing. Springer, 1--16.

Digital Library

[8]

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI, Vol. 11.

Digital Library

[9]

Anubhav Jain, Shyue Ping Ong, Wei Chen, Bharat Medasani, Xiaohui Qu, Michael Kocher, Miriam Brafman, Guido Petretto, Gian-Marco Rignanese, Geoffroy Hautier, and others. 2015. FireWorks: a dynamic workflow system designed for high-throughput applications. Concurrency and Computation: Practice and Experience 27, 17 (2015), 5037--5059.

Digital Library

[10]

Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga. 2015. Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters. In USENIX Annual Technical Conference. 485--497.

Digital Library

[11]

William TC Kramer and Clint Ryan. 2003. Performance variability of highly parallel architectures. In International Conference on Computational Science. Springer, 560--569.

Digital Library

[12]

Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter heron: Stream processing at scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 239--250.

Digital Library

[13]

David A Lifka. 1995. The ANL/IBM SP scheduling system. In Job Scheduling Strategies for Parallel Processing. Springer, 295--303.

Digital Library

[14]

Alejandro Lucero. 2011. Simulation of batch scheduling using real productionready software tools. Proceedings of the 5th IBERGRID (2011).

[15]

Muthucumaru Maheswaran, Shoukat Ali, HJ Siegal, Debra Hensgen, and Richard F Freund. 1999. Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In Heterogeneous Computing Workshop, 1999.(HCW'99) Proceedings. Eighth. IEEE, 30--44.

Digital Library

[16]

Hashim H Mohamed and Dick HJ Epema. 2005. The design and implementation of the KOALA co-allocating grid scheduler. In European Grid Conference. Springer, 640--650.

Digital Library

[17]

Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, and Mike Wilde. 2007. Falkon: a Fast and Light-weight tasK executiON framework. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing. ACM, 43.

Digital Library

[18]

Lavanya Ramakrishnan and Dennis Gannon. 2008. A survey of distributed workflow characteristics and resource requirements. Indiana University (2008), 1--23.

[19]

Lavanya Ramakrishnan, Charles Koelbel, Yang-Suk Kee, Rich Wolski, Daniel Nurmi, Dennis Gannon, Graziano Obertelli, Asim YarKhan, Anirban Mandal, T Mark Huang, and others. 2009. VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1--12.

Digital Library

[20]

Gonzalo Rodrigo, Erik Elmroth, Per-Olov Ostberg, and Lavanya Ramakrishnan. 2017. ScSF: A Scheduling Simulation Framework. In Workshop on Job Scheduling Strategies for Parallel Processing. Accepted, Springer.

[21]

Gonzalo Rodrigo, Per-Olov Ostberg, Erik Elmroth, Katie Antypas, Richard Gerber, and Lavanya Ramakrishnan. 2016. Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 521--526.

[22]

Gonzalo Rodrigo, P-O Ostberg, Erik Elmroth, Katie Antypass, Richard Gerber, and Lavanya Ramakrishnan. 2015. HPC System Lifetime Story: Workload Characterization and Evolutionary Analyses on NERSC Systems. In The 24th International ACM Symposium on High-Performance Distributed Computing (HPDC).

Digital Library

[23]

Gonzalo Rodrigo, Lavanya Ramakrishnan, P-O Ostberg, and Erik Elmroth. 2015. A2L2: an Application Aware flexible HPC scheduling model for Low Latency allocation. In The 8th International Workshop on Virtualization Technologies in Distributed Computing (VTDC).

Digital Library

[24]

Rizos Sakellariou and Henan Zhao. 2004. A hybrid heuristic for DAG scheduling on heterogeneous systems. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. IEEE, 111.

[25]

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 351--364.

Digital Library

[26]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. the hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, 1--10.

Digital Library

[27]

Massimo Benini Stephen Trofinoff. 2015. Using and Modifying the BSC Slurm Workload Simulator. In Slurm User Group.

[28]

Haluk Topcuoglu, Salim Hariri, and Min-you Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems 13, 3 (2002), 260--274.

Digital Library

[29]

Marek Wieczorek, Radu Prodan, and Thomas Fahringer. 2005. Scheduling of scientific workflows in the ASKALON grid environment. ACM SIGMOD Record 34, 3 (2005), 56--62.

Digital Library

[30]

Jia Yu and Rajkumar Buyya. 2005. A taxonomy of scientific workflow systems for grid computing. ACM Sigmod Record 34, 3 (2005), 44--49.

Digital Library

[31]

Jia Yu, Rajkumar Buyya, and Chen Khong Tham. 2005. Cost-based scheduling of scientific workflow applications on utility grids. In First International Conference on e-Science and Grid Computing (e-Science'05). IEEE.

Digital Library

[32]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. HotCloud 10 (2010).

Digital Library

[33]

Yong Zhao, Mihael Hategan, Ben Clifford, Ian Foster, Gregor Von Laszewski, Veronika Nefedova, Ioan Raicu, Tiberiu Stef-Praun, and Michael Wilde. 2007. Swift: Fast, reliable, loosely coupled parallel computation. In 2007 IEEE Congress on Services. IEEE, 199--206.

Cited By

Basu Roy RTiwari D(2024)StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific WorkflowsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390288:1(1-29)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639028
Phung TThain D(2024)Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00034(300-311)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00034
Lehmann FBader JTschirpke FThamsen LLeser U(2023)How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00025(166-179)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00025
Show More Cited By

Enabling Workflow-Aware Scheduling on HPC Systems
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
2. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis

Recommendations

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning
PEARC '19: Practice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning)

High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not ...
Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems
PEARC '21: Practice and Experience in Advanced Research Computing 2021: Evolution Across All Dimensions

In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised ...
Parallel Backfill: Improving HPC System Performance by Scheduling Jobs in Parallel
PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing

High-performance computing (HPC) clusters are widely used as a platform for scientific and engineering research as well as a broad range of data analysis tasks. Demand for HPC resources continues to grow, necessitating more scalable systems and improved ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

June 2017

254 pages

ISBN:9781450346993

DOI:10.1145/3078597

General Chairs:
Howie Huang
George Washington University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Adriana Iamnitchi
University of South Florida, USA
,
Alexandru Iosup
Vrije Universiteit Amsterdam and Delft University of Technology, NLD

Copyright © 2017 ACM.

© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Vetenskapsrådet
Advanced Scientific Computing Research
Swedish Government's strategic effort eSSENCE

Conference

HPDC '17

Sponsor:

University of Arizona
SIGARCH

HPDC '17: The 26th International Symposium on High-Performance Parallel and Distributed Computing

June 26 - 30, 2017

DC, Washington, USA

Acceptance Rates

HPDC '17 Paper Acceptance Rate 19 of 100 submissions, 19%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
997
Total Downloads

Downloads (Last 12 months)221
Downloads (Last 6 weeks)25

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Basu Roy RTiwari D(2024)StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific WorkflowsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390288:1(1-29)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639028
Phung TThain D(2024)Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00034(300-311)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00034
Lehmann FBader JTschirpke FThamsen LLeser U(2023)How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00025(166-179)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00025
Roy RPatel TTiwari DWolf FShende SCulhane CAlam SJagode H(2022)DayDreamProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571914(1-18)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571914
Roy RPatel TGadepally VTiwari DLee JAgrawal KSpear M(2022)MashupProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508407(46-60)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508407
Roy RPatel TTiwari D(2022)DayDream: Executing Dynamic Scientific Workflows on Serverless Platforms with Hot StartsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00027(1-18)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00027
Bader JWitzke JBecker SLoser ALehmann FDoehler LVu AKao O(2022)Towards Advanced Monitoring for Scientific Workflows2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020864(2709-2715)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020864
Phung TWard LChard KThain D(2021)Not All Tasks Are Created Equal: Adaptive Resource Allocation for Heterogeneous Tasks in Dynamic Workflows2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS54523.2021.00008(17-24)Online publication date: Nov-2021
https://doi.org/10.1109/WORKS54523.2021.00008
Zhang JZhou XGe TWang XHwang T(2021)Joint Task Scheduling and Containerizing for Efficient Edge ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305944732:8(2086-2100)Online publication date: 1-Aug-2021
https://doi.org/10.1109/TPDS.2021.3059447
Crankshaw DSela GMo XZumar CStoica IGonzalez JTumanov AFonseca RDelimitrou COoi B(2020)InferLineProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421285(477-491)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3419111.3421285
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents