research-article

Online Job Scheduling in Distributed Machine Learning Clusters

Authors:

Zongpeng LiAuthors Info & Claims

IEEE INFOCOM 2018 - IEEE Conference on Computer Communications

Pages 495 - 503

https://doi.org/10.1109/INFOCOM.2018.8486422

Published: 16 April 2018 Publication History

Abstract

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural network, multiple workers are run in parallel to train partitions of the input dataset, and update shared model parameters. In a shared cluster handling multiple training jobs, a fundamental issue is how to efficiently schedule jobs and set the number of concurrent workers to run for each job, such that server resources are maximally utilized and model training can be completed in time. Targeting a distributed machine learning system using the parameter server framework, <tex>$w$</tex> e design an online algorithm for scheduling the arriving jobs and deciding the adjusted numbers of concurrent workers and parameter servers for each job over its course, to maximize overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. Practical effectiveness of the online algorithm is evaluated using trace-driven simulation and testbed experiments, which demonstrate its outperformance as compared to commonly adopted scheduling algorithms in today's cloud systems.

References

[1]

M. Abadi, P. Barham et al., “TensorFlow: A System for Large-Scale Machine Learning”, in Proc. of USENIX OSDI, 2016.

[2]

“Microsoft Cognitive Toolkit”, https://www.microsoft.com/en-us/cognitive-toolkit/

[3]

“PaddlePaddle”, https://github.com/PaddlePaddle/Paddle

[4]

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., “ImageNet: A Large-Scale Hierarchical Image Database”, in Proc. of IEEE CVPR, 2009.

[5]

F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer, “FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters”, in Proc. of IEEE CVPR, 2016.

[6]

T. Chen, M. Li et al., “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems”, in NIPS Workshop on Machine Learning Systems (LearningSys), 2016.

[7]

M. Li, D. G. Andersen et al., “Scaling Distributed Machine Learning with the Parameter Server”, in Proc. of USENIX OSDI, 2014.

[8]

T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project Adam: Building an Efficient and Scalable Deep Learning Training System”, in Proc. of USENIX OSDI, 2014.

[9]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition”, in Proc. of IEEE CVPR, 2016.

[10]

K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, in Proc. of ICLR, 2015.

[11]

A. Verma, L. Pedrosa, M. Korupolu et al., “Large-Scale Cluster Management at Google with Borg”, in Proc. of ACM EuroSys, 2015.

[12]

V. K. Vavilapalli, A. C. Murthy et al., “Apache Hadoop YARN: Yet Another Resource Negotiator”, in Proc. of ACM SoCC, 2013.

[13]

M. Zaharia, M. Chowdhury, M. J. Franklin et al., “Spark: Cluster Computing with Working Sets”, in Proc. of USENIX HotCloud, 2010.

[14]

A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, “Dominant Resource Fairness: Fair Allocation of Multiple Resource Types”, in Proc. of USENIX NSDI, 2011.

[15]

B. Hindman, A. Konwinski et al., “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”, in Proc. of USENIX NSDI, 2011.

[16]

“Kubernetes”, https://kubernetes.io/

[17]

P. Sun, Y. Wen, N. B. D. Ta, and S. Yan, “Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach”, in Proc. of IEEE Smart Computing, 2017.

[18]

Y. S. L. Lee et al., “Dolphin: Runtime Optimization for Distributed Machine Learning”, in Proc. of ICML ML Systems Workshop, 2016.

[19]

F. Yan, O. Ruwase, Y. He, and T. Chilimbi, “Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems”, in Proc. of ACM SIGKDD, 2015.

[20]

Z. Huang, B. Balasubramanian, M. Wang, T. Lan, M. Chiang, and D. H. Tsang, “Need for Speed: CORA Scheduler for Optimizing Completion-Times in the Cloud”, in Proc. of IEEE INFOCOM, 2015.

[21]

L. Chen, S. Liu et al., “Scheduling Jobs across Geo-Distributed Data-centers with Max-Min Fairness”, in Proc. of IEEE INFOCOM, 2017.

[22]

Y. Azar, I. Kalp-Shaltiel, B. Lucier, I. Menache et al., “Truthful Online Scheduling with Commitments”, in Proc. of ACM EC, 2015.

[23]

B. Lucier, I. Menache, J. S. Naor, and J. Yaniv, “Efficient Online Scheduling for Deadline-Sensitive Jobs”, in Proc. of ACM SPAA, 2013.

[24]

R. Zhou, Z. Li, C. Wu, and Z. Huang, “An Efficient Cloud Market Mechanism for Computing Jobs With Soft Deadlines”, IEEE/ACM Transactions on Networking, 2017.

[25]

X. Zhang, Z. Huang, C. Wu, Z. Li, and F. C. Lau, “Online Auctions in IaaS Clouds: Welfare and Profit Maximization with Server Costs”, in Proc. of ACM SIGMETRICS, 2015.

[26]

Z. Xiao et al., “Automatic Scaling of Internet Applications for Cloud Computing Services”, IEEE Transactions on Computers, 2014.

[27]

A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch et al., “TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters”, in Proc. of ACM EuroSys, 2016.

[28]

“Amazon EC2 Instances”, https://aws.amazon.com/ec2/instance-types/

[29]

“Apache Hadoop”, http://hadoop.apache.org/

[30]

H. B. McMahan, G. Holt, D. Sculley, M. Young et al., “Ad Click Prediction: a View from the Trenches”, in Proc. of ACM SIGKDD, 2013.

[31]

N. Buchbinder, J. S. Naor et al., “The Design of Competitive Online Algorithms via a Primal-Dual Approach”, Foundations and Trends® in Theoretical Computer Science, 2009.

[32]

Y. Bao, Y. Peng, C. Wu, and Z. Li, “Online Job Scheduling in Distributed Machine Learning Clusters”, arXiv preprint arXiv:, 2017.

[33]

C. Reiss, A. Tumanov et al., “Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis”, in Proc. of ACM SoCC, 2012.

[34]

D. E. Irwin, L. E. Grit, and J. S. Chase, “Balancing Risk and Reward in a Market-based Task Service”, in Proc. of IEEE HPDC, 2004.

[35]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, in NIPS, 2012.

Digital Library

[36]

S. Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, in Proc. of ICML, 2015.

Cited By

Xu TXue BSong YWu XPeng XLyu YWang XTian CYe BNguyen CLyu BWen RZong ZZhu SBagchi SZhang Y(2024)CyberStarProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692006(227-246)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692006
Zhao BXu WLiu STian YWang QWu WTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Training Job Placement in Clusters with Statistical In-Network AggregationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624863(420-434)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624863
Guerraoui RGupta NPinot R(2024)Byzantine Machine Learning: A PrimerACM Computing Surveys10.1145/361653756:7(1-39)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3616537
Show More Cited By

Index Terms

Online Job Scheduling in Distributed Machine Learning Clusters
1. Computing methodologies
  1. Machine learning
2. Theory of computation
  1. Design and analysis of algorithms
    1. Online algorithms
      1. Online learning algorithms
  2. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning
        Sequential decision making

Index terms have been assigned to the content through auto-classification.

Recommendations

Job scheduling for large-scale machine learning clusters
CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

With the rapid proliferation of Machine Learning (ML) and Deep learning (DL) applications running on modern platforms, it is crucial to satisfy application performance requirements such as meeting deadline and ensuring accuracy. To this end, researchers ...
Online Flexible Job Scheduling for Minimum Span
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures

In this paper, we study an online Flexible Job Scheduling (FJS) problem. The input of the problem is a set of jobs, each having an arrival time, a starting deadline and a processing length. Each job has to be started by the scheduler between its arrival ...
Parallel job scheduling algorithms

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

IEEE INFOCOM 2018 - IEEE Conference on Computer Communications

Apr 2018

2776 pages

Copyright © 2018.

Publisher

IEEE Press

Publication History

Published: 16 April 2018

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu TXue BSong YWu XPeng XLyu YWang XTian CYe BNguyen CLyu BWen RZong ZZhu SBagchi SZhang Y(2024)CyberStarProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692006(227-246)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692006
Zhao BXu WLiu STian YWang QWu WTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Training Job Placement in Clusters with Statistical In-Network AggregationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624863(420-434)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624863
Guerraoui RGupta NPinot R(2024)Byzantine Machine Learning: A PrimerACM Computing Surveys10.1145/361653756:7(1-39)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3616537
Tairin SShen HZhang Z(2023)Embracing Uncertainty for Equity in Resource Allocation in ML TrainingProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605583(423-432)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605583
Jiang JZhang WGu JZhu WRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Asynchronous decentralized online learningProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541805(20185-20196)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3541805

View Options

View options

Figures

Tables

Media

View Table of Conten