Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/INFOCOM.2018.8486422guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Online Job Scheduling in Distributed Machine Learning Clusters

Published: 16 April 2018 Publication History

Abstract

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural network, multiple workers are run in parallel to train partitions of the input dataset, and update shared model parameters. In a shared cluster handling multiple training jobs, a fundamental issue is how to efficiently schedule jobs and set the number of concurrent workers to run for each job, such that server resources are maximally utilized and model training can be completed in time. Targeting a distributed machine learning system using the parameter server framework, <tex>$w$</tex> e design an online algorithm for scheduling the arriving jobs and deciding the adjusted numbers of concurrent workers and parameter servers for each job over its course, to maximize overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. Practical effectiveness of the online algorithm is evaluated using trace-driven simulation and testbed experiments, which demonstrate its outperformance as compared to commonly adopted scheduling algorithms in today&#x0027;s cloud systems.

References

[1]
M. Abadi, P. Barham et al., “TensorFlow: A System for Large-Scale Machine Learning”, in Proc. of USENIX OSDI, 2016.
[4]
J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., “ImageNet: A Large-Scale Hierarchical Image Database”, in Proc. of IEEE CVPR, 2009.
[5]
F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer, “FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters”, in Proc. of IEEE CVPR, 2016.
[6]
T. Chen, M. Li et al., “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems”, in NIPS Workshop on Machine Learning Systems (LearningSys), 2016.
[7]
M. Li, D. G. Andersen et al., “Scaling Distributed Machine Learning with the Parameter Server”, in Proc. of USENIX OSDI, 2014.
[8]
T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project Adam: Building an Efficient and Scalable Deep Learning Training System”, in Proc. of USENIX OSDI, 2014.
[9]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition”, in Proc. of IEEE CVPR, 2016.
[10]
K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, in Proc. of ICLR, 2015.
[11]
A. Verma, L. Pedrosa, M. Korupolu et al., “Large-Scale Cluster Management at Google with Borg”, in Proc. of ACM EuroSys, 2015.
[12]
V. K. Vavilapalli, A. C. Murthy et al., “Apache Hadoop YARN: Yet Another Resource Negotiator”, in Proc. of ACM SoCC, 2013.
[13]
M. Zaharia, M. Chowdhury, M. J. Franklin et al., “Spark: Cluster Computing with Working Sets”, in Proc. of USENIX HotCloud, 2010.
[14]
A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, “Dominant Resource Fairness: Fair Allocation of Multiple Resource Types”, in Proc. of USENIX NSDI, 2011.
[15]
B. Hindman, A. Konwinski et al., “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”, in Proc. of USENIX NSDI, 2011.
[17]
P. Sun, Y. Wen, N. B. D. Ta, and S. Yan, “Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach”, in Proc. of IEEE Smart Computing, 2017.
[18]
Y. S. L. Lee et al., “Dolphin: Runtime Optimization for Distributed Machine Learning”, in Proc. of ICML ML Systems Workshop, 2016.
[19]
F. Yan, O. Ruwase, Y. He, and T. Chilimbi, “Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems”, in Proc. of ACM SIGKDD, 2015.
[20]
Z. Huang, B. Balasubramanian, M. Wang, T. Lan, M. Chiang, and D. H. Tsang, “Need for Speed: CORA Scheduler for Optimizing Completion-Times in the Cloud”, in Proc. of IEEE INFOCOM, 2015.
[21]
L. Chen, S. Liu et al., “Scheduling Jobs across Geo-Distributed Data-centers with Max-Min Fairness”, in Proc. of IEEE INFOCOM, 2017.
[22]
Y. Azar, I. Kalp-Shaltiel, B. Lucier, I. Menache et al., “Truthful Online Scheduling with Commitments”, in Proc. of ACM EC, 2015.
[23]
B. Lucier, I. Menache, J. S. Naor, and J. Yaniv, “Efficient Online Scheduling for Deadline-Sensitive Jobs”, in Proc. of ACM SPAA, 2013.
[24]
R. Zhou, Z. Li, C. Wu, and Z. Huang, “An Efficient Cloud Market Mechanism for Computing Jobs With Soft Deadlines”, IEEE/ACM Transactions on Networking, 2017.
[25]
X. Zhang, Z. Huang, C. Wu, Z. Li, and F. C. Lau, “Online Auctions in IaaS Clouds: Welfare and Profit Maximization with Server Costs”, in Proc. of ACM SIGMETRICS, 2015.
[26]
Z. Xiao et al., “Automatic Scaling of Internet Applications for Cloud Computing Services”, IEEE Transactions on Computers, 2014.
[27]
A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch et al., “TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters”, in Proc. of ACM EuroSys, 2016.
[30]
H. B. McMahan, G. Holt, D. Sculley, M. Young et al., “Ad Click Prediction: a View from the Trenches”, in Proc. of ACM SIGKDD, 2013.
[31]
N. Buchbinder, J. S. Naor et al., “The Design of Competitive Online Algorithms via a Primal-Dual Approach”, Foundations and Trends® in Theoretical Computer Science, 2009.
[32]
Y. Bao, Y. Peng, C. Wu, and Z. Li, “Online Job Scheduling in Distributed Machine Learning Clusters”, arXiv preprint arXiv:, 2017.
[33]
C. Reiss, A. Tumanov et al., “Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis”, in Proc. of ACM SoCC, 2012.
[34]
D. E. Irwin, L. E. Grit, and J. S. Chase, “Balancing Risk and Reward in a Market-based Task Service”, in Proc. of IEEE HPDC, 2004.
[35]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, in NIPS, 2012.
[36]
S. Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, in Proc. of ICML, 2015.

Cited By

View all
  • (2024)CyberStarProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692006(227-246)Online publication date: 10-Jul-2024
  • (2024)Training Job Placement in Clusters with Statistical In-Network AggregationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624863(420-434)Online publication date: 27-Apr-2024
  • (2024)Byzantine Machine Learning: A PrimerACM Computing Surveys10.1145/361653756:7(1-39)Online publication date: 9-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
IEEE INFOCOM 2018 - IEEE Conference on Computer Communications
Apr 2018
2776 pages

Publisher

IEEE Press

Publication History

Published: 16 April 2018

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CyberStarProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692006(227-246)Online publication date: 10-Jul-2024
  • (2024)Training Job Placement in Clusters with Statistical In-Network AggregationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624863(420-434)Online publication date: 27-Apr-2024
  • (2024)Byzantine Machine Learning: A PrimerACM Computing Surveys10.1145/361653756:7(1-39)Online publication date: 9-Apr-2024
  • (2023)Embracing Uncertainty for Equity in Resource Allocation in ML TrainingProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605583(423-432)Online publication date: 7-Aug-2023
  • (2021)Asynchronous decentralized online learningProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541805(20185-20196)Online publication date: 6-Dec-2021

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media