research-article

Open access

Embracing Uncertainty for Equity in Resource Allocation in ML Training

Authors:

Suraiya Tairin,

Zeyu ZhangAuthors Info & Claims

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

Pages 423 - 432

https://doi.org/10.1145/3605573.3605583

Published: 13 September 2023 Publication History

All formats PDF

Abstract

To reduce the Deep Learning (DL) model training time and hence resource consumption, it is critical to avoid stragglers. However, the dynamics and uncertainty features of resource availability pose a challenge to avoiding stragglers caused. To handle this challenge, we propose a Straggler-Avoiding job Scheduling approach (SAS), which smartly ensures that the tasks of a job receive resources with similar dynamics and uncertainty so that the tasks can complete at approximately the same time. Specifically, SAS uses an ML method to predict available resource amounts with probability in future times, groups nodes with similar available resource amounts and probabilities, and then assigns each job to one node group with the objective of minimizing job completion time (JCT). To reduce the decision making time, we also propose a reinforcement learning (RL) based scheduling approach (SAS-RL) that assigns each job to a node group. In addition, we propose a distributed parameter server (PS) load reassignment method to handle PS stragglers. Our trace-driven real experiments show that SAS reduce up to 45% JCT and 63% stragglers compared with existing job schedulers, and our PS load reassignment reduces up to 48% JCT compared with the previous PS load distribution scheme.

References

[1]

2023. Alibaba trace.https://github.com/alibaba/clusterdata/blob/master/cluster-trace-gpu-v2020/README.md.

[2]

2023. Amazon EC2. https://aws.amazon.com/en/blogs/machine-learning/traindeep-learning-models-on-gpus-using-amazon-ec2-spot-instances/. .

[3]

2023. Cifar-10 dataset. https://www.cs.toronto.edu/kriz/cifar.html.

[4]

2023. cpu-load-generator. https://pypi.org/project/cpu-load-generator/.

[5]

2023. Microsoft trace. https://github.com/msr-fiddle/philly-traces.

[6]

2023. Psutil.https://pypi.org/project/psutil/.

[7]

2023. Straggler existence.https://github.com/pcl-projects/Alibaba-PAI-Data.git.

[8]

Martín Abadi 2016. Tensorflow: A system for large-scale machine learning. In Proc. of OSDI. 265–283.

[9]

Hervé Abdi 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010).

[10]

Yixin Bao 2019. Deep learning-based job placement in distributed machine learning clusters. In Proc. of INFOCOM.

Digital Library

[11]

Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. 2018. Online job scheduling in distributed machine learning clusters. In Proc. of INFOCOM. IEEE.

Digital Library

[12]

Chen Chen 2019. Round-robin synchronization: Mitigating communication bottlenecks in parameter servers. In Proc. of the IEEE INFOCOM.

Digital Library

[13]

Chen Chen 2020. Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments. In Proc. of SoCC. 431–446.

Digital Library

[14]

M. Chen 2020. An adaption scheduling based on dynamic weighted random forests for load demand forecasting. The Journal of Supercomputing (2020).

[15]

Yangrui Chen 2020. Elastic parameter server load distribution in deep learning clusters. In Proc. of SoCC.

Digital Library

[16]

Jiechao Gao 2020. Task failure prediction in cloud data centers using deep learning. IEEE Transactions on Services Computing (2020).

[17]

Jinkun Geng 2019. Accelerating distributed machine learning by smart parameter server. In Proc. of the 3rd Asia-Pacific Workshop on Networking.

Digital Library

[18]

S. S. Gill 2020. Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres. The Journal of Supercomputing (2020).

[19]

Robert Grandl 2014. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review 44, 4 (2014), 455–466.

Digital Library

[20]

Juncheng Gu 2019. Tiresias: A GPU cluster manager for distributed deep learning. In Proc. of NSDI. 485–500.

[21]

Aaron Harlap 2016. Addressing the straggler problem for iterative convergent parallel ML. In Proc. of SoCC.

Digital Library

[22]

Sayed Hadi Hashemi 2019. Tictac: Accelerating distributed deep learning with communication scheduling. Proc. of MLSys 1 (2019), 418–430.

[23]

Kim Hazelwood 2018. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proc. of HPCA. IEEE, 620–629.

[24]

Stefan Hougardy. 2010. The Floyd–Warshall algorithm on graphs with negative cycles. Inform. Process. Lett. 110, 8-9 (2010).

Digital Library

[25]

[25] Kubernetes. 2018. https://kubernetes.io.

[26]

Shijian Li 2021. Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning. In Proc. of ICDCS.

[27]

Xiangru Lian 2018. Asynchronous decentralized parallel stochastic gradient descent. In Proc. of ICML. PMLR.

[28]

Kshiteej Mahajan 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In Proc. of NSDI.

[29]

Hongzi Mao 2019. Learning scheduling algorithms for data processing clusters. In Proc. of SIGCOMM.

Digital Library

[30]

James A McHugh. 1990. Algorithmic graph theory. Vol. 68056. Citeseer.

[31]

Yanghua Peng 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proc. of EuroSys.

Digital Library

[32]

Aurick Qiao 2021. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In Proc. of OSDI.

[33]

Jörg Sander 1998. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data mining and knowledge discovery (1998).

[34]

Alexander Sergeev 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).

[35]

Richard S Sutton 1998. Reinforcement learning: An Introduction Cambridge. MA: MIT Press.[Google Scholar] (1998).

[36]

Haoyu Wang 2020. Job scheduling for large-scale machine learning clusters. In Proc. of CoNEXT. 108–120.

Digital Library

[37]

S. Wasserman 1994. Social network analysis: Methods and applications. (1994).

[38]

Qizhen Weng 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In Proc. of NSDI.

[39]

Wencong Xiao 2018. Gandiva: Introspective cluster scheduling for deep learning. In Proc. of OSDI.

[40]

Wencong Xiao 2020. AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In Proc. of OSDI. 533–548.

[41]

Jian Xu 2021. Live gradient compensation for evading stragglers in distributed learning. In Proc. of INFOCOM. IEEE, 1–10.

Digital Library

[42]

Haoyu Zhang 2017. Slaq: quality-driven scheduling for distributed machine learning. In Proc. of SoCC.

Digital Library

[43]

Yihao Zhao 2022. Multi-resource interleaving for deep learning training. In Proc. of SIGCOMM.

Digital Library

[44]

Lingxue Zhu 2017. Deep and confident prediction for time series at uber. In Proc. of ICDMW. IEEE, 103–110.

[45]

Yazeed Zoabi 2021. Machine learning-based prediction of COVID-19 diagnosis based on symptoms. npj Digital Medicine 4, 1 (2021), 1–5.

Cited By

Liu JZhai YZhao GXu HFang JZeng ZZhu YChua TNgo CKa-Wei Lee RKumar RLauw H(2024)InArt: In-Network Aggregation with Route Selection for Accelerating Distributed TrainingProceedings of the ACM Web Conference 202410.1145/3589334.3645394(2879-2889)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645394
Tairin SShen HIyer A(2024)Proactive, Accuracy-aware Straggler Mitigation in Machine Learning Clusters2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00204(1196-1198)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00204

Recommendations

Anticipatory Resource Allocation for ML Training
SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing

Our analysis of a large public cloud ML training service shows that resources remain unused likely because users statically (over-)allocate resources for their jobs given a desire for predictable performance, and state-of-the-art schedulers do not ...
The preemptive resource allocation problem
Abstract
We revisit a classical scheduling model to incorporate modern trends in data center networks and cloud services. Addressing some key challenges in the allocation of shared resources to user requests (jobs) in such settings, we consider the ... $^{}$
Convex resource allocation for minimizing the makespan in a single machine with job release dates

We consider the problem of scheduling jobs on a single machine where job-processing times are controllable through the allocation of a common limited resource. The release dates are known and the job-processing time is described by a convex decreasing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

August 2023

858 pages

ISBN:9798400708435

DOI:10.1145/3605573

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

FHWA
Microsoft Research Faculty Fellowship
CCF
Commonwealth Cyber Initiative (CCI)
NSF (National Science Foundation)

Conference

ICPP 2023

ICPP 2023: 52nd International Conference on Parallel Processing

August 7 - 10, 2023

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
352
Total Downloads

Downloads (Last 12 months)228
Downloads (Last 6 weeks)18

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu JZhai YZhao GXu HFang JZeng ZZhu YChua TNgo CKa-Wei Lee RKumar RLauw H(2024)InArt: In-Network Aggregation with Route Selection for Accelerating Distributed TrainingProceedings of the ACM Web Conference 202410.1145/3589334.3645394(2879-2889)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645394
Tairin SShen HIyer A(2024)Proactive, Accuracy-aware Straggler Mitigation in Machine Learning Clusters2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00204(1196-1198)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00204

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten