Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3605573.3605583acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

Embracing Uncertainty for Equity in Resource Allocation in ML Training

Published: 13 September 2023 Publication History

Abstract

To reduce the Deep Learning (DL) model training time and hence resource consumption, it is critical to avoid stragglers. However, the dynamics and uncertainty features of resource availability pose a challenge to avoiding stragglers caused. To handle this challenge, we propose a Straggler-Avoiding job Scheduling approach (SAS), which smartly ensures that the tasks of a job receive resources with similar dynamics and uncertainty so that the tasks can complete at approximately the same time. Specifically, SAS uses an ML method to predict available resource amounts with probability in future times, groups nodes with similar available resource amounts and probabilities, and then assigns each job to one node group with the objective of minimizing job completion time (JCT). To reduce the decision making time, we also propose a reinforcement learning (RL) based scheduling approach (SAS-RL) that assigns each job to a node group. In addition, we propose a distributed parameter server (PS) load reassignment method to handle PS stragglers. Our trace-driven real experiments show that SAS reduce up to 45% JCT and 63% stragglers compared with existing job schedulers, and our PS load reassignment reduces up to 48% JCT compared with the previous PS load distribution scheme.

References

[1]
2023. Alibaba trace.https://github.com/alibaba/clusterdata/blob/master/cluster-trace-gpu-v2020/README.md.
[2]
2023. Amazon EC2. https://aws.amazon.com/en/blogs/machine-learning/traindeep-learning-models-on-gpus-using-amazon-ec2-spot-instances/. .
[3]
2023. Cifar-10 dataset. https://www.cs.toronto.edu/kriz/cifar.html.
[4]
2023. cpu-load-generator. https://pypi.org/project/cpu-load-generator/.
[5]
2023. Microsoft trace. https://github.com/msr-fiddle/philly-traces.
[6]
2023. Psutil.https://pypi.org/project/psutil/.
[7]
2023. Straggler existence.https://github.com/pcl-projects/Alibaba-PAI-Data.git.
[8]
Martín Abadi 2016. Tensorflow: A system for large-scale machine learning. In Proc. of OSDI. 265–283.
[9]
Hervé Abdi 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010).
[10]
Yixin Bao 2019. Deep learning-based job placement in distributed machine learning clusters. In Proc. of INFOCOM.
[11]
Yixin Bao, Yanghua Peng, Chuan Wu, and Zongpeng Li. 2018. Online job scheduling in distributed machine learning clusters. In Proc. of INFOCOM. IEEE.
[12]
Chen Chen 2019. Round-robin synchronization: Mitigating communication bottlenecks in parameter servers. In Proc. of the IEEE INFOCOM.
[13]
Chen Chen 2020. Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments. In Proc. of SoCC. 431–446.
[14]
M. Chen 2020. An adaption scheduling based on dynamic weighted random forests for load demand forecasting. The Journal of Supercomputing (2020).
[15]
Yangrui Chen 2020. Elastic parameter server load distribution in deep learning clusters. In Proc. of SoCC.
[16]
Jiechao Gao 2020. Task failure prediction in cloud data centers using deep learning. IEEE Transactions on Services Computing (2020).
[17]
Jinkun Geng 2019. Accelerating distributed machine learning by smart parameter server. In Proc. of the 3rd Asia-Pacific Workshop on Networking.
[18]
S. S. Gill 2020. Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres. The Journal of Supercomputing (2020).
[19]
Robert Grandl 2014. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review 44, 4 (2014), 455–466.
[20]
Juncheng Gu 2019. Tiresias: A GPU cluster manager for distributed deep learning. In Proc. of NSDI. 485–500.
[21]
Aaron Harlap 2016. Addressing the straggler problem for iterative convergent parallel ML. In Proc. of SoCC.
[22]
Sayed Hadi Hashemi 2019. Tictac: Accelerating distributed deep learning with communication scheduling. Proc. of MLSys 1 (2019), 418–430.
[23]
Kim Hazelwood 2018. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proc. of HPCA. IEEE, 620–629.
[24]
Stefan Hougardy. 2010. The Floyd–Warshall algorithm on graphs with negative cycles. Inform. Process. Lett. 110, 8-9 (2010).
[25]
[25] Kubernetes. 2018. https://kubernetes.io.
[26]
Shijian Li 2021. Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning. In Proc. of ICDCS.
[27]
Xiangru Lian 2018. Asynchronous decentralized parallel stochastic gradient descent. In Proc. of ICML. PMLR.
[28]
Kshiteej Mahajan 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In Proc. of NSDI.
[29]
Hongzi Mao 2019. Learning scheduling algorithms for data processing clusters. In Proc. of SIGCOMM.
[30]
James A McHugh. 1990. Algorithmic graph theory. Vol. 68056. Citeseer.
[31]
Yanghua Peng 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proc. of EuroSys.
[32]
Aurick Qiao 2021. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In Proc. of OSDI.
[33]
Jörg Sander 1998. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data mining and knowledge discovery (1998).
[34]
Alexander Sergeev 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[35]
Richard S Sutton 1998. Reinforcement learning: An Introduction Cambridge. MA: MIT Press.[Google Scholar] (1998).
[36]
Haoyu Wang 2020. Job scheduling for large-scale machine learning clusters. In Proc. of CoNEXT. 108–120.
[37]
S. Wasserman 1994. Social network analysis: Methods and applications. (1994).
[38]
Qizhen Weng 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In Proc. of NSDI.
[39]
Wencong Xiao 2018. Gandiva: Introspective cluster scheduling for deep learning. In Proc. of OSDI.
[40]
Wencong Xiao 2020. AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In Proc. of OSDI. 533–548.
[41]
Jian Xu 2021. Live gradient compensation for evading stragglers in distributed learning. In Proc. of INFOCOM. IEEE, 1–10.
[42]
Haoyu Zhang 2017. Slaq: quality-driven scheduling for distributed machine learning. In Proc. of SoCC.
[43]
Yihao Zhao 2022. Multi-resource interleaving for deep learning training. In Proc. of SIGCOMM.
[44]
Lingxue Zhu 2017. Deep and confident prediction for time series at uber. In Proc. of ICDMW. IEEE, 103–110.
[45]
Yazeed Zoabi 2021. Machine learning-based prediction of COVID-19 diagnosis based on symptoms. npj Digital Medicine 4, 1 (2021), 1–5.

Cited By

View all
  • (2024)InArt: In-Network Aggregation with Route Selection for Accelerating Distributed TrainingProceedings of the ACM Web Conference 202410.1145/3589334.3645394(2879-2889)Online publication date: 13-May-2024
  • (2024)Proactive, Accuracy-aware Straggler Mitigation in Machine Learning Clusters2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00204(1196-1198)Online publication date: 27-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing
August 2023
858 pages
ISBN:9798400708435
DOI:10.1145/3605573
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2023
ICPP 2023: 52nd International Conference on Parallel Processing
August 7 - 10, 2023
UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)228
  • Downloads (Last 6 weeks)18
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)InArt: In-Network Aggregation with Route Selection for Accelerating Distributed TrainingProceedings of the ACM Web Conference 202410.1145/3589334.3645394(2879-2889)Online publication date: 13-May-2024
  • (2024)Proactive, Accuracy-aware Straggler Mitigation in Machine Learning Clusters2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00204(1196-1198)Online publication date: 27-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media