research-article

Optimizing distributed training deployment in heterogeneous GPU clusters

Authors:

Wei LinAuthors Info & Claims

CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

Pages 93 - 107

https://doi.org/10.1145/3386367.3432728

Published: 24 November 2020 Publication History

Get Access

Editorial Notes

The authors have requested minor, non-substantive changes to the VoR and, in accordance with ACM policies, a Corrected VoR was published on December 7, 2020. For reference purposes the VoR may still be accessed via the Supplemental Material section on this page.

Abstract

This paper proposes HeteroG, an automatic module to accelerate deep neural network training in heterogeneous GPU clusters. To train a deep learning model with large amounts of data, distributed training using data or model parallelism has been widely adopted, mostly over homogeneous devices (GPUs, network bandwidth). Heterogeneous training environments may often exist in shared clusters with GPUs of different models purchased in different batches and network connections of different bandwidth availability (e.g., due to contention). Classic data parallelism does not work well in a heterogeneous cluster, while model-parallel training is hard to plan. HeteroG enables highly-efficient distributed training over heterogeneous devices, by automatically converting a single-GPU training model to a distributed one according to the deep learning graph and available resources. HeteroG embraces operation-level hybrid parallelism, communication architecture selection and execution scheduling, based on a carefully designed strategy framework exploiting both GNN-based learning and combinatorial optimization. We compare HeteroG with existing parallelism schemes and show that it achieves up-to 222% training speed-up. HeteroG also enables efficient training of large models over a set of heterogeneous devices where simple parallelism is infeasible.

Supplementary Material

3432728-vor (3432728-vor.pdf)

Version of Record for "Optimizing distributed training deployment in heterogeneous GPU clusters" by Yi et al., Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies (CoNEXT '20).

Download
1.50 MB

MP4 File (3386367.3432728.mp4)

Optimizing Distributed Training Deployment in Heterogeneous GPU Clusters

Download
17.58 MB

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265--283.

Editorial Notes

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

A scalable framework for heterogeneous GPU-based clusters

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations