research-article

Open access

Hoplite: efficient and fault-tolerant collective communication for task-based distributed systems

Authors:

Ion StoicaAuthors Info & Claims

SIGCOMM '21: Proceedings of the 2021 ACM SIGCOMM 2021 Conference

Pages 641 - 656

https://doi.org/10.1145/3452296.3472897

Published: 09 August 2021 Publication History

PDF eReader

Abstract

Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance.

We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively.

Supplementary Material

chen-public-reiview (73-public-review.pdf)

Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems: Public Review

Download
56.73 KB

MP4 File (video-presentation.mp4)

Conference Presentation Video

Download
95.11 MB

MP4 File (video-long.mp4)

Long Version Video

Download
94.12 MB

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, and et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI'16). USENIX Association, USA, 265--283.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Designing an Offloaded Nonblocking MPI_Allgather Collective Using CORE-Direct

A High-Radix Circulant Network Topology for Efficient Collective Communication

An optimisation of allreduce communication in message-passing systems

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Badges

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations