research-article

Open access

Deep Optimizer States: Towards Scalable Training of Transformer Models using Interleaved Offloading

Authors:

Avinash Maurya,

Jie Ye,

M. Mustafa Rafique,

Franck Cappello,

Bogdan NicolaeAuthors Info & Claims

MIDDLEWARE '24: Proceedings of the 25th International Middleware Conference

Pages 404 - 416

https://doi.org/10.1145/3652892.3700781

Published: 02 December 2024 Publication History

PDF eReader

Abstract

Transformers and large language models (LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a "memory wall", i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward and update phases generate fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement Deep Optimizer States, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5× faster iterations over state-of-the-art approaches using extensive experiments.

References

[1]

Moiz Arif, Kevin Assogba, and M. Mustafa Rafique. Canary: Fault-tolerant faas for stateful time-sensitive applications. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16, Dallas, TX, USA, 2022. IEEE.

Abstract

References

Index Terms

Recommendations

Towards achieving performance portability using directives for accelerators

Towards Locality-Aware Host-to-Device Offloading in OpenMP

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Comments

Information

Published In

In-Cooperation

Publisher

Publication History

Check for updates

Badges

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations