research-article

Open access

Grape: Practical and Efficient Graphed Execution for Dynamic Deep Neural Networks on GPUs

Authors:

Gennady PekhimenkoAuthors Info & Claims

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1364 - 1380

https://doi.org/10.1145/3613424.3614248

Published: 08 December 2023 Publication History

All formats PDF

Abstract

Achieving high performance in machine learning workloads is a crucial yet difficult task. To achieve high runtime performance on hardware platforms such as GPUs, graph-based executions such as CUDA graphs are often used to eliminate CPU runtime overheads by submitting jobs in the granularity of multiple kernels. However, many machine learning workloads, especially dynamic deep neural networks (DNNs) with varying-sized inputs or data-dependent control flows, face challenges when directly using CUDA graphs to achieve optimal performance. We observe that the use of graph-based executions poses three key challenges in terms of efficiency and even practicability: (1) Extra data movements when copying input values to graphs’ placeholders. (2) High GPU memory consumption due to the numerous CUDA graphs created to efficiently support dynamic-shape workloads. (3) Inability to handle data-dependent control flows.

To address those challenges, we propose Grape, a new graph compiler that enables practical and efficient graph-based executions for dynamic DNNs on GPUs. Grape comprises three key components: (1) an alias predictor that automatically removes extra data movements by leveraging code positions at the Python frontend, (2) a metadata compressor that efficiently utilizes the data redundancy in CUDA graphs’ memory regions by compressing them, and (3) a predication rewriter that safely replaces control flows with predication contexts while preserving programs’ semantics. The three components improve the efficiency and broaden the optimization scope of graph-based executions while allowing machine learning practitioners to program dynamic DNNs at the Python level with minimal source code changes.

We evaluate Grape on state-of-the-art text generation (GPT-2, GPT-J) and speech recognition (Wav2Vec2) workloads, which include both training and inference, using real systems with modern GPUs. Our evaluation shows that Grape achieves up to 36.43 × less GPU memory consumption and up to 1.26 × better performance than prior works on graph-based executions that directly use CUDA graphs. Furthermore, Grape can optimize workloads that are impractical for prior works due to the three key challenges, achieving 1.78 × and 1.82 × better performance on GPT-J and Wav2Vec2 respectively than the original implementations that do not use graph-based executions.

References

[1]

Tor M. Aamodt, Wilson Wai Lun Fung, and Timothy G. Rogers. 2018. General-Purpose Graphics Processor Architectures. Morgan & Claypool Publishers.

Abstract

References

Index Terms

Recommendations

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Efficient execution of OpenMP on GPUs

Accelerated test execution using GPUs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Badges

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations