research-article

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Authors:

Kai Zhu,

Wei LinAuthors Info & Claims

ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 359 - 373

https://doi.org/10.1145/3503222.3507723

Published: 22 February 2022 Publication History

Get Access

ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Pages 359 - 373

Abstract
References

Abstract

This work reveals that memory-intensive computation is a rising performance-critical factor in recent machine learning models. Due to a unique set of new challenges, existing ML optimizing compilers cannot perform efficient fusion under complex two-level dependencies combined with just-in-time demand. They face the dilemma of either performing costly fusion due to heavy redundant computation, or skipping fusion which results in massive number of kernels. Furthermore, they often suffer from low parallelism due to the lack of support for real-world production workloads with irregular tensor shapes. To address these rising challenges, we propose AStitch, a machine learning optimizing compiler that opens a new multi-dimensional optimization space for memory-intensive ML computations. It systematically abstracts four operator-stitching schemes while considering multi-dimensional optimization objectives, tackles complex computation graph dependencies with novel hierarchical data reuse, and efficiently processes various tensor shapes via adaptive thread mapping. Finally, AStitch provides just-in-time support incorporating our proposed optimizations for both ML training and inference. Although AStitch serves as a stand-alone compiler engine that is portable to any version of TensorFlow, its basic ideas can be generally applied to other ML frameworks and optimization compilers. Experimental results show that AStitch can achieve an average of 1.84x speedup (up to 2.73x) over the state-of-the-art Google's XLA solution across five production workloads. We also deploy AStitch onto a production cluster for ML workloads with thousands of GPUs. The system has been in operation for more than 10 months and saves about 20,000 GPU hours for 70,000 tasks per week.

References

[1]

Cited July 2021. AMD GPU-Powered Machine Learning Solutions. https://www.amd.com/en/graphics/servers-radeon-instinct-deep-learning

Abstract

References

Cited By

Index Terms

Recommendations

Enabling SIMT Execution Model on Homogeneous Multi-Core System

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

Unified on-chip memory allocation for SIMT architecture

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors