research-article

Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically

Authors:

Zhiliang GanAuthors Info & Claims

PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

Pages 451 - 466

https://doi.org/10.1145/3559009.3569656

Published: 27 January 2023 Publication History

Get Access

Abstract

Due to the missing of a good orchestration of loop transformations, existing optimizing compilers for deploying neural networks on GPU either parallelize reductions ineffectively or miss the fusion opportunities with other operators. Neural network models thus exhibit sub-optimal performance on GPU. We present a practical approach called Panamera for the effective parallelization of reductions in neural networks on GPU. Panamera first leverages loop coalescing to flatten the loop dimensions of reductions, converting all reduction operators into canonical forms eligible for the polyhedral model. Next, Panamera uses polyhedral transformations to reduce the data movements caused by unfused reductions and perform multi-block hardware binding not considered by many compilers. Finally, Panamera embeds a highly optimized routine implemented using GPU atomic instructions, further improving the performance of neural network models while guaranteeing the correctness of parallel reductions. The experimental results demonstrate the effectiveness of our approach: for single operators our code obtains a mean speedup of 33.7×, 3.5×, 5.4× and 9.6× over cuDNN, CUB, TVM and Ansor, for sub-graphs our approach outperforms cuDNN, TVM and Ansor by 9.5×, 2.6× and 2.7×, and for end-to-end workloads, a tensor compiler integrated with our approach outperforms them by 122.5%, 19.3% and 15.2%.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

Abstract

References

Index Terms

Recommendations

Neural acceleration for GPU throughput processors

Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations