research-article

MLIR-based code generation for GPU tensor cores

Authors:

Vivek Khandelwal, and

Uday BondhugulaAuthors Info & Claims

CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

March 2022

Pages 117 - 128

https://doi.org/10.1145/3497776.3517770

Published: 18 March 2022 Publication History

Abstract

The state-of-the-art in high-performance deep learning today is primarily driven by manually developed libraries optimized and highly tuned by expert programmers using low-level abstractions with significant effort. This effort is often repeated for similar hardware and future ones. In this work, we pursue and evaluate the more modular and reusable approach of using compiler IR infrastructure to generate libraries by encoding all the required optimizations as a sequence of transformations and customized passes on an IR. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), it had been hard to represent and transform computation at various levels of abstraction within a single IR.

Using the MLIR infrastructure, we build a transformation and lowering pipeline to automatically generate near-peak performance code for matrix-matrix multiplication (matmul) as well as matmul fused with simple pointwise operators targeting tensor cores on NVIDIA GPUs. On a set of problem sizes ranging from 256 to 16384, our performance evaluation shows that we can obtain performance that is 0.95× to 1.19× and 0.80× to 1.60× of cuBLAS for FP32 and FP16 accumulate respectively on NVIDIA’s Ampere based Geforce 3090 RTX. Furthermore, by allowing the fusion of common pointwise operations with matrix-matrix multiplication, we obtain performance ranging from 0.95× to 1.67× of a cuBLAS-based implementation. Additionally, we present matmul-like examples such as 3-d contraction and batched matmul, which the pipeline can efficiently handle while providing competitive performance. We believe that these results motivate further research and engineering on automatic domain-specific library generation using compiler IR infrastructure for similar specialized accelerators.

References

[1]

Jeremy Appleyard and Scott Yokim. 2017. Programming Tensor Cores in CUDA 9. https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

[2]

Somashekaracharya G. Bhaskaracharya, Julien Demouth, and Vinod Grover. 2020. Automatic Kernel Generation for Volta Tensor Cores. CoRR, abs/2006.12645 (2020), arxiv:2006.12645

[3]

Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. CoRR, https://arxiv.org/abs/2003.00532

[4]

C. Lattner and M. Amini and U. Bondhugula and A. Cohen and A. Davis and J. Pienaar and R. Riddle and T. Shpeisman and N. Vasilache and O. Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[5]

LLVM Community. 2022. The LLVM Project. https://llvm.org/

[6]

LLVM community. 2022. User Guide for NVPTX Back-end. https://llvm.org/docs/NVPTXUsage.html

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, arxiv:1810.04805

[8]

Thomas Faingnaert, Tim Besard, and Bjorn De Sutter. 2020. Flexible Performant GEMM Kernels on GPUs. CoRR, abs/2009.12263 (2020), arxiv:2009.12263

[9]

Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw., 34, 3 (2008), issn:0098-3500 https://doi.org/10.1145/1356052.1356053

Digital Library

[10]

Scott Gray. 2017. A full walk through of the SGEMM implementation. https://github.com/NervanaSystems/maxas/wiki/SGEMM

[11]

Stephan Herhut. 2020. MLIR on GPUs. MLIR Open Design Meeting, Apr 16, 2020

[12]

Stephan Herhut and Oleksandr Zinenko. 2019. GPUs in MLIR. MLIR Open Design Meeting, Dec 12, 2019

[13]

Junjie Lai and Andre Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO ’13). IEEE Computer Society. https://doi.org/10.1109/CGO.2013.6494986

Digital Library

[14]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. San Jose, CA, USA. 75–88.

[15]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. CoRR, abs/2002.11054 (2020), http://arxiv.org/abs/2002.11054

[16]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. CoRR, abs/1901.08746 (2019), arxiv:1901.08746

[17]

Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders. CoRR, arxiv:1908.08345

[18]

LLVM/MLIR. 2020. MLIR: A multi-level intermediate representation. https://mlir.llvm.org

[19]

Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw., 43, 2 (2016), Article 12, Aug.

Digital Library

[20]

Justin Luitjens. 2013. Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

[21]

MLIR. 2020. MLIR dialects. https://mlir.llvm.org/docs/Dialects/

[22]

MLIR. 2020. MLIR language reference. https://mlir.llvm.org/docs/LangRef/

[23]

Pandu Nayak. 2019. Understanding searches better than ever before. https://www.blog.google/products/search/search-language-understanding-bert/

[24]

NVIDIA. 2018. NVIDIA Turing GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

[25]

NVIDIA. 2020. NVIDIA Ampere GA102 GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf

[26]

NVIDIA. 2021. cuBLAS. https://docs.nvidia.com/cuda/cublas/

[27]

NVIDIA. 2021. CUTLASS. https://github.com/NVIDIA/cutlass

[28]

NVIDIA. 2021. Mixed precision training in deep learning. https://docs.nvidia.com/deeplearning/performance/mixed- precision-training/index.html

[29]

NVIDIA. 2021. Nsight Systems. https://docs.nvidia.com/nsight-systems/UserGuide/index.html

[30]

NVIDIA. 2021. PTX programming guide. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

[31]

NVIDIA. 2022. CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[32]

NVIDIA. 2022. cuTensor. https://docs.nvidia.com/cuda/cutensor/index.html

[33]

PolyMage Labs. 2020. MLIRX. https://github.com/polymage-labs/mlirx

[34]

Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2063384.2063431

Digital Library

[35]

Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. Association for Computing Machinery. https://doi.org/10.1145/3315508.3329973

Digital Library

[36]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Softw., 41, 3 (2015), Article 14, June.

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR.

[38]

Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Mathematical Software – ICMS 2010, Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama (Eds.).

Digital Library

[39]

Da Yan, Wei Wang, and Xiaowen Chu. 2020. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[40]

Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. SIGPLAN Not., https://doi.org/10.1145/3155284.3018755

Digital Library

Cited By

Li JQin ZMei YCui JSong YChen CZhang YDu LCheng XJin BZhang YYe JLin ELavery D(2024)oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444871(460-470)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444871
Tao XZhu YWang BXu JPang JZhao J(2022)Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway ProcessorProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545031(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545031

Index Terms

MLIR-based code generation for GPU tensor cores
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More
GPU Code Generation of Cardiac Electrophysiology Simulation with MLIR
Euro-Par 2023: Parallel Processing
Abstract
We show the benefits of the novel MLIR compiler technology to the generation of code from a DSL, namely EasyML used in openCARP, a widely used simulator in the cardiac electrophysiology community. Building on an existing work that deeply modified ...
Read More
Fast implementation of DGEMM on Fermi GPU
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

March 2022

253 pages

ISBN:9781450391832

DOI:10.1145/3497776

General Chairs:
Bernhard Egger
Seoul National University, South Korea
,
Aaron Smith
Microsoft, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CC '22

Sponsor:

SIGPLAN

CC '22: 31st ACM SIGPLAN International Conference on Compiler Construction

April 2 - 3, 2022

Seoul, South Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,127
Total Downloads

Downloads (Last 12 months)281
Downloads (Last 6 weeks)21

Other Metrics

View Author Metrics

Citations

Cited By

Li JQin ZMei YCui JSong YChen CZhang YDu LCheng XJin BZhang YYe JLin ELavery D(2024)oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444871(460-470)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444871
Tao XZhu YWang BXu JPang JZhao J(2022)Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway ProcessorProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545031(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545031

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents