Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3497776.3517770acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article

MLIR-based code generation for GPU tensor cores

Published: 18 March 2022 Publication History
  • Get Citation Alerts
  • Abstract

    The state-of-the-art in high-performance deep learning today is primarily driven by manually developed libraries optimized and highly tuned by expert programmers using low-level abstractions with significant effort. This effort is often repeated for similar hardware and future ones. In this work, we pursue and evaluate the more modular and reusable approach of using compiler IR infrastructure to generate libraries by encoding all the required optimizations as a sequence of transformations and customized passes on an IR. We believe that until the recent introduction of MLIR (Multi-level intermediate representation), it had been hard to represent and transform computation at various levels of abstraction within a single IR.
    Using the MLIR infrastructure, we build a transformation and lowering pipeline to automatically generate near-peak performance code for matrix-matrix multiplication (matmul) as well as matmul fused with simple pointwise operators targeting tensor cores on NVIDIA GPUs. On a set of problem sizes ranging from 256 to 16384, our performance evaluation shows that we can obtain performance that is 0.95× to 1.19× and 0.80× to 1.60× of cuBLAS for FP32 and FP16 accumulate respectively on NVIDIA’s Ampere based Geforce 3090 RTX. Furthermore, by allowing the fusion of common pointwise operations with matrix-matrix multiplication, we obtain performance ranging from 0.95× to 1.67× of a cuBLAS-based implementation. Additionally, we present matmul-like examples such as 3-d contraction and batched matmul, which the pipeline can efficiently handle while providing competitive performance. We believe that these results motivate further research and engineering on automatic domain-specific library generation using compiler IR infrastructure for similar specialized accelerators.

    References

    [1]
    Jeremy Appleyard and Scott Yokim. 2017. Programming Tensor Cores in CUDA 9. https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
    [2]
    Somashekaracharya G. Bhaskaracharya, Julien Demouth, and Vinod Grover. 2020. Automatic Kernel Generation for Volta Tensor Cores. CoRR, abs/2006.12645 (2020), arxiv:2006.12645
    [3]
    Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. CoRR, https://arxiv.org/abs/2003.00532
    [4]
    C. Lattner and M. Amini and U. Bondhugula and A. Cohen and A. Davis and J. Pienaar and R. Riddle and T. Shpeisman and N. Vasilache and O. Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
    [5]
    LLVM Community. 2022. The LLVM Project. https://llvm.org/
    [6]
    LLVM community. 2022. User Guide for NVPTX Back-end. https://llvm.org/docs/NVPTXUsage.html
    [7]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, arxiv:1810.04805
    [8]
    Thomas Faingnaert, Tim Besard, and Bjorn De Sutter. 2020. Flexible Performant GEMM Kernels on GPUs. CoRR, abs/2009.12263 (2020), arxiv:2009.12263
    [9]
    Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Softw., 34, 3 (2008), issn:0098-3500 https://doi.org/10.1145/1356052.1356053
    [10]
    Scott Gray. 2017. A full walk through of the SGEMM implementation. https://github.com/NervanaSystems/maxas/wiki/SGEMM
    [11]
    Stephan Herhut. 2020. MLIR on GPUs. MLIR Open Design Meeting, Apr 16, 2020
    [12]
    Stephan Herhut and Oleksandr Zinenko. 2019. GPUs in MLIR. MLIR Open Design Meeting, Dec 12, 2019
    [13]
    Junjie Lai and Andre Seznec. 2013. Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO ’13). IEEE Computer Society. https://doi.org/10.1109/CGO.2013.6494986
    [14]
    Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. San Jose, CA, USA. 75–88.
    [15]
    Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. CoRR, abs/2002.11054 (2020), http://arxiv.org/abs/2002.11054
    [16]
    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. CoRR, abs/1901.08746 (2019), arxiv:1901.08746
    [17]
    Yang Liu and Mirella Lapata. 2019. Text Summarization with Pretrained Encoders. CoRR, arxiv:1908.08345
    [18]
    LLVM/MLIR. 2020. MLIR: A multi-level intermediate representation. https://mlir.llvm.org
    [19]
    Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw., 43, 2 (2016), Article 12, Aug.
    [20]
    Justin Luitjens. 2013. Increase Performance with Vectorized Memory Access. https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/
    [21]
    MLIR. 2020. MLIR dialects. https://mlir.llvm.org/docs/Dialects/
    [22]
    MLIR. 2020. MLIR language reference. https://mlir.llvm.org/docs/LangRef/
    [23]
    Pandu Nayak. 2019. Understanding searches better than ever before. https://www.blog.google/products/search/search-language-understanding-bert/
    [24]
    NVIDIA. 2018. NVIDIA Turing GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
    [25]
    NVIDIA. 2020. NVIDIA Ampere GA102 GPU Architechture. https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf
    [26]
    NVIDIA. 2021. cuBLAS. https://docs.nvidia.com/cuda/cublas/
    [27]
    NVIDIA. 2021. CUTLASS. https://github.com/NVIDIA/cutlass
    [28]
    NVIDIA. 2021. Mixed precision training in deep learning. https://docs.nvidia.com/deeplearning/performance/mixed- precision-training/index.html
    [29]
    NVIDIA. 2021. Nsight Systems. https://docs.nvidia.com/nsight-systems/UserGuide/index.html
    [30]
    NVIDIA. 2021. PTX programming guide. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
    [31]
    NVIDIA. 2022. CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
    [32]
    NVIDIA. 2022. cuTensor. https://docs.nvidia.com/cuda/cutensor/index.html
    [33]
    PolyMage Labs. 2020. MLIRX. https://github.com/polymage-labs/mlirx
    [34]
    Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2063384.2063431
    [35]
    Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. Association for Computing Machinery. https://doi.org/10.1145/3315508.3329973
    [36]
    Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Softw., 41, 3 (2015), Article 14, June.
    [37]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR.
    [38]
    Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Mathematical Software – ICMS 2010, Komei Fukuda, Joris van der Hoeven, Michael Joswig, and Nobuki Takayama (Eds.).
    [39]
    Da Yan, Wei Wang, and Xiaowen Chu. 2020. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
    [40]
    Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. SIGPLAN Not., https://doi.org/10.1145/3155284.3018755

    Cited By

    View all
    • (2024)oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444871(460-470)Online publication date: 2-Mar-2024
    • (2022)Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway ProcessorProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545031(1-12)Online publication date: 29-Aug-2022

    Index Terms

    1. MLIR-based code generation for GPU tensor cores

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction
      March 2022
      253 pages
      ISBN:9781450391832
      DOI:10.1145/3497776
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 March 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GPU
      2. MLIR
      3. matrix-matrix multiplication
      4. tensor cores

      Qualifiers

      • Research-article

      Conference

      CC '22
      Sponsor:

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)281
      • Downloads (Last 6 weeks)21

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444871(460-470)Online publication date: 2-Mar-2024
      • (2022)Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway ProcessorProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545031(1-12)Online publication date: 29-Aug-2022

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media