Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3458817.3476138acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

E.T.: re-thinking self-attention for transformer models on GPUs

Published: 13 November 2021 Publication History

Abstract

Transformer-based deep learning models have become a ubiquitous vehicle to drive a variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling. However, these models also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce ET. that r<u>E</u>-thinks self-attention computation for <u>T</u>ransformer models on GPUs with the following contributions: First, we introduce a novel self-attention architecture, which encompasses two tailored self-attention operators with corresponding sequence length-aware optimizations, and operation reordering optimizations. Second, we present an attention-aware pruning design which judiciously uses various pruning algorithms to reduce more computations hence achieves significantly shorter turnaround time. For the pruning algorithms, we not only revamp the existing pruning algorithms, but also tailor new ones for transformer models. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistilBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions, i.e., TensorRT and FasterTransformer.

Supplementary Material

MP4 File (Large Scale Neural Network Training_ Part I - ET_ Re-Thinking Self-Attention for Transformer Models on GPUs.mp4)
Presentation video

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, USA, 265--283.
[2]
AMD. 2021. INTRODUCING AMD CDNA ARCHITECTURE. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf.
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., Virtual, 1877--1901.
[4]
Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. 2008. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications 14, 5--6 (2008), 877--905.
[5]
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). The Association for Computer Linguistics, Vancouver, Canada, 1--14.
[6]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International Conference on Machine Learning. PMLR, Vienna, Austria, 1691--1703.
[7]
Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020. The lottery ticket hypothesis for pre-trained bert networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada.
[8]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, USA, 579--594.
[9]
Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long Short-Term Memory-Networks for Machine Reading. (2016), 551--561.
[10]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A low-latency online prediction serving system. In 14th Symposium on Networked Systems Design and Implementation. USENIX Association, Boston, MA, 613--627.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Longand Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186.
[12]
Bill Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Third International Workshop on Paraphrasing (third international workshop on paraphrasing (IWP2005) ed.). Asia Federation of Natural Language Processing.
[13]
Linhao Dong, Shuang Xu, and Bo Xu. 2018. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In International Conference on Acoustics, Speech and Signal Processing. IEEE, Calgary, Alberta, Canada, 5884--5888.
[14]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. (2020). arXiv:arXiv preprint arXiv:2010.11929
[15]
Angela Fan, Edouard Grave, and Armand Joulin. 2019. Reducing transformer depth on demand with structured dropout. (2019). arXiv:arXiv:1909.11556
[16]
Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, USA, 389--402.
[17]
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU kernels for deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, Virtual, 1--14.
[18]
Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference. ACM, New York, NY, USA, 1--15.
[19]
Google. 2021. XLA: Optimizing Compiler for Machine Learning. https://www.tensorflow.org/xla
[20]
Mitchell Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In Proceedings of the 5th Workshop on Representation Learning for NLP. Association for Computational Linguistics, Virtual, 143--155.
[21]
Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Virtual, 1--15.
[22]
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao. 2020. A Survey on Visual Transformer. (2020). arXiv:arXiv:2012.12556
[23]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 1135--1143.
[24]
Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. Grnn: Low-latency and scalable rnn inference on gpus. In Proceedings of the Fourteenth EuroSys Conference 2019. ACM, New York, NY, USA, 1--16.
[25]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. (2019). arXiv:arXiv:1909.11942
[26]
Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. AAAI Press, Rome, Italy.
[27]
Xin Li, Yiming Zhou, Zheng Pan, and Jiashi Feng. 2019. Partial Order Pruning: For Best Speed/Accuracy Trade-Off in Neural Architecture Search. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 9137--9145.
[28]
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1412--1421.
[29]
JS McCarley, Rishav Chakravarti, and Avirup Sil. 2019. Structured pruning of a bert-based question answering model. (2019). arXiv:arXiv:1910.06360
[30]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. (2017). arXiv:arXiv:1609.07843
[31]
Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782 (2017).
[32]
Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. 2015. Tensorizing neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1. MIT Press, Cambridge, MA, USA, 442--450.
[33]
NVIDIA. 2007. cuBLAS. https://developer.nvidia.com/cublas.
[34]
NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[35]
NVIDIA. 2020. CUTLASS. https://github.com/NVIDIA/cutlass.
[36]
NVIDIA. 2020. Matrix Multiplication Background User Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#math-mem.
[37]
NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.
[38]
NVIDIA. 2021. cuBLAS: cublasgemmalgo_t. https://docs.nvidia.com/cuda/cublas/index.html#cublasgemmalgo_t.
[39]
NVIDIA. 2021. FasterTransformer. https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer.
[40]
NVIDIA. 2021. TensorRT. https://developer.nvidia.com/tensorrt.
[41]
Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, and Qun Liu. 2020. ALP-KD: Attention-Based Layer Projection for Knowledge Distillation. In Proceedings of the Conference on Artificial Intelligence. AAAI Press, Virtual, 13657--13665.
[42]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. Curran Associates, Inc., Vancouver, Canada, 8026--8037.
[43]
Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Virtual, 3208--3229.
[44]
Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In International Symposium on High Performance Computer Architecture. IEEE, San Diego, USA, 58--70.
[45]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392.
[46]
Qing Rao and Jelena Frtunikj. 2018. Deep learning for self-driving cars: Chances and challenges. In Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems. IEEE, Gothenburg, Sweden, 35--38.
[47]
Masuma Akter Rumi, Xiaolong Ma, Yanzhi Wang, and Peng Jiang. 2020. Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, USA, 267--278.
[48]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. (2019). arXiv:arXiv:1910.01108
[49]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631--1642.
[50]
Betty van Aken, Benjamin Winter, Alexander Löser, and Felix A Gers. 2019. How does bert answer questions? a layer-wise analysis of transformer representations. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, USA, 1823--1832.
[51]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, 5998--6008.
[52]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353--355.
[53]
Ziheng Wang. 2020. SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. Association for Computing Machinery, New York, NY, USA, 31--42.
[54]
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, 2074--2082.
[55]
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1112--1122.
[56]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv:cs.CL/1910.03771
[57]
Xilinx. 2017. SDAccel Environment Profiling and Optimization Guide. https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/jbt1504034294480.html.
[58]
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. 2020. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Virtual, 7859--7869.
[59]
Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, and Joaquín Olivares. 2020. Accelerating sparse matrix-matrix multiplication with GPU Tensor Cores. Computers & Electrical Engineering 88 (2020), 106848.
[60]
Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos. 2020. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 53rd Annual International Symposium on Microarchitecture. IEEE, Athens, Greece, 811--824.
[61]
Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. 2018. A systematic DNN weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision. Springer, Munich, Germany, 184--199.
[62]
Xiaodong Zhang, Xu Sun, and Houfeng Wang. 2018. Duplicate question identification by integrating framenet with neural networks. In Proceedings of the Conference on Artificial Intelligence, Vol. 32. AAAI Press, New Orleans, Louisiana, USA.
[63]
Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the 57th Annual Meeting. Association for Computational Linguistics, Florence, Italy, 5059--5069.

Cited By

View all

Index Terms

  1. E.T.: re-thinking self-attention for transformer models on GPUs
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2021
      1493 pages
      ISBN:9781450384421
      DOI:10.1145/3458817
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 November 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SC '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)756
      • Downloads (Last 6 weeks)62
      Reflects downloads up to 02 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Accelerating Sparse DNNs Based on Tiled GEMMIEEE Transactions on Computers10.1109/TC.2024.336594273:5(1275-1289)Online publication date: May-2024
      • (2024)FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00107(1458-1473)Online publication date: 2-Nov-2024
      • (2024)DDKNeural Networks10.1016/j.neunet.2024.106164173:COnline publication date: 2-Jul-2024
      • (2024)Kernel fusion in atomistic spin dynamics simulations on Nvidia GPUs using tensor coreJournal of Computational Science10.1016/j.jocs.2024.10235781(102357)Online publication date: Sep-2024
      • (2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
      • (2023)ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00042(344-355)Online publication date: May-2023
      • (2023)Automatic Kernel Generation for Large Language Models on Deep Learning Accelerators2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323944(1-9)Online publication date: 28-Oct-2023
      • (2023)Pay attention to the hidden semantemeInformation Sciences: an International Journal10.1016/j.ins.2023.119076640:COnline publication date: 1-Sep-2023
      • (2022)TCB: Accelerating Transformer Inference Services with Request ConcatenationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545052(1-11)Online publication date: 29-Aug-2022
      • (2022)DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented ScaleSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00051(1-15)Online publication date: Nov-2022

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media