research-article

Public Access

E.T.: re-thinking self-attention for transformer models on GPUs

Authors:

Santosh Pandey,

Hang LiuAuthors Info & Claims

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 25, Pages 1 - 18

https://doi.org/10.1145/3458817.3476138

Published: 13 November 2021 Publication History

Abstract

Transformer-based deep learning models have become a ubiquitous vehicle to drive a variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling. However, these models also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce ET. that r<u>E</u>-thinks self-attention computation for <u>T</u>ransformer models on GPUs with the following contributions: First, we introduce a novel self-attention architecture, which encompasses two tailored self-attention operators with corresponding sequence length-aware optimizations, and operation reordering optimizations. Second, we present an attention-aware pruning design which judiciously uses various pruning algorithms to reduce more computations hence achieves significantly shorter turnaround time. For the pruning algorithms, we not only revamp the existing pruning algorithms, but also tailor new ones for transformer models. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERT_BASE and DistilBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions, i.e., TensorRT and FasterTransformer.

Supplementary Material

MP4 File (Large Scale Neural Network Training_ Part I - ET_ Re-Thinking Self-Attention for Transformer Models on GPUs.mp4)

Presentation video

Download
240.36 MB

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, USA, 265--283.

[2]

AMD. 2021. INTRODUCING AMD CDNA ARCHITECTURE. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf.

[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., Virtual, 1877--1901.

[4]

Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. 2008. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications 14, 5--6 (2008), 877--905.

[5]

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). The Association for Computer Linguistics, Vancouver, Canada, 1--14.

[6]

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International Conference on Machine Learning. PMLR, Vienna, Austria, 1691--1703.

[7]

Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020. The lottery ticket hypothesis for pre-trained bert networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada.

[8]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, USA, 579--594.

[9]

Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long Short-Term Memory-Networks for Machine Reading. (2016), 551--561.

[10]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A low-latency online prediction serving system. In 14th Symposium on Networked Systems Design and Implementation. USENIX Association, Boston, MA, 613--627.

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Longand Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186.

[12]

Bill Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Third International Workshop on Paraphrasing (third international workshop on paraphrasing (IWP2005) ed.). Asia Federation of Natural Language Processing.

[13]

Linhao Dong, Shuang Xu, and Bo Xu. 2018. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In International Conference on Acoustics, Speech and Signal Processing. IEEE, Calgary, Alberta, Canada, 5884--5888.

Digital Library

[14]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. (2020). arXiv:arXiv preprint arXiv:2010.11929

[15]

Angela Fan, Edouard Grave, and Armand Joulin. 2019. Reducing transformer depth on demand with structured dropout. (2019). arXiv:arXiv:1909.11556

[16]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, USA, 389--402.

Digital Library

[17]

Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU kernels for deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, Virtual, 1--14.

Digital Library

[18]

Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference. ACM, New York, NY, USA, 1--15.

Digital Library

[19]

Google. 2021. XLA: Optimizing Compiler for Machine Learning. https://www.tensorflow.org/xla

[20]

Mitchell Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In Proceedings of the 5th Workshop on Representation Learning for NLP. Association for Computational Linguistics, Virtual, 143--155.

[21]

Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Virtual, 1--15.

Digital Library

[22]

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao. 2020. A Survey on Visual Transformer. (2020). arXiv:arXiv:2012.12556

[23]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 1135--1143.

[24]

Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. Grnn: Low-latency and scalable rnn inference on gpus. In Proceedings of the Fourteenth EuroSys Conference 2019. ACM, New York, NY, USA, 1--16.

Digital Library

[25]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. (2019). arXiv:arXiv:1909.11942

[26]

Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. AAAI Press, Rome, Italy.

Digital Library

[27]

Xin Li, Yiming Zhou, Zheng Pan, and Jiashi Feng. 2019. Partial Order Pruning: For Best Speed/Accuracy Trade-Off in Neural Architecture Search. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, 9137--9145.

[28]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1412--1421.

[29]

JS McCarley, Rishav Chakravarti, and Avirup Sil. 2019. Structured pruning of a bert-based question answering model. (2019). arXiv:arXiv:1910.06360

[30]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. (2017). arXiv:arXiv:1609.07843

[31]

Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782 (2017).

[32]

Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. 2015. Tensorizing neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1. MIT Press, Cambridge, MA, USA, 442--450.

[33]

NVIDIA. 2007. cuBLAS. https://developer.nvidia.com/cublas.

[34]

NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

[35]

NVIDIA. 2020. CUTLASS. https://github.com/NVIDIA/cutlass.

[36]

NVIDIA. 2020. Matrix Multiplication Background User Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#math-mem.

[37]

NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.

[38]

NVIDIA. 2021. cuBLAS: cublasgemmalgo_t. https://docs.nvidia.com/cuda/cublas/index.html#cublasgemmalgo_t.

[39]

NVIDIA. 2021. FasterTransformer. https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer.

[40]

NVIDIA. 2021. TensorRT. https://developer.nvidia.com/tensorrt.

[41]

Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, and Qun Liu. 2020. ALP-KD: Attention-Based Layer Projection for Knowledge Distillation. In Proceedings of the Conference on Artificial Intelligence. AAAI Press, Virtual, 13657--13665.

[42]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. Curran Associates, Inc., Vancouver, Canada, 8026--8037.

[43]

Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Virtual, 3208--3229.

[44]

Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In International Symposium on High Performance Computer Architecture. IEEE, San Diego, USA, 58--70.

[45]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392.

[46]

Qing Rao and Jelena Frtunikj. 2018. Deep learning for self-driving cars: Chances and challenges. In Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems. IEEE, Gothenburg, Sweden, 35--38.

Digital Library

[47]

Masuma Akter Rumi, Xiaolong Ma, Yanzhi Wang, and Peng Jiang. 2020. Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, USA, 267--278.

Digital Library

[48]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. (2019). arXiv:arXiv:1910.01108

[49]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631--1642.

[50]

Betty van Aken, Benjamin Winter, Alexander Löser, and Felix A Gers. 2019. How does bert answer questions? a layer-wise analysis of transformer representations. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, USA, 1823--1832.

[51]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, 5998--6008.

[52]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353--355.

[53]

Ziheng Wang. 2020. SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. Association for Computing Machinery, New York, NY, USA, 31--42.

Digital Library

[54]

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, 2074--2082.

[55]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1112--1122.

[56]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv:cs.CL/1910.03771

[57]

Xilinx. 2017. SDAccel Environment Profiling and Optimization Guide. https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/jbt1504034294480.html.

[58]

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. 2020. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Virtual, 7859--7869.

[59]

Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, and Joaquín Olivares. 2020. Accelerating sparse matrix-matrix multiplication with GPU Tensor Cores. Computers & Electrical Engineering 88 (2020), 106848.

[60]

Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos. 2020. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 53rd Annual International Symposium on Microarchitecture. IEEE, Athens, Greece, 811--824.

[61]

Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. 2018. A systematic DNN weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision. Springer, Munich, Germany, 184--199.

Digital Library

[62]

Xiaodong Zhang, Xu Sun, and Houfeng Wang. 2018. Duplicate question identification by integrating framenet with neural networks. In Proceedings of the Conference on Artificial Intelligence, Vol. 32. AAAI Press, New Orleans, Louisiana, USA.

[63]

Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the 57th Annual Meeting. Association for Computational Linguistics, Florence, Italy, 5059--5069.

Cited By

Guo CXue FLeng JQiu YGuan YCui WChen QGuo M(2024)Accelerating Sparse DNNs Based on Tiled GEMMIEEE Transactions on Computers10.1109/TC.2024.336594273:5(1275-1289)Online publication date: May-2024
https://doi.org/10.1109/TC.2024.3365942
Nayak NWu XOdemuyiwa TPellauer MEmer JFletcher C(2024)FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00107(1458-1473)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00107
Zhang ZLu YWang TWei XWei Z(2024)DDKNeural Networks10.1016/j.neunet.2024.106164173:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.neunet.2024.106164
Show More Cited By

Index Terms

E.T.: re-thinking self-attention for transformer models on GPUs
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2021

1493 pages

ISBN:9781450384421

DOI:10.1145/3458817

General Chair:
Bronis R. de Supinski,
Program Chairs:
Mary Hall,
Todd Gamblin

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

SC '21

Sponsor:

SIGHPC

SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 19, 2021

Missouri, St. Louis

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
2,251
Total Downloads

Downloads (Last 12 months)756
Downloads (Last 6 weeks)62

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Guo CXue FLeng JQiu YGuan YCui WChen QGuo M(2024)Accelerating Sparse DNNs Based on Tiled GEMMIEEE Transactions on Computers10.1109/TC.2024.336594273:5(1275-1289)Online publication date: May-2024
https://doi.org/10.1109/TC.2024.3365942
Nayak NWu XOdemuyiwa TPellauer MEmer JFletcher C(2024)FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00107(1458-1473)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00107
Zhang ZLu YWang TWei XWei Z(2024)DDKNeural Networks10.1016/j.neunet.2024.106164173:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.neunet.2024.106164
Chen HChen STurner JFeiguin A(2024)Kernel fusion in atomistic spin dynamics simulations on Nvidia GPUs using tensor coreJournal of Computational Science10.1016/j.jocs.2024.10235781(102357)Online publication date: Sep-2024
https://doi.org/10.1016/j.jocs.2024.102357
Du JJiang JZheng JZhang HHuang DLu Y(2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3617689
Zhai YJiang CWang LJia XZhang SChen ZLiu XZhu Y(2023)ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00042(344-355)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00042
Wang FShen M(2023)Automatic Kernel Generation for Large Language Models on Deep Learning Accelerators2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323944(1-9)Online publication date: 28-Oct-2023
https://doi.org/10.1109/ICCAD57390.2023.10323944
Tang HLiu XWang YDou QLu M(2023)Pay attention to the hidden semantemeInformation Sciences: an International Journal10.1016/j.ins.2023.119076640:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.ins.2023.119076
Fu BChen FLi PZeng D(2022)TCB: Accelerating Transformer Inference Services with Request ConcatenationProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545052(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545052
Aminabadi RRajbhandari SAwan ALi CLi DZheng ERuwase OSmith SZhang MRasley JHe Y(2022)DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented ScaleSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00051(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00051

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten