Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3524059.3532372acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Handling heavy-tailed input of transformer inference on GPUs

Published: 28 June 2022 Publication History

Abstract

Transformer-based models achieve superior accuracy in the field of natural language processing (NLP) and start to be widely deployed in production. As a popular deployment device, graphic processing units (GPUs) basically adopt the batch processing technique for inferring transformer-based models and achieving high hardware performance. However, as the input sequence lengths of NLP tasks are generally variable and in a heavy-tailed distribution, the batch processing will bring large amounts of redundant computation and hurt the practical efficiency.
In this paper, we propose a unified solution for eliminating most redundant computation and gaining performance profit in handling heavy-tailed input of the transformer-based model inference on GPUs. In details, the unified solution includes three strategies for the self-attention module, the multilayer perceptron (MLP) module, and the entire transformer-based model respectively. For the self-attention module, we design a fine-grained strategy, which orchestrates fine-grained parallelism in the self-attention module by indexing the valid block matrix multiplication. For the MLP module, we take the common word-accumulation strategy, which places all sequences in a batch densely. For the entire model, we design a block-organized strategy to link up the fine-grained strategy and the word-accumulation strategy through organizing the data layout of the self-attention module in the grain of block. Applying our solution to eight corpora of the GLUE benchmark, there averagely achieves 63.9% latency reduction in the self-attention module and 28.1% latency reduction in the Bert-base model.

References

[1]
Devlin Jacob, Chang Ming-Wei et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
[2]
Radford Alec, Wu Jeff et al., "Language Models are Unsupervised Multitask Learners", 2019.
[3]
Vaswan Ashish, Noam Shazeer, and et al. "Attention is all you need." Advances in neural information processing systems (NIPS), 2017.
[4]
Richard Socher, Alex Perelygin et al., "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.
[5]
Pranav Rajpurkar, Jian Zhang et al., "Squad: 100,000+ questions for machine comprehension of text". In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
[6]
Adina Williams, Nikita Nangia, and Samuel R. Bowman, "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference". Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018.
[7]
Phi Nguyen, Nikhil Kulkarni and et al., Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker. 2022. Accessed on: Apr, 18, 2022. [Online]. Avaialble: https://aws.amazon.com/cn/blogs/machine-learning/optimize-your-inference-jobs-using-dynamic-batch-inference-withtorchserve-on-amazon-sagemaker/
[8]
Michaël Benesty. Hugging Face Transformer Inference Under 1 Millisecond Latency. 2021. Accessed on: Apr, 18, 2022. [Online]. Available: https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c
[9]
Alex Wang, Amanpreet Singh, and et al., "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". In 7th International Conference on Learning Representations (ICLR), 2019.
[10]
Effective Transformer. 2022. Accessed on: Feb. 04, 2022. [Online]. Available: https://github.com/bytedance/effective_transformer
[11]
FasterTransformer. 2022. Accessed on: Feb. 04, 2022. [Online]. Available: https://github.com/NVIDIA/FasterTransformer
[12]
Jiarui Fang, Yang Yu and et al., "TurboTransformers: an efficient GPU serving system for transformer models". In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2021.
[13]
Tomás Mikolov, Wen-tau Yih, and et al., "Linguistic Regularities in Continuous Space Word Representations". In Processing of the North American Chapter of the Association of Computational Linguistics (NAACL), 2013.
[14]
P. Micikevicius, "GPU Performance Analysis and Optimization". GPU Technology Conference, 2012.
[15]
cuBLAS: Basic Linear Algebra on NVIDIA GPUs. 2021. Accessed on: Sept. 28, 2021. [Online]. Available: https://developer.nvidia.com/cublas
[16]
Martín Abadi, Paul Barham, and et al., "TensorFlow: A System for Large-Scale Machine Learning". In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, 2016.
[17]
Adam Paszke, Sam Gross, and et al., Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. 2019.
[18]
NVIDIA. 2021. TensorRT. https://developer.nvidia.com/tensorrt.
[19]
Tianqi Chen, Thierry Moreau, and et al., "TVM: An automated end-to-end optimizing compiler for deep learning". In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, 2018.
[20]
Shiyang Chen, Shaoyi Huang, and et al., "ET: re-thinking self-attention for transformer models on GPUs", In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
[21]
Hamid Tabani, Ajay Balasubramaniam, and et al., "Improving the Efficiency of Transformers for Resource-Constrained Devices". Arxiv.
[22]
Yi Tay, Dara Bahri, and et al., "Synthesizer: Rethinking Self-Attention in Transformer Models", 2020, Arxiv.

Cited By

View all
  • (2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
  • (2023)CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU systemThe Journal of Supercomputing10.1007/s11227-023-05183-679:13(14172-14199)Online publication date: 4-Apr-2023
  • (2022)Edge Resource Autoscaling for Hierarchical Federated Learning Over Public Edge Platforms2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta)10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00123(806-814)Online publication date: Dec-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing
June 2022
514 pages
ISBN:9781450392815
DOI:10.1145/3524059
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. NLP
  3. deep learning
  4. transformer

Qualifiers

  • Research-article

Funding Sources

  • Zhejiang Lab
  • Program for Guangdong Introducing Innovative and Entrepreneurial Teams
  • Natural Science Foundation of China
  • National Key R&D Program of China
  • CCF-Baidu Open Fund

Conference

ICS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)106
  • Downloads (Last 6 weeks)5
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
  • (2023)CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU systemThe Journal of Supercomputing10.1007/s11227-023-05183-679:13(14172-14199)Online publication date: 4-Apr-2023
  • (2022)Edge Resource Autoscaling for Hierarchical Federated Learning Over Public Edge Platforms2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta)10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00123(806-814)Online publication date: Dec-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media