research-article

Handling heavy-tailed input of transformer inference on GPUs

Authors:

Yutong LuAuthors Info & Claims

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

Article No.: 38, Pages 1 - 11

https://doi.org/10.1145/3524059.3532372

Published: 28 June 2022 Publication History

Abstract

Transformer-based models achieve superior accuracy in the field of natural language processing (NLP) and start to be widely deployed in production. As a popular deployment device, graphic processing units (GPUs) basically adopt the batch processing technique for inferring transformer-based models and achieving high hardware performance. However, as the input sequence lengths of NLP tasks are generally variable and in a heavy-tailed distribution, the batch processing will bring large amounts of redundant computation and hurt the practical efficiency.

In this paper, we propose a unified solution for eliminating most redundant computation and gaining performance profit in handling heavy-tailed input of the transformer-based model inference on GPUs. In details, the unified solution includes three strategies for the self-attention module, the multilayer perceptron (MLP) module, and the entire transformer-based model respectively. For the self-attention module, we design a fine-grained strategy, which orchestrates fine-grained parallelism in the self-attention module by indexing the valid block matrix multiplication. For the MLP module, we take the common word-accumulation strategy, which places all sequences in a batch densely. For the entire model, we design a block-organized strategy to link up the fine-grained strategy and the word-accumulation strategy through organizing the data layout of the self-attention module in the grain of block. Applying our solution to eight corpora of the GLUE benchmark, there averagely achieves 63.9% latency reduction in the self-attention module and 28.1% latency reduction in the Bert-base model.

References

[1]

Devlin Jacob, Chang Ming-Wei et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.

[2]

Radford Alec, Wu Jeff et al., "Language Models are Unsupervised Multitask Learners", 2019.

[3]

Vaswan Ashish, Noam Shazeer, and et al. "Attention is all you need." Advances in neural information processing systems (NIPS), 2017.

[4]

Richard Socher, Alex Perelygin et al., "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.

[5]

Pranav Rajpurkar, Jian Zhang et al., "Squad: 100,000+ questions for machine comprehension of text". In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

[6]

Adina Williams, Nikita Nangia, and Samuel R. Bowman, "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference". Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018.

[7]

Phi Nguyen, Nikhil Kulkarni and et al., Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker. 2022. Accessed on: Apr, 18, 2022. [Online]. Avaialble: https://aws.amazon.com/cn/blogs/machine-learning/optimize-your-inference-jobs-using-dynamic-batch-inference-withtorchserve-on-amazon-sagemaker/

[8]

Michaël Benesty. Hugging Face Transformer Inference Under 1 Millisecond Latency. 2021. Accessed on: Apr, 18, 2022. [Online]. Available: https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c

[9]

Alex Wang, Amanpreet Singh, and et al., "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". In 7th International Conference on Learning Representations (ICLR), 2019.

[10]

Effective Transformer. 2022. Accessed on: Feb. 04, 2022. [Online]. Available: https://github.com/bytedance/effective_transformer

[11]

FasterTransformer. 2022. Accessed on: Feb. 04, 2022. [Online]. Available: https://github.com/NVIDIA/FasterTransformer

[12]

Jiarui Fang, Yang Yu and et al., "TurboTransformers: an efficient GPU serving system for transformer models". In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2021.

Digital Library

[13]

Tomás Mikolov, Wen-tau Yih, and et al., "Linguistic Regularities in Continuous Space Word Representations". In Processing of the North American Chapter of the Association of Computational Linguistics (NAACL), 2013.

[14]

P. Micikevicius, "GPU Performance Analysis and Optimization". GPU Technology Conference, 2012.

[15]

cuBLAS: Basic Linear Algebra on NVIDIA GPUs. 2021. Accessed on: Sept. 28, 2021. [Online]. Available: https://developer.nvidia.com/cublas

[16]

Martín Abadi, Paul Barham, and et al., "TensorFlow: A System for Large-Scale Machine Learning". In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, 2016.

[17]

Adam Paszke, Sam Gross, and et al., Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. 2019.

[18]

NVIDIA. 2021. TensorRT. https://developer.nvidia.com/tensorrt.

[19]

Tianqi Chen, Thierry Moreau, and et al., "TVM: An automated end-to-end optimizing compiler for deep learning". In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, 2018.

[20]

Shiyang Chen, Shaoyi Huang, and et al., "ET: re-thinking self-attention for transformer models on GPUs", In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.

Digital Library

[21]

Hamid Tabani, Ajay Balasubramaniam, and et al., "Improving the Efficiency of Transformers for Resource-Constrained Devices". Arxiv.

[22]

Yi Tay, Dara Bahri, and et al., "Synthesizer: Rethinking Self-Attention in Transformer Models", 2020, Arxiv.

Cited By

Du JJiang JZheng JZhang HHuang DLu Y(2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3617689
Zhang QLiu YLiu TQian D(2023)CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU systemThe Journal of Supercomputing10.1007/s11227-023-05183-679:13(14172-14199)Online publication date: 4-Apr-2023
https://dl.acm.org/doi/10.1007/s11227-023-05183-6
Zhao MZhao KZhou ZChen X(2022)Edge Resource Autoscaling for Hierarchical Federated Learning Over Public Edge Platforms2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta)10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00123(806-814)Online publication date: Dec-2022
https://doi.org/10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00123

Index Terms

Handling heavy-tailed input of transformer inference on GPUs
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

June 2022

514 pages

ISBN:9781450392815

DOI:10.1145/3524059

General Chairs:
Lawrence Rauchwerger
University of Illinois at Urbana-Champaign
,
Kirk Cameron
Virginia Tech
,
Program Chairs:
Dimitrios S. Nikolopoulos
Virginia Tech
,
Dionisios Pnevmatikatos
National Technical University of Athens

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Zhejiang Lab
Program for Guangdong Introducing Innovative and Entrepreneurial Teams
Natural Science Foundation of China
National Key R&D Program of China
CCF-Baidu Open Fund

Conference

ICS '22

Sponsor:

SIGARCH

ICS '22: 2022 International Conference on Supercomputing

June 28 - 30, 2022

Virtual Event

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
352
Total Downloads

Downloads (Last 12 months)106
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Du JJiang JZheng JZhang HHuang DLu Y(2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3617689
Zhang QLiu YLiu TQian D(2023)CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU systemThe Journal of Supercomputing10.1007/s11227-023-05183-679:13(14172-14199)Online publication date: 4-Apr-2023
https://dl.acm.org/doi/10.1007/s11227-023-05183-6
Zhao MZhao KZhou ZChen X(2022)Edge Resource Autoscaling for Hierarchical Federated Learning Over Public Edge Platforms2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta)10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00123(806-814)Online publication date: Dec-2022
https://doi.org/10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00123

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents