research-article

Open access

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

Authors:

Yutong LuAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 4

Article No.: 46, Pages 1 - 22

https://doi.org/10.1145/3617689

Published: 26 October 2023 Publication History

PDF eReader

Abstract

Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance.

In this article, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28% on the entire transformer model, 63.8% on the self-attention module, and reduces memory footprint of intermediate results by 7.8×, compared with prevailing frameworks.

1 Introduction

Transformer models begin to be widely deployed in the real-world due to their remarkable accuracy and performance on tasks such as text generation, question-answering, and language translation. Transformer models, e.g., bidirectional encoder representations from transformers (BERT) [14] and generative pre-trained transformers (GPT) [24], use the self-attention mechanism [28], to capture the dependency between any two words in a sequence. This enables the model to learn long-range dependencies between different parts of the text.

The parallel architecture of graphic processing units (GPUs) makes them ideal for the deployment of transformer models. To take full advantage of the computational power of modern GPUs, deep learning workloads are typically processed using batch processing, where multiple samples are processed concurrently to exploit high parallelism. This method reduces the kernel launch overhead and significantly improves hardware performance for most deep learning structures. For instance, Amazon [23] and HuggingFace [3] both use batch processing in their enterprise-level services to saturate the hardware and improve throughput. Despite the advantages that batch processing offers to deep learning workloads, it is inefficient for natural language processing (NLP) tasks due to the nature that input sequences are variable in length. This is because the matrices of different samples differ in shape. A naive solution for this is to preset a maximum sequence length to cover all conditions and pad all the input sequences to this length, known as the zero padding. Our statistics on general language understanding evaluation (GLUE) [29] benchmark reveal that sequence lengths in all corpora follow a heavy-tailed distribution, with the majority of sequences being significantly shorter than the potential longest length. As a consequence, the amount of redundant computation can even exceed that of the valid computation. While the combination of batch processing and zero padding improves hardware performance by exploiting higher parallelism, it results in redundant computations and severely damages the practical performance.

Transformer models comprise three main modules: the word embedding module, the self-attention module, and the multi-layer perceptron (MLP) module, which will be introduced in more detail in Section 3. In this essay, we classify all linear functions into the MLP module and take these operations in the computation process of the self-attention mechanism as the self-attention module. The self-attention and MLP modules are repeated multiple times in a transformer model and contribute most computation, and they both suffer from the redundant computation problem. Therefore, our primary focus is on the computationally intensive parts of the model, i.e., the self-attention and the MLP modules.

Several prevailing frameworks, such as ByteTransformer [32], FasterTransformer [21], and TurboTransformers [10], have addressed the problem of redundant computation in MLP modules by using a word-accumulation approach that removes padding before each self-attention module and rebuilds padding after each self-attention module. We refer to this as the EFF-rebuild solution, since it is first claimed by EffectiveTransformer. However, the EFF-rebuild solution has two shortages. First, it involves extra data movement for removing and rebuilding padding. Second, it does not eliminate the redundant computation in the self-attention module.

In this article, we propose a unified solution for improving both computation and memory efficiency in transformer model inference on GPUs with heavy-tailed input. The main contributions of this article are as follows:

•

We show that the input of NLP tasks often follows a heavy-tailed distribution and this leads to severe redundancy on both computation and memory in transformer inference.

•

We propose a unified solution. To improve the computation efficiency, it takes the fine-grained approach for the self-attention module, the word-accumulation approach for the MLP module, and the block-organized approach for the entire model. Upon these approaches, it takes the chunk-based approach for better memory management.

•

We propose a fine-grained approach that reduces the redundant computation in the self-attention module by indexing only the valid block matrix multiplication.

•

We propose a block-organized approach that unifies redundant computation reduction methods for both MLP and self-attention modules by organizing the data layout of the self-attention module in block granularity.

•

We propose a chunk-based memory management approach that balances the memory footprint and allocation/free efficiency.

2 Background

2.1 Transformer Model

Transformer models are a series of models derived from the architecture of Transformer [28]. Here, we take BERT [14] as the example to introduce Transformer models in Figure 1.

Fig. 1.

2.1.1 Word Embedding Module.

Word embedding module is of responsibility to represent texts with numbers. Here, we call these minimum units as words for the intuitive understanding in the field of NLP. In fact, input sequences are not always segmented according to natural language words and it is more accurate to call these minimum units as tokens. As shown in Figure 1, an input sequence goes through the word embedding module, where each word is converted to a vector. There is a predefined feature number (hidden_size), e.g., 768, for representing a single word in a sequence and the sequence length (seq_len), e.g., 512, for representing the longest sequence that a model can process. An input sequence will be transformed to a matrix that has a size of \({\it hidden\_size} \times {\it seq\_len}\). Notably, the number of valid vectors in the matrix is equal to the input sequence length.

2.1.2 Multi-layer Perceptron Module.

As the left part of Figure 1 shows, there are three linear functions for generating query, key, and value, and as the right part of Figure 1 shows, there are three consecutive linear functions for learning knowledge from the self-attention mechanism. These linear functions all belong to the MLP module.

2.1.3 Self-attention Module.

For each word in a sequence, a query vector, a key vector, and a value vector are created, and for the entire sequence, they form the query matrix, key matrix, and value matrix. The computation process of the self-attention module mainly consists of the \(Q \times K^T\) and \(Q K^T \times V\) functions. In the \(Q \times K^T\) function of Figure 1, all query vectors are dot-multiplied with all key vectors and produce the result \(Q K^T\). In the \(Q K^T \times V\) function of Figure 1, the \(Q K^T\) is dot-multiplied with corresponding value vectors and yields the final result of the self-attention module.

2.1.4 Ignoble Functions.

There are softmax functions, transpose functions, and so on, in a transformer model, and they are not related to the redundant computation problem in our unified solution.

2.2 GPU Architecture

This section provides a basic description of the GPU architecture [18]. Since we use CUDA for our implementation, we mainly adopt the terminology of Nvidia. GPUs consist of an array of streaming multiprocessors (SMs) and GPU kernels consist of threads that are grouped into sets of 32 called warps. Warps are further grouped into larger sets of threads, called GPU thread blocks. Moreover, the set of thread blocks that compose a kernel is called a grid. When a kernel is launched to the GPU for execution, each thread block is assigned to an SM. All threads within a thread block can communicate through fast, programmer-managed, shared memory that is local to the SM. In contrast, GPUs have a large but high-latency global memory that is accessible to all SMs. When a warp of threads accesses global memory, GPUs try to coalesce the accesses into as few transactions as possible. The number of thread blocks that execute concurrently on an SM is referred to as the occupancy of the kernel. Higher occupancy, or rather higher parallelism, can generally improve hardware performance, as the latency of memory and arithmetic operations can be hidden among threads.

Besides, NVIDIA introduces tensor core from its Volta architecture for better processing mixed-precision matrix multiply and accumulate (MMA) operations [17]. Each Tensor Core is capable of performing 64 floating-point fused multiply-add (FMA) operations per clock cycle.

3 Computation Analysis

This section shows the distribution of natural language input and presents how it causes redundant computation problem in the transformer inference. Table 1 outlines the hyperparameters for transformer models.

Table 1.

Variable	Meaning
seq_len	Maximum sequence length preset for the model
head_size	Vector size of a head
hidden_size	Vector size of word embedding
batch_size	Sample number in a batch
val_len	Sequence length of a sample

Table 1. Hyperparameters and Their Meaning

3.1 Heavy-tailed Distribution of Input

We analyze the length of sequences in eight corpora of the GLUE benchmark, which is a widely used dataset collection in the NLP field. We exclude outlier sequences with extremely long length and cluster sequence lengths with 32 as the stride. Figure 2 shows the normalized results, and we connect scatters of the same corpus to provide a clear view. It illustrates that the lengths of sequences in the same batch vary significantly and they mainly concentrate on the short length area. In other words, the input sequence length of NLP tasks follows a heavy-tailed distribution.

Fig. 2.

3.2 Computation of Linear Function

Figure 3(a) illustrates the computation process of the linear function using batch processing and the zero padding. Grey squares indicate valid values, while white squares represents padded values. Here the batch_size is 4, and lengths of these sequences are 1, 6, 2, and 1. All sequences are padded into the same length, i.e., 6. The parameters are represented as a fixed matrix. Therefore, as the number of vectors equals the sequence length and the parameters have a fixed size, the amount of valid computation for a linear function has a linear relationship with the sequence length. An intuitive way to perform computation only for valid words in the linear function is to exploit parallelism in the word dimension and accumulate words from different sequences densely. However, since batch processing in self-attention modules requires padding, we also need to pad words in MLP modules to maintain consistency between them.

Fig. 3.

To avoid redundant computation in MLP modules, the EFF-rebuild solution is adopted in existing systems. It includes the word-accumulation approach for the MLP module and a layout switch method to switch between the padding data layout and the densely arranged data layout. Figure 3(b) illustrates how EFF-rebuild solution works. It uses an array to store offsets based on sequence lengths in a batch. Before entering an MLP module, it removes padding and packs sequences densely along the word dimension. Then it performs matrix multiplication with parameters using existing GEMM libraries. Specifically, existing systems use cublasLtMatmul of cublas [20] library in their GPU implementations. After completing an MLP module and before entering the self-attention module, the solution makes a reverse process that rebuilds the padding. Despite its simplicity and memory movement overhead, this solution is very effective.

3.3 Computation of Q × K^T Function

The computation of the batched \(Q \times K^T\) function in the self-attention module is shown in Figure 4(a). This function calculates the dot product of every query vector with every key vector and produces the matrix \(QK^T\), which represents the pairwise relevance between words in a sequence. The computation cost of this function grows quadratically with the sequence length, since both query size and key size depend on it. The output matrix \(Q K^T\) is square and only the blue values are useful for the next step of \(Q K^T \times V\) function. The \(Q \times K^T\) function for a single sample computes the dot products of every query vector with every key vector, which is equivalent to a general matrix multiplication (GEMM). Since the size of query and key matrices depends on the sequence length, the matrix multiplications of different samples for the \(Q \times K^T\) function vary in configuration. A simple approach is to use multiple GEMM kernels with different configurations for the \(Q \times K^T\) function of a batch. However, this would incur high overhead from frequent GPU kernel launches and cause GPU to work sporadically, leading to low efficiency.

Fig. 4.

The existing implementation with batch processing and zero padding fulfills all empty vectors and makes all sequences have equal length, resulting in high hardware utilization. It uses cublasGemmStridedBatchedEx of cublas [20] library to perform batched \(Q \times K^T\) function. Figure 4(b) shows the matrix multiplication illustration in the flattened form. Lines between matrices indicate that these two vectors will be multiplied under the zero padding approach, but only red lines indicate valid computations. We can observe that there are many more black lines than red lines. The exact ratio of total to valid computations is given by Equation (1). As mentioned earlier, sequence lengths in a batch can vary significantly and tend to be short. For instance, if seq_len is 128 and val_len of a sample is 16, then the ratio of total to valid computations is 64, which means there are 63 times more redundant computations than valid ones:

\begin{equation} \frac{({\it head\_size}) \times ({\it seq\_len})^2}{({\it head\_size}) \times ({\it val\_len})^2} = (\frac{seq\_len}{val\_len})^2. \end{equation}

(1)

3.4 Computation of Q K^T × V Function

The computation of the batched \(QK^T \times V\) function in the self-attention module is shown in Figure 5. After obtaining the square matrix \(QK^T\), the \(QK^T \times V\) function multiplies it with the value matrix. Since both sizes of \(QK^T\) matrix and value matrix depend on the sequence length, the computation cost of \(QK^T \times V\) function also grows quadratically with the sequence length and the ratio of total to valid computations follows Equation (1). Existing systems use cublasGemmStridedBatchedEx of cublas library to implement the \(QK^T \times V\) function.

Fig. 5.

3.5 Computation Time Comparison

The most time-consuming components of transformer model inference are the MLP and self-attention modules. From aforementioned analysis, the MLP module has a linear computation relationship with sequence length, while the self-attention module has a quadratic relationship. As the sequence length improves, the proportion of computation for the self-attention module will increase. Further, we measure the execution time of GEMM and BatchedGEMM operations to roughly learn the absolute time relationship between the MLP module and the self-attention module.

Here, we use BERT in base version (BERT-base) as the model, and use SST-2 as the corpus. We choose 512 as the sequence length to accommodate all sequence lengths in SST-2. We use a batch size of 8. Incorporating batch processing and zero padding, Table 2 shows that GEMM in the MLP module consumes more computation time compared to BatchedGEMM, more than four times. This indicates that, even with the redundancies in the self-attention module, the self-attention module only contributes a small proportion of time. Eliminating the redundant computation in the self-attention module can bring a limited benefit for the transformer model. However, when using the EFF-rebuild solution, the computation cost of GEMM is greatly reduced, making BatchedGEMM more time-consuming than GEMM. Therefore, eliminating redundant computation in the self-attention module becomes more meaningful.

Table 2.

	GEMM(f)	GEMM(b)	BatchedGEMM
Padding(ms)	20.4	50.16	15.75
Ratio	23.6%	58.1%	18.2%
No Padding(ms)	3.24	6.84	15.26
Ratio	12.8%	27.0%	60.2%

Table 2. BERT Inference Time Breakdown

GEMM(f) refers to linear functions that precede the self-attention module and GEMM(b) refers to linear functions that follow the self-attention module.

4 The Unified Solution

In this section, we introduce our unified solution for improving both computation and memory efficiency. To eliminate the redundancy on computation, as in Figure 6, the unified solution includes three approaches for the self-attention module, the MLP module, and the entire model, respectively. To eliminate the redundancy on memory, as in Figure 11, the unified solution takes the chunk-based approach for balancing the memory footprint and the allocate/free efficiency.

Fig. 6.

The fine-grained approach is based on the block matrix multiplication and leverages potential fine-grained parallelism. Specifically, it determines the execution order of valid mini-block matrix multiplications at the beginning and stores the order in index arrays. Using these index arrays, the GPU can execute only these valid mini-block matrix multiplications in the self-attention module, to eliminate most redundant computation. We implement the fine-grained approach using three techniques, i.e., the mini-block index (MBIX), the shared memory block transpose (SMBT), and the efficient atomic operation (EAOP).

The word-accumulation approach is not novel. Since the existing implementation of the linear function performs the parallelism in the word dimension instead of the batch dimension, it is intuitive to pack sequences in a batch densely and eliminate the redundant computation of the MLP module.

As MLP modules place sequences tightly together while self-attention modules place them loosely, data layout switch is necessary in the connection area, leading to data movement. To address this issue, the block-organized approach is introduced to enable better connectivity. With the fine-grained approach, the memory usage becomes inefficient as many padded areas will not participate in computations. Therefore, the block-organized approach incorporates a novel data layout called block padding that pads sequences in blocks to satisfy the minimum requirement for the fine-grained approach. To switch between data layouts, two customized layout switch kernels are developed.

Benefiting from the aforementioned approaches, the essential memory footprint required by intermediate results changes frequently during execution. Instead of allocating sufficient memory space in advance, the chunk-based approach aims to release unnecessary memory while ensuring the efficient allocation. It allocates or frees memory in chunks to enable more precise memory management. Additionally, it includes a pre-schedule method to hide the memory allocation overhead into computation. In this way, our solution requires less memory footprint in most of the time and releases more memory space for higher-level scheduler [6, 12] to make multiple tasks co-locate in a single GPU, which is the common case in cloud scenarios.

4.1 Fine-grained Approach

By breaking down the matrix multiplication of \(Q \times K^T\) and \(QK^T \times V\) functions into multiple mini-block matrix multiplications following the block matrix multiplication Equation (2), we can exclusively orchestrate the valid mini-blocks for efficient computation:

\begin{align} AB &= \begin{bmatrix}A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix} \begin{bmatrix}B_{11} & B_{12} \\ B_{21} & B_{22}\end{bmatrix} = \begin{bmatrix}A_{11}B_{11}+A_{12}B_{21} & A_{11}B_{12}+A_{12}B_{22} \\ A_{21}B_{11}+A_{22}B_{21} & A_{21}B_{12}+A_{22}B_{22}\end{bmatrix}. \end{align}

(2)

4.1.1 Mini-block Index.

Figure 7(a) illustrates the block matrix multiplication of the \(Q \times K^T\) function and all mini blocks have the same size and behavior. MBIX leverages two index arrays to record the execution order. Based on the val_len of input sequences, we can compute the related offsets of the valid query and key mini-blocks. Additionally, there lacks the necessary 3D structural information to correctly position the result in the resulting matrix for writing back. To address this, MBIX builds the offset array of the resulting matrix, which records the offsets of relevant mini-block in \(QK^T\) matrix for writing back the results.

Fig. 7.

For a three-word sequence, the q_offset and k_offset arrays record [0,2,0, ..., 10] and [0,2,8, ..., 10], respectively, which enables calculation of the corresponding offsets ([0,0,2, ..., 14]) in the resulting matrix. During execution, the GPU first accesses the query and key offset arrays, then utilizes the offset to access and process each valid mini-block matrix pair one by one. As most values in mini-block matrix pairs are valid values, a majority of redundant computation can be avoided. Similarly, Figure 7(b) illustrates the 2D partitioning of MBIX on the \(Q K^T \times V\) function, leveraging the same concept.

While MBIX can eliminate a significant amount of redundant computation, it introduces new overhead, such as index array building, irregular data access, and extra memory usage. The block size of decomposing the matrix multiplication is a critical parameter to balance redundant computation and newly introduced overhead. A smaller mini-block can better suit irregular data structures, resulting in less redundant computation. However, smaller mini-blocks can result in very large index arrays, which take longer to build and consume more memory. Additionally, small mini-blocks lead to poor spatial locality. Here, we empirically select the \(32\times 32\) block size for the 2D partitioning, which aims to validate our unified solution. It is important to note that further engineering is required for a mature implementation.

4.1.2 Shared Memory Block Transpose.

When executing a matrix multiplication on a pair of mini-blocks, each vector in the first mini-block is multiplied with all vectors in the second mini-block, resulting in the multiple access of each vector in a mini-block. Shared memory is a potent feature of advanced GPUs, with significantly reduced access latency and higher throughput compared to local and global memory. Prefetching the mini-block data from global memory to shared memory is desirable to minimize memory access overhead.

However, merely migrating mini-block data from global memory to shared memory yields very limited benefits. To achieve high memory bandwidth, each SM’s shared memory is divided into 32 memory blocks of equal size, corresponding to the thread number in a warp. Adjacent values will fall in different banks. Unfortunately, as shown in the left side of Figure 8, direct copying leads to memory requests from multiple threads accessing the same bank, resulting in severe bank conflicts. To solve this issue, we adjust the data layout in shared memory, ensuring that threads within the same warp access data in different banks. As illustrated in the right side of Figure 8, mini-blocks are transposed while moving data from global memory to shared memory. Threads within the same warp will not be forced to access data in the same bank, and shared memory bank conflicts are avoided.

Fig. 8.

4.1.3 Efficient Atomic Operation.

In both Figures 7(a) and 7(b), multiple mini-block pairs in the 2D partitioning write their results to the same location. Hence, ensuring the correctness of the fine-grained approach necessitates atomic operations, which result in contention issues that undermine performance. EAOP aims at reducing the overhead arising from this contention problem.

As query and key are identical in shape, the \(Q \times K^T\) function degenerates, becoming a more specialized case. Figure 9 demonstrates this in the case with a block size of \(1\times 4\), naming 1D partitioning. With the 1D partitioning, a place in the output matrix corresponds to a single mini-block pair, and computations of the \(Q \times K\) function can be performed without atomic operations in the implementation. Consequently, in our implementation, we select \(64\times 32\) as the block size empirically.

Fig. 9.

For the \(QK^T \times V\) function, reducing the overhead of atomic operations involves reorganizing the offset generation and enlarging the write interval between values intended to be written to the same location. The offset generation function consists of several loops in different dimensions. For example, in Figure 7(b), mini-blocks from the blue matrix in the first row will be multiplied with mini-blocks from the yellow matrix in the first column. The results of two mini-block matrix multiplication pairs will write their results to the same location, generating the contention problem. To improve this, we can modify the offset order and set it to [0,12,2,14,0,12,2,14] and [0,0,8,8,2,2,10,10]. Enlarging the write interval of dependent values can effectively relieve the contention problem. The write interval can be further increased by introducing loops in other dimensions such as the batch dimension to eliminate the contention problem more effectively.

4.1.4 Half-precision on Tensor Core.

Despite loss of accuracy, the lower precision method is widely studied for inference tasks. It uses lower precision to represent weights and activations, to take less memory footprint and speed up the inference. In addition to the single-precision with the CUDA core, we further extend the fine-grained approach to the half-precision and adopt the latest tensor core architecture of NVIDIA GPUs.

The tensor core has a \(4\times 4\times 4\) matrix processing array that can perform the operation \(D = A \times B + C\) in half-precision and these matrices have a size of \(4\times 4\). The warp matrix multiply accumulate (WMMA) API is used to manipulate tensor cores. Multiple tensor core operations are combined by the WMMA API, which is implemented at the warp level. The WMMA API loads data from any other memory space to registers using the \(load\_matrix\_sync()\) function, which loads a \(16\times 16\) matrix. Next, it uses \(mma\_sync()\) function to perform matrix multiplication and accumulation on a pair of \(16\times 16\) matrices. The resulting product is then moved back using the \(store\_matrix\_sync()\) function.

The fine-grained approach can also fit the tensor core’s half-precision computation and the difference is in processing each pair of mini-blocks. Since we choose \(32\times 32\) as the block size in the fine-grained approach, our goal is to decompose the \(32\times 32\) matrix multiplication in the grain of \(16\times 16\) to fit the WMMA API. A single \(32\times 32\) matrix multiplication can be decomposed into 8 \(16 \times 16\) matrix multiplications, leading to 8 WMMA API calls. As shown in Figure 10, every two matrix multiplications should be accumulated together. To accelerate the data loading process, we also move the \(32\times 32\) matrices (A and B) to the shared memory following SMBT. Although two resulting matrices will be accumulated in the same place, there is no contention problem, since multiple WMMAs are executed in series within a warp.

Fig. 10.

Furthermore, we take the pipeline method to reduce the data movement overhead between shared memory and registers. For each pair of matrices, it requires to load data (A, B, and C), make computation (\(D = A \times B + C\)), and then offload data (D). Figure 10 shows that the matrix multiplication results of two pairs of matrices should be accumulated together. Thus, we employ the pipeline method for these two pairs. After loading the data of the first pair, it starts to make the related computation, and simultaneously loads the data of the next pair. Also, we can keep the resulting matrix in registers, without using the other memory space for the accumulation. Consequently, the pipeline method takes advantage of (1) hiding data movement overhead with computation and (2) reducing memory movement by directly store the resulting matrix in registers.

4.2 Block-organized Approach

In the word-accumulation approach of the MLP module, the dense arrangement data layout is used, as shown in Figure 11. The dense arrangement lacks padding between sequences and can be directly processed by GEMM without redundant computation. However, the densely arranged data layout is not compatible with our fine-grained approach, because the valid mini-blocks will intrude into invalid values and cause the wrong result. To support the fine-grained approach, we can use the batch padding data layout that all sequences in a batch have the same length. Neglecting the word-accumulation approach, the batch padding does not introduce any overhead on computation.

Fig. 11.

However, the densely arranged layout of the MLP module contradicts the batch padding layout of the self-attention module when they connect. Integrating these two data layouts requires moving data and continuously switch between them, thus introducing data movement costs. To mitigate these costs, we design a block padding layout of the self-attention module. As in Figure 11, the block padding layout is organized based on the block size, where only values required by the self-attention module’s fine-grained approach are padded. Thus, blocks on the sequence’s edge that cannot be divided by the block size are padded, making the padding area far less than that of the batch padding layout. Although the block padding layout still requires data movement to become compatible with the previous densely arranged layout, the overheads are also largely reduced. Furthermore, the peak memory demand is lower than that of the batch padding approach, because the values are arranged more densely. The block-organized approach requires customizing both the offset generation function and data layout switch function. Here, we perform the offset generation function on the CPU in serial.

Regarding the layout switch function, both the FasterTransformer [21] and TurboTransformer [10] frameworks fuse it with the transpose operation, which significantly reduces the kernel launch and data movement overhead. TurboTransformer uses \(batch\_size \times head\_num \times seq\_len\) GPU warps to implement it and determines whether to move values in each thread. FasterTransformer, however, focuses on the valid values, builds a prefix sum array for the valid words, and reduces redundant data movement. Our implementation mainly refers to the approach used in FasterTransformer. To implement the layout switch logic, we build an index array that records the blocks’ offsets instead of the words’ offset used in FasterTransformer. This index array is considerably smaller than in FasterTransformer, and each warp will be responsible for moving a block. The reverse process that switches the block-padding data layout to the densely arranged data layout employs similar principles.

4.3 Chunk-based Approach

In addition to optimizing computation efficiency, memory efficiency is also critical for an inference runtime system. In case of the fixed-length input, the intermediate tensors are also fixed in size. We can determine the life circle of all intermediate tensors and make the schedule in advance, effectively avoiding the memory allocation overhead. However, based on computational approaches of the unified solution, the size of intermediate tensors will keep changing during the serving process under the variable-length input. Although we can determine the upper limit of the memory footprint and apply for a big enough memory space in advance, it results in a big waste on the memory space, especially when the input length is in heavy-tailed distribution. Adversely, if we re-allocate the optimal memory size in each run, then the memory allocation overhead will severely impact the execution speed. Thus, we propose the chunk-based approach to balance the memory footprint and the allocation/free efficiency.

Memory management for fixed-length input has been studied in Pytorch [22] and TFLite [8]. They exploit the computation graph to analyze the lifecycles of all intermediate tensors. For tensors that do not coexist in the time dimension, they can use the same memory space. During the first inference, the system will incrementally allocate memory for intermediate tensors until reaching the maximum memory requirement. Afterward, the memory will be cached in the system and reused in subsequent inferences. However, for variable-length input, the incremental allocation method cannot achieve optimal memory usage, since it does not free memory when new input become smaller.

The chunk-based approach organizes memory in chunks to enhance the efficiency of allocation/free processes. Through reusing most memory chunks, the allocator only requires to allocate/free a small amount of memory as the input size changes. To determine the size and number of chunks, here we take a scheduling algorithm in the chunk-based approach. After knowing sequence lengths in a batch, the scheduling algorithm will simulate the entire execution process and produce the memory plan for this batch based on the computation graph. As shown in Figure 12, for each batch, the algorithm will schedule the tensor placement based on the computation graph and the existing chunks. Then, we can know how many chunks we should free or how many chunks we should allocate for the next batch.

Fig. 12.

The scheduling algorithm seeks to place intermediate tensors in existing chunks while minimizing the memory footprint. It first involves sorting intermediate tensors of a batch based on their memory size in a non-increasing order. Then, for each tensor in the non-increasing list, it checks whether allocated chunks can satisfy both memory space and time window, which is a naive 2D strip packing problem. Here, we execute the Greedy-by-Size algorithm [8, 10] multiple times to check whether there exists enough space in the chunk list. If existing chunks can satisfy, then the tensor will be assigned with the related offset. Otherwise, a new chunk will be allocated and appended to the chunk list. Here the size of new chunk will be determined by two hyperparameters. The first is a fixed scaling factor based on the tensor size. The second is a minimal size requirement. We also adopt layer similarity to simplify the complexity of the scheduling algorithm.

The chunk-based approach further enhances the performance by overlapping the memory allocation overhead with computation. As shown in Figure 12, we extract and visualize the memory placement in executing phase of the self-attention module. Based on chunks of batch0, the scheduling algorithm points out that a chunk can be freed for processing batch1. Based on chunks of batch1, a new chunk should be allocated for batch2. Requests will arrive randomly in real-world serving systems and a scheduling thread can be employed for executing the scheduling algorithm as soon as it detects sequence lengths of a new batch. Additionally, the scheduling thread can also execute the chunk memory allocation in advance. After the execution of the previous batch, the chunk-based approach frees any extra chunks or appends the newly allocated chunks to the chunk list.

5 Evaluation

We demonstrate the effectiveness of our unified solution in two steps. At first, we present modular evaluations for three approaches, respectively. Next, we integrate the unified solution into a prevailing transformer inference system and evaluate the overall latency improvement.

5.1 Experiment Setup

For the modular evaluations, we present the effectiveness of the fine-grained approach, the block-organized approach, and the chunk-based approach. Basically, for the fine-grained approach, we evaluate the average latency of the self-attention module, i.e., \(Q \times K^T\) and \(QK^T \times V\) functions, across an entire corpus. In comparison, we take the related approach in both FasterTransformer and TurboTransformer as the baseline, in which they implement the self-attention module by taking cublasGemmStridedBatchedEx kernel of cublas library. Configurations are the same as in FasterTransformer that CUBLAS_GEMM_ALGO0 is used for the \(Q \times K^T\) function and CUBLAS_GEMM_ALGO1 is used for the \(QK^T \times V\). Particularly, we will show the performance of the fine-grained approach in both half and single-precision, since it is the only approach for improving the computation.

For the modular evaluation of the block-organized approach, we evaluate the switching time between the densely arranged layout and the block padding layout, while the switch between densely arranged layout and batched-padding layout is selected as the baseline. Two different implementations in FasterTransformer and TurboTransformer are both compared.

For the modular evaluation of the chunk-based approach, we evaluate the memory footprint of entire model’s intermediate results and report the average cost of the scheduling algorithm. The baseline is the memory management method used in Pytorch [22] and TFLite [8], which incrementally apply a cache of memory and reassigns the cached memory to later execution.

For our overall evaluation, we pick the EFF-rebuild solution that eliminates redundant computation in MLP modules as our baseline and implement it using FasterTransformer. We then apply our unified solution to this baseline for comparing the computation efficiency. The changes include replacing cublasGemmStridedBatchedEx kernels with our fine-grained kernels, constructing offset generation functions, adopting block-organized padding for the self-attention module, adopting the new data layout switch method, and adopting the chunk-based allocator for memory management. It is noted that index arrays only have to be built once for all self-attention modules in a transformer-based model.

The experiments are conducted on a node equipped with an Intel Xeon Silver 4208 CPU and a Tesla V100 GPU (16 GB). We use Ubuntu 18.04, GCC 7.5.0, and CUDA 11.3 as the running environment. For our fine-grained approach in single-precision, we take both BERT-base and BERT-large [14] configurations. For the other evaluations, we only present the result of BERT-base as model scaling does not affect the time breakdown for both our solution and the baseline. We select eight corpora from the famous GLUE benchmark, specifically CoLA, SST-2, WNLI, SST-B, QQP, MNLI, QNLI, and RTE. The preset sequence lengths (seq_len) used are shown in Table 3, which is the statistical results, and we remove outlier values of the sequence length. To ensure reliability and validity, we repeat each case three times and record the average result.

Table 3.

Corpus	Samples	Max	Mean	Seq_Len
CoLA	8,544	231	40.7	256
SST-2	67,328	268	53.5	512
WNLI	608	441	148.5	512
SST-B	5,696	2,576	116.2	1,024
MNLI	391,168	181,871	174.1	1,024
QQP	363,840	1,318	119.6	1,536
QNLI	103,136	9,678	227.9	1,536
RTE	2,464	1,418	320.0	1,536

Table 3. Statistical Results of Eight Corpora

5.2 Evaluation of the Fine-grained Approach

Figure 13 illustrates the latency of executing \(Q \times K^T\) and \(QK^T \times V\) functions on eight corpora. We develop two host functions to create two sets of offsets for the \(Q \times K^T\) and \(QK^T \times V\) functions, and then we copy the offsets from host to device. In terms of kernel execution time, the offset generation process accounts for approximately 5%. Since all self-attention modules in a transformer-based model behave in the same way and have the same structure, the offset generation process will only be executed once per batch, contributing minimally to the total execution time. Moreover, both the generation and memory copy of the offset can be overlapped with the word embedding module during practical inference, and they are excluded here.

Fig. 13.

Regarding the cublas implementation shown in Figure 13, corpora with the same \(seq\_len\) have comparable latency. As \(seq\_len\) doubles, the latency experience a significant increase, because the computation amount follows a quadratic relationship with \(seq\_len\). In contrast, the latency of our fine-grained kernels mainly depends on the average sequence length of a corpus. For instance, when comparing WNLI to SST-B, the cublas approach’s latency is much lower than that of SST-B, since WNLI has a shorter \(seq\_len\). However, the fine-grained kernels’ latency is similar to that of SST-B. Furthermore, in some scenarios, the fine-grained kernels’ latency even surpass that of the cublas kernels in the case of WNLI, resulting in negative optimization.

Comparing Figures 13(a) with 13(b), we observe that the latency in the BERT-large configuration increases, while the ratio between the fine-grained approach and the cublas approach remains stable. This is due to the fact that the self-attention module of BERT-large solely increases on the number of heads compared to BERT-base. Additionally, when increasing the batch size, both approaches achieve throughput improvement as saturating the hardware in most cases and the trend is gradually weakening when batch size is already large. As a result, our fine-grained approach can decrease the self-attention module’s latency by up to 87.8% on the QQP corpus and 63.9% on average across the eight GLUE corpora relative to the cublas approach.

Moving onto Figure 13(c), it illustrates the half-precision comparison between the fine-grained approach and the cublas approach. By vertically comparing the half-precision performance with that of the preceding single-precision, it is evident that using half-precision computation of tensor core indeed achieves considerable acceleration. Specifically, by switching to half-precision computation, the cublas implementation delivers an average acceleration of 2.98\(\times\), while the fine-grained approach yields a comparatively lower acceleration ratio of 2.24\(\times\). However, both acceleration ratios fall well below the potential acceleration achievable with the Tesla V100 GPU, where the single-precision performance reaches 14 TFLOPS and the half-precision performance reaches 112 TFLOPS. The performance ratio between half-precision and single-precision is eightfold. Many factors, such as memory bandwidth, can result in the disparity. This may also explain why our fine-grained approach achieves lower acceleration. The data volume becomes half and memory bandwidth keeps unchanged, while the hardware performance becomes eightfold from cuda core to tensor core. In this way, the asymmetric requirement improvements on memory bandwidth and computation lead to the lower acceleration of the fine-grained approach. As for WNLI corpus, our fine-grained approach becomes worse than the cublas approach. Even so, we think our approach can also take effect in the tensor core architecture.

5.3 Evaluation of the Block-organized Approach

In Figure 14, we present the modular evaluation of the block-organized approach. Ours represents the switch between densely arranged layout and block padding layout implemented by us. Fast represents the switch between densely arranged layout and batched-padding layout implemented in FasterTransformer. Turbo represents the switch between densely arranged layout and batched-padding layout implemented in TurboTransformer.

Fig. 14.

Regarding the switching function in the block-organized approach, we observe that the time gradually increases as the batch size doubles, and the upward trend becomes evident as the workload grows larger. When the batch size increases from 16 to 32 in CoLA and SST-2, the time only increases by 1.05\(\times\) and 1.16\(\times\), respectively, whereas in MNLI, QNLI, and RTE corpora, it increases by 1.36\(\times\), 1.4\(\times\), and 1.48\(\times\), respectively. In fact, when we continuously increase the batch size, for example from 512 to 1024, the time starts to double. This is because each GPU thread in our implementation moves a column of values, and the number of GPU warps cannot saturate the device when the workload is small.

As shown in Figure 14, Turbo demonstrations the worst performance in almost all cases. This is because TurboTransformer’s implementation computes the offset in the kernel function and there are condition statements to determine whether a value is redundant. As a result, corpora with the same \(seq\_len\) have comparable switching time in the TurboTransformer, since its performance primarily depends on the preset length.

Furthermore, we notice that the performance of Ours and Fast primarily depends on the average length, since corpora with longer average lengths take longer time. In FasterTransformer’s implementation, it also builds a prefix sum index and records the number of valid words. As a result, both methods deal with fewer redundant values, making the performance proportional to the average length of a corpus. Moreover, our method achieves better performance in data layout switch. This is because, in FasterTransformer, each GPU warp only accounts for moving a valid word and each thread is allocated with less workloads. Also, it needs to build a more detailed prefix sum index and transfers more data between host memory and device memory. In comparison, each GPU warp accounts for moving a block in our approach.

5.4 Evaluation of the Chunk-based Approach

For the modular evaluation of the chunk-based approach, we analyze the memory footprint and the newly introduced time cost in the view of the entire BERT model. From the perspective of the memory footprint, Figure 15 shows the memory footprint of 30 continuous batches when processing the eight corpora using batch sizes of 8 and 16. We can notice that the chunk-based approach achieves significant reduction on memory footprint, which fluctuates as sequence lengths change. As in the computing optimization, the effect is also highly dependent on the distribution of corpora. We compute the average memory footprint reduction across each corpus compared to the baseline. The results show that, for the WNLI corpus, the chunk-based approach requires 30.1\(\%\) and 28.0\(\%\) memory footprint for batch size 8 and 16, while for the QQP corpus, it only requires 8.4\(\%\) and 8.2\(\%\) memory footprint for batch size 8 and 16. This is because, although WNLI and QQP have similar average lengths, their \(seq\_len\) values differ drastically, with the \(seq\_len\) of QQP far exceeding that of WNLI. In addition, Figure 15 shows that even datasets with highly volatile fluctuations in memory footprint experience periods in which the memory footprint remains unchanged for several batches. During these steady periods, the values of related sequence length summations in x-axis are similar but differ, indicating that the chunk-based approach reduces the number of memory allocations.

Fig. 15.

From the perspective of the memory allocation cost, the chunk-based approach continuously requires the scheduling time and memory allocation/free time throughout the entire serving process, which introduces overhead. Table 4 presents the total time cost, number of allocations, and the total number of batches for each corpus. The ratios of the batch and allocation numbers vary widely across corpora, which is somewhat reflected in the trend depicted in Table 4. Further, our measurements indicate that the performance cost of the chunk-based approach averagely accounts for 0.10\(\%\)–1.05\(\%\) of the overall model inference in these corpora. Moreover, if the new batch arrives during the execution of the previous batch, then the memory management time cost can be overlapped with the computation time. Then, after the completion of the previous batch, the allocator can return the allocated memory for the current batch by adding or removing pointers in the chunk list. Therefore, despite the overhead of active scheduling and memory allocation, the chunk-based approach’s advantage in reducing memory footprint is significant. Consequently, the chunk-based approach achieves significant memory footprint reduction with very limited overhead by effectively exploiting the memory feature of preceding approaches. Moreover, in all GLUE datasets, the memory footprint keeps in relatively low level and our chunk-based approach has a significant effect in most time, which is useful when multiple tasks co-locate and are uniformly scheduled.

Table 4.

Corpus	Total Time	Number of Malloc	Batches	Total Time	Number of Malloc	Batches
	Batch size = 8			Batch size = 16
CoLA	50	181	1,069	42	109	535
SST-2	496	2,051	8,419	158	533	4,210
WNLI	14	33	80	13	20	40
SST-B	23	60	714	19	45	357
MNLI	4,571	10,142	48,893	1,400	1,967	24,447
QQP	2,149	6,169	45,481	630	1,223	22,741
QNLI	1,000	1,893	12,893	200	279	6,447
RTE	76	113	312	23	32	156

Table 4. Total Time Cost and Allocation Number of the Chunk-based Approach Across Each Corpus

5.5 Overall Evaluation of the Unified Solution

Figure 16 demonstrates the overall evaluation. Fast represents the model is executed using Eff-rebuild solution of FasterTransformer, in which the redundant computation of MLP modules is eliminated. Ours represents that our unified solution is applied. All these experiments presented are in single-precision.

Fig. 16.

Figure 16 presents the comparison of the latency in model inference while processing eight corpora between the unified solution and the Eff-rebuild solution. Unlike the modular evaluation of the fine-grained approach, the latency of the baseline is dependent not only on the \(seq\_len\) but also on the average length of the corpus. This is because the computation amount of the MLP module is affected by the average length of the corpus.

The comparison of the latency between Fast and Ours reveals that, in most corpora, our unified solution offers significant improvement in performance, particularly when there are numerous sequences of shorter length. However, for WNLI, the optimization effect is negative, due to its longer average sequence length. It is important to note that the fine-grained approach used in our unified solution is less hardware-efficient compared with cublasGemmStridedBatchedEx in cublas and our main improvement is originated from the elimination of redundant computations. On average, our unified solution results in a 28.1\(\%\) reduction in latency, in comparison to the baseline implementation of FasterTransformer.

In addition to the gap between average sequence length and \(seq\_len\), we find that our unified solution performs better on corpora using larger \(seq\_len\). For instance, in the modular evaluation of the fine-grained approach, QQP and QNLI experience a more than 70\(\%\) reduction in latency, compared to the 50\(\%\) reduction in the overall evaluation. This is attributed to the computation amount of linear functions is in a linear relationship with the sequence length and the computation amount of the \(Q \times K^T\) and \(Q K^T \times V\) is in a quadratic relationship with the sequence length. Consequently, the self-attention module requires a larger portion of computation in corpora with larger \(seq\_len\), making our optimization more effective.

5.6 Summary

Benefiting from removing huge amounts of redundant computing, our fine-grained approach can largely improve the practical performance of computing the self-attention module for most corpora. As for the block-organized approach, it reduces the layout switch overhead from the entire model view. Next, the chunk-based approach effectively exploits the memory feature of preceding approaches and significantly reduces the memory footprint. Compared with prevailing frameworks, our unified solution enables us to achieve a decrease of average latency by 28\(\%\) on the entire model, 63.8\(\%\) on the self-attention module, and reduces memory footprint of intermediate results by 7.8\(\times\) across eight corpora of the GLUE benchmark.

6 Related Work

In recent years, there has been a growing interest in optimizing both transformer training and inference, particularly in the industry. As this study focuses on optimizing transformer inference, we only present the relevant works regarding optimizing transformer inference.

Although these training frameworks, such as PaddlePaddle [16] and TensorFlow [1], are capable of executing deep learning inference, their inability to fully utilize advanced hardware for inference workloads leads to the development of inference-specific frameworks. General inference frameworks, such as TensorRT [27], TVM [5], and XLA [26], are developed. They abstract a target model into the computation graph for scheduling the optimal execution order and fusing kernels. With the increasing demand on transformer models, some transformer-specific inference frameworks and methods are claimed. NVIDIA starts FasterTransformer [21] project in 2019 and it has an active open-source community, which keeps updating and collecting practical optimization. ByteTransformer [32] by ByteDance proposes the padding-rebuild method for reducing the variable length problem in MLP module. It is originated from the early FasterTransformer and further reduces the execution time and memory consumption, especially for large batch size cases. TurboTransformer [10] by Tencent claims new kernel fusion methods and presents the design of the transformer serving system for the first time. Although these specific inference frameworks have many customized kernels, they all use cublas for performing the self-attention module, which cannot release the redundant computation of the variable-length problem in the self-attention module. Instead of handling the variable-length problem in the transformer model inference, BatchMaker [11] takes the novel cellular batching technique to reduce the redundant computation in the inference of the recurrent neural network. Further, as transformer model goes large, the distributed Inference are studied in ORCA [31] and Deepspeed [2].

More specifically, some works study the self-attention module optimization. Jiang et al. [13] designs NUMA-aware thread scheduling for the variable-length problem on ARM platforms. ReTransformer [30] proposes a new sub-matrix pipeline design for multi-head self-attention in Processing-in-Memory. E.T. [4] proposed a kernel fusion design for the attention module. However, constrained by the shared memory capacity, it can only outperform FasterTransformer in a very limited range and related methods will become unusable as the sequence length becomes long. Besides optimizing the computation, Synthesizer [25], Reformer [15], and FlashAttention [7] improve the vanilla self-attention for less computation and more expressive. At present, FlashAttention starts to be applied in large-scale models. FlashAttention is a fast and memory-efficient exact attention algorithm. It is based on the principle of making attention mechanism IO-aware, which uses tiling to reduce the number of memory accesses between GPU high bandwidth memory and on-chip SRAM. Similarly, Narang et al. [19] study to make recurrent neural network sparse and remove the redundancy, which is a promising direction for improving transformer models.

7 Conclusion

The variable-length problem of real-world requests makes the existing approach on transformer model inference inefficient on both computation and memory, preventing its serving from practical high performance. In this article, we propose a unified solution to handle the heavy-tailed input of the transformer inference. For the purpose, it proposes three novel approaches, including a computation approach for eliminating the redundant computation of the self-attention module on GPUs, a data layout approach to better unify redundant computation elimination approaches of different modules, and a memory allocation approach to strike a balance between the memory footprint and allocation efficiency. Consequently, on eight corpora of GLUE benchmark, our experimental results show that the self-attention module averagely achieves 63.8\(\%\) of latency reduction, the entire model averagely achieves 28.1\(\%\) of latency reduction, and the memory footprint of the intermediate results are reduced by 7.8\(\times\), compared with the popular FasterTransformer.

Moreover, although we focus on GPU architectures and propose many specific optimizations in this article, the basic idea that records and processes valid block matrix multiplications can be used for improving the transformer inference in many other accelerators with many-core architectures. In comparison, our solution can perform poorly in multi-core processors, such as CPU. This is because multi-core processors are relatively low in parallelism, and it can be saturated without adopting batch processing technique. Whether our solution is applicable heavily depends on parallel capability of target processors.

Acknowledgments

We thank the anonymous reviewers for their constructive comments and feedback that greatly helped us improve the final article.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.

Abstract

1 Introduction

2 Background

2.1 Transformer Model

2.1.1 Word Embedding Module.

2.1.2 Multi-layer Perceptron Module.

2.1.3 Self-attention Module.

2.1.4 Ignoble Functions.

2.2 GPU Architecture

3 Computation Analysis

3.1 Heavy-tailed Distribution of Input

3.2 Computation of Linear Function

3.3 Computation of Q × KT Function

3.4 Computation of Q KT × V Function

3.5 Computation Time Comparison

4 The Unified Solution

4.1 Fine-grained Approach

4.1.1 Mini-block Index.

4.1.2 Shared Memory Block Transpose.

4.1.3 Efficient Atomic Operation.

4.1.4 Half-precision on Tensor Core.

4.2 Block-organized Approach

4.3 Chunk-based Approach

5 Evaluation

5.1 Experiment Setup

5.2 Evaluation of the Fine-grained Approach

5.3 Evaluation of the Block-organized Approach

5.4 Evaluation of the Chunk-based Approach

5.5 Overall Evaluation of the Unified Solution

5.6 Summary

6 Related Work

7 Conclusion

Acknowledgments

References

Cited By

Index Terms

Recommendations

Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs

Improving the Virtual Memory Efficiency of GPUs

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations

3.3 Computation of Q × K^T Function

3.4 Computation of Q K^T × V Function