1 Introduction
Transformer models begin to be widely deployed in the real-world due to their remarkable accuracy and performance on tasks such as text generation, question-answering, and language translation. Transformer models, e.g.,
bidirectional encoder representations from transformers (BERT) [
14] and
generative pre-trained transformers (GPT) [
24], use the self-attention mechanism [
28], to capture the dependency between any two words in a sequence. This enables the model to learn long-range dependencies between different parts of the text.
The parallel architecture of
graphic processing units (GPUs) makes them ideal for the deployment of transformer models. To take full advantage of the computational power of modern GPUs, deep learning workloads are typically processed using batch processing, where multiple samples are processed concurrently to exploit high parallelism. This method reduces the kernel launch overhead and significantly improves hardware performance for most deep learning structures. For instance, Amazon [
23] and HuggingFace [
3] both use batch processing in their enterprise-level services to saturate the hardware and improve throughput. Despite the advantages that batch processing offers to deep learning workloads, it is inefficient for
natural language processing (NLP) tasks due to the nature that input sequences are variable in length. This is because the matrices of different samples differ in shape. A naive solution for this is to preset a maximum sequence length to cover all conditions and pad all the input sequences to this length, known as the zero padding. Our statistics on
general language understanding evaluation (GLUE) [
29] benchmark reveal that sequence lengths in all corpora follow a heavy-tailed distribution, with the majority of sequences being significantly shorter than the potential longest length. As a consequence, the amount of redundant computation can even exceed that of the valid computation. While the combination of batch processing and zero padding improves hardware performance by exploiting higher parallelism, it results in redundant computations and severely damages the practical performance.
Transformer models comprise three main modules: the word embedding module, the self-attention module, and the
multi-layer perceptron (MLP) module, which will be introduced in more detail in Section
3. In this essay, we classify all linear functions into the MLP module and take these operations in the computation process of the self-attention mechanism as the self-attention module. The self-attention and MLP modules are repeated multiple times in a transformer model and contribute most computation, and they both suffer from the redundant computation problem. Therefore, our primary focus is on the computationally intensive parts of the model, i.e., the self-attention and the MLP modules.
Several prevailing frameworks, such as ByteTransformer [
32], FasterTransformer [
21], and TurboTransformers [
10], have addressed the problem of redundant computation in MLP modules by using a word-accumulation approach that removes padding before each self-attention module and rebuilds padding after each self-attention module. We refer to this as the EFF-rebuild solution, since it is first claimed by EffectiveTransformer. However, the EFF-rebuild solution has two shortages. First, it involves extra data movement for removing and rebuilding padding. Second, it does not eliminate the redundant computation in the self-attention module.
In this article, we propose a unified solution for improving both computation and memory efficiency in transformer model inference on GPUs with heavy-tailed input. The main contributions of this article are as follows:
•
We show that the input of NLP tasks often follows a heavy-tailed distribution and this leads to severe redundancy on both computation and memory in transformer inference.
•
We propose a unified solution. To improve the computation efficiency, it takes the fine-grained approach for the self-attention module, the word-accumulation approach for the MLP module, and the block-organized approach for the entire model. Upon these approaches, it takes the chunk-based approach for better memory management.
•
We propose a fine-grained approach that reduces the redundant computation in the self-attention module by indexing only the valid block matrix multiplication.
•
We propose a block-organized approach that unifies redundant computation reduction methods for both MLP and self-attention modules by organizing the data layout of the self-attention module in block granularity.
•
We propose a chunk-based memory management approach that balances the memory footprint and allocation/free efficiency.
4 The Unified Solution
In this section, we introduce our unified solution for improving both computation and memory efficiency. To eliminate the redundancy on computation, as in Figure
6, the unified solution includes three approaches for the self-attention module, the MLP module, and the entire model, respectively. To eliminate the redundancy on memory, as in Figure
11, the unified solution takes the chunk-based approach for balancing the memory footprint and the allocate/free efficiency.
The fine-grained approach is based on the block matrix multiplication and leverages potential fine-grained parallelism. Specifically, it determines the execution order of valid mini-block matrix multiplications at the beginning and stores the order in index arrays. Using these index arrays, the GPU can execute only these valid mini-block matrix multiplications in the self-attention module, to eliminate most redundant computation. We implement the fine-grained approach using three techniques, i.e., the mini-block index (MBIX), the shared memory block transpose (SMBT), and the efficient atomic operation (EAOP).
The word-accumulation approach is not novel. Since the existing implementation of the linear function performs the parallelism in the word dimension instead of the batch dimension, it is intuitive to pack sequences in a batch densely and eliminate the redundant computation of the MLP module.
As MLP modules place sequences tightly together while self-attention modules place them loosely, data layout switch is necessary in the connection area, leading to data movement. To address this issue, the block-organized approach is introduced to enable better connectivity. With the fine-grained approach, the memory usage becomes inefficient as many padded areas will not participate in computations. Therefore, the block-organized approach incorporates a novel data layout called block padding that pads sequences in blocks to satisfy the minimum requirement for the fine-grained approach. To switch between data layouts, two customized layout switch kernels are developed.
Benefiting from the aforementioned approaches, the essential memory footprint required by intermediate results changes frequently during execution. Instead of allocating sufficient memory space in advance, the chunk-based approach aims to release unnecessary memory while ensuring the efficient allocation. It allocates or frees memory in chunks to enable more precise memory management. Additionally, it includes a pre-schedule method to hide the memory allocation overhead into computation. In this way, our solution requires less memory footprint in most of the time and releases more memory space for higher-level scheduler [
6,
12] to make multiple tasks co-locate in a single GPU, which is the common case in cloud scenarios.
4.1 Fine-grained Approach
By breaking down the matrix multiplication of
\(Q \times K^T\) and
\(QK^T \times V\) functions into multiple mini-block matrix multiplications following the block matrix multiplication Equation (
2), we can exclusively orchestrate the valid mini-blocks for efficient computation:
4.1.1 Mini-block Index.
Figure
7(a) illustrates the block matrix multiplication of the
\(Q \times K^T\) function and all mini blocks have the same size and behavior. MBIX leverages two index arrays to record the execution order. Based on the
val_len of input sequences, we can compute the related offsets of the valid query and key mini-blocks. Additionally, there lacks the necessary 3D structural information to correctly position the result in the resulting matrix for writing back. To address this, MBIX builds the offset array of the resulting matrix, which records the offsets of relevant mini-block in
\(QK^T\) matrix for writing back the results.
For a three-word sequence, the q_offset and k_offset arrays record [0,2,0, ..., 10] and [0,2,8, ..., 10], respectively, which enables calculation of the corresponding offsets ([0,0,2, ..., 14]) in the resulting matrix. During execution, the GPU first accesses the query and key offset arrays, then utilizes the offset to access and process each valid mini-block matrix pair one by one. As most values in mini-block matrix pairs are valid values, a majority of redundant computation can be avoided. Similarly, Figure
7(b) illustrates the 2D partitioning of MBIX on the
\(Q K^T \times V\) function, leveraging the same concept.
While MBIX can eliminate a significant amount of redundant computation, it introduces new overhead, such as index array building, irregular data access, and extra memory usage. The block size of decomposing the matrix multiplication is a critical parameter to balance redundant computation and newly introduced overhead. A smaller mini-block can better suit irregular data structures, resulting in less redundant computation. However, smaller mini-blocks can result in very large index arrays, which take longer to build and consume more memory. Additionally, small mini-blocks lead to poor spatial locality. Here, we empirically select the \(32\times 32\) block size for the 2D partitioning, which aims to validate our unified solution. It is important to note that further engineering is required for a mature implementation.
4.1.2 Shared Memory Block Transpose.
When executing a matrix multiplication on a pair of mini-blocks, each vector in the first mini-block is multiplied with all vectors in the second mini-block, resulting in the multiple access of each vector in a mini-block. Shared memory is a potent feature of advanced GPUs, with significantly reduced access latency and higher throughput compared to local and global memory. Prefetching the mini-block data from global memory to shared memory is desirable to minimize memory access overhead.
However, merely migrating mini-block data from global memory to shared memory yields very limited benefits. To achieve high memory bandwidth, each SM’s shared memory is divided into 32 memory blocks of equal size, corresponding to the thread number in a warp. Adjacent values will fall in different banks. Unfortunately, as shown in the left side of Figure
8, direct copying leads to memory requests from multiple threads accessing the same bank, resulting in severe bank conflicts. To solve this issue, we adjust the data layout in shared memory, ensuring that threads within the same warp access data in different banks. As illustrated in the right side of Figure
8, mini-blocks are transposed while moving data from global memory to shared memory. Threads within the same warp will not be forced to access data in the same bank, and shared memory bank conflicts are avoided.
4.1.3 Efficient Atomic Operation.
In both Figures
7(a) and
7(b), multiple mini-block pairs in the 2D partitioning write their results to the same location. Hence, ensuring the correctness of the fine-grained approach necessitates atomic operations, which result in contention issues that undermine performance. EAOP aims at reducing the overhead arising from this contention problem.
As query and key are identical in shape, the
\(Q \times K^T\) function degenerates, becoming a more specialized case. Figure
9 demonstrates this in the case with a block size of
\(1\times 4\), naming 1D partitioning. With the 1D partitioning, a place in the output matrix corresponds to a single mini-block pair, and computations of the
\(Q \times K\) function can be performed without atomic operations in the implementation. Consequently, in our implementation, we select
\(64\times 32\) as the block size empirically.
For the
\(QK^T \times V\) function, reducing the overhead of atomic operations involves reorganizing the offset generation and enlarging the write interval between values intended to be written to the same location. The offset generation function consists of several loops in different dimensions. For example, in Figure
7(b), mini-blocks from the blue matrix in the first row will be multiplied with mini-blocks from the yellow matrix in the first column. The results of two mini-block matrix multiplication pairs will write their results to the same location, generating the contention problem. To improve this, we can modify the offset order and set it to [0,12,2,14,0,12,2,14] and [0,0,8,8,2,2,10,10]. Enlarging the write interval of dependent values can effectively relieve the contention problem. The write interval can be further increased by introducing loops in other dimensions such as the batch dimension to eliminate the contention problem more effectively.
4.1.4 Half-precision on Tensor Core.
Despite loss of accuracy, the lower precision method is widely studied for inference tasks. It uses lower precision to represent weights and activations, to take less memory footprint and speed up the inference. In addition to the single-precision with the CUDA core, we further extend the fine-grained approach to the half-precision and adopt the latest tensor core architecture of NVIDIA GPUs.
The tensor core has a \(4\times 4\times 4\) matrix processing array that can perform the operation \(D = A \times B + C\) in half-precision and these matrices have a size of \(4\times 4\). The warp matrix multiply accumulate (WMMA) API is used to manipulate tensor cores. Multiple tensor core operations are combined by the WMMA API, which is implemented at the warp level. The WMMA API loads data from any other memory space to registers using the \(load\_matrix\_sync()\) function, which loads a \(16\times 16\) matrix. Next, it uses \(mma\_sync()\) function to perform matrix multiplication and accumulation on a pair of \(16\times 16\) matrices. The resulting product is then moved back using the \(store\_matrix\_sync()\) function.
The fine-grained approach can also fit the tensor core’s half-precision computation and the difference is in processing each pair of mini-blocks. Since we choose
\(32\times 32\) as the block size in the fine-grained approach, our goal is to decompose the
\(32\times 32\) matrix multiplication in the grain of
\(16\times 16\) to fit the WMMA API. A single
\(32\times 32\) matrix multiplication can be decomposed into 8
\(16 \times 16\) matrix multiplications, leading to 8 WMMA API calls. As shown in Figure
10, every two matrix multiplications should be accumulated together. To accelerate the data loading process, we also move the
\(32\times 32\) matrices (A and B) to the shared memory following SMBT. Although two resulting matrices will be accumulated in the same place, there is no contention problem, since multiple WMMAs are executed in series within a warp.
Furthermore, we take the pipeline method to reduce the data movement overhead between shared memory and registers. For each pair of matrices, it requires to load data (A, B, and C), make computation (
\(D = A \times B + C\)), and then offload data (D). Figure
10 shows that the matrix multiplication results of two pairs of matrices should be accumulated together. Thus, we employ the pipeline method for these two pairs. After loading the data of the first pair, it starts to make the related computation, and simultaneously loads the data of the next pair. Also, we can keep the resulting matrix in registers, without using the other memory space for the accumulation. Consequently, the pipeline method takes advantage of (1) hiding data movement overhead with computation and (2) reducing memory movement by directly store the resulting matrix in registers.
4.2 Block-organized Approach
In the word-accumulation approach of the MLP module, the dense arrangement data layout is used, as shown in Figure
11. The dense arrangement lacks padding between sequences and can be directly processed by GEMM without redundant computation. However, the densely arranged data layout is not compatible with our fine-grained approach, because the valid mini-blocks will intrude into invalid values and cause the wrong result. To support the fine-grained approach, we can use the batch padding data layout that all sequences in a batch have the same length. Neglecting the word-accumulation approach, the batch padding does not introduce any overhead on computation.
However, the densely arranged layout of the MLP module contradicts the batch padding layout of the self-attention module when they connect. Integrating these two data layouts requires moving data and continuously switch between them, thus introducing data movement costs. To mitigate these costs, we design a block padding layout of the self-attention module. As in Figure
11, the block padding layout is organized based on the block size, where only values required by the self-attention module’s fine-grained approach are padded. Thus, blocks on the sequence’s edge that cannot be divided by the block size are padded, making the padding area far less than that of the batch padding layout. Although the block padding layout still requires data movement to become compatible with the previous densely arranged layout, the overheads are also largely reduced. Furthermore, the peak memory demand is lower than that of the batch padding approach, because the values are arranged more densely. The block-organized approach requires customizing both the offset generation function and data layout switch function. Here, we perform the offset generation function on the CPU in serial.
Regarding the layout switch function, both the FasterTransformer [
21] and TurboTransformer [
10] frameworks fuse it with the transpose operation, which significantly reduces the kernel launch and data movement overhead. TurboTransformer uses
\(batch\_size \times head\_num \times seq\_len\) GPU warps to implement it and determines whether to move values in each thread. FasterTransformer, however, focuses on the valid values, builds a prefix sum array for the valid words, and reduces redundant data movement. Our implementation mainly refers to the approach used in FasterTransformer. To implement the layout switch logic, we build an index array that records the blocks’ offsets instead of the words’ offset used in FasterTransformer. This index array is considerably smaller than in FasterTransformer, and each warp will be responsible for moving a block. The reverse process that switches the block-padding data layout to the densely arranged data layout employs similar principles.
4.3 Chunk-based Approach
In addition to optimizing computation efficiency, memory efficiency is also critical for an inference runtime system. In case of the fixed-length input, the intermediate tensors are also fixed in size. We can determine the life circle of all intermediate tensors and make the schedule in advance, effectively avoiding the memory allocation overhead. However, based on computational approaches of the unified solution, the size of intermediate tensors will keep changing during the serving process under the variable-length input. Although we can determine the upper limit of the memory footprint and apply for a big enough memory space in advance, it results in a big waste on the memory space, especially when the input length is in heavy-tailed distribution. Adversely, if we re-allocate the optimal memory size in each run, then the memory allocation overhead will severely impact the execution speed. Thus, we propose the chunk-based approach to balance the memory footprint and the allocation/free efficiency.
Memory management for fixed-length input has been studied in Pytorch [
22] and TFLite [
8]. They exploit the computation graph to analyze the lifecycles of all intermediate tensors. For tensors that do not coexist in the time dimension, they can use the same memory space. During the first inference, the system will incrementally allocate memory for intermediate tensors until reaching the maximum memory requirement. Afterward, the memory will be cached in the system and reused in subsequent inferences. However, for variable-length input, the incremental allocation method cannot achieve optimal memory usage, since it does not free memory when new input become smaller.
The chunk-based approach organizes memory in chunks to enhance the efficiency of allocation/free processes. Through reusing most memory chunks, the allocator only requires to allocate/free a small amount of memory as the input size changes. To determine the size and number of chunks, here we take a scheduling algorithm in the chunk-based approach. After knowing sequence lengths in a batch, the scheduling algorithm will simulate the entire execution process and produce the memory plan for this batch based on the computation graph. As shown in Figure
12, for each batch, the algorithm will schedule the tensor placement based on the computation graph and the existing chunks. Then, we can know how many chunks we should free or how many chunks we should allocate for the next batch.
The scheduling algorithm seeks to place intermediate tensors in existing chunks while minimizing the memory footprint. It first involves sorting intermediate tensors of a batch based on their memory size in a non-increasing order. Then, for each tensor in the non-increasing list, it checks whether allocated chunks can satisfy both memory space and time window, which is a naive 2D strip packing problem. Here, we execute the Greedy-by-Size algorithm [
8,
10] multiple times to check whether there exists enough space in the chunk list. If existing chunks can satisfy, then the tensor will be assigned with the related offset. Otherwise, a new chunk will be allocated and appended to the chunk list. Here the size of new chunk will be determined by two hyperparameters. The first is a fixed scaling factor based on the tensor size. The second is a minimal size requirement. We also adopt layer similarity to simplify the complexity of the scheduling algorithm.
The chunk-based approach further enhances the performance by overlapping the memory allocation overhead with computation. As shown in Figure
12, we extract and visualize the memory placement in executing phase of the self-attention module. Based on chunks of batch0, the scheduling algorithm points out that a chunk can be freed for processing batch1. Based on chunks of batch1, a new chunk should be allocated for batch2. Requests will arrive randomly in real-world serving systems and a scheduling thread can be employed for executing the scheduling algorithm as soon as it detects sequence lengths of a new batch. Additionally, the scheduling thread can also execute the chunk memory allocation in advance. After the execution of the previous batch, the chunk-based approach frees any extra chunks or appends the newly allocated chunks to the chunk list.
5 Evaluation
We demonstrate the effectiveness of our unified solution in two steps. At first, we present modular evaluations for three approaches, respectively. Next, we integrate the unified solution into a prevailing transformer inference system and evaluate the overall latency improvement.
5.1 Experiment Setup
For the modular evaluations, we present the effectiveness of the fine-grained approach, the block-organized approach, and the chunk-based approach. Basically, for the fine-grained approach, we evaluate the average latency of the self-attention module, i.e., \(Q \times K^T\) and \(QK^T \times V\) functions, across an entire corpus. In comparison, we take the related approach in both FasterTransformer and TurboTransformer as the baseline, in which they implement the self-attention module by taking cublasGemmStridedBatchedEx kernel of cublas library. Configurations are the same as in FasterTransformer that CUBLAS_GEMM_ALGO0 is used for the \(Q \times K^T\) function and CUBLAS_GEMM_ALGO1 is used for the \(QK^T \times V\). Particularly, we will show the performance of the fine-grained approach in both half and single-precision, since it is the only approach for improving the computation.
For the modular evaluation of the block-organized approach, we evaluate the switching time between the densely arranged layout and the block padding layout, while the switch between densely arranged layout and batched-padding layout is selected as the baseline. Two different implementations in FasterTransformer and TurboTransformer are both compared.
For the modular evaluation of the chunk-based approach, we evaluate the memory footprint of entire model’s intermediate results and report the average cost of the scheduling algorithm. The baseline is the memory management method used in Pytorch [
22] and TFLite [
8], which incrementally apply a cache of memory and reassigns the cached memory to later execution.
For our overall evaluation, we pick the EFF-rebuild solution that eliminates redundant computation in MLP modules as our baseline and implement it using FasterTransformer. We then apply our unified solution to this baseline for comparing the computation efficiency. The changes include replacing cublasGemmStridedBatchedEx kernels with our fine-grained kernels, constructing offset generation functions, adopting block-organized padding for the self-attention module, adopting the new data layout switch method, and adopting the chunk-based allocator for memory management. It is noted that index arrays only have to be built once for all self-attention modules in a transformer-based model.
The experiments are conducted on a node equipped with an Intel Xeon Silver 4208 CPU and a Tesla V100 GPU (16 GB). We use Ubuntu 18.04, GCC 7.5.0, and CUDA 11.3 as the running environment. For our fine-grained approach in single-precision, we take both BERT-base and BERT-large [
14] configurations. For the other evaluations, we only present the result of BERT-base as model scaling does not affect the time breakdown for both our solution and the baseline. We select eight corpora from the famous GLUE benchmark, specifically CoLA, SST-2, WNLI, SST-B, QQP, MNLI, QNLI, and RTE. The preset sequence lengths (
seq_len) used are shown in Table
3, which is the statistical results, and we remove outlier values of the sequence length. To ensure reliability and validity, we repeat each case three times and record the average result.
5.2 Evaluation of the Fine-grained Approach
Figure
13 illustrates the latency of executing
\(Q \times K^T\) and
\(QK^T \times V\) functions on eight corpora. We develop two host functions to create two sets of offsets for the
\(Q \times K^T\) and
\(QK^T \times V\) functions, and then we copy the offsets from host to device. In terms of kernel execution time, the offset generation process accounts for approximately 5%. Since all self-attention modules in a transformer-based model behave in the same way and have the same structure, the offset generation process will only be executed once per batch, contributing minimally to the total execution time. Moreover, both the generation and memory copy of the offset can be overlapped with the word embedding module during practical inference, and they are excluded here.
Regarding the cublas implementation shown in Figure
13, corpora with the same
\(seq\_len\) have comparable latency. As
\(seq\_len\) doubles, the latency experience a significant increase, because the computation amount follows a quadratic relationship with
\(seq\_len\). In contrast, the latency of our fine-grained kernels mainly depends on the average sequence length of a corpus. For instance, when comparing WNLI to SST-B, the cublas approach’s latency is much lower than that of SST-B, since WNLI has a shorter
\(seq\_len\). However, the fine-grained kernels’ latency is similar to that of SST-B. Furthermore, in some scenarios, the fine-grained kernels’ latency even surpass that of the cublas kernels in the case of WNLI, resulting in negative optimization.
Comparing Figures
13(a) with
13(b), we observe that the latency in the BERT-large configuration increases, while the ratio between the fine-grained approach and the cublas approach remains stable. This is due to the fact that the self-attention module of BERT-large solely increases on the number of heads compared to BERT-base. Additionally, when increasing the batch size, both approaches achieve throughput improvement as saturating the hardware in most cases and the trend is gradually weakening when batch size is already large. As a result, our fine-grained approach can decrease the self-attention module’s latency by up to 87.8% on the QQP corpus and 63.9% on average across the eight GLUE corpora relative to the cublas approach.
Moving onto Figure
13(c), it illustrates the half-precision comparison between the fine-grained approach and the cublas approach. By vertically comparing the half-precision performance with that of the preceding single-precision, it is evident that using half-precision computation of tensor core indeed achieves considerable acceleration. Specifically, by switching to half-precision computation, the cublas implementation delivers an average acceleration of 2.98
\(\times\), while the fine-grained approach yields a comparatively lower acceleration ratio of 2.24
\(\times\). However, both acceleration ratios fall well below the potential acceleration achievable with the Tesla V100 GPU, where the single-precision performance reaches 14 TFLOPS and the half-precision performance reaches 112 TFLOPS. The performance ratio between half-precision and single-precision is eightfold. Many factors, such as memory bandwidth, can result in the disparity. This may also explain why our fine-grained approach achieves lower acceleration. The data volume becomes half and memory bandwidth keeps unchanged, while the hardware performance becomes eightfold from cuda core to tensor core. In this way, the asymmetric requirement improvements on memory bandwidth and computation lead to the lower acceleration of the fine-grained approach. As for WNLI corpus, our fine-grained approach becomes worse than the cublas approach. Even so, we think our approach can also take effect in the tensor core architecture.
5.3 Evaluation of the Block-organized Approach
In Figure
14, we present the modular evaluation of the block-organized approach. Ours represents the switch between densely arranged layout and block padding layout implemented by us. Fast represents the switch between densely arranged layout and batched-padding layout implemented in FasterTransformer. Turbo represents the switch between densely arranged layout and batched-padding layout implemented in TurboTransformer.
Regarding the switching function in the block-organized approach, we observe that the time gradually increases as the batch size doubles, and the upward trend becomes evident as the workload grows larger. When the batch size increases from 16 to 32 in CoLA and SST-2, the time only increases by 1.05\(\times\) and 1.16\(\times\), respectively, whereas in MNLI, QNLI, and RTE corpora, it increases by 1.36\(\times\), 1.4\(\times\), and 1.48\(\times\), respectively. In fact, when we continuously increase the batch size, for example from 512 to 1024, the time starts to double. This is because each GPU thread in our implementation moves a column of values, and the number of GPU warps cannot saturate the device when the workload is small.
As shown in Figure
14, Turbo demonstrations the worst performance in almost all cases. This is because TurboTransformer’s implementation computes the offset in the kernel function and there are condition statements to determine whether a value is redundant. As a result, corpora with the same
\(seq\_len\) have comparable switching time in the TurboTransformer, since its performance primarily depends on the preset length.
Furthermore, we notice that the performance of Ours and Fast primarily depends on the average length, since corpora with longer average lengths take longer time. In FasterTransformer’s implementation, it also builds a prefix sum index and records the number of valid words. As a result, both methods deal with fewer redundant values, making the performance proportional to the average length of a corpus. Moreover, our method achieves better performance in data layout switch. This is because, in FasterTransformer, each GPU warp only accounts for moving a valid word and each thread is allocated with less workloads. Also, it needs to build a more detailed prefix sum index and transfers more data between host memory and device memory. In comparison, each GPU warp accounts for moving a block in our approach.
5.4 Evaluation of the Chunk-based Approach
For the modular evaluation of the chunk-based approach, we analyze the memory footprint and the newly introduced time cost in the view of the entire BERT model. From the perspective of the memory footprint, Figure
15 shows the memory footprint of 30 continuous batches when processing the eight corpora using batch sizes of 8 and 16. We can notice that the chunk-based approach achieves significant reduction on memory footprint, which fluctuates as sequence lengths change. As in the computing optimization, the effect is also highly dependent on the distribution of corpora. We compute the average memory footprint reduction across each corpus compared to the baseline. The results show that, for the WNLI corpus, the chunk-based approach requires 30.1
\(\%\) and 28.0
\(\%\) memory footprint for batch size 8 and 16, while for the QQP corpus, it only requires 8.4
\(\%\) and 8.2
\(\%\) memory footprint for batch size 8 and 16. This is because, although WNLI and QQP have similar average lengths, their
\(seq\_len\) values differ drastically, with the
\(seq\_len\) of QQP far exceeding that of WNLI. In addition, Figure
15 shows that even datasets with highly volatile fluctuations in memory footprint experience periods in which the memory footprint remains unchanged for several batches. During these steady periods, the values of related sequence length summations in x-axis are similar but differ, indicating that the chunk-based approach reduces the number of memory allocations.
From the perspective of the memory allocation cost, the chunk-based approach continuously requires the scheduling time and memory allocation/free time throughout the entire serving process, which introduces overhead. Table
4 presents the total time cost, number of allocations, and the total number of batches for each corpus. The ratios of the batch and allocation numbers vary widely across corpora, which is somewhat reflected in the trend depicted in Table
4. Further, our measurements indicate that the performance cost of the chunk-based approach averagely accounts for 0.10
\(\%\)–1.05
\(\%\) of the overall model inference in these corpora. Moreover, if the new batch arrives during the execution of the previous batch, then the memory management time cost can be overlapped with the computation time. Then, after the completion of the previous batch, the allocator can return the allocated memory for the current batch by adding or removing pointers in the chunk list. Therefore, despite the overhead of active scheduling and memory allocation, the chunk-based approach’s advantage in reducing memory footprint is significant. Consequently, the chunk-based approach achieves significant memory footprint reduction with very limited overhead by effectively exploiting the memory feature of preceding approaches. Moreover, in all GLUE datasets, the memory footprint keeps in relatively low level and our chunk-based approach has a significant effect in most time, which is useful when multiple tasks co-locate and are uniformly scheduled.
5.5 Overall Evaluation of the Unified Solution
Figure
16 demonstrates the overall evaluation. Fast represents the model is executed using Eff-rebuild solution of FasterTransformer, in which the redundant computation of MLP modules is eliminated. Ours represents that our unified solution is applied. All these experiments presented are in single-precision.
Figure
16 presents the comparison of the latency in model inference while processing eight corpora between the unified solution and the Eff-rebuild solution. Unlike the modular evaluation of the fine-grained approach, the latency of the baseline is dependent not only on the
\(seq\_len\) but also on the average length of the corpus. This is because the computation amount of the MLP module is affected by the average length of the corpus.
The comparison of the latency between Fast and Ours reveals that, in most corpora, our unified solution offers significant improvement in performance, particularly when there are numerous sequences of shorter length. However, for WNLI, the optimization effect is negative, due to its longer average sequence length. It is important to note that the fine-grained approach used in our unified solution is less hardware-efficient compared with cublasGemmStridedBatchedEx in cublas and our main improvement is originated from the elimination of redundant computations. On average, our unified solution results in a 28.1\(\%\) reduction in latency, in comparison to the baseline implementation of FasterTransformer.
In addition to the gap between average sequence length and \(seq\_len\), we find that our unified solution performs better on corpora using larger \(seq\_len\). For instance, in the modular evaluation of the fine-grained approach, QQP and QNLI experience a more than 70\(\%\) reduction in latency, compared to the 50\(\%\) reduction in the overall evaluation. This is attributed to the computation amount of linear functions is in a linear relationship with the sequence length and the computation amount of the \(Q \times K^T\) and \(Q K^T \times V\) is in a quadratic relationship with the sequence length. Consequently, the self-attention module requires a larger portion of computation in corpora with larger \(seq\_len\), making our optimization more effective.
5.6 Summary
Benefiting from removing huge amounts of redundant computing, our fine-grained approach can largely improve the practical performance of computing the self-attention module for most corpora. As for the block-organized approach, it reduces the layout switch overhead from the entire model view. Next, the chunk-based approach effectively exploits the memory feature of preceding approaches and significantly reduces the memory footprint. Compared with prevailing frameworks, our unified solution enables us to achieve a decrease of average latency by 28\(\%\) on the entire model, 63.8\(\%\) on the self-attention module, and reduces memory footprint of intermediate results by 7.8\(\times\) across eight corpora of the GLUE benchmark.
6 Related Work
In recent years, there has been a growing interest in optimizing both transformer training and inference, particularly in the industry. As this study focuses on optimizing transformer inference, we only present the relevant works regarding optimizing transformer inference.
Although these training frameworks, such as PaddlePaddle [
16] and TensorFlow [
1], are capable of executing deep learning inference, their inability to fully utilize advanced hardware for inference workloads leads to the development of inference-specific frameworks. General inference frameworks, such as TensorRT [
27], TVM [
5], and XLA [
26], are developed. They abstract a target model into the computation graph for scheduling the optimal execution order and fusing kernels. With the increasing demand on transformer models, some transformer-specific inference frameworks and methods are claimed. NVIDIA starts FasterTransformer [
21] project in 2019 and it has an active open-source community, which keeps updating and collecting practical optimization. ByteTransformer [
32] by ByteDance proposes the padding-rebuild method for reducing the variable length problem in MLP module. It is originated from the early FasterTransformer and further reduces the execution time and memory consumption, especially for large batch size cases. TurboTransformer [
10] by Tencent claims new kernel fusion methods and presents the design of the transformer serving system for the first time. Although these specific inference frameworks have many customized kernels, they all use cublas for performing the self-attention module, which cannot release the redundant computation of the variable-length problem in the self-attention module. Instead of handling the variable-length problem in the transformer model inference, BatchMaker [
11] takes the novel cellular batching technique to reduce the redundant computation in the inference of the recurrent neural network. Further, as transformer model goes large, the distributed Inference are studied in ORCA [
31] and Deepspeed [
2].
More specifically, some works study the self-attention module optimization. Jiang et al. [
13] designs NUMA-aware thread scheduling for the variable-length problem on ARM platforms. ReTransformer [
30] proposes a new sub-matrix pipeline design for multi-head self-attention in Processing-in-Memory. E.T. [
4] proposed a kernel fusion design for the attention module. However, constrained by the shared memory capacity, it can only outperform FasterTransformer in a very limited range and related methods will become unusable as the sequence length becomes long. Besides optimizing the computation, Synthesizer [
25], Reformer [
15], and FlashAttention [
7] improve the vanilla self-attention for less computation and more expressive. At present, FlashAttention starts to be applied in large-scale models. FlashAttention is a fast and memory-efficient exact attention algorithm. It is based on the principle of making attention mechanism IO-aware, which uses tiling to reduce the number of memory accesses between GPU high bandwidth memory and on-chip SRAM. Similarly, Narang et al. [
19] study to make recurrent neural network sparse and remove the redundancy, which is a promising direction for improving transformer models.
7 Conclusion
The variable-length problem of real-world requests makes the existing approach on transformer model inference inefficient on both computation and memory, preventing its serving from practical high performance. In this article, we propose a unified solution to handle the heavy-tailed input of the transformer inference. For the purpose, it proposes three novel approaches, including a computation approach for eliminating the redundant computation of the self-attention module on GPUs, a data layout approach to better unify redundant computation elimination approaches of different modules, and a memory allocation approach to strike a balance between the memory footprint and allocation efficiency. Consequently, on eight corpora of GLUE benchmark, our experimental results show that the self-attention module averagely achieves 63.8\(\%\) of latency reduction, the entire model averagely achieves 28.1\(\%\) of latency reduction, and the memory footprint of the intermediate results are reduced by 7.8\(\times\), compared with the popular FasterTransformer.
Moreover, although we focus on GPU architectures and propose many specific optimizations in this article, the basic idea that records and processes valid block matrix multiplications can be used for improving the transformer inference in many other accelerators with many-core architectures. In comparison, our solution can perform poorly in multi-core processors, such as CPU. This is because multi-core processors are relatively low in parallelism, and it can be saturated without adopting batch processing technique. Whether our solution is applicable heavily depends on parallel capability of target processors.