1 Introduction
Recently,
deep neural network (DNN)-based
natural language processing (NLP) models are rapidly developing, bringing a breakthrough in the field of NLP. There are two types of DNN-based NLP models, and both types show high accuracy in various NLP tasks. For example, recurrent neural network-based [
30] models are good at speech recognition [
16]. Meanwhile, attention-based models show good performance for tasks such as question answering and language modeling.
There are various types of services that use NLP models. Some of these NLP tasks have strict real-time constraints (i.e., real-time interactive services [
11]) while others are performed offline with relaxed latency requirements (e.g., text summarization). For the real-time NLP tasks, it is necessary to support a fast batch-1 inference for the immediate responses. However, it is challenging to accelerate NLP models in a single batch due to their following characteristics: (1) diverse and complex operations, (2) various ranges of dimensions, (3) various parameter configurations, and (4) heterogeneous vector operations. In this article, we focus on accelerating NLP tasks for real-time interactive NLP services.
These characteristics of NLP models incur three challenges (i.e., non-negligible vector operations’ latency, a wide range of dimensions and irregular matrix operations, heterogeneity of vector operations). First, challenge 1: We need to reduce the overhead of vector operations. According to characteristic (1), NLP models have a number of complex vector operations, resulting in non-negligible overhead. Second, challenge 2: We should cover a wide range of dimensions and deal with irregular matrix operations in the models. Due to characteristics (2) and (3), NLP models’ sizes and dimensions span a very wide range, and there are many irregular matrix operations. Last, challange 3: We should handle the heterogeneity of vector operations because in characteristic (4), NLP models consist of vector operations of different types, orders, and lengths.
However, existing accelerators [
1,
11,
12,
13,
26] cannot solve these three challenges. For example, GPUs show low utilization when running NLP models in a single batch or running models with small dimensions because they are throughput-oriented. Also, some works design ASICs for target models. But ASICs made for a particular model and configuration [
13] perform poorly on different models or the same model with different configurations. Therefore, prior approaches degrade the performance of NLP models that they are not originally targeting. Also, they fail to support various NLP models due to the absence of functional units required by some NLP models.
In this article, we propose FlexRun, an FPGA-based modular architecture approach to solve the three challenges of accelerating NLP models. FlexRun exploits the high reconfigurability of FPGAs to dynamically adapt the architecture to the target model and its configuration. FlexRun includes three main schemes, FlexRun:Architecture, FlexRun:Algorithm, and FlexRun:Automation.
First, FlexRun:Architecture is an FPGA-based flexible base architecture template. Our base architecture template alleviates the overhead of vector operations by adopting a deeply pipelined architecture, resolving challenge 1. Most importantly, it consists of parameterized pre-defined basic modules so we can configure the architecture to fit the input model and its configuration.
Next, we suggest FlexRun:Algorithm, design space exploration algorithms to get the optimal compute unit (i.e., matrix unit, vector unit) design by finding the best modules and parameters set for the input models, resolving challenge 2 and challenge 3. For FlexRun:Algorithm, we define the design space of the base architecture template (i.e, matrix unit: three dimensions of matrix multiplication unit, vector unit: vector operators’ types, order, and number).
Last, we propose an automatic tool, FlexRun:Automation, which automates the entire flow to find the best architecture and implement it. FlexRun:Automation reconfigures compute units, memory units, and interconnects according to the results of FlexRun:Algorithm. Also, it generates a new decoder so instructions can be properly decoded to the modified architecture.
For evaluation, we compare FlexRun with Intel’s Brainwave-like architecture [
26] on FPGA (Stratix10 GX and MX) and equivalent GPU (Tesla V100) with tensor cores enabled. First, compared to the FPGA baseline, FlexRun achieves an average speedup of 1.59
\(\times\) on various configurations of BERT. For GPT2, FlexRun gets 1.31
\(\times\) average speedup. Next, when comparing to the GPU implementation, FlexRun improves the performance by 2.79
\(\times\) and 2.59
\(\times\) for BERT and GPT2, respectively. Last, we evaluate the scalability of FlexRun by doubling the compute and memory resources of FPGAs, modeling hypothetical next-generation FPGAs. The results show that FlexRun is able to get the scalable performance improvement, showing 1.57
\(\times\) additional speedup compared to current generation FPGAs.
7 Related Work
There are works that accelerate NLP models, and they exploit different methods.
First, there are studies using the quantization method to accelerate NLP models and reduce models’ sizes [
39,
40,
41]. References [
39,
41] suggest new quantization methods, expressing parameters of BERT with 3 bits. Also, Reference [
40] presents BERT’s parameters with eight bits, targeting INT8.
Similar to quantization, many studies apply pruning to NLP models. Reference [
15] uses the weight pruning to reduce the size of LSTM and designs architecture for sparse LSTM. Also, Reference [
38] proposes block-circulant matrices for weight matrices to resolve irregularities in the neural network in addition to pruning. For attention-based NLP models like BERT, Reference [
25] proposes a structured pruning, while Reference [
25] uses structured dropout.
References [
18,
19,
20,
21] utilize model partitioning for acceleration. Reference [
19] defines parallelizable dimensions in DNNs and finds the best parallelization strategies for the target model. Reference [
20] applies holistic model partitioning to all operations across attention-based NLP models. Reference [
18] exploits model partitioning to accelerate large RNN models by enabling multi-FPGA executions.
Also, some works design accelerators for the NLP models. Reference [
13] targets attention operations in NLP models and makes attention-specialized units. Reference [
12] exploits PIM technologies to minimize the memory overhead of the NLP models. References [
6,
11,
26] exploit FPGAs as their HW platforms to accelerate NLP models. However, none of those works can address three challenges of NLP models.
Last, References [
36,
42,
43] take modular approach for accelerating DNNs. However, these works focus on Convolutional Neural Networks, rather than NLP models. Reference [
36] suggests a modular accelerator generator for CNNs. References [
42,
43] use FPGAs to build accelerators through their design space exploration tool in the cloud and edge-computing environments.