NLP Cookbook
NLP Cookbook
NLP Cookbook
ABSTRACT In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic
and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural
Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal
Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have
achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP
architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate
model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data
size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to
extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has
also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and
examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal
performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy
of NLP designs, comparative evaluations, and future directions in NLP.
INDEX TERMS Deep Learning, Natural Language Processing (NLP), Natural Language Understanding (NLU), Natural
Language Generation (NLG), Information Retrieval (IR), Knowledge Distillation (KD), Pruning, Quantization
1
The above-mentioned architectures are primarily language model’s enormity and implicit knowledge storage, its
understanding models, where a natural dialect is mapped to learning ability had caveats in terms of efficient information
a formal interpretation. Here the initial goal is the translation access. Current Knowledge Retrieval models like ORQA
of an input user utterance into a conventional phrase [30], REALM [31], RAG [32], DPR [33] attempt to alleviate
representation. For Natural Language Understanding (NLU) implicit storage concerns of language models by providing
the intermediate representation for the above models’ end external access to interpretable modular knowledge. This
goal is dictated by the downstream tasks. was achieved by supplementing the language model’s pre-
Meanwhile, fine-tuning was transpiring to be progressively training with a ‘knowledge retriever’ that facilitated the
challenging for task-specific roles in NLU models as it model to effectively retrieve and attend over explicit target
required a greater sample size to learn a particular task, documents from a large corpus like Wikipedia.
which bereft such models from generalization [16]. This Further, the Transformer model’s inability to handle input
triggered the advent of Natural Language Generation (NLG) sequences beyond a fixed token span inhibited them to
models that contrary to NLU training, generated dialect comprehend large textual bodies holistically. This was
utterances learned from their corresponding masked or particularly evident when related words were farther apart
corrupted input semantics. Such models operate differently than the input length. Hence, to enhance contextual
from a routine downstream approach of cursory language understanding, architectures like Transformer-XL [34],
comprehension and are optimal for sequence-to-sequence Longformer [35], ETC [36], Big Bird [37], were introduced
generation tasks, such as language translation. Models like with modified attention mechanisms to process longer
T5 [17], BART [18], mBART [19], T-NLG [20] were pre- sequences.
trained on a large corpus of corrupted text and generated its Also, due to the surge in demand for NLP models to be
corresponding cleaned text via denoising objective [21]. This economically viable and readily available on edge devices,
transition was useful as the additional fine-tuning layer for innovative compressed models were launched based on
NLU tasks was not required for NLG purposes. This further generic techniques. These are apart from the Distillation,
enhanced prediction ability via zero or few-shot learning Pruning, and Quantization techniques described earlier. Such
which enabled sequence generation with minimal or no fine- models deploy a wide range of computing optimization
tuning. For instance, if a model’s semantic embedding space procedures ranging from hashing [38], sparse attention [39],
is pre-trained with animal identification of “cat”, “lion” and factorized embedding parameterization [40], replaced token
“chimpanzee”, it could still correctly predict “dog” without detection [41], inter-layer parameter sharing [42], or a
fine-tuning. Despite superior sequence generation combination of the above mentioned.
capabilities, NLG model sizes surged exponentially with the
subsequent release of GPT-III [22] which was the largest II. RELATED REVIEWS/TAXONOMY
model before the release of GShard [23]. We propose a novel NLP based taxonomy providing a
Since NLU and NLG’s exceptionally large-sized models unique classification of current NLP models from six
required several GPUs to load, this turned out costly and different perspectives:
resource prohibitive in most practical situations. Further, ➢ NLU Models: NLU models excel in classification,
when trained for several days or weeks on GPU clusters, structured prediction, and/or query generation tasks. This
these colossal models came at an exorbitant energy cost. To is accomplished through pre-training and fine-tuning
mitigate such computational costs [24], Knowledge motivated by the downstream task.
Distillation (KD) [25] based models like DistilBERT [26], ➢ NLG Models: Contrary to NLU models, these stand out
TinyBERT [27], MobileBERT [28] were introduced at in sequence-to-sequence generation tasks. They generate
reduced inference cost and size. These smaller student clean text via few and single-shot learning from
models capitalized on the inductive bias of larger teacher corresponding corrupted utterances.
models (BERT) to achieve faster training time. Similarly, ➢ Model Size Reduction: Use compression-based
pruning and quantization [29] techniques got popular to techniques like KD, Pruning, and Quantization to make
build economically sized models. Pruning can be classified large models economical and pragmatic. It's useful for the
into 3 categories: weight pruning, layer pruning, and head real-time deployment of large language models to operate
pruning where certain minimal contributing weights, layers, on edge devices.
and attention heads are removed from the model. Like ➢ Information Retrieval (IR): Contextual open domain
pruning, training-aware quantization is performed to achieve question answering (QA) is reliant on effective and
less than 32-bit precision format thereby reducing model efficient document retrieval. Hence, IR systems via
size. superior lexical and semantical extraction of physical
For higher performance, greater learning was required which
resulted in larger data storage and model size. Due to the
2
Figure 1: Taxonomy of NLP Architectures
documents from a large textual corpus create SOTA in 𝑦 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃( 𝑦𝑡 ∣ 𝑦1 , 𝑦2 , 𝑦3 , … , 𝑦𝑡−1 ) (1)
the QA domain on multiple benchmarks outperforming 𝑃( 𝑦𝑡 ∣ 𝑦1 , 𝑦2 , 𝑦3 , … , 𝑦𝑡−1 ) = 𝑃( 𝑦𝑡 ∣ 𝑦1𝑡−1 ) (2)
contemporary language models.
➢ Long Sequence Models: Attention-based computational Such a system is empirically found to give superior results
complexity in Transformers scales quadratically with than vanilla RNNs, LSTMs [46], or GRUs [47] by
input length, hence it is usually fixed to 512 tokens. This implementing conditional probabilities of phase pairs in
might be acceptable for co-reference resolution tasks that machine translation, sequence to sequence mapping, or text
benefit from smaller input lengths [43], however, is summarization tasks.
inadequate for Question Answering (QA) tasks where
reasoning is required across multiple lengthy documents
e.g., the HotpotQA dataset [44].
➢ Computationally Efficient Architectures: Memory
efficient architectures with comparable accuracies to
large language models were built to reduce the high
training time of such models.
The above mentioned is a generalized categorization and not
a hard classification, few models can be used
interchangeably that might serve dual purposes, however,
there is a clear demarcation despite insignificant
universality. Figure 1 depicts this taxonomy giving a visual
breakdown of the significant models belonging to different
categories along with their launch years.
Encoder:
𝐻𝑡+1 = 𝜎(𝑈𝑡+1 . 𝑋𝑡+1 + 𝑊𝑡+1 . 𝐻𝑡 + 𝑏𝑡 ) (3)
𝐸𝑡+1 = 𝜎(𝑉𝑡+1 . 𝐻𝑡+1 + 𝑏𝑡 ) (4)
Decoder:
𝐻′𝑡+1 = 𝜎(𝑈 ′ 𝑡+1 . 𝐸𝑡+1 + 𝑊 ′ 𝑡+1 . 𝐻 ′ 𝑡 + 𝑏𝑡+1 ) (5)
FIGURE 3. Attention Mechanism on Encoder-Decoder Model
𝑂𝑡+1 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝐻′ 𝑡+1 . 𝑉 ′ 𝑡+1 + 𝑏𝑡+1 ) (6)
5
shown in (13). Likewise, in (14) with its order reversed, the to greater task agnosticism than then SOTA models like
backward language model forecasts the prior tokens given ELMo, ULMFiT [60] and succeeded in more sophisticated
the future tokens. tasks like common-sense reasoning, semantic similarity, and
𝑁
reading comprehension. The pre-training of GPT-I can be
𝑝(𝑡1 , 𝑡2 , . . , 𝑡𝑛 ) = ∏ 𝑝( 𝑡𝑘 ∣ 𝑡1 , 𝑡2 , . . , 𝑡𝑘−1 ) (13) modeled as a maximization function of unsupervised tokens
𝑘=1 {𝑢𝑖 , . . . , 𝑢𝑛 }.
𝑁
𝑝(𝑡1 , 𝑡2 , . . , 𝑡𝑛 ) = ∏ 𝑝(𝑡𝑘 ∣ 𝑡𝑘+1 , 𝑡𝑘+2 , … , 𝑡𝑁 ) (14) 𝐿1 (𝒰) = ∑ log 𝑃(𝑢𝑖 ∣ 𝑢𝑖−𝑘 , . . . , 𝑢𝑖−1 ; 𝛩) (18)
𝑖
𝑘=1
This is further implemented through a softmax on top of the where 𝑘 is the context window size and conditional
final LSTM layer as shown in Figure 5. probability is parametrized via 𝛩. With multi-headed-
attention and feedforward layers, a target token-based
probability distribution via softmax is produced.
ℎ𝑛 = 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝑏𝑙𝑜𝑐𝑘(ℎ𝑙−1) ∀𝑖 ∈ [1, 𝑛] (19)
ℎ0 = 𝑈𝑊𝑒 + 𝑊𝑝 (20)
𝑃(𝑢) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(ℎ𝑛 𝑊𝑒𝑇 ) (21)
6
GPT performs various tasks like classification, entailment,
similarity index, Multiple-Choice Questions (MCQ) as
shown in figure 6. The extraction phase distills features from
textual bodies before which the text is separated via the
‘Delimiter’ token during text pre-processing. This token is
not required for classification tasks since it does not need to
gauge the relationship between multiple sequences.
Moreover, Q&A or textual entailment tasks involve defined
inputs like ordered sentence pairs or triplets in a document.
For MCQ tasks, contextual alterations are required at input
to achieve the correct results. This is done via a Transformer
based Decoder training objective where input
transformations are fine-tuned for their respective answers.
IV-C BIDIRECTIONAL ENCODER REPRESENTATIONS FIGURE 7. The architecture of BERT’s MLM and NSP functionality
FROM TRANSFORMER: BERT
BERT is a stack of pre-trained Transformer Encoders that IV-D GENERALIZED AUTOREGRESSIVE PRETRAINING
FOR LANGUAGE UNDERSTANDING: XLNeT
overcomes prior models’ restrictive expressiveness i.e.,
GPT’s lack of bidirectional context and ELMo’s shallow XLNet captures the best of both worlds where it preserves
dual context’s concatenation. BERT’s deeper model the benefits of Auto-Regressive (AR) modeling and
provides a token with several contexts with its multiple bidirectional contextual capture. To better comprehend why
layers and the bi-directional model provides a richer learning XLNet outperforms BERT, consider the 5-token sequence
environment. However, bi-directionality raises concerns that [San, Francisco, is, a, city]. The two tokens chosen for
prediction are [San, Francisco], hence BERT and XLNet
tokens could implicitly foresee future tokens during pre-
training resulting in minimal learning and leading to trivial maximize 𝑙𝑜𝑔 𝑝(𝑆𝑎𝑛 𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜 | 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦) as follows:
predictions. To effectively train such a model, BERT ℒ𝐵𝐸𝑅𝑇 = log 𝑝 (𝑆𝑎𝑛| 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦) +
implements Masked Language Modeling (MLM) that masks log 𝑝 (𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜|𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦)
15% of all input tokens randomly in each input sequence. ℒ𝑋𝐿𝑁𝑒𝑡 = log 𝑝 (𝑆𝑎𝑛| 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦) +
This masked word prediction is the new requirement unlike log 𝑝 (𝐹𝑟𝑎𝑛𝑐𝑖𝑠𝑐𝑜|𝑆𝑎𝑛 𝑖𝑠 𝑎 𝑐𝑖𝑡𝑦)
recreating the entire output sequence in a unidirectional LM. The above can further be generalized for the target (𝒯) and
BERT masks during pre-training, hence the [MASK] token non-target token set (𝒩), BERT and XLNet will maximize
does not show during fine-tuning, creating a mismatch as the log 𝑝 (𝒯|𝒩) with the following different interpretability:
“masked” tokens are not replaced. To overcome this
disparity, subtle modeling modifications are performed ℒ𝐵𝐸𝑅𝑇 = ∑ log 𝑝(𝑥| 𝒩) (25)
during the pre-training phase. If a token 𝑇𝑖 is chosen to be 𝑥∈𝒯
masked, then 80% of the time it is replaced with the [MASK] ℒ𝐵𝐸𝑅𝑇 = ∑ log 𝑝(𝑥| 𝒩𝒯<𝑥 ) (26)
token, 10% of the time a random token is chosen and for the
𝑥∈𝒯
remaining 10%, it remains unchanged. Thereafter 𝑇𝑖 cross- XLNet considers the target as well as the remaining tokens
entropy loss will predict the original token, the unchanged for prediction, whereas BERT only considers the non-target
token step is employed to maintain a bias towards the correct tokens. Hence, XLNet captures the inter-pair dependency
prediction. This methodology creates a state of randomness [San, Francisco] unlike BERT where either [San] or
and constant learning for the Transformer encoder which is [Francisco] leads to correct prediction. Further, via AR
compelled to maintain a distributed contextual XLNet performs factorized ordering on all possible token
representation of each token. Further, as random replacement permutations (𝐿! =5!) of sequence length 𝐿 in the set i.e.,
arises for a mere 1.5% of all tokens (10% of 15%), this does {[1, 2, 3, 4, 5], [1, 2, 5, 4, 3],. . ., [5, 4, 3, 2, 1]} ≅ [is, San,
not seem to impair the language model’s understanding Francisco, a, city] etc.
ability.
Language modeling could not explicitly comprehend the 𝑇
association between multiple sequences; therefore it was 𝑚𝑎𝑥
𝐸𝑧~𝒵𝑇 [∑ log 𝑝𝜃 ( 𝑥𝑧 𝑡 ∣ x𝑧<𝑡 )] (27)
deemed sub-optimal for inference and Q&A tasks. To 𝜃
𝑡=1
overcome this, BERT was pre-trained with a monolingual where set 𝒵𝑇 contains all permutational sequences of length
corpus for a binarized Next Sentence Prediction (NSP) task. 𝑇 [1, 2, . . , 𝑇] and 𝑥𝑧 𝑡 is the reference token. Hence the target
As shown in Figure 7, sentences 𝑌 (He came [MASK] from
learns from numerous combinations attaining a richer
home) and 𝑍 (Earth [MASK] around Sun) do not form any contextualized learning. Further for all permutable
continuity or relationship. Since 𝑍 is not the actual next factorization orders, the model parameters are shared to build
sentence following 𝑌, the output classification label knowledge and bidirectional context from all factorizations
[NotNext] gets activated, and [IsNext] activates when as demonstrated via equation 27.
sequences are coherent.
7
IV-D.1. Masking where 𝑔𝜃 (x𝒛<𝑡 , 𝒵𝑡 ) is a modified representation that
There is a challenge to determine the word order in the additionally considers the target position 𝒵𝑡 as an input.
sequence as the token (𝑥𝑧 𝑡 ) determining the autoregression
is not considered. This word order is partially achieved via (ii) Two Stream Self Attention: The formulation of 𝑔𝜃
positional encoding, however, for contextual understanding remains a challenge despite the above resolution as the goal
XLNet employs masking. Consider a generated permutation is to rely on the target position 𝒵𝑡 to gather contextual
of [2, 1, 3] in a 3-token sequence where the first token i.e., 2 information 𝑥𝒛<𝑡 via attention, hence: (1) For 𝑔𝜃 to predict
has no context hence all masking results in [0,0,0] in the 2nd 𝑥𝒵𝑡 , it should utilize the position of 𝒵𝑡 only to incorporate
row of the 3×3 masking matrix. Similarly, the 2nd and 3rd greater learning, not the content 𝑥𝒵𝑡 (2) To predict other
masks would result in [0,1,0] and [1,1,0] in the 1st and 3rd row tokens 𝑥𝒵 𝑗 where 𝑗 > 𝑡, 𝑔𝜃 should encode the context 𝑥𝒵 𝑡
of the Query Stream (QS) masking matrix where the token to provide full contextual understanding.
cannot see itself. QS matrix with an all-one diagonal To further resolve the above conflict, the authors propose
inclusion constitutes Content Stream (CS) masking matrix two sets of hidden representation instead as follows:
where each token can see itself. This 3-token sequence ❖ The hidden content representation ℎ𝜃 (𝑥𝒛<𝑡 ) ≅ ℎ𝒵 𝑡 that
masking is demonstrated in figure 8 below.
encodes both context and content 𝑥𝒵 𝑡
❖ The query representation 𝑔𝜃 (𝑥𝒛<𝑡 , 𝒵𝑡 ) ≅ 𝑔𝒵 𝑡 which
solely accesses the contextual information 𝑥𝒛<𝑡 and
position 𝒵𝑡 without the content 𝑥𝒵 𝑡
IV-D.2. Model Architecture The above two attention courses are parametrically shared
Figure 9 demonstrates the model’s two-stream attention and updated for every self-attention layer 𝑚 as:
framework that consists of a content and query stream
attention process to achieve greater understanding via 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄 = ℎ𝒵 𝑡 (𝑚−1) , 𝐾𝑉 = ℎ𝒵 ≤𝑡 (𝑚−1) ; 𝜃) → ℎ𝒵 𝑡 (𝑚) )
contextualization. This process is initiated via target-aware (Content Stream: utilize both 𝒵𝑡 and 𝑥𝒵 𝑡 )
representation, where the target position is baked into the 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄 = 𝑔𝒵 𝑡 (𝑚−1) , 𝐾𝑉 = ℎ𝒵 <𝑡 (𝑚−1) ; 𝜃) → 𝑔𝒵 𝑡 (𝑚) )
input for subsequent token generation purposes. (Query Stream: use 𝒵𝑡 without seeing 𝑥𝒵 𝑡 )
This dual attention is pictorially expressed in figure 9. For
(i) Target Aware Representation: A vanilla implementation simplicity purposes, consider the prediction of token 𝑡𝑖 that
of Transformer based parametrization does not suffice for is not allowed to access its corresponding embedding from
complex permutation-based language modeling. This is the preceding layer. However, to predict 𝑡𝑖+1 the token 𝑡𝑖
because the next token distribution 𝑝𝜃 (𝑋𝒵 𝑡 ∣ 𝑥𝑧<𝑡 ) is needs to access its embedding and both operations must
independent of the target position i.e., 𝒵𝑡 . Subsequently, occur in a single pass.
redundant distribution is generated, which is unable to Therefore, two hidden representations are implemented
discover effective representations, hence target position- where ℎ𝒵 𝑡 (𝑚) is initialized via token embeddings and 𝑔𝒵 𝑡 (𝑚)
aware re-parametrization for the next-token distribution is
through weighted transformations. From above equations
proposed as follows:
exp (𝑒(𝑥)𝑇 𝒉𝜽 (𝐱 𝒛<𝒕 )) ℎ𝒵 𝑡 (𝑚) can access the history including the current position
𝑝𝜃 ( 𝑋𝒵 𝑡 = 𝑥 ∣∣ 𝐱 z <𝑡 ) = ∑ (28)
𝑥′ exp (𝑒(𝑥
′ )𝑇 𝒉 (𝐱
𝜽 𝒛<𝒕 )) whereas 𝑔𝒵 𝑡 (𝑚) can access only previous ℎ𝒵 𝑡 (𝑚) positions.
exp(𝑒(𝑥)𝑇 𝒈𝜽 (𝐱𝒛<𝒕 ,𝓩𝒕 )) The token prediction happens in the final layer via 𝑔𝒵 𝑡 (𝑚) .
𝑝𝜃 ( 𝑋𝒵 𝑡 = 𝑥 ∣∣ 𝐱 𝑧 <𝑡 ) = ∑𝑥′ exp (𝑒(𝑥 ′ )𝑇 𝒈𝜽 (𝐱 𝒛<𝒕 ,𝓩𝒕 ))
(29)
For greater sequence length processing the memory blocks
8
are derived from Transformer-XL which can process longer First, the Multi-Layer Perceptron (MLP) block partitions the
than standard Transformer input sequence lengths. The GEMM parallelly in two columns, enabling GeLU
hidden representations mentioned above are also stored in nonlinearity applied independently to each partitioned
the memory blocks. GEMM. This GeLU output is fed directly to the row-wise
parallelized GEMM whose output is reduced via a single all-
IV-E A Robustly Optimized BERT Pretraining Approach: reduce operator (g and f) in forward and backward pass
RoBERTa before passing it to the dropout layer.
This paper claimed that BERT was considerably
undertrained and as a result, RoBERTa incorporated a
greater training intensive regime. This was for BERT-based
models that could match or exceed the prior methods. Their
revisions include: (i) longer training duration with greater
data and batch sizes (ii) eliminating BERT’s NSP goal (iii)
longer sequence training (iv) training data’s masking pattern
modified dynamically. The authors claim superior
performance over BERT on downstream tasks for a more
diverse and voluminous CC-News dataset.
Further, BERT implements a non-efficient static masking
implementation to avoid redundant masks. For instance,
training data that is duplicated 10 times for a sequence to be
masked in 10 different ways for 40 training epochs, where
each training sequence is seen with the same mask 4 times.
RoBERTa provides slightly enhanced results via
incorporating dynamic masking where a masking pattern is
generated each time the model is fed a sequence while
pretraining larger datasets. Recent work has questioned FIGURE 10. Parallelized Megatron’s MLP and Self-Attention blocks
BERT’s NSP [61] role which was conjectured to play a key
role in its performance in language inference and Q&A tasks. Parallelism in the self-attention block is achieved by
RoBERTa amalgamates both hypotheses and provides partitioning the GEMMs column-wise for each key, query,
numerous supplementary training formats that perform like and value set. Hence, the workload is split across all GPUs
BERT and outperform it for full sentence training excluding as matrix multiplication for each attention head is performed
the NSP loss. RoBERTa provides similar and marginally on a single GPU. The resulting GEMM output, like MLP
better results than BERT on GLUE benchmark as well as on undergoes an all-reduce operation and is parallelized across
RACE and SQUAD datasets without fine-tuning for multiple rows as shown above in figure 10. This technique eliminates
tasks. the need for synchronization between GEMMs for MLP and
attention blocks.
IV-E MEGATRON LANGUAGE MODEL (LM)
Megatron was the largest model when released with the size V. NLG ARCHITECTURES
of 24 × BERT and 5.6 × GPT-2 and could not fit in a single In NLU models, the sheer amount of data compute required
GPU. Hence the key engineering implementation was the for learning numerous post pre-trained ‘fine-tuned’ tasks is
induction of its 8 and 64-way model, and data parallelized parametrically inefficient, as an entirely new model is
version where parameters were split across (~512) GPUs. It required for every task. These models can be exemplified as
sustained high performance (15.1 Petaflops) and scaling narrow experts rather than proficient generalists. Therefore,
efficiency (76%), whereas BERT resulted in performance NLG models provide a transition towards building generic
degradation with size growth. This feat was primarily systems, that accomplish several tasks without the necessity
attributed to layer normalization and residual connection re- to create and label a training dataset manually for each task.
ordering within the transformer layers. This led to Moreover, MLM in NLU models is unable to capture a rich
monotonically superior performance on downstream tasks relationship between multiple sequences. Further, most
with increased model size. effective NLU models derive their methodologies from the
Megatron overcomes the prior model’s memory constraint MLM model variants which are denoising autoencoders
via splitting the model across several accelerators. This not trained on text reconstruction where a random subset of
only resolves the memory usage but enhances the model words is masked out. Consequently, NLG models in the last
parallelism irrespective of batch size. It incorporates few years have made tremendous progress on tasks like text
distributed tensor computations to upsurge model size or translation and summarization, Q&A, NLI, conversational
acceleration and parallelizes the attention head computation. engagement, picture description, with unprecedented
This does not require a new compiler or code re-write and is accuracies.
implementable with a few parameters.
9
V-A LANGUAGE MODELS ARE UNSUPERVISED MULTI- language English. The source encoder is trained in two
TASK LEARNERS: GPT-II stages, that share the backpropagation of cross-entropy loss
GPT-II [62] was possibly the first model that dawned on the from BART’s output. Firstly, most BART parameters are
rise of NLG models. It was trained in an unsupervised frozen, and only the arbitrarily initialized encoder, BART’s
manner capable of learning complex tasks including positional embeddings, and its encoder’s self-attention input
Machine Translation, reading comprehension, and projection matrix are updated. Secondly, all model
summarization without explicit fine-tuning. Task-specific parameters are jointly trained for few iterations. BART
training corresponding to its dataset was the core reason achieves state-of-the-art performance on several text
behind the generalization deficiency witnessed in current generation tasks, fueling further exploration of NLG models.
models. Hence robust models would likely require training It achieves comparative results on discriminative tasks when
and performance gauges on a variety of task domains. compared with RoBERTa.
GPT-II incorporates a generic probabilistic model where
numerous tasks can be performed for the same input as
𝑝(𝑜𝑢𝑡𝑝𝑢𝑡|𝑖𝑛𝑝𝑢𝑡, 𝑡𝑎𝑠𝑘). The training and test set
performance improves as model size is scaled up and as a
result, it under fits on the huge WebText dataset. The 1.5
billion parameter GPT-2 outperformed its predecessors on
most datasets in the previously mentioned tasks in a zero-
shot environment. It is an extension of the GPT-I decoder- FIGURE 12. Denoised BART Model for fine-tuned MT tasks
only architecture trained on significantly greater data.
V-C MULTILINGUAL DENOISING PRE-TRAINING FOR
V-B BIDIRECTIONAL AND AUTOREGRESSIVE
NEURAL MACHINE TRANSLATION: mBART
TRANSFORMERS: BART
V-C.1. Supervised Machine Translation
A denoising autoencoder BART is a sequence-to-sequence
mBART demonstrates that considerable performance gains
[63] model that incorporates two-stage pre-training: (1)
are achieved over prior techniques [64], [65] by
Corruption of original text via a random noising function,
autoregressively pre-training BART, via sequence
and (2) Recreation of the text via training the model. Noising
reconstructed denoising objective across 25 languages from
flexibility is the major benefit of the model where random
the common crawl (CC-25) corpus [66]. mBART’s
transformations not limited to length alterations are applied
parametric fine-tuning can be supervised or unsupervised,
to the original text. Two such noising variations that stand
for any linguistic pair without task-specific revision. For
out are random order shuffling of the original sentence and a
instance, fine-tuning a language pair i.e. (German-English)
filling scheme where texts of any spanned length are
enables the model to translate from any language in the
randomly replaced by a single masked token. BART deploys
monolingual pre-training set i.e. (French English), without
all possible document corruption schemes as shown below in
further training. Since each language contains tokens that
figure 11, wherein the severest circumstance all source
possess significant numerical variations, the corpus is
information is lost and BART behaves like a language
balanced via textual up/downsampling from each language 𝑖
model.
with the ratio 𝜆𝑖
1 𝑝𝑖 𝛼
𝜆𝑖 = . (30)
𝑝𝑖 ∑𝑖 𝑝𝑖 𝛼
where 𝑝𝑖 is each language’s percentage in the dataset with a
soothing parameter 𝛼 = 0.7. The training data encompasses
𝐾 languages: 𝒞 = {𝒞1 , −−, 𝒞𝑘 } where each 𝒞𝑖 is 𝑖 𝑡ℎ
language’s monolingual document collection. Consider a
text corrupting noising function 𝑔(𝑋) where the model is
FIGURE 11. Denoised BART Model and its Noising Schemes trained to predict original text 𝑋, hence loss ℒ𝜃 is maximized
as:
This forces the model to develop greater reasoning across
overall sequence length enabling greater input ℒ𝜃 = ∑ ∑ log 𝑃(𝑋 ∣ 𝑔(𝑋); 𝜃) (31)
transformations which results in superior generalization than 𝒞𝑖 ∈𝒞 𝑋∈𝒞𝑖
BERT. BART is pre-trained via optimization of a where language 𝑖 has an instance 𝑋 and above distribution 𝑃
reconstruction loss performed on corrupted input documents is defined via a sequence-to-sequence model.
i.e., cross-entropy between decoder’s output and original
document. For machine translation tasks, BART’s encoder V-C.2. Unsupervised Machine Translation
embedding layer is replaced with an arbitrarily initialized mBART is evaluated on tasks where target bi-text or text
encoder, that is trained end-to-end with the pre-trained model pairs are not available in these 3 different formats.
as shown in Figure 12. This encoder maps its foreign ❖ None of any kind of bi-text is made available, here back-
vocabulary to BART’s input which is denoised to its target translation (BT) [67],[68] is a familiar solution. mBART
10
offers a clean and effective initialization scheme for Clean Crawled Corpus (C4) was developed, twice as large as
such techniques. Wikipedia.
❖ The bi-text for the target’s pair is made unavailable, The authors concluded that causal masking limits the
however, the pair is available in the target language’s bi- model’s capability to attend only till the 𝑖 𝑡ℎ input entry of a
text corpora for other language pairs. sequence, which turns detrimental. Hence T5 incorporates
❖ Bi text is not available for the target pair, however, is fully visible masking during the sequence’s prefix section
available for translation from a different language to the (prefix LM) whereas causal masking is incorporated for
target language. This novel evaluation scheme training the target’s prediction. The following conclusions
demonstrates mBART’s transfer learning capability were made after surveying the current transfer learning
despite the absence of the source language’s bi-text landscape.
mBART is pre-trained for all 25 languages and fine-tuned ❖ Model Configuration: Normally models with Encoder-
for the target language as shown in figure 13. Decoder architectures outperformed decoder-based
language models.
❖ Pre-Training Goals: Denoising worked best for fill-in-
the-blank roles where the model is pre-trained to retrieve
input missing words at an acceptable computational cost
❖ In-Domain Datasets: In-domain data training turns out to
be effective, however pre-training small datasets
generally leads to overfitting.
❖ Training Approaches: A pre-train, fine-tune methodology
for multi-task learning could be effective, however, each
task’s training frequency needs to be monitored.
❖ Scaling Economically: To efficiently access the finite
computing resources, evaluation among model size
scaling, training time, and ensembled model quantity is
performed.
V-E TURING NATURAL LANGUAGE GENERATION: T-
NLG
T-NLG is a 78 layered Transformer based generative
language model, that outsizes the T5 with its 17 billion
trainable parameters. It possesses greater speedup than
Nvidia’s Megatron, which was based on interconnecting
multiple machines via low latency buses. T-NLG is a
progressively larger model, pre-trained with greater variety
and quantity of data. It provides superior results in
generalized downstream tasks with lesser fine-tuning
samples. Hence, its authors conceptualized training a huge
centralized multi-task model with its resources shared across
various tasks, rather than allocating each model for a task.
Consequently, the model effectively performs question
answering without prior context leading to enhanced zero-
shot learning. Zero Redundancy Optimizer (ZeRO) achieves
both model and data parallelism concurrently, which perhaps
is the primary reason to train T-NLG with high throughput.
12
ℒ𝑠 (𝑦, 𝑝(𝑧𝑠 , 𝑇)) = ∑ −𝑦𝑖 log(𝑝𝑖 (𝑧𝑠𝑖 , 𝑇)) (38) knowledge transfer process by inducting 3 loss functions: (i)
Embedding Layer Output (ii) Attention Matrices, the Hidden
𝑖
States from Transformer (iii) Output Logits. This not only
led TinyBERT to retain over 96% of BERT’s performance
at drastically reduced size but also deployed a meager 28%
of parameters and 31% of inference time across all BERT-
based distillation models. Further, it leveraged the untapped
extractable potential from BERT’s learned attention weights
[70], for ( 𝑀 + 1)𝑡ℎ layer, knowledge acquired is enhanced
by minimizing:
𝑀+1
13
replacement of Layer Normalization and gelu activation, require re-training a new model from scratch as opposed to
with the simpler Hadamard product (∘) based linear training a network from which multiple shallow models are
transformation. extracted. This sub-network sampling like Dropout [73] and
𝑁𝑜𝑁𝑜𝑟𝑚(ℎ) = 𝛶 ∘ ℎ + 𝛽, 𝑤ℎ𝑒𝑟𝑒 𝛶, 𝛽 ∈ ℝ𝑛 (45) DropConnect [74] builds an efficient pruning robust network
if the smartly chosen simultaneous group of weights are
For knowledge transfer, the mean squared error between dropped. Formally, pruning robustness in regularizing
feature maps of MobileBERT’s and IB-BERT is networks can be achieved by independently dropping each
implemented as a transfer objective. weight via Bernoulli’s distribution where parameter p > 0
𝑇 𝑁 regulates the drop rate. This is comparable to the pointwise
𝑙
1 𝑡𝑟 𝑠𝑡
ℒ𝐹𝑀𝑇 = ∑ ∑(𝐻𝑡,𝑙,𝑛 − 𝐻𝑡,𝑙,𝑛 )2 (46) product of weight matrix 𝑊 with an arbitrarily sampled {0,
𝑇𝑁 1) mask matrix 𝑀, 𝑊𝑑 = 𝑀 ⨀ 𝑊.
𝑡=1 𝑛=1
where 𝑙 is layer index, 𝑇 is sequence length, 𝑁 is the feature The most effective layer dropping strategy is to drop every
map size. For TinyBERT to harness the attention capability other layer, where pruning rate 𝑝 and dropping layers at
from BERT, KL-divergence is minimized between per-head depth 𝑑 such that 𝑑 ≡ 0(𝑚𝑜𝑑 ⌊1/𝑝⌋). For 𝑁 groups with a
distributions of the two models, where 𝐴 denotes the number fixed drop ratio 𝑝, the average number of groups utilized
of attention heads. during training the network is 𝑁(1 − 𝑝), hence pruning size
𝑇 𝐴
1 for 𝑟 groups, the ideal drop rate will be 𝑝∗ = 1 − 𝑟/𝑁. This
𝑙 𝑡𝑟 𝑠𝑡
ℒ𝐴𝑇 = ∑ ∑ 𝐷𝐾𝐿 (𝑎𝑡,𝑙,𝑎 ||𝑎𝑡,𝑙,𝑎 ) (47) approach has been highly effective on numerous NLP tasks
𝑇𝐴
𝑡=1 𝑎=1 and has led to models on size comparable to distilled versions
Alternatively, a new KD loss can be implemented during of BERT and demonstrate better performance.
MobileBERT’s pre-training with a linear combination of
BERT’s MLM and NSP loss, where 𝛼 is a hyperparameter VI-B.1-B POOR MAN’S BERT
between (0,1). Due to the over-parameterization of deep neural networks,
ℒ𝑃𝐷 = 𝛼ℒ𝑀𝐿𝑀 + (1 − 𝛼)ℒ𝐾𝐷 + ℒ𝑁𝑆𝑃 (48) availability of all parameters is not required at inference
For the above-outlined objectives, 3 training strategies are time, hence few layers are strategically dropped resulting in
proposed: competitive results for downstream tasks [75]. The odd-
(i) Auxiliary Knowledge Transfer: Intermediary transfer via alternate dropping strategy drove superior results than the
a linear combination of all layer transfer loss and distilled top and even alternate dropping for span 𝐾 = 2 across all
pre-training loss. tasks. For instance, in a 12-layer network, dropping: top –
(ii) Joint Knowledge Transfer: For superior results, 2 {11, 12}; even-alternate – {10, 12}; odd-alternate – {9, 11},
separate losses are proposed where MobileBERT is trained concluded in (i) dropping the final two layers consecutively
with all layers that jointly transfer losses and perform pre- is more detrimental than eliminating alternate layers, and (ii)
trained distillation. preserving the final layer has greater significance than other
(iii) Progressive Knowledge Transfer: To minimize error top layers.
transfer from lower to higher layers, it is proposed to divide
knowledge transfer into 𝐿 layered 𝐿 stages where each layer
is trained progressively.
VI-B PRUNING
Pruning [71] is a methodology where certain weights, biases,
layers, and activations are zeroed out which are no longer a
part of the model’s backpropagation. This introduces
sparsity in such elements which are visible post ReLU layer
that converts negative values to zero
((𝑅𝑒𝐿𝑈(𝑥): 𝑚𝑎𝑥(0, 𝑥)). Iterative pruning learns the key
weights, eliminating the least critical ones based on threshold
values, and retraining the model enabling it to recuperate
from pruning by adapting to the remaining weights. NLP
models like BERT, RoBERTa, XLNet were pruned by 40%
and retained their performance by 98%, which is comparable
to DistilBERT.
pruning. The weight matrices were factorized into a product 𝐿0 (𝑔1 … 𝑔ℎ ) = ∑(1 − [[𝑔𝑖 = 0]]) (51)
of two smaller matrices with a diagonal mask that was 𝑖=1
pruned while training via 𝑙0 regularizer that controlled the However, 𝐿0 norm is non-differentiable; hence it cannot be
end sparsity of the model. This generic method FLOP inducted as a regularization term in the objective function.
(Factorized 𝐿0 Pruning) could be employed for any matrix Therefore, a stochastic relaxation is applied where each gate
multiplication. For a neural network 𝑓(; 𝜽) parameterized by 𝑔𝑖 is randomly picked from a head distribution obtained via
𝜽 = {𝜃𝑗 }𝑛𝑗=1 where each 𝜃𝑗 represents an individual weight stretching (0,1) to (−∈ ,1+∈) and collapsing the probability
or a block of weights (e.g., column matrix) and 𝑛 denotes the distribution (−∈, 1] to [1, 1+∈) to singular points 0 and 1.
number of blocks. Consider a pruning binarized variable 𝑧 = This rectified stretching results in a distribution over [0,1]
{𝑧𝑗 }𝑛𝑗=1 where 𝑧𝑗 ∈ {0, 1}, 𝜽 ̃ = { 𝜃̃𝑗 } denotes model that is mixed discretized-continuous. The probability sum of
parameter set, post pruning via 𝑙0 normalization. heads being non-zero can be implemented as a relaxed L0
norm.
ℎ
𝜃̃ = 𝜃 ⨀ 𝑧 ∀𝑗 𝜃̃𝑗 = 𝜃𝑗 𝑧𝑗 (49)
ℒ𝐶 (∅) = ∑(1 − 𝑃( 𝑔𝑖 = 0 ∣ ∅𝑖 )) (52)
Consider a matrix 𝑊 to be factorized into a product of two
𝑖=1
smaller matrices 𝑃 and 𝑄 where 𝑊 = 𝑃. 𝑄 and 𝑟 is the The modified training regime can be expressed as ℒ(𝜃, ∅) =
number of 𝑃 columns or 𝑄 rows. Structured Pruning for each ℒ𝑥𝑒𝑛𝑡 (𝜃, ∅) + 𝜆ℒ𝐶 (∅), where 𝜃 are original Transformer’s
component is attained via a pruning variable 𝑧𝑘 parameters, ℒ𝑥𝑒𝑛𝑡 (𝜃, ∅) is the translation model’s cross-
entropy loss and ℒ𝐶 (∅) is the regularizer.
𝑊 = 𝑃𝐺𝑄 = ∑𝑟𝑘=1 𝑧𝑘 × (𝑝𝑘 × 𝑞𝑘 ) (50)
15
VI-B.3-B ARE 16 HEADS REALLY BETTER THAN ONE? quantized function can represent floating-point weights 𝑤,
In multi-headed attention (MHA), consider a sequence of 𝑛 activations 𝑎, in a few bits as:
𝑑-dimensional vectors 𝑥 = 𝑥1 , . . , 𝑥𝑛 ∈ ℝ𝑑 , and query 𝑄(𝑥) = 𝑞𝑙 , if 𝑥 ∈ (𝑡𝑙 , 𝑡𝑙+1 ] where 𝑞𝑙 , 𝑙 = (1, … , 𝐿) (56)
vector 𝑞 ∈ ℝ𝑑 . The MHA layer parameters
ℎ
𝑊𝑞 , 𝑊𝑘 𝑊𝑣ℎ 𝑊𝑜ℎ ∈ ℝ𝑑ℎ ×𝑑 and 𝑊𝑜ℎ ∈ ℝ𝑑×𝑑ℎ , when 𝑑ℎ = 𝑑.
ℎ
Here 𝑞𝑙 and (𝑡𝑙 , 𝑡𝑙+1 ] are quantization levels and intervals,
For masking attention heads the original transformer respectively. To preserve quick inference times, quantization
equation is modified as: functions need to be compatible with bitwise operations,
𝑁ℎ which is achieved via uniform distribution that
𝑀𝐻𝐴𝑡𝑡𝑛 (𝑥, 𝑞) = ∑ 𝝃𝒉 𝐴𝑡𝑡𝑊 ℎ,𝑊 ℎ 𝑊 ℎ 𝑊 ℎ (𝑥, 𝑞) (53) maps floating-point numbers to their nearest fixed-point
𝑞 𝑘 𝑣 𝑜
ℎ=1
integers with a normalization factor. The LQ learnable
where 𝜉ℎ are masking variables with values between {0,1}, quantization function can be expressed as:
𝐴𝑡𝑡ℎ (𝑥) is the output of head ℎ for input 𝑥. The following 𝑄𝐿𝑄 (𝑥, 𝑣) = 𝑣 𝑇 𝑒𝑙 , if 𝑥 ∈ (𝑡𝑙 , 𝑡𝑙+1 ] (57)
experiments yielded the best results [83] on pruning the where 𝑣 ∈ ℝ𝐾 is the learnable floating-point basis and 𝑒𝑙 ∈
different number of heads at test times: {−1, 1}𝐾 for 𝑙 = (1, . . , 𝐿) enumerating 𝐾-bit binary
encodings from [−1, . . , −1] to [1, . . ,1]. The inner product
(i) Pruning just one head: If the model’s performance computation of quantized weights and activations is
significantly degrades while masking head ℎ, then ℎ is a computed by the following bitwise operations with weight
key head else it is redundant given the rest of the model. bit-width 𝐾𝑤 .
𝐾𝑤 𝐾𝑎
A mere 8 (out of 96) heads trigger a significant change
in performance when removed from the model, out of 𝑄𝐿𝑄 (𝑤, 𝑣 𝑤 )𝑇
𝑄𝐿𝑄 (𝑎, 𝑣 𝑎)
= ∑ ∑ 𝑣𝑖𝑤 𝑣𝑗𝑎 (𝑏𝑖𝑤 ⨀𝑏𝑗𝑎 ) (58)
which half result in a higher BLEU score. 𝑖=1 𝑗=1
(ii) Pruning all heads except one: A single head for most where 𝑤, 𝑎 ∈ ℝ𝑛 encoded by vectors 𝑏𝑖𝑤 , 𝑏𝑗𝑎 ∈ {−1, 1}𝑁
layers was deemed sufficient at test time, even for where 𝑖 = 1, . . . , 𝐾𝑤 and 𝑗 = 1, . . . , 𝐾𝑎 and 𝑣 𝑤 ∈ ℝ𝐾𝑤 , 𝑣 𝑎 ∈
networks with 12 or 16 attention heads, resulting in a ℝ𝐾𝑎 , ⨀ denotes bitwise inner product 𝑥𝑛𝑜𝑟 operation.
drastic parametric reduction. However, multiple
attention heads are a requirement for specific layers i.e., VI-C.2 QBERT
the final layer of the encoder-decoder attention, where QBERT [87] deploys a two-way BERT quantization with
performance degrades by a massive 13.5 BLEU points input 𝑥 ∈ 𝑋, its corresponding label y ∈ 𝑌, via cross entropy-
on a single head. based loss function
The expected sensitivity of the model to the masking 𝜉 is 𝐿(𝜃) =
evaluated for the proxy score for head significance.
𝜕ℒ(𝑥) ∑ 𝐶𝐸(𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊𝑐 (𝑊𝑛 (… 𝑊1 (𝑊𝑒 (𝑥𝑖 ))))), 𝑦𝑖 ) (59)
𝐼ℎ = 𝔼𝑥~𝑋 | | (54) (𝑥𝑖 ,𝑦𝑖 )
𝜕𝜉ℎ
𝜕ℒ(𝑥) where 𝑊𝑒 is the embedding table, with encoder layers
𝐼ℎ = 𝔼𝑥~𝑋 |𝐴𝑡𝑡ℎ (𝑥)𝑇 | (55) 𝑊1 , 𝑊2 … 𝑊𝑛 and classifier 𝑊𝑐 . Assigning the same bit size
𝜕𝐴𝑡𝑡ℎ (𝑥)
representation to different encoder layers with varying
where 𝑋 is the data distribution, ℒ(𝑥) is the loss on sample
sensitivity attending to different structures [5] is sub-optimal
𝑥. If 𝐼ℎ is high, then modifying 𝜉ℎ will likely have a and it gets intricate for small target size (2/4 bits) requiring
significant effect on the model, hence low 𝐼ℎ value heads are ultra-low precision. Hence via Hessian Aware Quantization
iteratively pruned out. (HAWQ) more bits are assigned to greater sensitive layers to
VI-C QUANTIZATION
retain performance. Hessian matrix is computed via
32-bit floating-point (FP32) has been the predominant computationally economical matrix-free iteration technique
numerical format for deep learning, however the current where first layer encoder gradient 𝑔1 for an arbitrary vector
surge for reduced bandwidth and compute resources has 𝑣 as:
propelled the implementation of lower-precision formats. It 𝜕𝑔1𝑇 𝑣 𝜕𝑔1𝑇 𝜕𝑣 𝜕𝑔1𝑇
= 𝑣 + 𝑔1𝑇 = 𝑣 = 𝐻1 𝑣 (60)
has been demonstrated that weights and activation 𝜕𝑊1 𝜕𝑊1 𝜕𝑊1 𝜕𝑊1
representations via 8-bit integers (INT8) have not led to an where 𝐻1 is Hessian matrix of the first encoder and 𝑣 is
evident accuracy loss. For instance, BERT’s quantization to independent to 𝑊1 , this approach determines the top
16/8-bit weight format resulted in 4× model compression eigenvalues for different layers and more aggressive
with minimal accuracy loss, consequently, a scaled-up quantization is deployed for layers with smaller eigenvalues.
BERT serves a billion CPU requests daily. For further optimization via group-wise quantization, each
dense matrix is treated as a group with its quantization range
VI-C.1 LQ-NETS and is partitioned following each continuous output neuron.
This model [84] inducts simple to train network weights and
activations mechanism via joint training of a deep neural VI-C.3 Q8BERT
network. It quantizes with variable bit precision capabilities To quantize weights and activations to 8-bits, symmetric
unlike fixed or manual schemes [85],[86]. Generally, a linear quantization is implemented [88], where 𝑆 𝑥 is the
16
quantized scaling factor for input 𝑥 and (𝑀 = 2𝑏−1 − 1) is ℎ𝑏 = 𝑊𝑏 𝐵𝐸𝑅𝑇𝐵 (𝑏)[𝐶𝐿𝑆] (65), 𝑆𝑟𝑒𝑡𝑟 (𝑏, 𝑞) = ℎ𝑞 𝑇 ℎ𝑏 (66)
the highest quantized value when quantizing to 𝑏 bits. where 𝑊𝑞 and 𝑊𝑏 matrices project the BERT output into
𝑄𝑢𝑎𝑛𝑡𝑖𝑧𝑒(𝑥|𝑆 𝑥 , 𝑀) ≔ 𝐶𝑙𝑎𝑚𝑝(⌊𝑥 ×𝑆 𝑥 ⌉, −𝑀, 𝑀) (61)
128-dimensional vectors. Similarly, the reader is BERT’s
𝐶𝑙𝑎𝑚𝑝(𝑥, 𝑎, 𝑏) = min (max(𝑥, 𝑎) , 𝑏)
Implementing a combination of fake quantization [89] and span variant of the reading model.
Straight-Through Estimator (STE) [90], inference time ℎ𝑠𝑡𝑎𝑟𝑡 = 𝐵𝐸𝑅𝑇𝑅 (𝑞, 𝑏)[𝑆𝑇𝐴𝑅𝑇(𝑠)], (67)
quantization is achieved during training with a full-precision ℎ𝑒𝑛𝑑 = 𝐵𝐸𝑅𝑇𝑅 (𝑞, 𝑏)[𝐸𝑁𝐷(𝑠)], (68)
backpropagation enabling FP32 weights to overcome errors.
𝜕𝑥 𝑞 𝑆𝑟𝑒𝑎𝑑 (𝑏, 𝑠, 𝑞)𝑀𝐿𝑃([ℎ𝑠𝑡𝑎𝑟𝑡 ; ℎ𝑒𝑛𝑑 ]) (69)
Here ⃗ , where 𝑥 𝑞 is the result of fake quantizing 𝑥.
=1
𝜕𝑥
The retrieval model is pre-trained with an Inverse Cloze Task
VII. INFORMATION RETRIEVAL (ICT), where the sentence context is relevant semantically
For knowledge-intensive tasks like efficient data updating, and is used to extrapolate data missing from the sequence 𝑞.
and retrieval, huge implicit knowledge storage is required. exp (𝑆𝑟𝑒𝑡𝑟 (𝑏, 𝑞))
𝑃𝐼𝐶𝑇 (𝑏|𝑞) = (70)
Standard language models are not adept at these tasks and do ∑𝑏′∈𝐵𝐴𝑇𝐶𝐻 exp (𝑆𝑟𝑒𝑡𝑟 (𝑏 ′ , 𝑞))
not match up with task-specific architectures which can be
crucial for open-domain Q&A. For instance, BERT can where 𝑞 is treated as pseudo-question, 𝑏 is text encircling 𝑞
predict the missing word in the sentence, “The __ is the and 𝐵𝐴𝑇𝐶𝐻 is a set of evidence blocks employed for
currency of the US” (answer: “dollar”). However since this sampling negatives. Apart from learning word matching
knowledge is stored implicitly in its parameters, the size features, it also learns abstract representations as pseudo-
substantially increases to store further data. This constraint question might or might not be present in the evidence. Post
raises the network latency and turns out prohibitively ICT, learning is defined distribution over answer derivations.
expensive to store information as storage space is limited due exp (𝑆(𝑏, 𝑠, 𝑞))
𝑃𝑙𝑒𝑎𝑟𝑛 (𝑏, 𝑠|𝑞) = (71)
to the size constraints of the network. ∑𝑏′∈𝑇𝑂𝑃(𝑘) ∑𝑠′ ∈𝑏′ exp (𝑆(𝑏 ′ , 𝑠 ′ , 𝑞))
VII-A GOLDEN RETRIEVER where 𝑇𝑂𝑃(𝑘) are top retrieved blocks based on 𝑆𝑟𝑒𝑡𝑟 . In
A conventional multi-hop based open-domain QA involves this framework, evidence retrieval from complete Wikipedia
question 𝑞 and from a large corpus containing relevant is implemented as a latent variable which is unfeasible to
contextual 𝑆 (gold) documents 𝑑1 , . . , 𝑑𝑠 that form a sequence train from scratch hence retriever is pre-trained with an ICT.
of reasoning via textual similarities that lead to a preferred VII-C REALM
answer 𝑎. However, GoldEn Retriever’s [91] first-hop This framework explicitly attends to a vast corpus like
generates a search query 𝑞1 that retrieves document 𝑑 for a Wikipedia however, its retriever learns via backpropagation
given question 𝑞, thereafter for consequent reasoning steps and performs Maximum Inner Product Search (MIPS) via
(𝑘 = 2, . . , 𝑆) a query 𝑞𝑘 is generated from the question (𝑞) cosine similarity to chose document appropriateness. The
and available context (𝑑1 , . . , 𝑑𝑘−1 ). GoldEn retrieves greater retriever is designed to cache and asynchronously update
contextual documents iteratively while concatenating the each document to overcome the computational challenge of
retrieved context for its QA model to answer. It is multi-million order retrieval of candidate documents.
independent of the dataset and task-specific IR models where In pre-training, the model needs to predict the randomly
indexing of additional documents or question types leads to masked tokens via the knowledge retrieval relevance score
inefficiencies. A lightweight RNN model is adapted where 𝑓(𝑥, 𝑧), the inner product of vector embeddings between 𝑥
text spans are extracted from contextual data to potentially and 𝑧 (MIPS). To implement a knowledge-based encoder,
reduce the large query space. The goal is to generate a search the combination of input 𝑥 and retrieved document 𝑧 from a
query 𝑞𝑘 that helps retrieve 𝑑𝑘 for the following reasoning corpus Ƶ is fed as a sequence to fine-tune the Transformer
step, based on a textual span from the context 𝐶𝑘 , 𝑞 is 𝑝( 𝑦 ∣ 𝑧, 𝑥 ) as shown in figure 17. This enables complete
selected from a trained document reader. cross attention between 𝑥 and 𝑧 that enables to predict the
𝑞𝑘 = 𝐺𝑘 (𝑞, 𝐶𝑘 ) (62), 𝐶𝑘+1 = 𝐶𝑘 𝑐𝑜𝑛𝑐𝑎𝑡 𝐼𝑅𝑛 (𝑞𝑘 ) (63) output y where:
where 𝐺𝑘 is the query generator and 𝐼𝑅𝑛 (𝑞𝑘 ) are top n 𝑓(𝑥, 𝑧) = 𝐸𝑚𝑏𝑒𝑑𝐼𝑛𝑝𝑢𝑡 (𝑥)𝑇 𝐸𝑚𝑏𝑒𝑑𝑑𝑜𝑐 (𝑧)
retrieved documents via 𝑞𝑘 .
𝑒𝑥𝑝 𝑓(𝑥,𝑧)
VII-B ORQA 𝑝(𝑧 | 𝑥) = ∑ ′ (72)
𝑧′ exp 𝑓(𝑥,𝑧 )
The components reader and retriever are trained jointly in an
𝑝(𝑦 | 𝑥) = ∑ 𝑝( 𝑦 ∣ 𝑧, 𝑥 )𝑝( 𝑧 ∣ 𝑥 ) (73)
end-to-end fashion where BERT is implemented for
𝑧∈Ƶ
parameter scoring. It can retrieve any text from an open Like ORQA, BERT is implemented for embedding:
corpus and is not constrained by returning a fixed set of 𝑗𝑜𝑖𝑛𝐵𝐸𝑅𝑇 (𝑥) = [𝐶𝐿𝑆]𝑥[𝑆𝐸𝑃] (74)
documents like a typical IR model. The retrieval score
𝑗𝑜𝑖𝑛𝐵𝐸𝑅𝑇 (𝑥1 , 𝑥2 ) = [𝐶𝐿𝑆]𝑥1 [𝑆𝐸𝑃]𝑥2 [𝑆𝐸𝑃] (75)
computation is the question’s 𝑞 dense inner product with
In the pre-training of the BERT’s masked language modeling
evidence block 𝑏. task, each mask in token 𝑥 needs to be predicted as:
ℎ𝑞 = 𝑊𝑞 𝐵𝐸𝑅𝑇𝑄 (𝑞)[𝐶𝐿𝑆] (64)
17
latent variable marginalized to achieve maximum probability
𝑝(𝑦 ∣ 𝑥) across top-K approximations.
𝑝𝑅𝐴𝐺−𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒 (𝑦 |𝑥) =
𝑁
where 𝐵𝐸𝑅𝑇𝑆𝑇𝐴𝑅𝑇(𝑠) and 𝐵𝐸𝑅𝑇𝐸𝑁𝐷(𝑠) denote the To effectively compute 𝑡𝑜𝑝 − 𝑘(𝑝ɳ ( . ∣ 𝑥 )) elements 𝑧 with
Transformer output vectors corresponding to the start and the highest probability 𝑝Ƞ (𝑧 ∣ 𝑥) DPR employs a MIPS
end tokens of span 𝑆 and 𝑀𝐿𝑃 denotes a feed-forward neural index where BART is used as the generator 𝑝𝜃 (𝑦𝑖 ∣
network. 𝑥, 𝑧𝑖 , 𝑦1:𝑖−1 ). The retriever and generator are trained in
conjunction to retrieve the target document in a semi-
VII-D RETRIEVAL AUGMENTED GENERATION: RAG unsupervised manner.
RAG is a flexible combination of the ‘closed-book’ i.e.,
parametric model and the performance of ‘open-book’ i.e., VII-D DENSE PASSAGE RETRIEVAL: DPR
retrieval model approaches, outperforming current language DPR enhances open-domain QA retrieval using the dual
models. A parametric memory is a sequence to sequence pre- encoder approach, unlike the computationally intensive ICT.
trained model whereas a Wikipedia representation via a Its dense encoder 𝐸𝑃 (·) indexes all 𝑀 passages in a
dense vector index constitutes non-parametric memory, continuous, low-dimensional (𝑑) space that could
which is accessed via a pre-trained neural retriever. Since effectively retrieve top relevant passages for a query at run
RAG is built as a culmination of the two it does not require time. A separate encoder 𝐸𝑄 (·) is deployed for the query and
prior training since knowledge is available via retrieved pre- d-dimensional vector to map at run time, that retrieves 𝑘
trained data unlike former non-parametric architectures [92]. passages which are most relevant to the question vector. The
To achieve greater context in output sequence (𝑦) dot product computation between the query and passage
generation, the general-purpose RAG incorporates retrieved determines their similarity. 𝑠𝑖𝑚(𝑞, 𝑝) = 𝐸𝑄 (𝑞)𝑇 . 𝐸𝑄 (𝑞). The
text passages 𝑧 for a given input 𝑥, that involve two major
goal is to learn a superior embedding function via training
components:
encoders that involve the creation of vector space where the
(𝑖) Retriever 𝑝Ƞ (𝑧 ∣ 𝑥), parameterized via Ƞ, it returns the
relevant question, passage pairs possess smaller distances
top matched content from text passages for query 𝑥, this i.e., greater similarity than irrelevant ones. Assume training
𝑅𝐴𝐺 𝑆𝑒𝑞𝑢𝑒𝑛𝑐𝑒 architecture’s retrieved passage acts as a data with 𝑚 instances 𝒟 = {⦑𝑦𝑖 , 𝑝𝑖+ , 𝑝𝑖,1 − − ⦒}𝑚
, −−, 𝑝𝑖,𝑛 𝑖=1
18
where each instance contains one query 𝑞𝑖 , one positive every new target. The extra target losses are weighed
(relevant) passage 𝑝𝑖+ with 𝑛 negative (irrelevant) passages in half before being added to a corresponding layer
−
𝑝𝑖,𝑗 . The loss function can be optimized as the negative log- loss.
likelihood of the positive passage. The above 3 implementations are expressed in figure 19. For
𝐿(𝑞𝑖 , 𝑝𝑖+ , 𝑝𝑖,1
− −
, . . , 𝑝𝑖,𝑛 )= sequence length 𝐿, the language model computes joint
probability autoregressive distribution over token sequences.
𝐿
+
𝑒 𝑠𝑖𝑚(𝑞𝑖,𝑝𝑖 ) 𝑃(𝑡0:𝐿 ) = 𝑃(𝑡0 ) ∏ 𝑃(𝑡𝑖 ∣ 𝑡0:𝑖−1 ) (85)
−𝑙𝑜𝑔 −
(84)
𝑠𝑖𝑚(𝑞𝑖 ,𝑝𝑖+) ∑𝑛𝑗=1 𝑒 𝑠𝑖𝑚(𝑞𝑖,𝑝𝑖,𝑗 )
𝑒 + 𝑖=1
VIII-B TRANSFORMER-XL
VIII. LONG SEQUENCE MODELS
To mitigate context fragmentation in vanilla Transformers,
Vanilla Transformers break input sequences into chunks if
XL incorporates lengthier dependencies where it reuses and
their length exceeds 512 tokens, which results in loss of
caches the prior hidden states from where data is propagated
context when related words exist in different chunks. This
via recurrence. Given a corpus of tokens 𝑥 =
constraint results in a lack of contextual information leading
(𝑥1 , 𝑥2 . . . , 𝑥𝑇 ), a language model computes the joint
to inefficient prediction and compromised performance and
probability 𝑃(𝑥) autoregressively, where the context 𝑥<𝑡 is
dawned the rise of such models.
encoded into a fixed size hidden state.
VIII-A DEEPER SELF-ATTENTION 𝑃(𝑥) = ∏ 𝑃(𝑥𝑡 ∣ 𝑥<𝑡 ) (86)
This 64 layered Transformer [93] was built based on the 𝑡
discovery that it possessed greater character level modeling
of longer-range sequences. The information was swiftly
transmitted over random distances as compared to RNN’s
unitary step progression. However, the three following
supporting loss parameters were added to the vanilla
Transformer which accelerated convergence and provided
the ability to train deeper networks.
(i) 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑎𝑐𝑟𝑜𝑠𝑠 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠: Generally Assume two consecutive sentences of length 𝐿, 𝑠𝜏 =
causal prediction occurs at a single position in the final [𝑥𝜏,1 , … , 𝑥𝜏,𝐿 ] and 𝑠𝜏+1 = [𝑥𝜏+1,1 , … , 𝑥𝜏+1,𝐿 ] where 𝑛𝑡ℎ layer
layer, however in this case all positions are used for
hidden state sequence produced by the 𝜏 𝑡ℎ segment 𝑠𝜏 as
prediction. These auxiliary losses compel the model to
ℎ𝜏𝑛 ⋵ 𝑅𝐿×𝑑 , where 𝑑 is the hidden dimension. The 𝑛𝑡ℎ
predict on smaller contexts and accelerate training
hidden layer state for the segment 𝑠𝜏+1 is computed as
without weight decay.
follows:
(ii) 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑛 𝐼𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝐿𝑎𝑦𝑒𝑟: Apart from ~𝑛−1 𝑛−1 ]
the final layer, predictions from all intermediate layers ℎ𝑟+1 = [𝑆𝐺(ℎ𝑟𝑛−1 ) • ℎ𝑟+1 (87)
𝑛 𝑛 𝑛 𝑛−1 ~𝑛−1 ~𝑛−1
are added for a given sequence, as training progresses, 𝑞𝑟+1 , 𝑘𝑟+1 , 𝑣𝑟+1 = ℎ𝑟+1 𝑊𝑄𝑇 , ℎ𝑟+1 𝑊𝐾𝑇 , ℎ𝑟+1 𝑊𝑉𝑇 (88)
𝑛 𝑛 𝑛 𝑛 )
lower layers weightage is progressively reduced. For 𝑛 ℎ𝑟+1 = 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟 − 𝐿𝑎𝑦𝑒𝑟(𝑞𝑟+1 , 𝑘𝑟+1 , 𝑣𝑟+1 (89)
layers, the contribution of 𝑙 𝑡ℎ intermediate layer ceases
to exist after completing 𝑙/2𝑛 of the training. where 𝑆𝐺(·) represents stop-gradient, [ℎ𝑢 • ℎ𝑣 ] is the two
(iii) 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑇𝑎𝑟𝑔𝑒𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛: The model is modified hidden sequence concatenation, and 𝑊 the model
to generate two or greater predictions of future parameters. The key distinction from the original
𝑛 𝑛
characters where a separate classifier is introduced for Transformer lies in modeling the key 𝑘𝑟+1 and value 𝑣𝑟+1
19
~𝑛−1
concerning the extended context ℎ𝑟+1 and hence preceding 𝑤ℎ𝑒𝑟𝑒 𝑛 ∈ (𝑖𝑛𝑝𝑢𝑡 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑠𝑖𝑧𝑒), 𝑤 ∈ (𝑤𝑖𝑛𝑑𝑜𝑤 𝑠𝑖𝑧𝑒)
𝑛−1
ℎ𝑟 are cached. This can be demonstrated from figure 20 𝑔𝑙𝑜𝑏𝑎𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 = (2 × 𝑛 × 𝑠)
above where prior attention span is cached by the latter 𝑤ℎ𝑒𝑟𝑒 𝑠 ∈ (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠 𝑤𝑖𝑡ℎ 𝑓𝑢𝑙𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛)
forming an elongated caching mechanism. 𝑊𝑖𝑛𝑑𝑜𝑤 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑆𝑖𝑧𝑒 = 𝑛0 , ℎ𝑒𝑛𝑐𝑒 (𝑛0 = 𝑤)
Such recurrence is applied to every two consecutive 𝑇𝑜𝑡𝑎𝑙 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑐𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 = 𝑛(𝑛0 + 2𝑠) 𝜀 𝑂(𝑛)
segments to create a segment level recurrence via hidden 𝑖𝑓 𝑛0 ≠ 𝑛
states. In the original transformer the attention score within 𝑇𝑜𝑡𝑎𝑙 𝑀𝑒𝑚𝑜𝑟𝑦 𝑅𝑒𝑞𝑢𝑖𝑟𝑒𝑚𝑒𝑛𝑡𝑠 =
the same segment between query (𝑞𝑖 ) and key (𝑘𝑖 ) vector is: 𝑛(𝑛0 + 2𝑠) × 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑒𝑟 𝐿𝑎𝑦𝑒𝑟𝑠
𝐴𝑎𝑏𝑠 𝑇 𝑇 𝑇 𝑇
𝑖,𝑗 = 𝐸𝑥𝑖 𝑊𝑞 𝑊𝑘 𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑊𝑞 𝑊𝑘 𝑈𝑗 + (90) Global attention enables chunk-less document processing,
𝑈𝑖 𝑊𝑞 𝑊𝑘 𝐸𝑥𝑗 + 𝑈𝑖𝑇 𝑊𝑞𝑇 𝑊𝑘 𝑈𝑗
𝑇 𝑇 however, its space-time complexity will be greater than
From a perspective of relative positional encoding, the above RoBERTa, if sequence length exceeds the window size.
equation is remodeled in the following manner 𝑂(𝐿𝑜𝑛𝑔𝑓𝑜𝑟𝑚𝑒𝑟)
{𝑂(𝑅𝑜𝐵𝐸𝑅𝑇𝑎) = 𝑂(𝑛0 )2 } < { }
𝐴𝑟𝑒𝑙 𝑇 𝑇 𝑇 𝑇
𝑖,𝑗 = 𝐸𝑥𝑖 𝑊𝑞 𝑊𝑘,𝐸 𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑊𝑞 𝑊𝑘,𝑅 𝑹𝒊−𝒋 + (91) = 𝑂(𝑛(𝑛0 + 2𝑠))
𝒖 𝑊𝑘,𝐸 𝐸𝑥𝑗 + 𝒗𝑻 𝑊𝑘,𝑅 𝑹𝒊−𝒋
𝑻 𝑖𝑓 𝑛 > 𝑛0
IX-B REFORMER
Reformer reduces the Transformer attention complexity to
𝑂(L log L) via local sensitive hashing (LSH). This assigns
each vector 𝑥 to a hash ℎ(𝑥), where neighboring vectors
obtain the same hash within hash buckets of similar size with
high probability and remote ones do not. The modified LSH
attention equation:
𝑜𝑖 = ∑ exp (𝑞𝑖 . 𝑘𝑗 − 𝑧(𝑖, 𝑃𝑖 )) 𝑣𝑗 (96)
𝑗⋵𝑃𝑖
𝑤ℎ𝑒𝑟𝑒 𝑃𝑖 = {𝑗: 𝑖 ≥ 𝑗}
𝑃𝑖 belongs to the set where 𝑖 𝑡ℎ position query attends to, 𝑧 is
the partition function that contains a range of nearby keys to
which a query attends to. For batching purposes, attention is
performed over 𝑃 ̃𝑖 = {0,1 − −𝑙} ⪾ 𝑃𝑖 where 𝑃𝑖 is a subset of
̃
𝑃𝑖 and elements not in 𝑃𝑖 are masked.
𝑜𝑖 = ∑𝑗⋵𝑃̃𝑖 exp (𝑞𝑖 . 𝑘𝑗 − 𝑚(𝑗, 𝑃𝑖 ) − 𝑧(𝑖, 𝑃𝑖 ))𝑣𝑗 (97)
∞, 𝑖𝑓 𝑗 ∉ 𝑃𝑖
𝑤ℎ𝑒𝑟𝑒 𝑚(𝑗, 𝑃𝑖 ) = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Decoder implements masking to prevent access to future
query positions. The set 𝑃𝑖 target items can only be attended
by a query at 𝑖 𝑡ℎ position, by enabling attention within a
single hash bucket. To further reduce the probability of
similar items falling in different buckets, several parallel
FIGURE 22. (a) Sparse Transformer Architecture (b) Decoder based Full
Attention with Causal Masking (c) Stridden Sparsity (d) Fixed Sparsity hashing (𝑛𝑟𝑜𝑢𝑛𝑑𝑠 ) is performed with distinct hash functions
{ℎ(1) , ℎ(2) , . . }
𝑛𝑟𝑜𝑢𝑛𝑑𝑠
The strided attention is implemented in two dimensions
where one head attends to previous 𝑙 locations and the other 𝒫𝑖 = ⋃ 𝒫𝑖 𝑟 𝑤ℎ𝑒𝑟𝑒 𝒫𝑖 𝑟 =
attends to each 𝑙 𝑡ℎ location, where stride 𝑙 value is close to 𝑟=1
(1)
√𝑛. This is expressed as 𝐴𝑖 = {𝑡, 𝑡 + 1, . . , 𝑖} for 𝑡 = {𝑗: ℎ(𝑟) (𝑞𝑖 ) = ℎ(𝑟) (𝑞𝑗 )} (98)
(1)
𝑚𝑎𝑥(0, 𝑖 − 𝑙) and 𝐴𝑖 = { 𝑗: (𝑖 − 𝑗) 𝑚𝑜𝑑 𝑙 = 0}. This
Attention is done on chunks of sorted keys queries and keys
linear transformation leads to this dense attention:
to batch:
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑋) = 𝑊𝑝 . 𝑎𝑡𝑡𝑒𝑛𝑑(𝑋, 𝑆) (92) (𝑟) (𝑟) (𝑟)
(𝑟) 𝑠 𝑠𝑗 𝑠𝑖
where 𝑊𝑝 is the post attention matrix. Similarly, to 𝒫̃𝑖 = { 𝑗 ∶ ⌊ 𝑖 ⌋ − 1 ≤ ⌊ ⌋≤⌊ ⌋} (99)
𝑚 𝑚 𝑚
implement factorized attention heads, one attention type is
used alternatively per residual block or interleaved or a From (96) and (97) we can write,
hyperparameter determines the ratio. 𝑜𝑖 = ∑𝑗⋵𝑃̃𝑖 exp (𝑞𝑖 . 𝑘𝑗 − 𝑚(𝑗, 𝑃𝑖 ) − 𝑧(𝑖, 𝑃𝑖 ))𝑣𝑗 (100)
𝑛
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑋) = 𝑊𝑝 . 𝑎𝑡𝑡𝑒𝑛𝑑(𝑋, 𝐴(𝑟 𝑚𝑜𝑑 𝑝) ) (93) = ∑𝑟=1𝑟𝑜𝑢𝑛𝑑𝑠
exp (𝑧(𝑖, 𝒫𝑖 (𝑟) −
1
𝑧(𝑖, 𝒫𝑖 )) ∑𝑗∈𝒫̃(𝑟) exp (𝑞𝑖 . 𝑘𝑗 − 𝑚( 𝑗, (101)
𝑖 𝑁𝑖,𝑗
where 𝑟 is the current residual block index and 𝑝 is the (𝑟)
factorized headcount. An alternative merged head approach 𝒫𝑖 (𝑟) ) − 𝑧(𝑖, 𝒫𝑖 ))𝑣𝑗
incorporates one head attend to target locations where both
𝑛 (𝑟)
factorized heads would attend to. This approach is 𝑟𝑜𝑢𝑛𝑑𝑠
= ∑𝑟=1 exp (𝑧(𝑖, 𝒫𝑖 (𝑟) − 𝑧(𝑖, 𝒫𝑖 )) 𝑜𝑖 (102)
computationally more expensive by a constant factor.
𝑝 (𝑟)
𝑜𝑖 =
(𝑚)
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑋) = 𝑊𝑝 . 𝑎𝑡𝑡𝑒𝑛𝑑 (𝑋, ⋃ 𝐴 ) (94) ∑𝑗∈𝒫̃(𝑟) exp (𝑞𝑖 . 𝑘𝑗 − 𝑚( 𝑗, 𝒫𝑖 (𝑟) ) − 𝑧(𝑖, 𝒫𝑖 (𝑟) ))𝑣𝑗 (103)
𝑚=1 𝑖
21
The following example in figure 23 comprehensively 𝑋1 = 𝑌1 − 𝑓 (𝑌2 𝐿𝑎𝑦𝑒𝑟 2 ) , 𝑋2 = 𝑌2 − 𝑓 (𝑌1 𝐿𝑎𝑦𝑒𝑟 1 ) (105)
demonstrates the various working mechanisms of the
Reformer.
(c)
(a) (b)
FIGURE 25. (a) Smaller model via Embedding size reduction (b) Effective
learning via sharing of Attention parameters
(d) (e)
FIGURE 23. (a) Bucket formation of similar Attention vectors (b) Simple
bucketing of a Query-Key pair (c) Query-Key sequence distribution based
on (a) before Bucketing (d), (e) Bucketing and Chunking of (c)
block are recomputed from the next layer’s activations as: FIGURE 26. (a) BERT’s NSP learning via simple non-reversed pair order
(b) ALBERT’s SOP dual sentiment learning via sentence order reversal.
𝑌1 = 𝑋1 + 𝑓 (𝑋2 𝐿𝑎𝑦𝑒𝑟 2 ) , 𝑌2 = 𝑋2 + 𝑓 (𝑋1 𝐿𝑎𝑦𝑒𝑟 1 ) (104)
Like BERT’s NSP, ALBERT’s sentence-order prediction
(SOP) loss incorporated two-pronged learning from two
22
positive successive text segments that also included its into (𝑘 × 𝑑) − dimensional key, value layers, thereafter
corresponding negative samples with orders reversed as resulting (𝑛 × 𝑘) − dimensional context mapping is
demonstrated in figure 26. This influences the model to learn computed using scaled dot-product attention.
contextually the finer-grained discrepancies in any discourse ℎ𝑒𝑎𝑑𝑖 = (111)
giving superior coherent performances. Its MLM target ̅ :𝒏×𝒌
𝑷
𝑄𝑊𝑖 𝑄 (𝐸𝑖 𝐾𝑊𝑖𝐾 )𝑇
implements 𝑛-gram masking that comprises up to 3- 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( ) . (𝐹𝑖 𝑉𝑊𝑖𝑉 )𝒌×𝒅
character sequences, like “World Cup Football” or “Natural √𝑑𝑘
Language Processing”. If 𝑘 << 𝑛, then a significant reduction of memory and space
1/𝑛 consumption is achieved. For further efficient optimization,
𝑝(𝑛) = ∑𝑛 (106)
𝑘=1 1/𝑘 parameter sharing between projections is performed at three
levels: (i) Headwise Sharing: for each layer two projection
IX-D ELECTRA
matrices 𝐸 and 𝐹 are shared where 𝐸𝑖 = 𝐸, 𝐹𝑖 = 𝐹 through
The advantage lies in its contextual learning via effective
all heads 𝑖. (ii) Key-Value Sharing: including (i) key, value
discrimination, where it learns from all. input tokens unlike
projections are shared where each layer’s single projection
BERT’s that learn from a mere 15% masked-out subset.
matrix 𝐸 = 𝐸𝑖 = 𝐹𝑖 is created for each key-value projection
ELCTRA implements “replaced token detection”, as shown
matrix for all heads 𝑖 (iii) Layer-wise Sharing: a single
in figure 27, where contamination occurs by replacing few
random tokens with probabilistic meaningful substitutions projection matrix 𝐸 implemented for all layers, heads, keys,
via Generator(𝐺), a small ‘masked language model’. and values. For a 12-layer, 12-head Transformer, (i), (ii), (iii)
will incorporate 24, 12, 1 distinct linear projection matrices,
respectively.
IX-E PERFORMER
𝑇
The standard attention (𝑄𝐿×𝑑 . 𝐾𝑑×𝐿 ). 𝑉𝐿×𝑑 results in
quadratic time complexity of 𝑂(𝐿2 𝑑), preferable
𝑇
implementation of 𝑄𝐿×𝑑 . (𝐾𝑑×𝐿 . 𝑉𝐿×𝑑 ) leads to 𝑂(𝑑 2 𝐿)
where 𝐿 ≫ 𝑑. However, attention decomposition of query-
key product into its pristine form is not possible after
FIGURE 27. Replaced token detection via model’s combined training implementing the softmax non-linear function. However, pre
softmax decomposition of attention is possible via
Simultaneously, via binary classification, a larger model
approximation of lower-ranked queries and keys enabling
Discriminator (𝐷) is jointly pre-trained to predict if each 𝑄𝐾 𝑇
token was restored correctly via the generator. greater efficiency, specifically 𝑄′ 𝐾 ′𝑇 ≅ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( ) ≅
√𝑑
ℒ𝑀𝐿𝑀 (𝑥, 𝜃𝐺 ) = 𝔼(∑𝑖∈𝑚 −log 𝑝𝐺 ( 𝑥𝑖 |𝑥 𝑚𝑎𝑠𝑘𝑒𝑑 )) (107) exp (𝑄𝐾 𝑇 ). This is achieved via kernel approximation
ℒ𝐷𝑖𝑠𝑐 (𝑥, 𝜃𝐷 ) = (108) function 𝐾(𝑥, 𝑦) = ∅(𝑥)𝑇 ∅(𝑦), the dot product of a high-
𝑛
dimensional feature map ∅. Contrary to the kernel trick
𝔼 (∑ −1( 𝑥𝑡 𝑐𝑜𝑟𝑟 = 𝑥𝑡 ) log 𝐷(𝑥 𝑐𝑜𝑟𝑟 , 𝑡) − 1( 𝑥𝑡 𝑐𝑜𝑟𝑟 where the dimensionality is increased, the Performer [96]
𝑡=1 decomposes the attention matrix 𝐴(𝑖, 𝑗) = 𝐾(𝑞𝑖 , 𝑘𝑗 ) =
≠ 𝑥𝑡 ) log (1 − D(𝑥 𝑐𝑜𝑟𝑟 , 𝑡))) 𝑒𝑥𝑝(𝑞𝑖 , 𝑘𝑗 𝑇 ) to a lower-dimensional feature map ∅.
The two encoder-based networks (𝐺, 𝐷) transform an input X. MODELING CLASSIFICATION OF LMs
token sequence 𝑥 = [𝑥1 , . . , 𝑥𝑛 ] into a contextualized vector Transformer based language models (LM) can be classified
representation ℎ𝑥 = [ℎ1 , . . , ℎ𝑛 ]. Via Softmax, 𝐺 yields the into 3 categories [97] from a modeling perspective:
likelihood of generating a 𝑡 𝑡ℎ position token 𝑥𝑡 , where 𝑥𝑡 = (i) Autoregressive: These are pre-trained feedforward
[𝑀𝐴𝑆𝐾]. models that predict future tokens from token history.
exp(𝑒(𝑥𝑡 )𝑇 ℎ𝐺 (𝑥)𝑡 ) Here output 𝑦𝑡 is dependent on the input at time instant
𝑝𝐺 (𝑥𝑡 ∣ 𝑥) = ∑ ′ )𝑇 ℎ (𝑥) ) (109) 𝑥𝑡 and previous time step inputs 𝑥<𝑡 . These are primarily
𝑥′ exp(𝑒(𝑥 𝐺 𝑡
The combined loss over a large corpus 𝜒 is minimized as: decoder-based Transformers that incorporate causal
𝑚𝑖𝑛 masking where attention heads are prevented from
𝜃 ,𝜃
∑𝑥∈𝜒 ℒ𝑀𝐿𝑀 (𝑥, 𝜃𝐺 ) + 𝜆ℒ𝐷𝑖𝑠𝑐 (𝑥, 𝜃𝐷 ) (110)
𝐺 𝐷 attending to future tokens. Such models are generally
IX-E LINFORMER fine-tuned for text generation purposes and deploy zero-
It demonstrates [95] that attention weights are dominated by shot learning in the GPT series.
a few key entries, hence sequence length is down projected (ii) Auto-Encoded: These Encoder based models have full
to a target output matrix via low-rank self-attention that access to the input array, devoid of any masking. To
achieves linear time and space complexity 𝑂(1). During learn they are pre-trained via incorporating input token
computation of keys and values, two linearly projected masking schemes and then fine-tuned to reproduce the
matrices are added 𝐸𝑖 , 𝐹𝑖 ∈ ℝ𝑛×𝑘 , where (𝑛 × 𝑑) − masked tokens as output. These models (BERT) are
dimensional key, value layers 𝐾𝑊𝑖𝐾 and 𝑉𝑊𝑖𝑉 are projected
23
generally appropriate for sequence or token not operate on fine-tuning as its focus is to deliver task-
classification tasks. agnostic execution. However, there is the scope of minimal
(iii) Sequence to Sequence: These Encoder-Decoder-based fine-tuning in GPT-3 which leads to one or few-shot
generative models create data post learning from a learning. The idea is to perform zero or minimal gradient
massive dataset. Unlike discriminative distribution updates post pre-training a huge model on a massive dataset.
𝑃(𝑌|𝑋), they model the joint distribution 𝑃(𝑋, 𝑌) of Though GPT-3 does not rank highly with the SuperGlue
input 𝑋 and target 𝑌 where input can be corrupted on benchmark, the key is that this generative model is the
several schemes. Decoder-based causal masking is quickest in learning any task at inference time. It matches
deployed to maximize learning for subsequent target performance with SOTA fine-tuned models on several NLP
generation. Models like BART and T5 perform best on tasks in the zero, one, and few-shot settings. It also generates
NMT, summarization, or QA tasks. high-quality samples and gives a solid qualitative
A comprehensive overview of the above-mentioned performance at tasks defined on the fly.
modeling classification is presented in figure 29.
XII. CONCLUSION AND FUTURE DIRECTIONS
XI. LANGUAGE MODEL PERFORMANCE COMPARISON We provide a comprehensive and detailed summary of the
The quantitative performance of few major NLP models is major language models that have led to the current SOTA in
shown in figure 28 that is based on the Glue and SuperGlue NLP performance. Since the launch of the Attention
benchmarks. These benchmarks contain a variety of datasets mechanism and Transformer architecture, NLP has advanced
that judge the model on several NLP tasks. With the highest exponentially. We presented a high-level mind map of model
number of trainable parameters, GPT-3 is the largest model classifications via a taxonomy. These classifications are
in this comparison. Since GPT-3 is the newest model here, it primarily based on Transformer derivative architectures,
does not participate in the older Glue benchmark. built for specialized tasks like Language Understanding and
From a qualitative perspective, the T5 within the same model Generation, Model Size Reduction via Distillation,
uses the same loss function and hyperparameters spread Quantization and Pruning, Information Retrieval, Long
across a variety of tasks leading to a multi-task learning Sequence Modeling, and other Generalized Model
environment. It performs the best as this scalable text to text Reduction techniques. Recent language models are primarily
generative (NLG) model couples the denoising objective driven by attaining higher NLP performance requiring huge
during its training with massive amounts of unlabelled data. computing resources. Thus, model scaling has been the
This leads to superior learning and greater generalized natural pathway in industry. This exponential scaling
performances over NLU models like RoBERTa which are coupled with higher attention complexity makes these
fine-tuned for individual downstream tasks after pre- models infeasible to access at a global scale. Subsequently,
training. significant efforts have been made to engineer reasonably
sized models and an efficient attention computation to speed
up model convergence leading to lower latency in models.
Incorporating a Mixture of Expert (MoE) [98] methodology
is an effective way for large models to achieve computational
efficiency, as only a subset of the neural network is activated
for every input. Consequently, this leads to sparsity, and
although sparsity training is an active research area, current
GPUs are better suited for dense matrix computations. While
MoE models have demonstrated promise in training sparse
matrices, their communication costs and complexity impede
wide-scale deployment. Further, larger models are prone to
memorize training data leading to overfitting and reduced
learning [99]. To overcome this, models are only trained for
a single epoch on de-duplicated instances on huge datasets,
thereby exhibiting minimal overfitting.
Thus, MoE design coupled with a robust training paradigm
in the future might lead to highly scalable and efficient
FIGURE 28. Graphical Representation of Language Model Performance models. These models will possess superior language
understanding, as data memorization would be minimized.
The primary motive of several rounds of fine-tuning in NLU The current approach in SOTA models relies on supervised
models is to achieve strong performance on multiple tasks. learning on huge datasets. A promising area of future
The major disadvantages are the requirement for a new and enhancements in NLP would be incorporating reinforcement
typically large dataset for each task. This amplifies the learning in Machine Translation, text summarization, and
potential for poor out-of-distribution generalization leading Q&A tasks.
to unfair comparison with human-level abilities. GPT-3 does
24
LANGUGAE
MODEL DESCRIPTION TASKS MODELING
TYPE
Q&A, NMT, Reading
Autoregressive
GPT-I, II, III • Unsupervised pre-training on large datasets Comprehension, Text
DECODER based
• Autoregressive Language Modeling and Causal Masking Summarization, Common
Transformer
Sense Reasoning, Zero-Shot
• Greater Contextual Learning via Factorized Ordering on Reading Comprehension, Autoregressive
XLNET
Input’s Sequence Length Natural Language Inference, DECODER based
• Bidirectional Contextual Language Modeling Sentiment Analysis, Q&A Transformer
REFERENCES
[1] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436- [4] Devlin, J., Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
-444 (2015). “BERT: Pre-training of Deep Bidirectional Transformers for
[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Language Understanding.” NAACL-HLT (2019).
Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin [5] Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer
“Attention is all you need”, NeurIPS (2017). Levy, and Samuel R. Bowman. “GLUE: A Multi-Task Benchmark
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, and Analysis Platform for Natural Language
“Improving language understanding with unsupervised learning”, Understanding.” BlackboxNLP@EMNLP (2018).
Technical report, OpenAI Blog, 2018 [6] Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh,
Julian Michael, Felix Hill, Omer Levy and Samuel R.
25
Bowman.“SuperGLUE: A Stickier Benchmark for General-Purpose [29] Han, Song, Huizi Mao and W. Dally. “Deep Compression:
Language Understanding Systems.” NeurIPS (2019). Compressing Deep Neural Network with Pruning, Trained
[7] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark Quantization, and Huffman Coding.” arXiv: Computer Vision and
analysis of representative deep neural network architectures,” IEEE Pattern Recognition (2016)
Access, vol. 6, pp. 64 270–64 277, 2018. [30] Lee, Kenton, Ming-Wei Chang, and Kristina Toutanova. “Latent
[8] Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Retrieval for Weakly Supervised Open Domain Question
“SQuAD: 100, 000+ Questions for Machine Comprehension of Answering.” ArXiv abs/1906.00300 (2019)
Text.” ArXiv abs/1606.05250 (2016) [31] Guu, Kelvin, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-
[9] Rajpurkar, P., Jia, R., & Liang, P, “Know What You Don't Know: Wei Chang. “REALM: Retrieval-Augmented Language Model Pre-
Unanswerable Questions for SQuAD”. ArXiv, abs/1806.03822, 2018 Training.” ArXiv abs/2002.08909 (2020)
[10] Warstadt, Alex, Amanpreet Singh, and Samuel R. Bowman. “Neural [32] Lewis, Patrick, Ethan Perez, Aleksandara Piktus, F. Petroni, V.
Network Acceptability Judgments.” Transactions of the Karpukhin, Naman Goyal, Heinrich Kuttler, M. Lewis, Wen-tau Yih,
Associationfor Computational Linguistics 7 (2019): 625-641. Tim Rocktäschel, Sebastian Riedel and Douwe Kiela. “Retrieval-
[11] McCoy, R. Thomas, Junghyun Min and Tal Linzen. “BERTs of a Augmented Generation for Knowledge-Intensive NLP
feather do not generalize together: Large variability in generalization Tasks.” ArXiv abs/2005.11401 (2020)
across models with similar test set performance.” [33] Karpukhin, V., Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Yu
ArXiv abs/1911.02969 (2020) Wu, Sergey Edunov, Danqi Chen and Wen-tau Yih. “Dense Passage
[12] ElSahar, Hady and Matthias Gallé. “To Annotate or Not? Predicting Retrieval for Open-Domain Question
Performance Drop under Domain Shift.” EMNLP/IJCNLP (2019). Answering.” ArXiv abs/2010.08191 (2020)
[13] Ruder, Sebastian and Barbara Plank. “Learning to select data for [34] Dai, Zihang, Z. Yang, Yiming Yang, J. Carbonell, Quoc V. Le, and R.
transfer learning with Bayesian Optimization.” EMNLP (2017). Salakhutdinov. “Transformer-XL: Attentive Language Models
[14] Yang, Z., Zihang Dai, Yiming Yang, J. Carbonell, R. Salakhutdinov, Beyond a Fixed-Length Context.” ArXiv abs/1901.02860 (2019)
and Quoc V. Le. “XLNet: Generalized Autoregressive Pretraining for [35] Beltagy, Iz, Matthew E. Peters, and Arman Cohan. “Longformer: The
Language Understanding.” NeurIPS (2019). Long-Document Transformer.” ArXiv abs/2004.05150 (2020)
[15] Liu, Y., Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi [36] Ainslie, Joshua, S. Ontañón, C. Alberti, V. Cvicek, Zachary Kenneth
Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin toyanov. Fisher, Philip Pham, Anirudh Ravula, S. Sanghai, Qifan Wang and L.
“RoBERTa: A Robustly Optimized BERT Pretraining Yang. “ETC: Encoding Long and Structured Inputs in
Approach.” ArXiv abs/1907.11692 (2019). Transformers.” EMNLP (2020).
[16] McCann, B., N. Keskar, Caiming Xiong, and R. Socher. “The Natural [37] Zaheer, Manzil, Guru Guruganesh, Kumar Avinava Dubey, Joshua
Language Decathlon: Multitask Learning as Question Ainslie, C. Alberti, S. Ontañón, Philip Pham, Anirudh Ravula, Qifan
Answering.” ArXiv abs/1806.08730 (2018). Wang, L. Yang and Amr Ahmed. “Big Bird: Transformers for
[17] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Longer Sequences.” ArXiv abs/2007.14062 (2020)
Narang, M. Matena, Yanqi Zhou, W. Li, and Peter J. Liu. “Exploring [38] Kitaev, Nikita, L. Kaiser and Anselm Levskaya. “Reformer: The
the Limits of Transfer Learning with a Unified Text-to-Text Efficient Transformer.” ArXiv abs/2001.04451 (2020)
Transformer.” J. Mach. Learn. Res. 21 (2020): 140:1-140:67. [39] Child, R., Scott Gray, A. Radford, and Ilya Sutskever. “Generating
[18] Lewis, M., Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, A. Long Sequences with Sparse
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Transformers.” ArXiv abs/1904.10509 (2019)
“BART: Denoising Sequence-to-Sequence Pre-training for Natural [40] Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel,
Language Generation, Translation, and Piyush Sharma, and Radu Soricut “ALBERT: A Lite BERT for Self-
Comprehension.” ArXiv abs/1910.13461 (2020) supervised Learning of Language
[19] Liu, Yinhan, Jiatao Gu, Naman Goyal, X. Li, Sergey Edunov, Marjan Representations.” ArXiv abs/1909.11942 (2020)
Ghazvininejad, M. Lewis and Luke Zettlemoyer. “Multilingual [41] Clark, K., Minh-Thang Luong, Quoc V. Le, and Christopher D.
Denoising Pre-training for Neural Machine Manning. “ELECTRA: Pre-training Text Encoders as Discriminators
Translation.” Transactions of the Association for Computational Rather Than Generators.” ArXiv abs/2003.10555 (2020)
Linguistics 8 (2020): 726-742. [42] Plummer, Bryan A., Nikoli Dryden, Julius Frost, Torsten Hoefler and
[20] C. Rosset, “Turing-nlg: A 17-billion-parameter language model by Kate Saenko. “Shapeshifter Networks: Cross-layer Parameter Sharing
Microsoft,” Microsoft Blog, 2019 for Scalable and Effective Deep Learning.” ArXiv abs/2006.10598
[21] Xie, Ziang, Guillaume Genthial, S. Xie, A. Ng and Dan Jurafsky. (2020)
“Noising and Denoising Natural Language: Diverse Backtranslation [43] Joshi, Mandar, Danqi Chen, Y. Liu, Daniel S. Weld, Luke Zettlemoyer,
for Grammar Correction.” NAACL-HLT (2018). and Omer Levy. “SpanBERT: Improving Pre-training by
[22] Brown, T., B. Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Representing and Predicting Spans.” Transactions of the Association
Dhariwal, Arvind Neelakantan, A. Radford, Ilya Sutskever, and Dario for Computational Linguistics 8 (2019): 64-77
Amodei. “Language Models are Few-Shot [44] Yang, Z., Peng Qi, Saizheng Zhang, Yoshua Bengio, William W.
Learners.” ArXiv abs/2005.14165 (2020) Cohen, R. Salakhutdinov and Christopher D. Manning. “HotpotQA: A
[23] Lepikhin, Dmitry, H. Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Dataset for Diverse, Explainable Multi-hop Question
Y. Huang, M. Krikun, Noam Shazeer and Z. Chen. “GShard: Scaling Answering.” ArXiv abs/1809.09600 (2018)
Giant Models with Conditional Computation and Automatic [45] Cho, Kyunghyun, B. V. Merrienboer, Çaglar Gülçehre, Dzmitry
Sharding.” ArXiv abs/2006.16668 (2020) Bahdanau, Fethi Bougares, Holger Schwenk and Yoshua Bengio.
[24] Strubell, Emma, Ananya Ganesh, and A. McCallum. “Energy and “Learning Phrase Representations using RNN Encoder-Decoder for
Policy Considerations for Deep Learning in Statistical Machine Translation.” ArXiv abs/1406.1078 (2014)
NLP.” ArXiv abs/1906.02243 (2019) [46] Hochreiter, S. and J. Schmidhuber. “Long Short-Term
[25] Hinton, Geoffrey E., Oriol Vinyals, and J. Dean. “Distilling the Memory.” Neural Computation 9 (1997): 1735-1780.
Knowledge in a Neural Network.” ArXiv abs/1503.02531 (2015) [47] Chung, J., Çaglar Gülçehre, Kyunghyun Cho and Yoshua Bengio.
[26] Sanh, Victor, Lysandre Debut, Julien Chaumond and Thomas Wolf. “Empirical Evaluation of Gated Recurrent Neural Networks on
“DistilBERT, a distilled version of BERT: smaller, faster, cheaper, Sequence Modeling.” ArXiv abs/1412.3555 (2014)
and lighter.” ArXiv abs/1910.01108 (2019) [48] Luong, Thang, Hieu Pham and Christopher D. Manning. “Effective
[27] Jiao, Xiaoqi, Y. Yin, L. Shang, Xin Jiang, X. Chen, Linlin Li, F. Wang, Approaches to Attention-based Neural Machine
and Qun Liu. “TinyBERT: Distilling BERT for Natural Language Translation.” ArXiv abs/1508.04025 (2015)
Understanding.” ArXiv abs/1909.10351 (2020). [49] Bahdanau, Dzmitry, Kyunghyun Cho and Yoshua Bengio. “Neural
[28] Sun, Zhiqing, H. Yu, Xiaodan Song, Renjie Liu, Yiming Yang and Machine Translation by Jointly Learning to Align and
Denny Zhou. “MobileBERT: a Compact Task-Agnostic BERT for Translate.” CoRR abs/1409.0473 (2015)
Resource-Limited Devices.” ACL (2020).
26
[50] Pascanu, Razvan, Tomas Mikolov and Yoshua Bengio. “On the [76] Narang, Sharan, Greg Diamos, S. Sengupta, and E. Elsen. “Exploring
difficulty of training recurrent neural networks.” ICML (2013). Sparsity in Recurrent Neural Networks.” ArXiv abs/1704.05119
[51] Mikolov, Tomas, Kai Chen, G. S. Corrado and J. Dean. “Efficient (2017)
Estimation of Word Representations in Vector [77] Zhu, M. and S. Gupta. “To prune, or not to prune: exploring the
Space.” CoRR abs/1301.3781 (2013) efficacy of pruning for model compression.” ArXiv abs/1710.01878
[52] Pennington, Jeffrey, R. Socher and Christopher D. Manning. “Glove: (2018)
Global Vectors for Word Representation.” EMNLP (2014). [78] Wang, Ziheng, Jeremy Wohlwend and Tao Lei. “Structured Pruning
[53] Melamud, Oren, J. Goldberger, and I. Dagan. “context2vec: Learning of Large Language Models.” ArXiv abs/1910.04732 (2020)
Generic Context Embedding with Bidirectional [79] Voita, Elena, P. Serdyukov, Rico Sennrich and Ivan Titov. “Context-
LSTM.” CoNLL (2016). Aware Neural Machine Translation Learns Anaphora
[54] McCann, B., James Bradbury, Caiming Xiong, and R. Socher. Resolution.” ArXiv abs/1805.10163 (2018)
“Learned in Translation: Contextualized Word [80] Tang, Raphael, Yao Lu, L. Liu, Lili Mou, Olga Vechtomova and Jimmy
Vectors.” NIPS (2017). Lin. “Distilling Task-Specific Knowledge from BERT into Simple
[55] Ramachandran, Prajit, Peter J. Liu and Quoc V. Le. “Unsupervised Neural Networks.” ArXiv abs/1903.12136 (2019)
Pretraining for Sequence to Sequence [81] Voita, Elena, David Talbot, F. Moiseev, Rico Sennrich, and Ivan
Learning.” ArXiv abs/1611.02683 (2017) Titov. “Analyzing Multi-Head Self-Attention: Specialized Heads Do
[56] Howard, J. and Sebastian Ruder. “Universal Language Model Fine- the Heavy Lifting, the Rest Can Be Pruned.” ACL (2019).
tuning for Text Classification.” ACL (2018). [82] Ding, Yanzhuo, Yang Liu, Huanbo Luan and M. Sun. “Visualizing
[57] Liu, Xiaodong, Pengcheng He, W. Chen and Jianfeng Gao. “Multi- and Understanding Neural Machine Translation.” ACL (2017).
Task Deep Neural Networks for Natural Language [83] Michel, Paul, Omer Levy, and Graham Neubig. “Are Sixteen Heads
Understanding.” ArXiv abs/1901.11504 (2019) Really Better than One?” ArXiv abs/1905.10650 (2019)
[58] Ba, Jimmy, J. Kiros, and Geoffrey E. Hinton. “Layer [84] Zhang, D., J. Yang, Dongqiangzi Ye, and Gang Hua. “LQ-Nets:
Normalization.” ArXiv abs/1607.06450 (2016) Learned Quantization for Highly Accurate and Compact Deep Neural
[59] Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Networks.” ArXiv abs/1807.10029 (2018)
Christopher Clark, Kenton Lee, and Luke Zettlemoyer. “Deep [85] Zhou, Aojun, A. Yao, Yiwen Guo, L. Xu, and Y. Chen. “Incremental
contextualized word representations.” NAACL-HLT (2018). Network Quantization: Towards Lossless CNNs with Low-Precision
[60] Howard, J. and Sebastian Ruder. “Universal Language Model Fine- Weights.” ArXiv abs/1702.03044 (2017)
tuning for Text Classification.” ACL (2018). [86] Lin, Xiaofan, Cong Zhao and W. Pan. “Towards Accurate Binary
[61] Wang, Yuxuan, W. Che, Jiang Guo, Yijia Liu, and Ting Liu. “Cross- Convolutional Neural Network.” ArXiv abs/1711.11294 (2017)
Lingual BERT Transformation for Zero-Shot Dependency [87] Shen, Sheng, Zhen Dong, J. Ye, L. Ma, Zhewei Yao, A. Gholami, M.
Parsing.” ArXiv abs/1909.06775 (2019) Mahoney and K. Keutzer. “Q-BERT: Hessian Based Ultra Low
[62] Radford, A., Jeffrey Wu, R. Child, David Luan, Dario Amodei, and Precision Quantization of BERT.” AAAI (2020).
Ilya Sutskever. “Language Models are Unsupervised Multitask [88] Zafrir, Ofir, Guy Boudoukh, Peter Izsak and M. Wasserblat. “Q8BERT:
Learners.” (2019). Quantized 8Bit BERT.” ArXiv abs/1910.06188 (2019)
[63] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence [89] Jacob, B., Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang,
Learning with Neural Networks.” NIPS (2014). A. Howard, H. Adam, and D. Kalenichenko. “Quantization and
[64] Lample, Guillaume and Alexis Conneau. “Cross-lingual Language Training of Neural Networks for Efficient Integer-Arithmetic-Only
Model Pretraining.” NeurIPS (2019). Inference.” 2018 IEEE/CVF Conference on Computer Vision and
[65] Song, K., X. Tan, T. Qin, Jianfeng Lu and T. Liu. “MASS: Masked Pattern Recognition (2018)
Sequence to Sequence Pre-training for Language [90] Bengio, Yoshua, N. Léonard and Aaron C. Courville. “Estimating or
Generation.” ICML (2019). Propagating Gradients Through Stochastic Neurons for Conditional
[66] Wenzek, Guillaume, Marie-Anne Lachaux, Alexis Conneau, Vishrav Computation.” ArXiv abs/1308.3432 (2013)
Chaudhary, F. Guzmán, Armand Joulin and E. Grave. “CCNet: [91] Qi, Peng, Xiaowen Lin, L. Mehr, Zijian Wang and Christopher D.
Extracting High Quality Monolingual Datasets from Web Crawl Manning. “Answering Complex Open-domain Questions Through
Data.” ArXiv abs/1911.00359 (2020) Iterative Query Generation.” ArXiv abs/1910.07000 (2019)
[67] Artetxe, M., Gorka Labaka and Eneko Agirre. “Learning bilingual [92] Kumar, A., Ozan Irsoy, Peter Ondruska, Mohit Iyyer, J. Bradbury,
word embeddings with (almost) no bilingual data.” ACL (2017). Ishaan Gulrajani, Victor Zhong, Romain Paulus and R. Socher. “Ask
[68] Lample, Guillaume, Myle Ott, Alexis Conneau, Ludovic Denoyer and Me Anything: Dynamic Memory Networks for Natural Language
Marc'Aurelio Ranzato. “Phrase-Based & Neural Unsupervised Processing.” ICML (2016).
Machine Translation.” ArXiv abs/1804.07755 (2018) [93] Al-Rfou, Rami, Dokook Choe, Noah Constant, Mandy Guo, and Llion
[69] Baldwin, Timothy T., and J. K. Ford. “TRANSFER OF TRAINING: Jones. “Character-Level Language Modeling with Deeper Self-
A REVIEW AND DIRECTIONS FOR FUTURE Attention.” AAAI (2019).
RESEARCH.” Personnel Psychology 41 (1988): 63-105. [94] Gomez, Aidan N., Mengye Ren, R. Urtasun and Roger B. Grosse.
[70] Clark, Kevin, Urvashi Khandelwal, Omer Levy and Christopher D. “The Reversible Residual Network: Backpropagation Without Storing
Manning. “What Does BERT Look At? An Analysis of BERT's Activations.” ArXiv abs/1707.04585 (2017)
Attention.” ArXiv abs/1906.04341 (2019) [95] Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang and Hao Ma.
[71] Liu, Zhuang, M. Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. “Linformer: Self-Attention with Linear
“Rethinking the Value of Network Complexity.” ArXiv abs/2006.04768 (2020)
Pruning.” ArXiv abs/1810.05270 (2019) [96] Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan,
[72] Fan, Angela, E. Grave and Armand Joulin. “Reducing Transformer Xingyou Song, A. Gane, Tamás Sarlós, P. Hawkins, J. Davis, Afroz
Depth on Demand with Structured Mohiuddin, L. Kaiser, D. Belanger, Lucy J. Colwell, and Adrian
Dropout.” ArXiv abs/1909.11556 (2020) Weller. “Rethinking Attention with
[73] Srivastava, Nitish, Geoffrey E. Hinton, A. Krizhevsky, Ilya Sutskever, Performers.” ArXiv abs/2009.14794(2020)
and R. Salakhutdinov. “Dropout: a simple way to prevent neural [97] Hugging Face Modeling Classification of NLP Models
networks from overfitting.” J. Mach. Learn. Res. 15 (2014): 1929- [98] Fedus, W., Barret Zoph, and Noam Shazeer. “Switch Transformers:
1958. Scaling to Trillion Parameter Models with Simple and Efficient
[74] Wan, Li, Matthew D. Zeiler, Sixin Zhang, Y. LeCun, and R. Fergus. Sparsity.” ArXiv abs/2101.03961 (2021): n. pag.
“Regularization of Neural Networks using [99] Carlini, N., Florian Tramèr, Eric Wallace, M. Jagielski, Ariel Herbert-
DropConnect.” ICML (2013). Voss, K. Lee, A. Roberts, Tom Brown, D. Song, Úlfar Erlingsson,
[75] Sajjad, Hassan, F. Dalvi, Nadir Durrani and Preslav Nakov. “Poor Alina Oprea and Colin Raffel. “Extracting Training Data from Large
Man's BERT: Smaller and Faster Transformer Language Models.” ArXiv abs/2012.07805 (2020): n. pag.
Models.” ArXiv abs/2004.03844 (2020)
27