Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

GLM: General Language Model Pretraining With Autoregressive Blank Infilling

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

GLM: General Language Model Pretraining

with Autoregressive Blank Infilling


Zhengxiao Du∗1,2 Yujie Qian∗3 Xiao Liu1,2 Ming Ding1,2 Jiezhong Qiu1,2
Zhilin Yang†1,4 Jie Tang†1,2
1
Tsinghua University 2 Beijing Academy of Artificial Intelligence (BAAI)
3
MIT CSAIL 4 Shanghai Qi Zhi Institute
zx-du20@mails.tsinghua.edu.cn yujieq@csail.mit.edu
{zhiliny,jietang}@tsinghua.edu.cn

Abstract All NLP tasks [END] are generation tasks

There have been various types of pretrain-


ing architectures including autoencoding mod-
arXiv:2103.10360v2 [cs.CL] 17 Mar 2022

els (e.g., BERT), autoregressive models (e.g.,


×L
GPT), and encoder-decoder models (e.g., T5).
However, none of the pretraining frameworks All [START] NLP tasks are generation tasks

performs the best for all tasks of three main cat-


egories including natural language understand- Figure 1: Illustration of GLM. We blank out text spans
ing (NLU), unconditional generation, and con- (green part) and generate them autoregressively. (Some
ditional generation. We propose a General attention edges are omitted; cf. Figure 2.)
Language Model (GLM) based on autoregres-
sive blank infilling to address this challenge.
GLM improves blank filling pretraining by In general, existing pretraining frameworks can
adding 2D positional encodings and allowing be categorized into three families: autoregressive,
an arbitrary order to predict spans, which re- autoencoding, and encoder-decoder models. Au-
sults in performance gains over BERT and T5 toregressive models, such as GPT (Radford et al.,
on NLU tasks. Meanwhile, GLM can be pre- 2018a), learn left-to-right language models. While
trained for different types of tasks by varying
they succeed in long-text generation and show few-
the number and lengths of blanks. On a wide
range of tasks across NLU, conditional and shot learning ability when scaled to billions of
unconditional generation, GLM outperforms parameters (Radford et al., 2018b; Brown et al.,
BERT, T5, and GPT given the same model 2020), the inherent disadvantage is the unidirec-
sizes and data, and achieves the best perfor- tional attention mechanism, which cannot fully cap-
mance from a single pretrained model with ture the dependencies between the context words
1.25× parameters of BERTLarge , demonstrat- in NLU tasks. Autoencoding models, such as
ing its generalizability to different downstream BERT (Devlin et al., 2019), learn bidirectional con-
tasks.1
text encoders via denoising objectives, e.g. Masked
Language Model (MLM). The encoders produce
1 Introduction
contextualized representations that suit natural lan-
Language models pretrained on unlabeled texts guage understanding tasks, but could not be directly
have substantially advanced the state of the art in applied for text generation. Encoder-decoder mod-
various NLP tasks, ranging from natural language els adopt bidirectional attention for the encoder,
understanding (NLU) to text generation (Radford unidirectional attention for the decoder, and cross
et al., 2018a; Devlin et al., 2019; Yang et al., 2019; attention between them (Song et al., 2019; Bi et al.,
Radford et al., 2018b; Raffel et al., 2020; Lewis 2020; Lewis et al., 2019). They are typically de-
et al., 2019; Brown et al., 2020). Downstream task ployed in conditional generation tasks, such as
performance as well as the scale of the parame- text summarization and response generation. 2 .
ters have also constantly increased in the past few T5 (Raffel et al., 2020) unifies NLU and condi-
years. tional generation via encoder-decoder models but
requires more parameters to match the performance
* The
first two authors contributed equally.
† 2
Corresponding authors. Unconditional generation refers to generating text as a lan-
1
The code and pre-trained models are available at https: guage model without finetuning, while conditional generation
//github.com/THUDM/GLM refers to sequence-to-sequence tasks.
of BRET-based models such as RoBERTa (Liu 2 GLM Pretraining Framework
et al., 2019) and DeBERTa (He et al., 2021).
We propose a general pretraining framework GLM
None of these pretraining frameworks is flexible based on a novel autoregressive blank infilling ob-
enough to perform competitively across all NLP jective. GLM formulates NLU tasks as cloze ques-
tasks. Previous works have tried to unify differ- tions that contain task descriptions, which can be
ent frameworks by combining their objectives via answered by autoregressive generation.
multi-task learning (Dong et al., 2019; Bao et al.,
2020). However, since the autoencoding and au- 2.1 Pretraining Objective
toregressive objectives differ by nature, a simple 2.1.1 Autoregressive Blank Infilling
unification cannot fully inherit the advantages of GLM is trained by optimizing an autoregressive
both frameworks. blank infilling objective. Given an input text x =
In this paper, we propose a pretraining frame- [x1 , · · · , xn ], multiple text spans {s1 , · · · , sm } are
work named GLM (General Language Model), sampled, where each span si corresponds to a
based on autoregressive blank infilling. We ran- series of consecutive tokens [si,1 , · · · , si,li ] in x.
domly blank out continuous spans of tokens from Each span is replaced with a single [MASK] to-
the input text, following the idea of autoencoding, ken, forming a corrupted text xcorrupt . The model
and train the model to sequentially reconstruct the predicts the missing tokens in the spans from the
spans, following the idea of autoregressive pretrain- corrupted text in an autoregressive manner, which
ing (see Figure 1). While blanking filling has been means when predicting the missing tokens in a
used in T5 (Raffel et al., 2020) for text-to-text pre- span, the model has access to the corrupted text
training, we propose two improvements, namely and the previously predicted spans. To fully cap-
span shuffling and 2D positional encoding. Empiri- ture the interdependencies between different spans,
cally, we show that with the same amount of param- we randomly permute the order of the spans, simi-
eters and computational cost, GLM significantly lar to the permutation language model (Yang et al.,
outperforms BERT on the SuperGLUE benchmark 2019). Formally, let Zm be the set of all possi-
by a large margin of 4.6% – 5.0% and outperforms ble permutations of the length-m index sequence
RoBERTa and BART when pretrained on a corpus [1, 2, · · · , m], and sz<i be [sz1 , · · · , szi−1 ], we de-
of similar size (158GB). GLM also significantly fine the pretraining objective as
outperforms T5 on NLU and generation tasks with "m #
X
fewer parameters and data. max Ez∼Zm log pθ (szi |xcorrupt , sz<i ) (1)
θ
i=1
Inspired by Pattern-Exploiting Training (PET)
(Schick and Schütze, 2020a), we reformulate NLU We always generate the tokens in each blank fol-
tasks as manually-crafted cloze questions that lowing a left-to-right order, i.e. the probability of
mimic human language. Different from the BERT- generating the span si is factorized as:
based models used by PET, GLM can naturally
handle multi-token answers to the cloze question pθ (si |xcorrupt , sz<i )
via autoregressive blank filling. li
Y (2)
= p(si,j |xcorrupt , sz<i , si,<j )
Furthermore, we show that by varying the num- j=1
ber and lengths of missing spans, the autoregressive
blank filling objective can pretrain language mod- We implement the autoregressive blank infilling
els for conditional and unconditional generation. objective with the following techniques. The input
Through multi-task learning of different pretraining x is divided into two parts: Part A is the corrupted
objectives, a single GLM can excel in both NLU text xcorrupt , and Part B consists of the masked
and (conditional and unconditional) text genera- spans. Part A tokens can attend to each other, but
tion. Empirically, compared with standalone base- cannot attend to any tokens in B. Part B tokens can
lines, GLM with multi-task pretraining achieves attend to Part A and antecedents in B, but cannot
improvements in NLU, conditional text generation, attend to any subsequent tokens in B. To enable au-
and language modeling tasks altogether by sharing toregressive generation, each span is padded with
the parameters. special tokens [START] and [END], for input and
x1 x2 [M] x4 [M] [S] x5 x6 [S] x3

x1 x2 x3 x4 x5 x6
x1
x2
x5 x6 [E] x3 [E]
[M] x5 x6 [E] x3 [E]
x1 x2 [M] x4 [M] [S] x5 x6 x[S]
4 x3 xx15 x26 [M]
[E] xx34 [E]
Key[S] x5 x6 [S] x3
[M]

[M] x5 x6 [E] x3 [E] xx11 xx22 [M]


x3 xx44 [M] x6 x5 x6 [S] x3
x5 [S]
x1 x2 x3 x4 x5 x6 x1 x1 x2 x3 x4 x5 x×6 × × × ×
[S]
x1 x2 [M] x4 [M] [S] x5 xxx216 [S] x3
x1 x5 × × × × ×
(a) Sample spans from the input text GLM x2
[M]
x1 x2 x3 x4 x5 x6 × × × × ×
x2 x6 (Transformer w/ masked self-attention) [M]
x 4 × × × × ×
xx55 xx66x5[E] x1
[E]x6 xx33[E][E]
[E]x3 [E] x4

Query
[S] [M] × × × × ×
[M] x5 x6 [E] x3 [E] [M]
x2 [S] × × × ×
x3
xx11A: xx22x1[M]
[M]x2 xx44[M][M]x4[S]
[M] [S][M]xx55 [S]xx66 x[S]
5[S] xx6x
x313 [S] x3 x4 [M] x[S] × × ×
x4 Part x2 [M] [M] [S] x5 x6 [S] x3 5

xx65 × ×
x x6
1 x11 x22 x33 x44 x55 x56
4
xx11B: xx22x1 xx33x2 xx44x3xx55 x4xx66 x5 xPosition
[M]Part 6Position 1
5 5 3 3 [S] ×
[S]
Position
Positionx212 0 0 0 0 [M] 0 1 2 3 1 2 x3
x3 x1 x2 [M] x4 [M] [S] x5 x6 [S] x3
xx[S]
11 (b) xDivide
1 the input into Part A / Part B [S] autoregressively
(c)x2Generate the Part B spans x(d)
1 x2Self-attention
[M] x4 [M] [S] mask
x5 x6 [S] x3
Position 1 1 2 3 4 5 5 5 5 3 3
xx2x2 5 x2 [M] x5 Position 21 01
Position 02
30 40 50 51 52 53 3 1 3 2
Figure 2: GLM pretraining. (a) The original text is [x1 , x2x, x3 , x4 , x5 , x6 ]. Two Position[x
spans 2 0] 0 0[x0 , 0x 1] are
and 2 sampled.
3 1 2
3 5 6
x(b) x 6
[M]
[M]6 Replace
[M] the sampled spans with [M]4 in Part A, and shuffle the spans in Part B. (c) GLM autoregressively
[S]
[M] with [S] as input and appended with [E] as output. 2D positional
generates Part B. Each span is prepended
xx[S]
44 x4 x3
encoding represents inter- and intra-span
[S] positions. (d) Self-attention mask. Grey areas are masked out. Part A
x x x x x x
1 2 [M] 4 [M] [S] 5 6 [S] 3
[M]xtokens
[M] 3
[M] can attend to themselves (blue frame)
x5 but not B. Part B tokens can attend to A and their antecedents in B
x1 and
x2 green
[M] xframes [S] x5 x6to x[S]
4 [M] correspond x3 spans). [M]
Position 1 :=
1 [MASK],
2 3 4 [S] 5 :=5 [START],
5 5 3 and3 [E] := [END].
[S](yellow
[S] [S]
the
6
two
Position 2 0 0 0 0 0 1 2 3 1 2
[S]
xx55 1 1 2 3 4 5 5 5 5 3 3
Position
x5
2 0 respectively.
Position output 0 0 0 0In this 2 3ourx31model
1 way, 2
x1 x2 auto- as the original objective, i.e. Eq. 1. The only differ-
xx66 [M] x4 [M] [S] x5 x6 [S] x3
x6
matically learns a bidirectional encoder (for Part ence is the number of spans and the span lengths.
[S]
[S]A) and a unidirectional decoder Position 1 1 2 3 4 5 5 5 5 3 3
[S] (for Part B) in a
Position 2 0 0 0 0 0 1 2 3 1 2
xx33unified model. The implementation of GLM is 2.2 Model Architecture
xxx113 xx22 [M]
illustrated xx44 [M]
[M]Figure
in [M] [S] xx55 xx66 [S]
2. [S] [S] xx33 GLM uses a single Transformer with several mod-
Token x1 x2 [M] x4 [M] [S] x5 x6 [S] x3
We randomly sample spans ofx5length
Target drawn
x6 [E] x3 from
[E] ifications to the architecture: (1) we rearrange
Position 1 11 1 22 1 33 2 44 3 55 4 55 5 55 5 55 5 33 533 3
Position 1Position 3
a Poisson distribution with λ = 3. We repeatedly the order of layer normalization and the resid-
Position 2 00 2 00 0 00 0 00 0 00 0 11 0 22 1 33 2 11 322 1
Position 2Position 2
sample new spans until at least 15% of the original ual connection, which has been shown critical for
tokens are masked. Empirically, we have found large-scale language models to avoid numerical
that the 15% ratio is critical for good performance errors (Shoeybi et al., 2019); (2) we use a sin-
on downstream NLU tasks. gle linear layer for the output token prediction;
(3) we replace ReLU activation functions with
2.1.2 Multi-Task Pretraining
GeLUs (Hendrycks and Gimpel, 2016).
In the previous section, GLM masks short spans
and is suited for NLU tasks. However, we are 2.2.1 2D Positional Encoding
interested in pretraining a single model that can One of the challenges of the autoregressive blank
handle both NLU and text generation. We then infilling task is how to encode the positional infor-
study a multi-task pretraining setup, in which a mation. Transformers rely on positional encodings
second objective of generating longer text is jointly to inject the absolute and relative positions of the
optimized with the blank infilling objective. We tokens. We propose 2D positional encodings to
consider the following two objectives: address the challenge. Specifically, each token is
encoded with two positional ids. The first posi-
• Document-level. We sample a single span
tional id represents the position in the corrupted
whose length is sampled from a uniform distri-
text xcorrupt . For the masked spans, it is the position
bution over 50%–100% of the original length.
of the corresponding [MASK] token. The second
The objective aims for long text generation.
positional id represents the intra-span position. For
• Sentence-level. We restrict that the masked tokens in Part A, their second positional ids are
spans must be full sentences. Multiple spans 0. For tokens in Part B, they range from 1 to the
(sentences) are sampled to cover 15% of length of the span. The two positional ids are pro-
the original tokens. This objective aims for jected into two vectors via learnable embedding
seq2seq tasks whose predictions are often tables, which are both added to the input token
complete sentences or paragraphs. embeddings.
Our encoding method ensures that the model is
Both new objectives are defined in the same way not aware of the length of the masked span when
<latexit sha1_base64="cIlXHKTMHL8y94GI+KZXnlT1K7g=">AAAB7XicbVDLSgNBEOyNrxhfUY9ehgQhIoRdD+ox6MVjBPOAZAmzk9lkzOzMMjMrLjH/4EEPinj1f7zlb5w8DppY0FBUddPdFcScaeO6Yyezsrq2vpHdzG1t7+zu5fcP6lomitAakVyqZoA15UzQmmGG02asKI4CThvB4HriNx6o0kyKO5PG1I9wT7CQEWysVI9L6dPjSSdfdMvuFGiZeHNSrBTapy/jSlrt5L/bXUmSiApDONa65bmx8YdYGUY4HeXaiaYxJgPcoy1LBY6o9ofTa0fo2CpdFEplSxg0VX9PDHGkdRoFtjPCpq8XvYn4n9dKTHjpD5mIE0MFmS0KE46MRJPXUZcpSgxPLcFEMXsrIn2sMDE2oJwNwVt8eZnUz8reedm9tWlcwQxZOIIClMCDC6jADVShBgTu4Rne4N2Rzqvz4XzOWjPOfOYQ/sD5+gFYz5H0</latexit>

p(y|x)
For text generation tasks, the given context con-
Positive good stitutes the Part A of the input, with a mask token
y
<latexit sha1_base64="cb5S93r+RGEy3gKCUaUf2i3SjJQ=">AAAB6HicbVDJSgNBEK2JW4xb1KMijUHwFGY8qMegF48JmAWSIfR0apI2PQvdPcIw5OjJiwdFvPoV+Q5vfoM/YWc5aPRBweO9KqrqebHgStv2p5VbWl5ZXcuvFzY2t7Z3irt7DRUlkmGdRSKSLY8qFDzEuuZaYCuWSANPYNMbXk/85j1KxaPwVqcxugHth9znjGoj1dJusWSX7SnIX+LMSalyOK59PRyNq93iR6cXsSTAUDNBlWo7dqzdjErNmcBRoZMojCkb0j62DQ1pgMrNpoeOyIlResSPpKlQk6n6cyKjgVJp4JnOgOqBWvQm4n9eO9H+pZvxME40hmy2yE8E0RGZfE16XCLTIjWEMsnNrYQNqKRMm2wKJgRn8eW/pHFWds7Lds2kcQUz5OEAjuEUHLiACtxAFerAAOERnuHFurOerFfrbdaas+Yz+/AL1vs31mKQqQ==</latexit>

<latexit sha1_base64="CZ8abL6hJorh4jxHBPAGp8NppAA=">AAAB63icbVBNSwMxEJ2tX7V+VT16CS1CRSi7HtRj0YvHCvYD2qVk07QNTbJLki0sS/+CFwVFvPqHvPXfmG170NYHA4/3ZpiZF0ScaeO6Mye3sbm1vZPfLeztHxweFY9PmjqMFaENEvJQtQOsKWeSNgwznLYjRbEIOG0F4/vMb02o0iyUTyaJqC/wULIBI9hk0qSSXPSKZbfqzoHWibck5Vqpe/k6qyX1XvG72w9JLKg0hGOtO54bGT/FyjDC6bTQjTWNMBnjIe1YKrGg2k/nt07RuVX6aBAqW9Kgufp7IsVC60QEtlNgM9KrXib+53ViM7j1Uyaj2FBJFosGMUcmRNnjqM8UJYYnlmCimL0VkRFWmBgbT8GG4K2+vE6aV1Xvuuo+2jTuYIE8nEEJKuDBDdTgAerQAAIjeIY3eHeE8+J8OJ+L1pyznDmFP3C+fgCio5Dy</latexit>

v(y) appended at the end. The model generates the text


of Part B autoregressively. We can directly apply
GLM
the pretrained GLM for unconditional generation,
or finetune it on downstream conditional generation
Coronet has the best lines of all day cruisers. It is really [MASK]
tasks.
x
<latexit sha1_base64="XknPsXXFT3s+dKVLzg736M6sfhc=">AAAB9XicbVC7TsMwFL3hWcKrwMgSUSExVQkDsCAqWBiLRB9SGyrHcVqrjh3ZDlBF/Q8WBh5i5TPYWRB/g9N2gJYjWT465175+AQJo0q77rc1N7+wuLRcWLFX19Y3Notb23UlUolJDQsmZDNAijDKSU1TzUgzkQTFASONoH+R+41bIhUV/FoPEuLHqMtpRDHSRrppB4KFahCbK7sfdoolt+yO4MwSb0JKZx/2afLyZVc7xc92KHAaE64xQ0q1PDfRfoakppiRod1OFUkQ7qMuaRnKUUyUn41SD519o4ROJKQ5XDsj9fdGhmKVRzOTMdI9Ne3l4n9eK9XRiZ9RnqSacDx+KEqZo4WTV+CEVBKs2cAQhCU1WR3cQxJhbYqyTQne9JdnSf2w7B2V3Su3VDmHMQqwC3twAB4cQwUuoQo1wCDhAZ7g2bqzHq1X6208OmdNdnbgD6z3H7R5lks=</latexit>

Figure 3: Formulation of the sentiment classification 2.4 Discussion and Analysis


task as blank infilling with GLM.
In this section, we discuss the differences between
GLM and other pretraining models. We are mainly
reconstructing them. It is an important difference concerned with how they can be adapted to down-
as compared to other models. For example, XL- stream blank infilling tasks.
Net (Yang et al., 2019) encodes the original posi- Comparison with BERT (Devlin et al., 2019).
tion so that it can perceive the number of missing As pointed out by (Yang et al., 2019), BERT fails
tokens, and SpanBERT (Joshi et al., 2020) replaces to capture the interdependencies of masked tokens
the span with multiple [MASK] tokens and keeps due to the independence assumption of MLM. An-
the length unchanged. Our design fits downstream other disadvantage of BERT is that it cannot fill in
tasks as usually the length of the generated text is the blanks of multiple tokens properly. To infer the
unknown beforehand. probability of an answer of length l, BERT needs
to perform l consecutive predictions. If the length l
2.3 Finetuning GLM
is unknown, we may need to enumerate all possible
Typically, for downstream NLU tasks, a linear clas- lengths, since BERT needs to change the number
sifier takes the representations of sequences or to- of [MASK] tokens according to the length.
kens produced by pretrained models as input and Comparison with XLNet (Yang et al., 2019).
predicts the correct labels. The practices are differ- Both GLM and XLNet are pretrained with autore-
ent from the generative pretraining task, leading to gressive objectives, but there are two differences
inconsistency between pretraining and finetuning. between them. First, XLNet uses the original posi-
Instead, we reformulate NLU classification tasks tion encodings before corruption. During inference,
as generation tasks of blank infilling, following we need to either know or enumerate the length of
PET (Schick and Schütze, 2020a). Specifically, the answer, the same problem as BERT. Second,
given a labeled example (x, y), we convert the in- XLNet uses a two-stream self-attention mechanism,
put text x to a cloze question c(x) via a pattern instead of the right-shift, to avoid the information
containing a single mask token. The pattern is writ- leak within Transformer. It doubles the time cost
ten in natural language to represent the semantics of pretraining.
of the task. For example, a sentiment classification
Comparison with T5 (Raffel et al., 2020). T5
task can be formulated as “{SENTENCE}. It’s
proposes a similar blank infilling objective to pre-
really [MASK]”. The candidate labels y ∈ Y are
train an encoder-decoder Transformer. T5 uses
also mapped to answers to the cloze, called ver-
independent positional encodings for the encoder
balizer v(y). In sentiment classification, the labels
and decoder, and relies on multiple sentinel tokens
“positive” and “negative” are mapped to the words
to differentiate the masked spans. In downstream
“good” and “bad”. The conditional probability of
tasks, only one of the sentinel tokens is used, lead-
predicting y given x is
ing to a waste of model capacity and inconsistency
p(v(y)|c(x)) between pretraining and finetuning. Moreover, T5
p(y|x) = P 0
(3) always predicts spans in a fixed left-to-right order.
y 0 ∈Y p(v(y )|c(x))
As a result, GLM can significantly outperform T5
where Y is the label set. Therefore the probability on NLU and seq2seq tasks with fewer parameters
of the sentence being positive or negative is propor- and data, as stated in Sections 3.2 and 3.3.
tional to predicting “good” or “bad” in the blank. Comparison with UniLM (Dong et al., 2019).
Then we finetune GLM with a cross-entropy loss UniLM combines different pretraining objectives
(see Figure 3). under the autoencoding framework by changing the
attention mask among bidirectional, unidirectional, mark (Wang et al., 2019) and report the standard
and cross attention. However, UniLM always re- metrics. SuperGLUE consists of 8 challenging
places masked spans with [MASK] tokens, which NLU tasks. We reformulate the classification tasks
limits its ability to model the dependencies between as blank infilling with human-crafted cloze ques-
the masked spans and their context. GLM feeds in tions, following PET (Schick and Schütze, 2020b).
the previous token and autoregressively generates Then we finetune the pretrained GLM models on
the next token. Finetuning UniLM on downstream each task as described in Section 2.3. The cloze
generation tasks also relies on masked language questions and other details can be found in Ap-
modeling, which is less efficient. UniLMv2 (Bao pendix B.1.
et al., 2020) adopts partially autoregressive model- For a fair comparison with GLMBase and
ing for generation tasks, along with the autoencod- GLMLarge , we choose BERTBase and BERTLarge
ing objective for NLU tasks. Instead, GLM unifies as our baselines, which are pretrained on the same
NLU and generation tasks with autoregressive pre- corpus and for a similar amount of time. We report
training. the performance of standard finetuning (i.e. classifi-
cation on the [CLS] token representation). The per-
3 Experiments formance of BERT with cloze questions is reported
in Section 3.4. To compare with GLMRoBERTa , we
We now describe our pretraining setup and the eval-
choose T5, BARTLarge , and RoBERTaLarge as our
uation of downstream tasks.
baselines. T5 has no direct match in the number
3.1 Pretraining Setup of parameters for BERTLarge , so we present the re-
sults of both T5Base (220M parameters) and T5Large
For a fair comparison with BERT (Devlin et al.,
(770M parameters). All the other baselines are of
2019), we use BooksCorpus (Zhu et al., 2015) and
similar size to BERTLarge .
English Wikipedia as our pretraining data. We use
Table 1 shows the results. With the same amount
the uncased wordpiece tokenizer of BERT with 30k
of training data, GLM consistently outperforms
vocabulary. We train GLMBase and GLMLarge with
BERT on most tasks with either base or large archi-
the same architectures as BERTBase and BERTLarge ,
tecture. The only exception is WiC (word sense dis-
containing 110M and 340M parameters respec-
ambiguation). On average, GLMBase scores 4.6%
tively.
higher than BERTBase , and GLMLarge scores 5.0%
For multi-task pretraining, we train two Large-
higher than BERTLarge . It clearly demonstrates
sized models with a mixture of the blank infill-
the advantage of our method in NLU tasks. In
ing objective and the document-level or sentence-
the setting of RoBERTaLarge , GLMRoBERTa can still
level objective, denoted as GLMDoc and GLMSent .
achieve improvements over the baselines, but with
Additionally, we train two larger GLM models of
a smaller margin. Specifically, GLMRoBERTa outper-
410M (30 layers, hidden size 1024, and 16 atten-
forms T5Large but is only half its size. We also find
tion heads) and 515M (30 layers, hidden size 1152,
that BART does not perform well on the challeng-
and 18 attention heads) parameters with document-
ing SuperGLUE benchmark. We conjecture this
level multi-task pretraining, denoted as GLM410M
can be attributed to the low parameter efficiency of
and GLM515M .
the encoder-decoder architecture and the denoising
To compare with SOTA models, we also train
sequence-to-sequence objective.
a Large-sized model with the same data, tokeniza-
tion, and hyperparameters as RoBERTa (Liu et al.,
3.3 Multi-Task Pretraining
2019), denoted as GLMRoBERTa . Due to resource
limitations, we only pretrain the model for 250,000 Then we evaluate the GLM’s performance in a
steps, which are half of RoBERTa and BART’s multi-task setting (Section 2.1). Within one train-
training steps and close to T5 in the number of ing batch, we sample short spans and longer
trained tokens. More experiment details can be spans (document-level or sentence-level) with
found in Appendix A. equal chances. We evaluate the multi-task model
for NLU, seq2seq, blank infilling, and zero-shot
3.2 SuperGLUE language modeling.
To evaluate our pretrained GLM models, we SuperGLUE. For NLU tasks, we evaluate mod-
conduct experiments on the SuperGLUE bench- els on the SuperGLUE benchmark. The results
Table 1: Results on the SuperGLUE dev set.

ReCoRD COPA WSC RTE BoolQ WiC CB MultiRC


Model Avg
F1/Acc. Acc. Acc. Acc. Acc. Acc. F1/Acc. F1a/EM
Pretrained on BookCorpus and Wikipedia
BERTBase 65.4 / 64.9 66.0 65.4 70.0 74.9 68.8 70.9 / 76.8 68.4 / 21.5 66.1
GLMBase 73.5 / 72.8 71.0 72.1 71.2 77.0 64.7 89.5 / 85.7 72.1 / 26.1 70.7
BERTLarge 76.3 / 75.6 69.0 64.4 73.6 80.1 71.0 94.8 / 92.9 71.9 / 24.1 72.0
UniLMLarge 80.0 / 79.1 72.0 65.4 76.5 80.5 69.7 91.0 / 91.1 77.2 / 38.2 74.1
GLMLarge 81.7 / 81.1 76.0 81.7 74.0 82.1 68.5 96.1 / 94.6 77.1 / 36.3 77.0
GLMDoc 80.2 / 79.6 77.0 78.8 76.2 79.8 63.6 97.3 / 96.4 74.6 / 32.1 75.7
GLMSent 80.7 / 80.2 77.0 79.8 79.1 80.8 70.4 94.6 / 93.7 76.9 / 36.1 76.8
GLM410M 81.5 / 80.9 80.0 81.7 79.4 81.9 69.0 93.2 / 96.4 76.2 / 35.5 78.0
GLM515M 82.3 / 81.7 85.0 81.7 79.1 81.3 69.4 95.0 / 96.4 77.2 / 35.0 78.8
Pretrained on larger corpora
T5Base 76.2 / 75.4 73.0 79.8 78.3 80.8 67.9 94.8 / 92.9 76.4 / 40.0 76.0
T5Large 85.7 / 85.0 78.0 84.6 84.8 84.3 71.6 96.4 / 98.2 80.9 / 46.6 81.2
BARTLarge 88.3 / 87.8 60.0 65.4 84.5 84.3 69.0 90.5 / 92.9 81.8 / 48.0 76.0
RoBERTaLarge 89.0 / 88.4 90.0 63.5 87.0 86.1 72.6 96.1 / 94.6 84.4 / 52.9 81.5
GLMRoBERTa 89.6 / 89.0 82.0 83.7 87.7 84.7 71.2 98.7 / 98.2 82.4 / 50.1 82.9

Table 2: Results of abstractive summarization on the CNN/DailyMail and XSum test sets.

CNN/DailyMail XSum
Model
RG-1 RG-2 RG-L RG-1 RG-2 RG-L
BERTSumAbs (Liu and Lapata, 2019) 41.7 19.4 38.8 38.8 16.3 31.2
UniLMv2Base (Bao et al., 2020) 43.2 20.4 40.1 44.0 21.1 36.1
T5Large (Raffel et al., 2020) 42.5 20.7 39.8 40.9 17.3 33.0
BARTLarge (Lewis et al., 2019) 44.2 21.3 40.9 45.1 22.3 37.3
GLMRoBERTa 43.8 21.0 40.5 45.5 23.5 37.3

are also shown in Table 1. We observe that with pretrained on larger corpora.
multi-task pretraining, GLMDoc and GLMSent per-
The results for models trained on BookCorpus
form slightly worse than GLMLarge , but still outper-
and Wikipedia are shown in Tables 3 and 4. We
form BERTLarge and UniLMLarge . Among multi-
observe that GLMLarge can achieve performance
task models, GLMSent outperforms GLMDoc by
matching the other pretraining models on the two
1.1% on average. Increasing GLMDoc ’s param-
generation tasks. GLMSent can perform better than
eters to 410M (1.25×BERTLarge ) leads to better
GLMLarge , while GLMDoc performs slightly worse
performance than GLMLarge . GLM with 515M pa-
than GLMLarge . This indicates that the document-
rameters (1.5×BERTLarge ) can perform even better.
level objective, which teaches the model to extend
the given contexts, is less helpful to conditional
Sequence-to-Sequence. Considering the
generation, which aims to extract useful informa-
available baseline results, we use the Gigaword
tion from the context. Increasing GLMDoc ’s pa-
dataset (Rush et al., 2015) for abstractive summa-
rameters to 410M leads to the best performance on
rization and the SQuAD 1.1 dataset (Rajpurkar
both tasks. The results for models trained on larger
et al., 2016) for question generation (Du et al.,
corpora are shown in Table 2. GLMRoBERTa can
2017) as the benchmarks for models pretrained
achieve performance matching the seq2seq BART
on BookCorpus and Wikipedia. Additionally, we
model, and outperform T5 and UniLMv2.
use the CNN/DailyMail (See et al., 2017) and
XSum (Narayan et al., 2018) datasets for abstrac- Text Infilling. Text infilling is the task of pre-
tive summarization as the benchmarks for models dicting missing spans of text which are consistent
Table 3: Results on Gigaword summarization.
Books&Wiki Test
16
Model RG-1 RG-2 RG-L
14

Perplexily
MASS 37.7 18.5 34.9
12
UniLMLarge 38.5 19.5 35.8
GLMLarge 38.6 19.7 36.0 10
GLMDoc 38.5 19.4 35.8
8
GLMSent 38.9 20.0 36.3 Unidirectional Bidirectional
GLM410M 38.9 20.0 36.2
LAMBADA
60

50
Table 4: Results on SQuAD question generation.

Accuracy
40
Model BLEU-4 MTR RG-L
30
SemQG 18.4 22.7 46.7
UniLMLarge 22.1 25.1 51.1 20
Unidirectional Bidirectional
GLMLarge 22.4 25.2 50.4
GLMDoc GLM410M GPTLarge
GLMDoc 22.3 25.0 50.2
GLMDoc – 2D GLM515M
GLMSent 22.6 25.4 50.4
GLM410M 22.9 25.6 50.5
Figure 4: Zero-shot language modeling results.


Table 5: BLEU scores on Yahoo text infilling. indi-
cates the results from (Shen et al., 2020). et al., 2016), which tests the ability of systems to
model long-range dependencies in text. The task
Mask ratio 10% 20% 30% 40% 50%
is to predict the final word of a passage. As the
BERT† 82.8 66.3 50.3 37.4 26.2 baseline, we train a GPTLarge model (Radford et al.,
BLM† 86.5 73.2 59.6 46.8 34.8 2018b; Brown et al., 2020) with the same data and
GLMLarge 87.8 76.7 64.2 48.9 38.7
GLMDoc 87.5 76.0 63.2 47.9 37.6 tokenization as GLMLarge .
The results are shown in Figure 4. All the models
are evaluated in the zero-shot setting. Since GLM
with the surrounding context (Zhu et al., 2019; learns the bidirectional attention, we also evalu-
Donahue et al., 2020; Shen et al., 2020). GLM ate GLM under the setting in which the contexts
is trained with an autoregressive blank infilling are encoded with bidirectional attention. Without
objective, thus can straightforwardly solve this generative objective during pretraining, GLMLarge
task. We evaluate GLM on the Yahoo Answers cannot complete the language modeling tasks,
dataset (Yang et al., 2017) and compare it with with perplexity larger than 100. With the same
Blank Language Model (BLM) (Shen et al., 2020), amount of parameters, GLMDoc performs worse
which is a specifically designed model for text in- than GPTLarge . This is expected since GLMDoc
filling. From the results in Table 5, GLM outper- also optimizes the blank infilling objective. In-
forms previous methods by large margins (1.3 to creasing the model’s parameters to 410M (1.25× of
3.9 BLEU) and achieves the state-of-the-art result GPTLarge ) leads to a performance close to GPTLarge .
on this dataset. We notice that GLMDoc slightly GLM515M (1.5× of GPTLarge ) can further outper-
underperforms GLMLarge , which is consistent with form GPTLarge . With the same amount of param-
our observations in the seq2seq experiments. eters, encoding the context with bidirectional at-
Language Modeling. Most language model- tention can improve the performance of language
ing datasets such as WikiText103 are constructed modeling. Under this setting, GLM410M outper-
from Wikipedia documents, which our pretraining forms GPTLarge . This is the advantage of GLM
dataset already contains. Therefore, we evaluate over unidirectional GPT. We also study the con-
the language modeling perplexity on a held-out tribution of 2D positional encoding to long text
test set of our pretraining dataset, which contains generation. We find that removing the 2D posi-
about 20M tokens, denoted as BookWiki. We also tional encoding leads to lower accuracy and higher
evaluate GLM on the LAMBADA dataset (Paperno perplexity in language modeling.
Table 6: Ablation study on the SuperGLUE dev set. (T5 ≈ GLM – shuffle spans + sentinel tokens.)

ReCoRD COPA WSC RTE BoolQ WiC CB MultiRC


Model Avg
F1/Acc. Acc. Acc. Acc. Acc. Acc. F1/Acc. F1a/EM
BERTLarge 76.3 / 75.6 69.0 64.4 73.6 80.1 71.0 94.8 / 92.9 71.9 / 24.1 72.0
BERTLarge (reproduced) 82.1 / 81.5 63.0 63.5 72.2 80.8 68.7 80.9 / 85.7 77.0 / 35.2 71.2
BERTLarge (cloze) 70.0 / 69.4 80.0 76.0 72.6 78.1 70.5 93.5 / 91.1 70.0 / 23.1 73.2
GLMLarge 81.7 / 81.1 76.0 81.7 74.0 82.1 68.5 96.1 / 94.6 77.1 / 36.3 77.0
– cloze finetune 81.3 / 80.6 62.0 63.5 66.8 80.5 65.0 89.2 / 91.1 72.3 / 27.9 70.0
– shuffle spans 82.0 / 81.4 61.0 79.8 54.5 65.8 56.3 90.5 / 92.9 76.7 / 37.6 68.5
+ sentinel tokens 81.8 / 81.3 69.0 78.8 77.3 81.2 68.0 93.7 / 94.6 77.5 / 37.7 76.0

Summary. Above all, we conclude that GLM text generation.


effectively shares model parameters across natu- We note that T5 is pretrained with a similar blank
ral language understanding and generation tasks, infilling objective. GLM differs in three aspects:
achieving better performance than a standalone (1) GLM consists of a single encoder, (2) GLM
BERT, encoder-decoder, or GPT model. shuffles the masked spans, and (3) GLM uses a
single [MASK] instead of multiple sentinel tokens.
3.4 Ablation Study While we cannot directly compare GLM with T5
Table 6 shows our ablation analysis for GLM. due to the differences in training data and the num-
First, to provide an apple-to-apple comparison with ber of parameters, the results in Tables 1 and 6 have
BERT, we train a BERTLarge model with our im- demonstrated the advantage of GLM.
plementation, data, and hyperparameters (row 2).
4 Related Work
The performance is slightly worse than the official
BERTLarge and significantly worse than GLMLarge . Pretrained Language Models. Pretraining large-
It confirms the superiority of GLM over Masked scale language models significantly improves the
LM pretraining on NLU tasks. Second, we show performance of downstream tasks. There are three
the SuperGLUE performance of GLM finetuned as types of pretrained models. First, autoencoding
sequence classifiers (row 5) and BERT with cloze- models learn a bidirectional contextualized encoder
style finetuning (row 3). Compared to BERT with for natural language understanding via denoising
cloze-style finetuning, GLM benefits from the au- objectives (Devlin et al., 2019; Joshi et al., 2020;
toregressive pretraining. Especially on ReCoRD Yang et al., 2019; Liu et al., 2019; Lan et al., 2020;
and WSC, where the verbalizer consists of multi- Clark et al., 2020). Second, autoregressive mod-
ple tokens, GLM consistently outperforms BERT. els are trained with a left-to-right language mod-
This demonstrates GLM’s advantage in handling eling objective (Radford et al., 2018a,b; Brown
variable-length blank. Another observation is that et al., 2020). Third, encoder-decoder models are
the cloze formulation is critical for GLM’s perfor- pretrained for sequence-to-sequence tasks (Song
mance on NLU tasks. For the large model, cloze- et al., 2019; Lewis et al., 2019; Bi et al., 2020;
style finetuning can improve the performance by Zhang et al., 2020).
7 points. Finally, we compare GLM variants with Among encoder-decoder models, BART (Lewis
different pretraining designs to understand their et al., 2019) conducts NLU tasks by feeding the
importance. Row 6 shows that removing the span same input into the encoder and decoder, and tak-
shuffling (always predicting the masked spans from ing the final hidden states of the decoder. Instead,
left to right) leads to a severe performance drop on T5 (Raffel et al., 2020) formulates most language
SuperGLUE. Row 7 uses different sentinel tokens tasks in the text-to-text framework. However, both
instead of a single [MASK] token to represent dif- models require more parameters to outperform au-
ferent masked spans. The model performs worse toencoding models such as RoBERTa (Liu et al.,
than the standard GLM. We hypothesize that it 2019). UniLM (Dong et al., 2019; Bao et al., 2020)
wastes some modeling capacity to learn the differ- unifies three pretraining models under the masked
ent sentinel tokens which are not used in down- language modeling objective with different atten-
stream tasks with only one blank. In Figure 4, we tion masks.
show that removing the second dimension of 2D NLU as Generation. Previously, pretrained
positional encoding hurts the performance of long language models complete classification tasks for
NLU with linear classifiers on the learned rep- Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan
resentations. GPT-2 (Radford et al., 2018b) and Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Song-
hao Piao, Ming Zhou, and Hsiao-Wuen Hon. 2020.
GPT-3 (Brown et al., 2020) show that generative
Unilmv2: Pseudo-masked language models for uni-
language models can complete NLU tasks such fied language model pre-training. In ICML 2020,
as question answering by directly predicting the volume 119, pages 642–652.
correct answers without finetuning, given task in-
Bin Bi, Chenliang Li, Chen Wu, Ming Yan, Wei
structions or a few labeled examples. However, Wang, Songfang Huang, Fei Huang, and Luo
generative models require much more parameters Si. 2020. PALM: Pre-training an Autoencod-
to work due to the limit of unidirectional atten- ing&Autoregressive Language Model for Context-
tion. Recently, PET (Schick and Schütze, 2020a,b) conditioned Generation. In EMNLP 2020, pages
8681–8691.
proposes to reformulate input examples as cloze
questions with patterns similar to the pretraining Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
corpus in the few-shot setting. It has been shown Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
that combined with gradient-based finetuning, PET Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
can achieve better performance in the few-shot set- Gretchen Krueger, Tom Henighan, Rewon Child,
ting than GPT-3 while requiring only 0.1% of its Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
parameters. Similarly, Athiwaratkun et al. (2020) Clemens Winter, Christopher Hesse, Mark Chen,
and Paolini et al. (2020) convert structured predic- Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
tion tasks, such as sequence tagging and relation
Candlish, Alec Radford, Ilya Sutskever, and Dario
extraction, to sequence generation tasks. Amodei. 2020. Language Models are Few-Shot
Blank Language Modeling. Donahue et al. Learners. In NeurIPS 2020.
(2020) and Shen et al. (2020) also study blank-
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
ing infilling models. Different from their work, Gazpio, and Lucia Specia. 2017. SemEval-2017
we pre-train language models with blank infilling Task 1: Semantic Textual Similarity Multilingual
objectives and evaluate their performance in down- and Crosslingual Focused Evaluation. In Proceed-
stream NLU and generation tasks. ings of the 11th International Workshop on Semantic
Evaluation (SemEval-2017), pages 1–14.
5 Conclusions Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
Christopher D. Manning. 2020. ELECTRA: Pre-
GLM is a general pretraining framework for nat-
training Text Encoders as Discriminators Rather
ural language understanding and generation. We Than Generators. In ICLR 2020.
show that the NLU tasks can be formulated as con-
ditional generation tasks, and therefore solvable by Ido Dagan, Oren Glickman, and Bernardo Magnini.
2005. The pascal recognising textual entailment
autoregressive models. GLM unifies the pretrain- challenge. In Machine Learning Challenges Work-
ing objectives for different tasks as autoregressive shop, pages 177–190. Springer.
blank infilling, with mixed attention masks and
the novel 2D position encodings. Empirically we Michael Denkowski and Alon Lavie. 2014. Meteor
Universal: Language Specific Translation Evalua-
show that GLM outperforms previous methods for tion for Any Target Language. In Proceedings of the
NLU tasks and can effectively share parameters for Ninth Workshop on Statistical Machine Translation,
different tasks. pages 376–380.

Acknowledgements Jacob Devlin, Ming-Wei Chang, Kenton Lee, and


Kristina Toutanova. 2019. BERT: Pre-training of
The work is supported by the NSFC for Distin- Deep Bidirectional Transformers for Language Un-
derstanding. In NAACL 2019, pages 4171–4186.
guished Young Scholar(61825602), and Beijing
Academy of Artificial Intelligence (BAAI). Chris Donahue, Mina Lee, and Percy Liang. 2020. En-
abling language models to fill in the blanks. pages
2492–2501.
References
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-
Ben Athiwaratkun, Cicero dos Santos, Jason Krone, aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,
and Bing Xiang. 2020. Augmented natural language and Hsiao-Wuen Hon. 2019. Unified language
for generative sequence labeling. In Proceedings of model pre-training for natural language understand-
the 2020 Conference on Empirical Methods in Natu- ing and generation. In NeurIPS 2019, pages 13042–
ral Language Processing (EMNLP), pages 375–385. 13054.
Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn- Denis Paperno, Germán Kruszewski, Angeliki Lazari-
ing to Ask: Neural Question Generation for Reading dou, Quan Ngoc Pham, Raffaella Bernardi, San-
Comprehension. In ACL 2017, pages 1342–1352. dro Pezzelle, Marco Baroni, Gemma Boleda, and
Raquel Fernández. 2016. The LAMBADA dataset:
Aaron Gokaslan and Vanya Cohen. 2019. Openweb- Word prediction requiring a broad discourse context.
text corpus. http://Skylion007.github. In ACL 2016.
io/OpenWebTextCorpus.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Jing Zhu. 2002. Bleu: A Method for Automatic
Weizhu Chen. 2021. Deberta: Decoding- Evaluation of Machine Translation. In ACL 2002,
enhanced bert with disentangled attention. ArXiv, pages 311–318.
abs/2006.03654.
Gabriel Pereyra, George Tucker, Jan Chorowski,
Dan Hendrycks and Kevin Gimpel. 2016. Bridging Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regu-
nonlinearities and stochastic regularizers with gaus- larizing neural networks by penalizing confident out-
sian error linear units. CoRR, abs/1606.08415. put distributions. In 5th International Conference
on Learning Representations, ICLR 2017, Toulon,
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. France, April 24-26, 2017, Workshop Track Proceed-
Weld, Luke Zettlemoyer, and Omer Levy. 2020. ings.
SpanBERT: Improving Pre-training by Representing
and Predicting Spans. Trans. Assoc. Comput. Lin- Alec Radford, Karthik Narasimhan, Tim Salimans, and
guistics, 8:64–77. Ilya Sutskever. 2018a. Improving Language Under-
standing by Generative Pre-Training.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
2020. ALBERT: A Lite BERT for Self-supervised Dario Amodei, and Ilya Sutskever. 2018b. Lan-
Learning of Language Representations. In ICLR guage models are unsupervised multitask learners.
2020.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- ine Lee, Sharan Narang, Michael Matena, Yanqi
jan Ghazvininejad, Abdelrahman Mohamed, Omer Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
Levy, Ves Stoyanov, and Luke Zettlemoyer. Limits of Transfer Learning with a Unified Text-to-
2019. BART: Denoising Sequence-to-Sequence Pre- Text Transformer. J. Mach. Learn. Res., 21:140:1–
training for Natural Language Generation, Trans- 140:67.
lation, and Comprehension. In ACL 2020, pages
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
7871–7880.
Know What You Don’t Know: Unanswerable Ques-
Chin-Yew Lin. 2004. ROUGE: A Package for Auto- tions for SQuAD. In ACL 2018, pages 784–789.
matic Evaluation of Summaries. pages 74–81. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Yang Liu and Mirella Lapata. 2019. Text Summariza- Percy Liang. 2016. Squad: 100, 000+ questions for
tion with Pretrained Encoders. In EMNLP 2019, machine comprehension of text. In EMNLP 2016,
pages 3730–3740. pages 2383–2392.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase,
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
and Yuxiong He. 2020. Deepspeed: System opti-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
mizations enable training deep learning models with
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
over 100 billion parameters. In KDD 2020, pages
Roberta: A robustly optimized BERT pretraining ap-
3505–3506.
proach. CoRR, abs/1907.11692.
Alexander M. Rush, Sumit Chopra, and Jason Weston.
Joel Mackenzie, Rodger Benham, Matthias Petri, Jo- 2015. A neural attention model for abstractive sen-
hanne R. Trippas, J. Shane Culpepper, and Alistair tence summarization. In EMNLP 2015, pages 379–
Moffat. 2020. CC-News-En: A Large English News 389.
Corpus. In CIKM 2020, pages 3077–3084.
Timo Schick and Hinrich Schütze. 2020a. Exploiting
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Cloze Questions for Few Shot Text Classification
2018. Don’t Give Me the Details, Just the Summary! and Natural Language Inference. pages 255–269.
Topic-Aware Convolutional Neural Networks for Ex-
treme Summarization. In EMNLP 2018, pages Timo Schick and Hinrich Schütze. 2020b. It’s Not
1797–1807. Just Size That Matters: Small Language Models Are
Also Few-Shot Learners. pages 2339–2352.
Giovanni Paolini, Ben Athiwaratkun, Jason Krone,
Jie Ma, Alessandro Achille, Rishita Anubhai, Ci- Abigail See, Peter J. Liu, and Christopher D. Man-
cero Nogueira dos Santos, Bing Xiang, and Stefano ning. 2017. Get To The Point: Summarization with
Soatto. 2020. Structured Prediction as Translation Pointer-Generator Networks. In ACL 2017, pages
between Augmented Natural Languages. 1073–1083.
Tianxiao Shen, Victor Quach, Regina Barzilay, and Towards story-like visual explanations by watching
Tommi S. Jaakkola. 2020. Blank language models. movies and reading books. In ICCV 2015, pages 19–
pages 5186–5198. 27.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, A Pretraining Setting


Patrick LeGresley, Jared Casper, and Bryan Catan-
zaro. 2019. Megatron-lm: Training multi-billion pa- A.1 Datasets
rameter language models using model parallelism.
CoRR, abs/1909.08053. To train GLMBase and GLMLarge , we use Book-
Corpus (Zhu et al., 2015) and Wikipedia used by
Richard Socher, Alex Perelygin, Jean Wu, Jason BERT (Devlin et al., 2019).
Chuang, Christopher D. Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive Deep Models for
To train GLMRoBERTa , we follow the pretraining
Semantic Compositionality Over a Sentiment Tree- datasets of RoBERTa (Liu et al., 2019), which con-
bank. In EMNLP 2013, pages 1631–1642. sist of BookCorups (Zhu et al., 2015),Wikipedia
(16GB), CC-News (the English portion of the Com-
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2019. MASS: Masked Sequence to Se- monCrawl News dataset3 76GB), OpenWebText
quence Pre-training for Language Generation. In (web content extracted from URLs shared on Red-
ICML 2019, volume 97, pages 5926–5936. dit with at least three upvotes(Gokaslan and Co-
Trieu H. Trinh and Quoc V. Le. 2019. A
hen, 2019), 38GB) and Stories (subset of Common-
Simple Method for Commonsense Reasoning. Crawl data filtered to match the story-like style of
arXiv:1806.02847 [cs]. Winograd schemas (Trinh and Le, 2019), 31GB).
The Stories dataset is no longer publicly available4 .
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer Therefore, we remove the Stories dataset and re-
Levy, and Samuel R. Bowman. 2019. SuperGLUE: place OpenWebText with OpenWebText25 (66GB).
A Stickier Benchmark for General-Purpose Lan- The CC-News dataset is not publicly available and
guage Understanding Systems. In NeurIPS 2019, we use the CC-News-en published by (Mackenzie
pages 3261–3275.
et al., 2020). All the datasets used total 158GB of
Alex Wang, Amanpreet Singh, Julian Michael, Fe- uncompressed texts, close in size to RoBERTa’s
lix Hill, Omer Levy, and Samuel Bowman. 2018. 160GB datasets.
GLUE: A Multi-Task Benchmark and Analysis Plat-
form for Natural Language Understanding. In ICLR A.2 Hyperparameters
2019, pages 353–355.
The hyperparameters for GLMBase and GLMLarge
Adina Williams, Nikita Nangia, and Samuel Bowman. are similar to those used by BERT. For trade-off
2018. A Broad-Coverage Challenge Corpus for Sen-
tence Understanding through Inference. In NAACL
of training speed and fair comparison with BERT
2018, pages 1112–1122. (batch size 256 and 1,000,000 training steps), we
use batch size of 1024 and 200,000 training steps
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- for GLMLarge . Since GLMBase is smaller, we re-
bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
XLNet: Generalized Autoregressive Pretraining for duce the number of training steps to 120,000 to
Language Understanding. In NeurIPS 2019, pages speed up pre-training. The hyperparameters for
5754–5764. GLMDoc and GLMSent are the same as those of
Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and
GLMLarge . The hyperparameters except Trans-
Taylor Berg-Kirkpatrick. 2017. Improved varia- former architecture for GLM410M and GLM515M
tional autoencoders for text modeling using dilated are the same as those of GLMLarge . The models
convolutions. In ICML 2017, volume 70, pages are trained on 64 V100 GPUs for 200K steps with
3881–3890.
batch size of 1024 and maximum sequence length
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe- of 512, which takes about 2.5 days for GLMLarge .
ter J. Liu. 2020. PEGASUS: Pre-training with Ex- To train GLMRoBERTa , we follow most of the hy-
tracted Gap-sentences for Abstractive Summariza- perparameters of RoBERTa. The main difference
tion. In ICML 2020, pages 11328–11339.
3
https://commoncrawl.org/2016/10/
Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Text news-dataset-available
infilling. arXiv preprint arXiv:1901.00158. 4
https://github.com/tensorflow/models/
tree/archive/research/lm_commonsense#
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan 1-download-data-files
Salakhutdinov, Raquel Urtasun, Antonio Torralba, 5
https://openwebtext2.readthedocs.io/
and Sanja Fidler. 2015. Aligning books and movies: en/latest
Table 7: Hyperparameters for pretraining

Hyperparameters GLM Base GLM Large GLM RoBERTa


Number of Layers 12 24 24
Hidden size 768 1024 1024
FFN inner hidden size 3072 4096 4096
Attention heads 12 16 16
Attention head size 64 64 64
Dropout 0.1 0.1 0.1
Attention Dropout 0.1 0.1 0.1
Warmup Steps 6k 8k 30K
Peak Learning Rate 4e-4 2e-4 4e-4
Batch Size 1024 1024 8192
Weight Decay 0.1 0.1 0.01
Max Steps 120k 200k 250k
Learning Rate Decay Cosine Cosine Cosine
Adam  1e-6 1e-6 1e-6
Adam β1 0.9 0.9 0.9
Adam β2 0.98 0.98 0.98
Gradient Clipping 1.0 1.0 1.0

includes: (1) Due to resource limit, we only pre- When finetuning GLM on the SuperGLUE tasks,
train GLM RoBERTa for 250,000 steps, which are we construct the input using the cloze questions
half of RoBERTa and BART’s training steps, and in Table 8 and replace the blank with a [MASK]
close to T5 in number of trained tokens. (2) We use token. Then we compute the score of generating
cosine decay instead of linear decay for learning each answer candidate. For the 5 single-token tasks,
rate scheduling (3) We additionally apply gradient the score is defined to be the logit of the verbal-
clipping with value 1.0. izer token. For the 3 multi-token tasks, we use
The hyperparameters for all the pre-training set- the sum of the log-probabilities of the verbalizer
tings are summarized in Table 7. tokens. Thanks to the autoregressive blank infill-
ing mechanism we proposed, we can obtain all the
A.3 Implementation log-probabilities in one pass. Then we compute the
Our pretraining implementation is based on cross entropy loss using the groundtruth label and
Megatron-LM (Shoeybi et al., 2019) and Deep- update the model parameters.
Speed (Rasley et al., 2020). We include our code in
the supplementary material. Due to the size limit of
supplementary material, we cannot include the pre- For the baseline classifiers, we follow the stan-
trained models, but will make them public available dard practice to concatenate the input parts of each
in the future. task (such as the premise and hypothesis for textual
entailment, or the passage, question and answer
B Downstream Tasks for ReCORD and MultiRC) and add a classifica-
tion layer on top of the [CLS] token representa-
B.1 SuperGLUE tion. We also implemented cloze-style finetuning
The SuperGLUE benchmark consists of 8 NLU for the other pre-trained models, but the perfor-
tasks. We formulate them as blank infilling tasks, mance was usually similar to the standard classifier,
following (Schick and Schütze, 2020b). Table 8 as we shown in the ablation study. Models with
shows the cloze questions and verbalizers we used blank-infilling objectives, such as T5 and our GLM,
in our experiments. For 3 tasks (ReCoRD, COPA, benefits more from converting the NLU tasks into
and WSC), the answer may consist of multiple cloze questions. Thus for T5 and GLM, we report
tokens, and for the other 5 tasks, the answer is the performance after such conversion in our main
always a single token. results.
Table 8: Cloze questions and verbalizers for the 8 SuperGLUE tasks used in our experiments. ∗ denotes the answer
contains multiple tokens.

Dataset Task Cloze Question Verbalizers


ReCoRD∗ Question answering [passage p] [cloze question q] Answer candidates
COPA∗ Causal reasoning “[choice c1 ]” or “[choice c2 ]”? [premise p], so c1 / c2
.
WSC∗ Coreference resolution [sentence s] The pronoun ‘∗p∗’ refers to . Noun n
RTE Textual entailment “[hypothesis h]”? | , “[premise p]” “yes” (entail-
ment), “no” (not
entailment)
BoolQ Question answering [passage p]. Question: q? Answer: . “yes” / “no”
WiC Word sense disambiguation “[sentence s1 ]” / “[sentence s2 ]” Similar sense “yes” / “no”
of [word w]? .
CB Textual entailment “[hypothesis h]”? | , “[premise p]” “yes” (entailment),
“no” (contradiction),
“maybe” (neutral)
MultiRC Question answering [passage p]. Question: q? Is it [answer a]? . “yes” / “no”

B.2 Sequence-to-Sequence baselines on seq2seq tasks are obtained from the


Fot the text summarization task, we use the dataset corresponding papers.
Gigaword (Rush et al., 2015) for model fine-tuning
B.3 Text Infilling
and evaluation. We finetune GLMLARGE on the
training set for 4 epochs with AdamW optimizer. We follow (Shen et al., 2020) and evaluate text in-
The learning rate has a peak value of 3e-5, warm- filling performance on the Yahoo Answers dataset
up over the 6% training steps and a linear decay. (Yang et al., 2017), which contains 100K/10K/10K
We also use label smoothing with rate 0.1 (Pereyra documents for train/valid/test respectively. The av-
et al., 2017). The maximum document length is 192 erage document length is 78 words. To construct
and the maximum summary length is 32. During the text infilling task, we randomly mask a given ra-
decoding, we use beam search with beam size of 5 tio r ∈ {10% · · · 50%} of each document’s tokens
and remove repeated trigrams. We tweak the value and the contiguous masked tokens are collapsed
of length penalty on the development set. The into a single blank. We finetune GLMLarge on the
evaluation metrics are the F1 scores of Rouge-1, training set for 5 epochs with dynamic masking, i.e.
Rouge-2, and Rouge-L (Lin, 2004) on the test set. the blanks are randomly generated at training time.
For the question generation task, we use the Similar to the sequence-to-sequence experiments,
SQuAD 1.1 dataset (Rajpurkar et al., 2016) and we use an AdamW optimizer with a peak learning
follow the dataset split of (Du et al., 2017). The rate 1e-5 and 6% warm-up linear scheduler.
optimizer hyperparameters are the same as those of For comparison with previous work, we use the
abstractive summarization. The maximum passage same test set constructed by (Shen et al., 2020).
length is 464 and the maximum question length The evaluation metric is the BLEU score of the in-
is 48. During decoding, we use beam search with filled text against the original document. We com-
beam size 5 and tweak the value of length penalty pare with two baselines: (1) BERT, which learns a
on the development set. The evaluation metrics are left-to-right language model to generate the masked
the scores of BLEU-1, BLEU-2, BLEU-3, BLEU- tokens on top of the blank representation, and (2)
4 (Papineni et al., 2002), METEOR (Denkowski BLM proposed by (Shen et al., 2020), which can
and Lavie, 2014) and Rouge-L (Lin, 2004). fill in the blank with arbitrary trajectories.
Results of T5Large on XSum are obtained by run-
ning the summarization script provided by Hug- B.4 Language Modeling
gingface transformers6 . All the other results of We evaluate the model’s ability of language model-
6
ing with perplexity on BookWiki and accuracy on
https://github.com/huggingface/
transformers/tree/master/examples/ the LAMBDA dataset (Paperno et al., 2016).
pytorch/summarization Perplexity is an evaluation criterion that has been
well studied for language modeling. Perplexity is Example D.1. The Wyoming State Legislature is
the exponentiation of the average cross entropy of the legislative branch of the U.S. State of Wyoming.
a corpus. GLM: It consists of a state senate and a house
T of representatives. As of 2019, state senators are
1X
PPL = exp(− p(xt |x<t )) (4) Steve Kish, Joe L. Williams, and Linda S. Bollens.
T Members of the Wyoming State Legislature are
t=1

where x<t = [x0 , · · · , xt−1 ]. Since transformers elected from single-member districts representing
can only operate on a window of fixed input size the majority of the state. The current state senate
w, we cannot fully calculate p(xt |x<t ) and can members are: In recent years, there have been four
only calculate p(xt |xt−w:t−1 ). Even calculating changes to the senate. The most recent is the cre-
this value for each token is prohibitively expensive, ation of a six-seat district that includes all or part of
since we need to conduct T evaluations of w-size the following: In the 2009 elections, the state sen-
contexts. To improve evaluation efficiency, we ate members were elected to six-year terms. The
adopt overlapping evaluation, where we advance current state house members are: The Wyoming
the sliding windows by some overlap o each time Constitution assigns certain powers to the governor.
and only compute the cross entropy loss for the last Most notably, the governor is president of the sen-
o tokens of the window. In our experiments we set ate and governor. However, if the governor desires
o = 256 for all the models. to appoint a member to the Wyoming state senate, a
LAMBDA is a cloze-style dataset to test the abil- law authorizes the governor to do so. The governor
ity of long-range dependency modeling. Each ex- of Wyoming holds no legislative power but has the
ample is a passage consisting of 4-5 sentences with power to veto lawmakers, which is not limited to
the last word missing and the model is required to the veto of laws. Under the wyoming state con-
predict the last word of the passage. Since we use stitution, the governor can veto the actions of the
WordPiece tokenization, a word can be split into other members of the wyoming house of represen-
several subword units. We use teacher forcing and tatives. The governor can also appoint members of
consider the prediction correct only when all the the wyoming senate. In addition, the governor can
predicted tokens are correct. appoint members of the Wyoming house of repre-
sentatives. Wyoming’s constitution provides that
C Results on Other NLU Benchmarks the governor can appoint a member of the wyoming
state senate to the wyoming supreme court, and the
GLUE (Wang et al., 2018) is another widely-used
chairman of the wyoming senate.
NLU benchmark, including single sentence tasks
(e.g. sentiment analysis (Socher et al., 2013)) and Example D.2. Jonathan Terry is a television and
sentence pair tasks (e.g. text similarity (Cer et al., film actor.
2017) and natural language inference (Williams GLM: He was a voice actor for the “X-Men"
et al., 2018; Dagan et al., 2005)). The benchmark is cartoon series. He was also a voice actor for “the
usually considered as less challenging than Super- Simpsons" as well as “the marvelous misadven-
GLUE. SQuAD (Rajpurkar et al., 2016, 2018) is an tures of superman. He also has voiced characters
extractive question answering benchmark. We fur- in “the legend of korra" He has appeared on several
ther compare GLM with BERT on the two bench- television series, including “the simpsons", “the
marks. x-files", “heroes" and “the simpsons movie" as the
The results on GLUE and SQuAD are shown in character captain billy higgledy-pig, and the web
Tables 9 and 10. On the two benchmarks, GLM series “krusty mysteries" as the character Colonel
can still outperform BERT with the same amount Trungus. He wrote for “the Guardian" newspaper.
of parameters, but with a smaller margin. Jonathan Terry was born in London. Terry grew
up in Surrey, England and attended the University
D Text Generation Samples
of Sussex in the United Kingdom, graduating with
We show texts generated by GLMDoc given unseen a degree in english literature. He was a guest lec-
contexts randomly sampled from the test set. We turer at King’s College London, and then took two
use top-k random sampling with k = 40 for gen- years of acting courses at the brit school of acting
eration and set maximum sequence length to 512. to prepare for his future career in the entertainment
Some of the texts are cut short. industry. Terry first appeared in the TV series “the
Table 9: Results on the GLUE dev set.

Model MNLI QNLI QQP RTE SST-2 MRPC CoLA STS-B Avg
BERTLarge 86.6 92.3 91.3 73.6 93.2 88.0 60.6 90.0 84.4
GLMLarge 86.7 92.8 91.5 74.0 93.5 90.0 61.4 90.7 85.1

Table 10: Results on the SQuAD v1.1/v2.0 dev sets. Example D.4. Robert Lee Smith (born July 5,
1938) is a former American football cornerback
SQuAD v1.1 SQuAD v2.0 in the national football league. He played for the
Model
EM/F1 EM/F1 Los Angeles Rams (1962–1965) and the Detroit
BERTBase 80.8 / 88.5 73.7 / 76.3 Lions (1965–1966). He was also a kick and punt
GLMBase 81.5 / 88.6 74.7 / 77.8 returner.

BERTLarge 84.1 / 90.9 79.0 / 81.8 GLM: In his four-year NFL career, he played
GLMLarge 85.4 / 91.6 80.3 / 83.3 in 33 games and started 14, registering 62 career
interceptions. He completed his NFL career with
five interceptions. Smith was born in La Canada
Simpsons" as the character captain Billy Higgledy- Flintridge, Michigan, in 1938. He attended Bishop
pig, but his character was only a one-time recurring O’ Dowd high school in Flintridge. Smith’ s fa-
character in the series’ first six seasons. He later ap- ther was a pharmacist. He graduated from Michi-
peared as a regular for the show’s final six seasons, gan State University in 1958 with a degree in busi-
and has been a frequent guest in the show since. ness administration. Smith played for the Michi-
He appeared in the first few episodes of “" as the gan Wolverines football team from 1959 to 1963.
character major Jack Ryan. He has also appeared In his two years as a reserve cornerback, he led
as part of the supporting cast of several episodes of the conference in interceptions with five. In 1962,
“the secret life of pets". He has also worked on “the he set the Wolverines’ all-time interception record
simpsons" TV show since “the simpsons movie", with 13, and was second overall in the 1962 sea-
most notably playing the roles of Captain Skeletor son’s Heisman Trophy voting. Smith also won
and the ghost of the same name. He plays char- the Dick Butkus award as the nation’s outstand-
acters in several films, including “", “", “" and “". ing linebacker. In 1961, the “Los Angeles Times"
He has appeared in music videos for the killers in wrote that Smith “is an outstanding pass rusher,
1993, the pretenders in 1995, and in the TV shows with an average of almost 100 yards per punt re-
“the royal" and “the bill". turn." Smith was inducted into the university of
Example D.3. Corona was a station along the port Michigan athletic hall of honor in 1989 and the
Washington branch of the long island rail road in national football foundation hall of fame in 1991.
the Corona section of queens, New York City. It He was elected to the Michigan sports hall of fame
was one of two stations built by the flushing rail- in 1995. Smith earned the honor because of his ac-
road in Corona, this one having been at Grand Av- complishments prior to his NFL career. He was one
enue (later called National Avenue, now National of four Michigan players honored as first-overall
Street ) and 45th Avenue. selections in the 1964 NFL draft. The others were
GLM: It had a brick entrance building next to Joe Namath, Bill Nelsen, and Jerry Kramer. In
the tracks, and one platform with 2 side platforms. 1966, the NFL gave players $300,000 a season to
The station burned down and was rebuilt in 1908. play football. After his rookie season, he was not
The station closed on september 15, 1927, with selected to play in the 1966 pro bowl. On January
the train service transferred from Grand Avenue to 13, 1966, the Rams traded smith to the Detroit Li-
45th Avenue. There were plans to build a subway ons for Paul Hornung, and later that year he was
extension to Corona, but it was never built. In the traded to the Lions for Ray “the Lion" Jones in
1980s, two stations were constructed on the line, exchange for Linebacker Jim “the Hawk" Johnson.
Corona Road and Corona Park. The new Corona On September 10, 1968, he was traded back to Los
station opened in 1988, and the original Corona Angeles for a second round pick in the 1970 draft.
station was demolished. He was also traded to the St. Louis Cardinals for
a second round pick in the 1970 draft. On June 2,
1970 he was cut by the Cardinals. On November
15, 1970, the Los Angeles Rams acquired Smith
from the Lions in exchange for Linebacker Tony
Harris. The Rams waived Smith during the Septem-
ber 1, 1972 offseason. Smith’s number at Michigan
State was # 7 in 1969.

You might also like