Ul 2
Ul 2
Ul 2
Vinh Q. Tran] Xavier Garcia] Jason Wei] Xuezhi Wang] Hyung Won Chung]
Google Brain
Abstract
Existing pre-trained models are generally geared towards a particular class of problems. To date, there
seems to be still no consensus on what the right architecture and pre-training setup should be. This paper
presents a unified framework for pre-training models that are universally effective across datasets and
setups. We begin by disentangling architectural archetypes with pre-training objectives – two concepts
that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision
in NLP and show how different pre-training objectives can be cast as one another and how interpolating
between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training
objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of
mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We
conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method
pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups.
Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established
supervised NLP tasks ranging from language generation (with automated and human evaluation), language
understanding, text classification, question answering, commonsense reasoning, long text reasoning, struc-
tured knowledge grounding and information retrieval. Our model also achieve strong results at in-context
learning, outperforming 175B GPT-3 (published paper results) on zero-shot SuperGLUE and tripling the
performance of T5-XXL on one-shot summarization. On zero-shot MMLU, UL2 20B outperforms T0 and T5
models. Additionally, we show that UL2 20B works well with chain-of-thought prompting and reasoning,
making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters.
Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores
competitive to FLAN-PaLM 62B. We release Flax-based T5X model checkpoints for the UL2 20B model and
Flan-UL2 20B model at https://github.com/google-research/google-research/tree/master/ul2.
∗ Yi and Mostafa are co-leads of this project and are denoted with ∗ . ] denotes technical research contributors. [ denotes data &
infrastructure contributions. 4 denotes advising contributions. Don, denoted with is the last author. Full contributions of all authors
at the end of paper. Correspondence to yitay@google.com or dehghani@google.com.
1
Contents
1 Introduction 4
4 Ablative Experiments 11
4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Datasets and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.2 Metrics and Holistic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Overview of Ablative Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3.1 Decoder Vs Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.2 Is GPT and/or T5 the optimal setup? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.3 On the Performance of UniLM and SCLM . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.4 On the Performance of the Proposed UL2 . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Mode Switching Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Mixture-of-Denoisers Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.6 Modestly Scaling Model Size and Pretraining Data . . . . . . . . . . . . . . . . . . . . . . . . . 16
2
5.2.4 Results on Supervised Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.5 Tradeoffs between Finetuning and Prompt-based Zero-shot Learning (SuperGLUE) . 21
5.2.6 Generative Few-shot: XSUM Summarization . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.7 UL2 for chain-of-thought prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.8 Massively Multitask Language Understanding . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Instruction Tuned UL2 20B with FLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.1 Few-shot MMLU and Big-Bench Results after Flan training of UL2 . . . . . . . . . . . 24
5.3.2 Comparisons on using Chain-of-thought vs Direct Prompting . . . . . . . . . . . . . . 25
6 Conclusion 25
7 Acknowledgements 26
8 Author Contributions 26
9 Appendix 36
9.1 Model Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2 Implementation Details and UL2 code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.3 Details of Supervised Finetuning SOTA runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.4 Details of Prompts for few-shot and zero-shot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3
1 Introduction
There is a wide spectrum of pre-trained model options for NLP researchers and practitioners these days
(Devlin et al., 2018; Brown et al., 2020; Raffel et al., 2019; Radford et al., 2019; Liu et al., 2019; Yang et al.,
2019; Thoppilan et al., 2022; Fedus et al., 2021; Du et al., 2021; Chowdhery et al., 2022). When faced with the
question of what model should one use, the answer is often it depends, followed by on what task?
Answering this can be overwhelming, comprising of a number of fine-grained follow-up questions like,
‘encoder-only or encoder-decoder?’, ‘span corruption or language model?’. Pressing further, the answer always
seems to depend on the target downstream task. This paper questions and rethinks this thought process,
specifically answering the questions of why should the choice of the pre-trained LM depend on the downstream task?
and how can we pre-train models that work universally well across many tasks?.
This paper proposes a step towards making a universally applicable language model possible. We present a
framework for Unifying Language Learning Paradigms or UL2 in short, that is consistently effective across a
very diverse set of tasks and setups. Figure 1 shows an example of how UL2 can perform universally well,
unlike other models that often have to make a trade-off.
74
UL2 (EncDec)
72 T5 (EncDec)
UniLM (EncDec)
Finetuned SuperGLUE Score
70 PrefixLM (EncDec)
68 SpanCorrupt (Dec)
66 UL2 (Dec)
64
GPT-like PrefixLM (Dec)
(Dec)
62
60
0 2 4 6 8 10
1-Shot GEM XSum, SGD, TOT Avg. Rouge-L
Figure 1: In both decoder-only and encoder-decoder setups, UL2 strikes a significantly improved balance in
performance between fine-tuned discriminative tasks and prompt-based 1-shot open-ended text generation
than previous methods. Note: Dec and EncDec are compute matched but EncDec models have double the
parameters.
The appeal of a universal model is clear, i.e., as this not only allows concentrated effort in improving and scaling
a single model, instead of diversifying resources across N models. Moreover, under resource constrained
settings where only a few models can be served (e.g., on device), it would be preferable to have a single
pretrained model that can perform well on many types of tasks.
At the core of UL2 is a the newly proposed Mixture-of-Denoisers (MoD), a pre-training objective that enables
strong performance across tasks. MoD is a mixture of several well-established denoising objectives along with
4
new ones; namely X-denoising (extreme denoising) which considers extreme span lengths and corruption
rates, S-denoising (sequential denoising) that strictly follows sequence order, and R-denoising (regular
denoising) that is a standard span corruption objective introduced in (Raffel et al., 2019). We show that MoD
is conceptually simple but highly effective for a diverse set of tasks.
Our approach exploits the realization that most (if not all) well-studied pre-training objectives differ in the
type of context a model is conditioned on. For example, the span corruption objective is akin to invoking
multiple regions of prefix language modeling (PLM) (Liu et al., 2018; Raffel et al., 2019) whereby prefixes are
contiguous segments of non-corrupted tokens and targets have full access to prefixes of all PLM segments. The
setting where the span approaches the full sequence length is approximately a language modeling objective
conditioned on long-range context. Thus, we are able to design a pre-training objective that smoothly
interpolates these different paradigms (span corruption vs language modeling vs prefix language modeling).
It is also easy to see that each denoiser is difficult in different ways. They also differ in the nature of
extrapolation (or interpolation). For example, bounding a model by bidirectional context (or the future) (ie..,
span corruption) makes the task easier and becomes more akin to fact completion. Meanwhile, PrefixLM/LM
objectives are generally more ‘open ended‘. These behaviours can be easily observed by monitoring the cross
entropy losses of these different denoising objectives.
Given the MoD formulation, we conjecture that it is beneficial for our model to not only distinguish between
different denoisers during pre-training but also to adaptively switch modes when learning downstream
tasks. We introduce mode switching, a new concept that associates pre-training tasks with dedicated sentinel
tokens and allows dynamic mode switching via discrete prompting. Our model is able to switch modes
between R,S and X denoisers on-demand after being pre-trained.
We then disentangle the architecture from the self-supervision scheme. While it might be a common
misconception, as previously noted in Raffel et al. (2019), that a pre-trained model is strongly characterized
by its backbone architecture (e.g., decoder-only vs. encoder-decoder), we find that the choice of the denoiser
has significantly more impact. MoD supports either backbone, similar to how T5’s span corruption may be
trained with a decoder-only model. As such, UL2 is agnostic to architecture. We argue that the choice of
backbone architecture is mainly a trade-off across different efficiency metrics.
We conduct systematic and ablative experiments on a suite of 9 diverse tasks aimed to capture different
problem formulations (supervised and prompt-based in-context few-shot learning). We experiment with the
SuperGLUE suite (Wang et al., 2019), and three tasks from the GEM benchmark (Gehrmann et al., 2021).
In addition, we evaluate on open text generation, as well as prompt-based one-shot settings on all tasks. In
this ablative setup, our experimental results show that UL2 outperforms T5 and GPT-like baselines on all 9
setups. On average, UL2 outperforms a T5 baseline by +43.6% and a language model by +76.1%. Among
all the other competitive baselines considered, UL2 is the only method that outperforms T5 and GPT-like
models on all tasks.
We scale UL2 up to a moderate scale setting of approximately 20B (19.5 to be exact) parameters and run
experiments across a very diverse suite of 50+ NLP tasks ranging from language generation (with automated
and human evaluation), language understanding, text classification, question answering, commonsense
reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our results show
that UL2 achieves SOTA on a vast majority of tasks and setups.
Finally, we conduct zero/few-shot experiments with UL2 and show that UL2 outperforms GPT-3 175B on
zero shot SuperGLUE. When compared with newer state-of-the-art models like GLaM (Du et al., 2021), PaLM
(Chowdhery et al., 2022) and ST-MoE (Zoph et al., 2022), UL2 remains competitive at a compute-matched
setup despite only training on C4 corpus which is known to be less effective than specially curated datasets
used in (Du et al., 2021; Chowdhery et al., 2022). We delve into understanding trade-offs between zero-shot
and finetuning performance and show that UL2 is Pareto-efficient with respect to both learning paradigms.
On one-shot summarization, UL2 triples the performance of an LM adapted T5 XXL model and is competitive
with (or outperforms) PaLM and LaMDA at the same compute cost. We release T5X-based Flax checkpoints
of the trained UL2 model.
5
2 Background: Pre-trained Language Models
In this section, we discuss background surrounding pretrained language models, pretraining objectives and
other unified pretraining proposals.
Learning pre-trained representations for language is a far-reaching pillar of modern NLP research, dating
back to (Mikolov et al., 2013; Pennington et al., 2014; Neumann et al., 2018; Dai & Le, 2015; Howard & Ruder,
2018). The first pre-trained Transformer, GPT, was proposed by (Radford et al., 2019) and was trained
as a causal language model. Subsequently, BERT (Devlin et al., 2018) demonstrated the importance of
bidirectional modeling for many downstream tasks. BERT introduced masked language modeling (MLM), a
denoising objective that reconstructs the input in-place using bidirectional receptive fields. XLNet Yang et al.
(2019) introduced the Permutation Language Modeling to account for dependencies between masked tokens
during training. A number of additional papers (e.g., RoBERTA (Liu et al., 2019), SpanBERT (Joshi et al.,
2020)) suggested further improvements to the pre-training process.
At the same time, two-stack encoder-decoder architectures such as T5 (Raffel et al., 2019) gained popularity
due to their improved performance on classification and sequence-to-sequence (“seq2seq”) tasks. However,
so far, these models have shown limited performance on open-text generation and prompt-based inference
(i.e., in-context learning), which motivates the use of decoder-only models that are trained with different
objectives (e.g., GPT-3 (Brown et al., 2020), GLaM (Du et al., 2021), LaMDa (Thoppilan et al., 2022) and
PaLM (Chowdhery et al., 2022)). In this work, we aim to bridge the performance gap between the two by
means of a general training paradigm that suits both architectures.
Decoder-only vs Encoder-only The key similarities of decoder-only and encoder-only architectures is that
decoder-only architectures operate with an input-to-target paradigm or targets-only paradigm if CausalLM
is used over PrefixLM used. For both architectures, the objective is always to predict the next token (LM)
and are therefore autoregressive models. Notably this is different from position-wise masked LM denoising
(sometimes known as autoencoding), which have been popularized by encoder-only BERT-style models. These
class of models are very restricted in their generative capabilities. On top of that, task specific classification
heads are also typically employed for downstream tasks. Because of the cumbersomeness of task specific
classification heads, we strongly do not recommend using this class of autoencoding models moving forward
and consider them somewhat deprecated. Caveats do apply. For instance, regression is the probably only
reason why one would add a task specific head (Lees et al., 2022), or to squeeze out some efficiency gains
from eliminating a full vocabulary. Either way, one can always start from a encoder-decoder and chop off
the decoder later so there is no good reason to use an encoder-only model. Hence the only real objective
consideration here is between decoder-only and encoder-decoder architectures.
Decoder-only vs Encoder-Decoder The line between decoder-only and encoder-decoder models is less
clear. PrefixLM models are almost encoder-decoder models with shared parameters (but not quite). From
an inductive bias point of view, there are multiple differences. Encoder-Decoder models process input and
targets independently with a different set of parameters. This is a form of sparsity where different set of
parameters are used for different tokens. Encoder-Decoder models also have a cross attention component
that connects input tokens to target tokens. Meanwhile, decoder-only models process inputs and targets by
concatenating them. Hence, the representations of inputs and targets are concurrently build layer by layer as
the input/targets propagate up the network. Conversely, the decoder in Encoder-decoder models generally
only looks at the fully processed encoder input. Overall, the inductive bias of PrefixLM decoder-only models
and Encoder-Decoder models could be pretty similar modulo the subtle differences stated above. The distinct
property is that Encoder-Decoder models are generally approximately 2x parameters of a decoder-only model
when compute-matched.
6
Sparse Models On a side note, there have also been also an emerging trend of sparse pretrained models
that achieve state-of-the-art performance. Sparse mixture-of-expert models such as the Switch Transformer
(Fedus et al., 2021), GLaM (Du et al., 2021) and/or GShard (Lepikhin et al., 2020) have also demonstrated a
lot of promise. While orthogonal to the topic of pretraining objectives, sparse models achieve a very different
flop-per-parameter ratio compared to dense models - a core recurring motif in the debate surrounding
encoder-decoder models vs decoder-only models.
While recent research demonstrates the potential of large supervised multi-task pre-training (Aribandi et al.,
2021; Sanh et al., 2021; Wang et al., 2022a), most pre-training objectives rely on the vast availability of
unsupervised data and use self-training techniques. As mentioned above, different architectures typically
leverage different objectives. Decoder-only models are typically trained with causal language model objectives
to mimic auto-regressive generation (Radford et al., 2019). Raffel et al. (2019) explored many objectives
for encoder-decoder models and found span corruption to be effective. (Wang et al., 2022a) conducts a
systematic study of different architectures combined with three different pretraining objectives (causal
LM, prefixLM and span corruption) and analyzed their impact on zero-shot generalization. Related to
our proposed X-denoisers, (Wettig et al., 2022) studies the effect of corruption rate in BERT-style masked
language modeling and hypothesizes that this improves sample efficiency along with benefitting larger
models. Notably, the benefits of heightened corruption rates as a standalone denoiser is still unclear, as noted
by (Raffel et al., 2019) and also apparent in our own ablations. Pre-training (or denoising) is generally
applied on the subword level (Raffel et al., 2019; Devlin et al., 2018) but it is worth to note that it has also
been applied on the character or byte-level (Xue et al., 2021; Tay et al., 2021c). In these setups, the corrupted
spans are generally much larger than subword-based denoising.
UniLM (Dong et al., 2019) proposed to train on multiple language modeling objectives using a single
Transformer model. Specifically, UniLM trains on unidirectional LM, bidirectional LM and seq2seq LM. This
is quite similar to combining auto-regressive LMs with BERT and prefix-LM models. Notably, UniLM trains
using a cloze-type formulation which adds explicit mask tokens to the inputs. Losses are then computed by
the difference of the predicted token and target token in a position-wise fashion. Aside from pretraining
unification, there have been a recent trend of thematic unification, i.e., unifying common tasks into one model.
Examples of these include UNICORN (Lourie et al., 2021) for commonsense reasoning, UnifiedQA (Khashabi
et al., 2020, 2022) for question answering and UnifiedSKG (Xie et al., 2022) for Structured Knowledge
Grounding.
This section describes the UL2 framework and the proposed pre-training objectives which we study for the
remainder of the paper.
3.1 Pre-training
7
X-denoiser X-denoiser X-denoiser
(long spans & (long spans & (short spans
low high & high
corruption) corruption) corruption)
Inputs-to-targets Learning Paradigms
“Autoregressive”
Supervised
models Finetuning
X-denoiser In-context
(extreme denoising) Learning
Decoder-only
PrefixLM
Zero-Shot
OR R-denoiser
(short spans & low corruption) Language
Generation
Encoder-Decoder Language
S-denoiser Understanding
(sequential denoising / prefix
Structured
language modeling) Knowledge
Grounding
Long Range
Mixture-of-Denoisers Reasoning
Task Paradigms
Figure 2: An overview of UL2 pretraining paradigm. UL2 proposes a new pretraining objective that works
well on a diverse suite of downstream tasks.
[R] He dealt in archetypes before anyone knew such [S] He dealt in archetypes before anyone knew such [X] He dealt in archetypes before anyone16
knew such 3 anyone knew such
[X] He dealt in archetypes before
3 to take an emotion or a
things existed, and his ability things existed, and his ability to take an emotion or a things existed, and his ability to take an emotion or a things existed, and3his ability to take an emotion
5 or a
situation and5
push it to the limit helped create a cadre of situation and push it to the limit helped create a cadre of situation and push it to the limit32
helped create a cadre of 4 a cadre of
situation and push it to the limit helped create
4 – and copied.
plays that have been endlessly staged plays that have been endlessly staged – and copied. plays that have been endlessly staged – and copied. plays that4 5
have been endlessly staged – and copied.
Apart from this, Romeo and Juliet inspired Malorie Apart from this, Romeo and Juliet inspired Malorie 24Juliet inspired Malorie
Apart from this, Romeo and Apart from this, Romeo and Juliet inspired Malorie
5
Blackman's Noughts & Crosses, there are references to Blackman's Noughts & Crosses, there are references to Blackman's Noughts & Crosses, there are references to Blackman's Noughts 3 are references to
5 & Crosses, there
2 The
Hamlet in Lunar Park by Bret Easton Ellis and 95Easton Ellis and The
Hamlet in Lunar Park by Bret 24Ellis and The
Hamlet in Lunar Park by Bret Easton Hamlet in Lunar 2 and The4
3 Park by Bret Easton Ellis
Tempest was the cue for The Magus by John Fowles. Tempest was the cue for The Magus by John Fowles. Tempest was the cue for The Magus by John Fowles. 4 was the cue
Tempest 5
4 by John Fowles.
2 for The Magus
Target: Target: Target: Target:
<B> 3 <S> 5 <S> 4 <S> 5 <B> <B> 16 <S> <B> 3 <S> 3 <S> 5 <S> 4 <S>
Figure 3: Mixture of denoisers for training UL2. Greyed out rectangles are masked tokens that are shifted to
‘targets’ for prediction.
Many pre-training tasks can be simply formulated as an ‘input-to-target’ task, wherein the input refers to
any form of memory or context that the model conditions on, and the target is the model’s expected output.
Language models use all previous time-steps as inputs to the model to predict the next token, which is the
target. In span corruption, the model leverages all uncorrupted tokens from the past and future as inputs for
predicting the corrupted span (targets). Prefix-LMs are LMs that use past tokens as inputs, but consume the
inputs bidirectionally: this offer more modelling power than unidirectional encoding of inputs in vanilla LM.
8
Given this perspective, we can approximately reduce one pre-training objective to another. For instance,
in the span corruption objective, when the corrupted span, i.e., target, is equal to the entire sequence, the
problem becomes effectively1 a language modeling problem. With this in mind, using span corruption, by
setting the span length to be large, we can effectively mimic the LM objective in local regions.
We define a notation that covers all of the different denoising tasks that we use in this paper. The inputs and
targets of the denoising tasks are generated by a SpanCorrupt function that is parameterized by three values
(µ, r, n), where µ is the mean span length, r is the corruption rate, and n which is number of corrupted spans.
Note that n may be a function of the input length, L, and the span length µ, e.g. L/µ, but in some cases, we
use a fixed value of n. Given an input text, SpanCorrupt introduces corruptions to the spans of lengths that
are drawn from a (normal or uniform) distribution with mean of µ. After corruption, the input text is then
fed to the denoising task and the corrupted spans are used as targets to be recovered.
As an example, to construct an objective analogous to causal language modeling using this formulation, one
would simply set (µ = L, r = 1.0, n = 1), i.e. a single span with its span length equal to the length of the
sequence. To express one similar to Prefix LM objective, one would set (µ = L − P , r = 1.0 − P/L, n = 1)
where P is the length of the prefix, with the additional constraint that the single corrupted span always
reaches the end of the sequence.
We note that this inputs-to-targets formulation can be applied to both encoder-decoder models and single-
stack transformer models (e.g., decoder models). We opt to select models that predict the next target token
instead of those that do so in-place (e.g., predict the current masked token in BERT) because the next-
target formulation is more general and can subsume more tasks instead of using a special “CLS” tokens and
task-specific projection heads.
We conjecture that a strong universal model has to be exposed to solving diverse set of problems during
pre-training. Given that pre-training is done using self-supervision, we argue that such diversity should
be injected to the objective of the model, otherwise the model might suffer from lack a certain ability, like
long-coherent text generation.
Motivated by this, as well as current class of objective functions, we define three main paradigms that are
used during pre-training:
• R-Denoiser - The regular denoising is the standard span corruption introduced in Raffel et al. (2019)
that uses a range of 2 to 5 tokens as the span length, which masks about 15% of input tokens. These
spans are short and potentially useful to acquire knowledge instead of learning to generate fluent text.
• S-Denoiser - A specific case of denoising where we observe a strict sequential order when framing the
inputs-to-targets task, i.e., prefix language modeling. To do so, we simply partition the input sequence
into two sub-sequences of tokens as context and target such that the targets do not rely on future
information. This is unlike standard span corruption where there could be a target token with earlier
position than a context token. Note that similar to the Prefix-LM setup, the context (prefix) retains a
bidirectional receptive field. We note that S-Denoising with very short memory or no memory is in
similar spirit to standard causal language modeling.
• X-Denoiser - An extreme version of denoising where the model must recover a large part of the input,
given a small to moderate part of it. This simulates a situation where a model needs to generate long
target from a memory with relatively limited information. To do so, we opt to include examples with
aggressive denoising where approximately 50% of the input sequence is masked. This is by increasing
the span length and/or corruption rate. We consider a pre-training task to be extreme if it has a long
span (e.g., ≥ 12 tokens) or have a large corruption rate (e.g., ≥ 30%). X-denoising is motivated by
being an interpolation between regular span corruption and language model like objectives.
1 This is roughly approximate since the model still conditions on a sentinel token.
9
This set of denoisers has strong connections with previously used objective functions: R-Denoising is the
T5 span corruption objective, S-Denoising is connected to causal language models that are GPT-like, and
X-Denoising can expose the model to a combination of objectives from T5 and Causal LMs. Notably, X-
denoisers are also connected to improve sample efficiency since more tokens are learned to be predicted in
each sample, in similar spirit to LMs. We propose blending all these tasks in a uniform fashion and have a
hybrid self-supervised objective. The final objective is a mixture of 7 denoisers that are configured as follows:
Denoiser Setting
R (µ = 3, r = 0.15, n) ∪ (µ = 8, r = 0.15, n)
S (µ = L/4, r = 0.25, 1)
X (µ = 3, r = 0.5, n) ∪ (µ = 8, r = 0.5, n) ∪ (µ = 64, r =
0.15, n) ∪ (µ = 64, r = 0.5, n)
For X- and R-Denoisers, the span length is sampled from a normal distribution with mean of µ. For S-
Denoisers, we use a uniform distribution, fix the number of corrupted spans to 1, and have an additional
constraint that the corrupted span should end at the end of the original input text, i.e. no un-cropped token
should appear after the corrupted part. This is roughly equivalent to seq2seq denoising or the Prefix LM
pre-training objective.
Since LM is a special case of Prefix-LM, we did not find it necessary to include a casual LM task into the
mixture. All tasks have an approximate equal participation in the mixture. We also explore an alternative
where we increase number of S-denoisers up to 50% of denoisers in the Mixture and all other denoisers take
up the remainder. We present detailed ablation studies of various design choices in the later sections.
Finally, the mixing in Mixture-of-Denoisers is what makes it universally powerful. Alone, some of the denoiser
types do not perform well. For instance, the original T5 paper explored an option with 50% corruption rate
(X-denoising) and found that to not work well.
The implementation of UL2’s mixture of denoiser is simple and easy to implement using a library like seqio2
(Roberts et al., 2022). See appendix for more details on implementation.
We introduce the notion of paradigm-shifting via mode switching. During pre-training, we feed the model
an extra paradigm token, i.e., {[R], [S], [X]} that helps the model switch gears and operate on a mode that is
more suitable for the given task. For fine-tuning and downstream few-shot learning, to trigger the model
to learn better solutions, we also add a paradigm token with respect to the setups and requirements of the
downstream task. Mode switching in fact binds downstream behavior to one of the modes we used during
upstream training.
UL2 adopts an architecture-agnostic philosophy. We argue that the choice between both architectures
(encoder-decoder vs decoder-only) is a more of an efficiency trade-off and that architecture choice should not
be conflated with the pretraining objective. Hence, we have both a UL2 decoder and UL2 encoder-decoder
in similar spirit to how there are multiple sizes per model. We discuss this efficiency trade-off in detail in
our experiment section. UL2 adopts a pretty standard vanilla T5 Transformer that have been enhanced with
modifications that have withstood the test of time, i.e., GLU layers (Shazeer, 2020) and T5-style relative
2 https://github.com/google/seqio
10
Table 2: Experimental results on a suite of language understanding and generation tasks on both supervised
and one-shot setup. Models are pretrained on 32B tokens.
attention. To not further conflate architectural modifications with pretraining contributions, the backbone of
the model remains similar to a T5-like model. This is also in light of results such as (Narang et al., 2021).
4 Ablative Experiments
This section describes our ablative experimental setup (e.g., baselines, datasets, implementation details) and
results. Our overall findings show that UL2 outperforms T5-like and GPT-like models on 9 out of 9 tasks.
4.1 Baselines
• Causal Language Model (CLM) - This is the standard left-to-right auto-regressive language model
pre-training, used in many standard pre-trained models, like GPT (Radford et al., 2019; Brown et al.,
2020). We refer to this model as GPT-like in our experiments.
• Prefix LM (PLM) - This is a slight variation of causal LM where M has bidirectional receptive fields,
introduced in (Liu et al., 2018; Raffel et al., 2019). We uniformly sample the length of M and only
compute the loss at the auto-regressive targets.
• Span Corruption (SC) - This is the standard denoising objective proposed in T5 (Raffel et al., 2019).
The idea is to blank out certain text portions and replace them with sentinel tokens. The text replaced
with sentinel tokens are then copied to the targets and auto-regressively generated by the model. We
use a mean span of 3 and denoising rate of 15% following the default T5 setup.
• Span Corruption + LM (SCLM) - We train on a mixture of CLM and Span Corruption with an equal
mix ratio. We use the same hyper-parameters for SC for the SC component of this objective.
• UniLM (ULM) - This is the objective proposed in Dong et al. (2019). Similar to the original UniLM, we
mix causal language modeling, Prefix LM (sequence-to-sequence LM) and bidirectional i.i.d denoising.
Instead of training UniLM in cloze-style or BERT-style, we opt to generate the masked tokens. This
allows this objective to be applicable to both decoder-only and encoder-decoder architectures and
remove the need for task-specific linear heads for fine-tuning.
For all objectives, we explore both single-stack and encoder-decoder architectures. All architectures are
inputs-to-targets either implemented in encoder-decoder or decoder-only model structures since we consider
11
BERT-style masked language modeling pretraining to have already been effectively subsumed by this style of
pretraining, as empirically made evident in (Raffel et al., 2019). Task-specific classification heads are also not
recommended, since they clearly go against the principle of having a universal model (and are also very
cumbersome).
We conduct our experiments on a diverse set of supervised and prompt-based few-shot learning tasks.
The datasets we use are SuperGLUE (Wang et al., 2019), comprising of 8 NLU sub-tasks. We also conduct
experiments on 3 datasets from the GEM benchmark (Gehrmann et al., 2021) that focuses on language
generation problems. We arbitrarily select XSUM (summarization), ToTTo (table-to-text generation) (Parikh
et al., 2020) and Schema Guided Dialog (SGD) (Rastogi et al., 2019) from the GEM benchmark. For all these
tasks, we evaluate on both supervised fine-tuning and prompt-based one-shot learning. Finally we also
compare our models on their general ability for text generation using perplexity scores on the C4 validation
set. We believe our suite of tasks gives us good coverage across many setups in the literature including
supervised and conditional few-shot learning.
For SuperGLUE, we report well-established metrics such as accuracy, F1 or Exact Match, whenever appropriate.
For GEM benchmark, we use the Rouge-L metric. For language modeling we report negative log perplexity.
The universality of the models, i.e., their collective performance across all range of tasks, is a main evaluation
criteria here. To enable the comparison between models from this perspective, we need an aggregate
performance score. However, metrics on different tasks we include are widely different in nature – take,
for example, F1 and perplexity. To address this, we opt to report and use the normalized relative gain with
respect to baselines as an overall metric. For this purpose, we use the standard language model (decoder-only)
(GPT-like) and standard span denoising encoder-decoder (T5) as prime baselines and report all methods
against their relative performance against these well-established candidates. We believe this is the most
suitable method for comparing these models since it is easy to reason about how much a new model is
generally better than a popular setting (e.g., GPT or T5-like). We also highlight the fact that the overall gain
is normalized, so this becomes harder to exploit or be susceptible to benchmark lottery effects (Dehghani
et al., 2021b).
Our experiments are all conducted in JAX/Flax (Bradbury et al., 2018) using the open source T5X3 framework
(Roberts et al., 2022) and Flaxformer4 . We pre-train all models for 500K steps with a batch size of 128 and
a sequence length of 512 inputs and 512 targets using the C4 corpus. The total approximate tokens seen
during pre-training is approximately 32 billion tokens. Each pre-training run is typically trained using 64
to 128 TPUv4 chips (Jouppi et al., 2020). We optimize our model with the Adafactor (Shazeer & Stern,
2018) optimizer with an inverse square root learning rate. To understand the trade-off of different backbone
architectures, we run all baseline pre-training objectives with both the decoder-only architecture and encoder-
decoder architecture. We report key experiment results using a base architecture of approximately 167M
parameters for the decoder model and 335M parameters for the encoder-decoder model. All models use a
standard Transformer that uses SwiGLU layers as described in (Shazeer, 2020). We utilize the default T5
3 https://github.com/google-research/t5x.
4 https://github.com/google/flaxformer
12
Table 3: Relative performance compared to standard encoder-decoder span corruption model (T5). Results in
this table are expressed in terms of relative percentage improvements over a baseline. Model with ? denotes
the main compared baseline. Overall score column is normalized to be weighted equally across tasks.
Supervised One-shot
Obj Arch SG XS SGD TOT SGL XS SGD TOT LM All Win
CLM Dec -13.6 -9.2 -0.7 -3.0 +1.8 -91.7 -2.2 -90.5 +208 -31.7 2/9
PLM Dec -13.3 -9.2 -0.5 -2.8 +10.5 -85.6 +158 +205 +185 -11.0 4/9
SC Dec -5.6 -6.2 -0.6 -1.3 +0.05 -84.5 +54 -23.8 +99 -20.6 3/9
SCLM Dec -6.0 -6.5 -0.2 -2.0 +5.9 -59.6 -11.3 -95 +204 -16.1 2/9
UniLM Dec -10.1 -8.2 -0.2 -2.3 -5.3 -69.1 +382 +110 +200 -16.1 3/9
UL2 Dec -9.0 -6.9 0.0 -1.4 +9.8 +6.9 +340 +176 +209 +14.1 5/9
PLM ED -3.7 +2.9 -0.2 -0.6 -0.86 -13.3 +397 +86 +199 +16.7 5/9
SC? ED 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -
SCLM ED +0.7 +2.1 -0.2 -0.5 +3.2 -31.6 +508 +248 +201 +28.3 7/9
UniLM ED -1.2 -0.2 +0.1 -0.4 +3.5 -11.0 +355 +95 +173 +19.8 5/9
UL2 ED +1.5 +2.6 +0.5 +0.4 +7.2 +53.6 +363 +210 +184 +43.6 9/9
English 32K sentencepiece for all models. Within the context of decoder-only models, except for the case of
the decoder model trained on causal LM, our experiments always use a bidirectional receptive field only in
it’s input segment and autoregressive decoding at the targets segment. This is essentially the a PrefixLM-type
architecture5 (Raffel et al., 2019) which we find to be consistently better than a full causal decoder model.
Table 4: Relative performance compared to standard decoder causal language model (GPT-like). Results in
this table are expressed in terms of relative percentage improvements over a baseline. Model with ? denotes
the main compared baseline. Overall score column is normalized to be weighted equally across tasks.
Supervised One-shot
Obj Arch SG XS SGD TOT SG XS SGD TOT LM All Win
?
CLM Dec 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -
PLM Dec +0.3 +0.1 +0.2 +0.2 +8.5 +74.3 +164 +3100 -8.0 +21.4 8/9
UniLM Dec +4.0 +1.1 +0.5 +0.7 -7.0 +274 +393 +2100 -2.5 +21.0 7/9
SC Dec +8.7 +3.4 +0.1 +1.8 -1.8 +87.0 +57.1 +700 -54.2 +13.9 7/9
SCLM Dec +1.8 +3.0 +0.5 +1.0 +4.0 +387 -9.3 -50 -1.3 +15.8 6/9
UL2 Dec +5.2 +2.6 +0.6 +1.7 +7.9 +1190 +350 +2800 +0.3 +45.7 9/9
PLM ED +11.3 +13.4 +0.5 +2.5 -2.6 +946 +408 +1850 -2.9 +48.6 7/9
SC ED +16.5 +10.2 +0.6 +3.1 -1.8 +1107 +2.3 +950 -208 +31.7 7/9
SCLM ED +15.7 +12.5 +0.5 +2.6 +1.3 +726 +522 +3550 -2.2 +60.3 8/9
UniLM ED +14.2 +10.0 +0.7 +2.7 +1.6 +974 +365 +1950 -12.9 +52.6 8/9
UL2 ED +17.4 +13.1 +1.2 +3.5 +5.3 +1754 +373 +3150 -8.3 +76.1 8/9
Table 2 reports the raw results on all the benchmark tasks and datasets. To facilitate easier comparison across
setups, we also report relative comparisons against well-established baselines such as T5 and GPT models.
This is reported in Tables 3 and 4 respectively.
5 Not to be confused with the PrefixLM pretraining objective.
13
4.3.1 Decoder Vs Encoder-Decoder
Before we dive into the results of this segment, we would like to remind readers that there is no easy way
to compare decoder-only models with encoder-decoder models. In short, we can either compare them in a
compute-matched setup or a parameter-matched way. Therefore, the encoder-decoder models in these set of
results have approximately twice the number of parameters as the decoder models but have similar speeds.
We note that this may slightly favor encoder-decoders since this can be interpreted form of model sparsity.
Moving back to the results, when using T5 as the reference baseline, we note that, with the exception of
UL2 Decoder, none of the pre-trained decoders models outperform T5. Additionally, there is a 10% to 30%
degradation in overall relative performance. The best decoder baseline model here is the Prefix-LM decoder
model, which is about 10% worse than the T5 baseline. It is clear from these results that encoder-decoder
models should be preferred over decoder-only models if and only if there is no concern about storage, i.e.,
parameter counts are generally less important than actual throughput (see (Dehghani et al., 2021a) for a
detailed discussion).
When there is a parameter constraint, the Prefix-LM decoder makes for a suitable alternative. Finally, an
interesting data point is how we were able to push the UL2 decoder to outperform the T5 encoder-decoder
setup by +14.6%. That said, this UL2 decoder does not outperform our UL2 encoder-decoder. However,
this reinforces our point that the self-supervision objective may be intrinsically more important than the
backbone architecture and negotiating architectural choices is mainly about efficiency trade-offs that can be
studied independently.
Based on the relative comparisons against a GPT-like (causal LM + decoder) and T5-like (span corruption +
encoder decoder) setup, we are able to easily identify if the well-established setups are indeed optimal or
already close to optimal. Firstly, the causal LM (GPT-like) setup appears to be the worse configuration as it
is outperformed by all our baselines. We thus make the straightforward recommendation of always at least
training with Prefix-LM or UniLM whenever possible. The best decoder-only model (with the exception
of UL2) is the Prefix-LM pre-training that keeps a memory prefix for a language model to condition on.
Regarding Prefix-LM pre-training, it is interesting that Prefix-LM actually outperforms the T5 span corrupt
setup by +16.7%. The Prefix-LM encoder-decoder model is indeed less effective than the default T5 model
on SuperGLUE but is on a whole, stronger especially when it comes to one-shot or open text-generation.
Overall, between the Prefix-LM and the span corruption encoder-decoder model (T5), it is unclear to which
is the universally superior model as there are gives and takes across the different sub-tasks although it is
worthy noting the Prefix-LM EncDec model only sacrifices a minor degradation in certain tasks for a huge
multifold increase in other tasks.
On the encoder-decoder setup, both the UniLM and SCLM objective performs better than the standard span
corruption objective in terms of aggregated and normalized overall gain. This shows that, in general, mixing
pre-training objectives is helpful. On the decoder setup, there is an overall gain of +9.4% gain for UniLM
and +16.1% for SCLM compared to the baseline causal LM. In terms of individual tasks, UniLM and SCLM
both outperforms T5 on 6 out of 9 tasks. It is also noteworthy that SCLM performs the best out of all models
on 1shot generation (SGD and TOTTO).
Finally, we note that UL2 performs the best when compared against both the GPT-like model and the T5-like
model. Overall, UL2 outperforms by T5 +43.4% and +76.2% when compared to the GPT-like CLM decoder
14
Table 5: Effect of different paradigm prompts on 1-shot evaluation, using a Encoder-Decoder architecture
pre-trained using UL2 on 7B tokens.
Table 6: Ablation study for Mixture-of-Denoisers. Span, Rate and SD are in percentages (%). We report
SuperGLUE score (SG) and XSUM Rouge-L (XS).
model. This is the highest relative (overall) gain compared to all other alternatives. We also note that on all
individual tasks, UL2 outperforms T5 on all 9 out of 9 considered tasks. Hence, UL2 is a universally better
option compared to the span corruption T5 model. While UL2 doesn’t always outperform all baselines on all
individual tasks, UL2 is very consistent. Even when it loses to another method on a task, the loss is relatively
marginal (e.g., 6.5 vs 7.3 on one-shot TOTTO). Conversely, when UL2 outperforms a baseline like T5, the gain
can be as large as +363%. UL2 remains the most consistently strong method. The consistent improvement
also suggests that it can be used as a more consistent replacement to T5 and GPT-like models.
In order to ascertain that our mode switching capabilities have an effective on performance, we conduct
ablation experiments. We conduct experiments on one-shot XSum and one-shot SuperGLUE. Table 5 reports
the result of varying the paradigm prompt to the model. Firstly, we observe that the prompt has quite
substantial effect on model performance – i.e., using the right or wrong prompt can lead to a 48% gap
in performance (on XSum, Rouge-1). SuperGLUE, on the other hand, is less sensitive to prompting. On
SuperGLUE, using prompts are almost always better than not using prompts during one-shot eval. However,
for XSum, getting the prompt right seems to be crucial for good performance.
We conduct extensive experiments to verify the effectiveness of individual objectives within the MoD objective.
Table 6 reports results for these ablations. We report results for varying the mean span, and corruption rate,
15
along with the percentage of S-denoising used (denoted by % SD)). Note that the total number of denoisers
in a mixture is kSpank × kCorrupt_Ratek + 1. We label these configurations from Var-A through Var-J to
refer to them easily.
X-Denoising is Complementarily Effective but Does Not Suffice as a Standalone We observe that mixing
Extreme Denoising is effective. Most of the best results across the board come from mixtures with long spans
(e.g., 32 or 64). When compared with variants without long spans (Var-D vs. Var-C), we see that Var-D
is strictly better. We also draw the readers attention to Var-H, which is a variant that only employs long
spans. In general, Var-H performs poorly, suggesting that extreme denoising complements regular denoising
but does not suffice in isolation. This also corroborates the result from Raffel et al. (2019) that shows that a
50% corruption rate does not perform well. This slightly conflicts with the finding of (Wettig et al., 2022)
although our architectures use a inputs-to-targets form of pretraining instead of BERT-style masked language
modeling.
Small Amounts of S-Denoisers is Preferred We explore a setting where we scale S-denoisers to 50% of
the entire MoD mixture. We find that this generally hurts performance. Hence, we make a conclusion that
S-denoisers are necessary but only small amounts of S-denoisers (≈ 20%) are preferred. Var-K and Var-L
also explore the case where there is no S-denoising at all. While performance on one task substantially
improves (SuperGLUE), another substantially degrades (one-shot XSUM). Meanwhile for Var-L which is
identical to var-F (but without S-denoising), performs on a whole, substantially worse. Hence, we showed
that S-denoising is crucial.
We conduct additional experiments by scaling up both 1) the model size and 2) pre-training dataset size.
Concretely, we scale the UL2 Encoder-Decoder model up to approximately 1B parameters and increase
the number of pre-training tokens to 0.5 trillion tokens. Our motivation is to conduct a sanity check that
the proposed formulation also works at different model scales and to observe if there are differences and
implications at operating at a larger scale. Moreover, it has also become a staple for language model research
to derive scaling laws (Kaplan et al., 2020; Tay et al., 2021b). Table 7 reports results in this scaled setting. At
large scale, we find that the proposed of the UL2 encoder-decoder model is still competitive. A key difference
now is that UL2 drops the SuperGLUE suite against T5 (1B). However, this is compensated by not only
out-performing on 7 out of 8 tasks but also improving performance by 2-4 times on one-shot evaluation. The
gains on supervised fine-tuning is smaller, but still noticeable across the board on XSUM, SGD and TOT.
Table 7: Experiments with moderately scaled up models in terms of model compute (e.g., 1B for EncDec and
0.5B for decoder-only) and dataset size (0.5T tokens).
We are also interested to evaluate UL2 in a scaled up setting. Following the insights we obtained from ablative
experiments, we use an encoder-decoder architecture for this run. While UL2 is architecture agnostic, our
soft advice here is to probably default to an encoder-decoder architecture due to intrinsic sparsity.
16
We train UL2 at a scale of approximately 20B total parameters. Compared to truly large language models (Du
et al., 2021; Chowdhery et al., 2022), 20B represents a medium scale model that we train as a proof-of-concept
resembling a hint of what UL2 can do at a relatively larger scale than our ablation experiments. Admittedly,
not much thought was put into the exact parameter count of this model, i.e., we were training a 20B model
already for some time and decided to see it to convergence. Additionally, we note that spiking and instabilities
are common when scaling up models due to a potential barrage of reasons (data corruption, intermittent
hardware issues like pre-emption). In this run we did not specifically control or put in place any mitigation
strategies such as occasional restarts as we were not attentively monitoring the job. Hence, we find occasional
loss spikes in the training of this 20B model. However, since many finetuning experiments using these
checkpoints still often result in sota performance, we let it be for now and leave a properly monitored run for
future work. Despite obtaining sota performance on 50+ NLP benchmarks, we expect the current presented
results to be still an underestimate of the true potential of the model. We leave properly scaling UL2 to truly
large scale to future work.
We follow the same training protocol in earlier experiments by pretraining on the C4 corpus but by also scaling
the number of tokens the model sees during pretraining. We use a batch size of 1024 and 512 TPUv4 chips
for pretraining this model. The model is trained on a total of 1 trillion tokens on C4 (2 million steps). The
sequence length is set to 512/512 for inputs and targets. Dropout is set to 0 during pretraining. Pre-training
took approximately slight more than one month for about 1 trillion tokens. We use the same mixture of
denoisers as earlier sections. The model has 32 encoder layers and 32 decoder layers, dmodel of 4096 and df f
of 16384. The dimension of each head is 256 for a total of 16 heads. Our model uses a model parallelism of 8.
We retain the same sentencepiece tokenizer as T5 of 32k vocab size. Hence, UL20B can be interpreted as a
model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.
Similar to earlier experiments, UL20B is trained with Jax and T5X infrastructure. We release and open source
T5X-based model checkpoints of this 20B model.
We conduct experiments on both finetuning and in-context learning. For supervised finetuning, our models
are continuously finetuned after N pretraining steps where N is typically from 50k to 100k. In other words,
after each N k steps of pretraining, we finetune on each downstream task and record its results. This is
generally done in a manual fashion. While some tasks were finetuned on earlier pretrained checkpoints as the
model was still pretraining, many were finetuned on checkpoints nearer to convergence that we release. As
we continiously finetune, we stop finetuning on a task once it has reached sota to save compute. Finetuning
is generally done in a per-task basis and not co-trained. Details of tasks where co-training is performed is
found in the appendix. We leave the combination of massive multi-task training (Aribandi et al., 2021) and
UL2 to future work.
For supervised finetuning, we generally adopt a learning rate in the range of {5 × 10−5 , 1 × 10−5 1 × 10−4 }
using the Adafactor optimizer. The general recipe is that we reset Adafactor optimizer states and/or adopt a
loss normalization based on the number of real target tokens. This is reminiscent of the PaLM finetuning
setup (Chowdhery et al., 2022). Batch size is generally a range of 32 to 128 although we did not find much
impact of batch size on finetuning performance. Many of the evaluated tasks were not tuned much and we
only ran once or twice before performing leaderboard submissions.
17
5.2.2 Datasets for Supervised Finetuning
To demonstrate the universality of the approach, we consider a total of nearly 50+ NLP tasks. We list our
categorization of tasks below. Note that the categorization of tasks are generally soft in nature and some
tasks may cross into different categorization boundaries.
• Language Generation - We consider summarization and data-to-text generation tasks. We use CN-
N/Dailymail (Hermann et al., 2015), XSUM (Narayan et al., 2018), MultiNews (Fabbri et al., 2019),
SAMSum (Gliwa et al., 2019), WebNLG (Castro Ferreira et al., 2020) (English), E2E (Dušek et al., 2019)
and CommonGen (Lin et al., 2020) to evaluate our models. For WebNLG, E2E and CommonGen, we
use the versions from the GEM benchmark (Gehrmann et al., 2021).
• Language Generation with Human Evaluation - We evaluate on a variety of text generation tasks
using human evaluation, via the GENIE leaderboard (Khashabi et al., 2021). These tasks include aNLG
(Bhagavatula et al., 2019), ARC-DA (Clark et al., 2018), WMT19 (Foundation), and XSUM (Narayan
et al., 2018).
• Long Range Reasoning - We use the Scrolls benchmark (Shaham et al., 2022) which comprises of
seven component tasks including GovReport (Huang et al., 2021), SumScr (Chen et al., 2021), QMSUm
(Zhong et al., 2021), QASPER (Dasigi et al., 2021), NarrativeQA (Kočiský et al., 2018), QuaLITY (Pang
et al., 2021), and ContractNLI (Koreeda & Manning, 2021).
• Structured Knowledge Grounding - We use several component tasks from UnifiedSKG (Xie et al.,
2022), namely WikiTQ (Pasupat & Liang, 2015), CompWQ (Talmor & Berant, 2018), FetaQA (Nan
et al., 2021), HybridQA (Chen et al., 2020), WikiSQL (Zhong et al., 2017), TabFat (Chen et al., 2019),
Feverous (Aly et al., 2021), SQA (Iyyer et al., 2017), MTOP (Li et al., 2020) and DART (Nan et al., 2020).
We select datasets that are relatively convenient to perform evaluation and uses mainstream metrics
such as accuracy or exact match instead of obscure ones or those that require significant domain specific
post-processing.
• Information Retrieval - IR is the task of retrieving relevant documents given queries. We use the setup
of the latest next generation IR paradigm, i.e., differentiable search index (Tay et al., 2022) for our
experiments. We use the same NQ (Kwiatkowski et al., 2019) splits in the DSI paper.
For each dataset, we report the best previous sota result. For generation tasks, we generally report ROUGE-2
following the advice of (Gehrmann et al., 2022). For the rest of the datasets, we report the dominant metric
that is reported in prior work. For BLEU scores, we use sacrebleu. For commonsense reasoning tasks, we do
not compare against approaches that use external knowledge bases as they are orthogonal and out of scope
for this paper. For most part, GLUE is generally considered to be saturated and there are many unpublished
results on the GLUE leaderboard. For this reason, we make a very reasonable decision of considering (Raffel
et al., 2019) to be the state-of-the-art since we believe that there has not been any real advance on the GLUE
benchmark since the T5 model (Raffel et al., 2019). GLUE results, given how saturated it already is, are
provided as a reference and should be taken with a pinch of salt.
18
Generally, we make a best effort to submit scores to any leaderboard (unpublished test set) but refrain from
doing so in the cases where the labor costs to make such a submission is prohibitive - especially when the
existing state-of-the-art approach has made their dev scores available or when reporting on this particular
dataset is only for completeness (e.g., GLUE). We would advise readers to not over think the differences in
dev/test since (1) in most academic leaderboards, dev/test aligns not only from our own experience but
also can be empirically observed and because (2) the real test is production anyways. Whenever reporting
on leaderboard, we consider the top performing published work as SOTA and indicate in our results using
the # symbol that there might be some anonymous submission that has scored higher. For this purpose
we consider arxiv preprints of above reasonable quality to count as published work. These results and
comparisons are accurate as of 15th April 2022 where we stopped experiments to focus on polishing this
paper. We later realized, while preparing to put this paper up on arxiv that there have been new results
on Scrolls benchmark using a model (Guo et al., 2021) using 16k sequence lengths as opposed to ours (2k)
where we kept it at 2k once we had obtained sota. It is expected that increasing the length to UL2 would
significantly improve our scores likely above current sota, but in the interest of logistics and timeline we leave
that to future work.
Table 8: Summary of UL20B results compared to state-of-the-art. (l) denotes leaderboard submission. (] )
denotes the best published we could find on the leaderboard. (e) denotes SOTA used an ensembled approach.
Because we evaluate finetuning and in-context trade-offs for SuperGLUE, SuperGLUE scores have their own
dedicated section below.
19
Table 8 – continued from previous page
20
5.2.4 Results on Supervised Finetuning
Our experimental results show that UL2 achieves state-of-the-art performance on around 50+ NLP tasks and
setups. For many, the margins are quite wide and for those that UL2 doesn’t achieve SOTA, the performance
of UL2 is generally quite competitive. It is worth to note that the extent of difficulty of obtaining sota on
each benchmark has vastly different difficulties. For some, the sota model is a 32B dense equivalent (Zoph
et al., 2022). For some others, it’s a base model. It is worth also noting that many benchmarks have a strong
relatively large model, e.g., 3B or 11B T5, UnifiedQA (Khashabi et al., 2020) or Unicorn (Lourie et al., 2021)
as the existing SOTA model so outperforming these models is also not exactly the easiest thing to do. Overall,
we urge the readers to judge the value of these sota results for themselves. Finally, we note that UL2 20B does
pretty well on human evaluation on GENIE tasks, outperforming sota on several metrics. This ascertains that
the generation quality of UL2 is reasonably solid.
In this section, we explore finetuning and in-context learning trade-offs on the SuperGLUE benchmark. We
conduct experiments on SuperGLUE with UL20B. While UL20B does not achieve SOTA on this benchmark,
we note that UL20B at least remains competitive and outperforms T5-11B. This section reassures that UL2
indeed scales and matches/slightly outperforms T5-11B on SuperGLUE (while strongly outperforming
T5-XXL on many other in-context tasks). UL20B still lacks behind the SOTA model ST-MoE-32B given two
main reasons. Firstly, ST-MoE-32B has 200B+ parameters and is costs equivalent to a 32B dense model.
Secondly, ST-MoE-32B is trained solely on span corruption using an encoder-decoder architecture which is
known to be very advantageous on NLU finetuning.
Table 9: Results on SuperGLUE dev set. We compare with T5-11B (Raffel et al., 2019), ST-MoE-32B (Zoph
et al., 2022) and PaLM-8B, PaLM-62B and PaLM-540B (Chowdhery et al., 2022). Scores reported are the peak
validation scores per task.
Finally, we conduct additional few-shot in-context one-shot learning using the XSum dataset We compare our
model with the baseline T5-XXL, T5-XXL with LM Adaptation (Lester et al., 2021), LaMDA 137B (Thoppilan
et al., 2022), and PaLM (8B, 62B, 540B) (Chowdhery et al., 2022). We run T5-XXL ourselves in the same
experimental setup but report results from (Chowdhery et al., 2022) for the other models.
21
Table 10: Results on zero-shot learning on SuperGLUE dataset. We compare with GPT-3, GLaM and PaLM
(Chowdhery et al., 2022). We also include models that are relatively compute-matched with UL20B such as
T5-XXL with LM adaptation (Lester et al., 2021), GPT-3 13B and GLaM-8B dense. Notably, UL20B outperforms
GPT-3 175B and all other models in a similar compute class on average score.
Table 11 reports results on 1-shot summarization. Our results show that the performance of UL2 20B is about
3x the performance of LM adapted T5 XXL model. Moreover, UL2 20B outperform LaMDA 137B and has
better performance compared to PaLM 8B which is approximately compute-matched with UL2. The best
result, however, is still the larger 540B and 62B PaLM models.
It has recently been shown that language models at scale can perform multi-step reasoning tasks such as
math word problems or commonsense reasoning via chain-of-thought prompting, which prompts the model to
generate a step-by-step reasoning path before giving the final answer (Wei et al., 2022b). Notably, chain-of-
thought (CoT) prompting does not require any additional fine-tuning of the model.
A crucial consideration of CoT prompting is that it is an emergent ability of scale (Wei et al., 2022a)—it
requires a sufficiently large language model to improve performance, and actually hurts performance for
small language models. Hence, the successful use cases of chain-of-thought prompting use either LaMDA
137B (Thoppilan et al., 2022), PaLM 540B (Chowdhery et al., 2022), or OpenAI models (Brown et al., 2020;
Ouyang et al., 2022). These models, however, are compute intensive and not available as public checkpoints.
Here we demonstrate that UL2 20B is the first publicly available pre-trained model (without any fine-tuning)
to successfully leverage CoT prompting to solve multi-step arithmetic and commonsense tasks. We use the
same benchmark tasks and prompts from Wei et al. (2022b). In Table 12 below, we see that on five arithmetic
reasoning datasets, CoT prompting outperforms standard prompting (directly outputting the answer without
22
a chain of thought) for UL2 20B. Similar to Wei et al. (2022b), we also show that CoT prompting can be
augmented by using an external calculator (“calc.”) to perform arithmetic computational only (+, −, ×, /)
to further improve performance by a large margin. In addition, we add self-consistency (Wang et al., 2022b)
(denoted as “SC”) on top of CoT prompting and observed significant gains consistently across all benchmarks,
with an average improvement of 22.5% compared to standard prompting.
Table 12: Chain-of-thought prompting and self-consistency (SC) results on five arithmetic reasoning bench-
marks. GSM8K: (Cobbe et al., 2021). SVAMP: (Patel et al., 2021). ASDiv: (Miao et al., 2020). AQuA: (Ling
et al., 2017). MAWPS: (Koncel-Kedziorski et al., 2016).
In addition to arithmetic reasoning, Table 13 shows the performance of CoT prompting using UL2 20B
compared to standard prompting on five commonsense reasoning benchmarks. CoT prompting plus self-
consistency outperforms standard prompting in four of the five benchmarks, with an average improvement
of 14.4%.
Table 13: Chain-of-thought prompting and self-consistency (SC) results on five commonsense reasoning
benchmarks. CSQA: (Talmor et al., 2019). StrategyQA: (Geva et al., 2021). Date Understanding and Sports
Understanding: (Srivastava et al., 2022). ARC-easy/challenge: (Clark et al., 2018).
Overall, we have shown that whereas prior CoT work have required large pre-trained models such as PaLM
540B, UL2 20B is a relatively smaller model that can also perform multi-step reasoning. We hypothesize that
the mixture of denoisers may contribute to the ability of UL2 to leverage CoT prompting at 20B parameters,
although we leave further investigation of what unlocks emergent chain-of-thought reasoning to future work.
Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2021) is a collection of 57 tasks
covering a wide range of topics (humanities, social sciences, hard sciences, etc.). Strong performance on
MMLU requires extensive world knowledge as well as problem solving skills.
For MMLU, we compare with T5 model variants including the language model adapted variant Lester et al.
(2021) and T0 (Sanh et al., 2019). For the latter, we use “T0 Strawberry” and “T0 Vanilla" as these are the
models recommended for research purposes. We report 0-shot performance. T0 models are specifically
finetuned for 0-shot and hence we believe this is a conservative setting to test the efficacy of UL2. Table 14
shows that the LM-adapted T5-XXL model barely gives above-random performance (25%). UL2 outperforms
both T0 and T5 models.
23
Table 14: MMLU 0-shot performance (accuracy).
MMLU
T5-XXL + LM 27.5
T0 Strawberry 36.9
T0 Vanilla 34.5
UL2 20B 39.2
Inspired7 by Chung et al. (2022), we apply Flan instruction tuning on the UL2 20B checkpoint. We pretty
much use the same settings and Flan mixture as the Flan2 paper (Chung et al., 2022). Because the flan
mixtures do not have mode switching prompts, we opt to train UL2 for another 100K steps without mode
tokens to adapt it. We increased the length to 1024/1024 this time to accommodate a larger context length.
Flan training was done at 2048/512 length. We find the ‘mode switching purification’ of the original UL2
checkpoint to be useful, although the more optimal way would to be to add mode tokens to the FLAN tasks.
Since we were lazy to do that, we simply opt to continue training UL2 again for more steps. We release this
Flan-UL2 20B checkpoint at the same url as the original UL2 checkpoints.
5.3.1 Few-shot MMLU and Big-Bench Results after Flan training of UL2
BBH MMLU
OPT 30B 28.0 23.5/25.9
OPT 175B 30.2 27.3/34.2
T5 11B 29.5 -/25.9
OpenAI davinci 33.6 -/32.3
OPT IML-Max 30B 30.0 46.3/43.2
OPT IML-Max 175B 35.7 49.1/47.1
T0pp 11B 13.0 46.7/33.7
FLAN T5 XXL 45.3 54.5/53.7
FLAN-PaLM 62B 47.5 -/59.6
FLAN-PaLM 540B 57.9 -/73.5
FLAN-UL2 20B (Best ckpt for both tasks† ) 45.3 55.6/56.2
FLAN-UL2 20B (Individual task best) 46.0 55.1/58.1
Table 15: Results on MMLU and BBH using FLAN-UL2. † denotes checkpoint that we release.
Table 15 reports the results on MMLU and BBH (Suzgun et al., 2022). Generally, the performance of FLAN-
UL2 20B is pretty competitive and outperforms Flan-T5 XXL by +1.8% on the test set and +4.7% on MMLU
dev. The Big-Bench hard score remains competitive with the best checkpoint marginally outperforming
Flan-T5 XXL. Notably, the best dev scores of FLAN-UL2 is almost reaching the performance of Flan-PaLM
62B on both MMLU and BBH, suggesting that the results are pretty solid.
7 Inspiration is really a stretch, this was an obvious thing to do.
24
5.3.2 Comparisons on using Chain-of-thought vs Direct Prompting
We compare Flan models on direct and chain-of-thought setup. We fine-tune Flan-UL2 using the exact
identical protocol as T5-XXL and pick the best score based on the strongest average8 across all four setups
(MMLU/BBH with direct and CoT). We find that Flan-UL2 outperforms Flan-T5-XXL on all four tasks.
Notably, there are larger gains on CoT tasks, e.g., especially MMLU-CoT where the gain is a relative +7.4%. In
general, CoT variants for these tasks still perform worse than direct which can also be observed in the PaLM
62B model. This also seems to be true for Flan-PaLM 62B. Overall, Flan-UL2 comes close to FLAN-PaLM 62B
(49.1 vs 49.9) on average across all setups. However, it is still strongly outperformed by Flan-PaLM 540B.
We also tried some self-consistency (Wang et al., 2022b) experiments in combination with CoT. From brief
experiments, this raised CoT score from 53.9 to 57.1 (when the corresponding direct score was 55.4). This
shows that at 20B scale, CoT + self consistency can outperform direct prompting. We did not experiment
further since this increases the search space to a point where it was more time consuming than we would
have liked (or enjoyed). We leave any future experiments as an exercise for the reader.
6 Conclusion
We proposed a new paradigm for training universally effective models. UL2 is characterized by two key
ideas. Firstly, we propose a new Mixture of Denoisers (MoD) pretraining that frames multiple pretraining
tasks as span corruption, diversifies and then mixes them. Secondly, we introduce mode switching, a way of
associating downstream task behaviour to upstream pretraining. Extensive ablative experiments show that
UL2 consistently outperforms GPT-like and T5 models on a wide range of supervised and few-shot tasks,
outperforming T5 on 9 out of 9 tasks and by a normalized overall gain of +76.1%. Finally, we scale UL2
up to 20B parameters and conduct experiments on a diverse suite of 50 to 60 NLP tasks and setups. UL2
achieves sota performance on 50 of them. Pretrained checkpoints of UL2 and Flan-UL2 20B are released at
https://github.com/google-research/google-research/tree/master/ul2.
8 This results slightly differ from the earlier setup because we release the best checkpoints only on direct prompting. CoT did worse,
25
7 Acknowledgements
The authors would like to specially thank (in alphabetical order): Alexey Gritsenko, Andrew M. Dai, Jacob
Devlin, Jai Gupta, Liam Fedus, Orhan Firat for discussions and feedback that have helped to improve the
paper. We also thank Sebastian Gerhmann for discussions and clarifications on GEM metrics, Nan Du for
clarifications about GLaM’s in-context learning setup and Dave Uthus for his work on getting the scrolls
tasks in seqio format. We thank Slav Petrov and Quoc Le for general advice about UL2. We also thank the T5,
Jax and Flax teams for building such wonderful infrastructure and enabling this research. Finally, we also
thank Tianbao Xie from University of Hong Kong for helping us with UnifiedSKG’s code and dataset.
8 Author Contributions
• Yi Tay proposed the idea, conceived the project, lead this effort and drove the implementation, core
ablation experiments. Yi ran initial ablations, proof-of-concepts and pretrained the 20B model. Yi was
responsible for running most of the finetuning and in-context learning experiments for the 20B model.
• Mostafa Dehghani served as a co-lead of this effort and ran a good portion of initial experiments and
ablations, especially on SuperGLUE. Mostafa was quite involved in early brainstorming of this effort
and project. Mostafa helped out on the open sourcing process and procedures of UL2.
• Vinh Q. Tran participated substantially in early project discussions and brainstorming and contributed
to the the inception of UL2. Vinh implemented and trained UL2 on several tasks/baselines (e.g.,
SamSum, GENIE human evaluations, CommonsenseQA) for the UL2 20B runs.
• Xavier Garcia helped optimize our the UL2 pipeline in seqio and provided many great suggestions
about optimizing UL2. Xavier also ran experiments on UL2 in machine translation.
• Jason Wei ran Chain-of-thought experiments on reasoning benchmarks using the UL2 model.
• Xuezhi Wang ran self-consistency experiments on reasoning benchmarks using the UL2 model.
• Hyung Won ran experiments for the MMLU dataset and wrote the section for it.
• Siamak extensively helped with UL2 experiments, infrastructure and continiously improving the UL2
algorithm.
• Denny Zhou suggested running chain of thought and reasoning experiments with UL2, helped advise
the chain-of-thought section.
• Neil and Donald served as technical advisors and sponsors to the project and helped brainstorm,
provide feedback and writing of the paper.
26
References
Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta.
Better fine-tuning by reducing representational collapse. arXiv preprint arXiv:2008.03156, 2020.
Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke Zettlemoyer.
Htlm: Hyper-text pre-training and prompting of language models. arXiv preprint arXiv:2107.06955, 2021.
Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. The fact extraction and VERification over unstruc-
tured and structured information (FEVEROUS) shared task. In Proceedings of the Fourth Workshop on Fact Ex-
traction and VERification (FEVER), pp. 1–13, Dominican Republic, November 2021. Association for Compu-
tational Linguistics. doi: 10.18653/v1/2021.fever-1.1. URL https://aclanthology.org/2021.fever-1.1.
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei
Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task scaling for transfer
learning. arXiv preprint arXiv:2111.10952, 2021.
Shreyan Bakshi, Soumya Batra, Peyman Heidari, Ankit Arun, Shashank Jain, and Michael White. Structure-
to-text generation with self-training, acceptability classifiers and context-conditioning for the gem shared
task. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021),
pp. 136–147, 2021.
Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah
Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi. Abductive commonsense reasoning. arXiv
preprint arXiv:1908.05739, 2019.
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense
in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439,
2020.
Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics
for measuring unintended bias with real data for text classification. CoRR, abs/1903.04561, 2019. URL
http://arxiv.org/abs/1903.04561.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George
Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable
transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
arXiv preprint arXiv:2005.14165, 2020.
Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem,
and Anastasia Shimorina. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation
results (webnlg+ 2020). In Proceedings of the 3rd WebNLG Workshop on Natural Language Generation from the
Semantic Web (WebNLG+ 2020), pp. 55–76, Dublin, Ireland (Virtual), 2020. Association for Computational
Linguistics.
Mingda Chen, Zewei Chu, Sam Wiseman, and Kevin Gimpel. Summscreen: A dataset for abstractive
screenplay summarization. arXiv preprint arXiv:2104.07091, 2021.
Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and
William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint
arXiv:1909.02164, 2019.
Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Wang. Hybridqa: A
dataset of multi-hop question answering over tabular and textual data. arXiv preprint arXiv:2004.07347,
2020.
27
Aakanksha Chowdhery, Sharan Narang, and Jacob Devlin. Palm: Scaling language modeling with pathways.
arXiv preprint, 2022.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,
Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint
arXiv:2210.11416, 2022.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR,
abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
Jordan Clive, Kris Cao, and Marek Rei. Control prefixes for text generation. CoRR, abs/2110.08329, 2021.
URL https://arxiv.org/abs/2110.08329.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and
John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
URL https://arxiv.org/abs/2110.14168.
Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. Advances in neural information processing
systems, 28:3079–3087, 2015.
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-
seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. The efficiency misnomer. arXiv
preprint arXiv:2110.12894, 2021a.
Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler,
and Oriol Vinyals. The benchmark lottery. 2021b.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-
Wuen Hon. Unified language model pre-training for natural language understanding and generation.
arXiv preprint arXiv:1905.03197, 2019.
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun,
Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-
experts. arXiv preprint arXiv:2112.06905, 2021.
Ondřej Dušek, David M Howcroft, and Verena Rieser. Semantic Noise Matters for Neural Natural Language
Generation. In Proceedings of the 12th International Conference on Natural Language Generation (INLG 2019),
pp. 421–426, Tokyo, Japan, 2019. URL https://www.aclweb.org/anthology/W19-8652/.
Julian Martin Eisenschlos, Maharshi Gor, Thomas Müller, and William W Cohen. Mate: Multi-view attention
for table transformer efficiency. arXiv preprint arXiv:2109.04312, 2021.
Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. Multi-news: A large-scale
multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749,
2019.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models
with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.
Wikimedia Foundation. Acl 2019 fourth conference on machine translation (wmt19), shared task: Machine
translation of news. URL http://www.statmt.org/wmt19/translation-task.html.
28
Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu
Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D
Dhole, et al. The gem benchmark: Natural language generation, its evaluation and metrics. arXiv preprint
arXiv:2102.01672, 2021.
Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. Repairing the cracked foundation: A survey of
obstacles in evaluation practices for generated text. arXiv preprint arXiv:2202.06935, 2022.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a
laptop? A question answering benchmark with implicit reasoning strategies. TACL, 2021. doi: 10.1162/
tacl_a_00370. URL https://aclanthology.org/2021.tacl-1.21.
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated
dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019.
Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang.
Longt5: Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916, 2021.
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with
disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
Yun He, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao
Chen, Donald Metzler, et al. Hyperprompt: Prompt-based task-conditioning of transformers. arXiv preprint
arXiv:2203.00759, 2022.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
Measuring massive multitask language understanding. In International Conference on Learning Representations,
2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing
systems, pp. 1693–1701, 2015.
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv
preprint arXiv:1801.06146, 2018.
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading comprehen-
sion with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277, 2019.
Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long
document summarization. arXiv preprint arXiv:2104.02112, 2021.
Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. Search-based neural structured learning for sequential
question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 1821–1831, 2017.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert:
Improving pre-training by representing and predicting spans. Transactions of the Association for Computational
Linguistics, 8:64–77, 2020.
Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and
David Patterson. A domain-specific supercomputer for training deep neural networks. Communications of
the ACM, 63(7):67–78, 2020.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray,
Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361, 2020.
29
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh
Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700,
2020.
Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A.
Smith, and Daniel S. Weld. GENIE: A leaderboard for human-in-the-loop evaluation of text generation.
CoRR, abs/2101.06561, 2021. URL https://arxiv.org/abs/2101.06561.
Daniel Khashabi, Yeganeh Kordi, and Hannaneh Hajishirzi. Unifiedqa-v2: Stronger generalization via broader
cross-format training. arXiv preprint arXiv:2202.12359, 2022.
Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. Qasc: A dataset for question
answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34,
pp. 8082–8090, 2020.
Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and
Edward Grefenstette. The NarrativeQA Reading Comprehension Challenge. Transactions of the Association
for Computational Linguistics, 2018.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math
word problem repository. NAACL, 2016. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/
N16-1136.
Yuta Koreeda and Christopher D Manning. Contractnli: A dataset for document-level natural language
inference for contracts. arXiv preprint arXiv:2110.01799, 2021.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Jacob Devlin Kenton Lee, Kristina Toutanova, Llion Jones Matthew
Kelcey, Ming-Wei Chang, Andrew M Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural Questions: a
Benchmark for Question Answering Research. In Transactions of the ACL, 2019.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding compre-
hension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural
Language Processing, pp. 785–794, Copenhagen, Denmark, September 2017. Association for Computational
Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. A
new generation of perspective api: Efficient multilingual character-level transformers. arXiv preprint
arXiv:2202.11176, 2022.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim
Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation
and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.
arXiv preprint arXiv:2104.08691, 2021.
Haoran Li, Abhinav Arora, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. Mtop: A
comprehensive multilingual task-oriented semantic parsing benchmark. arXiv preprint arXiv:2008.09335,
2020.
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren.
CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings
of the Association for Computational Linguistics: EMNLP 2020, pp. 1823–1840, Online, November 2020. Associ-
ation for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.findings-emnlp.
165.
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation:
Learning to solve and explain algebraic word problems. ACL, 2017. doi: 10.18653/v1/P17-1015. URL
https://aclanthology.org/P17-1015.
30
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer.
Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692, 2019.
Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Unicorn on rainbow: A universal
commonsense reasoning model on a new multitask benchmark. arXiv preprint arXiv:2103.13009, 2021.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts.
Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011.
Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
Shen Yun Miao, Chao Chun Liang, and Keh Yih Su. A diverse corpus for evaluating and developing English
math word problem solvers. ACL, 2020. doi: 10.18653/v1/2020.acl-main.92. URL https://aclanthology.
org/2020.acl-main.92.
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a
new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of
words and phrases and their compositionality. In Advances in neural information processing systems, pp.
3111–3119, 2013.
Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru
Tang, Aadit Vyas, Neha Verma, Pranav Krishna, et al. Dart: Open-domain structured data record to text
generation. arXiv preprint arXiv:2007.02871, 2020.
Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kryściński,
Nick Schoelkopf, Riley Kong, Xiangru Tang, et al. Fetaqa: Free-form table question answering. arXiv
preprint arXiv:2104.00369, 2021.
Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma
Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifications transfer across
implementations and applications? arXiv preprint arXiv:2102.11972, 2021.
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware
convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018.
Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simões, Vitaly Nikolaev, and Ryan McDonald. Planning
with learned entity prompts for abstractive summarization. Transactions of the Association for Computational
Linguistics, 9:1475–1492, 2021. doi: 10.1162/tacl_a_00438. URL https://aclanthology.org/2021.tacl-1.
88.
ME Peters M Neumann, M Iyyer, M Gardner, C Clark, K Lee, and L Zettlemoyer. Deep contextualized word
representations. arXiv preprint arXiv:1802.05365, 2018.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A
new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599, 2019.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with
human feedback. arXiv preprint arXiv:2203.02155, 2022. URL http://go/arxiv/2203.02155.
Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh
Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input
texts, yes! arXiv preprint arXiv:2112.08608, 2021.
31
Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and
Dipanjan Das. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373, 2020.
Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv
preprint arXiv:1508.00305, 2015.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word
problems? NAACL, 2021. URL https://aclanthology.org/2021.naacl-main.168.pdf.
Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representa-
tion. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.
1532–1543, 2014.
Guanghui Qin, Yukun Feng, and Benjamin Van Durme. The nlp task effectiveness of long-range transformers.
arXiv preprint arXiv:2202.07856, 2022.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models
are Unsupervised Multitask Learners. 2019.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.
arXiv preprint arXiv:1910.10683, 2019.
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. Towards scalable
multi-domain conversational agents: The schema-guided dialogue dataset. arXiv preprint arXiv:1909.05855,
2019.
Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan
Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu,
Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko,
Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen
Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer,
Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick,
Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. Scaling up
models and data with t5x and seqio, 2022. URL https://arxiv.org/abs/2203.17189.
Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. Getting closer to ai complete question
answering: A set of prerequisite real tasks. In Proceedings of the AAAI conference on artificial intelligence,
volume 34, pp. 8722–8731, 2020.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin,
Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task
generalization. arXiv preprint arXiv:2110.08207, 2021.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense
reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
Tal Schuster, Adam Fisch, and Regina Barzilay. Get your vitamin c! robust fact verification with contrastive
evidence. arXiv preprint arXiv:2103.08541, 2021.
Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva,
Jonathan Berant, and Omer Levy. Scrolls: Standardized comparison over long language sequences, 2022.
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In
International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
32
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro.
Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint
arXiv:1909.08053, 2019.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game:
Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha
Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-
thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions. arXiv
preprint arXiv:1803.06643, 2018.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering
challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question
answering challenge targeting commonsense knowledge. NAACL, 2019. doi: 10.18653/v1/N19-1421. URL
https://aclanthology.org/N19-1421.
Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan
Berant. CommonsenseQA 2.0: Exposing the limits of AI through gamification. In Thirty-fifth Conference
on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://
openreview.net/forum?id=qF7FlUT5dxa.
Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, and Da-Cheng Juan. Hypergrid: Efficient multi-task trans-
formers with grid-wise decomposable hyper projections. arXiv preprint arXiv:2007.05891, 2020.
Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen Qin, and Donald Metzler. Are
pre-trained convolutions better than pre-trained transformers? arXiv preprint arXiv:2105.03322, 2021a.
Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang,
Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale efficiently: Insights from pre-training and
fine-tuning transformers. arXiv preprint arXiv:2109.10686, 2021b.
Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgart-
ner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword
tokenization. arXiv preprint arXiv:2106.12672, 2021c.
Yi Tay, Vinh Q Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao,
Jai Gupta, et al. Transformer memory as a differentiable search index. arXiv preprint arXiv:2202.06991, 2022.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri,
Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong
Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will
Rusch, Marc Pickett, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos,
Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson,
Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna,
Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise
Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog
applications, 2022.
Hui Wan. Multi-task learning with multi-head attention for multi-choice reading comprehension. arXiv
preprint arXiv:2003.04992, 2020.
33
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-
task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,
2018.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy,
and Samuel R Bowman. Superglue: A stickier benchmark for general-purpose language understanding
systems. arXiv preprint arXiv:1905.00537, 2019.
Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, Ruoxi Jia, Bo Li, and Jingjing Liu. Infobert: Improving
robustness of language models from an information theoretic perspective. arXiv preprint arXiv:2010.02329,
2020.
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay,
and Colin Raffel. What language model architecture and pretraining objective work best for zero-shot
generalization? arXiv preprint arXiv:2204.05832, 2022a.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint
arXiv:2203.11171, 2022b. URL https://arxiv.org/abs/2203.11171.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten
Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint
arXiv:2206.07682, 2022a. URL http://go/arxiv/2206.07682.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of
thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b. URL
http://go/arxiv/2201.11903.
Alexander Wettig, Tianyu Gao, Zexuan Zhong, and Danqi Chen. Should you mask 15% in masked language
modeling? arXiv preprint arXiv:2202.08005, 2022.
Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman Cohan. Primer: Pyramid-based masked sentence
pre-training for multi-document summarization. arXiv preprint arXiv:2110.08499, 2021.
Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng
Wu, Ming Zhong, Pengcheng Yin, Sida I Wang, et al. Unifiedskg: Unifying and multi-tasking structured
knowledge grounding with text-to-text language models. arXiv preprint arXiv:2201.05966, 2022.
Wenhan Xiong, Jiawei Wu, Hong Wang, Vivek Kulkarni, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and
William Yang Wang. Tweetqa: A social media focused question answering dataset. arXiv preprint
arXiv:1907.06292, 2019.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua,
and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint
arXiv:2010.11934, 2020.
Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and
Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models. arXiv preprint
arXiv:2105.13626, 2021.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet:
Generalized autoregressive pretraining for language understanding. Advances in neural information processing
systems, 32, 2019.
Wenpeng Yin, Dragomir Radev, and Caiming Xiong. Docnli: A large-scale dataset for document-level natural
language inference. arXiv preprint arXiv:2106.09449, 2021.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really
finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
34
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.
Advances in neural information processing systems, 28, 2015.
Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli
Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain
meeting summarization. arXiv preprint arXiv:2104.05938, 2021.
Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural
language using reinforcement learning. CoRR, abs/1709.00103, 2017.
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William
Fedus. Designing effective sparse expert models, 2022.
35
9 Appendix
As part of this work, we release the weights of the 20B checkpoint. The weights of the model can be found in
in this GCP bucket (gs://scenic-bucket/ul2). These checkpoints were trained with T5X (Roberts et al.,
2022) found at https://github.com/google-research/t5x and are implemented in JAX/Flax. Because the
fine-tuning results are generally not from a single checkpoint due to our continuous finetuning setup, we
release three different checkpoints (1.87M, 2.05M, 2.65M) which we found to be consistently pretty good.
A slight disclaimer is that we finetuned and trained this model on TPUv4 chips on our internal systems. Even
so, finetuning would also sometimes result in nans which may require some care and manual tuning to get
resolved. Therefore, if checkpoints were to be ported to another system, we could not guarantee that these
checkpoints would work as well. We are overall optimistic but we do not guarantee its stability with external
hardware and accelerators such as GPUs.
For this particular checkpoint, note that the mode tags we used are [NLG] (X-denoiser), [NLU] (R-denoiser)
and [S2S] (S-denoiser). So add that at the start of the inputs of your examples.
This section aims to give more insight to how UL2 pretraining is implemented. Our implementation is
actually pretty simple. It is simply a mixture of different pretraining objectives that is implemented in seqio9 .
Most of our experiments were run with simply mixing different seqio tasks with seqio’s Mixture Registry.
However, one could also implement a generalized UL2 objective with the following function which could be
cleaner.
Args:
dataset: A tf.data.Dataset with dictionaries containing the key
‘input_feature_key‘.
sequence_length: dict mapping of feature key to int length for that feature.
output_features: mapping of keys to features.
use_prefix_lm_task: <bool> If True, include PrefixLM in the task mix.
rates: <Optional<List<float>> List of rates per task. If None, tasks are
9 https://github.com/google/seqio
36
sampled uniformly.
mean_noise_span_lengths: List of mean number of tokens per masked span per
example.
noise_densities: List of what fraction of the tokens to mask.
shard_ds: <bool> If True, shard dataset per objective.
optional_task_prefixes: <Optional<list<str>> Strings to prepend for each
corruption scheme. NOTE: If including prefixLM task,
it must be the last prefix.
input_feature_key: which feature to use from the dataset as the input text
tokens.
merge_examples_to_reduce_padding: if True, combines multiple input examples
to reduce padding.
reserved_for_packing: if specified, reduces the desired inputs length by the
specified amount to enable multiple examples to be packed together
downstream.
seed: tf.int64 for controlling the random choice of spans.
Returns:
a dataset
"""
ds = dataset
ds = t5.data.preprocessors.select_random_chunk(
ds,
output_features=output_features,
feature_key="targets",
max_length=65536)
if merge_examples_to_reduce_padding:
ds = t5.data.preprocessors.reduce_concat_tokens(
37
ds, feature_key="targets", batch_size=128)
num_shards = len(input_lengths) + int(use_prefix_lm_task)
if shard_ds:
ds_shards = [ds.shard(num_shards, i) for i in range(num_shards)]
else:
ds_shards = [ds for _ in range(num_shards)]
processed_ds = []
hyperparams = zip(input_lengths, hyperparams, range(num_shards))
for input_length, (noise_span_length, noise_density), i in hyperparams:
ds = ds_shards[i]
ds = t5.data.preprocessors.split_tokens(
ds,
feature_key="targets",
min_tokens_per_segment=None,
max_tokens_per_segment=input_length)
ds = t5.data.preprocessors.denoise(
ds,
output_features,
inputs_fn=t5.data.preprocessors.noise_span_to_unique_sentinel,
targets_fn=t5.data.preprocessors.nonnoise_span_to_unique_sentinel,
noise_density=noise_density,
noise_mask_fn=functools.partial(
t5.data.preprocessors.random_spans_noise_mask,
mean_noise_span_length=noise_span_length),
input_feature_key=input_feature_key)
if optional_task_prefixes:
ds = prepend_prompt(
ds,
output_features,
prompt_mode=optional_task_prefixes[i],
mode=optional_task_prefixes[i])
processed_ds.append(ds)
if use_prefix_lm_task:
ds = ds_shards[-1]
ds = t5.data.preprocessors.prefix_lm(ds, sequence_lengths, output_features)
if optional_task_prefixes:
ds = prepend_prompt(
ds,
output_features,
prompt_mode=optional_task_prefixes[-1],
mode=optional_task_prefixes[-1])
processed_ds.append(ds)
Most of our supervised finetuning runs were finetuned as single tasks. The only exception was that:
• We finetuned GLUE as a single mixture with proportionate sampling. This has become standard and
defacto setup (Raffel et al., 2019; He et al., 2022; Tay et al., 2020, 2021b).
• We finetuned SuperGLUE as a single mixture which is also a standard setup these days (Fedus et al.,
2021; Raffel et al., 2019; Chowdhery et al., 2022).
38
• SIQA, PIQA, AbductiveNLI, Winogrande XL and CosmosQA were co-trained in a proportionate mixture
similar to (Lourie et al., 2021) under the Rainbow benchmark.
• For CSQA, CSQA2. OBQA, and ARC-DA we co-trained with the rainbow mixture to obtain results on
these three datasets.
Task Prompt
BoolQ S2S
CB NLU
RTE S2S
Record S2S
WiC S2S
WSC S2S
COPA NLU
MultiRC S2S
39