Table 1: Comparison of mT5 to existing massively multilingual pre-trained language models. Multiple versions of
XLM and mBERT exist; we refer here to the ones that cover the most languages. Note that XLM-R counts five
Romanized variants as separate languages, while we ignore six Romanized variants in the mT5 language count.
of just dff in the larger models, and pre-training on use SentencePiece (Kudo and Richardson, 2018;
unlabeled data only with no dropout. We refer to Kudo, 2018) models trained with the language sam-
Raffel et al. (2020) for further details on T5. pling rates used during pre-training. To accom-
A major factor in pre-training multilingual mod- modate languages with large character sets like
els is how to sample data from each language. Chinese, we use a character coverage of 0.99999
Ultimately, this choice is a zero-sum game: If and enable SentencePiece’s “byte-fallback” feature
low-resource languages are sampled too often, the to ensure that any string can be uniquely encoded.
model may overfit; if high-resource languages are
not trained on enough, the model will underfit. We 3.3 Comparison to related models
therefore take the approach used in (Devlin, 2018; To contextualize our new model, we provide a brief
Conneau et al., 2020; Arivazhagan et al., 2019) and comparison with existing massively multilingual
boost lower-resource languages by sampling ex- pre-trained language models. For brevity, we focus
amples according to the probability p(L) ∝ |L|α , on models that support more than a few dozen lan-
where p(L) is the probability of sampling text from guages. Table 1 gives a high-level comparison of
a given language during pre-training and |L| is the mT5 to the most similar models.
number of examples in the language. The hyper- mBERT (Devlin, 2018) is a multilingual ver-
parameter α (typically with α < 1) allows us to sion of BERT (Devlin et al., 2019). Similar to our
control how much to “boost” the probability of approach with mT5, mBERT follows the BERT
training on low-resource languages. Values used recipe as closely as possible (same architecture, ob-
by prior work include α = 0.7 for mBERT (Devlin, jective, etc.). The primary difference is the training
2018), α = 0.3 for XLM-R (Conneau et al., 2020), set: Instead of training on English Wikipedia and
and α = 0.2 for MMNMT (Arivazhagan et al., the Toronto Books Corpus, mBERT is trained on
2019). We tried all three of these values (ablation up to 104 languages from Wikipedia. XLM (Con-
results in section 4.2) and found α = 0.3 to give a neau and Lample, 2019) is also based on BERT but
reasonable compromise between performance on applies improved methods for pre-training multi-
high- and low-resource languages. lingual language models including explicitly cross-
The fact that our model covers over 100 lan- lingual pre-training objectives. Many pre-trained
guages necessitates a larger vocabulary. Following versions of XLM have been released; the most
XLM-R (Conneau et al., 2018), we increase the vo- massively-multilingual variant was trained on 100
cabulary size to 250,000 wordpieces. As in T5, we languages from Wikipedia. XLM-R (Conneau
Sentence pair Structured Question answering
Metrics Acc. Acc. F1 F1 / EM F1 / EM F1 / EM
Cross-lingual zero-shot transfer (models fine-tuned on English data only)
mBERT 65.4 81.9 62.2 64.5 / 49.4 61.4 / 44.2 59.7 / 43.9
XLM 69.1 80.9 61.2 59.8 / 44.3 48.5 / 32.6 43.6 / 29.1
InfoXLM 81.4 - - -/- 73.6 / 55.2 -/-
X-STILTs 80.4 87.7 64.7 77.2 / 61.3 72.3 / 53.5 76.0 / 59.5
XLM-R 79.2 86.4 65.4 76.6 / 60.8 71.6 / 53.2 65.1 / 45.0
VECO 79.9 88.7 65.7 77.3 / 61.8 71.7 / 53.2 67.6 / 49.1
RemBERT 80.8 87.5 70.1 79.6 / 64.0 73.1 / 55.0 77.0 / 63.0
mT5-Small 67.5 82.4 50.5 58.1 / 42.5 54.6 / 37.1 35.2 / 23.2
mT5-Base 75.4 86.4 55.7 67.0 / 49.0 64.6 / 45.0 57.2 / 41.2
mT5-Large 81.1 88.9 58.5 77.8 / 61.5 71.2 / 51.7 69.9 / 52.2
mT5-XL 82.9 89.6 65.5 79.5 / 63.6 73.5 / 54.5 75.9 / 59.4
mT5-XXL 85.0 90.0 69.2 82.5 / 66.8 76.0 / 57.4 80.8 / 65.9
Translate-train (models fine-tuned on English data plus translations in all target languages)
XLM-R 82.6 90.4 - 80.2 / 65.9 72.8 / 54.3 66.5 / 47.7
F ILTER + Self-Teaching 83.9 91.4 - 82.4 / 68.0 76.2 / 57.7 68.3 / 50.9
VECO 83.0 91.1 - 79.9 / 66.3 73.1 / 54.9 75.0 / 58.9
mT5-Small 64.7 79.9 - 64.3 / 49.5 56.6 / 38.8 48.2 / 34.0
mT5-Base 75.9 89.3 - 75.3 / 59.7 67.6 / 48.5 64.0 / 47.7
mT5-Large 81.8 91.2 - 81.2 / 65.9 73.9 / 55.2 71.1 / 54.9
mT5-XL 84.8 91.0 - 82.7 / 68.1 75.1 / 56.6 79.9 / 65.3
mT5-XXL 87.8 91.5 - 85.2 / 71.3 76.9 / 58.3 82.8 / 68.8
In-language multitask (models fine-tuned on gold data in all target languages)
mBERT - - 89.1 - - 77.6 / 68.0
mT5-Small - - 83.4 - - 73.0 / 62.0
mT5-Base - - 85.4 - - 80.8 / 70.0
mT5-Large - - 88.4 - - 85.5 / 75.3
mT5-XL - - 90.9 - - 87.5 / 78.1
mT5-XXL - - 91.2 - - 88.5 / 79.1
Table 2: Results on XTREME sentence-pair classification, structured prediction and question answering tasks.
mBERT metrics are from Hu et al. (2020). Metrics for XLM, InfoXLM, X-STILTs and XLM-R are from Fang
et al. (2020), though Conneau et al. (2020) report better performance of XLM-R on XNLI (80.9). All other metrics
are from the original sources: F ILTER (Fang et al., 2020), VECO (Luo et al., 2020) and RemBERT (Chung et al.,
2020). For the “translate-train” setting, we include English training data, so as to be comparable with Fang et al.
(2020) and Luo et al. (2020). This differs from the XTREME “translate-train” setup of Hu et al. (2020). For mT5
results on TyDi QA zero-shot, we report the median across five fine-tuning runs, as we observed high variance
across runs. Full results for all languages in all tasks are provided in the appendix.
et al., 2020) is an improved version of XLM based ment in one language by retrieving documents in
on the RoBERTa model (Liu et al., 2019). XLM-R other languages. It uses data in 26 languages from
is trained with a cross-lingual masked language Wikipedia and CC-News (Liu et al., 2019).
modeling objective on data in 100 languages from
Common Crawl. To improve the pre-training data 4 Experiments
quality, pages from Common Crawl were filtered To validate the performance of mT5, we evaluate
by an n-gram language model trained on Wikipedia our models on 6 tasks from the XTREME multilin-
(Wenzek et al., 2020). mBART (Liu et al., 2020a) gual benchmark (Hu et al., 2020): the XNLI (Con-
is a multilingual encoder-decoder model that is neau et al., 2018) entailment task covering 14 lan-
based on BART (Lewis et al., 2020b). mBART is guages; the XQuAD (Artetxe et al., 2020), MLQA
trained with a combination of span masking and (Lewis et al., 2019), and TyDi QA (Clark et al.,
sentence shuffling objectives on a subset of 25 lan- 2020) reading comprehension benchmarks with 10,
guages from the same data as XLM-R. MARGE 7, and 11 languages respectively; the Named En-
(Lewis et al., 2020a) is a multilingual encoder- tity Recognition (NER) dataset of WikiAnn (Pan
decoder model that is trained to reconstruct a docu- et al., 2017) restricted to the 40 languages from
XTREME (Hu et al., 2020), and the PAWS-X (Yang We use the same inverse square-root learning
et al., 2019) paraphrase identification dataset with rate schedule used by T5 during pre-training, with
7 languages. We cast all tasks into the text-to-text the learning rate set to 1/ max(n, k) where n is
format, i.e. generating the label text (XNLI and the current training iteration and k = 104 is the
PAWS-X), entity tags and labels (WikiAnn NER), number of warm-up steps. Following the T5.1.1
or answer (XQuAD, MLQA, and TyDi QA) di- recipe, we do not apply dropout during pre-training.
rectly in a generative fashion. For NER, if there We use the same self-supervised objective as T5,
are multiple entities, then they are concatenated with 15% of tokens masked and an average noise
in the order they appear, and if there are no en- span length of 3. We ablate some of these experi-
tities then the target text is “None”. We con- mental details in section 4.2.
sider three variants of these tasks: (1) “zero-shot”, For fine-tuning, we use a constant learning rate
where the model is fine-tuned only on English data, of 0.001 and dropout rate of 0.1 for all tasks. We
(2) “translate-train”, adding machine translations use batch size 217 for most tasks but increased this
from English into each target language, and (3) “in- up to 220 in a few cases based on performance
language multitask”, training on gold data in all on the validation set. For early stopping, we save
target languages. For brevity, we refer to Hu et al. checkpoints every 200 steps and choose the check-
(2020) for further details on these benchmarks. point with the highest validation performance.
Following the original T5 recipe, we consider 4.1 Results
five model sizes: Small (≈ 300M parameters),
Base (580M), Large (1.2B), XL (3.7B), and XXL Table 2 presents our main results, with per-
(13B). The increase in parameter counts com- language breakdowns for each task given in the
pared to the corresponding T5 model variants appendix. Our largest model mT5-XXL exceeds
comes from the larger vocabulary used in mT5. state-of-the-art on all classification and QA tasks
Note that, because mT5 is an encoder-decoder and is near SOTA on NER (69.2 vs. 70.1). Note
model, it has roughly twice as many parameters as that unlike our model, InfoXLM (Chi et al., 2020)
correspondingly-sized encoder-only models such and VECO (Luo et al., 2020) benefit from paral-
as XLM-R. For example, the “Large” variant of lel training data, while X-STILTs (Phang et al.,
XLM-R has 550 million parameters whereas mT5- 2020) leverages labeled data from tasks similar to
Large has around 1 billion. However, the compu- the target task. Overall, our results highlight the
tational cost for text classification is roughly the importance of model capacity in cross-lingual rep-
same: In both cases, the model processes a length- resentation learning and suggest that scaling up a
T input sequence with an encoder of approximately simple pre-training recipe can be a viable alterna-
equal size. In an encoder-only model like XLM-R, tive to more complex techniques relying on LM
the encoder processes one additional "CLS" token, filtering, parallel data, or intermediate tasks.
which is used to generate the representation for clas- In the “translate-train” setting, we exceed state-
sification. In mT5, the decoder typically produces of-the-art on all XTREME classification and QA
two additional tokens: the class label and an end- tasks. For these tasks, we fine-tune on the combina-
of-sequence token. Since the decoder has the same tion of the labeled English data and machine trans-
architecture (ignoring encoder-decoder attention) lations thereof.6 This allows direct comparison
as the encoder, the computational cost of classifi- with both F ILTER (Fang et al., 2020) as well as the
cation with mT5 typically amounts to the cost of XLM-R baseline of Fang et al. (2020). Note that
processing T + 2 tokens compared to T + 1 for this setup differs from XTREME “translate-train”
an encoder-only model. However, encoder-decoder (Hu et al., 2020), which excludes English.
architectures have the additional benefit of being Figure 2 shows that model capacity is key to
applicable to generative tasks like abstractive sum- improving performance on variants of the TyDi
marization or dialog. QA GoldP task in the absence of “gold” multi-
lingual data: For the smallest model, training on
We pre-train our mT5 model variants for 1 mil- gold datasets (in-language multitask) achieves dra-
lion steps on batches of 1024 length-1024 input
sequences, corresponding to roughly 1 trillion in- We use the translation data provided by Hu et al. (2020)
throughout. On the PAWS-X task, F ILTER used translation
put tokens total. This is the same amount of pre- data from the original task instead. Switching to this data
training as T5 and about 16 as much as XLM-R. would improve our scores slightly (mT5-XXL 91.5 → 92.0).
T5 mT5 Model Accuracy
Small 87.2 / 79.1 84.7 / 76.4 Baseline (mT5-Large) 81.1
Base 92.1 / 85.4 89.6 / 83.8 Dropout 0.1 77.6
Large 93.8 / 86.7 93.0 / 87.0 Sequence length 512 80.5
XL 95.0 / 88.5 94.5 / 88.9 Span length 10 78.6
XXL 96.2 / 91.3 95.6 / 90.4 α = 0.7 80.7
α = 0.2 80.7
Table 3: Comparison of T5 vs. mT5 on SQuAD ques- No line length filter 79.1
Add Wikipedia data 80.3
tion answering (F1/EM).
Table 4: Average XNLI zero-shot accuracy of various
ablations on our mT5-Large model. Per-language met-
rics are shown in the appendix.
70 model has enough capacity to effectively learn 101
40 40
10 20
0 10
el ru th ar de hi zh es tr vi en
(a) mT5-Small 0
Small Base Large XL XXL
Figure 4: Error rates of mT5 on XQuAD zero-shot.
30 Baseline: Fine-tuning on XQuAD alone. Domain Pre-
th ru zh hi de es el ar tr vi en
task during fine-tuning. A similar approach was
(b) mT5-XXL
explored by Liu et al. (2020b). We use the same
Figure 3: Per-language error rates on XQuAD zero- mC4 task definition as in pre-training, with two
shot, sorted by illegal rate. Incorrect: Not matching adjustments: First, we remove all “sentinel” tokens
the target span. Illegal: Missing from the input context. (corresponding to non-masked spans in the input
Illegal after norm: Illegal even after Unicode NFKC text) from the target sequence, as otherwise we
normalization is applied to the prediction and context.
observe occasional sentinels in downstream predic-
tions. Second, we reduce the language sampling
parameter α from 0.3 to 0.1. This produces a near-
ify our inference procedure. As is common practice
uniform distribution of languages, encouraging the
with encoder-based models, we could devise a task-
model to treat all languages as equally likely.8
specific fine-tuning mechanism that restricts the
With these changes, we mix a small amount of
model to perform ranking over legal spans, remov-
our unsupervised task (covering 101 languages)
ing the possibility of illegal predictions entirely.
into XQuAD fine-tuning, at a ratio of just 1:100.
While this would likely improve our zero-shot met-
Figure 4 shows the results on XQuAD zero-shot er-
rics, it is unsatisfying for two reasons: First, it
ror rates. The addition of even this small amount of
implies taking a step backward from the general
multilingual data has a marked effect on the mT5-
text-to-text interface, as different tasks would de-
Small and mT5-Base models (where accidental
mand different types of inference. Second, this
translation was most rampant), reducing the illegal
solution won’t extend to more “open-ended” zero-
prediction rates by more than 70% (relative), and
shot generative tasks like summarization, where
contributing to an overall reduction in errors.
the legal output space can’t be easily delimited.
For these reasons, we consider a more general 6 Conclusion
solution that remains within the text-to-text frame-
work and can apply to all zero-shot generation In this paper, we introduced mT5 and mC4: mas-
tasks. Our motivating intuition is that the reason the sively multilingual variants of the T5 model and
model outputs English when given a non-English C4 dataset. We demonstrated that the T5 recipe is
test input is that it has never observed a non-English straightforwardly applicable to the multilingual set-
target during fine-tuning. As English-only fine- ting, and achieved strong performance on a diverse
tuning proceeds, the model’s assigned likelihood set of benchmarks. We also characterized illegal
of non-English tokens presumably decreases, even- predictions that can occur in zero-shot evaluation
tually reaching the point where English becomes of multilingual pre-trained generative models, and
the most likely answer to any question. described a simple technique to avoid this issue.
To prevent the model from “forgetting” how to We release all code and pre-trained datasets used in
generate other languages, we use a strategy inspired this paper to facilitate future work on multilingual
by domain/task-adaptive pre-training (Howard and 8
Alternatively, one could mix in unlabeled data only for a
Ruder, 2018; Gururangan et al., 2020): We simply single language at a time. However, we believe this is contrary
mix in our unsupervised multilingual pre-training to the spirit of multilingual models and zero-shot evaluation.
Table 6: Statistics of the mC4 corpus, totaling 6.6B pages and 6.3T tokens. The “mT5” column indicates the
percentage of mT5 training data coming from a given language, using the default exponential smoothing value of
α=0.3. We list 107 “languages” as detected by cld3, but note six of these (marked “Latin”) are just Romanized
variants of existing languages.
Model en ar bg de el es fr hi ru sw th tr ur vi zh avg
Cross-lingual zero-shot transfer (models fine-tune on English data only)
mBERT 80.8 64.3 68.0 70.0 65.3 73.5 73.4 58.9 67.8 49.7 54.1 60.9 57.2 69.3 67.8 65.4
XLM 82.8 66.0 71.9 72.7 70.4 75.5 74.3 62.5 69.9 58.1 65.5 66.4 59.8 70.7 70.2 69.1
XLM-R 88.7 77.2 83.0 82.5 80.8 83.7 82.2 75.6 79.1 71.2 77.4 78.0 71.7 79.3 78.2 79.2
mT5-Small 79.6 65.2 71.3 69.2 68.6 72.7 70.7 62.5 70.1 59.7 66.3 64.4 59.9 66.3 65.8 67.5
mT5-Base 84.7 73.3 78.6 77.4 77.1 80.3 79.1 70.8 77.1 69.4 73.2 72.8 68.3 74.2 74.1 75.4
mT5-Large 89.4 79.8 84.1 83.4 83.2 84.2 84.1 77.6 81.5 75.4 79.4 80.1 73.5 81.0 80.3 81.1
mT5-XL 90.6 82.2 85.4 85.8 85.4 81.3 85.3 80.4 83.7 78.6 80.9 82.0 77.0 81.8 82.7 82.9
mT5-XXL 91.6 84.5 87.7 87.3 87.3 87.8 86.9 83.2 85.1 80.3 81.7 83.8 79.8 84.6 83.6 84.5
Translate-train (models fine-tune on English training data plus translations in all target languages)
mt5-Small 69.5 63.7 67.5 65.7 66.4 67.5 67.3 61.9 66.4 59.6 63.9 63.5 60.4 63.3 64.5 64.7
mt5-Base 82.0 74.4 78.5 77.7 78.1 79.1 77.9 72.2 76.5 71.5 75.0 74.8 70.4 74.5 76.0 75.9
mt5-Large 88.3 80.3 84.1 84.0 83.7 84.9 83.8 79.8 82.0 76.4 79.9 81.0 75.9 81.3 81.7 81.8
mt5-XL 90.9 84.2 86.8 86.8 86.4 87.4 86.8 83.1 84.9 81.3 82.3 84.4 79.4 83.9 84.0 84.8
mT5-XXL 92.7 87.2 89.4 89.8 89.5 90.0 89.1 86.5 87.6 84.3 85.6 87.1 83.8 87.5 86.5 87.8
Model en de es fr ja ko zh avg
Cross-lingual zero-shot transfer (models fine-tune on English data only)
mBERT 94.0 85.7 87.4 87.0 73.0 69.6 77.0 81.9
XLM 94.0 85.9 88.3 87.4 69.3 64.8 76.5 80.9
XLM-R 94.7 89.7 90.1 90.4 78.7 79.0 82.3 86.4
mT5-Small 92.2 86.2 86.1 86.6 74.7 73.5 77.9 82.4
mT5-Base 95.4 89.4 89.6 91.2 79.8 78.5 81.1 86.4
mT5-Large 96.1 91.3 92.0 92.7 82.5 82.7 84.7 88.9
mT5-XL 96.0 92.8 92.7 92.4 83.6 83.1 86.5 89.6
mT5-XXL 96.3 92.9 92.6 92.7 84.5 83.9 87.2 90.0
Translate-train (models fine-tune on English training data plus translations in all target languages)
mT5-Small 87.9 81.4 83.1 84.1 74.2 71.7 76.7 79.9
mT5-Base 95.5 90.9 91.4 92.5 83.6 84.8 86.4 89.3
mT5-Large 96.4 92.7 93.3 93.6 86.5 87.4 88.4 91.2
mT5-XL 96.4 92.5 93.1 93.6 85.5 86.9 89.0 91.0
mT5-XXL 96.1 92.9 93.6 94.2 87.0 87.9 89.0 91.5
Model en ar de el es hi ru th tr vi zh avg
Cross-lingual zero-shot transfer (models fine-tune on English data only)
mBERT 83.5 / 72.2 61.5 / 45.1 70.6 / 54.0 62.6 / 44.9 75.5 / 56.9 59.2 / 46.0 71.3 / 53.3 42.7 / 33.5 55.4 / 40.1 69.5 / 49.6 58.0 / 48.3 64.5 / 49.4
XLM 74.2 / 62.1 61.4 / 44.7 66.0 / 49.7 57.5 / 39.1 68.2 / 49.8 56.6 / 40.3 65.3 / 48.2 35.4 / 24.5 57.9 / 41.2 65.8 / 47.6 49.7 / 39.7 59.8 / 44.3
XLM-R 86.5 / 75.7 68.6 / 49.0 80.4 / 63.4 79.8 / 61.7 82.0 / 63.9 76.7 / 59.7 80.1 / 64.3 74.2 / 62.8 75.9 / 59.3 79.1 / 59.0 59.3 / 50.0 76.6 / 60.8
mT5-Small 78.5 / 66.1 51.4 / 34.0 63.8 / 45.9 53.8 / 33.4 67.0 / 50.3 47.8 / 34.5 50.5 / 30.1 54.0 / 44.5 55.7 / 38.9 58.1 / 41.3 58.9 / 48.7 58.1 / 42.5
mT5-Base 84.6 / 71.7 63.8 / 44.3 73.8 / 54.5 59.6 / 35.6 74.8 / 56.1 60.3 / 43.4 57.8 / 34.7 57.6 / 45.7 67.9 / 48.2 70.7 / 50.3 66.1 / 54.1 67.0 / 49.0
mT5-Large 88.4 / 77.3 75.2 / 56.7 80.0 / 62.9 77.5 / 57.6 81.8 / 64.2 73.4 / 56.6 74.7 / 56.9 73.4 / 62.0 76.5 / 56.3 79.4 / 60.3 75.9 / 65.5 77.8 / 61.5
mT5-XL 88.8 / 78.1 77.4 / 60.8 80.4 / 63.5 80.4 / 61.2 82.7 / 64.5 76.1 / 60.3 76.2 / 58.8 74.2 / 62.5 77.7 / 58.4 80.5 / 60.8 80.5 / 71.0 79.5 / 63.6
mT5-XXL 90.9 / 80.1 80.3 / 62.6 83.1 / 65.5 83.3 / 65.5 85.1 / 68.1 81.7 / 65.9 79.3 / 63.6 77.8 / 66.1 80.2 / 60.9 83.1 / 63.6 83.1 / 73.4 82.5 / 66.8
Translate-train (models fine-tune on English training data plus translations in all target languages)
mT5-Small 74.0 / 61.2 61.0 / 45.0 66.0 / 50.2 64.1 / 47.2 67.5 / 50.8 60.2 / 43.7 64.4 / 46.7 58.9 / 52.9 59.0 / 39.4 63.5 / 46.0 68.2 / 61.2 64.3 / 49.5
mT5-Base 83.1 / 70.3 72.4 / 55.2 76.9 / 59.7 76.8 / 58.8 79.0 / 61.2 71.4 / 53.4 76.1 / 58.5 67.9 / 62.0 72.5 / 51.4 75.9 / 56.3 76.9 / 69.7 75.3 / 59.7
mT5-Large 87.3 / 75.5 79.4 / 62.7 82.7 / 66.0 81.8 / 63.5 83.8 / 66.1 78.0 / 59.8 81.9 / 66.3 74.7 / 68.2 80.2 / 59.2 80.4 / 60.8 83.2 / 76.9 81.2 / 65.9
mT5-XL 88.5 / 77.1 80.9 / 65.4 83.4 / 66.7 83.6 / 64.9 84.9 / 68.2 79.6 / 63.1 82.7 / 67.1 78.5 / 72.9 82.4 / 63.8 82.4 / 64.1 83.2 / 75.9 82.7 / 68.1
mT5-XXL 91.3 / 80.3 83.4 / 68.2 85.0 / 68.2 85.9 / 68.9 87.4 / 70.8 83.7 / 68.2 85.2 / 70.4 80.2 / 74.5 84.4 / 67.7 85.3 / 67.1 85.7 / 80.0 85.2 / 71.3
Model ar bg de el en es fr hi ru sw th tr ur vi zh avg
Baseline (mT5-large) 79.8 84.1 83.4 83.2 89.4 84.2 84.1 77.6 81.5 75.4 79.4 80.1 73.5 81.0 80.3 81.1
Dropout 0.1 76.4 82.1 81.7 81.0 88.0 70.8 80.3 74.4 79.0 72.3 75.8 75.9 70.6 78.6 76.5 77.6
Sequence length 512 78.1 83.4 83.1 82.1 88.8 84.5 82.8 77.3 81.2 75.4 78.2 79.6 73.8 80.0 78.9 80.5
Span length 10 77.6 81.5 80.5 81.2 87.2 83.0 81.2 74.7 79.8 73.6 76.7 75.9 71.3 78.6 76.5 78.6
↵ = 0.7 79.3 84.1 84.5 83.1 89.4 85.3 84.4 76.4 82.8 70.6 78.7 79.8 71.7 80.3 79.9 80.7
↵ = 0.2 78.7 83.8 83.3 82.5 89.3 83.4 83.6 77.3 81.2 75.4 78.6 79.4 73.9 79.9 79.7 80.7
No line length filter 78.4 83.3 81.5 81.4 88.9 83.8 82.5 74.4 80.5 69.4 77.6 76.9 71.3 78.8 78.3 79.1
Add Wikipedia data 79.3 83.1 83.1 82.7 88.6 80.1 83.2 77.3 81.4 75.0 78.9 79.3 73.5 80.2 79.2 80.3
Table 13: XNLI zero-shot accuracy of various ablations on our mT5-Large model.
Model en ar de el es hi ru th tr vi zh avg
Baseline(mT5-large) 88.4 / 77.3 75.2 / 56.7 80.0 / 62.9 77.5 / 57.6 81.8 / 64.2 73.4 / 56.6 74.7 / 56.9 73.4 / 62.0 76.5 / 56.3 79.4 / 60.3 75.9 / 65.5 77.8 / 61.5
Span length 10 88.1 / 76.3 70.0 / 50.6 78.1 / 60.2 68.8 / 44.0 79.0 / 60.8 67.3 / 48.4 65.4 / 43.3 68.1 / 57.2 74.4 / 53.6 77.9 / 57.7 76.6 / 66.4 74.0 / 56.2
Dropout 0.1 87.3 / 76.0 54.9 / 33.9 77.6 / 60.2 64.4 / 40.1 79.2 / 60.6 59.1 / 40.4 59.5 / 38.4 65.7 / 51.0 73.6 / 52.8 75.8 / 55.8 77.0 / 64.5 70.4 / 52.1
Sequence length 512 88.0 / 76.9 77.0 / 59.6 80.2 / 62.4 79.8 / 60.0 81.7 / 64.4 75.1 / 57.5 77.4 / 58.5 72.7 / 59.8 75.3 / 53.9 79.4 / 58.9 78.5 / 67.2 78.6 / 61.7
↵ = 0.7 88.4 / 77.1 76.5 / 58.8 78.5 / 59.8 77.2 / 55.5 78.7 / 59.5 74.6 / 56.8 73.1 / 54.5 72.5 / 60.2 75.7 / 55.0 79.2 / 58.3 78.6 / 66.2 77.5 / 60.2
↵ = 0.2 87.9 / 76.8 75.5 / 57.3 80.2 / 62.4 76.2 / 54.0 81.6 / 63.7 73.7 / 57.0 70.7 / 50.8 72.2 / 60.4 75.5 / 55.7 79.7 / 59.7 78.3 / 67.5 77.4 / 60.5
No line length filter 88.9 / 77.4 73.8 / 54.0 80.8 / 62.7 74.2 / 51.8 80.9 / 62.8 74.1 / 56.6 75.0 / 56.4 71.7 / 60.3 76.7 / 56.0 78.8 / 58.6 78.5 / 67.1 77.6 / 60.3
Add Wikipedia data 89.3 / 78.4 69.6 / 48.9 79.6 / 61.1 59.5 / 36.0 80.6 / 61.0 73.6 / 55.0 68.7 / 47.0 70.5 / 58.1 76.7 / 56.9 78.6 / 56.4 77.5 / 66.3 74.9 / 56.8
Table 14: XQuAD zero-shot F1/EM of various ablations on our mT5-Large model.