2021 Multilingual-T5 Xue-Google
2021 Multilingual-T5 Xue-Google
2021 Multilingual-T5 Xue-Google
Table 1: Comparison of mT5 to existing massively multilingual pre-trained language models. Multiple versions of
XLM and mBERT exist; we refer here to the ones that cover the most languages. Note that XLM-R counts five
Romanized variants as separate languages, while we ignore six Romanized variants in the mT5 language count.
of just dff in the larger models, and pre-training on use SentencePiece (Kudo and Richardson, 2018;
unlabeled data only with no dropout. We refer to Kudo, 2018) models trained with the language sam-
Raffel et al. (2020) for further details on T5. pling rates used during pre-training. To accom-
A major factor in pre-training multilingual mod- modate languages with large character sets like
els is how to sample data from each language. Chinese, we use a character coverage of 0.99999
Ultimately, this choice is a zero-sum game: If and enable SentencePiece’s “byte-fallback” feature
low-resource languages are sampled too often, the to ensure that any string can be uniquely encoded.
model may overfit; if high-resource languages are
not trained on enough, the model will underfit. We 3.3 Comparison to related models
therefore take the approach used in (Devlin, 2018; To contextualize our new model, we provide a brief
Conneau et al., 2020; Arivazhagan et al., 2019) and comparison with existing massively multilingual
boost lower-resource languages by sampling ex- pre-trained language models. For brevity, we focus
amples according to the probability p(L) ∝ |L|α , on models that support more than a few dozen lan-
where p(L) is the probability of sampling text from guages. Table 1 gives a high-level comparison of
a given language during pre-training and |L| is the mT5 to the most similar models.
number of examples in the language. The hyper- mBERT (Devlin, 2018) is a multilingual ver-
parameter α (typically with α < 1) allows us to sion of BERT (Devlin et al., 2019). Similar to our
control how much to “boost” the probability of approach with mT5, mBERT follows the BERT
training on low-resource languages. Values used recipe as closely as possible (same architecture, ob-
by prior work include α = 0.7 for mBERT (Devlin, jective, etc.). The primary difference is the training
2018), α = 0.3 for XLM-R (Conneau et al., 2020), set: Instead of training on English Wikipedia and
and α = 0.2 for MMNMT (Arivazhagan et al., the Toronto Books Corpus, mBERT is trained on
2019). We tried all three of these values (ablation up to 104 languages from Wikipedia. XLM (Con-
results in section 4.2) and found α = 0.3 to give a neau and Lample, 2019) is also based on BERT but
reasonable compromise between performance on applies improved methods for pre-training multi-
high- and low-resource languages. lingual language models including explicitly cross-
The fact that our model covers over 100 lan- lingual pre-training objectives. Many pre-trained
guages necessitates a larger vocabulary. Following versions of XLM have been released; the most
XLM-R (Conneau et al., 2018), we increase the vo- massively-multilingual variant was trained on 100
cabulary size to 250,000 wordpieces. As in T5, we languages from Wikipedia. XLM-R (Conneau
Sentence pair Structured Question answering
Model
XNLI PAWS-X WikiAnn NER XQuAD MLQA TyDiQA-GoldP
Metrics Acc. Acc. F1 F1 / EM F1 / EM F1 / EM
Cross-lingual zero-shot transfer (models fine-tuned on English data only)
mBERT 65.4 81.9 62.2 64.5 / 49.4 61.4 / 44.2 59.7 / 43.9
XLM 69.1 80.9 61.2 59.8 / 44.3 48.5 / 32.6 43.6 / 29.1
InfoXLM 81.4 - - -/- 73.6 / 55.2 -/-
X-STILTs 80.4 87.7 64.7 77.2 / 61.3 72.3 / 53.5 76.0 / 59.5
XLM-R 79.2 86.4 65.4 76.6 / 60.8 71.6 / 53.2 65.1 / 45.0
VECO 79.9 88.7 65.7 77.3 / 61.8 71.7 / 53.2 67.6 / 49.1
RemBERT 80.8 87.5 70.1 79.6 / 64.0 73.1 / 55.0 77.0 / 63.0
mT5-Small 67.5 82.4 50.5 58.1 / 42.5 54.6 / 37.1 35.2 / 23.2
mT5-Base 75.4 86.4 55.7 67.0 / 49.0 64.6 / 45.0 57.2 / 41.2
mT5-Large 81.1 88.9 58.5 77.8 / 61.5 71.2 / 51.7 69.9 / 52.2
mT5-XL 82.9 89.6 65.5 79.5 / 63.6 73.5 / 54.5 75.9 / 59.4
mT5-XXL 85.0 90.0 69.2 82.5 / 66.8 76.0 / 57.4 80.8 / 65.9
Translate-train (models fine-tuned on English data plus translations in all target languages)
XLM-R 82.6 90.4 - 80.2 / 65.9 72.8 / 54.3 66.5 / 47.7
F ILTER + Self-Teaching 83.9 91.4 - 82.4 / 68.0 76.2 / 57.7 68.3 / 50.9
VECO 83.0 91.1 - 79.9 / 66.3 73.1 / 54.9 75.0 / 58.9
mT5-Small 64.7 79.9 - 64.3 / 49.5 56.6 / 38.8 48.2 / 34.0
mT5-Base 75.9 89.3 - 75.3 / 59.7 67.6 / 48.5 64.0 / 47.7
mT5-Large 81.8 91.2 - 81.2 / 65.9 73.9 / 55.2 71.1 / 54.9
mT5-XL 84.8 91.0 - 82.7 / 68.1 75.1 / 56.6 79.9 / 65.3
mT5-XXL 87.8 91.5 - 85.2 / 71.3 76.9 / 58.3 82.8 / 68.8
In-language multitask (models fine-tuned on gold data in all target languages)
mBERT - - 89.1 - - 77.6 / 68.0
mT5-Small - - 83.4 - - 73.0 / 62.0
mT5-Base - - 85.4 - - 80.8 / 70.0
mT5-Large - - 88.4 - - 85.5 / 75.3
mT5-XL - - 90.9 - - 87.5 / 78.1
mT5-XXL - - 91.2 - - 88.5 / 79.1
Table 2: Results on XTREME sentence-pair classification, structured prediction and question answering tasks.
mBERT metrics are from Hu et al. (2020). Metrics for XLM, InfoXLM, X-STILTs and XLM-R are from Fang
et al. (2020), though Conneau et al. (2020) report better performance of XLM-R on XNLI (80.9). All other metrics
are from the original sources: F ILTER (Fang et al., 2020), VECO (Luo et al., 2020) and RemBERT (Chung et al.,
2020). For the “translate-train” setting, we include English training data, so as to be comparable with Fang et al.
(2020) and Luo et al. (2020). This differs from the XTREME “translate-train” setup of Hu et al. (2020). For mT5
results on TyDi QA zero-shot, we report the median across five fine-tuning runs, as we observed high variance
across runs. Full results for all languages in all tasks are provided in the appendix.
et al., 2020) is an improved version of XLM based ment in one language by retrieving documents in
on the RoBERTa model (Liu et al., 2019). XLM-R other languages. It uses data in 26 languages from
is trained with a cross-lingual masked language Wikipedia and CC-News (Liu et al., 2019).
modeling objective on data in 100 languages from
Common Crawl. To improve the pre-training data 4 Experiments
quality, pages from Common Crawl were filtered To validate the performance of mT5, we evaluate
by an n-gram language model trained on Wikipedia our models on 6 tasks from the XTREME multilin-
(Wenzek et al., 2020). mBART (Liu et al., 2020a) gual benchmark (Hu et al., 2020): the XNLI (Con-
is a multilingual encoder-decoder model that is neau et al., 2018) entailment task covering 14 lan-
based on BART (Lewis et al., 2020b). mBART is guages; the XQuAD (Artetxe et al., 2020), MLQA
trained with a combination of span masking and (Lewis et al., 2019), and TyDi QA (Clark et al.,
sentence shuffling objectives on a subset of 25 lan- 2020) reading comprehension benchmarks with 10,
guages from the same data as XLM-R. MARGE 7, and 11 languages respectively; the Named En-
(Lewis et al., 2020a) is a multilingual encoder- tity Recognition (NER) dataset of WikiAnn (Pan
decoder model that is trained to reconstruct a docu- et al., 2017) restricted to the 40 languages from
XTREME (Hu et al., 2020), and the PAWS-X (Yang We use the same inverse square-root learning
et al., 2019) paraphrase identification dataset with rate schedule used by T5 during pre-training, with
7 languages. We cast all tasks into the text-to-text the learning rate set to 1/ max(n, k) where n is
p
format, i.e. generating the label text (XNLI and the current training iteration and k = 104 is the
PAWS-X), entity tags and labels (WikiAnn NER), number of warm-up steps. Following the T5.1.1
or answer (XQuAD, MLQA, and TyDi QA) di- recipe, we do not apply dropout during pre-training.
rectly in a generative fashion. For NER, if there We use the same self-supervised objective as T5,
are multiple entities, then they are concatenated with 15% of tokens masked and an average noise
in the order they appear, and if there are no en- span length of 3. We ablate some of these experi-
tities then the target text is “None”. We con- mental details in section 4.2.
sider three variants of these tasks: (1) “zero-shot”, For fine-tuning, we use a constant learning rate
where the model is fine-tuned only on English data, of 0.001 and dropout rate of 0.1 for all tasks. We
(2) “translate-train”, adding machine translations use batch size 217 for most tasks but increased this
from English into each target language, and (3) “in- up to 220 in a few cases based on performance
language multitask”, training on gold data in all on the validation set. For early stopping, we save
target languages. For brevity, we refer to Hu et al. checkpoints every 200 steps and choose the check-
(2020) for further details on these benchmarks. point with the highest validation performance.
Following the original T5 recipe, we consider 4.1 Results
five model sizes: Small (≈ 300M parameters),
Base (580M), Large (1.2B), XL (3.7B), and XXL Table 2 presents our main results, with per-
(13B). The increase in parameter counts com- language breakdowns for each task given in the
pared to the corresponding T5 model variants appendix. Our largest model mT5-XXL exceeds
comes from the larger vocabulary used in mT5. state-of-the-art on all classification and QA tasks
Note that, because mT5 is an encoder-decoder and is near SOTA on NER (69.2 vs. 70.1). Note
model, it has roughly twice as many parameters as that unlike our model, InfoXLM (Chi et al., 2020)
correspondingly-sized encoder-only models such and VECO (Luo et al., 2020) benefit from paral-
as XLM-R. For example, the “Large” variant of lel training data, while X-STILTs (Phang et al.,
XLM-R has 550 million parameters whereas mT5- 2020) leverages labeled data from tasks similar to
Large has around 1 billion. However, the compu- the target task. Overall, our results highlight the
tational cost for text classification is roughly the importance of model capacity in cross-lingual rep-
same: In both cases, the model processes a length- resentation learning and suggest that scaling up a
T input sequence with an encoder of approximately simple pre-training recipe can be a viable alterna-
equal size. In an encoder-only model like XLM-R, tive to more complex techniques relying on LM
the encoder processes one additional "CLS" token, filtering, parallel data, or intermediate tasks.
which is used to generate the representation for clas- In the “translate-train” setting, we exceed state-
sification. In mT5, the decoder typically produces of-the-art on all XTREME classification and QA
two additional tokens: the class label and an end- tasks. For these tasks, we fine-tune on the combina-
of-sequence token. Since the decoder has the same tion of the labeled English data and machine trans-
architecture (ignoring encoder-decoder attention) lations thereof.6 This allows direct comparison
as the encoder, the computational cost of classifi- with both F ILTER (Fang et al., 2020) as well as the
cation with mT5 typically amounts to the cost of XLM-R baseline of Fang et al. (2020). Note that
processing T + 2 tokens compared to T + 1 for this setup differs from XTREME “translate-train”
an encoder-only model. However, encoder-decoder (Hu et al., 2020), which excludes English.
architectures have the additional benefit of being Figure 2 shows that model capacity is key to
applicable to generative tasks like abstractive sum- improving performance on variants of the TyDi
marization or dialog. QA GoldP task in the absence of “gold” multi-
lingual data: For the smallest model, training on
We pre-train our mT5 model variants for 1 mil- gold datasets (in-language multitask) achieves dra-
lion steps on batches of 1024 length-1024 input
6
sequences, corresponding to roughly 1 trillion in- We use the translation data provided by Hu et al. (2020)
throughout. On the PAWS-X task, F ILTER used translation
put tokens total. This is the same amount of pre- data from the original task instead. Switching to this data
training as T5 and about 16 as much as XLM-R. would improve our scores slightly (mT5-XXL 91.5 → 92.0).
T5 mT5 Model Accuracy
Small 87.2 / 79.1 84.7 / 76.4 Baseline (mT5-Large) 81.1
Base 92.1 / 85.4 89.6 / 83.8 Dropout 0.1 77.6
Large 93.8 / 86.7 93.0 / 87.0 Sequence length 512 80.5
XL 95.0 / 88.5 94.5 / 88.9 Span length 10 78.6
XXL 96.2 / 91.3 95.6 / 90.4 α = 0.7 80.7
α = 0.2 80.7
Table 3: Comparison of T5 vs. mT5 on SQuAD ques- No line length filter 79.1
Add Wikipedia data 80.3
tion answering (F1/EM).
Table 4: Average XNLI zero-shot accuracy of various
ablations on our mT5-Large model. Per-language met-
90
rics are shown in the appendix.
80
70 model has enough capacity to effectively learn 101
F1
40 40
Percent
30
30
20
10 20
0 10
el ru th ar de hi zh es tr vi en
(a) mT5-Small 0
Small Base Large XL XXL
40
Figure 4: Error rates of mT5 on XQuAD zero-shot.
30 Baseline: Fine-tuning on XQuAD alone. Domain Pre-
Percent
0
th ru zh hi de es el ar tr vi en
task during fine-tuning. A similar approach was
(b) mT5-XXL
explored by Liu et al. (2020b). We use the same
Figure 3: Per-language error rates on XQuAD zero- mC4 task definition as in pre-training, with two
shot, sorted by illegal rate. Incorrect: Not matching adjustments: First, we remove all “sentinel” tokens
the target span. Illegal: Missing from the input context. (corresponding to non-masked spans in the input
Illegal after norm: Illegal even after Unicode NFKC text) from the target sequence, as otherwise we
normalization is applied to the prediction and context.
observe occasional sentinels in downstream predic-
tions. Second, we reduce the language sampling
parameter α from 0.3 to 0.1. This produces a near-
ify our inference procedure. As is common practice
uniform distribution of languages, encouraging the
with encoder-based models, we could devise a task-
model to treat all languages as equally likely.8
specific fine-tuning mechanism that restricts the
With these changes, we mix a small amount of
model to perform ranking over legal spans, remov-
our unsupervised task (covering 101 languages)
ing the possibility of illegal predictions entirely.
into XQuAD fine-tuning, at a ratio of just 1:100.
While this would likely improve our zero-shot met-
Figure 4 shows the results on XQuAD zero-shot er-
rics, it is unsatisfying for two reasons: First, it
ror rates. The addition of even this small amount of
implies taking a step backward from the general
multilingual data has a marked effect on the mT5-
text-to-text interface, as different tasks would de-
Small and mT5-Base models (where accidental
mand different types of inference. Second, this
translation was most rampant), reducing the illegal
solution won’t extend to more “open-ended” zero-
prediction rates by more than 70% (relative), and
shot generative tasks like summarization, where
contributing to an overall reduction in errors.
the legal output space can’t be easily delimited.
For these reasons, we consider a more general 6 Conclusion
solution that remains within the text-to-text frame-
work and can apply to all zero-shot generation In this paper, we introduced mT5 and mC4: mas-
tasks. Our motivating intuition is that the reason the sively multilingual variants of the T5 model and
model outputs English when given a non-English C4 dataset. We demonstrated that the T5 recipe is
test input is that it has never observed a non-English straightforwardly applicable to the multilingual set-
target during fine-tuning. As English-only fine- ting, and achieved strong performance on a diverse
tuning proceeds, the model’s assigned likelihood set of benchmarks. We also characterized illegal
of non-English tokens presumably decreases, even- predictions that can occur in zero-shot evaluation
tually reaching the point where English becomes of multilingual pre-trained generative models, and
the most likely answer to any question. described a simple technique to avoid this issue.
To prevent the model from “forgetting” how to We release all code and pre-trained datasets used in
generate other languages, we use a strategy inspired this paper to facilitate future work on multilingual
by domain/task-adaptive pre-training (Howard and 8
Alternatively, one could mix in unlabeled data only for a
Ruder, 2018; Gururangan et al., 2020): We simply single language at a time. However, we believe this is contrary
mix in our unsupervised multilingual pre-training to the spirit of multilingual models and zero-shot evaluation.
language understanding.9 Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-
ina Williams, Samuel Bowman, Holger Schwenk,
Acknowledgements and Veselin Stoyanov. 2018. XNLI: Evaluating
cross-lingual sentence representations. In Proceed-
We thank Melvin Johnson for tips on the translate- ings of the 2018 Conference on Empirical Methods
train procedure for XTREME and Itai Rolnick for in Natural Language Processing, pages 2475–2485,
help with infrastructure. Brussels, Belgium. Association for Computational
Linguistics.
David Crystal. 2008. Two thousand million? English
References today, 24(1):3–6.
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Wietse de Vries, Andreas van Cranenburgh, Arianna
Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Bisazza, Tommaso Caselli, Gertjan van Noord, and
Mia Xu Chen, Yuan Cao, George Foster, Colin Malvina Nissim. 2019. BERTje: A dutch BERT
Cherry, et al. 2019. Massively multilingual neural model. arXiv preprint arXiv:1912.09582.
machine translation in the wild: Findings and chal-
lenges. arXiv preprint arXiv:1907.05019. Pieter Delobelle, Thomas Winters, and Bettina Berendt.
2020. RobBERT: a dutch RoBERTa-based language
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. model. arXiv preprint arXiv:2001.06286.
2020. On the cross-lingual transferability of mono-
lingual representations. In Proceedings of the 58th Jacob Devlin. 2018. Multilingual BERT
Annual Meeting of the Association for Computa- README. https://github.com/
tional Linguistics, pages 4623–4637, Online. Asso- google-research/bert/blob/master/
ciation for Computational Linguistics. multilingual.md.
Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Nogueira, and Roberto Lotufo. 2020. PTT5: Pre- Kristina Toutanova. 2019. BERT: Pre-training of
training and validating the t5 model on brazilian por- deep bidirectional transformers for language under-
tuguese data. arXiv preprint arXiv:2008.09144. standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
Zewen Chi, Li Dong, Furu Wei, Nan Yang, Sak- for Computational Linguistics: Human Language
sham Singhal, Wenhui Wang, Xia Song, Xian-Ling Technologies, Volume 1 (Long and Short Papers),
Mao, Heyan Huang, and Ming Zhou. 2020. In- pages 4171–4186, Minneapolis, Minnesota. Associ-
foXLM: An information-theoretic framework for ation for Computational Linguistics.
cross-lingual language model pre-training. arXiv
preprint arXiv:2007.07834. Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun,
and Jingjing Liu. 2020. FILTER: An enhanced fu-
Hyung Won Chung, Thibault Févry, Henry Tsai, sion method for cross-lingual language understand-
Melvin Johnson, and Sebastian Ruder. 2020. Re- ing. arXiv preprint arXiv:2009.05166.
thinking embedding coupling in pre-trained lan-
guage models. arXiv preprint arXiv:2010.12821. Suchin Gururangan, Ana Marasović, Swabha
Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan and Noah A. Smith. 2020. Don’t stop pretraining:
Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Adapt language models to domains and tasks. In
Jennimaria Palomaki. 2020. TyDi QA: A bench- Proceedings of the 58th Annual Meeting of the
mark for information-seeking question answering in Association for Computational Linguistics, pages
typologically diverse languages. Transactions of the 8342–8360, Online. Association for Computational
Association for Computational Linguistics, 8:454– Linguistics.
470.
Jeremy Howard and Sebastian Ruder. 2018. Universal
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, language model fine-tuning for text classification. In
Vishrav Chaudhary, Guillaume Wenzek, Francisco Proceedings of the 56th Annual Meeting of the As-
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- sociation for Computational Linguistics (Volume 1:
moyer, and Veselin Stoyanov. 2020. Unsupervised Long Papers), pages 328–339, Melbourne, Australia.
cross-lingual representation learning at scale. In Association for Computational Linguistics.
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8440– Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-
8451, Online. Association for Computational Lin- ham Neubig, Orhan Firat, and Melvin Johnson.
guistics. 2020. XTREME: A massively multilingual multi-
task benchmark for evaluating cross-lingual general-
Alexis Conneau and Guillaume Lample. 2019. Cross- ization. arXiv preprint arXiv:2003.11080.
lingual language model pretraining. In Advances in
Neural Information Processing Systems, volume 32, Gautier Izacard and Edouard Grave. 2020. Lever-
pages 7059–7069. aging passage retrieval with generative models for
open domain question answering. arXiv preprint
9
https://goo.gle/mt5-code arXiv:2007.01282.
Mihir Kale. 2020. Text-to-text pre-training for data-to- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
text tasks. arXiv preprint arXiv:2005.10433. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, RoBERTa: A robustly optimized BERT pretraining
and Richard Socher. 2019. Unifying question an- approach. arXiv preprint arXiv:1907.11692.
swering and text classification via span extraction.
arXiv preprint arXiv:1904.09286. Zihan Liu, Genta Indra Winata, Andrea Madotto, and
Pascale Fung. 2020b. Exploring fine-tuning tech-
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish niques for pre-trained cross-lingual models via con-
Sabharwal, Oyvind Tafjord, Peter Clark, and Han- tinual learning. arXiv preprint arXiv:2004.14218.
naneh Hajishirzi. 2020. UnifiedQA: Crossing for-
mat boundaries with a single QA system. In Find- Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi,
ings of the Association for Computational Linguis- Songfang Huang, Fei Huang, and Luo Si. 2020.
tics: EMNLP 2020, pages 1896–1907, Online. As- Veco: Variable encoder-decoder pre-training for
sociation for Computational Linguistics. cross-lingual understanding and generation. arXiv
Taku Kudo. 2018. Subword regularization: Improving preprint arXiv:2010.16046.
neural network translation models with multiple sub-
word candidates. In Proceedings of the 56th Annual Martin Malmsten, Love Börjeson, and Chris Haffenden.
Meeting of the Association for Computational Lin- 2020. Playing with words at the national library of
guistics (Volume 1: Long Papers), pages 66–75, Mel- sweden–making a swedish BERT. arXiv preprint
bourne, Australia. Association for Computational arXiv:2007.01658.
Linguistics.
Louis Martin, Benjamin Muller, Pedro Javier Or-
Taku Kudo and John Richardson. 2018. SentencePiece: tiz Suárez, Yoann Dupont, Laurent Romary, Éric
A simple and language independent subword tok- de la Clergerie, Djamé Seddah, and Benoît Sagot.
enizer and detokenizer for neural text processing. In 2020. CamemBERT: a tasty French language model.
Proceedings of the 2018 Conference on Empirical In Proceedings of the 58th Annual Meeting of the
Methods in Natural Language Processing: System Association for Computational Linguistics, pages
Demonstrations, pages 66–71, Brussels, Belgium. 7203–7219, Online. Association for Computational
Association for Computational Linguistics. Linguistics.
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Max- Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,
imin Coavoux, Benjamin Lecouteux, Alexandre Al- and Richard Socher. 2018. The natural language de-
lauzen, Benoit Crabbé, Laurent Besacier, and Didier cathlon: Multitask learning as question answering.
Schwab. 2020. FlauBERT: Unsupervised language arXiv preprint arXiv:1806.08730.
model pre-training for French. In Proceedings of
the 12th Language Resources and Evaluation Con- Sharan Narang, Colin Raffel, Katherine Lee, Adam
ference, pages 2479–2490, Marseille, France. Euro- Roberts, Noah Fiedel, and Karishma Malkan. 2020.
pean Language Resources Association. WT5?! Training text-to-text models to explain their
predictions. arXiv preprint arXiv:2004.14546.
Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Ar-
men Aghajanyan, Sida Wang, and Luke Zettle- Dat Quoc Nguyen and Anh Tuan Nguyen. 2020.
moyer. 2020a. Pre-training via paraphrasing. arXiv PhoBERT: Pre-trained language models for Viet-
preprint arXiv:2006.15020. namese. In Findings of the Association for Computa-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- tional Linguistics: EMNLP 2020, pages 1037–1042,
jan Ghazvininejad, Abdelrahman Mohamed, Omer Online. Association for Computational Linguistics.
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020b. BART: Denoising sequence-to-sequence Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and
pre-training for natural language generation, trans- Jimmy Lin. 2020. Document ranking with a pre-
lation, and comprehension. In Proceedings of the trained sequence-to-sequence model. In Findings
58th Annual Meeting of the Association for Compu- of the Association for Computational Linguistics:
tational Linguistics, pages 7871–7880, Online. As- EMNLP 2020, pages 708–718, Online. Association
sociation for Computational Linguistics. for Computational Linguistics.
Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian Xiaoman Pan, Boliang Zhang, Jonathan May, Joel
Riedel, and Holger Schwenk. 2019. MLQA: Eval- Nothman, Kevin Knight, and Heng Ji. 2017. Cross-
uating cross-lingual extractive question answering. lingual name tagging and linking for 282 languages.
arXiv preprint arXiv:1910.07475. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey 1: Long Papers), pages 1946–1958, Vancouver,
Edunov, Marjan Ghazvininejad, Mike Lewis, and Canada. Association for Computational Linguistics.
Luke Zettlemoyer. 2020a. Multilingual denoising
pre-training for neural machine translation. arXiv Jason Phang, Phu Mon Htut, Yada Pruksachatkun,
preprint arXiv:2001.08210. Haokun Liu, Clara Vania, Katharina Kann, Iacer
Calixto, and Samuel R Bowman. 2020. En- 9th International Joint Conference on Natural Lan-
glish intermediate-task training improves zero- guage Processing (EMNLP-IJCNLP), pages 3687–
shot cross-lingual transfer too. arXiv preprint 3692, Hong Kong, China. Association for Computa-
arXiv:2005.13013. tional Linguistics.
Table 6: Statistics of the mC4 corpus, totaling 6.6B pages and 6.3T tokens. The “mT5” column indicates the
percentage of mT5 training data coming from a given language, using the default exponential smoothing value of
α=0.3. We list 107 “languages” as detected by cld3, but note six of these (marked “Latin”) are just Romanized
variants of existing languages.
Model en ar bg de el es fr hi ru sw th tr ur vi zh avg
Cross-lingual zero-shot transfer (models fine-tune on English data only)
mBERT 80.8 64.3 68.0 70.0 65.3 73.5 73.4 58.9 67.8 49.7 54.1 60.9 57.2 69.3 67.8 65.4
XLM 82.8 66.0 71.9 72.7 70.4 75.5 74.3 62.5 69.9 58.1 65.5 66.4 59.8 70.7 70.2 69.1
XLM-R 88.7 77.2 83.0 82.5 80.8 83.7 82.2 75.6 79.1 71.2 77.4 78.0 71.7 79.3 78.2 79.2
mT5-Small 79.6 65.2 71.3 69.2 68.6 72.7 70.7 62.5 70.1 59.7 66.3 64.4 59.9 66.3 65.8 67.5
mT5-Base 84.7 73.3 78.6 77.4 77.1 80.3 79.1 70.8 77.1 69.4 73.2 72.8 68.3 74.2 74.1 75.4
mT5-Large 89.4 79.8 84.1 83.4 83.2 84.2 84.1 77.6 81.5 75.4 79.4 80.1 73.5 81.0 80.3 81.1
mT5-XL 90.6 82.2 85.4 85.8 85.4 81.3 85.3 80.4 83.7 78.6 80.9 82.0 77.0 81.8 82.7 82.9
mT5-XXL 91.6 84.5 87.7 87.3 87.3 87.8 86.9 83.2 85.1 80.3 81.7 83.8 79.8 84.6 83.6 84.5
Translate-train (models fine-tune on English training data plus translations in all target languages)
mt5-Small 69.5 63.7 67.5 65.7 66.4 67.5 67.3 61.9 66.4 59.6 63.9 63.5 60.4 63.3 64.5 64.7
mt5-Base 82.0 74.4 78.5 77.7 78.1 79.1 77.9 72.2 76.5 71.5 75.0 74.8 70.4 74.5 76.0 75.9
mt5-Large 88.3 80.3 84.1 84.0 83.7 84.9 83.8 79.8 82.0 76.4 79.9 81.0 75.9 81.3 81.7 81.8
mt5-XL 90.9 84.2 86.8 86.8 86.4 87.4 86.8 83.1 84.9 81.3 82.3 84.4 79.4 83.9 84.0 84.8
mT5-XXL 92.7 87.2 89.4 89.8 89.5 90.0 89.1 86.5 87.6 84.3 85.6 87.1 83.8 87.5 86.5 87.8
Model en de es fr ja ko zh avg
Cross-lingual zero-shot transfer (models fine-tune on English data only)
mBERT 94.0 85.7 87.4 87.0 73.0 69.6 77.0 81.9
XLM 94.0 85.9 88.3 87.4 69.3 64.8 76.5 80.9
XLM-R 94.7 89.7 90.1 90.4 78.7 79.0 82.3 86.4
mT5-Small 92.2 86.2 86.1 86.6 74.7 73.5 77.9 82.4
mT5-Base 95.4 89.4 89.6 91.2 79.8 78.5 81.1 86.4
mT5-Large 96.1 91.3 92.0 92.7 82.5 82.7 84.7 88.9
mT5-XL 96.0 92.8 92.7 92.4 83.6 83.1 86.5 89.6
mT5-XXL 96.3 92.9 92.6 92.7 84.5 83.9 87.2 90.0
Translate-train (models fine-tune on English training data plus translations in all target languages)
mT5-Small 87.9 81.4 83.1 84.1 74.2 71.7 76.7 79.9
mT5-Base 95.5 90.9 91.4 92.5 83.6 84.8 86.4 89.3
mT5-Large 96.4 92.7 93.3 93.6 86.5 87.4 88.4 91.2
mT5-XL 96.4 92.5 93.1 93.6 85.5 86.9 89.0 91.0
mT5-XXL 96.1 92.9 93.6 94.2 87.0 87.9 89.0 91.5
Model en ar de el es hi ru th tr vi zh avg
Cross-lingual zero-shot transfer (models fine-tune on English data only)
mBERT 83.5 / 72.2 61.5 / 45.1 70.6 / 54.0 62.6 / 44.9 75.5 / 56.9 59.2 / 46.0 71.3 / 53.3 42.7 / 33.5 55.4 / 40.1 69.5 / 49.6 58.0 / 48.3 64.5 / 49.4
XLM 74.2 / 62.1 61.4 / 44.7 66.0 / 49.7 57.5 / 39.1 68.2 / 49.8 56.6 / 40.3 65.3 / 48.2 35.4 / 24.5 57.9 / 41.2 65.8 / 47.6 49.7 / 39.7 59.8 / 44.3
XLM-R 86.5 / 75.7 68.6 / 49.0 80.4 / 63.4 79.8 / 61.7 82.0 / 63.9 76.7 / 59.7 80.1 / 64.3 74.2 / 62.8 75.9 / 59.3 79.1 / 59.0 59.3 / 50.0 76.6 / 60.8
mT5-Small 78.5 / 66.1 51.4 / 34.0 63.8 / 45.9 53.8 / 33.4 67.0 / 50.3 47.8 / 34.5 50.5 / 30.1 54.0 / 44.5 55.7 / 38.9 58.1 / 41.3 58.9 / 48.7 58.1 / 42.5
mT5-Base 84.6 / 71.7 63.8 / 44.3 73.8 / 54.5 59.6 / 35.6 74.8 / 56.1 60.3 / 43.4 57.8 / 34.7 57.6 / 45.7 67.9 / 48.2 70.7 / 50.3 66.1 / 54.1 67.0 / 49.0
mT5-Large 88.4 / 77.3 75.2 / 56.7 80.0 / 62.9 77.5 / 57.6 81.8 / 64.2 73.4 / 56.6 74.7 / 56.9 73.4 / 62.0 76.5 / 56.3 79.4 / 60.3 75.9 / 65.5 77.8 / 61.5
mT5-XL 88.8 / 78.1 77.4 / 60.8 80.4 / 63.5 80.4 / 61.2 82.7 / 64.5 76.1 / 60.3 76.2 / 58.8 74.2 / 62.5 77.7 / 58.4 80.5 / 60.8 80.5 / 71.0 79.5 / 63.6
mT5-XXL 90.9 / 80.1 80.3 / 62.6 83.1 / 65.5 83.3 / 65.5 85.1 / 68.1 81.7 / 65.9 79.3 / 63.6 77.8 / 66.1 80.2 / 60.9 83.1 / 63.6 83.1 / 73.4 82.5 / 66.8
Translate-train (models fine-tune on English training data plus translations in all target languages)
mT5-Small 74.0 / 61.2 61.0 / 45.0 66.0 / 50.2 64.1 / 47.2 67.5 / 50.8 60.2 / 43.7 64.4 / 46.7 58.9 / 52.9 59.0 / 39.4 63.5 / 46.0 68.2 / 61.2 64.3 / 49.5
mT5-Base 83.1 / 70.3 72.4 / 55.2 76.9 / 59.7 76.8 / 58.8 79.0 / 61.2 71.4 / 53.4 76.1 / 58.5 67.9 / 62.0 72.5 / 51.4 75.9 / 56.3 76.9 / 69.7 75.3 / 59.7
mT5-Large 87.3 / 75.5 79.4 / 62.7 82.7 / 66.0 81.8 / 63.5 83.8 / 66.1 78.0 / 59.8 81.9 / 66.3 74.7 / 68.2 80.2 / 59.2 80.4 / 60.8 83.2 / 76.9 81.2 / 65.9
mT5-XL 88.5 / 77.1 80.9 / 65.4 83.4 / 66.7 83.6 / 64.9 84.9 / 68.2 79.6 / 63.1 82.7 / 67.1 78.5 / 72.9 82.4 / 63.8 82.4 / 64.1 83.2 / 75.9 82.7 / 68.1
mT5-XXL 91.3 / 80.3 83.4 / 68.2 85.0 / 68.2 85.9 / 68.9 87.4 / 70.8 83.7 / 68.2 85.2 / 70.4 80.2 / 74.5 84.4 / 67.7 85.3 / 67.1 85.7 / 80.0 85.2 / 71.3
Model ar bg de el en es fr hi ru sw th tr ur vi zh avg
Baseline (mT5-large) 79.8 84.1 83.4 83.2 89.4 84.2 84.1 77.6 81.5 75.4 79.4 80.1 73.5 81.0 80.3 81.1
Dropout 0.1 76.4 82.1 81.7 81.0 88.0 70.8 80.3 74.4 79.0 72.3 75.8 75.9 70.6 78.6 76.5 77.6
Sequence length 512 78.1 83.4 83.1 82.1 88.8 84.5 82.8 77.3 81.2 75.4 78.2 79.6 73.8 80.0 78.9 80.5
Span length 10 77.6 81.5 80.5 81.2 87.2 83.0 81.2 74.7 79.8 73.6 76.7 75.9 71.3 78.6 76.5 78.6
↵ = 0.7 79.3 84.1 84.5 83.1 89.4 85.3 84.4 76.4 82.8 70.6 78.7 79.8 71.7 80.3 79.9 80.7
↵ = 0.2 78.7 83.8 83.3 82.5 89.3 83.4 83.6 77.3 81.2 75.4 78.6 79.4 73.9 79.9 79.7 80.7
No line length filter 78.4 83.3 81.5 81.4 88.9 83.8 82.5 74.4 80.5 69.4 77.6 76.9 71.3 78.8 78.3 79.1
Add Wikipedia data 79.3 83.1 83.1 82.7 88.6 80.1 83.2 77.3 81.4 75.0 78.9 79.3 73.5 80.2 79.2 80.3
Table 13: XNLI zero-shot accuracy of various ablations on our mT5-Large model.
Model en ar de el es hi ru th tr vi zh avg
Baseline(mT5-large) 88.4 / 77.3 75.2 / 56.7 80.0 / 62.9 77.5 / 57.6 81.8 / 64.2 73.4 / 56.6 74.7 / 56.9 73.4 / 62.0 76.5 / 56.3 79.4 / 60.3 75.9 / 65.5 77.8 / 61.5
Span length 10 88.1 / 76.3 70.0 / 50.6 78.1 / 60.2 68.8 / 44.0 79.0 / 60.8 67.3 / 48.4 65.4 / 43.3 68.1 / 57.2 74.4 / 53.6 77.9 / 57.7 76.6 / 66.4 74.0 / 56.2
Dropout 0.1 87.3 / 76.0 54.9 / 33.9 77.6 / 60.2 64.4 / 40.1 79.2 / 60.6 59.1 / 40.4 59.5 / 38.4 65.7 / 51.0 73.6 / 52.8 75.8 / 55.8 77.0 / 64.5 70.4 / 52.1
Sequence length 512 88.0 / 76.9 77.0 / 59.6 80.2 / 62.4 79.8 / 60.0 81.7 / 64.4 75.1 / 57.5 77.4 / 58.5 72.7 / 59.8 75.3 / 53.9 79.4 / 58.9 78.5 / 67.2 78.6 / 61.7
↵ = 0.7 88.4 / 77.1 76.5 / 58.8 78.5 / 59.8 77.2 / 55.5 78.7 / 59.5 74.6 / 56.8 73.1 / 54.5 72.5 / 60.2 75.7 / 55.0 79.2 / 58.3 78.6 / 66.2 77.5 / 60.2
↵ = 0.2 87.9 / 76.8 75.5 / 57.3 80.2 / 62.4 76.2 / 54.0 81.6 / 63.7 73.7 / 57.0 70.7 / 50.8 72.2 / 60.4 75.5 / 55.7 79.7 / 59.7 78.3 / 67.5 77.4 / 60.5
No line length filter 88.9 / 77.4 73.8 / 54.0 80.8 / 62.7 74.2 / 51.8 80.9 / 62.8 74.1 / 56.6 75.0 / 56.4 71.7 / 60.3 76.7 / 56.0 78.8 / 58.6 78.5 / 67.1 77.6 / 60.3
Add Wikipedia data 89.3 / 78.4 69.6 / 48.9 79.6 / 61.1 59.5 / 36.0 80.6 / 61.0 73.6 / 55.0 68.7 / 47.0 70.5 / 58.1 76.7 / 56.9 78.6 / 56.4 77.5 / 66.3 74.9 / 56.8
Table 14: XQuAD zero-shot F1/EM of various ablations on our mT5-Large model.