Optimizing Deeper Transformers on Small Datasets

Xu, Peng; Kumar, Dhruv; Yang, Wei; Zi, Wenjie; Tang, Keyi; Huang, Chenyang; Cheung, Jackie Chi Kit; Prince, Simon J. D.; Cao, Yanshuai

Computer Science > Computation and Language

arXiv:2012.15355v2 (cs)

[Submitted on 30 Dec 2020 (v1), revised 19 May 2021 (this version, v2), latest version 31 May 2021 (v4)]

Title:Optimizing Deeper Transformers on Small Datasets

Authors:Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi Tang, Chenyang Huang, Jackie Chi Kit Cheung, Simon J.D. Prince, Yanshuai Cao

View PDF

Abstract:It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train $48$ layers of transformers, comprising $24$ fine-tuned layers from pre-trained RoBERTa and $24$ relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state-of-the-art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data-dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.

Comments:	Accepted at ACL 2021
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2012.15355 [cs.CL]
	(or arXiv:2012.15355v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2012.15355

Submission history

From: Peng Xu [view email]
[v1] Wed, 30 Dec 2020 22:53:49 UTC (417 KB)
[v2] Wed, 19 May 2021 17:12:23 UTC (459 KB)
[v3] Thu, 27 May 2021 16:53:14 UTC (912 KB)
[v4] Mon, 31 May 2021 16:45:47 UTC (913 KB)

Computer Science > Computation and Language

Title:Optimizing Deeper Transformers on Small Datasets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Optimizing Deeper Transformers on Small Datasets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators