MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Zhang, Zhengyan; Lin, Yankai; Liu, Zhiyuan; Li, Peng; Sun, Maosong; Zhou, Jie

Computer Science > Computation and Language

arXiv:2110.01786 (cs)

[Submitted on 5 Oct 2021 (v1), last revised 5 Apr 2022 (this version, v3)]

Title:MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Authors:Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou

View PDF

Abstract:Recent work has shown that feed-forward networks (FFNs) in pre-trained Transformers are a key component, storing various linguistic and factual knowledge. However, the computational patterns of FFNs are still unclear. In this work, we study the computational patterns of FFNs and observe that most inputs only activate a tiny ratio of neurons of FFNs. This phenomenon is similar to the sparsity of the human brain, which drives research on functional partitions of the human brain. To verify whether functional partitions also emerge in FFNs, we propose to convert a model into its MoE version with the same parameters, namely MoEfication. Specifically, MoEfication consists of two phases: (1) splitting the parameters of FFNs into multiple functional partitions as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that MoEfication can conditionally use 10% to 30% of FFN parameters while maintaining over 95% original performance for different models on various downstream tasks. Besides, MoEfication brings two advantages: (1) it significantly reduces the FLOPS of inference, i.e., 2x speedup with 25% of FFN parameters, and (2) it provides a fine-grained perspective to study the inner mechanism of FFNs. The source code of this paper can be obtained from this https URL.

Comments:	Accepted to ACL Findings 2022
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2110.01786 [cs.CL]
	(or arXiv:2110.01786v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2110.01786

Submission history

From: Zhengyan Zhang [view email]
[v1] Tue, 5 Oct 2021 02:14:38 UTC (182 KB)
[v2] Fri, 15 Oct 2021 13:47:51 UTC (321 KB)
[v3] Tue, 5 Apr 2022 07:35:52 UTC (1,527 KB)

Computer Science > Computation and Language

Title:MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators