Unified Normalization for Accelerating and Stabilizing Transformers

Yang, Qiming; Zhang, Kai; Lan, Chaoxiang; Yang, Zhi; Li, Zheyang; Tan, Wenming; Xiao, Jun; Pu, Shiliang

doi:10.1145/3503161.3547860

Abstract:Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable performance on par with LN. UN strives to boost performance by calibrating the activation and gradient statistics with a tailored fluctuation smoothing strategy. Meanwhile, an adaptive outlier filtration strategy is applied to avoid collapse in training whose effectiveness is theoretically proved and experimentally verified in this paper. We demonstrate that UN can be an efficient drop-in alternative to LN by conducting extensive experiments on language and vision tasks. Besides, we evaluate the efficiency of our method on GPU. Transformers equipped with UN enjoy about 31% inference speedup and nearly 18% memory reduction. Code will be released at this https URL.

Comments:	ACM MM'22
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2208.01313 [cs.CV]
	(or arXiv:2208.01313v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2208.01313
Related DOI:	https://doi.org/10.1145/3503161.3547860

Computer Science > Computer Vision and Pattern Recognition

Title:Unified Normalization for Accelerating and Stabilizing Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators