6. Pre-LN は何故学習できるのか︖
• Pre-LN は LN の影響を受けない勾配がある
– LN の影響を受けない=勾配が減衰しない
• Post-LN と Pre-LN の式を確認する
– ⼊⼒を x,アテンションやFFNを とすると
6
前向き計算
勾配
in performance and remove unstable training property,
and thus provide better performance than Pre-LN re-
gardless of their layer sizes.
2. Post-LN and Pre-LN Transformers
We briefly describe Post-LN and Pre-LN Transformers. The
original Transformer (Vaswani et al., 2017) uses Post-LN in
which layer normalizations are located after each residual
connection. Let x be an input of sub-layer, and F(·) be a
sub-layer of Transformers such as a feed-forward network
and multi-head attention. Post-LN is defined as follows:
PostLN(x) = LN(x + F(x)), (1)
where LN(·) is the layer normalization function.
In contrast, Pre-LN places the layer normalization before an
input of each sub-layer;
for
ials
the
ish-
ing,
rtic-
cant
the
ayer
yses
with-
nce
we
mers
dual
me-
ayer
tion
age
the
nts;
2. Our modifications enable Post-LN Transformers to
stack many layers.
3. Our method can maintain the advantage of Post-LN
in performance and remove unstable training property,
and thus provide better performance than Pre-LN re-
gardless of their layer sizes.
2. Post-LN and Pre-LN Transformers
We briefly describe Post-LN and Pre-LN Transformers. The
original Transformer (Vaswani et al., 2017) uses Post-LN in
which layer normalizations are located after each residual
connection. Let x be an input of sub-layer, and F(·) be a
sub-layer of Transformers such as a feed-forward network
and multi-head attention. Post-LN is defined as follows:
PostLN(x) = LN(x + F(x)), (1)
where LN(·) is the layer normalization function.
In contrast, Pre-LN places the layer normalization before an
input of each sub-layer;
PreLN(x) = x + F(LN(x)). (2)
-LN prevents it as shown in Figure 1. In partic-
rify that the layer normalization is a significant
vanishing gradient problem by comparing the
t vector norms of gradient flows for each layer
on during back-propagation. These analyses
ovel idea that can satisfy higher stability with-
ormalizations and provide better performance
N regardless of their layer sizes. Specifically, we
method that is based on Post-LN Transformers
different components; 1. additional residual
and 2. simple layers without model parame-
onstant values) as the replacement of the layer
ons.
experiments on a wide range of text generation
y machine translation, summarization, language
nd automatic speech recognition. We obtain the
hree new major findings from our experiments;
gardless of their layer sizes.
2. Post-LN and Pre-LN Transformers
We briefly describe Post-LN and Pre-LN Transformers. The
original Transformer (Vaswani et al., 2017) uses Post-LN in
which layer normalizations are located after each residual
connection. Let x be an input of sub-layer, and F(·) be a
sub-layer of Transformers such as a feed-forward network
and multi-head attention. Post-LN is defined as follows:
PostLN(x) = LN(x + F(x)), (1)
where LN(·) is the layer normalization function.
In contrast, Pre-LN places the layer normalization before an
input of each sub-layer;
PreLN(x) = x + F(LN(x)). (2)
13
based
er for
WMT
Trans-
10
Figure 5. Cosine similarities among outputs of each layer.
norms are exponentially decayed as back-propagated to
shallower layers. This result is consistent with the previous
study (Liu et al., 2020). We consider that this vanishing
gradient causes the difficulty in stacking many layers with
the Post-LN setting as shown in Figure 1.
To explore more details of the vanishing gradient empiri-
cally, we check gradient norms of parts (1) - (5) in Figure 2
(a). Figure 4 shows the gradient norms of each part at 18th
layer. This figure indicates that the gradient norms from (4)
to (3) and (2) to (1) drastically decrease. These parts cor-
respond to layer normalizations as in Figure 4. Thus, layer
normalizations in Post-LN Transformers probably cause the
vanishing gradient problem.
To investigate the difference of gradient flows between Post-
LN and Pre-LN theoretically, we calculate derivatives of
equations (1) and (2). The derivatives are as follows:
@PostLN(x)
@x
=
@LN(x + F(x))
@(x + F(x))
✓
I +
@F(x)
@x
◆
, (3)
@PreLN(x)
@x
= I +
@F(LN(x))
@LN(x)
@LN(x)
@x
, (4)
where I is the identity matrix. As Equation (3), the deriva-
tive of Post-LN is equal to the product of two derivatives;
one is the layer normalization, and the other consists of the
residual connection and sub-layer F. In contrast, in Pre-LN,
Layer
Norm
Attention
FFN
Layer
Norm
Layer
Norm
Attention
Gradient norms of each location in the 18th decoder for
yered Post-LN Transformer encoder-decoder on WMT
o-German translation training data.
2 (a) and (b) illustrate Post-LN and Pre-LN Trans-
architectures respectively.
dients of Transformer Layers
norms are exponentially decayed as back-propagated to
shallower layers. This result is consistent with the previous
study (Liu et al., 2020). We consider that this vanishing
gradient causes the difficulty in stacking many layers with
the Post-LN setting as shown in Figure 1.
To explore more details of the vanishing gradient empiri-
cally, we check gradient norms of parts (1) - (5) in Figure 2
(a). Figure 4 shows the gradient norms of each part at 18th
layer. This figure indicates that the gradient norms from (4)
to (3) and (2) to (1) drastically decrease. These parts cor-
respond to layer normalizations as in Figure 4. Thus, layer
normalizations in Post-LN Transformers probably cause the
vanishing gradient problem.
To investigate the difference of gradient flows between Post-
LN and Pre-LN theoretically, we calculate derivatives of
equations (1) and (2). The derivatives are as follows:
@PostLN(x)
@x
=
@LN(x + F(x))
@(x + F(x))
✓
I +
@F(x)
@x
◆
, (3)
@PreLN(x)
@x
= I +
@F(LN(x))
@LN(x)
@LN(x)
@x
, (4)
where I is the identity matrix. As Equation (3), the deriva-
tive of Post-LN is equal to the product of two derivatives;
one is the layer normalization, and the other consists of the
residual connection and sub-layer F. In contrast, in Pre-LN,
the derivative of the residual connection is isolated from the
term related to the derivative of the layer normalization. The
Post-LN Pre-LN
LN の微分から独⽴した項
→ 勾配の維持に貢献
LN の微分との積
→ 勾配が減衰
Residual が LN を迂回