Batch Normalization Separate
Batch Normalization Separate
Batch Normalization Separate
1. Introduction
Training Deep Neural Networks is a difficult task that involves several
problems to tackle. Despite their huge potential, they can be slow and be
prone to overfitting. Thus, studies on methods to solve these problems are
constant in Deep Learning research.
Batch Normalization – commonly abbreviated as Batch Norm – is one of these
methods. Currently, it is a widely used technique in the field of Deep Learning.
It improves the learning speed of Neural Networks and provides
regularization, avoiding overfitting.
But why is it so important? How does it work? Furthermore, how can it be
applied to non-regular networks such as Convolutional Neural Networks?
2. Normalization
To fully understand how Batch Norm works and why it is important, let’s start
by talking about normalization.
Normalization is a pre-processing technique used to standardize data. In
other words, having different sources of data inside the same range. Not
normalizing the data before training can cause problems in our network,
making it drastically harder to train and decrease its learning speed.
For example, imagine we have a car rental service. Firstly, we want to predict
a fair price for each car based on competitors’ data. We have two features per
car: the age in years and the total amount of kilometers it has been driven for.
These can have very different ranges, ranging from 0 to 30 years, while
distance could go from 0 up to hundreds of thousands of kilometers. We don’t
want features to have these differences in ranges, as the value with the higher
range might bias our models into giving them inflated importance.
There are two main methods to normalize our data. The most straightforward
method is to scale it to a range from 0 to 1:
the data point to normalize, the mean of the data set, the highest
value, and the lowest value. This technique is generally used in the
inputs of the data. The non-normalized data points with wide ranges can
cause instability in Neural Networks. The relatively large inputs can cascade
down to the layers, causing problems such as exploding gradients.
The other technique used to normalize data is forcing the data points to have
a mean of 0 and a standard deviation of 1, using the following formula:
being the data point to normalize, the mean of the data set, and the
standard deviation of the data set. Now, each data point mimics a standard
normal distribution. Having all the features on this scale, none of them will
have a bias, and therefore, our models will learn better.
In Batch Norm, we use this last technique to normalize batches of data inside
the network itself.
3. Batch Normalization
Batch Norm is a normalization technique done between the layers of a
Neural Network instead of in the raw data. It is done along mini-batches
instead of the full data set. It serves to speed up training and use higher
learning rates, making learning easier.
Following the technique explained in the previous section, we can define the
normalization formula of Batch Norm as:
being the mean of the neurons’ output and the standard deviation of the
neurons’ output.
Batch Norm – in the image represented with a red line – is applied to the
neurons’ output just before applying the activation function. Usually, a neuron
without Batch Norm would be computed as follows:
Z=g(w,x) +b, alpha= f(z);
being the output of Batch Norm, the mean of the neurons’ output, the
standard deviation of the output of the neurons, and and learning parameters
of Batch Norm. Note that the bias of the neurons ( ) is removed. This is
because as we subtract the mean , any constant over the values of –
such as – can be ignored as it will be subtracted by itself.
The parameters and shift the mean and standard deviation,
respectively. Thus, the outputs of Batch Norm over a layer results in a
distribution with a mean and a standard deviation of . These values are
learned over epochs and the other learning parameters, such as the weights
of the neurons, aiming to decrease the loss of the model.
model = Sequential([
Dense(16, input_shape=(1,5), activation='relu'),
BatchNormalization(),
Dense(32, activation='relu'),
BatchNormalization(),
Dense(2, activation='softmax')
])Copy
model = Sequential([
Conv2D(32, (3,3), input_shape=(28, 28, 3) activation='relu'),
BatchNormalization(),
Conv2D(32, (3,3), activation='relu'),
BatchNormalization(),
MaxPooling2D(),
Dense(2, activation='softmax')
])
https://towardsdatascience.com/batch-normalization-in-3-
levels-of-understanding-14c2da90a338
Multilayer Perceptron (MLP) with batch normalization (BN) | Credit : author - Design : Lou HD
B.1) Principle
B.1.1) Training
It then normalizes the activation vector Z^(i) with (3). That way,
each neuron’s output follows a standard normal distribution across the
batch. (𝜀 is a constant used for numerical stability)
Batch Normalization first step. Example of a 3-neurons hidden layer, with a batch of size b.
Each neuron follows a standard normal distribution. |
𝛽 allows to adjust the bias, shifting the curve on the right or on the
left side.
Benefits of 𝛾 and 𝛽 parameters. Modifying the distribution (on the top) allows us to use
different regimes of the nonlinear functions (on the bottom). |
Remark : The reasons explaining BN layers effectiveness are subjected to misunderstanding and
errors (even in the original article). A recent paper [2] disproved some erroneous hypotheses,
improving the community’s understanding of this method. We will discuss that matter in section
the current batch. Then it trains 𝛾 and 𝛽 through gradient descent, using an Exponential Moving
B.1.2) Evaluation
Unlike the training phase, we may not have a full batch to feed into the model during the
evaluation phase.
To tackle this issue, we compute (𝜇_pop , σ_pop), such as :
directly fed into equation (3) during evaluation (instead of calling (1) and (2)).
Remark : We will discuss that matter in depth in section C.2.3 : “Normalization during
evaluation”.
B.2) In practice
Each of the popular frameworks already have an implemented Batch Normalization layer. For
example :
All of the BN implementations allow you to set each parameters independently. However, the
How many neurons are in the current hidden layer (for MLP) ;
How many filters are in the current hidden layer (for convolutional networks).
Take a look at the online docs of your favorite framework and read the BN
layer page : is there anything specific to their implementation ?
To get a first insight, let’s take a look on the official article’s results [1] :
Figure 1 : How BN affects training. Accuracy on the ImageNet (2012) validation set, w.r.t. the
number of trained iterations. Five networks are compared : “Inception” is the vanilla Inception network
[3], “BN-X” are Inception network with BN layers (for 3 differents learning rates : x1, x5, x30 the
Inception optimum one, “BN-X-Sigmoid” is an Inception network with BN layers, were all ReLU
nonlinearities are replaced by sigmoid. | Source : [1]
You can easily reproduce those results without GPU, it’s a great way to
get more familiar with this concept !
This example does not show all the benefits of the Batch
Normalization.
Adding BN layers leads to faster and better convergence (where better means higher
accuracy) ;
On such a large dataset, the improvement is much more significant than the one observed on the
Adding BN layers allows us to use higher learning rate (LR) without compromising
convergence ;
The authors have successfully trained their Inception-with-BN network using a 30 times higher
learning rate than the original one. Notice that a 5 times larger LR already makes the vanilla
network diverge !
That way, it is much easier to find an “acceptable” learning rate : the interval of LR which lies
Also, a higher learning rate helps the optimizer to avoid local minima convergence. Encouraged
We need to take a step back and look at the bigger picture. We can clearly see that we can get
slightly better performances with ReLU-based models than with sigmoid ones, but that’s not what
matters here.
To show why this results is meaningful, let me rephrase what Ian Goodfellow (inventor of GANs
Before BN, we thought that it was almost impossible to efficiently train deep models using
sigmoid in the hidden layers. We considered several approaches to tackle training instability,
such as looking for better initialization methods. Those pieces of solution were heavily heuristic,
and way too fragile to be satisfactory. Batch Normalization makes those unstable networks
Now we understand why BN had such an important impact on the deep learning field.
However, there are some side effects you should have in mind to get the most out of BN.
C.2.2) Regularization, a BN side effect
BN relies on batch first and second statistical moments (mean and variance) to normalize hidden
layers activations. The output values are then strongly tied to the current batch statistics. Such
transformation adds some noise, depending on the input examples used in the current batch.
Adding some noise to avoid overfitting … sounds like a regularization process, doesn’t it ? ;)
of orthogonality matters. Put simply, we should always make sure that one module addresses one
issue. Relying on several modules to deal with different problems makes the development process
Still, it is interesting to be aware of the regularization effect, as it could explains some unexpected
While BN uses the current batch to normalize every single value, LN uses all the current layer to
do so. In other words, the normalization is performed using other features from a single example
instead of using the same feature across all current batch examples. This solution seems more
efficient for recurrent networks. Note that it is quite difficult to define a consistent strategy for
those kinds of neurons, as they rely on multiplication of the same weight matrix several times.
Should we normalize each step independently ? Or should we compute the mean across all steps,
and then apply normalization recursively ? (source of the intuition argument : here)
I will not detail any further on that matter, as it’s not the purpose of this article.
C.2.6) Before or after the nonlinearity ?
Historically, BN layer is positioned right before the nonlinear function, which was consistent
Before diving into the discussion, here are what we’ll see :
The original paper [1] assumes that BN effectiveness is due to the reduction of what they
call the internal covariate shift (ICS). A recent paper [2] refutes this hypothesis. (see C.3.1)
Another hypothesis had replaced the first one with much more caution : BN mitigates the
A recent paper from the MIT [2] stresses the impact of BN on the optimization landscape
I bet that exploring those hypotheses will help you to build a strong intuition about Batch
Normalization.
Extra
C.3.1) Hypothesis n°1 — BN reduces the internal covariance shift (ICS)
Despite its fundamental impact on DNN performances, Batch Normalization is still subject to
misunderstanding.
Internal covariate shift deteriorates training : the original paper hypothesis
Diagram 7 : Internal covariate shift (ICS) principle in the distributional stability perspective
(ICS_distrib). | Credit : author - Design : Lou HD
In our car classifier, we can see hidden layers as units which are activated when they identify
some “conceptual” features associated with cars : it could be a wheel, a tire or a door. We can
suppose that the previously described effect could happen inside hidden units. A wheel with a
certain orientation angle will activate neurons with a specific distribution. Ideally, we want
to make some neurons react with a comparable distribution for any wheel orientations, so
the model could efficiently conclude on the probability for the input image to contain a car.
If there is a huge covariate shift in the input signal, the optimizer will have trouble generalizing
well. On the contrary, if the input signal always follows a standard normal distribution, the
optimizer will more easily generalize. With this in mind, [1]’s authors applied the strategy of
normalizing the signal in the hidden layers. They assumed that forcing to (𝜇 = 0, σ = 1) the
intermediates signal distribution will help the network generalization in the “conceptual”
levels of features.
Though, we do not always want standard normal distribution in the hidden units. It would reduce