Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
31 views20 pages

Batch Normalization Separate

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 20

Batch Normalization

Batch Normalisation (BN):

 Batch Normalization is a technique used to normalize the activations of a layer


within a mini-batch during training.
 It is commonly applied to intermediate layers of a neural network to help
stabilize training, reduce internal covariate shift, and improve the convergence
of the model.
 BN normalizes the activations by subtracting the mini-batch mean and dividing
by the mini-batch standard deviation. These operations are learnable through
parameters (gamma and beta) to allow the model to adjust the normalization.
 In essence, BN acts on the batch dimension, hence the name “Batch
Normalization.”

1. Introduction
Training Deep Neural Networks is a difficult task that involves several
problems to tackle. Despite their huge potential, they can be slow and be
prone to overfitting. Thus, studies on methods to solve these problems are
constant in Deep Learning research.
Batch Normalization – commonly abbreviated as Batch Norm – is one of these
methods. Currently, it is a widely used technique in the field of Deep Learning.
It improves the learning speed of Neural Networks and provides
regularization, avoiding overfitting.
But why is it so important? How does it work? Furthermore, how can it be
applied to non-regular networks such as Convolutional Neural Networks?

2. Normalization
To fully understand how Batch Norm works and why it is important, let’s start
by talking about normalization.
Normalization is a pre-processing technique used to standardize data. In
other words, having different sources of data inside the same range. Not
normalizing the data before training can cause problems in our network,
making it drastically harder to train and decrease its learning speed.
For example, imagine we have a car rental service. Firstly, we want to predict
a fair price for each car based on competitors’ data. We have two features per
car: the age in years and the total amount of kilometers it has been driven for.
These can have very different ranges, ranging from 0 to 30 years, while
distance could go from 0 up to hundreds of thousands of kilometers. We don’t
want features to have these differences in ranges, as the value with the higher
range might bias our models into giving them inflated importance.
There are two main methods to normalize our data. The most straightforward
method is to scale it to a range from 0 to 1:

the data point to normalize, the mean of the data set, the highest

value, and the lowest value. This technique is generally used in the
inputs of the data. The non-normalized data points with wide ranges can
cause instability in Neural Networks. The relatively large inputs can cascade
down to the layers, causing problems such as exploding gradients.
The other technique used to normalize data is forcing the data points to have
a mean of 0 and a standard deviation of 1, using the following formula:

being the data point to normalize, the mean of the data set, and the
standard deviation of the data set. Now, each data point mimics a standard
normal distribution. Having all the features on this scale, none of them will
have a bias, and therefore, our models will learn better.
In Batch Norm, we use this last technique to normalize batches of data inside
the network itself.

3. Batch Normalization
Batch Norm is a normalization technique done between the layers of a
Neural Network instead of in the raw data. It is done along mini-batches
instead of the full data set. It serves to speed up training and use higher
learning rates, making learning easier.
Following the technique explained in the previous section, we can define the
normalization formula of Batch Norm as:
being the mean of the neurons’ output and the standard deviation of the
neurons’ output.

3.1. How Is It Applied?


In the following image, we can see a regular feed-forward Neural Network:
are the inputs, the output of the neurons, the output of the activation
functions, and the output of the network:

Batch Norm – in the image represented with a red line – is applied to the
neurons’ output just before applying the activation function. Usually, a neuron
without Batch Norm would be computed as follows:
Z=g(w,x) +b, alpha= f(z);

being g, the linear transformation of the neuron, W the weights of the


neuron, b the bias of the neurons, and f the activation function. The model
learns the parameters W AND B . Adding Batch Norm, it looks as:

being the output of Batch Norm, the mean of the neurons’ output, the
standard deviation of the output of the neurons, and and learning parameters
of Batch Norm. Note that the bias of the neurons ( ) is removed. This is
because as we subtract the mean , any constant over the values of –
such as – can be ignored as it will be subtracted by itself.
The parameters and shift the mean and standard deviation,
respectively. Thus, the outputs of Batch Norm over a layer results in a
distribution with a mean and a standard deviation of . These values are
learned over epochs and the other learning parameters, such as the weights
of the neurons, aiming to decrease the loss of the model.

3.2. Implementation in Python


Implementing Batch Norm is quite straightforward when using modern
Machine Learning frameworks such as Keras, Tensorflow, or Pytorch. They
come with the most commonly used methods and are generally up to date
with state of the art.
With Keras, we can implement a really simple feed-forward Neural Network
with Batch Norm as:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization

model = Sequential([
Dense(16, input_shape=(1,5), activation='relu'),
BatchNormalization(),
Dense(32, activation='relu'),
BatchNormalization(),
Dense(2, activation='softmax')
])Copy

3.3. Why Does Batch Normalization Work?


Now that we know how to apply and implement Batch Norm, why does it
work? How can it speed up training and make learning easier?
There are different reasons why it is believed that Batch Norm affects all of
that. Here we will expose the intuitions of the most important reasons.
Firstly, we can see how normalizing the inputs to take on a similar range of
values can speed up learning. One simple intuition is that Batch Norm is doing
a similar thing with the values in the layers of the network, not only in the
inputs.
Secondly, in their original paper Sergey et al. claim that Batch Norm reduces
the internal covariate shift of the network. The covariate shift is a change
in data distribution. For example, going back to our example on the car rental
service, imagine we want to include other motorbikes. If we only look at our
previous data set, containing only cars, our model will likely fail to predict
motorbikes’ price. This change in data (now containing motorbikes) is named
covariate shift, and it is gaining attention as it is a common issue in real-world
problems.
The internal covariate shift is a change in the input distribution of an internal
layer of a Neural Network. For the neurons in an internal layer, the inputs
received (from the previous layer) are constantly changing. This is due to the
multiple computations done before it and the weights over the training
process.
Applying Batch Norm ensures that the mean and standard deviation of the
layer inputs will always remain the same; and , respectively. Thus, the
amount of change in the distribution of the input of layers is reduced. The
deeper layers have a more robust ground on what the input values are going
to be, which helps during the learning process.
Lastly, it seems that Batch Norm has a regularization effect. Because it is
computed over mini-batches and not the entire data set, the model’s data
distribution sees each time has some noise. This can act as a regularizer,
which can help overcome overfitting and help learn better. However, the noise
added is quite small. Thus, it generally is not enough to properly regularize on
its own and is normally used along with Dropout.

4. Batch Normalization in Convolutional Neural


Networks
Batch Norm works in a very similar way in Convolutional Neural Networks.
Although we could do it in the same way as before, we have to follow the
convolutional property.
In convolutions, we have shared filters that go along the feature maps of the
input (in images, the feature map is generally the height and width). These
filters are the same on every feature map. It is then reasonable to normalize
the output, in the same way, sharing it over the feature maps.
In other words, this means that the parameters used to normalize are
calculated along with each entire feature map. In a regular Batch Norm, each
feature would have a different mean and standard deviation. Here, each
feature map will have a single mean and standard deviation, used on all
the features it contains.
4.1. Implementation in Python
Again, implementing Batch Norm in Convolutional Neural Networks is really
easy using modern frameworks. It does the operation in the background,
using the same function we used before:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Conv2D,
MaxPooling2D

model = Sequential([
Conv2D(32, (3,3), input_shape=(28, 28, 3) activation='relu'),
BatchNormalization(),
Conv2D(32, (3,3), activation='relu'),
BatchNormalization(),
MaxPooling2D(),
Dense(2, activation='softmax')
])

https://towardsdatascience.com/batch-normalization-in-3-
levels-of-understanding-14c2da90a338

Batch-Normalization (BN) is an algorithmic method which makes


the training of Deep Neural Networks (DNN) faster and more
stable.It consists of normalizing activation vectors from
hidden layers using the first and the second statistical moments
(mean and variance) of the current batch. This normalization step is
applied right before (or right after) the nonlinear function.
Multilayer Perceptron (MLP) without batch normalization (BN) |

Multilayer Perceptron (MLP) with batch normalization (BN) | Credit : author - Design : Lou HD

All the current deep learning frameworks have already implemented


methods which apply batch normalization. It is usually used as a
module which could be inserted as a standard layer in a DNN.

B.1) Principle

Batch normalization is computed differently during the training and


the testing phase.

B.1.1) Training

At each hidden layer, Batch Normalization transforms the signal as


follow :
The BN layer first determines the mean 𝜇 and the variance σ² of the
activation values across the batch, using (1) and (2).

It then normalizes the activation vector Z^(i) with (3). That way,
each neuron’s output follows a standard normal distribution across the
batch. (𝜀 is a constant used for numerical stability)

Batch Normalization first step. Example of a 3-neurons hidden layer, with a batch of size b.
Each neuron follows a standard normal distribution. |

It finally calculates the layer’s output Ẑ(i) by applying a linear


transformation with 𝛾 and 𝛽, two trainable parameters (4). Such step
allows the model to choose the optimum distribution for each hidden
layers, by adjusting those two parameters :

 𝛾 allows to adjust the standard deviation ;

 𝛽 allows to adjust the bias, shifting the curve on the right or on the
left side.
Benefits of 𝛾 and 𝛽 parameters. Modifying the distribution (on the top) allows us to use
different regimes of the nonlinear functions (on the bottom). |

Remark : The reasons explaining BN layers effectiveness are subjected to misunderstanding and

errors (even in the original article). A recent paper [2] disproved some erroneous hypotheses,

improving the community’s understanding of this method. We will discuss that matter in section

C.3 : “Why does BN work ?”.


At each iteration, the network computes the mean 𝜇 and the standard deviation σ corresponding to

the current batch. Then it trains 𝛾 and 𝛽 through gradient descent, using an Exponential Moving

Average (EMA) to give more importance to the latest iterations.

B.1.2) Evaluation

Unlike the training phase, we may not have a full batch to feed into the model during the
evaluation phase.
To tackle this issue, we compute (𝜇_pop , σ_pop), such as :

 𝜇_pop : estimated mean of the studied population ;

 σ_pop : estimated standard-deviation of the studied population.


Those values are computed using all the (𝜇_batch , σ_batch) determined during training, and

directly fed into equation (3) during evaluation (instead of calling (1) and (2)).

Remark : We will discuss that matter in depth in section C.2.3 : “Normalization during

evaluation”.
B.2) In practice

In practice, we consider the batch normalization as a standard layer, such as a perceptron, a

convolutional layer, an activation function or a dropout layer.

Each of the popular frameworks already have an implemented Batch Normalization layer. For

example :

Pytorch : torch.nn.BatchNorm1d, torch.nn.BatchNorm2d, torch.nn.BatchNorm3d

Tensorflow / Keras : tf.nn.batch_normalization, tf.keras.layers.BatchNormalization

All of the BN implementations allow you to set each parameters independently. However, the

input vector size is the most important one. It should be set to :

 How many neurons are in the current hidden layer (for MLP) ;

 How many filters are in the current hidden layer (for convolutional networks).

Take a look at the online docs of your favorite framework and read the BN
layer page : is there anything specific to their implementation ?

B.3) Overview of results

Even if we don’t understand all the underlying mechanisms of Batch


Normalization yet (see C.3), there’s something everyone agrees on : it works.

To get a first insight, let’s take a look on the official article’s results [1] :
Figure 1 : How BN affects training. Accuracy on the ImageNet (2012) validation set, w.r.t. the
number of trained iterations. Five networks are compared : “Inception” is the vanilla Inception network
[3], “BN-X” are Inception network with BN layers (for 3 differents learning rates : x1, x5, x30 the
Inception optimum one, “BN-X-Sigmoid” is an Inception network with BN layers, were all ReLU
nonlinearities are replaced by sigmoid. | Source : [1]

The results are clear-cut : BN layers make the training faster,


and allow a wider range of learning rate without compromising
the training convergence.

Remark : At this point, you should know enough about BN

At first, they have trained a classifier on the MNIST dataset


(handwritten digits). The model consists of 3 fully connected layers of
100 neurons each, all activated by sigmoid function. They have trained
this model twice (with and without BN layers) for 50 000 iterations,
using stochastic gradient descent (SGD), and the same learning rate
(0.01). Notice that BN layers were put right before the activation
function.

You can easily reproduce those results without GPU, it’s a great way to
get more familiar with this concept !

Figure 2 : How does BN affects the training of a simple Multi-Layer-Perceptron


(MLP) | Left : Training accuracy w.r.t. iterations | Right : Training loss w.r.t iterations | Credit : author

Looks good ! Batch Normalization increases our network


performances, regarding both the loss and the accuracy.

The 2nd experiment consists of taking a look at the activation values in


the hidden layers. Here are the plots corresponding to the last hidden
layer (right before the nonlinearity) :
Figure 3 : Batch Normalization impact on the activation values of the last
hidden layer | Credit : author

Without Batch Normalization, the activated values fluctuate


significantly during the first iterations. On the contrary, activation
curves are smoother when BN is used.

Figure 4 : Batch Normalization impact on hidden layers activation | Model with BN


have a smoother activation curve than the one without BN | Credit : author
Also, the signal is less noisy when adding BN layers. It seems to make
the convergence of the model easier.

This example does not show all the benefits of the Batch
Normalization.

The official article carried out a third experiment. They wanted to


compare the model performances while adding BN layers on a larger
dataset : ImageNet (2012). To do so, they trained a powerful neural
network (at that time) called Inception [3]. Originally, this network
doesn’t have any BN layers. They added some and trained the model by
modifying its learning rate (x1, x5, x30 former optimum). They also
tried to replace every ReLU activation function by sigmoid in another
network. Then, they compared the performances of those new
networks with the original one.

Here is what they got :


Figure 5 : Batch normalization impact on training (ImageNet) | “Inception” :
original network [3] ; “BX-Baseline” : same as Inception with BN, same learning rate (LR) ; “BX-x5” : same
as Inception with BN, LR x5 ; “BX-x30” : same as Inception with BN, LR x30 ; “BN-x5-Sigmoid” : same as
Inception with BN, LR x5 and sigmoid instead of ReLU | Source : [1]

What we can conclude from those curves :

 Adding BN layers leads to faster and better convergence (where better means higher

accuracy) ;
On such a large dataset, the improvement is much more significant than the one observed on the

small MNIST dataset.

 Adding BN layers allows us to use higher learning rate (LR) without compromising

convergence ;

The authors have successfully trained their Inception-with-BN network using a 30 times higher

learning rate than the original one. Notice that a 5 times larger LR already makes the vanilla

network diverge !
That way, it is much easier to find an “acceptable” learning rate : the interval of LR which lies

between underfitting and gradient explosion is much larger.

Also, a higher learning rate helps the optimizer to avoid local minima convergence. Encouraged

to explore, the optimizer will more easily converge on better solutions.

 The sigmoid-based model reached competitive results with ReLU-based models

We need to take a step back and look at the bigger picture. We can clearly see that we can get

slightly better performances with ReLU-based models than with sigmoid ones, but that’s not what

matters here.

To show why this results is meaningful, let me rephrase what Ian Goodfellow (inventor of GANs

[6], author of the famous “Deep learning” handbook) said about it :

Before BN, we thought that it was almost impossible to efficiently train deep models using

sigmoid in the hidden layers. We considered several approaches to tackle training instability,

such as looking for better initialization methods. Those pieces of solution were heavily heuristic,

and way too fragile to be satisfactory. Batch Normalization makes those unstable networks

trainable ; that’s what this example shows.

— Ian Goodfellow (rephrased from : source)

Now we understand why BN had such an important impact on the deep learning field.

Those results give an overview of Batch Normalization benefits on network performances.

However, there are some side effects you should have in mind to get the most out of BN.
C.2.2) Regularization, a BN side effect

BN relies on batch first and second statistical moments (mean and variance) to normalize hidden

layers activations. The output values are then strongly tied to the current batch statistics. Such

transformation adds some noise, depending on the input examples used in the current batch.

Adding some noise to avoid overfitting … sounds like a regularization process, doesn’t it ? ;)

In practice, we shouldn’t rely on batch normalization to avoid overfitting, because

of orthogonality matters. Put simply, we should always make sure that one module addresses one
issue. Relying on several modules to deal with different problems makes the development process

much more difficult than needed.

Still, it is interesting to be aware of the regularization effect, as it could explains some unexpected

behavior from a network (especially during sanity checks).


C.2.5) Recurrent network and Layer normalization

In practice, it is widely admitted that :

 For convolutional networks (CNN) : Batch Normalization (BN) is better

 For recurrent network (RNN) : Layer Normalization (LN) is better

While BN uses the current batch to normalize every single value, LN uses all the current layer to

do so. In other words, the normalization is performed using other features from a single example

instead of using the same feature across all current batch examples. This solution seems more

efficient for recurrent networks. Note that it is quite difficult to define a consistent strategy for

those kinds of neurons, as they rely on multiplication of the same weight matrix several times.

Should we normalize each step independently ? Or should we compute the mean across all steps,

and then apply normalization recursively ? (source of the intuition argument : here)

I will not detail any further on that matter, as it’s not the purpose of this article.
C.2.6) Before or after the nonlinearity ?

Historically, BN layer is positioned right before the nonlinear function, which was consistent

Before diving into the discussion, here are what we’ll see :

 The original paper [1] assumes that BN effectiveness is due to the reduction of what they

call the internal covariate shift (ICS). A recent paper [2] refutes this hypothesis. (see C.3.1)

 Another hypothesis had replaced the first one with much more caution : BN mitigates the

interdependency between layers during training. (see C.3.2)

 A recent paper from the MIT [2] stresses the impact of BN on the optimization landscape

smoothness, making the training easier. (see C.3.3)

I bet that exploring those hypotheses will help you to build a strong intuition about Batch

Normalization.
Extra
C.3.1) Hypothesis n°1 — BN reduces the internal covariance shift (ICS)

Despite its fundamental impact on DNN performances, Batch Normalization is still subject to

misunderstanding.
Internal covariate shift deteriorates training : the original paper hypothesis

Diagram 7 : Internal covariate shift (ICS) principle in the distributional stability perspective
(ICS_distrib). | Credit : author - Design : Lou HD

In our car classifier, we can see hidden layers as units which are activated when they identify

some “conceptual” features associated with cars : it could be a wheel, a tire or a door. We can

suppose that the previously described effect could happen inside hidden units. A wheel with a

certain orientation angle will activate neurons with a specific distribution. Ideally, we want

to make some neurons react with a comparable distribution for any wheel orientations, so

the model could efficiently conclude on the probability for the input image to contain a car.

If there is a huge covariate shift in the input signal, the optimizer will have trouble generalizing

well. On the contrary, if the input signal always follows a standard normal distribution, the
optimizer will more easily generalize. With this in mind, [1]’s authors applied the strategy of
normalizing the signal in the hidden layers. They assumed that forcing to (𝜇 = 0, σ = 1) the

intermediates signal distribution will help the network generalization in the “conceptual”

levels of features.

Though, we do not always want standard normal distribution in the hidden units. It would reduce

the model representativity :

Diagram 8 : Why don’t we always want a standard normal distribution in the


hidden units. Here, the sigmoid function only works in its linear regime. | Credit : author -
Design : Lou HD

The original article takes the sigmoid function as an example to show


why normalization alone is an issue. If the input signal values lie
between 0 and 1, the nonlinear function would only work … in
its linear regime. Sounds problematic.
To tackle this issue, they added two trainable parameters 𝛽 and 𝛾,
allowing the optimizer to define the optimum mean (using 𝛽) and
standard deviation (using 𝛾) for a specific task.

You might also like