CNN Course-Notes 365
CNN Course-Notes 365
CNN Course-Notes 365
1 1 1 1 0 1 0 0 −1 0
1 1 1 1 −4 1 −1 5 −1
9
1 1 1 0 1 0 0 −1 0
Convolution
The kernels are applied to the image through the mathematical operation of convolution.
(Okay, actually, this is cross-correlation, but convolution is closely related, and the two are
conflated often enough)
In practice, convolution is the operation of calculating the new value of each pixel in the
image, by “sliding” the kernel over it.
Convolution
Visuals
Convolution equation
Edge handling
There is ambiguity what to do when the kernel “sticks out” of the image, near the edges.
In this case, there are a couple of solutions:
An out-of-the-box solution is to ignore the pixels for which the kernel sticks
out. Essentially, that would trim the border of the image. This is not a big deal
when dealing with big images, as if we have a 256x256 image, with 5x5 kernel,
the result would be 252x252 transformed image.
From convolution to CNN
However, we don’t manually set the kernel matrices. We just let the network
find out what kernels would do the job best
CNN motivation
Why use CNNs, instead of normal feedforward neural networks?
CNNs are specialized for structured data. They preserve the spatial structure of
the image since they transform a 2D input into a 2D output (the transformed
image)
Some pixels that were close to each other in the original image, are now far apart
Others that were separated, appear next to each other in the vector
Feature maps
In CNNs, the kernels that transform the image are usually most helpful if they are
constructed as detectors. For instance, the GDE detection kernel is one such
example
These kernels are trained to search for specific patterns or features – circles, arcs,
edges, vertical lines, horizontal lines etc. Thus, the resulting output is not an image,
but a feature map
Detecting trees
and bushes
Feature maps
A single such feature map is not very useful. That’s why a convolutional layer
contains many kernels (hyperparameter) that produce many different feature maps
+
X 100 X 100
Тhe main purpose of pooling layers is to condense the feature maps to a smaller
size. Thus, usually pooling layers are situated right after convolutional layers.
y
Stride
Stride refers to the amount of pixels the kernel moves.
For example, in the previous maxpool example, the stride was 2, as the regions have
been calculated for each 2 pixels.
Example dimension transformation
Transformed by a single CNN
Common techniques
To improve the network’s performance
When considering the performance of our model, there are some techniques we can employ in
order to prevent overfitting, or simply to increase the accuracy. These are not intended only for
CNNs, but all kinds of networks.
L2 regularization
Weight decay
Dropout
Data augmentation
Common techniques
L2 regularization
Regularization, in general, is the technique of adding factors to the loss function, in order to
control what the network is learning.
L2 regularization specifically, adds this factor to the loss:
Hyperparameter to control
the scale of the effect
Common techniques
L2 regularization
This discourages the model from learning a very complex solution, thus limiting overfitting.
Due to this factor :
Learning rate
Common techniques
Weight decay
Weight decay is similar to L2 regularization, however it changes the update rule directly, instead
of doing that indirectly through the loss function
The only difference to the L2 regularization update rule is the missing learning rate in
the brackets
Thus, for optimizers with static learning rate (e.g. SGD), weight decay and L2
regularization are equivalent
However, for optimizers that use adaptive learning rate, the effects of L2 regularization
would be different in the beginning and end. In contrast, weight decay would have
constant effect no matter the optimizer
Common techniques
Dropout
Dropout consists of randomly setting a portion of the neurons in a layer to zero. This
creates some form of redundancy in the layer and helps with the generalization of
the network
Common techniques
Dropout
Dropout is present only during training. During testing and operational use of the
model, all neurons are present
To work properly, the remaining outputs of the given layer should be scaled up. If the
1
portion of neurons to be dropped is 𝑝 (0 < 𝑝 < 1), then the scaling factor is
𝑝
Common techniques
Data augmentation
Data augmentation is used when the data we have available does not include
examples of all classes we would like our model to learn
For example, if we want to classify images of cats, ideally our dataset should include
pictures of cats in different poses. In the case our dataset contains only cats facing to
the right, we can correct that with data augmentation
Common techniques
Data augmentation
Data augmentation itself is the technique of transforming the data to artificially create
more examples. This includes mirroring the image, translating, stretching, scaling etc.
Popular CNN architectures
As a final note, let’s take a look at some of the popular CNN architectures created by the
professionals in this field.
The ones we will discuss are:
5x Conv
3x MaxPool
3x Dense
Popular CNN architectures
VGG
VGG added more layers than AlexNet, which improved the results drastically. The trick VGG
employed was to make all convolutional layers with kernels of minimum size – 3x3. This allowed
for more layers to be stacked with fewer overall parameters
3x3 Conv
Popular CNN architectures
GoogleNet
This architecture was all about computational efficiency. The team at Google designed the so
called Inception module, and the whole network consisted of stacked Inception modules. The
Inception module incorporated parallel layers and a 1x1 conv bottleneck layer to reduce the
number of parameters and operations
Popular CNN architectures
ResNet
The number of layers in the ResNet architecture skyrocketed from 22 (GoogleNet) to 152! This
was achieved thanks to the residual blocks. These consisted of 2 convolutional layers in which the
input was summed with the output. Thus, the network needs to learn only how to change the
input, not the output itself.