The Math Behind Convolutional Neural Networks - Towards Data Science
The Math Behind Convolutional Neural Networks - Towards Data Science
The Math Behind Convolutional Neural Networks - Towards Data Science
Member-only story
Image by DALL-E
Index
· 1: Introduction
· 5: Conclusion
· Additional Resources
1: Introduction
Convolutional Neural Networks, or CNNs for short, are a big deal when it comes to
working with images, like in photo recognition or sorting. They’re super good at
picking up on the patterns and details in pictures automatically, which is why
they’re a go-to for any project that deals with a bunch of images.
The cool thing about CNNs is that they don’t just mash all the image data into one
big pile. Instead, they keep the layout of the image intact, which means they’re great
at noticing the specific patterns and where they’re located. This approach is a game-
changer because it lets CNNs handle the tricky parts of working with images much
more smoothly.
One of the secret sauces of CNNs is something called convolutional layers. These
layers move across the image and are able to spot different visual features, like
lines, textures, and shapes. This beats the old-school way where people had to
manually pick out these features, which was slow and often a bottleneck for getting
things done. By having the network figure out these features on its own, CNNs not
only get more accurate, they’re also simpler and can be used for a wider range of
image-related tasks without much hassle.
CNNs are composed of several types of layers, each serving a specific function in the
image recognition process. The main layers include convolutional layers, activation
functions, pooling layers, and fully connected layers. Together, these layers allow
CNNs to detect features, reduce complexity, and make predictions.
At its core, the convolution operation involves sliding a filter (or kernel) over the
input image and computing the dot product of the filter values and the original pixel
values at each position. The filter is a small matrix of weights, typically of size 3x3 or
5x5, which is trained to detect specific features in the image.
Where:
This equation tells us that each element S(i,j) of the output feature map is the sum of
the element-wise product of the kernel K and the portion of the input image I over
which the kernel is currently positioned.
Now, consider a matrix of pixel values which will serve as input image. If it’s a
grayscale image (image above), the matrix will have a single layer; for color images,
there are typically three layers (RGB), but the operation is often performed
separately on each layer.
The convolution operation apply a kernel (filter) to the matrix. Here the kernel is
another matrix, smaller than the input image, with predefined dimensions (e.g.,
3x3). The values in this matrix are the weights, which are learned during the
training process. The kernel is designed to detect specific types of features, such as
edges, textures, or patterns, from the input image. The kernel, then, strides (we will
cover this operation in a moment) over the entire input image and performing
element-wise multiplication followed by a sum.
From the convolution operation, we will get the output feature map. It’s a new
matrix where each element represents the presence and intensity of a feature
detected by the kernel at a specific location in the input image.
2.2: Stride
Stride on Input Image — Animation by Author
The stride specifies the number of pixels by which we move the filter across the
input image or feature map in each step. It is applied both horizontally and
vertically. A stride of 1 means the filter moves one pixel at a time, ensuring a
detailed and dense scanning of the input. Larger strides result in the filter skipping
pixels, leading to broader and less dense coverage.
The stride plays a direct role in determining the dimensions of the output feature
map:
With a Stride of 1: The filter moves across every pixel, often resulting in an
output feature map that is relatively large or similar in size to the input,
depending on padding, which we will talk about in the next section.
With a Larger Stride: The filter skips over pixels, which means it covers the
input in fewer steps. This leads to a smaller output feature map since each step
covers a larger area of the input with less overlap between positions where the
filter is applied.
Mathematical Representation
The size of the output feature map (W_out, H_out) can be calculated from the input
size (W_in, H_in), filter size (F), stride (S), and padding (P) using the formula:
where:
W_outand H_outare the width and height of the output feature map,
respectively.
S is the stride.
P is the padding.
A larger stride increases the field of view of each application of the filter, allowing
the network to capture more global features of the input with fewer parameters.
Using a larger stride reduces the computational load and memory usage since it
decreases the size of the output feature map and, consequently, the number of
operations required for convolution.
2.3: Padding
Padding plays a critical role in shaping the network’s architecture by influencing the
spatial dimensions of the output feature maps.
It involves adding layers of zeros (or other values, but zeros are most common)
around the border of the input image or feature map before applying the
convolution operation. This technique can be applied for various reasons, the most
prominent being to control the size of the output feature maps and to allow the
convolutional filters to have access to the edge pixels of the input.
You can notice how our previous 8x8 matrix is now a 10x10 matrix, as we added a
layer of 0s around it.
Without padding, each convolution operation reduces the size of the feature map.
Padding allows us to apply filters to the input without shrinking its spatial
dimensions, preserving more information, especially for deeper networks where
many convolutional layers are applied sequentially.
By padding the input, filters can properly process the edge pixels of the image,
ensuring that features located at the borders are adequately captured and utilized in
the network’s learning process.
There are two main types of padding:
Same Padding
With the same padding, enough zeros are added to the edges of the input to ensure
that the output feature map has the same dimensions as the input (when the stride
is 1). This is particularly useful for designing networks where the input and output
sizes need to be consistent.
The effect of padding on the output feature map size can be captured by adjusting
the formula used to calculate the dimensions of the output feature map:
where:
W_outand H_outare the width and height of the output feature map,
respectively.
S is the stride.
While padding helps in maintaining the spatial dimensions of the input through the
layers, excessive padding might lead to computational inefficiency and an increase
in the model’s complexity by adding more non-informative inputs (zeros) into the
computation.
The choice between valid and same padding typically depends on the specific
requirements of the application, such as the importance of preserving the spatial
dimensions of the input or the need to minimize computational overhead.
The output of a convolutional layer with multiple filters is a stack of feature maps,
one for each filter. This stack forms a three-dimensional volume where the depth
corresponds to the number of filters used. This depth is crucial for building a
hierarchical representation of the data, allowing subsequent layers to detect
increasingly abstract features by combining the outputs of previous layers.
The individual feature maps generated by each filter are stacked along the depth
dimension, forming a 3D volume. This volume encapsulates the diverse features
detected by the filters, providing a rich, multi-faceted representation of the input.
Implications of Depth
More filters mean a deeper network with a higher capacity to learn complex
features. However, this also increases the network’s computational complexity and
the amount of training data needed to learn effectively.
Each filter adds parameters to the model (the weights that define the filter). While
more filters increase the network’s expressive power, they also raise the total
number of parameters, which can impact training efficiency and the risk of
overfitting.
The allocation of filters across layers is strategic. Layers closer to the input might
have fewer, more general filters, while deeper layers may use more filters to capture
the complexity and variability of higher-order features within the data.
In the context of CNNs, weight sharing refers to using the same filter (and thus the
same set of weights) across the entire input image or feature map. Instead of
learning a unique set of weights for every possible location, a single filter scans the
entire image, applying the same weights at each position. This operation is repeated
for each filter in the convolutional layer.
By reusing the same set of weights across different parts of the input image, weight
sharing dramatically reduces the number of parameters in the model. This makes
CNNs much more parameter-efficient compared to fully connected networks,
especially when dealing with large input sizes.
Weight sharing enables the network to detect features regardless of their position
in the input image. If a filter learns to recognize an edge or a specific pattern, it can
detect this feature anywhere in the image, making CNNs inherently translation
invariant.
With fewer parameters to learn, CNNs are less likely to overfit the training data.
This improves the model’s ability to generalize from the training data to unseen
data, enhancing its performance on real-world tasks.
Despite the extensive reuse of weights across the spatial domain, each weight is
updated based on the aggregate gradient from all positions where it was applied.
This ensures that the filter weights are optimized to detect features that are most
relevant for the task, based on the entire dataset.
At the core of feature map creation is the convolution operation, where a filter with
learned weights slides (or convolves) across the input image or feature map from a
previous layer. At each position, the filter performs an element-wise multiplication
with the part of the image it covers, and the results are summed up to produce a
single output pixel in the new feature map.
The weights in the filter determine the type of feature it detects, such as edges,
textures, or more complex patterns in deeper layers. During training, these weights
are adjusted through backpropagation, allowing the network to learn which features
are most important for the task at hand.
The size of the stride and the use of padding directly affect the spatial dimensions of
the feature map. A larger stride results in broader coverage with less overlap
between filter applications, reducing the feature map size. Padding can be used to
preserve the spatial dimensions of the input, ensuring that features at the edges of
the image are not lost.
After a feature map is created through the convolution operation, it is often passed
through an activation function, such as ReLU. This introduces non-linearity,
enabling the network to learn and represent more complex patterns.
If you want to learn more about ReLU and other activation functions, take a look at
this article:
The activated feature map then proceeds to the next layer or a pooling operation.
Pooling layers reduce the size of the feature maps, thereby decreasing the number
of parameters and computations required in the network. This simplification helps
to focus on the most important features.
There are a few types of pooling techniques you should know about when playing
with CNNs:
Max Pooling
This is the most common form of pooling, where the maximum value from a set of
values in the feature map is selected and forwarded to the next layer. Max pooling
effectively captures the most pronounced feature in each patch of the feature map.
We denote the feature map by F and the pooling operation by P_max, the result of
max pooling at position (i,j) for a window size of n×n can be expressed as:
Max Pooling Formula — Image by Author
Here, s is the stride of the pooling window, and a, b iterate over the window
dimensions. This operation is applied independently for each window position
across the feature map.
Average Pooling
Unlike max pooling, average pooling takes the average of the values in each patch of
the feature map. This method provides a more generalized feature representation
but might dilute the presence of smaller, yet significant features.
For a feature map F and an n×n pooling window, the average pooling operation at
position (i,j) can be mathematically represented as:
Similar to max pooling, s represents the stride, and a,b iterate over the window, but
here the operation computes the mean of the values within each window.
Global Pooling
In global pooling, the entire feature map is reduced to a single value by taking the
max (global max pooling) or average (global average pooling) of all values in the
feature map. This approach is often used to reduce each feature map to a single
value before a fully connected layer.
For a feature map F of size M×N, global max pooling (P_gmax) and global average
pooling (P_gavg) can be defined as:
Global Pooling Formula (Top), Global Average Pooling Formula (Bottom) — Image by Author
Global pooling operations compress the entire feature map into a single summary
statistic, which is particularly useful for reducing model parameters before a fully
connected layer for classification.
The size of the window and the stride (how far the window moves each time)
determine how much the feature map is reduced. A common choice is a 2x2 window
with a stride of 2, which reduces the size of the feature map by half.
Fully connected layers are often positioned towards the end of CNNs. These layers
are where the high-level reasoning based on the learned features takes place,
ultimately leading to classification or prediction.
In a fully connected layer, every neuron is connected to every activation from the
previous layer. This dense connectivity ensures that the layer has the complete
context of the extracted features, allowing it to learn complex patterns that are
distributed across the feature map.
The neurons in fully connected layers can learn high-level patterns in the data by
considering the global information presented by the flattened feature map. This
ability is fundamental to making predictions or classifications based on the entire
input image.
Role in CNNs
In many CNN architectures, the final fully connected layer serves as the
classification layer, where each neuron represents a specific class. The network’s
prediction is determined by the activation of these neurons, typically through a
softmax function that converts the activations into probabilities.
Fully connected layers synthesize the localized, abstract features extracted by the
convolutional layers into a cohesive understanding of the input data. This synthesis
is essential for the network to reason about the input as a whole and make informed
decisions.
Feel free to have this Jupyter Notebook on the side, which contains all the code we
will cover today:
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
mnist_dataset = datasets.MNIST(root='./data', train=True, download=True, transf
A sample image from the dataset is displayed using matplotlib , illustrating the type
of data the network will be trained on.
The dataset is divided into training and validation sets to enable model evaluation
during training. DataLoader instances handle batching, shuffling, and preparing the
dataset for efficient processing by the neural network.
train_size = int(0.8 * len(mnist_dataset))
val_size = len(mnist_dataset) - train_size
train_dataset, val_dataset = random_split(mnist_dataset, [train_size, val_size]
The __init__ function is the constructor of the MyCNN class. It's where the layers of
the neural network are defined. The super(MyCNN, self).__init__() line calls the
constructor of the base nn.Module class, which is necessary for PyTorch to initialize
everything correctly.
class MyCNN(nn.Module):
def __init__(self):
super(MyCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
self.fc1 = nn.Linear(7*7*64, 128)
self.fc2 = nn.Linear(128, 10)
As you can notice from the code above, the network includes two convolutional
layers, conv1 and conv2 .
conv1 takes a single-channel image (like a grayscale image) as input and produces
32 feature maps using a filter (or kernel) size of 3x3, with a stride of 1 and padding
of 1. Padding is added to ensure the output feature maps are the same size as the
input.
conv2 takes the 32 feature maps from conv1 as input and produces 64 feature maps,
also with a 3x3 kernel, stride of 1, and padding of 1. This layer further extracts
features from the input provided by conv1 .
After the convolutional layers, there are two fully connected (fc) layers.
fc1 is the first fully connected layer that transforms the output from the
convolutional layers into a vector of size 128. The input size is 7*7*64 , which implies
that before reaching this layer, the feature maps are flattened into a single vector
and that the dimensionality of the feature maps before flattening is 7x7 with 64
channels. This step is crucial for transitioning from spatial feature extraction to
making decisions (classifications) based on those features.
fc2 is the second fully connected layer, which takes the 128-dimensional vector
from fc1 and outputs a 10-dimensional vector. This output size typically
corresponds to the number of classes in a classification problem, suggesting this
network is designed to classify images into one of 10 categories.
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.normal_(m.weight, 0, 0.01)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
if m.bias is not None:
nn.init.constant_(m.bias, 0)
Weight initialization is applied to ensure the network starts with weights in a range
that neither vanishes nor explodes the gradients. Convolutional layers are initialized
with normal distribution, while fully connected layers use Xavier uniform
initialization.
To learn more about Xavier initialization and other types of initialization consider
diving into my previous article:
Let’s dissect this method step by step, focusing on each operation to understand
how input images are transformed into output predictions.
x = F.relu(self.conv1(x))
The input tensor x, representing the batch of images, is passed through the first
convolutional layer ( conv1 ). This layer applies learned filters to the input, capturing
basic visual features like edges and textures. The convolution operation is
immediately followed by a ReLU activation function applied in-place. ReLU sets all
negative values in the output tensor to zero, enhancing the network's ability to
distinguish features.
x = F.max_pool2d(x, 2, 2)
Following the first convolution and activation, a max pooling operation is applied.
This operation reduces the spatial dimensions of the feature map by half (due to the
pool size and stride of 2), summarizing the most significant features within 2x2
patches of the feature map. Max pooling helps to make the representation
somewhat invariant to small shifts and distortions.
x = F.relu(self.conv2(x))
The process repeats with a second convolutional layer ( conv2 ), which applies
another set of learned filters to the now-reduced feature map. This layer typically
captures more complex features, building upon the basic patterns identified by the
first layer. Again, ReLU activation follows to maintain non-linearity in the learning
process.
x = F.max_pool2d(x, 2, 2)
Another max pooling step further reduces the spatial dimensions of the resulting
feature map, compacting the feature representation and reducing computational
complexity for subsequent layers.
Flattening
x = x.view(x.size(0), -1)
The flattened tensor is passed through the first fully connected layer ( fc1 ), where
neurons can learn complex patterns from the entire feature set. The ReLU function
is applied once more to introduce non-linearity, enabling the network to learn and
represent more complex functions.
x = self.fc2(x)
Finally, the tensor passes through a second fully connected layer ( fc2 ), which acts
as the output layer. This layer has as many neurons as there are classes to predict (10
for MNIST digits). The output of this layer represents the network's predictions for
each class.
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5, amsgrad=
The Adam optimizer is a popular algorithm for training deep learning models,
combining the best properties of the AdaGrad and RMSProp algorithms to
efficiently handle sparse gradients on noisy problems. It adjusts the learning rate on
a per-parameter basis, making it highly effective and well-suited for a wide range of
tasks and models. If you want to learn more about Adam take a look at my article
where I go through its math and build it from scratch:
class Trainer:
def __init__(self, model, criterion, optimizer, device, patience=7):
self.model = model
self.criterion = criterion
self.optimizer = optimizer
self.device = device
self.early_stopping = EarlyStopping(patience=patience)
self.scheduler = ReduceLROnPlateau(self.optimizer, 'min', patience=3, v
self.train_losses = []
self.val_losses = []
self.gradient_norms = []
In the initialization method __init__ , the Trainer class takes the CNN model, the
loss function ( criterion ), and the optimizer as arguments, alongside the device on
which to run the training (CPU or GPU) and the patience for early stopping. An
EarlyStopping instance is created to monitor validation loss and halt training if the
model ceases to improve, preventing overfitting. A learning rate scheduler
( ReduceLROnPlateau ) is also initialized to dynamically adjust the learning rate based
on the validation loss, helping to find the optimal learning rate during training. Lists
to track training and validation losses, as well as gradient norms, are initialized for
analysis and debugging purposes.
The train method orchestrates the training process over a specified number of
epochs . For each epoch, it sets the model to training mode and iterates over the
training dataset using the train_loader . Input images and labels are moved to the
specified device . The optimizer's gradients are zeroed before each forward pass to
prevent accumulation from previous iterations. The model's predictions are
obtained, and the loss is calculated using the specified criterion . The loss value is
appended to the train_losses list for tracking. Backpropagation is performed by
calling loss.backward() , and the optimizer updates the model weights with
optimizer.step() .
val_loss = self.evaluate(val_loader)
self.val_losses.append(val_loss)
self.scheduler.step(val_loss)
self.early_stopping(val_loss)
After processing the training data, the model is evaluated on the validation dataset
using the evaluate method, which calculates the average validation loss. This loss is
used to adjust the learning rate with the scheduler and to determine if early
stopping conditions are met. Validation loss is tracked for analysis.
if self.early_stopping.early_stop:
print("Early stopping")
break
The evaluate method calculates the average loss over the validation or test dataset
without updating the model's weights. It sets the model to evaluation mode and
disables gradient computations for efficiency.
transform = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
Adding random flips and rotations diversifies the training data, encouraging the
model to learn more robust features.
4.2: Dropout
Dropout is a regularization technique that randomly sets a fraction of input units to
0 during training, preventing units from co-adapting too much. This randomness
forces the network to learn more robust features that are useful in conjunction with
various random subsets of the other neurons.
In PyTorch, dropout can be added to the CNN model by including nn.Dropout layers:
class MyCNN(nn.Module):
def __init__(self):
super(MyCNN, self).__init__()
# Convolutional layers
self.fc1 = nn.Linear(7*7*64, 128)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
# Convolutional and pooling operations
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
Adding a dropout layer before the final fully connected layer helps mitigate
overfitting by encouraging the model to distribute the learned representation across
multiple neurons.
class MyCNN(nn.Module):
def __init__(self):
super(MyCNN, self).__init__()
# Convolutional layers
self.conv1_bn = nn.BatchNorm2d(32)
# Fully connected layers
def forward(self, x):
x = F.relu(self.conv1_bn(self.conv1(x)))
# Continue through model
Applying batch normalization after convolutional layers but before the activation
function helps in normalizing the output, contributing to faster convergence and
improved overall performance.
model = models.resnet18(pretrained=True)
# Replace the final fully connected layer
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 10) # Assuming 10 classes for the new task
# Freeze all layers but the last fully connected layer
for param in model.parameters():
param.requires_grad = False
model.fc.requires_grad = True
Here, a pre-trained ResNet-18 model is adapted for a new task with 10 classes by
replacing its final layer. Freezing the weights of all layers except the last one allows
us to fine-tune only the classifier layer, leveraging the feature extraction capabilities
learned from the original dataset.
Incorporating these strategies into the CNN training process not only combats
overfitting but also enhances model performance by ensuring robust feature
learning and leveraging knowledge from pre-trained models.
5: Conclusion
Wrapping up our deep dive into Convolutional Neural Networks, we’ve covered a lot.
From setting up and preparing data to dissecting CNN architecture and its layers,
we’ve seen what makes these models tick. We’ve looked into how tweaking things
like weight initialization and using techniques like data augmentation and transfer
learning can seriously boost a model’s performance. These methods help make our
models smarter, avoiding common pitfalls like overfitting and making them more
versatile.
CNNs are pretty much everywhere in AI now, helping with everything from spotting
faces in photos to diagnosing diseases from medical images. Their knack for picking
up on visual cues makes them super valuable for a whole range of tasks.
Additional Resources
1. LeCun et al., “Gradient-Based Learning Applied to Document Recognition”
This seminal paper by Yann LeCun and colleagues introduces LeNet-5, one of
the first convolutional neural networks, and demonstrates its application to
document recognition tasks.
Research Gate Link
If you liked this article consider leaving a like, and follow me to be updated on my
latest posts. My goal is to recreate all the most popular algorithms from scratch and
make machine learning accessible to everyone.
635 1
Deep Learning Machine Learning Convolutional Neural Net Python
Deep Dives
Follow
A Data Scientist with a passion about recreating all the popular machine learning algorithm from scratch.
2.5K 16
1.95K 10
Tim Sumner in Towards Data Science
2K 28
858 8
5.5K 179
Tim Sumner in Towards Data Science
2K 28
Lists
571 6
Umer Farooq
Embedded Engineering Roadmap — Say No to Arduino!
An engineer should have a strong foundation. Get your direction sorted out.
1.8K 16
2.5K 15
Open in app
Search
Hemanth in Street Science
2K 24