VGG-16 | CNN model

Last Updated : 21 Mar, 2024
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Save
Share
Report
News Follow

A Convolutional Neural Network (CNN) architecture is a deep learning model designed for processing structured grid-like data, such as images. It consists of multiple layers, including convolutional, pooling, and fully connected layers. CNNs are highly effective for tasks like image classification, object detection, and image segmentation due to their hierarchical feature extraction capabilities.

VGG-16

The VGG-16 model is a convolutional neural network (CNN) architecture that was proposed by the Visual Geometry Group (VGG) at the University of Oxford. It is characterized by its depth, consisting of 16 layers, including 13 convolutional layers and 3 fully connected layers. VGG-16 is renowned for its simplicity and effectiveness, as well as its ability to achieve strong performance on various computer vision tasks, including image classification and object recognition. The model’s architecture features a stack of convolutional layers followed by max-pooling layers, with progressively increasing depth. This design enables the model to learn intricate hierarchical representations of visual features, leading to robust and accurate predictions. Despite its simplicity compared to more recent architectures, VGG-16 remains a popular choice for many deep learning applications due to its versatility and excellent performance.

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual competition in computer vision where teams tackle tasks including object localization and image classification. VGG16, proposed by Karen Simonyan and Andrew Zisserman in 2014, achieved top ranks in both tasks, detecting objects from 200 classes and classifying images into 1000 categories.


VGG-16 architecture

This model achieves 92.7% top-5 test accuracy on the ImageNet dataset which contains 14 million images belonging to 1000 classes. 

VGG-16 Model Objective:

The ImageNet dataset contains images of fixed size of 224*224 and have RGB channels. So, we have a tensor of (224, 224, 3) as our input. This model process the input image and outputs the a vector of 1000 values:

[Tex]\hat{y} =\begin{bmatrix} \hat{y_0}\\ \hat{y_1} \\ \hat{y_2} \\. \\ . \\ . \\ \hat{y}_{999} \end{bmatrix}   [/Tex]

This vector represents the classification probability for the corresponding class. Suppose we have a model that predicts that the image belongs to class 0 with probability 1, class 1 with probability 0.05, class 2 with probability 0.05, class 3 with probability 0.03, class 780 with probability 0.72, class 999 with probability 0.05 and all other class with 0.

so, the classification vector for this will be:

[Tex]\hat{y}=\begin{bmatrix} \hat{y_{0}}=0.1\\ 0.05\\ 0.05\\ 0.03\\ .\\ .\\ .\\ \hat{y_{780}} = 0.72\\ .\\ .\\ \hat{y_{999}} = 0.05 \end{bmatrix}   [/Tex]

To make sure these probabilities add to 1, we use softmax function. 

This softmax function is defined as follows:

[Tex]\hat{y}_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}[/Tex]

After this we take the 5 most probable candidates into the vector.

[Tex]C =\begin{bmatrix} 780\\ 0\\ 1\\ 2\\ 999 \end{bmatrix}[/Tex]

and our ground truth vector is defined as follows:

[Tex]G = \begin{bmatrix} G_{0}\\ G_{1}\\ G_{2} \end{bmatrix}=\begin{bmatrix} 780\\ 2\\ 999 \end{bmatrix}   [/Tex]

Then we define our Error function as follows:

[Tex]E = \frac{1}{n}\sum_{k}min_{i}d(c_{i}, G_{k})    [/Tex]

It calculates the minimum distance between each ground truth class and the predicted candidates, where the distance function d is defined as:

  • d=0 if [Tex]c_i=G_k[/Tex]
  • d=1 otherwise

So, the loss function for this example is :

 [Tex]\begin{aligned} E &=\frac{1}{3}\left ( min_{i}d(c_{i}, G_{1}) +min_{i}d(c_{i}, G_{2})+min_{i}d(c_{i}, G_{3}) \right ) \\ &= \frac{1}{3}(0 + 0 +0) \\&=0 \end{aligned}[/Tex]

Since, all the categories in ground truth are in the Predicted top-5 matrix, so the loss becomes 0.  

VGG Architecture:

The VGG-16 architecture is a deep convolutional neural network (CNN) designed for image classification tasks. It was introduced by the Visual Geometry Group at the University of Oxford. VGG-16 is characterized by its simplicity and uniform architecture, making it easy to understand and implement.

The VGG-16 configuration typically consists of 16 layers, including 13 convolutional layers and 3 fully connected layers. These layers are organized into blocks, with each block containing multiple convolutional layers followed by a max-pooling layer for downsampling.

VGG-16 architecture Map

Here’s a breakdown of the VGG-16 architecture based on the provided details:

  1. Input Layer:
    1. Input dimensions: (224, 224, 3)
  2. Convolutional Layers (64 filters, 3×3 filters, same padding):
    • Two consecutive convolutional layers with 64 filters each and a filter size of 3×3.
    • Same padding is applied to maintain spatial dimensions.
  3. Max Pooling Layer (2×2, stride 2):
    • Max-pooling layer with a pool size of 2×2 and a stride of 2.
  4. Convolutional Layers (128 filters, 3×3 filters, same padding):
    • Two consecutive convolutional layers with 128 filters each and a filter size of 3×3.
  5. Max Pooling Layer (2×2, stride 2):
    • Max-pooling layer with a pool size of 2×2 and a stride of 2.
  6. Convolutional Layers (256 filters, 3×3 filters, same padding):
    • Two consecutive convolutional layers with 256 filters each and a filter size of 3×3.
  7. Convolutional Layers (512 filters, 3×3 filters, same padding):
    • Two sets of three consecutive convolutional layers with 512 filters each and a filter size of 3×3.
  8. Max Pooling Layer (2×2, stride 2):
    • Max-pooling layer with a pool size of 2×2 and a stride of 2.
  9. Stack of Convolutional Layers and Max Pooling:
    • Two additional convolutional layers after the previous stack.
    • Filter size: 3×3.
  10. Flattening:
    • Flatten the output feature map (7x7x512) into a vector of size 25088.
  11. Fully Connected Layers:
    • Three fully connected layers with ReLU activation.
    • First layer with input size 25088 and output size 4096.
    • Second layer with input size 4096 and output size 4096.
    • Third layer with input size 4096 and output size 1000, corresponding to the 1000 classes in the ILSVRC challenge.
    • Softmax activation is applied to the output of the third fully connected layer for classification.

This architecture follows the specifications provided, including the use of ReLU activation function and the final fully connected layer outputting probabilities for 1000 classes using softmax activation.

VGG-16 Configuration:

The main difference between VGG-16 configurations C and D lies in the use of filter sizes in some of the convolutional layers. While both versions predominantly use 3×3 filters, in version D, there are instances where 1×1 filters are used instead. This slight variation results in a difference in the number of parameters, with version D having a slightly higher number of parameters compared to version C. However, both versions maintain the overall architecture and principles of the VGG-16 model.

Different VGG Configuration

Object Localization In Image:

To perform localization, we need to replace the class score by bounding box location coordinates. A bounding box location is represented by the 4-D vector (center coordinates(x,y), height, width). There are two versions of localization architecture, one is bounding box is shared among different candidates (the output is 4 parameter vector) and the other is a bounding box is class-specific (the output is 4000 parameter vector). The paper experimented with both approaches on VGG -16 (D) architecture. Here we also need to change loss from classification loss to regression loss functions (such as MSE) that penalize the deviation of predicted loss from the ground truth. 

Results: VGG-16 was one of the best performing architectures in the ILSVRC challenge 2014.It was the runner up in the classification task with a top-5 classification error of 7.32% (only behind GoogLeNet with a classification error of 6.66%). It was also the winner of localization task with 25.32% localization error.

Limitations Of VGG 16:

  • It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for 2-3 weeks).
  • The size of VGG-16 trained imageNet weights is 528 MB. So, it takes quite a lot of disk space and bandwidth which makes it inefficient.
  • 138 million parameters lead to exploding gradients problem.

Further advancements: Resnets are introduced to prevent exploding gradients problem that occurred in VGG-16.




Previous Article
Next Article

Similar Reads

R-CNN vs Fast R-CNN vs Faster R-CNN | ML
R-CNN: R-CNN was proposed by Ross Girshick et al. in 2014 to deal with the problem of efficient object localization in object detection. The previous methods use what is called Exhaustive Search which uses sliding windows of different scales on image to propose region proposals Instead, this paper uses the Selective search algorithm which takes adv
6 min read
What is the difference between a region-based CNN (R-CNN) and a fully convolutional network (FCN)?
In computer vision, particularly in object detection and semantic segmentation, two prominent neural network architectures are frequently discussed: Region-based Convolutional Neural Networks (R-CNN) and Fully Convolutional Networks (FCN). Each of these architectures has distinct features and applications. This article will explore the differences
4 min read
VGG-Net Architecture Explained
The Visual Geometry Group (VGG) models, particularly VGG-16 and VGG-19, have significantly influenced the field of computer vision since their inception. These models, introduced by the Visual Geometry Group from the University of Oxford, stood out in the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) for their deep convolutional n
6 min read
R-CNN | Region Based CNNs
Since Convolution Neural Network (CNN) with a fully connected layer is not able to deal with the frequency of occurrence and multi objects. So, one way could be that we use a sliding window brute force search to select a region and apply the CNN model to that, but the problem with this approach is that the same object can be represented in an image
5 min read
Fast R-CNN | ML
Before discussing Fast R-CNN, let’s look at the challenges faced by R-CNN. The training of R-CNN is very slow because each part of the model such as (CNN, SVM classifier, and bounding box) requires training separately and cannot be paralleled.Also, in R-CNN we need to forward and pass every region proposal through the Deep Convolution architecture
5 min read
Faster R-CNN | ML
Identifying and localizing objects within images or video streams is one of the key tasks in computer vision. With the arrival of deep learning, there is significant growth in this field. Faster R-CNN, a major breakthrough, has reshaped how objects are detected and categorized in real-world images. It is the advancement of R-CNN architecture. The R
13 min read
Convolutional Neural Network (CNN) Architectures
Convolutional Neural Network(CNN) is a neural network architecture in Deep Learning, used to recognize the pattern from structured arrays. However, over many years, CNN architectures have evolved. Many variants of the fundamental CNN Architecture This been developed, leading to amazing advances in the growing deep-learning field. Let's discuss, How
11 min read
Training of Convolutional Neural Network (CNN) in TensorFlow
In this article, we are going to implement and train a convolutional neural network CNN using TensorFlow a massive machine learning library. Now in this article, we are going to work on a dataset called 'rock_paper_sissors' where we need to simply classify the hand signs as rock paper or scissors. Stepwise ImplementationStep 1: Importing the librar
5 min read
Traffic Signs Recognition using CNN and Keras in Python
We always come across incidents of accidents where drivers' Overspeed or lack of vision leads to major accidents. In winter, the risk of road accidents has a 40-50% increase because of the traffic signs' lack of visibility. So here in this article, we will be implementing Traffic Sign recognition using a Convolutional Neural Network. It will be ver
6 min read
Working of Convolutional Neural Network (CNN) in Tensorflow
In this article, we are going to see the working of convolution neural networks with TensorFlow a powerful machine learning library to create neural networks. Now to know, how a convolution neural network lets break it into parts. the 3 most important parts of this convolution neural networks are, ConvolutionPoolingFlattening These 3 actions are th
5 min read
Convolutional Neural Network (CNN) in Tensorflow
It is assumed that the reader knows the concepts of Neural Networks and Convolutional Neural Networks. If you are sure about the topics please refer to Neural Networks and Convolutional Neural Networks. Building Blocks of CNN: Convolutional Neural Networks are mainly made up of three types of layers: Convolutional Layer: It is the main building blo
4 min read
Lung Cancer Detection using Convolutional Neural Network (CNN)
Computer Vision is one of the applications of deep neural networks that enables us to automate tasks that earlier required years of expertise and one such use in predicting the presence of cancerous cells. In this article, we will learn how to build a classifier using a simple Convolution Neural Network which can classify normal lung tissues from c
8 min read
Pneumonia Detection Using CNN in Python
In this article, we will learn how to build a classifier using a simple Convolution Neural Network which can classify the images of patient's xray to detect whether the patient is Normal or affected by Pneumonia. To get more understanding, follow the steps accordingly. Importing Libraries The libraries we will using are : Pandas- The pandas library
10 min read
Difference between ANN, CNN and RNN
Artificial Neural Network (ANN): Artificial Neural Network (ANN), is a group of multiple perceptrons or neurons at each layer. ANN is also known as a Feed-Forward Neural network because inputs are processed only in the forward direction. This type of neural networks are one of the simplest variants of neural networks. They pass information in one d
3 min read
Image Segmentation with Mask R-CNN, GrabCut, and OpenCV
Image segmentation plays a crucial role in computer vision tasks, enabling machines to understand and analyze visual content at a pixel level. It involves dividing an image into distinct regions or objects, facilitating object recognition, tracking, and scene understanding. In this article, we explore three popular image segmentation techniques: Ma
13 min read
Mask R-CNN | ML
The article provides a comprehensive understanding of the evolution from basic Convolutional Neural Networks (CNN) to the sophisticated Mask R-CNN, exploring the iterative improvements in object detection, instance segmentation, and the challenges and advantages associated with each model. What is R-CNN?R-CNN, which stands for Region-based Convolut
9 min read
What is the use of SoftMax in CNN?
Answer: SoftMax is used in Convolutional Neural Networks (CNNs) to convert the network's final layer logits into probability distributions, ensuring that the output values represent normalized class probabilities, making it suitable for multi-class classification tasks.SoftMax is a crucial activation function in the final layer of Convolutional Neu
2 min read
When are Weights Updated in CNN?
Answer: Weights are updated in a Convolutional Neural Network (CNN) during the training phase through backpropagation and optimization algorithms, such as stochastic gradient descent, after computing the gradient of the loss with respect to the weights.In a Convolutional Neural Network (CNN), the process of updating weights occurs during the traini
2 min read
How to Decide Number of Filters in CNN?
Answer: The number of filters in a CNN is often determined empirically through experimentation, balancing model complexity and performance on the validation set.Deciding the number of filters in a Convolutional Neural Network (CNN) involves a combination of domain knowledge, experimentation, and understanding of the architecture's requirements. Her
3 min read
How to Choose Kernel Size in CNN?
Answer: The choice of kernel size in a CNN depends on factors such as the complexity of the features to be detected and the desired level of spatial information preservation.Choosing the kernel size in a Convolutional Neural Network (CNN) is a crucial decision that directly impacts the network's ability to extract meaningful features from input dat
3 min read
What is a channel in a CNN?
Answer: A channel in a CNN (Convolutional Neural Network) refers to a specific feature map resulting from applying filters to the input data, typically representing different learned patterns or features.In a Convolutional Neural Network (CNN), a channel refers to a specific dimension along which feature maps are organized. To understand this conce
3 min read
How Many Images per Class Are Sufficient for Training a CNN?
Answer: The number of images per class required for training a CNN varies depending on factors like the complexity of the task, dataset variability, and model architecture, but typically ranges from hundreds to thousands for effective learning.Determining the optimal number of images per class for training a Convolutional Neural Network (CNN) invol
2 min read
Convolution and Cross-Correlation in CNN
Answer: Convolution in CNN involves flipping both the rows and columns of the kernel before sliding it over the input, while cross-correlation skips this flipping step.These operations are foundational in extracting features and detecting patterns within the data, despite their technical differences. AspectConvolutionCross-CorrelationKernel Flippin
2 min read
How to Make a CNN Predict a Continuous Value?
Answer : To make a CNN predict a continuous value, use it in a regression setup by having the final layer output a single neuron with a linear activation function.Convolutional Neural Networks (CNNs) are widely recognized for their prowess in handling image data, typically in classification tasks. However, their versatility extends to regression pr
2 min read
Accuracy and Loss Don't Change in CNN. Is It Over-Fitting?
Answer : No, if accuracy and loss don't change, it's more indicative of underfitting or learning issues, not overfitting.When training a Convolutional Neural Network (CNN), encountering a situation where both accuracy and loss remain constant over epochs does not typically suggest overfitting. Instead, this scenario is often indicative of underfitt
2 min read
What Are the Possible Approaches to Fixing Overfitting on a CNN?
Answer: To fix overfitting on a CNN, use techniques such as adding dropout layers, implementing data augmentation, reducing model complexity, and increasing training data.Overfitting in Convolutional Neural Networks (CNNs) occurs when the model learns the training data too well, capturing noise and details to the extent that it performs poorly on n
2 min read
Cascade R-CNN- Explained
Cascade R-CNN plays an important role as a state-of-the-art solution for object detection accuracy in computer vision. It is built on the basis of the R-CNN family, resulting in a multimodal system that uses a sequence of detectors for highly accurate localization and classification. This innovative approach not only enhances accuracy but also stre
9 min read
Convolutional Neural Network (CNN) in Machine Learning
Convolutional Neural Networks (CNNs) are a powerful tool for machine learning, especially in tasks related to computer vision. Convolutional Neural Networks, or CNNs, are a specialized class of neural networks designed to effectively process grid-like data, such as images. In this article, we are going to discuss convolutional neural networks (CNN)
13 min read
Detecting COVID-19 From Chest X-Ray Images using CNN
A Django Based Web Application built for the purpose of detecting the presence of COVID-19 from Chest X-Ray images with multiple machine learning models trained on pre-built architectures. Three different machine learning models were used to build this project namely Xception, ResNet50, and VGG16. The Deep Learning model was trained on a publicly a
5 min read
What is Batch Normalization in CNN?
Batch Normalization is a technique used to improve the training and performance of neural networks, particularly CNNs. The article aims to provide an overview of batch normalization in CNNs along with the implementation in PyTorch and TensorFlow. Table of Content Overview of Batch Normalization Need for Batch Normalization in CNN modelHow Does Batc
5 min read