DL Unit 4
DL Unit 4
DL Unit 4
UNIT 4
COURSE B.TECH
SEMESTER 5
Version V-1
1|D L - U N I T - I V
BTECH_CSE-SEM 31
TABLE OF CONTENTS – UNIT
3
S. CONTENTS PAGE NO.
NO
1 COURSE OBJECTIVES 1
2 PREREQUISITES 1
3 SYLLABUS 1
4 COURSE OUTCOMES 1
5 CO - PO/PSO MAPPING 1
6 LESSON PLAN 2
7 ACTIVITY BASED LEARNING 2
8 LECTURE NOTES 2
4.1 INTRODUCTION TO CONVOLUTIONAL NETWORKS 5
4.2 CONVOLUTIONAL OPERATION 5
4.3 POOLING 11
4.4 CONVOLUTION 13
4.5 BASIC CONVOLUTION FUNCTIONS 15
4.6 STRUCTURED OUTPUTS 19
4.7 DATA TYPES 21
4.8 EFFICIENT CONVOLUTION ALGORITHM’S 23
4.9 RANDOM OR UNSUPERVISED FEATURES 23
4.10 BASIS FOR CONVOLUTIONAL NETWORKS 24
2|D L - U N I T - I V
BTECH_CSE-SEM 31
1. Course Objectives
The objectives of this course is to
1. To demonstrate the major technology trends driving Deep Learning.
2. To build, train and apply fully connected neural networks.
3. To implement efficient neural networks.
4. To analyze the key parameters and hyper perameters in neural network’s
architecture.
5. To apply concepts of Deep Learning to solve real word problems.
2. Prerequisites
This course is intended for senior undergraduate and junior graduate students who
have a proper understanding of
Python Programming Language
Calculus
Linear Algebra
Probability Theory
Although it would be helpful, knowledge about classical machine learning is NOT
required.
3. Syllabus
UNIT 4
Introduction to CONVOLUTIONAL NETWORK:The convolution operation, Pooling,
Convolution, Basic convolution functions, Structured outputs, Data types, Efficient
convolution algorithms, Random or unsupervised features, Basis for convolutional
network.
4. Course outcomes
1. Demonstrate the mathematical foundation of neural network.
2. Describe the machine learning basics.
3. Differentiate architecture ofdeep neural network.
4. Build the convolution neural network.
5. Build and Train RNN and LSTMs.
1|D L - U N I T - I V
BTECH_CSE-SEM 31
CO1 3 2
CO2 3 2
CO3 3 3 2 2 3 2 2
CO4 3 3 2 2 3 2 2
CO5
6. Lesson Plan
BTECH_CSE-SEM 31
Introduction:
A Convolutional neural network (CNN) is a neural network that has one or more
convolutional layers and are used mainly for image processing, classification,
segmentation and also for other auto correlated data.
A convolution is essentially sliding a filter over the input. One helpful way to think
about convolutions is this quote from Dr Prasad Samarakoon: “A convolution can be
thought as “looking at a function’s surroundings to make better/accurate predictions
of its outcome.”
Rather than looking at an entire image at once to find certain features it can be more
effective to look at smaller portions of the image.
Common uses for CNNs
The most common use for CNNs is image classification, for example identifying
satellite images that contain roads or classifying hand written letters and digits. There
are other quite mainstream tasks such as image segmentation and signal processing,
for which CNNs perform well at.
CNNs have been used for understanding in Natural Language Processing (NLP) and
speech recognition, although often for NLP Recurrent Neural Nets (RNNs) are used.
A CNN can also be implemented as a U-Net architecture, which are essentially two
almost mirrored CNNs resulting in a CNN whose architecture can be presented in a U
shape. U-nets are used where the output needs to be of similar size to the input such
as segmentation and image improvement.
4.2. CONVOLUTIONALOPERATION:
The convolution operates on the input with a kernel (weights) to produce
an output map given by:
3|D L - U N I T - I V
BTECH_CSE-SEM 31
Add a time offset i.e. g(τ) → g(t-τ). Adding the offset shifts the input to
the right by t units (by convention, a negative offset shits it to the left)
Multiply f and g point-wise and accumulate the results to get output at instant
t. Basically, we are calculating the area of overlap between f and shifted g
4|D L - U N I T - I V
BTECH_CSE-SEM 31
For our application, we are interested in the discrete domain formulation:
Although these equations imply that the domains for both f and g are infinite, in
practice, these two variables are non-zero only in a finite region. As a result, the
5|D L - U N I T - I V
BTECH_CSE-SEM 31
output is non-zero only in a finite region (where the non-zero regions
of f and g overlap).
The intuition for convolution in 1-D can be extended to n-dimensions by nesting the
convolution operations. Vincent Dumoulin and Francesco Visin provide an in depth
analysis of how input and output shapes and computations are tied. Below is their
visualization of a 2-D convolution operation:
6|D L - U N I T - I V
BTECH_CSE-SEM 31
Using the Toeplitz matrix of the kernel for matrix-vector implementation of convolution
To extend this principle to 2D input, we first need to unroll the 2D input into a 1D
vector. Once this is done, the kernel needs to be modified as before but this time
resulting in a block-circulant matrix. What’s that?
A circulant matrix is a special case of a Toeplitz matrix where each row is a circular
shift of the previous row. To see that it is a special case of the Toeplitz matrix is trivial.
Now, given a 2D kernel, we can create the block-circulant matrix that will act allow
matrix-vector implementation of convolution as below:
7|D L - U N I T - I V
BTECH_CSE-SEM 31
Convince yourself that the resultant of convolving a 3x3 kernel on a 4x4 input (16x1
unrolled vector) results in a 2x2 output (4x1 vector) [refer to gif above] and hence
the required kernel matrix must be of shape 4x16
4.3. POOLING:
Pooling is nothing other than down sampling of an image. The most common pooling
layer filter is of size 2x2, which discards three forth of the activations. Role of pooling
layer is to reduce the resolution of the feature map but retaining features of the map
required for classification through translational and rotational invariants. In addition to
spatial invariance robustness, pooling will reduce the computation cost by a great
deal.
Backpropagation is used for training of pooling operation It again helps the processor
to process things faster.
There are many pooling techniques. They are as follows
ii) Mean pooling where we take largest of the pixel values of a segment.
8|D L - U N I T - I V
BTECH_CSE-SEM 31
iii) Avg pooling where we take largest of the pixel values of a segment.
9|D L - U N I T - I V
BTECH_CSE-SEM 31
There are three variants of pooling operation depending on roots of regularization
technique:
Stochastic pooling:
Randomly picked activation within each pooling region is considered than
deterministic pooling operations for regularization of the network. Stochastic pooling
performs reduction of feature size but denies role for selecting features judiciously for
the sake of regularization. Although clipping of negative output from ReLU activation
helps to carry some of the selection responsibility.
Overlapping pooling:
Overlapping pooling operation shares responsibility of local connection beyond the
size of previous convolutional filter, which breaks orthogonal responsibility between
pooling layer and convolutional layer. So, no information is gained if pooling windows
overlap
Fractional pooling:
Reduction ratio of filter size due to pooling can be controlled by a fractional pooling
concept, which helps to increase the depth of the network. Unlike stochastic pooling,
the randomness is related to the choice of pooling regions, not the way pooling is
performed inside each of the pooling regions.
There are other variants of pooling as follows:
- Min pooling
- wavelet pooling
- tree pooling
- max-avg pooling
- spatial pyramid pooling
Pooling makes the network invariant to translations in shape, size and scale. Max
pooling is generally predominantly used in objection recognition.
4.4. CONVOLUTION:
BTECH_CSE-SEM 31
adjacent layers.
.
CNNs make use of filters (also known as kernels), to detect what features, such as
edges, are present throughout an image.
There are four main operations in a CNN:
Convolution
Non Linearity (ReLU)
Pooling or Sub Sampling
Classification (Fully Connected Layer)
The first layer of a Convolutional Neural Network is always a Convolutional
Layer. Convolutional layers apply a convolution operation to the input, passing the
result to the next layer. A convolution converts all the pixels in its receptive field into
a single value.
For example, if you would apply a convolution to an image, you will be decreasing
the image size as well as bringing all the information in the field together into a
single pixel. The final output of the convolutional layer is a vector. Based on the type
of problem we need to solve and on the kind of features we are looking to learn, we
can use different kinds of convolutions.
11|D L - U N I T - I V
BTECH_CSE-SEM 31
This operation expands window size without increasing the number of weights by
inserting zero-values into convolution kernels. Dilated or Atrous Convolutions can be
used in real time applications and in applications where the processing power is less
as the RAM requirements are less intensive.
Separable Convolutions
There are two main types of separable convolutions: spatial separable convolutions,
and depthwise separable convolutions.
The spatial separable convolution deals primarily with the spatial dimensions of an
image and kernel: the width and the height. Compared to spatial separable
convolutions, depthwise separable convolutions work with kernels that cannot be
“factored” into two smaller kernels. As a result, it is more frequently used.
Transposed Convolutions
These types of convolutions are also known as deconvolutions or fractionally strided
convolutions. A transposed convolutional layer carries out a regular convolution but
reverts its spatial transformation.
12|D L - U N I T - I V
BTECH_CSE-SEM 31
followed by a downsampling stage. This can be used to reduce the representation
size.
Zero padding helps to make output dimensions and kernel size independent.
3 common zero padding strategies are:
i) valid: The output is computed only at places where the entire kernel lies inside the
input. Essentially, no zero padding is performed. For a kernel of size k in any dimension,
the input shape of m in the direction will become m-k+1 in the output. This shrinkage
restricts architecture depth.
ii) same: The input is zero padded such that the spatial size of the input and output
is same. Essentially, for a dimension where kernle size is k, the input is padded by
k- 1 zeros in that dimension. Since the number of output units connected to border
pixels is less than that for centre pixels, it may under-represent border pixels.
iii) full: The input is padded by enough zeros such that each input pixel is
connected to the same number of
output units. In terms of test set accuracy, the optimal padding is
somewhere between same and valid.
13|D L - U N I T - I V
BTECH_CSE-SEM 31
valid(left), same(middle) and full(right) padding (source). The extreme left one is for
stride=2.
Besides locally-connected layers and tiled convolution, another extension can be to
restrict the kernels to operate on certain input channels. One way to implement this
is to connect the first m input channels to the first n output channels, the next m
14|D L - U N I T - I V
BTECH_CSE-SEM 31
input
15|D L - U N I T - I V
BTECH_CSE-SEM 31
channels to the next n output channels and so on. This method decreases the
number of parameters in the model without dereasing the number of output units.
Bias terms can be used in different ways in the convolution stage. For locally
connected layer and tiled convolution, we can use a bias per output unit and kernel
respectively. In case of traditional convolution, a single bias term per output channel
is used. If the input size is fixed, a bias per output unit may be used to counter the
effect of regional image statistics and smaller activations at the boundary due to zero
padding.
4.6.STRUCTURED OUTPUTS:
16|D L - U N I T - I V
BTECH_CSE-SEM 31
Fig:Recursive refinement of the segmentation
map
The output can be further processed under the assumption that contiguous regions
of pixels will tend to belong to the same label. Graphical models can describe this
relationship. Alternately, CNNs can learn to optimize the graphical models training
objective.
Another model that has gained popularity for segmentation tasks (especially in the
medical imaging community) is the U-Net. The up-convolution mentioned is just a
direct upsampling by repetition followed by a convolution with same padding.
17|D L - U N I T - I V
BTECH_CSE-SEM 31
Fig: U-Net architecture for medical image segmentation (source)
4.7.DATA TYPES
The data used with a convolutional network usually consist of several channels, each
channel being the observation of a different quantity at some point in space or time.
One advantage to convolutional networks is that they can also process inputs with
varying spatial extents.
When the output is accordingly variable sized, no extra design change needs to be
made. If however the output is fixed sized, as in the classification task, a pooling stage
with kernel size proportional to the input size needs to be used.
Fig: Different data types based on the number of spatial dimensions and channels
17|D L - U N I T - I V
BTECH_CSE-SEM 31
4.8.EFFICIENT COVOLUTION ALGORITHMS:
When a d-dimensional kernel can be broken into the outer product of d vectors, the
kernel is said to be separable. The corresponding convolution operations are more
efficient when implemented as d 1-dimensional convolutions rather than a direct d-
dimensional convolution. Note however, it may not always be possible to express a
kernel as an outer product of lower dimensional kernels.
To reduce the computational cost of training the CNN, we can use features not
learned by supervised training.
Random initialization has been shown to create filters that are frequency selective
and translation invariant. This can be used to inexpensively select the model
architecture. Randomly initialize several CNN architectures and just train the last
classification layer. Once a winner is determined, that model can be fully trained in a
supervised manner.
18|D L - U N I T - I V
BTECH_CSE-SEM 31
Hand designed kernels may be used; e.g. to detect edges at different orientations
and intensities.
Unsupervised training of kernels may be performed; e.g. applying k-means clustering
to image patches and using the centroids as convolutional kernels. Unsupervised pre-
training may offer regularization effect (not well established). It may also allow for
training of larger CNNs because of reduced computation cost.
Another approach for CNN training is greedy layer-wise pretraining most notably
used in convolutional deep belief network. For example, in the case of multi-layer
perceptrons, starting with the first layer, each layer is trained in isolation. Once the
first layer is trained, its output is stored and used as input for training the next layer,
and so on.
In the medial temporal lobe, we find grandmother cells. These cells respond to
specific concepts and are invariant to several transforms of the input. In the medial
temporal lobe, researchers also found neurons spiking on a particular concept, e.g.
19|D L - U N I T - I V
BTECH_CSE-SEM 31
the Halle Berry neuron fires when looking at a photo/drawing of Halle Berry or even
reading the text Halle Berry. Of course, there are neurons which spike at other
concepts like Bill Clinton, Jennifer Aniston, etc.
The medial temporal neurons are more generic than CNN in that they respond even
to specific ideas. A closer match to the function of the last layers of a CNN is the IT
(inferotemporal cortex). When viewing an object, information flows from the retina,
through LGN, V1, V2, V4 and reaches IT. This happens within 100ms. When a person
continues to look at an object, the brain sends top-down feedback signals to affect
lower level activation.
Some of the major differences between the human visual system (HVS) and the CNN
model are:
The human eye is low resolution except in a region called fovea.
Essentially, the eye does not receive the whole image at high resolution but
stiches several patches through eye movements called saccades.
This attention based gazing of the input image is an active research problem.
Note: attention mechanisms have been shown to work on natural language tasks.
Integration of several senses in the HVS while CNNs are only visual.
The HVS processes rich 3D information, and can also determine relations between
objects. CNNs for such tasks are in their early stages.
The feedback from higher levels to V1 has not been incorporated into CNNs with
substantial improvement.
While the CNN can capture firing rates in the IT, the similarity between intermediate
computations is not established. The brain probably uses different activation and
pooling functions. Even the linearity of filter response is doubtful as recent models for
V1 involve quadratic filters.
20|D L - U N I T - I V
BTECH_CSE-SEM 31
Neuroscience tells us very little about the training procedure. Backpropogation which
is a standard training mechanism today is not inspired by neuroscience and
sometimes considered biologically implausible.
21|D L - U N I T - I V
BTECH_CSE-SEM 31
Fig:(Left) Gabor functions with different values of the parameters that control the
coordinate system. (Middle) Weights learned by an unsupervised learning algorithm
(Right) Convolution kernels learned by the first layer of a fully supervised
convolutional maxout network.
9. Practice QuiZ
1. Supervised learning and unsupervised clustering both require at least one
a) hidden attribute
b) output attribute
c) input attribute
d) categorial attribute
a) 1989
b) 1943
c) 1978
d) 1962
BTECH_CSE-SEM 31
c) 4
d) 7
6. which of the following is a subset of machine learning
a) SciPy
b) NumPy
c) deep learning
d) none
7. first layer of deep learning
a) hidden layer
b) outer layer
c) none
d) inner layer
8. RNN stands for
a) report neural networks
(b) recurrent neural networks
c) receives neural networks
d) recording neural networks
9. Which of the following is/are Common uses of RNNs?
A. BusinessesHelp securities traders to generate analytic reports
B. Detect fraudulent credit-card transaction
C. Provide a caption for images
D. All of the above
10. Which of the following is well suited for perceptual tasks?
A. Feed-forward neural networks
B. Recurrent neural networks
C. Convolutional neural networks
D. Reinforcement Learning
11. CNN is mostly used when there is an?
A. structured data
B. unstructured data
C. Both A and B
D. None of the above
12. Which neural network has only one hidden layer between the input and output?
A. Shallow neural network
B. Deep neural network
C. Feed-forward neural networks
D. Recurrent neural networks
13. Which of the following is/are Limitations of deep learning?
A. Data labeling
B. Obtain huge training datasets
C. Both A and B
D. None of the above
14. Deep learning algorithms are more accurate than machine
learning algorithm in image classification.
A. 33%
B. 37%
23|D L - U N I T - I V
BTECH_CSE-SEM 31
C. 40%
D. 41%
24|D L - U N I T - I V
BTECH_CSE-SEM 31