Images, Neural Networks, CNNs

Images and
convolutional neural
networks
By: DIVAKAR KESHRI

PhD NIT TRICHY
1
Computer vision
Computer vision = giving computers the

ability to understand visual information
Examples:
○ A robot that can move around obstacles by
analysing the input of its camera(s)
○ A computer system finding images of cats
among millions of images on the Internet
2
From picture to pixels
An image has to be digitized for It is turned into millions of “pixel”

computer processing elements
0.49411765 0.49411765 0.4745098 0.49019608 0.4745098
0.49411765 0.49411765 0.5058824 0.49411765 0.49803922
0.49803922 0.49411765 0.4862745 0.47058824 0.49411765
0.5019608 0.49803922 0.49803922 0.49019608 0.50980395
0.50980395 0.5058824 0.52156866 0.50980395 0.5058824
Picture source: https://pixabay.com/en/kitty-cat-kid-cat-domestic-cat-2948404/

Each a set of numbers
quantifying the color of that
3
element
From pixels to … understanding?
0.49411765 0.49411765 0.4745098 0.49019608 0.4745098
0.49411765 0.49411765 0.5058824 0.49411765 0.49803922

There’s a cat among
0.49803922 0.49411765 0.4862745 0.47058824 0.49411765
some flowers in the
0.5019608 0.49803922 0.49803922 0.49019608 0.50980395 grass
0.50980395 0.5058824 0.52156866 0.50980395 0.5058824
● This is easy for humans

● But for AI it’s actually one of the harder problems!
● How do you transform that grid of numbers into
understanding…
or even something
4
useful?
Image understanding
• Humans are so good in vision that it’s not even
considered intelligence
5
Convolutional neural
networks
Convolutional neural network
(CNN, ConvNet)
● Dense or fully-connected: each neuron connected
to all neurons in previous layer
● CNN: only connected to a small “local” set of
neurons
● Radically reduces numberDense layer Convolutional
of network connections layer
7
Convolution for image data
3✕3 weights
3✕3 image (conv. kernel)
area output
● Image represented as 2D grid of neuron
values
● Each output neuron connected to
small 2D area in the image

● Output value = weighted sum of
inputs
● Idea: nearby pixels are related ⇒
we can learn local relationships

of pixels 8
Image source: https://mlnotebook.github.io/post/CNN1/
image input 3✕3 weights
● We repeat for each output (conv. kernel)
neuron
● Weights stay the same
(shared weights)
● Border effect: without
padding output area is
smaller
● Outputs form a “feature feature map
map”
9
Image source: https://mlnotebook.github.io/post/CNN1/
A real example
Image from: http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx

Side note: color images
● Example: 256 ✕ 256 color image with 3 color channels
(red, green, and blue)
⇒ single image is a 3D tensor: 256 ✕ 256 ✕ 3
● Example: 5 ✕ 5 convolution is actually also a 3D tensor:
5✕5✕3
● Slides over width and height, but covers the full color
depth
11
Convolution for image data K feature maps each
252✕252✕1
K kernels
● We can repeat for different each 5✕5(✕3)
sets of weights (kernels)
● Each learns a different
“feature”
image
● Typically: edges, corners, 256✕256✕3
etc
● Each outputs a feature
map
...
...
12
output tensor
252✕252✕K
● We stack the feature maps K kernels

each 5✕5(✕3)
into a single tensor
● Depth out output tensor =
number of kernels K
● Tensor is the output of the image
entire convolutional layer 256✕256✕3
...
13
Convolution in layers: intuition
● We can then add
another
convolutional layer
● This operates on the
previous layer’s
output tensor “cat”
(feature maps)
● Features layered
from simple to more
complex
14
learned learned learned
learned
low-level mid-level high-level ca
classifier
features features features t
Image from lecture by Yann Le Cun, original from Zeiler & Fergus (2013)
15
Image datasets
• Color image mini-batches are 4D
tensors:
width ✕ height ✕ color
channels ✕ samples
• Plenty of big datasets for training
exist, e.g., ImageNet with 1,2 million
images in 1000 classes
• Data augmentation for small datasets:
generate more training data by
transforming existing data
• E.g., shifting, rotation, cropping,
Scaling, adding noise, etc …
16
Convolutional layers
• Input: tensor of size N × Wi × Hi × Ci
• Hyperparameters:
• K: number of filters
• w, h: kernel size
• padding: how to handle image borders
• activation function
• Output: tensor of size N × Wo × Ho × K
• In tf.keras:
layers.Conv2D(filters, kernel_size,
padding, activation)
(there is also Conv1D and Conv3D)
17
Pooling layers
• Used to reduce the spatial resolution

• independently on each channel
• reduce complexity and number
of parameters
• MAX operator most common
• sometimes also AVERAGE
• In tf.keras:
layers.MaxPooling2D(pool_size)
layers.AveragePooling2D(pool_size)
18
Image from http://cs231n.github.io/convolutional-networks/
Other layers
• Flatten
• flattens the input into a vector
(typically before dense layers)
• Dropout
• similar as with dense layers
• In tf.keras:
layers.Flatten()
layers.Dropout(rate)
19
Typical architecture
1. Input layer = image pixels

2. Convolution
3. ReLU Repeat one or more times
4. Pooling
5. One or more fully connected layers (+ReLU)
6. Final fully connected layer to get to the
number of classes we want
7. Softmax to get probability distribution over
classes
20
CNN architectures and
applications
21
AlexNet
VGG
22
Inception /
GoogLeNet
ResNet
DenseNet
23
Large-scale CNNs with pre-trained
weights retrain
replace
output layer
extracted
features
• For many applications, an existing CNN can be re-used instead

of training a new model from scratch: extract features from
suitable layer or
retrain the top layers with new data
• Keras contains several models trained with ImageNet:
• Xception, VGG16, VGG19, ResNet50, InceptionV3,
InceptionResNetV2, MobileNet, DenseNet, NASNet
Computer vision
applications
Image credit: Li Fei-Fei

et al
25
Image credit: Noh et al, Learning Deconvolution Network for Semantic Segmentation,
Some selected applications
• Object detection:
https://pjreddie.com/darknet/yolo/
• Semantic segmentation:
https://www.youtube.com/watch?v=qWl9idsCu
LQ
• Human pose estimation:

https://www.youtube.com/watch?v=pW6nZXe
WlGM
• Video recognition: https://valossa.com/

26
• Digital pathology: https://www.aiforia.com/

Images, Neural Networks, CNNs

Uploaded by

Copyright:

Available Formats

Images, Neural Networks, CNNs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Images, Neural Networks, CNNs

Uploaded by

Copyright:

Available Formats

Images and

By: DIVAKAR KESHRI

Computer vision = giving computers the

An image has to be digitized for It is turned into millions of “pixel”

0.49411765 0.49411765 0.4745098 0.49019608 0.4745098

0.49411765 0.49411765 0.5058824 0.49411765 0.49803922

0.49803922 0.49411765 0.4862745 0.47058824 0.49411765

0.5019608 0.49803922 0.49803922 0.49019608 0.50980395

0.50980395 0.5058824 0.52156866 0.50980395 0.5058824

Picture source: https://pixabay.com/en/kitty-cat-kid-cat-domestic-cat-2948404/

0.49411765 0.49411765 0.4745098 0.49019608 0.4745098

0.49411765 0.49411765 0.5058824 0.49411765 0.49803922

● This is easy for humans

small 2D area in the image

we can learn local relationships

Image from: http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx

● We stack the feature maps K kernels

(there is also Conv1D and Conv3D)

• Used to reduce the spatial resolution

1. Input layer = image pixels

• For many applications, an existing CNN can be re-used instead

Image credit: Li Fei-Fei

• Human pose estimation:

• Video recognition: https://valossa.com/

You might also like