Convolutional Neural Networks: CMSC 733 Fall 2015 Angjoo Kanazawa

Convolutional Neural
Networks
CMSC 733 Fall 2015
Angjoo Kanazawa
Overview
Goal: Understand what Convolutional Neural
Networks (ConvNets) are & intuition behind it.
1. Brief Motivation for Deep Learning
2. What are ConvNets?
3. ConvNets for Object Detection
First of all what is Deep Learning?
● Composition of non-linear transformation of
the data.
● Goal: Learn useful representations, aka
features, directly from data.
● Many varieties, can be unsupervised or supervised.

● Today is about ConvNets, which is a supervised deep
learning method.
Recap: Supervised Learning
Slide: M. Ranzato
Supervised Learning: Examples
Slide: M. Ranzato
Supervised Deep Learning
So deep learning is about learning
feature representation in a
compositional manner.
But wait,
why learn features?
The Black Box in a
Traditional Recognition Approach
Feature Post-processing Classifier

Preprocessing Extraction (Feature selection, (SVM,
(HOG, SIFT, etc) MKL etc) boosting, etc)
The Black Box in a
Traditional Recognition Approach
Hand
Engin
eered
Feature Post-processing Classifier
Preprocessing Extraction (Feature selection, (SVM,
(HOG, SIFT, etc) MKL etc) boosting, etc)
Feature Post-processing
Preprocessing Extraction (Feature selection,
(HOG, SIFT, etc) MKL etc)
● Most critical for accuracy

● Most time-consuming in development
● What is the best feature???
● What is next?? Keep on crafting better features?
⇒ Let’s learn feature representation directly
from data.
Learn features and classifier
together
⇒ Learn an end-to-end recognition system.
A non-linear map that takes raw pixels directly
to labels.
Slide: M. Ranzato
Building a complicated function
Each box is a simple nonlinear function

Slide: M. Ranzato
● Composition is at the core of deep learning methods

● Each “simple function” will have parameters subject to
learning Slide: M. Ranzato
Intuition behind Deep Neural Nets
Slide: M. Ranzato
Slide: M. Ranzato
Layer 1 Layer 2 Layer 3 Layer 4
The final layer outputs a probability distribution of categories.

Slide: M. Ranzato
A simple single layer Neural Network
Consists of a linear combination of input
through a nonlinear function:
W is the weight parameter to be learned.

x is the output of the previous layer
f is a simple nonlinear function. Popular choice is max(x,0),
called ReLu (Rectified Linear Unit)
1 layer: Graphical Representation
f h is called a neuron,
hidden unit or feature.
f
f
Joint training architecture overview
Neural Net Training
Slide: M. Ranzato
Neural Net Training
Slide: M. Ranzato
Neural Net Training
Slide: M. Ranzato
Neural Net Training
Slide: M. Ranzato
Neural Net Training
Slide: M. Ranzato
Neural Net Training
Slide: M. Ranzato
Neural Net Training
Slide: M. Ranzato
When the input data is an image..
Slide: M. Ranzato
When the input data is an image..
Reduce connection to local regions
Reuse the same kernel everywhere
Because interesting
features (edges) can
happen at anywhere in
the image.
Convolutional Neural Nets
Detail
If the input has 3 channels (R,G,B),
3 separate k by k filter is applied to
each channel.
Output of convolving 1 feature is

called a feature map.
This is just sliding window, ex. the output of

one part filter of DPM is a feature map
Using multiple filters
Each filter detects features in
the output of previous layer.
So to capture different features,
learn multiple filters.
Example of filtering
Slide: R.
Fergus
Building Translation Invariance
Building Translation Invariance via
Spatial Pooling
Pooling also subsamples the image,

allowing the next layer to look at larger
spatial regions.
Summary of a typical convolutional
layer
Doing all of this consists one
layer.
○ Pooling and normalization is
optional.
Stack them up and train just like multi-
layer neural nets.
Final layer is usually fully connected
neural net with output size == number of
classes
Compare this to SIFT
Compare this to SIFT
Revisiting the composition idea
Every layer learns a feature detector by
combining the output of the layer before.
⇒ More and more abstract features are learned
as we stack layers.
Keep this in mind and let’s look at what kind of
things ConvNets learn.
Architecture of Alex Krizhevsky et al.
● 8 layers total.
● Trained on Imagenet Dataset
(1000 categories, 1.2M
training images, 150k test
images)
● 18.2% top-5 error
○ Winner of the ILSVRC-
2012 challenge.
Slide: R.
Fergus
Architecture of Alex Krizhevsky et al.
First layer filters
Showing 81 filters of
11x11x3.
Capture low-level
features like oriented
edges, blobs.
Note these oriented edges are

analogous to what SIFT uses to
compute the gradients.
Top 9 patches that activate each filter
in layer 1
Each 3x3 block shows

the top 9 patches for
one filter.
Second Layer
Second Layer
Note how the previous

low-level features are
combined to detect a
little more abstract
features like textures.
ConvNets as generic feature extractor
● A well-trained ConvNets is an excellent feature
extractor.
● Chop the network at desired layer and use the output as
a feature representation to train a SVM on some other
vision dataset.
● Improve further by taking a pre-trained ConvNet and re-training it on a

different dataset. Called fine-tuning
One way to do detection with
ConvNets
Since ConvNets extract discriminative features,
one can crop images at the object bounding
box and train a good SVM on each category.
⇒ Extract regions that are likely to have
objects, then apply ConvNet + SVM on each
and use the confidence to do maximum
suppression.
R-CNN: Regions with CNN features
Best performing method on PASCAL 2012

Detection improving previous methods by 30%
Slide: R. Girshick
ConvNet Libraries
• Cuda-convnet (Alex Krizhevsky, Google)
• Caffe (Y. Jia, Berkeley, now Google)
● Pre-trained Krizhevsky model available.
• Torch7 (Idiap, NYU, NEC)
more around.

Convolutional Neural Networks: CMSC 733 Fall 2015 Angjoo Kanazawa

Uploaded by

Copyright:

Available Formats

Convolutional Neural Networks: CMSC 733 Fall 2015 Angjoo Kanazawa

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convolutional Neural Networks: CMSC 733 Fall 2015 Angjoo Kanazawa

Uploaded by

Copyright:

Available Formats

Convolutional Neural

● Many varieties, can be unsupervised or supervised.

Feature Post-processing Classifier

● Most critical for accuracy

Each box is a simple nonlinear function

● Composition is at the core of deep learning methods

Layer 1 Layer 2 Layer 3 Layer 4

The final layer outputs a probability distribution of categories.

W is the weight parameter to be learned.

Output of convolving 1 feature is

This is just sliding window, ex. the output of

Pooling also subsamples the image,

Note these oriented edges are

Each 3x3 block shows

Note how the previous

● Improve further by taking a pre-trained ConvNet and re-training it on a

Best performing method on PASCAL 2012

You might also like