Alexnet Tugce Kyunghee

ImageNet Classification with Deep
Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Presented by
Tugce Tasci, Kyunghee Kim

05/18/2015
Outline
Goal
DataSet
Architecture of the Network
Reducing overfitting
Learning
Results
Discussion
Goal
Classica(on
ImageNet
Over 15M labeled high resolution images

Roughly 22K categories
Collected from web and labeled by Amazon Mechanical
Turk

h-p://image-net.org/
ILSVRC
Annual competition of image classification at large scale

1.2M images in 1K categories
Classification: make 5 guesses about the image label
EntleBucher Appenzeller
Convolutional Neural Networks
Model with a large learning capacity

Prior knowledge to compensate all data we do not have
ILSVRC
ImageNet Classification error throughout years and groups
SuperVision (SV)
Image classification with deep convolutional neural networks

7 hidden weight layers
650K neurons
60M parameters
630M connections

Rectified Linear Units, overlapping pooling, dropout trick
Randomly extracted 224x224 patches for more data
h-p://image-net.org/challenges/LSVRC/2012/supervision.pdf
Architecture
5 Convolu(onal Layers
1000-way
soLmax
3 Fully Connected Layers

Layer 1 (Convolutional)
Images: 227x227x3
F (receptive field size): 11
S (stride) = 4
Conv layer output: 55x55x96
"
.
Layer 1 (Convolutional)"

55*55*96 = 290,400 neurons

each has 11*11*3 = 363 weights and 1
bias
290400 * 364 = 105,705,600
paramaters on the first layer of the
AlexNet alone!
"
.
Architecture"
RELU Nonlinearity
Standard way to model a neuron
f(x) = tanh(x) or f(x) = (1 + e-x)-1
Very slow to train

Non-saturating nonlinearity (RELU)
f(x) = max(0, x)
Quick to train
"
.
Architecture"
RELU Nonlinearity
A 4 layer CNN with

ReLUs (solid line)
converges six times
faster than an
equivalent network
with tanh neurons
(dashed line) on
CIFAR-10 dataset
"
.
Architecture"
Training on Multiple GPUs
GPU #1
intra-GPU connec(ons
GPU #2 inter-GPU connec(ons

"
.
Architecture"
Training on Multiple GPUs
GPU #1
intra-GPU connec(ons
GPU #2 inter-GPU connec(ons
Top-1 and Top-5 error rates decreases by 1.7% & 1.2% respectively,
comparing to the net trained with one GPU and half neurons!!
"
.
Architecture"
Overlaping Pooling
"
.
Architecture"
Local Response Normalization
No need to input normalization with ReLUs.
But still the following local normalization scheme helps
generalization.

Response- Ac(vity of a neuron computed

normalized by applying kernel I at posi(on
ac(vity (x,y) and then applying the ReLU
nonlinearity
Response normalization reduces top-1 and top-5 error rates by

1.4% and 1.2% , respectively.
"
.
Architecture"
Overlaping Pooling
Traditional pooling (s = z)
s

s < z overlapping pooling
top-1 and top-5 error rates decrease by 0.4% and 0.3%,
respectively, compared to the non-overlapping scheme s =
2, z = 2
"
.
Architecture"

Architecture Overview

Outline
Introduction
DataSet
Architecture of the Network
Reducing overfitting
Learning
Results
Discussion
Reducing Overfitting
Data Augmentation!
60 million parameters, 650,000 neurons
Overfits a lot.

Crop 224x224 patches (and their horizontal
reflections.)

Data Augmentation!
At test time, average the predictions on the
10 patches.

Softmax
$ '
fy j
1 & e )
L = log& fj ) + W k,l
2
N i & e ) k l
j = 11000
% j (

P(y i | x i ;W ) Likelihood
No need to calibrate to average the predictions over 10
patches. cf. SVM

1 & )
L = (max(0, f (x i ;W ) j f (x i ;W ) y i + ) + W k,l +
2
N i j y i ' *

k l
Slide credit from Stanford CS231N Lecture 3.
Data Augmentation!
Change the intensity of RGB channels

Ixy = [IxyR , IxyG , IxyB ]T
add multiples of principle components

i ~ N(0, 0.1)

Dropout

With probability 0.5
last two 4096 fully-connected layers.
Figure credit from Srivastava et al.
Stochastic Gradient Descent Learning
Momentum Update
momentum(damping parameter) Learning rate (ini(alized at 0.01)
weight decay
Gradient of Loss
w.r.t weight
Averaged over batch
Batch size: 128

The training took 5 to 6 days on two
NVIDIA GTX 580 3GB GPUs.
Results : ILSVRC-2010
Results : ILSVRC-2012
96 Convolutional Kernels
11 x 11 x 3 size kernels. Why?
top 48 kernels on GPU 1 : color-agnostic

bottom 48 kernels on GPU 2 : color-specific.
Eight ILSVRC-2010 test images
Five ILSVRC-2010 test images

The output from the last 4096 fully-connected layer :
4096 dimensional feature.
Discussion
Depth is really important.
removing a single convolutional layer degrades the
performance.
K. Simonyan, A. Zisserman.
Very Deep Convolutional Networks for Large-Scale
Image Recognition. Technical report, 2014.
16-layer model, 19-layer model. 7.3% top-5 test error
on ILSVRC-2012

Discussion
Still have many orders of magnitude to go in order to match the
infero-temporal(IT) pathway of the human visual system.

Convolu(onal Neural
Networks? vs.
Convolutonal Networks?
Figure adapted from Gross, C. G., Rodman, H. R.,

Gochin, P. M., and Colombo, M. W. (1993). Inferior temporal
cortex as a pa-ern recogni(on device. In Computa(onal
Learning and Cogni(on (E. Baum, ed.), pp. 4473. Society for
Industrial and Applied Mathema(cs, Philadelphia.
Discussion
Classification on video.
video sequences provide temporal structure missing
in static images.
K. Simonyan, A. Zisserman.
Two-Stream Convolutional Networks for Action
Recognition in Videos. NIPS 2014.
separating two pathways for spatial and temporal
networks analogous to the ventral and dorsal pathways.

Alexnet Tugce Kyunghee

Uploaded by

Copyright:

Available Formats

Alexnet Tugce Kyunghee

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Alexnet Tugce Kyunghee

Uploaded by

Copyright:

Available Formats

ImageNet Classification with Deep

Convolutional Neural Networks

Over 15M labeled high resolution images

Annual competition of image classification at large scale

Model with a large learning capacity

Image classification with deep convolutional neural networks

3 Fully Connected Layers

555596 = 290,400 neurons

Very slow to train

A 4 layer CNN with

GPU #2 inter-GPU connec(ons

GPU #2 inter-GPU connec(ons

Response- Ac(vity of a neuron computed

Response normalization reduces top-1 and top-5 error rates by

Batch size: 128

11 x 11 x 3 size kernels. Why?

top 48 kernels on GPU 1 : color-agnostic

Figure adapted from Gross, C. G., Rodman, H. R.,

You might also like