Alexnet Tugce Kyunghee
Alexnet Tugce Kyunghee
Alexnet Tugce Kyunghee
Goal
DataSet
Architecture of the Network
Reducing overfitting
Learning
Results
Discussion
Goal
Classica(on
ImageNet
h-p://image-net.org/
ILSVRC
EntleBucher
Appenzeller
Convolutional Neural Networks
Rectified Linear Units, overlapping pooling, dropout trick
Randomly extracted 224x224 patches for more data
h-p://image-net.org/challenges/LSVRC/2012/supervision.pdf
Architecture
5 Convolu(onal Layers
1000-way
soLmax
Images: 227x227x3
F (receptive field size): 11
S (stride) = 4
Conv layer output: 55x55x96
"
.
Layer 1 (Convolutional)"
Non-saturating nonlinearity (RELU)
f(x) = max(0, x)
Quick to train
"
.
Architecture"
RELU Nonlinearity
Top-1 and Top-5 error rates decreases by 1.7% & 1.2% respectively,
comparing to the net trained with one GPU and half neurons!!
"
.
Architecture"
Overlaping Pooling
"
.
Architecture"
Local Response Normalization
No need to input normalization with ReLUs.
But still the following local normalization scheme helps
generalization.
s < z overlapping pooling
top-1 and top-5 error rates decrease by 0.4% and 0.3%,
respectively, compared to the non-overlapping scheme s =
2, z = 2
"
.
Architecture"
Architecture Overview
Outline
Introduction
DataSet
Architecture of the Network
Reducing overfitting
Learning
Results
Discussion
Reducing Overfitting
Data Augmentation!
60 million parameters, 650,000 neurons
Overfits a lot.
Crop 224x224 patches (and their horizontal
reflections.)
Reducing Overfitting
Data Augmentation!
At test time, average the predictions on the
10 patches.
Reducing Overfitting
Softmax
$ '
fy j
1 & e )
L = log& fj ) + W k,l
2
N i & e ) k l
j
=
11000
% j (
P(y i | x i ;W ) Likelihood
No need to calibrate to average the predictions over 10
patches.
cf.
SVM
1 &
)
L
= (max(0, f (x i ;W ) j f (x i ;W ) y i + ) + W k,l +
2
N i j y i ' *
k l
Slide
credit
from
Stanford
CS231N
Lecture
3.
Reducing Overfitting
Data Augmentation!
Change the intensity of RGB channels
Ixy = [IxyR , IxyG , IxyB ]T
add multiples of principle components
i ~ N(0, 0.1)
Reducing Overfitting
Dropout
With probability 0.5
last two 4096 fully-connected layers.
Figure
credit
from
Srivastava
et
al.
Stochastic Gradient Descent Learning
Momentum
Update
momentum(damping
parameter)
Learning
rate
(ini(alized
at
0.01)
weight
decay
Gradient
of
Loss
w.r.t
weight
Averaged
over
batch
The output from the last 4096 fully-connected layer :
4096 dimensional feature.
Discussion
Depth is really important.
removing a single convolutional layer degrades the
performance.
K. Simonyan, A. Zisserman.
Very Deep Convolutional Networks for Large-Scale
Image Recognition. Technical report, 2014.
16-layer model, 19-layer model. 7.3% top-5 test error
on ILSVRC-2012
Discussion
Still have many orders of magnitude to go in order to match the
infero-temporal(IT) pathway of the human visual system.
Convolu(onal
Neural
Networks?
vs.
Convolutonal
Networks?