2 Convolutional Neural Network For Image Classification
2 Convolutional Neural Network For Image Classification
image classification
Abstract— This paper describes a learning approach based To address this problem, we propose using the
on training convolutional neural networks (CNN) for a traffic convolutional neural network AlexNet applied on the large-
sign classification system. In addition, it presents the preliminary scale datasets ImageNet, [6] [7], by transferring its learned
classification results of applying this CNN to learn features and image representations and reuse them to the classification
classify RGB-D images task. To determine the appropriate task with limited training data. The main idea is based on
architecture, we explore the transfer learning technique called designing a method which reuse a part of training layers of
“fine tuning technique”, of reusing layers trained on the AlexNet.
ImageNet dataset in order to provide a solution for a four-class
classification task of a new set of data. In the following, problem statement is presented in section
II. Sections III introduces the method and the CNN architecture
Keywords— Convolutional neural network, Deep Learning, exploited. Initial experiment results using the appropriate CNN
Transfer Learning, ImageNet. architecture which demonstrates that the developed deep neural
network achieves a satisfied success rate are described in the
I. INTRODUCTION first part of section IV. In the second part, the effect
of the MiniBatchsize parameter is discussed , [8].
Image representation for classification task used often
feature extraction methods which have been proven to be
effective for different visual recognition tasks, [1]. Local II. PROBLEM STATEMENT
binary patterns method is used for texture features Usually, it is not evident for a driver to keep his eyes
extracting. Histograms of oriented gradients are applying for everywhere at once while driving. Being concentrated on
image processing. Usually these types of methods have been the road, checking it, looking oncoming traffic, what’s
used to transform images and describe them for many tasks, behind him, all while trying to control his speed, can
[2]. Most of the applied features need to be identified by an become difficult and annoying. To avoid any road accident
expert and then manually coded as per the data type and problem, traffic sign need to be rigid, unique and clear for
domain. This process is difficult and expensive in terms of the driver.
expertise and time.
As a solution, deep learning reduces the task of With the Traffic Sign classification system, the risk of
developing new feature extractor, [3], by automating the warning from a potential hazard ahead can be vastly
phase of extracting and learning features. The proposed reduced. Also, with automatic classifying Traffic signs a
traffic sign classification system is able to recognize the mandatory problem of self-driving cars can be solved.
traffic sign images put on the road and classify them by
exploiting this technology. The present paper aims to build a classifier system that
can determine the type of the traffic sign displayed in an
There exist many different architectures of deep image, and is robust to different real-life conditions such as
learning. The model presented in this paper is a classifier poor lighting or obstructions by designing an image
system developed by using convolutional neural networks processing algorithm. As an initial work, four types of
category, [4], which is the most efficient and useful deep traffic sign are used: Nonstop signs, stop sign, green light
neural network used for this type of data, [5]. Therefore, and red light.
CNNs applied to learn images representation on large-scale
datasets for recognition tasks can be exploited by Digit image and face task classification, [9], describe
transferring these learning representations on other tasks limited variation in appearance and pose. Therefore, these
with limited amount of training data. two domains are close to our task and the applied methods
397
can be efficiently used to traffic sign classification task. convolution maps where W2 is the width, H2 is the
Applications and researches on image classi¿cation, height and D2 is the depth if we decided to use D2
transfer learning, and deep learning are references on which filters or convolution kernels, [19]. Convolution
our method is related and discussed below. maps produce a volume equal to [W2×H2×D2]
where W2, H2, D2 are given by equations (1), (2),
Recent methods of image classi¿cation tasks use the (3):
bag-of-features pipeline. SIFT descriptors, [10], are using
for clustering. Spatial pooling, [11], Histogram encoding,
[12] and recent Fisher Vector encoding, [13] are using for
feature collection. Although these representations have been (1)
given acceptable results, it is not obvious if they are optimal
for the task, since it requires a lot of time and effort from
experts in the specific domain. This process is difficult and
expensive in terms of expertise and time. (2)
Deep learning or deep neural networks reduces the (3)
task of developing new feature extractor for every visual
recognition problem. This optimization is realized by With :
automating the phase of learning image’s representation and
using graphics processing units (GPUs), [14], suited to the F : spatial extend of the filter.
application’s problem. K : number of filters.
P : zero padding (hyperparameter controlling
III. CONVOLUTIONAL NEURAL NETWORK the output volume).
S : stride (hyperparameter with which we slide
A. Architecture the filter).
Convolutional Networks (ConvNets) are currently the most • RELU layer applying an activation function such
efficient deep models for classifying images data. Their multi- as the max(0,x) function, to product elementwise
stage architectures are inspired from the science of biology. non-linearity. This operation does not affect or
Through these models, invariant features are learned change the size of the volume, [20].
hierarchically and automatically, [15]. They first identify low
• POOL layer inserted between successive Conv
level features and then learn to recognize and combine these
layers, applying a downsampling operation along
features to learn more complicated patterns. the spatial dimensions width and height. It uses
These different levels of features come from different MAX operation to optimize the spatial size of the
layers of the network. And each layer has specific number of representation as well as reducing the amount of
neurons and presented in 3 dimensions: height, width, depth, parameters, [21]. Pool Layer produces a volume
[16]. [W2×H2×D2] where W2, H2, D2 are given by
applying equations (4), (5) and (6) :
To understand convolutional neural network structure,
[17], we can observe it as two distinct parts. In input, images
are presented as a matrix of pixels. It has 2 dimensions for a (4)
grayscale image. The color is represented by a third
dimension, of depth 3 to represent the fundamental colors
(Red, Green, Blue), [18].
(5)
The first part of a CNN is the convolutive part. It
functions as a feature extractor of images. In this part, an
image is passed through a succession of filters, or (6)
convolution kernels, creating new images called convolution
maps. Some intermediate filters reduce the resolution of the
image by a local maximum operation.
In the end, a feature extractor vector or CNN code
• CONV layer accepting a volume of size concatenate the output informations as a unique vector.
[W1×H1×D1] where W1 is the width, H1 is the
height and D1 the depth, the outputs of neurons in This code is then connected to the input of a second part,
this type of layers are calculated by applying the consisting of fully connected layers (multilayer perceptron),
product between their weights and a local region [22]. The role of this part is to combine the characteristics of
they are connected to in the input volume. The the CNN code to classify the image. It determines the class
obtained output volume [W2×H2×D2] called scores, presenting in an output volume of size [1×1×k]. The
architecture of this part is a usual multilayer perceptron and
398
each of the k output neurons or numbers, connecting to all all convolutional layers corresponds to the first
the numbers of the previous layer, correspond to a category method presented, with final classifier a pre-
of the classification. initialized multilayer perceptron.
B. CNN training To learn features for our traffic sign classification task,
Creating CNN is expensive in terms of expertise, we apply ConvNets combined with fine tuning technique. We
equipment and the amount of needed data. The first step is to used the pre-trained convolutional neural network AlexNet
fix the architecture by fixing the number of chosen layers, their which is trained for 1000 possible categories on the large
sizes and matrix operations that connect them, [23]. The dataset ImageNet in the Large Scale Visual Recognition
training consists then, of optimizing the network’s coefficients Challenge (ILSVRC-2012), [30], containing over then 1.2
to minimize the output classification error. million images. Results were remarkable and achieved a top-
5 error of 15.3%.
This training can take several weeks for the best CNNs,
with many GPUs working on hundreds of thousands of IV. EXPERIMENTS
annotated images. Research teams specialized in improving In this section, we first detailed the architecture of the
CNN publish their technical innovations, so the complexity proposed model. Next, we present experimental results of our
of creating CNN can be avoided by adapting publicly method based on transfer learning technique for the traffic
available pre-trained networks. These techniques are called sign classification task dataset. Finally, we discuss the effect
transfer learning, [24], which consist on transferring of one of the hyperparameters of a deep neural network in
knowledge from the related source to the target domain. our model.
These pre-trained neural network can be used in two ways:
A. Dataset
• Automatic feature extractor of images : exploits only
the convolutive part of a pretrained network. It uses it The traffic sign dataset contains more than 360 images in
as an automatic feature extractor of images, to feed total, divided into different classes. To avoid using the
the classifier of our choice. It keep only the testing data, we leave 180 images from the training set for
convolutive part. This part is called frozen, to validation and 180 test images featuring among four classes
express the absence of training. This network takes "stop sign", "non stop sign", "Green light" and "Red light".
an image in good format and outputs the CNN code. Both training and testing data are distributed over these
Each image in the dataset is thus transformed into a categories.
feature vector, which is used to drive a new
classifier, [25]. This method has many practical B. CNN developped architecture
interests:
Adapted network exploited in our method is AlexNet deep
neural network. AlexNet was among the first well
The image is transformed into a small vector, which
performed convolutional neural network in the computer
extracts features that are usually very relevant. This vision community. This CNN has shown successful results
reduces the size of the problem, [26]. while training on difficult ImageNet dataset. The training of
the model is released on two GTX 580 GPUs for five to six
Feature extraction is being performed only once per days, [31], using batch stochastic gradient descent
image, it can be performed quickly on CPU. The algorithm.
machine learning libraries are usually sequential and
also run on CPUs. The network was made up of 5 internal convolutional
layers: C1, C2, C3, C4, C5, pooling layers, dropout layers,
This method makes it possible to exploit the power of and 3 fully connected layers : FC6, FC7, FC8. It was used
the CNNs without investing in GPUs, [27]. for classification with 1000 possible categories, [32].
• Fine tuning : an initialization of the target model, The input of the architecture takes images of size
which is then retrained more finely to deal with the [227×227×3] with a zero padding P=0. On the first
new classification problem, [28]. Here we use an Convolutional Layer, AlexNet applied 96 convolution
architecture carefully optimized by specialists, and kernels with size of F=11 by striding it among the input
we take advantage of features extraction capabilities volume with a strider S=4. The output volume had for size
learned on a large quality dataset. Fine Tuning on [55×55×96] where height and width are W= H= 55 = (227-
images consists of some sort of taking a visual 11)/4+1 and depth K=96. The total number of neurons in
system already well trained on a classification task to this layer is 55×55×96=290400 neurons.
refine it on a similar task. The only necessary change
to the network is the adaptation of the last layer, [29]. Each of the 290400 neurons was connected to a local
For training, it is possible to freeze the initial layers region of [11×11×3] in the input, and all of the 96 neurons are
of the neural network, and to adapt only the final connected with different values of weights to the same region
layers for the new classification problem. Freezing of size [11×11×3] in the input volume. The rest of successive
layers and filters applied are presented on fig.1.
399
Fig. 1. AlexNet architecture, [33].
400
therefore necessary to change the last layer of AlexNet and
replacing it by a layer of four neurons.
401
[23] D. Lowe. “Distinctive image features from scale-invariant keypoints.”
References IJCV, 60(2):91–110,2004.
[24] Y. LeCun, L. Bottou, and J. HuangFu. “Learning methods for generic
[1] Redmon J, and Angelova A, “Real-time grasp detection using object recognition with invariance to pose and lighting.” CVPR, 2004
convolutional neural networks”, IEEE International Conference on [25] F. Perronnin, J. S´anchez, and T. Mensink. “Improving the ¿sher kernel
Robotics and Automation, pp. 1316–1322, 2015. for large-scale image classi¿cation.” ECCV, 2010.
[2] Hang Chang, Cheng Zhong, Ju Han, Jian-Hua Mao, “Unsupervised [26] : P.Sermanet,D.Eigen,X.Zhang,M.Mathieu,R.Fergus,and Y.LeCun.
Transfer Learning via Multi-Scale Convolutional Sparse Coding for “Overfeat: Integrated recognition, localizationand detection using
Biomedical Application.” IEEE Transactions on Pattern Analysis and convolutional networks.” arXiv:1312.6229, 2013.
Machine Intelligence, 23 janvier 2017. [27] : Schmidhuber, J. “Multi-column deep neural networks for image
[3] Zhou, X., Yu, K., Zhang, T., & Huang, T. “Image classi¿cation using classi¿cation.” CVPR. 2012.
super-vector coding of local image descriptors.” In ECCV,2010. [28] M. A. Ranzato, C. Poultney, S. Chopra, and Y. Lecun. “Efficient
[4] van de Sande, K. E. A., Gevers, T., and Snoek, C. G. M, “Evaluating learning of sparse representations with an energy-based model.”
color descriptors for object and scene recognition”, IEEE Transactions Advances in Neural Information Processing Systems (NIPS), 2006.
on Pattern Analysisand Machine Intelligence.” 1582– 1596. 2010. [29] Y. LeCun, F.-J. Huang, and L. Bottou. “Learning methods for generic
[5] Howard, A. , “Some improvements on deep convolutional neural object recognition with invariance to pose and lighting.” Computer
network based image classi¿cation.” ICLR, 2014. Vision and Pattern Recognition, 2004.
[6] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., [30] S. Behnke. “Hierarchical Neural Networks for Image Interpretation.”
“ImageNet: A large-scale hierarchical image database.” In CVPR, 2009. Lecture Notes in Computer Science. Springer, 2003.
[7] Ahonen, T., Hadid, A., and Pietikinen, “M. Face description with local [31] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. “Greedy layer-
binary patterns: Application to face recognition.” Pattern Analysis and wise training of deep networks.” Neural Information Processing
Machine Intelligence, 2037–2041. 2016. Systems, 2007
[8] K. Hornik, M. Stinchcombe, H. White, “Multilayer Feedforword
Networks are Universal Approximators.” Neural Networks, pp. 359-366,
1989.
[9] G. Cybenko, “Approximation by superpositions of a sigmoidal
function.” Math. Contr. Signals Syst., pp. 303-314, 1989.
[10] P.Sermanet, D.Eigen, X.Zhang, M.Mathieu, R.Fergus,and Y.LeCun.
“Overfeat: Integrated recognition, localization and detection using
convolutional networks.” arXiv:1312.6229, 2013.
[11] H. B. Burke, “Artificial neural networks for cancer research: Outcome
prediction.” Sem. Surg. Oncol, vol. 10, pp. 73–79, 1994.
[12] H.B. Burke, P.H. Goodman, D.B. Rosen, D.E. Henson, J.N. Weinstein,
F.E. Harrell, J.R. Marks, D.P. Winchester, D.G. Bostwick, “Artificial
neural networks improve the accuracy of cancer survival prediction.”
Cancer, vol. 79, pp. 857-862, 1997.
[13] J. Lampinen, S. Smolander, and M.Korhonen, “Wood surface inspection
system based on generic visual features.” Industrial Applications of
Neural Networks, F. F. Soulie and P. Gallinari, Eds, Singapore: World
Scientific, pp. 35-42, 1998.
[14] T. Petsche, A. Marcantonio, C. Darken, S. J. Hanson, G. M. Huhn, I.
Santoso, “An autoassociator for on-line motor monitoring.”, Industrial
Applications of Neural Networks, F. F. Soulie and P. Gallinari, Eds,
Singapore: World Scientific, pp. 91-97, 1998.
[15] A. Sifaoui, A. Abdelkrim, M. Benrejeb, “On RBF neural network
classifier design for iris plants”, The 37th International Conference on
Computers and Industrial Engineering, pp.113-118, Alexandrie, Octobre
2007.
[16] Sinno Jialin Pan and Qiang Yang, Fellow. “A Survey on Transfer
Learning.” IEEE Transactions on knowledge and data engineering, Vol.
22, No. 10, October 2010.
[17] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. “Blocks that
shout: Distinctive parts for scene classi¿cation.” CVPR, 2013.
[18] R. Girshick, J. Donahue, T. Darrell, and J. Malik. “Rich feature
hierarchies for accurate object detection and semantic segmentation.”
CVPR, 2014.
[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R.E. Howard, W.
Hubbard, and L.D. Jackel. “Backpropagation applied to handwritten zip
code recognition.” Neural Computation, 1(4):541–551, 1989.
[20] Y.Boureau, F. Bach, Y. LeCun, andJ. Ponce. “Learning midlevel
features for recognition.” CVPR, 2010.
[21] Marc Parizeau. “Le perceptron multicouche et son algorithme de
rétropropagation de l’erreur.” Département de génie électrique et de
génie informatique, Université Laval, 10 Septembre 2014.
[22] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing. “Training hierarchical
feed-forward visual recognition models using transfer learning from
pseudo-tasks.” ECCV,2008.
402