Improvement of Learning For CNN With Relu Activation by Sparse Regularization

Improvement of Learning for CNN with ReLU
Activation by Sparse Regularization

Hidenori Ide Takio Kurita
Department of Information Engineering Department of Information Engineering
Hiroshima University Hiroshima University
Email:wate2263@gmail.com Email:tkurita@hiroshima-u.ac.jp
Abstract—This paper introduces the sparse regularization for Another approach to improve the learning of the deep neural
the convolutional neural network (CNN) with the rectified linear networks is to modify the activation function of the hidden
units (ReLU) in the hidden layers. By introducing the sparseness neuron. Historically, the sigmoid function have been often used
for the inputs of the ReLU, there is effect to push the inputs of
the ReLU to zero in the learning process. Thus it is expected as the activation function of the neurons in the artificial neural
that the unnecessary increase of the outputs of the ReLU can be networks. However the standard activation functions such as
prevented. This is the similar effect with the Batch Normalization. the sigmoid function or the tangent hyperbolic function are
Also the unnecessary negative values of the inputs of the ReLU contractive almost everywhere and the gradients at the large
can be reduced by introducing the sparseness. This can improve values becomes almost zero. This make the updates by the
the generalization of the trained network. The relations between
the proposed approach and the Batch Normalization or the stochastic gradient decent very small. This problem is known
modifications of the activation function such as Exponential as the vanishing gradient problem.
Linear Unit (ELU) are also discussed. The effectiveness of the To improve the restricted Boltzmann Machine, Nair et al.
proposed method was confirmed through the detail experiments. [12] introduced the rectified linear units (ReLU). Glorot et
al. [13] showed that ReLU activation function in the hidden
layers could improve the learning speed of the various deep
I. INTRODUCTION
neural networks. The gradients of the ReLU activation function
Recently deep neural networks attracted many researchers at the positive values becomes constant and do not vanish
in computer vision, speech recognition, and natural language anymore. This means that the vanishing gradient problem can
processing, etc. Especially the convolutional neural network be avoided by using ReLU activation function. This is the
(CNN) have achieved a great success in image recognition [1]. reason why ReLU activation function can improve the learning
After the deep CNN by Krizhevsky et al. won the ILSVRC speed of the deep neural networks. Now the ReLU is used as
2012 with higher score than the conventional methods, it the standard activation function for deep neural networks.
becomes very popular as the fundamental technique for image Further improvement of the activation function has been
classification. To improve the recognition accuracy further, proposed in the literature. Recently Clevert et al. [14] proposed
deeper and complex network architectures have been proposed to use an activation function with negative outputs named
[2]–[6]. Exponential Linear Units (ELUs). The activation function of
Several methods have been proposed to imporve or speed up ELU has the same shape with the ReLU for the positive inputs.
the training of the deep neural network. LeCun et al. [9] and But the outputs of ELU activation function for the negative
Wiesler et al. [10] pointed out that the learning of the neural inputs becomes negative values while ReLU always outputs
network converges faster if its inputs are whitened. Whitening zero for the negative inputs. These negative outputs can push
of the inputs of each layer make the changes of inputs of the mean of the outputs of the activation function toward to
each layer uniform and can remove the bad effects of the zero and can reduce the undesired bias.
internal covariate shift. However the computation to perform From the early work on the cat’s visual cortex by Huber
the whitening of the inputs of each layer is expensive because and Wiesel [15], it is well known that the neurons in the
the whitening needs to calculate the covariance matrix of the visual cortex are sensitive to small regions of the visual field
full training samples and solve the eigenvalue problem of the (called a receptive field). The small regions are tiled to cover
covariance matrix. the entire visual field and each neuron acts as local filter
To simplify the computation of the whitening, Ioff et al. [11] over the input visual stimulus. The deep CNN inherits these
proposed the method called Batch Normalization by using only structure of the visual processing in the brain but usually the
the dimension-wise normalization for the training samples in parameters (weights) of the receptive fields are trained based
the min-batch. By this method, the mean of the inputs of each on the supervised learning.
layer is standarized to zero and the variance to one. Now Batch Olshausen and Field showed that the importance of sparse
Normalization is considered as the standard method to improve coding for self-organization of the receptive fields of simple
the learning of the deep neural networks. cells in the primary visual cortex (V1) [16], [17]. They demon-
978-1-5090-6182-2/17/$31.00 ©2017 IEEE 2684

strate that a complete family of localized, oriented, bandpass
receptive fields, similar to those found in the primary visual
cortex can be obtained by a learning algorithm that attempts
to find sparse linear codes for natural scenes. Also they show
that the resulting sparse code possesses a higher degree of
statistical independence and can provide a more efficient rep-
resentation for the later stages of the visual processing. These
results suggest the importance of the unsupervised learning Fig. 1. The pair of the convolution and the pooling layers.
for self-organization of the receptive fields. The unsupervised
learning algorithm proposed by Olshausen and Field can be
realized as the auto-encoder with sparseness constraints. where x is the input of the neuron. Then the output of the
In this paper we propose a method to introduce the sparse- unit f (hk ) is calculated by
ness regularization for the convolutional neural network with
the ReLU activation functions. Usually the sparseness for the f (hk ) = σ(hk ) (2)
neural networks is introduced to prevent the unnecessary in-
crease of the weights (parameters) in the model. For example, where σ is the activation function of the neuron. If there are
the penalty to minimize the squared norm of the weights several channels in the input layer, the sum of the filtered
is introduced in the weight decay. It is known that such outputs are used as the activity of the neuron.
penalty can prevent over-fitting to the training samples and In the pooling layer, down-sampling of the feature maps is
can improve the generalization ability of the trained model. performed. For example, the feature map is partitioned into a
In this paper we introduce the sparseness to the inputs of the set of non-overlapping rectangle sub-regions and the maximum
ReLU, namely the outputs of the linear filter. This is different in each sub-region is taken as its output. This is called max
from the usual sparseness for the neural networks. pooling. It is known that the pooling layer is important to make
By introducing the sparseness to the input of the ReLU, invariant to small shift of the input image. Also the pooling
the unnecessary increase of the outputs of the ReLU can be layer reduces the spatial size of the representation to reduce the
reduced. This has the similar effect with Batch Normalization. amount of parameters and computation in the network. Thus
Also the unnecessary negative inputs of the ReLU can be it is common to periodically insert a Pooling layer in-between
reduced. It is expected that this can improve the generalization the successive convolution layers in a deep CNN architecture.
ability similar with the weight decay. In addition to max pooling, several pooling methods have
This paper is organized as follows. Section II explains the been proposed. Average pooling or L2-norm pooling are
related works such as the CNN, Batch Normalization, the examples of such pooling methods. Average pooling have been
activation function and sparseness. Section III explains the often used historically but has recently fallen out of favor
proposed architecture with sparseness regularization. Exper- compared to the max pooling because the max pooling has
iments and results are shown in Section IV. Section V is for been shown to work better in practice.
the conclusion and the future works. The fully connected layer and classification layer corre-
spond to a multi-layer Perceptron (MLP). After several pairs
II. RELATED WORK of convolution and pooling layers, the low-level image features
are integrated via fully connected layers. Neurons in a fully
A. Convolutional Neural Network connected layer have full connections to all outputs in the
previous layer. This is the same as the standard MLP.
A deep CNN architecture is usually formed by a stack
of distinct layers that transform the input image into the The classification layer is used for the final decision and
class scores. Four distinct types of layers (convolution layer, is normally the last layer in the network. It specifies how the
pooling layer, fully connected layer, and classification layer) network training penalizes the deviation between the predicted
are commonly used. Usually several pairs of the convolution and true labels. Usually the soft-max function is used in the
and pooling layers are repeated and then a fully connected classification layer. The objective function of the learning is
layer and a classification layer are followed. Fig.1 shows an given by the negative log-likelihood of the outputs of the
example of the convolution and the pooling layers. network as
K
The convolution layer is the core building block of the CNN.
L=− tik log(yik ), (3)
Each neuron in the convolution layer has a small receptive field
i∈D k=1
in the input image and computes the output by convolution of
the receptive field with a linear filter. If we denote the weights where D is the set of index of the training samples. The terms
and the bias of the filter for k-th feature map as wk and bk , yik and tik are the k-th element of the output vector yi and the
the output of the filter hk is given as teacher vector ti for the i-th training sample as respectively.
The standard way to train the network parameters, weights
hk = wkT x + bk (1) and biases, is the back-propagation learning which is based
2685
on the stochastic steepest descent method. The general form Then we can normalize the inputs by
of the update rule is given as xi − μ x
x̂i = . (8)
∂L σx2 +
w ← w−μ (4)
∂w This process is known as the standarization in statistics and
∂L
b ← b−μ (5) the mean and the variation of the normalized values x̂i become
∂b 0 and 1 respectively.
where L is the objective function and μ is the learning rate. From these standarized inputs {x̂i |i = 1, . . . , m} we can
Usually these parameters are updated by using the mini-batch further transform to obtain the values {yi |i = 1, . . . , m} with
approach in which the subsets of the training samples are used the specified mean β and the variance γ 2 as
to calculate the partial derivatives.
yi = γ x̂i + β. (9)
B. Batch Normalization In the Batch Normalized, the parameters γ and β are also
Ioffe et al. [11] proposed Batch Normalization. Batch- learned from training samples in mini-batch.
normalization enable us to use much higher learning rates and Then the mean of the output of the filter hk for the training
be less careful about initialization. It is also pointed out that samples in the mini-batch after the Batch Normalization can
Batch Normalization has a function of regularization and make be obtained as
m
m

Dropout unnecessary in some case. 1 T T 1
The distribution of inputs of each layer changes during the μ hk = {w y + bk } = wk yi + bk
m i=1 k m i=1
learning process because the outputs of the previous layer are
influenced by the changes of the parameters in the previous = βwkT 1 + bk (10)
layer. This is called internal covariate shift. It is known that T
the learning speed is slowed down by the internal covariate where 1 = 1 1 · · · 1 . This equation shows that the
shift because we have to set lower learning rate when such mean of the output of the filter for the training samples in
sift becomes large. Also careful parameter initialization is the mini-batch is not influenced by the training samples. This
necessary. This makes notoriously hard to train models with probably contributes to improve the learning.
saturating nonlinearities. C. Activation Functions
To improve the learning we have to reduce the internal
Historically, the sigmoid function have been often used as
covariate shift. We can improve the learning speed by keeping
the activation function of the neurons in the artificial neural
the distribution of input of each layer the same during the
networks such as multi-layered Perceptron and Boltzmann
learning process. LeCun et al. [9] and Wiesler et al. [10]
Machine.
showed that the learning of the neural network converges faster
It is known that the standard activation functions such as
if its inputs are whitened. Whitening of the inputs of each
the sigmoid function or the tangent hyperbolic function are
layer make the changes of inputs of each layer uniform and
contractive almost everywhere and the gradients at large values
can remove the bad effects of the internal covariate shift.
becomes almost zero. Thus the updates by the stochastic
However to do the whitening of inputs of each layer is
gradient decent becomes very small. This problem is known
expensive because the whitening needs to calculate the covari-
as the vanishing gradient problem. To improve the restricted
ance matrix of the inputs and solve the eigenvalue problem of
Boltzmann Machine, Nair et al. [12] introduced the rectified
the covariance matrix.
linear units (ReLU) defined as
Instead of the whitening using the covariance matrix of
the full training samples, Batch Normalization performs the fRelu (hi,k ) = max(0, hi,k ). (11)
dimension-wise normalization for the training samples in the
mini-batch. The graph of the rectified liner unit is shown in Fig. 2 in red.
The derivatives of the ReLU activation function is also shown
Let the inputs of a neuron in an layer for the mini-batch
in Fig. 3 in red.
samples be {xi |i = 1, . . . , m} where m is the number of
Glorot et al. [13] showed that ReLU activation function in
samples in the mini-batch. The mean of the inputs of the
the hidden layers could improve the learning speed of the
neuron in the layer for the mini-batch samples is defined as
various deep neural networks. Now the rectified linear unit
m
1 is used as the standard activation function for deep neural
μx = xi . (6) networks.
m i=1
From this Fig. 3, it is noticed that the gradients at the
Similarly the variance of the inputs of the neuron in the layer positive values becomes constant and do not vanish anymore.
for the mini-batch samples is defined as This means that the vanishing gradient problem can be avoided
m
by using ReLU activation function. This is the reason why
1 ReLU activation function can improve the learning speed of
σx2 = (xi − μβ )2 . (7)
m i=1 the deep neural networks.
2686
Fig. 2. The graph of the activation functions ReLU and ELU. ReLU is shown Fig. 3. The derivatives of activation functions ReLU and ELU with respect
by the red line and ELU is shown by the green line. to the input of the activation function. ReLU is shown in red line and ELU is
shown in green line. Also the derivative of the sparseness S(hk ) with respect
to the input of the activation function (shown in blue line).
Further improvement of the activation function has been
proposed in the literature.
As explained in the previous subsection, it is known that red. By comparing these two derivatives, it is noticed that the
the learning of the neural networks can be improved when derivatives in the negative hk are positive for ELU but they
their input and hidden unit activities are centered about zero are zero for ReLU.
[9], [18]. Schraudolph et al. extended this to encompass the Although ReLU activation function is effective to avoid the
centering of error signals [19]. Raiko et al. [20] proposed a vanishing gradient problem, the outputs of the ReLU activation
method of centering the activation of each neuron in order to function at the negative values are always zero and the mean of
keep the off-diagonal entries of the Fisher information matrix the outputs of the ReLU activation function becomes positive.
small. In the the Projected Natural Gradient Descent algorithm This means that the ReLU activation function introduce the
(PRONG) proposed by Desjardins et al. [21], the activation of effect of the internal covariate shift.
each neuron is implicitly centered about zero by the whitening. From Fig. 2, it is noticed that the ELU outputs the negative
Recently Clevert et al. [14] proposed to use an activation values for the negative inputs. Therefore the mean of the
function with negative outputs named Exponential Linear outputs of the ELU activation function becomes closer to zero
Units(ELUs). The activation function of ELU is defined as than the case of the ReLU. This means that the ELU activation
function can soften the effect of the internal covariate shift.
hk (hk > 0) We think that this is one of the main reason why the ELU
fELU (hk ) = (12)
α(exp(hk ) − 1) (hk ≤ 0) activation function can improve the learning of the deep neural
where hk is the output of the filter in the k-th channel and α networks.
is a parameter to control the value to which an ELU saturates From Fig. 3, the derivative of the ELU activation function
for negative inputs. The graph of this activation function is with respect to the input hk is non-negative. This means that
also shown in Fig. 2 in green. It is noticed that they are the the outputs of the ELU activation function push to negative
same for the positive hk but the output values for the negative direction in the gradient decent algorithm.
hk becomes negative for the case of ELU. On the other hand
D. Sparseness
the output values of ReLU for the negative inputs are always
zero. It is well known that the neurons in the cat’s visual cortex
In the learning algorithm based on the stochastic gradient, are sensitive to small regions of the visual field (receptive
we have to calculate the partial derivatives of the output of fields) [15]. The receptive fields are tiled to cover the entire
the activation function with respect to the weights vectors wk . visual field and each neuron acts as local filter over the
This can be derived as input visual stimulus. The receptive fields of simple cells in
mammalian primary visual cortex are characterized as being
∂fELU (hk ) ∂fELU (hk ) ∂hk
= . (13) spatially localized, oriented. They are also selective to the
∂wk ∂hk ∂wk structure at different spatial scales. This is comparable to the
The derivative of the activation function ELU with respect basis functions of Gabor wavelet.
to the input hk is obtained as To understand the response properties of visual neurons
from the point view of the statistical structure of natural
∂fELU (hk ) 1 (hk > 0)
= (14) images, the researchers have attempted to train unsupervised
∂hk fELU (hk ) + α (hk ≤ 0)
learning algorithms on natural images. Olshausen and Field
The graph of the derivative of the activation function of ELU demonstrate that a learning algorithm that attempts to find
is shown in Fig. 3 in green. The graph of the derivative of sparse linear codes for natural scenes will develop a complete
the activation function of ReLU is also shown in Fig. 3 in family of localized, oriented, bandpass receptive fields, sim-
2687
B. Sparse Regularization
Usually the sparseness is introduced to prevent the unneces-
sary increase of the parameters in the model. For example the
Fig. 4. Example of the architecture of CNN with two convolution layers and sparseness of the weights is introduced in the weight decay.
one fully-connected layer. It is known that this can prevent overfitting to the training
samples and can improve the generalization ability of the
trained model.
ilar to those found in the primary visual cortex [16], [17]. Olshausen et al. [16] introduced the sparseness to the
Also they show that the resulting sparse code possesses a outputs of the hidden units. Similarly, in this paper we propose
higher degree of statistical independence and can provide a to introduce the sparseness to the inputs of the ReLU, namely
more efficient representation for the later stages of the visual the outputs of the linear filter. By introducing the sparseness to
processing. the inputs of the ReLU, both the unnecessary increase of the
They assumed that a local region I(x, y) of the given image outputs of the ReLU can be prevented and the unnecessary
can be represented in terms of a linear combination of basis negative inputs of the ReLU can be reduced. This means
function φi (x, y) as that the proposed method has the similar effect with Batch
Normalization and also can improve the generalization ability.
I(x, y) = ai φi (x, y). (15) There are many ways to evaluate the sparseness. In this
i paper, we use the sparseness of the input hk of k-th ReLU in
This corresponds to the reconstruction mapping in the auto- a neural network as
encoder and the ai can be considered as the output of the
S(hk ) = log(1 + h2k ). (17)
hidden units in the auto-encoder. The authors formulated the
unsupervised learning of the receptive field as an optimization This is one of the sparse terms introduced by Olshausen et al.
problem by constructing the cost function [16].
Here we define the objective function of the optimization
Q = L + λS, (16) to determine the parameters of the network as

where L is the measure to evaluate how well the code describe E =L+λ S(hk ) (18)
the original local region and S assesses the sparseness of the k
code.
where λ is the tuning parameter to control the sparseness.
To understand the effect of the sparse term in the stochastic
III. C ONVOLUTIONAL N EURAL N ETWORK WITH S PARSE gradient method, it is necessary to consider the shape of the
R EGULARIZATION derivatives of the sparse term. The derivative of the sparse
From the discussion in the related works described in term S(hk ) in terms of the input of the ReLU hk is derived
section II, the key to improve the learning of the deep neural as
networks is to avoid both the vanishing problem and the ∂S(hk ) 2hk
= . (19)
internal covariate shift. ∂hk 1 + h2k
In this paper we propose a method to introduce the sparse-
ness regularization for the convolution neural network with The graph of the derivative of the function S(hk ) is shown in
the ReLU activation functions and show that the sparseness Fig. 3 in blue line.
regularization can directly soften the effect of the internal From this figure, it is noticed that the derivative at the
covariate shift. positive input is positive but it is negative at the negative
input. This means that there is an effect to push the inputs of
the ReLU to zero in the learning process by introducing the
A. Proposed Architecture
sparseness. Thus it is expected that the unnecessary increase
Fig. 4 shows an example of the architecture of CNN with of the outputs of the ReLU can be prevented. This is the same
two convolution layers and one fully-connected layer. effect with the Batch Normalization. Also the unnecessary
As explained in the subsection II-A, each neuron of the negative values of the inputs of the ReLU can be reduced by
convolution layer has a small receptive field in the input introducing the sparseness regularization. This can improve
image and the outputs of the receptive field are calculated the generalization ability of the trained network.
by a linear filter. This is expressed by the equation (1). In the By comparing with the case of the ELU activation function,
pooling layer, the feature maps are down-sampled. Usually the derivative of this function has the opposite sign for the
the soft-max function is used in the classification layer and negative region. This means that the negative inputs of the
the objective function of the learning is given by the negative ELU are pushed toward to large negative values. Since the
log-likelihood of the outputs of the network as shown in the ELU can output the negative values for the negative inputs, it
equation (3). is useful to reduce the mean of the outputs to zero but as the
2688
TABLE I TABLE II
T HE NUMBER OF SAMPLES . T HE ACCURACY OF FULL SIZE CIFAR-10 I ( FULL DATASET ) AT 1000
EPOCHS . T HE SPARSE TERM IS INTRODUCED TO THE INPUT OF THE R E LU
UNITS .
data set train test
CIFAR-10 50000 10000
methods ReLU with SGD
(full set)
without sparse term original 66.04%
side effect the inputs of the ELU have unnecessary negative Batch Normalization 68.72%
values. This is not good for the generalization. conv & pool 75.68%
fc 69.32%
On the other hand, the sparse regularization for the inputs conv & fc 75.98%
of the ReLU can introduce both the effect to reduce the mean with sparse term pool & fc 75.79%
of the outputs of the ReLU and the effect to reduce the conv1 & pool1 & fc 75.56%
conv1 & pool2 & fc 75.79%
unnecessary negative values of the inputs of the ReLU. conv2 & pool1 & fc 75.57%
We can also introduce sparse regularization for the outputs conv & pool & fc 76.4%
of the ReLU. This also has the same effect with the sparse
regularization for the inputs of the ReLU for the positive
values. However the effect to reduce the unnecessary negative
C. Effectiveness of the sparse term for the inputs of ReLU
values of the inputs of the ReLU is lost because the ReLU
activation function makes all the negative input to zero. We have performed the experiments to evaluate the effec-
tiveness of the sparse regularization for the inputs of ReLU.
IV. EXPERIMENT The Table II shows the recognition rates for the test samples.
To confirm the effectiveness of the proposed method, we In this experiments, 50000 samples in the data set shown
have performed experiments on the learning of the CNN using in Table I were used for training. In this experiments, the
CIFAR-10 dataset. sparse regularization term is introduced to the input of the
ReLU activation function. The recognition rates obtained by
A. Data sets the proposed methods are shown as ”with sparse term”. We
In the experiments, we used the standard benchmark dataset, have done the experiments by changing the layers to which
CIFAR-10. CIFAR-10 is the dataset of RGB color images for the sparseness is introduced. In the Table II, ”conv & fc”,
the 10 classes objects classification. Each image in the dataset ”pool & fc”, and ”conv & pool & fc” denotes the layers
is normalized to 32 × 32 pixels. Table I shows the number of to which the sparse regularization is introduced. The label
the training samples and the test samples of this data set. ”conv” means that the sparse regularization are introduced to
both of the convolution layers. The label ”conv1 & pool1”
B. CNN architecture means that the sparse regularization are introduced to the
To confirm the effectiveness of the sparse regularization for first convolution layer and the first pooling layer. For the
the CNN with ReLU activation function in the hidden layers, comparison, the recognition rates obtained by the standard
we have done the experiments using the CNN shown in the error back-propagation learning algorithm with and without
Figure 4. Batch Normalization are shown as ”Batch Normalization” and
This network consists of 4 layers, namely two convolution ”original” in the Table II.
layers including max pooling, one fully-connected layer and From this Table II, it is noticed that the proposed approach
one classifier layer. The size of convolution filter is set to gives better recognition rates than the error back-propagation
5 × 5. The max pooling is used in the pooling layer. The learning algorithm with and without Batch Normalization.
pooling size is set to 2 × 2. The number of filters in the
first convolution layer is set to 32. The number of filters D. Detail comparison of sparse regularization
in the second convolution layer is set to 64. The number
of neurons in the fully-connected layer is set to 256. In the To understand the effectiveness of the sparse regularization
convolution layers and fully-connected layer, the ReLU is used in detail, we have performed the experiments in which the lay-
as activation function. ers introduced the sparse term are changed. In this experiments
In the following experiments, the parameter λ which con- the sparseness was introduced to the outputs of the ReLU.
1 Figure 5 and Figure 6 show the learning curves of the
trols the sparseness is defined by λ = where nf is
nf recognition rates for the training samples and the test samples.
the number of elements in the feature vector. The stochastic In the Figure 5, the sparse regularization term is introduced
gradient descent (SGD) is used for optimization. The learning to the input of the ReLU activation function. On the other
rate is set to 0.001. In this experiments Batch Normalization is hand, the sparse term is introduced to the output of the ReLU
applied to the first and the second convolution layers because in Figure 6. For comparisons, the learning curves obtained
this gives better recognition rate for test samples than the case by the original error back-propagation algorithm (shown as
in which Batch Normalization is applied to all the layers. ”origin”) and the error back-propagation algorithm with the
2689
Fig. 5. These graph are the learning curves of CNN with ReLU activation Fig. 6. These graph are the learning curves of CNN with ReLU activation
function. In this experiment, the full dataset is used as the training samples. function. In this experiment, the full dataset is used as the training samples.
The sparse regularization is introduced to the input of the ReLU activation The sparse regularization is introduced to the output of the ReLU activation
functions. functions.
Batch Normalization (shown as ”Batch Normalization”) are But when the sparse regularization is introduced to only
also shown in these figures. fully-connected layer is not as good as other cases. This means
From these figures, it is noticed that the learning speed that the sparse regularization should be introduced to both the
of the Batch Normalization is faster than the other cases but feature extraction layers such as the convolution layers or the
the recognition rates for test samples become the best when pooling layers and the classifier (the fully-connected layer).
the sparse regularization is introduced to the convolution, the By comparing the the table II and III, it is noticed that the
pooling, and the fully-connected layers. recognition accuracies obtained by the sparse regularization
Table III shows the recognition rates for the test samples to the input of the ReLU activation function are better than
when the sparse term is introduced to the outputs of the ReLU the cases in which the sparse term is introduced to the output
activation function. Similarly the recognition rates obtained of the ReLU activation function. We think that the reason is
by the trained CNN are shown in the third column. Upper that the gradient of the sparse term for the negative value
two rows are the results obtained by the original error back- always becomes zero when the sparse term is introduced to
propagation learning with and without Batch Normalization. the outputs of the ReLU activation functions. On the other
Other rows include the results obtained by the error back- hand the gradient of the sparse term for the negative values
propagation with sparse regularization. The layers where the has some values when the sparse term is introduced to the
sparse regularization are introduced are denoted such as inputs of the ReLU activation functions. This suggests that the
”conv” or ”conv1 & pool1”. sparse term should be introduced to the input of the ReLU.
From this table, it is noticed that the recognition rates
for all cases using the sparse regularization are better than V. C ONCLUSION
the results of the original error back-propagation with and In this paper we propose to introduce the sparse regulariza-
without the Batch Normalization. Top recognition rate is tion to the inputs of the ReLU of the CNN. It is shown that
achieved when the sparse regularization is introduced for all the sparse regularization has the effect to push the inputs of
layers. It is also about 7% and 9% better than the original the ReLU to zero in the learning process and the unnecessary
error pack-propagation learning with and without the Batch increase of the outputs of the ReLU can be prevented. This
Normalization respectively. is the similar effect with the Batch Normalization. Also the
Other higher recognition rates are also achieved when the unnecessary growth of the negative values of the inputs of
sparse regularization is introduced to the all pooling layer and the ReLU can be reduced by introducing the sparse regu-
the fully-connected layer or to the first convolution layer, the larization to the inputs of the ReLU can be reduced. Thus
second pooling layer and fully-connected layer. it is expected that the generalization ability of the trained
2690
TABLE III [12] V. Nair and G. E. Hinton, ”Rectified linear units improve restricted
T HE ACCURACY OF THE CIFAR-10 AT 1000 EPOCHS . T HE SPARSE boltzmann machines”, ICML-10, 2010.
REGULARIZATION IS INTRODUCED TO THE OUTPUT OF THE R E LU UNITS . [13] X. Glorot, A. Bordes and Y. Bengio, ”Deep sparse rectifier neural
networks”, AISTATS, 2011.
[14] D.A. Clevert, T. Unterhiner and S. Hockreiter, ”Fast and accurate deep
methods ReLU with SGD network learning by exponential linear units (ELUs)”, ICLR, 2016
without sparse term original 66.04% [15] D. H. HUBEL and T.N. WIESEL. ”Receptive fields, binocular interac-
Batch Normalization 68.72% tion and functional architecture in the cat’s visual cortex,” The Journal
conv & pool 75.08% of physiology, vol.160 ,pp.106-154 (1962).
fc 69.26% [16] B. Olshausen and D. Field, ”Emergence of simple-cell receptive field
conv & fc 74.39% properties by learning a sparse code for natural images”, Nature, 1996
pool & fc 75.67% [17] B. A. Olshausen and D. J. Field. ”Sparse coding with an overcomplete
with sparse term conv1 & pool1 & fc 75.1% basis set: a strategy emplyed by V1?,” Vision Research, Vol.37, No.23,
conv1 & pool2 & fc 75.15% pp.3311-3325 (1997).
conv2 & pool1 & fc 74.76% [18] Y. LeCun, I. Kanter, and S. A. Solla, ”Eigenvalues of covariance ma-
conv & pool & fc 75.87% trices: Application to neural-network learning”, Physical Review Letters,
vol.66, no.18, pp.23962399, 1991.
[19] N. N. Schraudolph, ”Centering neural network gradient factor”, Neural
Networks: Tricks of the Trade, vol.1524 of Lecture Notes in Computer
Science, pp.207226.Springer, 1998.
CNN can be improved. Through the detail experiments using [20] T .Raiko, H. Valpola and Y. LeCun, ”Deep learning made easier by
CIFAR-10 dataset, the effectiveness of the proposed approach linear transformations in perceptrons”, AISTATS12, 2012.
[21] G. Desjardins, K. Simonyan, R. Pascanu and K. Kavukcuoglu, ”Natural
is confirmed. neural networks”, arxiv.org/abs/1507.00210, 2015.
We think that the key to improve the learning of the CNN
is to solve the three problems; the vanishing gradient, the
unnecessary growth of the inputs, and the bias shift. ReLU
solved the vanishing gradient problem but introduced the
problem of unnecessary growth of the inputs. The sparse
regularization proposed in this paper solves the problem of
unnecessary growth of the inputs. This is probably the reason
why the combination of ReLU and the sparse regularization
gives the better results. On the other hand, ELU can solve
the problems of the vanishing gradient and the bias shift.
The effectiveness of the combination of ELU and the sparse
regularization should be investigated.
ACKNOWLEDGMENT
This work was partly supported by JSPS KAKENHI Grant
Number 16K00239.
R EFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ”ImageNet classification
with deep convolutional neural networks”, Proc. Conf. Neural Information
Processing Systems, pp.1097-1105, 2012.
[2] M. Lin, Q. Chen, S. Yan. Network In Network. arXiv: 1312.4400, 2013.
[3] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for
Large Scale Visual Recognition. arXiv: 1409.1556, 2014.
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. arXiv:
1409.4842, 2014.
[5] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the
Inception Architecture for Computer Vision. arXiv: 1512.00567, 2015.
[6] K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image
Recognition. arXiv: 1512.03385, 2015.
[7] K. Fukushima, ”Neocognitron: A self-organizing neural network model
for a mechanism of pattern recognition unaffected by shift in position”,
Biological Cybernetics, vol.36, no.4, pp.193-202, 1980.
[8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.
Hubbard, and L. D. Jackel. ”Backpropagation applied to handwritten zip
code recognition”, Neural Computation, vol.1, no.4, pp.541-551, 1989.
[9] Y. LeCun, L. Bottou, G. B. Orr and K. -R. Muller, ”Efficient backprop”,
Neural Networks: Tricks of the Trade, vol.1524 of Lecture Notes in
Computer Science, pp. 950. Springer, 1998.
[10] S. Wiesler and H. Ney. ”A convergence analysis of log-linear training”,
NIPS, 2011.
[11] S. Ioffe and C. Szegedy, ”Batch normalization: accelerating deep net-
work training by reducing internal covariate shift”, arXiv:1502.03167,
2015
2691

Improvement of Learning For CNN With Relu Activation by Sparse Regularization

Uploaded by

Copyright:

Available Formats

Improvement of Learning For CNN With Relu Activation by Sparse Regularization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improvement of Learning For CNN With Relu Activation by Sparse Regularization

Uploaded by

Copyright:

Available Formats

Improvement of Learning for CNN with ReLU

Activation by Sparse Regularization

978-1-5090-6182-2/17/$31.00 ©2017 IEEE 2684

You might also like