Deep Convolutional Neural Networks: Structure, Feature Extraction and Training
Deep Convolutional Neural Networks: Structure, Feature Extraction and Training
Deep Convolutional Neural Networks: Structure, Feature Extraction and Training
Abstract – Deep convolutional neural networks (CNNs) are mathematical terms, an ANN can be seen as a direct graph
aimed at processing data that have a known network like topology. where each node implements a neuron model [3].
They are widely used to recognise objects in images and diagnose
Deep convolutional neural networks (CNNs) are a
patterns in time series data as well as in sensor data classification.
The aim of the paper is to present theoretical and practical aspects specialised kind of ANNs that use convolution in place of
of deep CNNs in terms of convolution operation, typical layers and general matrix multiplication in at least one of their layers [1].
basic methods to be used for training and learning. Some practical In contrast to simple neural networks that have one or several
applications are included for signal and image classification. hidden layers, CNNs consist of many layers. Such a feature
Finally, the present paper describes the proposed block structure allows them to compactly represent highly nonlinear and
of CNN for classifying crucial features from 3D sensor data. varying functions [5]. CNNs involve many connections, and the
Keywords – Convolution layers, convolution operation, deep architecture is typically comprised of different types of layers,
convolutional neural networks, feature extraction. including convolution, pooling and fully-connected layers, and
realise form of regularisation [6]. In order to learn complicated
I. INTRODUCTION features and functions that can represent high-level abstractions
(e.g., in vision, language, and other AI-level tasks), CNNs
Deep learning has recently given new power that allows
would need deep architectures. Deep architectures, and CNNs,
building artificial intelligence (AI) systems that were not
consist of a large number of neurons and multiple levels of
possible a few years ago.
latent calculations of non-linearity. According to Bengio [7],
Today, AI is an explosion technology that solves the tasks
each level of architecture of CNN represents features at a
which require a huge amount of calculation power executed by
different level of abstraction defined as a composition of lower-
computers. On the other hand, there are problems which can be
level features.
solved by people, like recognising drawing in image,
CNNs have recently shown remarkable success in image
understanding spoken words or considering the direction and
recognition [8], [9], sentence and text classification [10],
obstacles of traffic on roads, based on their intuition and
multivariate time series data analysis [11], medicine [12], time
experience. This means that with massively-parallel processing
series physiological signals [13], electric machine fault
systems and introduction of complex algorithms these solutions
diagnosis [14], ultrasonic signal classification [15] and
can be figured out much faster and with a higher accuracy. The
biological image classification [16]. Deep learning techniques
computing infrastructure is based on a hierarchy of perceptions.
have recently been used by many companies, such as Adobe,
Each computing layer is characterised in terms of its relation to
Apple, Baidu, Facebook, Google, IBM, Microsoft, NEC, Netflix,
concepts where the essential layer consists of simple concepts.
and NVIDIA [17].
If we draw a graph showing how these concepts are built on top
The rest of the article is organised as follows. In section II, a
of each other, the graph is deep, with many layers. For this
short historical overview is given referring to the evolution of
reason, we call this approach deep learning covering several
convolutional networks. In section III, the definition of
aspects of machine learning [1].
convolution operation is provided. Section IV covers the basic
There are two fundamental approaches in the field of AI. The
principles of CNNs, which are necessary to understand in order
first approach is based on knowledge engineering systems,
to elaborate a deep neural network approach. In section V, the
logic programming and logical reasoning. The second approach
main principles of CNN learning and training are given. In
covers microscopic biological models [2]. Artificial neural
section VI, comprehensible deep CNN system architecture
networks (ANNs) and genetic algorithms are the prime
consisting of the proposed convolutional layers for signal data
examples of this later approach. The field of ANNs was initially
classification is brought out. Conclusions and proposals for
configured as an attempt to emulate the way that the brain
future research are formulated in Section VII.
performs a particular task, by regarding the brain as a highly
complex, nonlinear, parallel information processing system
II. BRIEF HISTORY OF CNNS
[3], [4].
The most notable features of ANNs are: an extensive CNNs are biologically inspired by the structure of mammals’
interconnection grid of simple processing units, adjustment of visual cortices and the operation of their vision system as
the grid parameters or weights in order to carry out tasks or presented in David Hubel and Torsten Wiesel’s model [18].
adapt itself to its environment through the learning process. In Based on their seminal model, the computational deep learning
can be recognised as part of computational intelligence
paradigm. The basic idea comes out from their model – using the brain and the described hierarchical temporal memory
computing infrastructure, algorithms and labelled data for (HTM). Maximilian Riesenhuber and Tomas Poggio worked on
learning can simulate the neocortex’s large array of neurons in the hierarchical model of object recognition (HMAX) which
an artificial “neural network”. was based on Poggio’s development of computational model of
Their model helped design understanding of many aspects of brain function to build intelligent machines vision that could
brain functions, especially, the primary visual cortex (PVC). mimic human performance. Finally, Sepp Hochreiter and
Goodfellow et al. [1], [19] outline three properties of PVC to Jürgen Schmidhuber worked on recurrent neural networks
design a convolutional network: first, PVC has a 2D structure (RNN) and Long Short-Term Memory (LSTM) RNN.
and is organised as a spatial map. CNNs capture this property It is worth mentioning some global conferences to be
by having features defined in terms of 2D maps. Second, PVC contributed to CNN development. The most noticeable are:
contains many simple cells that are characterised by a linear Annual Conference on Neuron Information Processing
function of the image in a small, spatially localised receptive Systems (NIPS);
field. The detector units of a convolutional network are The world’s biggest and most important GPU developer
designed to copy these properties of simple cells. Third, PVC conference (GPU);
also contains many complex cells and these cells respond to International Conference on Machine Learning
features. Complex cells are immutable to a small change in the (ICML);
position of their feature and remaining unchanged regardless of The conference on Computer Vision and Pattern
changes of their conditions of measurement than simple cells. Recognition (CVPR);
This inspires the pooling of CNNs. International Conference on Computer Vision (ICCV).
The first as we regard as contemporary convolutional
network has been proposed by Fukushima in 1980 and is called III. THE CONVOLUTION OPERATION
Neocognitron [20]. It perhaps was the first ANN that deserved In its most general form, the concept of convolution is a
the attribute ‘deep’ and the first to incorporate the mathematical operation of two functions that produces a new
neurophysiological model of PVC. Neocognitron is very function. This new function reflects to which extent the original
similar to the architecture of modern, contest-winning, purely functions match if their graphs are aligned with each other.
supervised, feedforward, gradient-based deep network learning The convolution theorem states that under certain conditions
with alternating convolutional and downsampling layers [21]. the Fourier transform of a convolution is a point-wise product
In 1986, Rumelhart et al. [22] proposed a backpropagation of Fourier transforms. In other words, convolution in one
network to train a neural network with one or two hidden layers. domain (e.g., time domain) equals point-wise multiplication in
In 1989, LeCun [23] demonstrated techniques, using a the other domain (e.g., frequency domain).
hierarchy of shift invariant local feature detectors for image In one dimension, the convolution between two functions is
recognition. In 1989, backpropagation was applied [24] to defined as follows:
Neocognitron-like, weight-sharing, convolutional neural layers ∞
with adaptive connections. In 1991, Robinson and Fallside [25] ⊙ d ,
∞
(1)
intended a recurrent neural network (RNN) for speech
recognition. In the same year, Bengio et al. [26] suggested for where f(x) and g(x) are two functions;
speech recognition multilayer perceptron (MLP). However, this s is a dummy variable of integration (takes values 0 or 1).
wave of using deep neural networks which started in 1980 as
the connectionist approach ended around 1995. In two dimensions, the convolution between two functions is
The current deep learning renaissance began in 2006 when defined as follows:
Hinto et al. [27] demonstrated that a neural network could , , ⊙ ,
outperform the Gaussian or radial basis function (RBF) kernel ∞ (2)
∬∞ , , d d .
on the MNIST benchmark. The year 2006 also saw early
The convolution operation is typically denoted with an
graphic processing units (GPUs) based CNN implementation
asterisk * and might not to be confused with multiplication.
up to 4 times faster than computing processing unit (CPU)
For example, in one-dimensional applications we have a
CNNs [28] and compared earlier GPU implementations of
signal time domain, x(t) and a frequency domain, w(a), based
standard feedforward neural networks (FNNs) with a reported
on the convolutional theorem, the convolution operation is [1]:
speed-up factor of 20 [29]. GPUs or graphics cards, digital
signal processing (DSP), field-programmable gate arrays
(FPGA) and other silicon architectures have become critical ∗ , (3)
components of computational resources for training and where x, the first argument is referred to as the input;
evaluation when executing the idiosyncratic patterns of deep w, the second argument is referred to as the kernel;
CNNs. s(t) output is referred to as the feature map or kernel map.
Today, deep CNNs are used for many practical applications
due to seminal ideas of Yann LeCun of convolutional networks, In computer applications, the time series data will be
Geofrrey Hinton’s exploring deep learning methods, Jeff discretized and the time index t can then take only integer
Hawkins’ creating the memory prediction framework theory of
41
Unauthenticated
Download Date | 1/12/18 3:53 AM
Information Technology and Management Science
_______________________________________________________________________________________________ 2017/20
values. Thus, the discrete convolution can be defined as this particular kind of neural networks assumes that we wish to
follows: learn filters in a data-driven fashion, as a means to extract
features describing the inputs [6].
∗ ∑∞∞ . (4) The standard model of CNN has a structure consisting of the
input layer, alternating convolutional layers, pooling or
In machine learning applications, the input is usually a subsampling layers and non-linear layers. The latter consists of
multidimensional array of data and the kernel is usually a a small number of fully-connected layers, but the final layer is
multidimensional array of parameters that are adapted by the often a sofmax classifier [32]. Accordingly with a complex
learning algorithm [1]. These multidimensional arrays are layer terminology, one convolutional net or convolutional layer
referred as tensors. is composed of convolutional stage (e.g., affine
If there is a two dimensional space, for example, image I as transformation), detector stage (e.g., rectified linear), and
an input, the two-dimensional kernel K has to be used. The pooling stage [1]. This means that each convolutional layer has
convolution for two dimensions is as follows: more than one stage. As a result, each stage of the convolutional
layer can be set apart and every step of processing of it can be
, ∗ , ∑ ∑ , , . (5) ruled in its own rights. Typically, convolutional layers are
interspersed with sub-sampling layers to reduce computational
If we assume that there are fewer variations in the range of time and gradually build up further spatial and configural
valid values of m compared to n based on assumption that invariance [7]. The basic layers of a CNN are listed below.
convolution is commutative, we can equivalently write (5) as Input layer. The input is usually a multidimensional array of
follows: data where data are fed to the network [6]. Input data can be,
i.e., image pixels or their transformation, patterns, time series
, ∗ , ∑ ∑ , , . (6) or video signals.
Convolutional layers or convolutional stage. It is the main
In case m increases, the index into the input increases, but building block of CNN. The prime purpose of convolution is to
the index into the kernel decreases, it means we have flipped extract distinct features from the input. Krig [33] outlines that
the kernel relative to the input. If the kernel is not flipped we these layers are comprised of a series of filters or learnable
use the related function called the cross-correlation: kernels which aim at extracting local features from the input,
and each kernel is used to calculate a feature map or kernel map.
, ∗ , ∑ ∑ , , . The first convolutional layer extracts low-level meaningful
(7) features such as edges, corners, textures and lines. Next
convolutional layer extracts higher-level features, but the
In the context of machine learning, the algorithm will learn highest-level features are extracted in the last convolution layer
the appropriate values of the kernel in the appropriate place [1]. [34]. Kernel size refers to the size of the filter, which convolves
In machine learning convolution is not used alone but around the feature map while the amount by which the filter
simultaneously with the combination of other functions. Based slides (sliding process) is the stride. It controls how the filter
on the principles of convolution CNNs operates. convolves around the feature map. Therefore, the filter
convolves around the different layers of input feature map by
IV. CONVOLUTIONAL NEURAL NETWORKS sliding one unit each time [1].
From the above-mentioned considerations, we can clearly Another essential feature of CNNs is padding that gives
recognise that the topology of CNN comparing with other option to make input data wider with, e.g., elements Vi, j, k. For
traditional neural networks is different. The latter uses a matrix example, if there is a need to control the size of the output and
product AB that is produced by multiplying two matrices n x m, the kernel width W independently, the zero padding of input is
where n is a matrix of parameters and m is a parameter used.
describing the interaction between each input unit and each Non-linear layers or detector stage. The detector stage is
output unit [1]. used to detect each linear activation through nonlinear
The main benefit of using CNNs with respect to traditional activation function. In other words, linear activation introduces
fully-connected neural networks is the reduced number of the non-linearity into neural networks and allows learning more
parameters to be learned [30]. The CNN topology is based on complex models [11].
three main concepts, namely: local receptive fields, shared There are several nonlinear activation functions. The
weights and spatial or temporal sampling [31]. It means that standard way to model a neuron’s output f as a function of its
CNNs are typically comprised of different types of layers called input x is with f(x) = tanh(x), sigmoid(x) or Rectified Linear
convolutional layers, whereas each convolutional layer is made Unit (ReLU) [32]. The last one is preferable because it makes
of small kernels that allow extracting high-level features in an training several times faster than its equivalents. Some authors
effective way. The last convolutional layer is fed to fully- adopt sigmoid(·) function at all activation stages due to its
connected layers. As it has been stated before, if CNNs are the simplicity [11]. ReLU applies the function y = max(x, 0). It
reduced number of parameters to be learned they caused to have increases the nonlinear properties of the decision function and
much fewer connections and easier to train [8]. Consequently, of the overall network without affecting the receptive fields of
42
Unauthenticated
Download Date | 1/12/18 3:53 AM
Information Technology and Management Science
_______________________________________________________________________________________________ 2017/20
the convolutional layer. According to [8], using ReLU it is input feature map to top-right corner element at a time. Then
possible to speed up the training of CNNs by keeping up the the kernel shifts one element downward, takes left-side position
gradient more or less constant at all network layers. and moves towards right-side position. This process is finished
The pooling or downsampling, or subsampling layers. It when the kernel reaches the bottom-right position.
reduces the resolution of the previous feature maps through For example, for the case when we have input H × H = 44
compressing features and computational complexity of the and k × k = 5, there are 30 unique positions from the left to the
network [35]. It adjusts the features robust to noise and disorder. right, and 30 unique positions from the top to the bottom that
Another purpose of the pooling layer is to make it robust to the kernel can take. Each feature in the convolution output will
small variations for previously learned features. As a result, contain 30 × 30, i.e., (H − k + 1) × (H − k + 1) = (44 − 5 + 1) ×
pooling ensures that the network focuses on the most important (44 − 5 + 1) elements. To create one element of one output
patterns. feature k × k × W operations are required.
In general, a pooling layer produces downsampled versions From the above-mentioned considerations, it can be
of the input map and reduces the dimensionality of the feature concluded that a new feature map is typically generated by
maps used by the following layers [6], [7]. sliding a filter over the input and computing the dot product
Pooling splits the inputs into regions with the size of R × R (which is similar to the convolution operation), followed by a
to produce one output from each region. If a given input with a non-linear activation function to introduce non-linearity into the
size of W × W is fed to the pooling layer, the output size P is model [32].
obtained by [36]: For instance, one convolutional layer consists of the input
feature map, the kernel and the convolution output. All units
.
(8) share the same weights (filters) in each feature map.
Pooling can be max pooling, average of a rectangular
neighbourhood, and pooling by downsampling. Convolution output
The max pooling action addresses the maximum output
within a rectangular neighbourhood. Max pooling outputs only Pooling
the maximum number in each kernel, thus reducing the feature a b a)) max pooli
map size. Max pooling introduces invariance. For example, we b)) average pooli
have input feature of size 44 × 44, which is divided into 22 ×
22 regions of size 2 × 2, i.e., we apply max pooling of size 2, D
i.e., output index (2, 2). For maximum pooling, the maximum W
value of the four values is selected. For average pooling, the k
average of the four values is selected. The result of averaging is
0.0 k
a fraction that has been rounded to the nearest integer.
ReLU Kernel
Each output map may combine convolution with multiple filter of size
input maps. In general, we can write [37]: k×k×W
1
∑ ∈ ∗ ,
ST
(9)
AR
T
43
Unauthenticated
Download Date | 1/12/18 3:53 AM
Information Technology and Management Science
_______________________________________________________________________________________________ 2017/20
layers will be fully connected 1D layers to all activations in the Finally, to obtain optimal performance of CNNs we can tune
previous layer [7]. From these layers it is possible to extract three options: regularisation, momentum and leaning rate.
features to train another classifier. To specify how the network Regularisation is the function to prevent overfitting of the data.
training penalizes the deviation between the predicted and the Regularisation can be improved by adjusting a weight decay
true labels, various loss functions can be used, e.g., softmax, coefficient or by adding a regularisation strategy such as drop
sigmoid cross-entropy or Euclidean loss [6]. out or data augmentation [12]. Momentum is the function to
control how fast or slow the network learn during training.
V. TRAINING AND LEARNING CNNS Learning rate is the function to help in the convergence of the
Training deep architectures is a challenging task, and data.
traditional methods that have proved effective when applied to
uncomplicated neural network architectures are not as effective VI. ARCHITECTURE
when applied to deep architecture. The training function means In this section, some architecture and methodologies are
to use an overall algorithm that is used to train a neural network considered for image and signal detection.
to recognise a certain input and map it to a certain output. The The architecture of CNN contains many (several) layers.
most expensive part of CNNs training is learning the features Krizhevsky et al. [8] offer CNN that contains eight learned
and accessibility to labelled data. layers with weights. The first five are convolutional and the
A learning function in deep neural networks requires remaining three are fully-connected layers [39]. For example,
computing the gradients of complicated functions and decides the first convolutional layer consists of 224 × 224 × 3 pixels
how those would be manipulated [1]. CNNs are usually trained input map and is convolved with 96 kernels of the size
by backpropagation (BP) and Stochastic Gradient Descent 11 × 11 × 3, where the stride is 4 pixels. The second
(SGD) to find weights and biases that minimise certain loss convolutional layer takes as input (response normalized and
function in order to map the arbitrary inputs to the targeted pooled) the output of the first convolutional layer and filters it
outputs as closely as possible [3]. BP algorithm refers only to with 256 kernels of size 5 × 5 × 48. The fully-connected layer
the method for computing the gradient, while SGD algorithm is has 4096 neurons. The training of the model has been done
used to perform learning using this gradient [1]. However, BP using stochastic gradient descent with a batch size of
technique can make training gradient 10 million times faster 128 examples, momentum of 0.9, and the weight decay of
relative to naive implementation techniques [22]. 0.0005. An equal learning rate was used for all layers, which
There are two central challenges in machine learning: was adjusted manually through training. The learning rate was
underfitting and overfitting [38]. Overfitting occurs when the initialised at 0.01 and reduced three times prior to termination.
gap between the training error and test error is too large. NVIDIA GTX 580 3GB GPUs were used for training the
Underfitting occurs when the model is not able to obtain a network.
sufficiently low error on the training set. In CNNs, there are two Archarya et al. [12] report using CNN for automated
primary ways to combat overfitting: dropout and data detection of myocardial infarction using the electrocardiogram
augmentation. Dropout is a cheap, powerful regularization (ECG) in the diagnosis of myocardial infarction. In pre-
strategy that can be seen as a process of constructing new inputs processing process, all ECG signals are segmented using the R-
by multiplication by noise [8]. It is a method of adaptive peak detection using Pan–Tompkins algorithm. To eliminate
reparametrization, motivated by the difficulty of training very the outweigh before feeding the ECG segments and to address
deep neural network models. Data augmentation is to the problem of amplitude scaling into deep CNN for training
artificially enlarge the dataset using label-preserving and testing each segment should be normalized with z score
transformations [12]. [40]. The architecture of the proposed CNN consists of
Another challenge in machine learning is: regularisation and 11 layers where the last three are fully-connected layers. For
optimisation. The optimisation perspective suggests that the example [41], the input layer (layer 0) is convolved with a
weights should be large enough to propagate information kernel size 102 to form the first layer (layer 1). After which, a
successfully, but some regularisation concerns encourage max pooling of size 2 is applied to every feature map. After
making them smaller. Unless your training set contains tens of performing the max pooling operation, the number of neurons
millions of examples or more, you should include some mild reduces from 550 × 3 to 275 × 3. Finally, layer 10 is connected
form of regularisation from the start [1]. to the last layer with 2 output neurons. The training has been
According to [7], the parameter update includes: feedforward executed by using standard backpropagation with a batch size
pass, backpropagation pass and the gradient applied. The aim of 10 [40]. The regularisation, momentum and learning rate
of feedforward pass is to determine the predicted output CNN parameters are set to 0.2, 3 × 10−4, and 0.7 respectively.
on input vector. Specifically, it computes feature maps from A proportion of all ECG data has been 70 % for training, 20 %
layer to layer and stage from stage until obtaining the output. for validation and 10 % for testing of CNN. The final layer of
The backpropagation pass means that backpropagation starts at the fully-connected network is a softmax layer with an output
the last layer of a neural network, recursively applies the chain of X dimensional vector where X is the number of classes that
rule to compute the gradients and go backwards to the inputs of we desire to have.
a neural network.
44
Unauthenticated
Download Date | 1/12/18 3:53 AM
Information Technology and Management Science
_______________________________________________________________________________________________ 2017/20
Ferreira and Giraldi [6] apply the application of CNN to (pattern recognition), results (classification). The algorithm
granite image classification using learning discriminate performs the following steps to learn features:
features, instead of relying on feature engineering. Each image 1) Classifying sensor signals as data array into number of
is converted to R, G and B colour channels and also converted segments as labelled training data.
to grey-level. These are used as input to the network. After the 2) Filtering and normalizing segments of the raw sensor
patching the images have been divided into tiles of interest, i.e., data (pre-processing step).
28 × 28 images and 32 × 32 colour images. In the next step, 3) Extracting features using CNN.
different CNNs were used to recognise patterns of image. 4) Detecting patterns and creating classification.
The number of networks used was 4 and their architecture
was as follows: MNIST1 network consisted of one input layer, Data are often corrupted by interfering noise. In order to
four convolutional layers, two pooling layers, one ReLU layer decrease that noise we use reduction, label, scaling and
and one fully-connected layer. MNIST2 network consisted of normalization of the data in the pre-processing stage. In our
one input layer, five convolutional layers, three pooling layers, case, we normalise the sensor data by subtracting the mean
one ReLU layer and a final fully-connected layer. MNIST3 sensor data value and dividing it by the standard deviation.
network consists of one input layer, six convolutional layers, For the feature extraction we use CNN. We take the sensor
four pooling layers, one ReLU layer and a final fully-connected data as a second-order tensor matrix. The sensor data has c
layer. CIFAR network consists of one input layer, five channels. Convolution operation is represented by C(*)
convolutional layers, three pooling layers, four ReLU layers throughout the network.
and one fully-connected layer. These networks generated The convolution layer_1 (CL_1). The labelled input sensor
feature vectors with 500, 256, 256 and 64 dimensions, data are convoluted with Kn kernel maps with dimension of,
respectively. Finally, the image patch classification was done k1 × k1 × W1, respectively, one kernel map at a time, with kernel
by choosing the 1st Nearest Neighbour (1NN) classifier, the stride s1 and zero padding. A bias b is added to each convolution
tiny image blocks from input images. operation between sensor data and a kernel map. The scored
The MNIST based networks were trained using a learning result then goes through the non-linear activation function
rate fixed on 0.001, SGD with momentum to 0.9 and weight ReLU1 to generate CL_1. The dimensions of the resulting CL_1
decay to 0.0005 without dropout. The CIFAR was trained using are [(W/s1) × (H/s1) × K1]. The number of kernel maps
SDG with momentum equal to 0.9 and the weight decay to represents the depth, dn of the convolution operation or number
0.0001. Experiments were performed on a cluster node with of extracted features that the network will extract after the
11 processors and 131 GB RAM and graphics card NVIDIA convolution.
GeForce 210. The convolution layer_2 (CL_2). The operation is exactly
Zheng et al. [11] modify CNN and apply it to multivariate similar to the previous measure, if the input labelled sensor data
time series classification task (the input of multivariate time are replaced by the CL_1 instead. The CL_1 is convoluted with
series classification is multiple 1D subsequences, but not kernel maps, k2, with the kernel stride s2 and the same zero
2D image pixels) separating multivariate time series into padding as it has been for the first convolutional layer. A bias
univariate ones. The feature learning is individually carried out is added to each convolution operation. The result goes through
on each univariate time series. The architecture consist of three the non-linear activation function ReLU2 to generate the CL_2.
input channels, two filter layers, two pooling layers, and two The dimensions of the resulting CL_2 are [(W/(s1 × s2)) × (H ×
fully-connected layers. To update the parameters, the stochastic (s1 × s2)) × K2]. We must take into consideration that the total
gradient descend method has been used instead. The reason for depth of convoluted layer is K2 at this point, which is the
that was that SGD could converge faster for large scale data number of extracted features that have from the original sensor
instead of full-batch version. data so far. Generally, convolution process goes through five
Wang et al. [13] focused on evaluating the efficacy of using convolutional layers because typically CNNs to be deep use 5
CNN to construct a model of physiological signal anomaly to 25 distinct layers of pattern recognition.
detection and tested algorithm on eight physiological signals. Finally, the CL_5 is fully connected to a hidden layer of r
The DEAP dataset, a dataset for emotion analysis using EEG, number of weight arrays or neurons. The dimension of each
physiological, and video signals was used [42]. CNN weight array is equal to [W/(s1 × s2 × s3 × s4 × s5) × H/(s1 × s2 ×
transforms the raw unlabelled time series signals into a reduced s3 × s4 × s5) × K5].
set of features. Before the signals enter the model, the time A bias is added to each product of CL_5 with a weight array,
series physiological signals were normalized. The deep CNN which then goes through the activation function ReLU6.
architecture contains two convolutional layers, two pooling A softmax layer with m classifiers is used to yield the resulting
layers and a multivariate Gaussian anomaly detection model. final output.
45
Unauthenticated
Download Date | 1/12/18 3:53 AM
Information Technology and Management Science
_______________________________________________________________________________________________ 2017/20
The study of the theoretical assumptions and some practical [13] K. Wang, Y. Zhao, Q. Xiong, M. Fan, G. Sun, L. Ma, and T. Liu,
“Research on Healthy Anomaly Detection Model Based on Deep
application of CNNs has shown that the main benefit of using Learning from Multiple Time-Series Physiological Signals,” Scientific
CNNs with respect to standard fully-connected neural networks Programming, vol. 2016, pp. 1–9, 2016.
is the reduced number of parameters to be learned. Decreasing http://dx.doi.org/10.1155/2016/5642856
[14] R. Liu, G. Meng, B. Yang, C. Sun, and X. Chen, “Dislocated Time Series
the number of the parameters leads to less noise during the Convolutional Neural Architecture: An Intelligent Fault Diagnosis
training process. The reason is that the number of parameters Approach for Electric Machine,” IEEE Transactions on Industrial
depends on the kernel width. The wider the kernel width, the Informatics, vol. 13, no. 3, pp. 1310–1320, Jun. 2017.
lager the number of parameters in the model. https://doi.org/10.1109/tii.2016.2645238
[15] M. Meng, Y. J. Chua, E. Wouterson, and C. P. K. Ong, “Ultrasonic Signal
On the other hand, CNNs normally need thousands or even Classification and Imaging System for Composite Materials via Deep
millions of labelled data. The model parameters become larger Convolutional Neural Networks,” Neurocomputing, vol. 257, pp. 128–
if the weight decay parameters are decreased. Dropout rate 135, Sep. 2017. https://doi.org/10.1016/j.neucom.2016.11.066
[16] C. Affonso, A. L. D. Rossi, F. H. A. Vieira, and A. C. P. de L. F. de
should be decreased to avoid an increase in the number of Carvalho, “Deep Learning for Biological Image Classification,” Expert
iterations to converge. Systems with Applications, vol. 85, pp. 114–122, Nov. 2017.
Learning rate should be tuned optimally. Very high or very https://doi.org/10.1016/j.eswa.2017.05.039
[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S.
low rate will lead to optimisation problems and the decrease in Guadarrana, and T. Darell, “Caffe: Convolutional Architecture for Fast
effective capacity of the network. Increasing the number of Feature Embedding”, Cornel University Library, Jun. 2014.
hidden units increases the representation capacity of the model. [18] D. H. Hubel and T. N. Wiesel, “Receptive Fields and Functional
Using zero padding before convolution keeps the representation Architecture of Monkey Striate Cortex,” The Journal of Physiology, vol.
195, no. 1, pp. 215–243, Mar. 1968.
size large. https://doi.org/10.1113/jphysiol.1968.sp008455
Finally, it can be considered that the proposed algorithm [19] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y.
diagram should be adapted using real sensor data as Bengio, “Maxout networks,” S. Dasgupta and D. McAllester, eds.,
ICML’13, pp. 1319–1327, 2013.
performance modelling and identification of classes. [20] K. Fukushima, “Neocognitron. “A Self-Organizing Neural Network
Model for a Mechanism of Pattern Recognition Unaffected by Shift in
REFERENCES Position”, Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
[21] J. Schmidhuber, “Deep Learning in Neural Networks: An Overview,”
[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (Adaptive Neural Networks, vol. 61, pp. 85–117, Jan. 2015.
Competition and Machine Learning). The MIT Press, p. 779, 2016. https://doi.org/10.1016/j.neunet.2014.09.003
[2] T. Munakata, Fundamentals of the New Artificial Intelligence: Neural, [22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
Evolutionary, Fuzzy and More, 2nd Edition. Springer-Verlag, London. Representations by Back-Propagating Errors,” Nature, vol. 323, no. 6088,
p. 225, 2008. pp. 533–536, Oct. 1986. https://doi.org/10.1038/323533a0
[3] D. Floreano, P. Dürr, and C. Mattiussi, “Neuroevolution: From [23] Y. LeCun, “Generalization and Network Design Strategies”, Technical
Architectures to Learning,” Evolutionary Intelligence, vol. 1, no. 1, Report, University of Toronto, 1989.
pp. 47–62, Jan. 2008. https://doi.org/10.1007/s12065-007-0002-4 [24] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.
[4] A. Prieto, M. Atencia, and F. Sandoval, “Advances in Artificial Neural Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten Zip
Networks and Machine Learning,” Neurocomputing, vol. 121, pp. 1–4, Code Recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, Dec.
Dec. 2013. https://doi.org/10.1016/j.neucom.2013.01.008 1989. https://doi.org/10.1162/neco.1989.1.4.541
[5] M. Dalto, “Deep Neural Networks for Time Series Prediction with [25] T. Robinson and F. Fallside, “A Recurrent Error Propagation Network
Application in Ultra-Short-Term Wind Forecasting,” IEEE, pp. 1657– Speech Recognition System,” Computer Speech & Language, vol. 5,
1663, 2015. no. 3, pp. 259–274, Jul. 1991.
[6] A. Ferreira and G. Giraldi, “Convolutional Neural Network Approaches https://doi.org/10.1016/0885-2308(91)90010-n
to Granite Tiles Classification,” Expert Systems with Applications, [26] Y. Bengio, R. De Mori, G. Flammia, and R. Kompe, “Phonetically
vol. 84, pp. 1–11, Oct. 2017. https://doi.org/10.1016/j.eswa.2017.04.053 Motivated Acoustic Parameters for Continuous Speech Recognition
[7] Y. Bengio, “Learning Deep Architectures for AI,” Foundations and Using Artificial Neural Networks,” Speech Communication, vol. 11,
Trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. no. 2–3, pp. 261–271, Jun. 1992.
https://doi.org/10.1561/2200000006 https://doi.org/10.1016/0167-6393(92)90020-8
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification [27] G. E. Hinton, “To Recognize Shapes, First Learn to Generate Images,”
with Deep Convolutional Neural Networks,” In proceedings of Neural Technical Report UTML TR 2006-003, University of Toronto, 2006.
Networks (NIPS), Nevada, USA, pp. 1106–1114, 2012. [28] K. Chellapilla, S. Puri, and P. Simard, “High Performance Convolutional
[9] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Neural Networks for Document Processing,” Tenth International
Large-scale Image Recognition,” Published as a conference paper at Workshop on Frontiers in Handwriting Recognition, La Baule (France),
ICLR, Cornel University Library, 2015. Université de Rennes 1, Suvisoft, 2006.
[10] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very Deep [29] K.-S. Oh and K. Jung, “GPU Implementation of Neural Networks,”
Convolutional Networks for Text Classification,” Proceedings of the 15th Pattern Recognition, vol. 37, no. 6, pp. 1311–1314, Jun. 2004.
Conference of the European Chapter of the Association for https://doi.org/10.1016/j.patcog.2004.01.013
Computational Linguistics, vol. 1, Long Papers, 2017. [30] S.-H. Zhong, J. Wu, Y. Zhu, P. Liu, J. Jiang, and Y. Liu, “Visual
https://doi.org/10.18653/v1/e17-1104 Orientation Inhomogeneity Based Convolutional Neural Networks,” 2016
[11] Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. L. Zhao, “Time Series IEEE 28th International Conference on Tools with Artificial Intelligence
Classification Using Multi-Channels Deep Convolutional Neural (ICTAI), Nov. 2016. https://doi.org/10.1109/ictai.2016.0079
Networks,” Lecture Notes in Computer Science, pp. 298–310, 2014. [31] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based
https://doi.org/10.1007/978-3-319-08010-9_33 Learning Applied to Document Recognition,” Proceedings of the IEEE,
[12] U. R. Acharya, H. Fujita, S. L. Oh, Y. Hagiwara, J. H. Tan, and M. Adam, vol. 86, no. 11, pp. 2278–2324, 1998. https://doi.org/10.1109/5.726791
“Application of Deep Convolutional Neural Network for Automated [32] S. Albelwi and A. Mahmood, “A Framework for Designing the
Detection of Myocardial Infarction Using ECG Signals,” Information Architectures of Deep Convolutional Neural Networks,” Entropy, vol. 19,
Sciences, vol. 415–416, pp. 190–198, Nov. 2017. no. 6, p. 242, May 2017. https://doi.org/10.3390/e19060242
https://doi.org/10.1016/j.ins.2017.06.027 [33] S. Krig, Computer Vision Metrics. Survey, Taxonomy and Analysis of
Computer Vision, Visual Neuroscience, and Deep Learning. Springer,
p. 637, 2016. https://doi.org/10.1007/978-3-319-33762-3
46
Unauthenticated
Download Date | 1/12/18 3:53 AM
Information Technology and Management Science
_______________________________________________________________________________________________ 2017/20
[34] J. S. Ren, W. Wang, J. Wang, and S. Liao, “An Unsupervised Feature [42] S. Guzel Aydin, T. Kaya, and H. Guler, “Wavelet-Based Study of
Learning Approach to Improve Automatic Incident Detection,” 2012 15th Valence–Arousal Model of Emotions on EEG Signals with LabVIEW,”
International IEEE Conference on Intelligent Transportation Systems, Brain Informatics, vol. 3, no. 2, pp. 109–117, Jan. 2016.
Sep. 2012. https://doi.org/10.1109/itsc.2012.6338621 https://doi.org/10.1007/s40708-016-0031-9
[35] C. Affonso, A. D. Rossi, F. H. A. Vieira, and A. C. P. de L. F. de
Carvalho, “Deep Learning for Biological Image Classification”, Expert
Systems with Applications, vol. 85, pp. 114–122, 2017.
[36] T. Chen, R. Y. He, and X. Wang, “A Gloss Composition and Context Ivars Namatēvs holds Mg. sc. ing. from Riga Technical University and MBA
Clustering Based Distributed Word Sense Representation Model,” degree from Riga Business School. His research interests include deep artificial
Entropy, vol. 17, no. 9, pp. 6007–6024, Aug. 2015. intelligence, especially deep convolutional networks and data mining methods
https://doi.org/10.3390/e17096007 and their application, as well as genetic algorithms. The most important
[37] J. Bouvrie, Notes on Convolutional Neural Networks, Nov. 2006 [Online]. publications: I. Namatēvs. “Concept Analysis of Complex Adaptive Systems,”
Available: http://cogprints.org/5869/1/cnn_tutorial.pdf International Scientific Forum: Proceedings of XVI International Scientific
[38] I. Song, H.-J. Kim, and P. B. Jeon, “Deep Learning for Real-Time Robust Conference: Towards Smart, Sustainable and Inclusive Europe: Challenges
Facial Expression Recognition on a Smartphone,” 2014 IEEE for Future Development. Riga, Latvia, 28–30 May 2015. Namatēvs, I.,
International Conference on Consumer Electronics (ICCE), Jan. 2014. Aleksejeva, L., Poļaka, I. “Neural Network Modelling for Sports Performance
https://doi.org/10.1109/icce.2014.6776135 Classification as a Complex Socio-Technical System,” Information Technology
[39] H. Tabia and H. Laga, “Learning Shape Retrieval from Different and Management Science, vol. 19. pp. 45–52. 2016. Available from
Modalities,” Neurocomputing, vol. 253, pp. 24–33, Aug. 2017. doi: 10.1515/itms-2016-0010. Namatēvs, I., “Exploring Model-Driven Domain
https://doi.org/10.1016/j.neucom.2017.01.101 Analysis for Software Engineering,” Proceedings of XVI International
[40] U. R. Acharya, H. Fujita, O. S. Lih, Y. Hagiwara, J. H. Tan, and M. Adam, Scientific Conference. Turība University, Riga, Latvia, 18 May 2017.
“Automated Detection of Arrhythmias Using Different Intervals of Namatēvs, I., Aleksejeva, L. “Decision Algorithm for Heuristic Donor-
Tachycardia ECG Segments with Convolutional Neural Network,” Recipient Matching,” Mendel Soft Computing Journal, vol. 23, No. 1,
Information Sciences, vol. 405, pp. 81–90, Sep. 2017. pp. 33–40, June 2017. ISSN:1803-3814.
https://doi.org/10.1016/j.ins.2017.04.012 E-mail: ivars@turiba.lv
[41] U. R. Acharya, S. L. Oh, Y. Hagiwara, J. H. Tan, and H. Adeli, “Deep
Convolutional Neural Network for the Automated Detection and
Diagnosis of Seizure Using EEG Signals,” Computers in Biology and
Medicine, Sep. 2017. https://doi.org/10.1016/j.compbiomed.2017.09.017
47
Unauthenticated
Download Date | 1/12/18 3:53 AM