Time Series Classification with Deep Learning | Marco Del Pra

Introduction Time Series Classiﬁcation Convolutional Neural Networks Inception Time Echo State Networks Conclusions Bibliography
Time Series Classiﬁcation with Deep Learning
May 5, 2020

Motivation
During the last years, Time Series Classification (TSC) has become one of the
most challenging problems in data mining
Many classification problems can be treated as a Time Series Classification
problems
Time series are present in many real-world applications:
health care,
human activity recognition,
cyber-security,
finance.
Many areas are strongly increasing their interest in applications based on time
series
Non Deep Learning algorithms require some kind of feature engineering before
the classification
Deep Learning algorithms already incorporate this kind of feature engineering
internally

Examples of Time Series Classiﬁcation Problems
Electrocardiogram analysis
Electrocardiogram records are saved in time series form
Distinguishing a disease is a TSC problem
Gesture recognition
Many devices record series of images to interpret the user’s gestures
Identifying the correct gesture is a TSC problem
Anomaly detection
Anomaly detection is the identiﬁcation of unusual events
Often the data in anomaly detection are time series
Distinguishing and recognize an anomaly is a TSC problem

Problem definition
Given a set of objects with the same structure and a fixed set of different classes,
a dataset is a collection of pairs (object, class)
Given a dataset, the goal of a Classification algorithm is to build a model that
associates to an object the probability to belong to the possible classes,
accordingly to the features of the objects associated to each class
Univariate time series: ordered set of real values
M-dimensional multivariate time series: M different univariate time series with
the same length
Time Series Classification problem: Classification problem where the objects of
the dataset are univariate or multivariate time series

Perceptron (Neuron)
The Perceptron (Neuron) is the basic element of many machine learning
algorithms
The goal of a Perceptron is to compute the wighted sum of the input values and
then apply an activation function to the result
Most common activation functions:
The result of the activation function is referred as the activation of the
Perceptron and represents its output value

Multi Layer Perceptron Architecture
A Multi Layer Perceptron (MLP) is a class of feedforward neural networks, with
one Input Layer, one or more Hidden Layers, and one Output Layer
Multi Layer Perceptron is fully connected
Each node of the hidden layers and of the output layer is a Perceptron
The output of the Multi Layer Perceptron is obtained computing in sequence the
activation of its Perceptrons
The function that connect the input and the output depends on the values of the
weights.

Classiﬁcation with Multi Layer Perceptron
Multi Layer Perceptron is commonly used for Classiﬁcation problems
It’s necessary to represent the pairs (object, class) in the dataset in a more
suitable way:
Every object must be represented with a vector, called input vector
Every class must be represented with its one-hot label vector, called target
For training, MLP uses Backpropagation technique that iterates on the input
vectors
Iteration steps:
Computation of the output for the current input vector
Computation of the prediction error with a cost function
Upgrade of the weights with gradient descent

Classiﬁcation with Multi Layer Perceptron
The Backpropagation minimizes the loss on the training data
After the training the model is able to predict the estimated probabilities of an
object to belong to each class
Why don’t use the MLP for TSC, taking the whole multivariate time series as
input?
MLP don’t work well for TSC problems because the length of the time series really
hurts the computational speed
It’s necessary to extract the relevant features of the input time series
The big advantage of Deep Learning algorithms is that these relevant feature are
learned during the training
After many layers used for the extraction of the relevant features, Deep Learning
architecures uses algorithms like MLP to obtain the classiﬁcation

Deep Learning for Time Series Classiﬁcation
A Deep Learning algorithm is a composition of several layers that implement
non-linear functions
Every layer takes as input the output of the previous layer and applies its
non-linear transformation to compute its own output
The behavior of the non-linear transformations is controlled by trainable
parameters
Often, the last layer is a Multi Layer Perceptron or a Ridge regressor
We consider 3 diﬀerent Deep Learning Architectures:
Convolutional Neural Network
Inception Time
Echo State Network

Convolutional Neural Networks Architecture
A Convolutional Neural Network (CNN) is able to successfully capture the spatial
and temporal patterns through the application trainable filters
The pre-processing required in a Convolutional Neural Network is much lower as
compared to other classification algorithms
A Convolutional Neural Network is composed of three different layers:
1 Convolutional Layer
2 Pooling Layer
3 Fully-Connected Layer
Several Convolutional Layers and Pooling Layers are alternated before the
Fully-Connected Layer

Convolutional Layer
The convolution performs a convolution of an input series of feature maps with a
filter matrix to obtain as output a different series of feature maps
The convolution is defined by a set of filters, that are fixed size matrices.
Single convolution step:
Convolution between one input feature map and a filter:
Convolutional Layer executes the convolution between every filter and every input
feature map
The values of the filters are considered as trainable weights and then are learned
during training.

Stride
Stride controls how the ﬁlter convolves around one input feature map.
The value of stride indicates how many units must be shifted at a time.

Padding
Padding indicates how many extra columns and rows to add outside an input
feature map, before applying a convolution ﬁlter
All the cells of the new columns and rows have a dummy value, usually 0.
Padding is used to preserve the original size of the input feature map after
Convolutional Layer, or make it drecresing slower

Pooling Layer
The purpose of Pooling is to achieve a dimension reduction of feature maps
Pooling is applied to sliding windows of ﬁxed size across the width and height of
every input feature map
There are two types of pooling: Max Pooling and Average Pooling.
For every sliding window the result of the pooling is the maximum or the average
value
Max Pooling works as a noise suppressant, discarding noisy activations.
Also for Pooling Layer stride and padding must be speciﬁed.
The advantage of pooling operation is down-sampling the convolutional output
bands, thus reducing variability in the hidden activations.

Fully-Connected Layer
The goal of the Fully-Connected Layer is to learn non-linear combinations of the
high-level features
Usually the Fully Connected Layer is implemented with a Multi Layer Perceptron.
After several convolution and pooling operations, the output series of feature
maps are flattened into a vector
The flattened column is the input of the Multi-Layer Perceptron
The output has a number of neurons equal to the number of possible classes
Backpropagation is applied to every iteration of training and finally the model is
able to classify the time series

Hyperparameters
Number of convolution filters
Few filters cannot extract enough features to achieve classification
Too many filters are helpless and computationally expensive
Convolution filter size and initial values
Smaller filters collect as much local information as possible
Bigger filters represent more global, high-level and representative information
The filters are usually initialized with random values.
Pooling method and size
Method: Max or Average
Size: when increases, the dimension reduction is greater, but more informations are lost
Weight initialization
The weights are usually initialized with small random numbers
Activation function
Rectifier, sigmoid or hyperbolic tangent are usually chosen
Number of epochs
Number of times the entire training set pass through the model

Implementation
Building a Convolutional Neural Network is very easy using Python library Keras
To build a CNN in Keras, it is sufficient to:
declare a Sequential class
add the desired Convolutional, MaxPooling and Dense Keras Layers in the Sequential
class
specify number of filters and filter size for Convolutional Layer
specify pooling size for Pooling Layer
To compile the model, Keras requires:
the input shape
the optimizer
the loss function
a list of metrics
To train a model in Keras it’s sufficient to call the function fit() specifying the
needed parameters:
the training data (input data and targets),
the number of epochs
the validation data
To use the model, pass an array of input to the function predict() and it returns
the array of outputs

Inception Time Architecture
Recently was introduced a deep Convolutional Neural Network called Inception
Time.
This kind of network shows high accuracy and very good scalability.
The Inception Network consists of a series of Inception Modules followed by a
Global Average Pooling Layer and a Fully Conencted Layer
A residual connections is added at every third inception module

Inception Module
Inception Module consists of 4 Layers:
Bottleneck Layer
A set of parallel Convolutional Layers with different filter size
MaxPooling Layer
Depth Concatenation Layer
The network is able to extract relevant features of multiple resolutions thanks to
the use of filters with different sizes
Internal layers chooses which filter size is relevant to learn the relevant features
This is very helpful to identify a high-level feature that can have different sizes on
different input feature maps.

Receptive Field and results
A neuron in an Inception Network depends only on a region of the input features
map, that is called Receptive Field of the neuron
For time series data, the total Receptive field of an Inception Network is given by
1 +
d
i=1
(ki − 1) (1)
It’s very interesting to investigate how the accuracy of an Inception Network
changes as the Receptive Field varies
The Figure shows Inception Network’s accuracy over a simulation dataset, with
respect to the filter length as well as the input time series length
It is evident that a longer filter is required to produce more accurate results

Receptive Field and results
The Figure shows Inception Network’s accuracy over a simulation dataset, with
respect to the network’s depth as well as the length of the input time series.
It turns out that adding more layers doesn’t necessarily give an improvement of
the network’s performance, particularly for datasets with a small training set
Single Inception Network sometimes exhibits high variance in accuracy
For this reason Inception Time is implemented as an ensemble of many Inception
Networks
In this way the algorithm improves his stability, and shows high accuracy and very
good scalability
Diﬀerent experiments have shown that its time complexity grows linearly with
both the training set size and the time series length

Implementation
On github you can ﬁnd an full implementation of Inception Time written with
Python using Keras library, at this link:
https://github.com/hfawaz/InceptionTime
This implementation is based on 3 main ﬁles:
File main.py contains the necessary code to run an experiement
File inception.py contains the Inception Network implementation
File nne.py contains the code that ensembles a set of Inception Networks
The implementation uses the Keras Module Class, since some layers of
InceptionTime work in parallel
The code that implements the Inception Module building block is very similar to
that described for CNNs, and can be easily included in codes based on Keras in
order to implement customized architectures
The structure of the code that implements compilation, training and use of the
model is very similar to that described for Convolutional Neural Networks

Recurrent Neural Networks
Echo State Networks are a type of Recurrent Neural Networks
Recurrent Neural Networks are networks of neuron-like nodes organized into
successive layers
Like in standard Neural Networks, neurons are divided in Input Layer, Hidden
Layer and Output Layer
Each connection between neurons has a corresponding trainable weight
Every neurons is assigned to a ﬁxed timestep
The neurons in the hidden layer are also forwarded in a time dependent direction
The input and output neurons are connected only to the hidden layers with the
same assigned timestep
The activation of the neurons is computed in time order

Motivation of Echo State Networks
Recurrent Neural Networks (RNNs) are rarely applied for Time Series
Classiﬁcation mainly due to three factors:
1 The type of this architecture is designed mainly to predict an output for each element in
the time series
2 Recurrent Neural Networks typically suﬀer from the vanishing gradient problem
3 The training of a RNN is hard to parallelize and computationally expensive
Echo State Networks were designed to mitigate the problems of Recurrent Neural
Networks by eliminating the need to compute the gradient for the hidden layers
This reduces the training time and avoid the vanishing gradient problem
Many results show that Echo State Networks are really helpful to handle chaotic
time series

Echo State Networks Architecture
The Architecture of an Echo State Network consists of an Input Layer, a
Reservoir, a Dimension Reduction Layer, a Readout, and an Output Layer
The Reservoir is organized like a sparsely connected random RNN
The Dimension Reduction algorithm is usually implemented with the PCA
The Readout is usually implemented as MLP or a Ridge regressor
The weights between the Input layer and the Reservoir and those in the Reservoir
are randomly assigned and not trainable
The weights in the Readout are trainable

Reservoir
The Reservoir is connected to the Input Layer, and consists in a set of internal
sparsely-connected neurons, and in its own output neurons.
In the Reservoir there are 4 types of weights:
the input weights
the internal weights
the output weights
the backpropagation weights
All these weights are randomly initialized, time independent and are not trainable
This output is added to the total Reservoir output, but acts also as input for the
next time step through backpropagation weights.
The output of the Reservoir is computed separately for every time step
At every time step, the activation of every internal and output neuron is computed
The Reservoir creates a recurrent non linear embedding of the input into a higher
dimension representation

Dimension Reduction
Choosing the correct dimension reduction it’s possible to reduce the execution
time without lowering the accuracy
The Figure shows how training time and average classiﬁcation accuracy vary with
respect to the subspace dimension D after dimension reduction, for a particular
experiment
Training time increases approximately linearly with D
Accuracy stops growing when D = 75
In this case the better value for the subspace dimension is 75

Implementation and Hyperparameters
A full implementation in Python of Echo State Networks is available on Github at
this link:
https://github.com/FilippoMB/Reservoir-Computing-framework-for-multivariate-
time-series-classification/blob/master/README.md
The code uses the libraries Scikit-learn and SciPy.
The main class RC_classifier contained in the file modules.py permits to build,
train and test an Echo State Network classifier
The most important hyperparameters in the Reservoir are:
the number of neurons in the Reservoir
the percentage of nonzero connection weights
the largest eigenvalue of the reservoir matrix of connection weights
The most important hyperparameters in other layers are:
the algorithm for Dimensional Reduction Layer
the subspace dimension after the Dimension Reduction Layer
the type of Readout used for classification
the number of epochs
The structure of the code that implements training and use of the model is very
similar to that described for Convolutional Neural Networks

Conclusions
Convolutional Neural Networks are the most popular Deep Learning technique for
Time Series Classifications
The main difficulties in using Convolutional Neural Networks:
The length of the time series can slow down training
Results can be not accurate as expected with chaotic input time series
Results can be not accurate as expected with input time series in which the same
relevant feature can have different sizes
To solve these problems, InceptionTime and Echo State Networks perform better
than the other purposed architectures
InceptionTime:
speeds up the training process using an efficient dimension reduction
performs really well in handling input time series in which the same relevant feature can
have different sizes
Echo State Networks:
Speed up the training process since they are very sparsely connected with most of their
weights fixed a priori
Really helpful to handle chaotic input time series
In conclusion, high accuracy and high scalability make these new architectures the
perfect candidate for product development

Filippo Maria Bianchi, Simone Scardapane, Sigurd Løkse, Robert Jenssen
Reservoir computing approaches for representation and classification of
multivariate time series.
Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier,
Daniel F. Schmidt, Jonathan Weber, Geoffrey I. Webb, Lhassane Idoumghar,
Pierre-Alain Muller, François Petitjean InceptionTime: Finding AlexNet for Time
Series Classification.

Time Series Classification with Deep Learning | Marco Del Pra

More Related Content

Time Series Classification with Deep Learning | Marco Del Pra