Improving Position Encoding of Transformers For Multivariate Time Series Classification
Improving Position Encoding of Transformers For Multivariate Time Series Classification
Improving Position Encoding of Transformers For Multivariate Time Series Classification
Abstract
Transformers have demonstrated outstanding performance in many
applications of deep learning. When applied to time series data, trans-
formers require effective position encoding to capture the ordering of
the time series data. The efficacy of position encoding in time series
analysis is not well-studied and remains controversial, e.g., whether
it is better to inject absolute position encoding or relative position
encoding, or a combination of them. In order to clarify this, we first
review existing absolute and relative position encoding methods when
applied in time series classification. We then proposed a new abso-
lute position encoding method dedicated to time series data called
time Absolute Position Encoding (tAPE). Our new method incorpo-
rates the series length and input embedding dimension in absolute
position encoding. Additionally, we propose computationally Efficient
implementation of Relative Position Encoding (eRPE) to improve gen-
eralisability for time series. We then propose a novel multivariate
time series classification (MTSC) model combining tAPE/eRPE and
convolution-based input encoding named ConvTran to improve the posi-
tion and data embedding of time series data. The proposed absolute
and relative position encoding methods are simple and efficient. They
can be easily integrated into transformer blocks and used for down-
stream tasks such as forecasting, extrinsic regression, and anomaly
detection. Extensive experiments on 32 multivariate time-series datasets
1
Springer Nature 2021 LATEX template
1 Introduction
A time series is a time-dependent quantity recorded over time. Time series
data can be univariate, where only a sequence of values for one variable is col-
lected; or multivariate, where data are collected on multiple variables. There
are many applications that require time series analysis, such as human activity
recognition [1], diagnosis based on electrocardiogram (ECG), electroencephalo-
gram (EEG), and systems monitoring problems [2]. Many of these applications
are inherently multivariate in nature – various sensors are used to measure
human’s activities; EEGs use a set of electrodes (channels) to measure brain
signals at different locations of the brain. Hence, multivariate time-series analy-
sis methods such as classification and segmentation are of great current interest
[3–5].
Convolutional neural networks (CNNs) have been widely employed in time
series classification [4, 5]. Many studies have shown that convolution layers
tend to have strong generalization with fast convergence due to their strong
inductive bias [6]. While CNN-based models are excellent for capturing local
temporal/spatial correlations, these models cannot effectively capture and
utilize long-range dependencies. Also, they only consider the local order of
data points in a time series rather than the order of all data points globally.
Due to this, many recent studies have used recurrent neural networks (RNN)
such as LSTMs to capture this information [7]. However, RNN-based models
are computationally expensive, and their capability in capturing long-range
dependencies are limited [8, 9].
On the other hand, attention models can capture long-range dependencies,
and their broader receptive fields provide more contextual information, which
can improve the models’ learning capacity. Not surprisingly, with the success of
attention models in natural language processing [8, 10], many previous studies
have attempted to bring the power of attention models into other domains
such as computer vision [11] and time series analysis [9, 12, 13].
The transformer’s core is self-attention [8], which is capable of modeling
the relationship of input time series. Self-attention, however, has a limitation
— it cannot capture the ordering of input series. Hence, adding explicit rep-
resentations of position information is especially important for the attention
since the model is otherwise entirely invariant to input order, which is undesir-
able for modeling sequential data. This limitation is even worse in time series
Springer Nature 2021 LATEX template
data since, unlike image and text, which use Word2Vec-like embedding, time
series data has less informative data context.
There are two main methods for encoding positional information in trans-
formers: absolute and relative. Absolute methods, such as those used in [8, 10],
assign a unique encoding vector to each position in the input sequence based
on its absolute position in the sequence. These encoding vectors are combined
with the input encoding to provide positional information to the model. On
the other hand, relative methods [14, 15] encode the relative distance between
two elements in the sequence, rather than their absolute positions. The model
learns to compute the relative distances between any two positions during
training and looks up the corresponding embedding vectors in a pre-defined
table to obtain the relative position embeddings. These embeddings are used
to directly modify the attention matrix. Position encoding has been verified to
be effective in natural language processing and computer vision [16]. However,
in time series classification, the efficacy is still unclear.
The original absolute position encoding is proposed for language modeling,
where high embedding dimensions like 512 or 1024 are usually used for posi-
tion embedding of input with a length of 512 [8]. But, for time series tasks,
embedding dimensions are relatively low, and the series might have a variety of
lengths (ranging from very low to very high). In this paper, for the first time, we
study the efficiency (i.e. how well resources are utilized) and the effectiveness
(i.e. how well the encodings achieve their intended purpose) of existing abso-
lute and relative position encodings for time series data. We then show that the
existing absolute position encodings are ineffective with time series data. We
introduce a novel time series-specific absolute position encoding method that
takes into account the series embedding dimension and length. We show that
our new absolute position encoding outperforms the existing absolute position
encodings in time series classification tasks.
Additionally, since the existing relative position encodings have large mem-
ory overhead and they require a large number of parameters to be trained,
in time series data it is very likely they overfit. We propose a novel compu-
tationally efficient implementation of relative position encoding to improve
their generalisability for time series. We show that our new relative position
encoding outperforms the existing relative position encodings in time series
classification tasks. We then propose a novel time series classification model
based on the combination of our proposed absolute/relative position encod-
ings named ConvTran to improve the position embedding of time series data.
We further enriched the data embedding of time series using CNN rather than
linear encoding. Our extensive experiments on 32 benchmark datasets show
ConvTran is significantly more accurate than the previous state-of-the-art in
deep learning models for time series classification (TSC). We believe our novel
position encodings can boost the performance of other transformer-based TSC
models.
Springer Nature 2021 LATEX template
2 Related Work
In this section, we briefly discuss the state-of-the-art multivariate time series
classification (MTSC) algorithms, as well as CNN and attention-based mod-
els that have been applied to MTSC tasks. We refer interested readers to the
corresponding papers or the recent survey on deep learning for time series
classification [17] for a more detailed description of these algorithms and
models.
of them being the state of the art in their respective domains. Since these algo-
rithms were designed for univariate time series, the adaption for multivariate
time series is not easy. Hence, they were adapted for multivariate time series
through ensembling over all the models built on each dimension independently.
This means that they are computationally very expensive especially when the
number of channels is large. Recently, the latest HIVE-COTE version, HIVE-
COTEv2.0 (HC2) was proposed [24]. It is currently the most accurate classifier
for both univariate and multivariate TSC tasks [24]. Despite being the most
accurate on 26 benchmark MTSC datasets, that are relatively small, HC2 is
not scalable to either large datasets with long time series or datasets with
many channels.
project the expanded hidden state back to the original size to capture the
temporal and spatial interaction.
3 Background
This section provides a basic definition of self-attention and an overview of
current position encoding models. Note that position encoding refers to the
method that integrates position information, e.g., absolute or relative. Position
embedding refers to a numerical vector associated with position encoding.
yt ∈ {1, ..., c} and c is the number of classes. The aim is to train a neural
network classifier to map set X to Y .
3.2 Self-Attention
The first attention mechanisms were proposed in the context of natural lan-
guage processing [30]. While they still relied on a recurrent neural network at
its core, Vaswani et al. [8] proposed a transformer model that relies on atten-
tion only. Transformers map a query and a set of key-value pairs to an output.
More specifically, for an input series, xt = {x1 , x2 , ..., xL }, self-attention com-
putes an output series zt = {z1 , z2 , ..., zL } where zi ∈ Rdz and is computed as
a weighted sum of input elements:
L
X
zi = αi,j (xj W V ) (1)
j=1
exp(eij )
αi,j = PL (2)
k=1 exp(eik )
xi = xi + p i (4)
Springer Nature 2021 LATEX template
where the position embedding pi ∈ Rdmodel . There are several options for abso-
lute positional encodings, including the fixed encodings by sine and cosine
functions with different frequencies called V anillaAP E and the learnable
encodings through trainable parameters (we refer it as Learn method) [8, 10].
By using sine and cosine for fixed position encoding, the dmodel -dimensional
embeddings of ith time step position can be represented by the following
equation:
L
X
zi = αi,j (xj W V + pVi,j ) (6)
j=1
(xi W Q + pQ
i,j )(xj W
K
+ pK
i,j )
T
eij = √ (7)
dz
By doing so, the pairwise positional relation is trained during transformer
training.
Shaw et al. [14] proposed the first relative position encoding for self-
attention. Relative positional information is supplied to the model on two
levels: values and keys. First, relative positional information is included in
the model as an additional component to the keys. The softmax operation
Equation 3 remains unchanged from vanilla self-attention. Lastly, relative posi-
tional information is resupplied as a sub-component of the values matrix.
Besides, the authors believe that relative position information is not useful
beyond a certain distance, so they introduced a clip function to reduce the num-
ber of parameters. Encoding is formulated as follows to consider the distance
between inputs i and j in computing their attention:
Embedding Value
0.25
0.3
0.00
0.2 −0.25
−0.50
0.1 −0.75
Position −1
−1.00
Position − 30
400 200 0 200 400 0 20 40 60 80 100 120
K Index of Dimension
(a) (b)
Fig. 1 Sinusoidal absolute position encoding. a) The dot product of two sinusoidal position
embeddings whose distance is K with various embedding dimensions. b) 128 dimension
sinusoidal positional encoding curves for positions 1 and 30 in a series of length 30.
dimensions (thin blue and orange lines, respectively), the dot product does
not always decrease as the distance between two positions increases. We call
this the distance awareness property, which disappears when lower embedding
dimensions, such as 64, are used for position encoding.
While high embedding dimensions show a desirable monotonous decrease
trend when the distance between two positions increases (see red line in
Fig.1a), they are not suitable for encoding time series datasets. The reason is
that most time series datasets have relatively low input dimensionality (e.g.,
28 out of 32 datasets have less than 64 input dimension), and higher embed-
ding dimensions may yield inferior model throughput due to extra parameters
(increasing the chances of overfitting the model).
On the other hand, in low embedding dimensions, the similarity value
between two random embedding vectors is high, making the embedding vectors
very similar to each other. In other words, we cannot fully utilise the embed-
ding vector space to differentiate between two positions. Fig. 1b depicts the
embedding vectors of the first and last position embedding for the embedding
dimension equals 128 and length equals 30. In this figure, almost half of the
embedding vectors are the same. This is called the anisotropic phenomenon
[33]. The anisotropic phenomenon makes the position encoding to be ineffec-
tive in low embedding dimensions as embedding vectors become similar to each
other as it is shown in Fig.1a (the blue line).
Hence, we require a position embedding for time series that has distance
awareness while simultaneously being isotropic. In order to incorporate dis-
tance awareness, we propose to use the time series length in Equation 5. In
this equation, ωk refers to the frequency of the sine and cosine functions from
which the embedding vectors are generated. Without our modification, as
series length L increases the dot product of positions becomes ever less regular,
Springer Nature 2021 LATEX template
ωk = 10000−2k/dmodel
ωk × dmodel
ωknew = (13)
L
where L is the series length and dmodel shows the embedding dimension.
Our new tAPE position encoding is compared with a vanilla sinusoidal
position encoding to provide further illustration. Using dmodel = 128 dimension
vector, Figs 2a-b show the dot product (similarity) of two positions with a
distance of K for series with of length L = 1000 and L = 30 respectively.
As depicted in Fig 2a, in vanilla APE, only the closest positions in the series
have a monotonous decreasing trend, and approximately from a distance 50
onwards (|K|> 50) on both sides, the decreasing similarity trend becomes less
apparent as the distance between two positions in the time series increases.
However, tAPE has a more stable decreasing trend and more steadily reflects
the distance between two positions. Meanwhile, Fig 2b shows the embedding
vectors of tAPE are less similar to each other compared to vanilla APE. This
is due to better utilising the embedding vector space to differentiate between
two positions as we discussed earlier.
Note in Equation 13 our ωknew will obviously be equal to the ωk in vanilla
APE when dmodel = L and the encodings of tAPE and vanilla APE will be the
same. However, if dmodel ̸= L, tAPE will encode the positions in series more
effectively than vanilla APE due to the two properties we discussed earlier.
Fig 2a shows a case in which dmodel < L and Fig 2b shows a case in which
dmodel > L and in both cases tAPE utilises embedding space to provide an
isotropic encoding, while holding the distance awareness property. In other
words, tAPE provides a balance between these two properties in its encodings.
The superiority of tAPE compared to vanilla APE and learned APE on various
length time series datasets is shown in the experimental results section.
Dot Product
40
50
30
45
20
40
10
35
−400 −200 0 200 400 −15 −10 −5 0 5 10 15
K K
(a) (b)
Fig. 2 Comparing dot product between two position whose distance is K in a time series
using tAPE and vanilla APE with dx = 128 dimension vector for series of length a) L = 1000
b) L = 30.
𝑑) ×𝐿 𝑑) ×𝐿 𝑑) ×𝐿 𝐿×𝑑) 𝑑) ×𝐿 𝑑) ×𝐿 𝑑) ×𝐿 𝐿×𝐿
𝒆𝑹𝑷𝑬
𝐿×𝐿 𝐿×𝐿 𝐿×𝐿
𝐿×𝐿
Skew Softmax
𝐿×𝐿 𝐿×𝐿
Softmax
𝐿×𝐿 𝐿×𝐿
𝑑) ×𝐿 𝑑) ×𝐿
Fig. 3 Self-attention modules with relative position encoding using scalar and vector
parameters. Newly added parts are depicted in grey.
where L is series length, Ai,j is attention weight and wi−j is a learnable scalar
(i.e., w ∈ RO(L) ) and represent the relative position weight between positions
i and j.
It is worth comparing the strengths and weaknesses of relative position
encodings and attention to determine what properties are more desirable for
relative position encoding of time series data. Firstly, the relative position
embedding wi−j is an input-independent parameter with static values, whereas
an attention weight Ai,j is dynamically determined by the representation of
the input series. In other words, attention adapts to input series via a weight-
ing strategy (input-adaptive weighting [8]). Input-adaptive-weighting enables
models to capture the complicated relationships between different time points,
a property that we desire most when we want to extract high-level concepts in
time series. This can be for instance the seasonality component in time series.
However, when we have limited size data we are at a greater risk of overfitting
when using attention.
Secondly, relative position embedding wi−j takes into account the relative
shift between positions i and j and not their values. This is similar to trans-
lation equivalence property of convolution, which has been shown to enhance
generalization [6]. We propose to consider the notation of wi−j as a scalar
rather than a vector to enable the translation equivalency without blowing up
the number of parameters. In addition, the scalar representation of w provides
the benefit that the value of wi−j for all (i, j) can be subsumed within the
pairwise dot-product attention function, resulting in minimal additional com-
putation (see subsection 4.2.1). We call our proposed efficient relative position
encoding as eRPE.
Theoretically, there are many possibilities for integrating relative position
information into the attention matrix, but we empirically found that attention
models perform better when we add the relative position to the model after
applying the softmax to the attention matrix as shown in Equation 14. We
presume this is because the position values will be sharper without the softmax.
And sharper position embeddings seems to be beneficial in TSC task as it
emphasizes more on informative relative positions for classification compared
to existing models in which softmax is applied to relative position embeddings.
eRPE 2L − 1 L + L2 L2
eRPE − MHA
Output
𝑑! + + +
GAP
FFN
𝑑!
FC
𝑑"#$%&
𝑀 1
1
𝐿
~
tAPE
4.3 ConvTran
Now we look at how we can utilize our new position encodings method to build
a time series classification network. According to the earlier discussion, global
attention has a quadratic complexity w.r.t. the series length. This means that if
we directly apply the proposed attention in Equation 14 to the raw time series,
the computation will be excessively slow for long time series. Hence, we first use
convolutions to reduce the series length and then apply our proposed position
encodings once the feature map has been reduced to a less computationally
intense size. See Fig. 4 where convolution blocks comes as a first component
proceeded by attention blocks.
Another benefit of using convolutions is that convolutions operations are
very well-suited to capture local patterns. By using convolutions as the
first component in our architecture we can capture any discriminative local
information that exists in raw time series.
As Shown in Fig. 4, as the first step in the convolution layers, M tem-
poral filters are applied to the input data. In this step, the model extracts
temporal patterns in the input series. Next, the output of temporal filtering
is convolved with dmodel spatial dx × M shape filters to capture the correla-
tions between variables in multivariate time series and construct dmodel size
input embeddings. Such disjoint temporal and spatial convolution is similar to
“Inverted Bottleneck” in [27]. It first expands the number of input channels
and then squeezes them. A key reason for this choice is that the Feed Forward
Springer Nature 2021 LATEX template
Network (FFN) in transformers [8] also expands on the input size and later
projects the expanded hidden state back to the original size to capture the
spatial interactions.
Before feeding the input embedding to the transformer block, we add the
tAPE-generated position embedding to the input embedding vector so that
the model can capture the temporal order of the time series. The size of the
embedding vector is dm odel, which is the same as the input embedding. Inside
the multi-head attention, the inputs with the L × dmodel dimension are first
converted to L × dz × 3 shape using a linear layer to get the qkv matrix in
which dz indicates the model dimension and defined by the user. Each of the
three matrices of shape L × dz represents the Query (q), Key (k) and Value (v)
matrices. These q, k, and v matrices are reshaped to h × L × dz /h to represent
the h attention heads. Each of these attention heads can be responsible for
capturing different patterns in time series. For instance, one attention head
can attend to the non-noisy data, another head can attend to the seasonal
component and another to the trend. Once we have the q, k, and v matrices, we
finally perform the attention operation inside the Multi-Head attention block
using Equation 14.
According to Equation 14 the eRPE with the same shape of L × L is also
added to the attention output. We consider the notation of wi−j as a scalar
(i.e., w ∈ RO(L) ) to enable the global convolution kernel without increasing
the number of parameters. The relative position embedding enables the model
to learn not only the order of time points, but also the relative position of
pairs of time points, which can capture richer information than other position
embedding strategies.
The FFN, is a multi-layer perceptron block consisting of two linear layers
and Gaussian Error Linear Units (GELUs) as an activation function. The
outputs from the FFN block are again added to the inputs (via skip connection)
to get the final output from the transformer block. Finally, just before the fully
connected layer, max-pooling and global average pooling (GAP) are applied
to the output of the last layer’s ELU activation function, which gives a more
translation-equivalence model.
5 Experimental Results
In this section, we evaluate the performance of our ConvTran model on the
UEA time series repository [2] and two large multivariate time series datasets
and compare it with the state-of-the-art models. All of our experiments were
conducted using the PyTorch framework in Python on a computing system
consisting of a single Nvidia A5000 GPU with 24GB of memory and an Intel(R)
Core(TM) i9-10900K CPU. To promote reproducibility, we have provided our
source code and more experimental results online 1 .
We have divided our experiments into four parts. First, we present an
ablation study on various position encodings. Then, we demonstrate that our
1
https://github.com/Navidfoumani/ConvTran
Springer Nature 2021 LATEX template
5.1 Datasets
UEA Repository The archive consists of 30 real-world multivariate time
series data from a wide range of applications such as Human Activity Recog-
nition, Motion classification, and ECG/EEG classification [2]. The number
of dimensions ranges from two dimensions to 1345 dimensions. The length
of the time series ranges from 8 to 17,984. The datasets also have a train
size ranging from 12 to 25000.
Ford Challenge This dataset is obtained from the Kaggle challenge website
3
. It includes measurements from total of 600 real-time driving sessions
where each driving session takes 2 minutes and sampled with 100ms rate.
Also, the trials are samples from 100 drivers of both genders, and of different
ages. The training data file consists of 604,329 data points each belongs to
one of 500 trials. The test file contains 120,840 data points belonging to 100
trials. While each data point comes with a label in 0,1 and also contains 8
physiological, 12 environmental, and 10 vehicular features that are acquired
while driving.
Actitracker human Activity Recognition This dataset describes six
daily activities which are collected in a controlled laboratory environment.
The activities include “Walking”, “Jogging”, “Stairs”, “Sitting”, “Stand-
ing”, and “Lying Down” which are recorded from 36 users collected using a
cell phone in their pocket. Data has 2,980,765 samples with 3 dimensions,
subject-wise split into train and test sets, and a sampling rate of 20Hz [1].
2
https://timeseriesclassification.com/HC2.php
3
https://www.kaggle.com/c/stayalert
Springer Nature 2021 LATEX template
4 3 2 1 3 2 1
[34], where models in the same clique (the black bar in the diagram) are not
statistically significant. For the statistical test, we used the Wilcoxon signed-
rank test with Holm correction as the post hoc test to the Friedman test
[34].
7 6 5 4 3 2 1
Fig. 6 The average rank of various combination of absolute and relative position encodings.
has the least accurate results, highlighting the importance of absolute position
encoding in time series classification. The vanilla APE also improves overall
performance despite not being significantly accurate than Learn APE since it
has fewer parameters.
Fig.5b shows the critical difference diagram of a single-layer transformer
with different relative position encodings. As shown in this figure, eRPE has
the highest rank and is significantly better than other encodings in terms of
accuracy as it has less number of parameters which is less likely to overfit. It
is not surprising that the model without position encoding has the least accu-
rate results, highlighting the importance of relative position encoding and the
translation equality property in time series classification. The input-dependent
Vector encoding also improves overall performance and is significantly better
than None model. Fig.6 shows the critical difference diagram for the various
combinations of absolute and relative position encodings. As depicted in this
figure, the combination of our proposed tAPE and eRPE is significantly more
accurate than all other combinations. This shows the high potential of our
encoding methods to incorporate position information into transformers. The
combination of Learn and Vector has the least accurate results, most likely
due to the high number of parameters.
6 5 4 3 2 1
Fig. 7 The average rank of ConvTran against all deep learning based methods on all 32
MTS datasets.
Table 2 Average accuracy of six deep learning based models over 32 multivariate time
series datasets. Datasets are sorted based on the number of training samples per-class. The
highest accuracy for each dataset is highlighted in bold.
DataSets Avg Train ConvTran TST IT Disjoint-CNN FCN ResNet
Ford 17300 0.7805 0.7655 0.7628 0.7422 0.6353 0.687
HAR 8400 0.9098 0.8831 0.8775 0.8807 0.8445 0.8711
FaceDetection 2945 0.6722 0.6542 0.5885 0.5665 0.5037 0.5948
Insectwingbeat 2500 0.7132 0.6748 0.6956 0.6308 0.6004 0.65
PenDigits 750 0.9871 0.9694 0.9797 0.9708 0.9857 0.9771
ArabicDigits 660 0.9945 0.9749 0.9872 0.9859 0.9836 0.9832
LSST 176 0.6156 0.2846 0.4456 0.5559 0.5616 0.5725
FingerMovement 158 0.56 0.58 0.56 0.54 0.53 0.54
MotorImagery 139 0.56 0.48 0.53 0.49 0.55 0.52
SelfRegSCP1 134 0.918 0.86 0.8634 0.8839 0.7816 0.8362
Heartbeat 102 0.7853 0.6975 0.6248 0.717 0.678 0.7268
SelfRegSCP2 100 0.5833 0.5333 0.4722 0.5166 0.4667 0.5
PhonemeSpectra 85 0.3062 0.089 0.1586 0.2821 0.1599 0.1596
CharacterTraject 72 0.9922 0.9825 0.9881 0.9945 0.9868 0.9945
EthanolConcen 66 0.3612 0.151 0.3489 0.2775 0.3232 0.3155
HandMovement 40 0.4054 0.5405 0.3783 0.5405 0.2973 0.2838
PEMS-SF 39 0.8284 0.7572 0.8901 0.8901 0.8324 0.7399
RacketSports 38 0.8618 0.8815 0.8223 0.8355 0.8223 0.8223
Epilepsy 35 0.9855 0.9492 0.9928 0.8898 0.9928 0.9928
JapaneseVowels 30 0.9891 0.9837 0.9702 0.9756 0.973 0.9135
NATOPS 30 0.9444 0.95 0.9166 0.9277 0.8778 0.8944
EigenWorms 26 0.5934 0.4503 0.5267 0.5934 0.4198 0.4198
UWaveGesture 15 0.8906 0.8906 0.9093 0.8906 0.85 0.85
Libras 12 0.9277 0.8222 0.8722 0.8577 0.85 0.8389
ArticularyWord 11 0.9833 0.9833 0.9866 0.9866 0.98 0.98
BasicMotions 10 1 0.975 1 1 1 1
DuckDuckGeese 10 0.62 0.5 0.36 0.5 0.36 0.24
Cricket 9 1 1 0.9861 0.9772 0.9306 0.9722
Handwriting 6 0.3752 0.2752 0.3011 0.2372 0.376 0.18
ERing 6 0.9629 0.9296 0.9296 0.9111 0.9037 0.9296
AtrialFibrillation 5 0.4 0.2 0.2 0.4 0.3333 0.3333
StandWalkJump 4 0.3333 0.3333 0.4 0.3333 0.4 0.4
1.0
ConvTran vs HC2 11/2/13 1.0
ConvTran vs ROCKET 14/2/10
ConvTran is better here ConvTran is better here
0.8 0.8
0.6 0.6
ConvTran
ConvTran
EigenWorms EigenWorms
128 training series 128 training series
26 per class 26 per class
0.4 0.4
EthanolConcentration
261 training series StandWalkJump
66 per class 12 training series
0.2 0.2 4 per class
1.0
ConvTran vs CIF 15/1/10 1.0
ConvTran vs Inception-Time 19/2/5
ConvTran is better here ConvTran is better here
DuckDuckGeese
0.8 0.8 50 training series
10 per class
ConvTran
5 per class
EigenWorms
128 training series
26 per class
0.4 0.4
EthanolConcentration
261 training series
0.2 66 per class 0.2
InsectWingBeat
ConvTran
0.7
25k
ROCKET
15k 20k
10k
0.6
5k
Accuracy
0.5
25k
20k
0.4 15k
10k
0.3
5k
0 20 40 60 80 100
Training Time in Minutes
Fig. 9 Comparison of runtime and accuracy between ConvTran and ROCKET on UEA
largest dataset InsectWingBeat with 25,000 training samples. The figure shows the runtime
of the two models on datasets with different sizes, and their corresponding classification
accuracy.
6 Conclusion
This paper studies the importance of position encoding for time series for
the first time and reviews existing absolute and relative position encoding
methods in time series classification. Based on the limitations of the current
position encodings for time series, we proposed two novel absolute and rel-
ative position encodings sepecifically for time series called tAPE and eRPE,
respectively. We then integrated our two proposed position encodings into a
transformer block and combine them with a convolution layer and presented a
novel deep-learning framework for multivariate time series classification (Con-
vTran). Extensive experiments show that ConvTran benefits from the position
information, achieving state-of-the-art performance on Multivariate time series
classification in deep learning literature. In future, we will study the effective-
ness of our new transformer block in other transformer-based TSC models and
other down stream tasks such as anomaly detection.
7 Declarations
Conflict of interest statement: The authors have no competing interests
to declare that are relevant to the content of this article.
Springer Nature 2021 LATEX template
References
[1] Lockhart, J.W., Weiss, G.M., Xue, J.C., Gallagher, S.T., Grosner, A.B.,
Pulickal, T.T.: Design considerations for the wisdm smart phone-based
sensor mining architecture. In: International Workshop on Knowledge
Discovery from Sensor Data, pp. 25–33 (2011)
[2] Bagnall, A., Dau, H.A., Lines, J., Flynn, M., Large, J., Bostrom, A.,
Southam, P., Keogh, E.: The UEA multivariate time series classification
archive, 2018. arXiv preprint arXiv:1811.00075 (2018)
[3] Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.: The great
time series classification bake off: a review and experimental evaluation
of recent algorithmic advances. Data Mining and Knowledge Discovery
31(3), 606–660 (2017)
[4] Fawaz, H.I., Forestier, G., Weber, J., Idoumghar, L., Muller, P.-A.:
Deep learning for time series classification: a review. Data Mining and
Knowledge Discovery 33(4), 917–963 (2019)
[5] Ruiz, A.P., Flynn, M., Large, J., Middlehurst, M., Bagnall, A.: The great
multivariate time series classification bake off: a review and experimental
evaluation of recent algorithmic advances. Data Mining and Knowledge
Discovery, 1–49 (2020)
[6] Dai, Z., Liu, H., Le, Q.V., Tan, M.: Coatnet: Marrying convolution and
attention for all data sizes. Advances in Neural Information Processing
Systems 34, 3965–3977 (2021)
[7] Karim, F., Majumdar, S., Darabi, H., Harford, S.: Multivariate lstm-fcns
for time series classification. Neural Networks 116, 237–245 (2019)
[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in
Neural Information Processing Systems 30 (2017)
[9] Hao, Y., Cao, H.: A new attention mechanism to classify multivariate time
series. In: International Joint Conference on Artificial Intelligence (2020)
[10] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training
of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805 (2018)
[11] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et
al.: An image is worth 16x16 words: Transformers for image recognition
at scale. arXiv preprint arXiv:2010.11929 (2020)
Springer Nature 2021 LATEX template
[12] Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., Eickhoff, C.:
A transformer-based framework for multivariate time series representa-
tion learning. In: SIGKDD Conference on Knowledge Discovery & Data
Mining, pp. 2114–2124 (2021)
[13] Kostas, D., Aroca-Ouellette, S., Rudzicz, F.: Bendr: using transform-
ers and a contrastive self-supervised learning task to learn from massive
amounts of eeg data. Frontiers in Human Neuroscience 15 (2021)
[14] Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position
representations. arXiv preprint arXiv:1803.02155 (2018)
[15] Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I.,
Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M., Eck, D.: Music
transformer. arXiv preprint arXiv:1809.04281 (2018)
[16] Dufter, P., Schmitt, M., Schütze, H.: Position information in transformers:
An overview. Computational Linguistics 48(3), 733–763 (2022)
[17] Foumani, N.M., Miller, L., Tan, C.W., Webb, G.I., Forestier, G., Salehi,
M.: Deep learning for time series classification and extrinsic regression: A
current survey. arXiv preprint arXiv:2302.02515 (2023)
[18] Dempster, A., Petitjean, F., Webb, G.I.: Rocket: exceptionally fast and
accurate time series classification using random convolutional kernels.
Data Mining and Knowledge Discovery 34(5), 1454–1495 (2020)
[19] Bagnall, A., Flynn, M., Large, J., Lines, J., Middlehurst, M.: On the usage
and performance of the hierarchical vote collective of transformation-
based ensembles version 1.0 (hive-cote v1. 0). In: International Workshop
on Advanced Analytics and Learning on Temporal Data, pp. 3–18 (2020)
[20] Middlehurst, M., Large, J., Bagnall, A.: The canonical interval forest
(cif) classifier for time series classification. In: 2020 IEEE International
Conference on Big Data, pp. 188–195 (2020)
[21] Fawaz, H.I., Lucas, B., Forestier, G., Pelletier, C., Schmidt, D.F., Weber,
J., Webb, G.I., Idoumghar, L., Muller, P.-A., Petitjean, F.: Inceptiontime:
Finding alexnet for time series classification. Data Mining and Knowledge
Discovery 34(6), 1936–1962 (2020)
[22] Dempster, A., Schmidt, D.F., Webb, G.I.: Minirocket: A very fast (almost)
deterministic transform for time series classification. In: SIGKDD Con-
ference on Knowledge Discovery & Data Mining, pp. 248–257 (2021)
[23] Tan, C.W., Dempster, A., Bergmeir, C., Webb, G.I.: Multirocket: Effective
summary statistics for convolutional outputs in time series classification.
Springer Nature 2021 LATEX template
[24] Middlehurst, M., Large, J., Flynn, M., Lines, J., Bostrom, A., Bagnall, A.:
Hive-cote 2.0: a new meta ensemble for time series classification. Machine
Learning 110(11), 3211–3243 (2021)
[25] Wang, Z., Yan, W., Oates, T.: Time series classification from scratch
with deep neural networks: A strong baseline. In: 2017 International Joint
Conference on Neural Networks, pp. 1578–1585 (2017)
[26] Foumani, S.N.M., Tan, C.W., Salehi, M.: Disjoint-cnn for multivari-
ate time series classification. In: 2021 International Conference on Data
Mining Workshops, pp. 760–769 (2021)
[27] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.:
Mobilenetv2: Inverted residuals and linear bottlenecks. In: IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. 4510–4520
(2018)
[28] Liu, M., Ren, S., Ma, S., Jiao, J., Chen, Y., Wang, Z., Song, W.: Gated
transformer networks for multivariate time series classification. arXiv
preprint arXiv:2103.14438 (2021)
[31] Huang, Z., Liang, D., Xu, P., Xiang, B.: Improve transformer models
with better relative position embeddings. arXiv preprint arXiv:2009.13658
(2020)
[32] Wu, K., Peng, H., Chen, M., Fu, J., Chao, H.: Rethinking and improv-
ing relative position encoding for vision transformer. In: IEEE/CVF
International Conference on Computer Vision, pp. 10033–10041 (2021)
[33] Liang, Y., Cao, R., Zheng, J., Ren, J., Gao, L.: Learning to remove:
Towards isotropic pre-trained bert embedding. In: International Confer-
ence on Artificial Neural Networks, pp. 448–459 (2021)
[34] Demšar, J.: Statistical comparisons of classifiers over multiple data sets.
The Journal of Machine Learning Research 7, 1–30 (2006)
Springer Nature 2021 LATEX template
ROCKET ConvTran
Datasets Train size
Accuracy Train time Test time Accuracy Train time Test time
HAR 41546 0.8293 5366.34 11.51 0.9098 2367.77 1.82
Ford 28839 0.6051 6863.81 11.91 0.7805 1619.42 0.95
InsectWingbeat 25000 0.4182 5721.04 41.5 0.7132 1617.82 5.47
PenDigits 7494 0.984 65.26 0.99 0.9871 401.1 0.59
ArabicDigits 6599 0.9932 75.59 10.38 0.9945 376.7 0.37
FaceDetection 5890 0.5624 53.23 11.99 0.6722 413.39 0.83
PhonemeSpectra 3315 0.1894 42 37.22 0.3062 202.27 0.89
LSST 2459 0.5251 5.84 3.52 0.6156 148.07 0.48
CharacterTrajec 1422 0.9916 8.4 7.72 0.9922 89.61 0.28
FingerMovement 316 0.55 0.96 0.35 0.56 21.33 0.02
MotorImagery 278 0.56 45.39 16.26 0.56 386 0.81
ArticularyWord 275 0.9933 2.09 2.19 0.9833 19.76 0.08
JapaneseVowels 270 0.9568 0.57 0.67 0.9891 20.6 0.13
SelfRegSCP1 268 0.8601 10.2 11.25 0.918 45.54 0.27
PEMS-SF 267 0.8266 3.53 2.13 0.8284 28.08 0.09
EthanolConcen 261 0.4448 14.59 14.32 0.3612 131.58 0.69
Heartbeat 204 0.7414 4.57 4.59 0.7853 17.13 0.09
SelfRegSCP2 200 0.5833 10.78 9.65 0.5833 50.05 0.22
NATOPS 180 0.8944 0.6 0.58 0.9444 14.61 0.04
Libras 180 0.8667 0.36 0.29 0.9277 11.51 0.04
HandMovement 160 0.4189 3.31 1.7 0.4054 11.29 0.03
RacketSports 151 0.9078 0.29 0.32 0.8618 11.86 0.03
Handwriting 150 0.5376 0.81 3.92 0.3752 11.85 0.23
Epilepsy 137 0.971 0.91 0.93 0.9855 10.52 0.03
EigenWorms 128 0.8702 107.48 111.42 0.5934 225.71 0.7
UWaveGesture 120 0.9188 1.21 3.07 0.8906 10.2 0.09
Cricket 108 1 5.79 4.07 1 32.1 0.1
DuckDuckGeese 50 0.5 1.59 1.76 0.62 9.46 0.05
BasicMotions 40 1 0.25 0.27 1 4.45 0.01
ERing 30 0.9851 0.17 0.7 0.9629 3.17 0.06
AtrialFibrillation 15 0.2 0.39 0.41 0.4 1.99 0.01
StandWalkJump 12 0.5333 1.52 1.65 0.3333 14.56 0.09
Appendix A
A.1 Empirical Evaluation of Efficiency and Effectiveness
The results presented in Table A1 demonstrate that ConvTran outperforms
ROCKET in terms of both train time and test accuracy on larger datasets with
more than 10k samples. However, ROCKET has a better train time on smaller
datasets. Nevertheless, even on small datasets, ConvTran achieves acceptable
accuracy within a reasonable train time. It is worth noting that the perfor-
mance of ConvTran improves as the dataset size increases, indicating that our
model is suitable for scaling to larger datasets.
Springer Nature 2021 LATEX template