Keywords

1 Introduction

Time series can be seen as series of ordered measurements. They contain temporal information that needs to be taken into account when dealing with such data. Time series classification (TSC) could be defined as follows: given a collection of unlabeled time series, one should assign each time series to one of a predefined set of classes. TSC is a challenge that is receiving more and more attention recently due to its most diverse applications in real life problems involving, for example, data mining, statistics, machine learning and image processing.

An extensive comparison of TSC approaches is performed in [1]. Two particular methods stand out from other core classifiers for their accuracy: COTE [2] and BOSS [12]. BOSS is a dictionary-based approach based on the extraction of Fourier coefficients from time series windows. Many other dictionary-based approaches have been proposed recently [3, 4]. These methods share the same overall steps: (i) extraction of feature vectors from time series; (ii) creation of a codebook (composed of codewords) from extracted feature vectors; and (iii) representation of time series as a histogram of codeword appearances, called a Bag-of-Words (BoW).

Dictionary-based approaches are well adapted for TSC. Nevertheless, two drawbacks with such methods can be pointed out: (i) global temporal information is lost when representing a time series as a BoW; and (ii) extracted features are quantized using a dictionary inherently inducing a loss in the precision of time series representation. In this paper, we tackle this second issue. Particularly, we study the impact of mid-level representations (widely used for image and video analysis) on time series classification. We focus on mid-level features that aim at enhancing the BoW representation by more accurate description of the distribution of feature vectors related to an object. To the best of our knowledge, such representations have never been used for time series. Vector of Locally Aggregated Descriptors [8] and Locality-constrained Linear Coding [14] are examples of such mid-level representations.

This paper is organized as follows. Section 2 gives the related works about time series classification. Section 3 explains the background about SIFT-based descriptors extracted from time series. In Sect. 4, we present a methodology for time series classification by using powerful mid-level representations on SIFT-based descriptors. Section 5 details the experimental setup and results to validate the method, and finally, some conclusions are drawn in Sect. 6.

2 Related Work

In this section, we give an overview about the related work on TSC. One of the earliest methods for that task is the combination of 1-nearest-neighbor classifier and the Dynamic Time Warping. It has been a baseline for TSC for many years thanks to its good performance. Recently, more sophisticated approaches have been designed for TSC.

Shapelets, for instance, were introduced in [15]. They represent existing subsequences able to discriminate classes. Hills et al. proposed the shapelet transform [6], which consists in transforming a time series into a vector whose components represent the distance between the time series and different shapelets, extracted beforehand. Classifiers, such as SVM, can then be used with the vectorial representations of time series.

Numerous approaches have been designed based on the Bag-of-Word framework. This framework consists in extracting feature vectors from time series, creating a dictionary of words using these extracted features, and then representing each time series as a histogram of words occurrence. The different approaches proposed in the literature differ mainly on the kind of features that are extracted. Local features such as mean, variance and extrema are considered in [4] and Fourier coefficients in [12]. SAX coefficients are used in [9]. Recently, SIFT-based descriptors adapted to time series have been considered as feature vectors in [3].

All the methods based on the BoW framework create a dictionary of words by quantizing the set (or a subset) of extracted features. This quantization step induces a loss when representing time series as histogram of words occurrence. In this paper, we aim at improving the accuracy of time series representations such as BoW through the adoption of specially designed mid-level representations.

3 Background on SIFT-Based Feature Extraction

The work proposed in this paper aims at improving classical BoW representation for time series. More precisely, the idea is to build an accurate vectorial representation that models a set of feature vectors extracted from time series. The mid-level representations that we have used in this paper can be applied with any kind of feature vectors extracted from time series. We choose in this paper to use SIFT-based descriptors that were proposed in [3], illustrated in Fig. 1. We quickly explain in this section how such descriptors are computed.

Fig. 1.
figure 1

SIFT-based descriptors for time series proposed in [3]: A time series and its extracted key-points. A keypoint is described by vectors representing the gradients in its neighborhood, at different scales.

First, key-points are extracted regularly every \(\tau _{step}\) instants. Then, each key-point is described by different feature vectors representing its neighborhood at different scales. More precisely, let \(L(\mathcal {S},\sigma )\) be the convolution of a time series \(\mathcal {S}\) with a Gaussian function \(G(t,\sigma )\) of width \(\sigma \) computed by \(L(\mathcal {S},\sigma )= \mathcal {S} * G(t,\sigma )\) in which

$$\begin{aligned} G(t,\sigma )=\frac{1}{\sqrt{2\pi }\sigma }e^{-t^2/2\sigma ^2} \end{aligned}$$
(1)

For description, \(n_b\) blocks of size a are selected in \(L(\mathcal {S},\sigma )\) around each key-point. Each of these blocks is described by the gradient magnitudes of the points in the block. More precisely, the sum of positive gradients and the sum of negative gradients are computed in each block. Hence, the size of a key-point feature vector is \(2\times n_b\). The key-points are described at many different scales, thereby transforming a time series into a set of feature vectors. More details about SIFT-based descriptors can be found in [3].

4 Classification of Time Series by Using Their Mid-Level Representation

The methodology for classifying time series by using mid-level representation can be divided into four main steps: (i) dense extraction and description of key-points in time series; (ii) coding of the key-point descriptors, using clustering methods; (iii) feature encoding following by vector concatenation in order to create a final mid-level representation; and (iv) classification of time series by using their mid-level representation. Here, it is important to note that each time series is represented by only one mid-level description. In the following, we shortly discuss about some of the representations that were used here for describing a set of low features.

Let \({\mathbb X}=\left\{ { {{\mathbf x}_{j}}}\in {\mathbb R}^{ d}\right\} _{j=1}^{N}\) be an unordered set of d-dimensional descriptors \({ {{\mathbf x}_{j}}}\) extracted from the data. Let also be the codebook learned by an unsupervised clustering algorithm, composed by a set of M codewords, also called prototypes or representatives. Consider \(\mathbb {Z}\in {\mathbb R}^{M}\) as the final vector mid-level representation. As formalized in [5], the mapping from \(\mathbb X\) to \(\mathbb {Z}\) can be decomposed into three successive steps: (i) coding; (ii) pooling; and (iii) concatenation, as follows:

$$\begin{aligned} \alpha _j&=f({ {{\mathbf x}_{j}}}), j\in [1,N] &\text{(coding) } \end{aligned}$$
(2)
$$\begin{aligned} h_m&=g(\alpha _m=\{\alpha _{m,j}\}_{j=1}^{N}),m\in [1,M]&\text{(pooling) } \end{aligned}$$
(3)
$$\begin{aligned} z&=[h_1^T,\dots ,h_M^T] &\text{(concatenation) } \end{aligned}$$
(4)

In vector quantization (VQ) [13], the coding function f aims to minimize the distance to codewords and pooling function leverages these distances, as follows:

(5)
$$\begin{aligned} h_m&=\frac{1}{N}\sum _{j=1}^{N}\alpha _{m,j} \end{aligned}$$
(6)

in which is the Euclidean distance between j-th descriptor and m-th codeword. A soft version of this approach, so-called soft-assignment (SA), attributes \({ {{\mathbf x}_{j}}}\) to the n nearest codewords, and usually presents better results than the hard version. Fisher vector (FV), that is derived from Fisher kernel [7], is another mid-level representation which is a generic framework that combines the benefits of generative and discriminative approaches. FV was introduced for large-scale image categorization in [11] by using an improved Fisher vector.

Vector of Locally Aggregated Descriptors (VLAD) [8] can be viewed as a simplification of the Fisher kernel representation keeping the first-order statistics. Fisher kernel combines the benefits of generative and discriminative approaches and it based on computation of two gradients as follows [11],

$$\begin{aligned} \mathcal {G}_{\mu ,k}^{{ {{\mathbf x}_{j}}}}&=\frac{1}{\sqrt{\pi _k}}\gamma _k\bigg (\frac{{ {{\mathbf x}_{j}}}-\mu _k}{\sigma _{k}}\bigg ),\end{aligned}$$
(7)
$$\begin{aligned} \mathcal {G}_{\sigma ,k}^{{ {{\mathbf x}_{j}}}}&=\frac{1}{\sqrt{2\pi _k}}\gamma _k\Bigg [\frac{{({ {{\mathbf x}_{j}}}-\mu _k)}^2}{\sigma _{k}^{2}}-1\Bigg ] , \end{aligned}$$
(8)

where \(\gamma _k\), \(\mu _k\) and \(\sigma _{k}\) and are the weight of local descriptor, mean and standard deviation, respectively, related to the \(k^{th}\) Gaussian Mixture. The final Fisher vector is the concatenation of these gradients for K models and is defined by

$$\begin{aligned} \mathrm {FV:}~~~\mathcal {S}=[\mathcal {G}_{\mu ,1}^{{ {{\mathbf x}_{j}}}},\mathcal {G}_{\sigma ,1}^{{ {{\mathbf x}_{j}}}},\ldots ,\mathcal {G}_{\mu ,K}^{{ {{\mathbf x}_{j}}}},\mathcal {G}_{\sigma ,K}^{{ {{\mathbf x}_{j}}}}] \end{aligned}$$
(9)

Furthermore, Vector of Locally Aggregated Descriptors (VLAD) can be defined as follows:

$$\begin{aligned} \mathrm {VLAD:}~~~\mathcal {S}=[\mathbf {0},\ldots ,\mathbf {s}(i)({{ {{\mathbf x}_{j}}}-\mathbf {d}_i}),\ldots ,\mathbf {0}] \end{aligned}$$
(10)

in which \(\mathbf {s}(i)\) is the \(i^{th}\) element of VQ and is equal to 1; and \(\mathbf {d}_i\) is the closest visual word to \({{ {{\mathbf x}_{j}}}}\). Locality-constrained Linear Coding (LLC) [14] is a mid-level strategy which incorporates a linear reconstruction term during the coding step. It aims to iteratively optimize the produced code using the Coordinate Descent method. At each iteration, each descriptor is weighted and projected into the coordinate system using its locality constraint. At the end, the basis vectors which most minimizes the distance between the descriptor and the codebooks are selected and all other coefficients are set to zero.

5 Experimental Analysis

In this section, we describe our experiments in order to investigate the impact, in terms of classification performances, of more powerful encoding methods applied to dense extracted features to for TSC.

5.1 Experimental Setup

Experiments are conducted on the 84 currently available datasets from the UCR repository, the largest on-line database for time series classification. We ignored 2 datasets: (i) StarLightCurves, due the large amount of instances; and (ii) ItalyPowerDemand, due the small length of the series. All datasets are split into a training and a test set, whose size varies between less than 20 and more than 8,000 time series. For a given dataset, all time series have the same length, ranging from 24 to more than 2,500 points. For computing the mid-level representation, we have extracted SIFT-based descriptors proposed in [3] by using dense sampling. For computing the codebook and GMM, we have used the following number of clusters \(\{16, 32, 64, 128, 256, 512,1024\}\) considering a sampling of \(30\%\) of the descriptors. The representations are normalized by a L2-norm. The best sets of parameters are obtained by a 5-fold cross-validation to be used in a SVM with RBF kernel for the classification. In the following, we present specific details of the setup for these representations. It is important to note that we have adapted the framework proposed in [10] which was used for evaluating video action recognition.

Here, “D” means SIFT-based descriptor by using dense sampling as low level feature. For D-VQ, we followed VQ [16] and the final representation is obtained by sum pooling. For D-LLC, the final representation is obtained by max and sum pooling. For D-SA, we set the number of nearest codewords to 5, and we have used max and sum pooling for the final representation. For D-VLAD, we used \(n=5\) for the nearest codewords and sum pooling.

Table 1. Comparison to the state-of-the-art in terms of classification rates and ranking.

5.2 Comparison to the State-of-the-Art Methods

In order to study the impact of specially designed mid-level representations on TSC. We focus on two different analysis. In the first one, we compare the studied representations, namely D-VLAD and D-LLC, to D-VQ and D-BoTSW, which are our baselines. In the second one, we present a comparative analysis between the state-of-the-art, namely D-BoTSW [3], BoP [9], BOSS [12] and COTE [2], and the new proposed use of mid-level description. In both cases, we used the average accuracy rate and rank and summarized it in Table 1. As illustrated, D-VLAD obtained the best results among the studied mid-representation, and it is very competitive to the state-of-the-art, being better than D-BoTSW, BoP and BOSS. When compared to our baseline D-BoTSW, D-VLAD statistically outperformed D-BoTSW taking into account the paired t-test with \(70\%\) of confidence, and for rates of confidence greater than 70% in the paired t-test, our baseline D-BoTSW and D-VLAD are statistically equivalent. Concerning the comparison of D-LLC and D-VLAD to D-VQ, we have observed: (i) the studied specially designed mid-level representations presented better performances than D-VQ, which confirms the our initial assumptions; and (ii) D-VLAD presented the best results in terms of classification rates and average rank of a single mid-level representation. Moreover, as illustrated in Fig. 2, the pairwise comparisons involving several tested representation showed that D-VLAD outperformed all SIFT-based representations and BOSS. Furthermore, COTE statistically outperformed all tested representations.

Fig. 2.
figure 2

Pairwise comparison of classification rates between SIFT-based mid-level representations and the state-of-the-art.

6 Conclusions

Time series classification is a challenge task due to its most diverse applications in real life. Among the several kind of approaches, dictionary-based ones have received much attention in the last years. In a general way, these ones are based on the extraction of feature vectors from time series, creation of codebook from the extracted feature vectors and finally representation of time series as traditional Bag-of-Words. In this work, we studied the impact of more discriminative and accurate mid-level representations for describing the time series taking into account SIFT-based descriptors [3].

According to our experiments, D-VLAD and D-LLC outperform the classical BoW representation, namely vector quantization (VQ). Moreover, we achieve competitive results when compared to the state-of-the-art methods, mainly with D-VLAD. By using, the average rank as a comparative criterion, D-VLAD is very competitive with COTE, which represents the state-of-the-art. However, despite the pairwise comparison (Fig. 2) involving both methods, COTE is slightly better than D-VLAD in terms of average rank. Thus, the use of more accurate mid-level representation in conjunction with SIFT-based descriptor seems to be a very interesting approach to cope with time series classification. From our results and observations, we believe that a future study of the normalization and distance functions could be interesting in order to understand their impact in our method, since according to [8] the reduction of frequent codeword influence could be profitable.