Evaluation of Bag-of-Word Performance for Time Series Classification Using Discriminative SIFT-Based Mid-Level Representations

Almeida, Raquel; Herlanin, Hugo; do Patrocinio, Zenilton Kleber G.; Malinowski, Simon; Guimarães, Silvio Jamil Ferzoli

doi:10.1007/978-3-030-13469-3_13

Raquel Almeida¹⁷,
Hugo Herlanin¹⁷,
Zenilton Kleber G. do Patrocinio Jr.¹⁷,
Simon Malinowski¹⁸ &
…
Silvio Jamil Ferzoli Guimarães¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11401))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

2242 Accesses
1 Citations

Abstract

Time series classification has been widely explored over the last years. Amongst the best approaches for that task, many are based on the Bag-of-Words framework, in which time series are transformed into a histogram of word occurrences. These words represent quantized features that are extracted beforehand. In this paper, we aim to evaluate the use of accurate mid-level representations in order to enhance the Bag-of-Words representation. More precisely, this kind of representation enables to reduce the loss induced by feature quantization. Experiments show that these representations are likely to improve time series classification accuracy compared to Bag-of-Words and some of them are very competitive to the state-of-the-art.

This work received funding from CAPES (STIC-AmSUD TRANSFORM 88881.143258/2017-01), FAPEMIG (PPM 00006-16), and CNPq (Universal 421521/2016-3 and PQ 307062/2016-3).

You have full access to this open access chapter, Download conference paper PDF

Time Series Classification with Temporal Bag-of-Words Model

Dense Bag-of-Temporal-SIFT-Words for Time Series Classification

BRIEF-Based Mid-Level Representations for Time Series Classification

Keywords

1 Introduction

Time series can be seen as series of ordered measurements. They contain temporal information that needs to be taken into account when dealing with such data. Time series classification (TSC) could be defined as follows: given a collection of unlabeled time series, one should assign each time series to one of a predefined set of classes. TSC is a challenge that is receiving more and more attention recently due to its most diverse applications in real life problems involving, for example, data mining, statistics, machine learning and image processing.

An extensive comparison of TSC approaches is performed in [1]. Two particular methods stand out from other core classifiers for their accuracy: COTE [2] and BOSS [12]. BOSS is a dictionary-based approach based on the extraction of Fourier coefficients from time series windows. Many other dictionary-based approaches have been proposed recently [3, 4]. These methods share the same overall steps: (i) extraction of feature vectors from time series; (ii) creation of a codebook (composed of codewords) from extracted feature vectors; and (iii) representation of time series as a histogram of codeword appearances, called a Bag-of-Words (BoW).

Dictionary-based approaches are well adapted for TSC. Nevertheless, two drawbacks with such methods can be pointed out: (i) global temporal information is lost when representing a time series as a BoW; and (ii) extracted features are quantized using a dictionary inherently inducing a loss in the precision of time series representation. In this paper, we tackle this second issue. Particularly, we study the impact of mid-level representations (widely used for image and video analysis) on time series classification. We focus on mid-level features that aim at enhancing the BoW representation by more accurate description of the distribution of feature vectors related to an object. To the best of our knowledge, such representations have never been used for time series. Vector of Locally Aggregated Descriptors [8] and Locality-constrained Linear Coding [14] are examples of such mid-level representations.

This paper is organized as follows. Section 2 gives the related works about time series classification. Section 3 explains the background about SIFT-based descriptors extracted from time series. In Sect. 4, we present a methodology for time series classification by using powerful mid-level representations on SIFT-based descriptors. Section 5 details the experimental setup and results to validate the method, and finally, some conclusions are drawn in Sect. 6.

2 Related Work

In this section, we give an overview about the related work on TSC. One of the earliest methods for that task is the combination of 1-nearest-neighbor classifier and the Dynamic Time Warping. It has been a baseline for TSC for many years thanks to its good performance. Recently, more sophisticated approaches have been designed for TSC.

Shapelets, for instance, were introduced in [15]. They represent existing subsequences able to discriminate classes. Hills et al. proposed the shapelet transform [6], which consists in transforming a time series into a vector whose components represent the distance between the time series and different shapelets, extracted beforehand. Classifiers, such as SVM, can then be used with the vectorial representations of time series.

Numerous approaches have been designed based on the Bag-of-Word framework. This framework consists in extracting feature vectors from time series, creating a dictionary of words using these extracted features, and then representing each time series as a histogram of words occurrence. The different approaches proposed in the literature differ mainly on the kind of features that are extracted. Local features such as mean, variance and extrema are considered in [4] and Fourier coefficients in [12]. SAX coefficients are used in [9]. Recently, SIFT-based descriptors adapted to time series have been considered as feature vectors in [3].

All the methods based on the BoW framework create a dictionary of words by quantizing the set (or a subset) of extracted features. This quantization step induces a loss when representing time series as histogram of words occurrence. In this paper, we aim at improving the accuracy of time series representations such as BoW through the adoption of specially designed mid-level representations.

3 Background on SIFT-Based Feature Extraction

The work proposed in this paper aims at improving classical BoW representation for time series. More precisely, the idea is to build an accurate vectorial representation that models a set of feature vectors extracted from time series. The mid-level representations that we have used in this paper can be applied with any kind of feature vectors extracted from time series. We choose in this paper to use SIFT-based descriptors that were proposed in [3], illustrated in Fig. 1. We quickly explain in this section how such descriptors are computed.

First, key-points are extracted regularly every $\tau _{step}$ instants. Then, each key-point is described by different feature vectors representing its neighborhood at different scales. More precisely, let $L(\mathcal {S},\sigma )$ be the convolution of a time series $\mathcal {S}$ with a Gaussian function $G(t,\sigma )$ of width $\sigma $ computed by $L(\mathcal {S},\sigma )= \mathcal {S} * G(t,\sigma )$ in which

$$\begin{aligned} G(t,\sigma )=\frac{1}{\sqrt{2\pi }\sigma }e^{-t^2/2\sigma ^2} \end{aligned}$$

(1)

For description, $n_b$ blocks of size a are selected in $L(\mathcal {S},\sigma )$ around each key-point. Each of these blocks is described by the gradient magnitudes of the points in the block. More precisely, the sum of positive gradients and the sum of negative gradients are computed in each block. Hence, the size of a key-point feature vector is $2\times n_b$. The key-points are described at many different scales, thereby transforming a time series into a set of feature vectors. More details about SIFT-based descriptors can be found in [3].

4 Classification of Time Series by Using Their Mid-Level Representation

The methodology for classifying time series by using mid-level representation can be divided into four main steps: (i) dense extraction and description of key-points in time series; (ii) coding of the key-point descriptors, using clustering methods; (iii) feature encoding following by vector concatenation in order to create a final mid-level representation; and (iv) classification of time series by using their mid-level representation. Here, it is important to note that each time series is represented by only one mid-level description. In the following, we shortly discuss about some of the representations that were used here for describing a set of low features.

Let ${\mathbb X}=\left\{ { {{\mathbf x}_{j}}}\in {\mathbb R}^{ d}\right\} _{j=1}^{N}$ be an unordered set of d-dimensional descriptors ${ {{\mathbf x}_{j}}}$ extracted from the data. Let also be the codebook learned by an unsupervised clustering algorithm, composed by a set of M codewords, also called prototypes or representatives. Consider $\mathbb {Z}\in {\mathbb R}^{M}$ as the final vector mid-level representation. As formalized in [5], the mapping from $\mathbb X$ to $\mathbb {Z}$ can be decomposed into three successive steps: (i) coding; (ii) pooling; and (iii) concatenation, as follows:

$$\begin{aligned} \alpha _j&=f({ {{\mathbf x}_{j}}}), j\in [1,N] &\text{(coding) } \end{aligned}$$

(2)

$$\begin{aligned} h_m&=g(\alpha _m=\{\alpha _{m,j}\}_{j=1}^{N}),m\in [1,M]&\text{(pooling) } \end{aligned}$$

(3)

$$\begin{aligned} z&=[h_1^T,\dots ,h_M^T] &\text{(concatenation) } \end{aligned}$$

(4)

In vector quantization (VQ) [13], the coding function f aims to minimize the distance to codewords and pooling function leverages these distances, as follows:

(5)

$$\begin{aligned} h_m&=\frac{1}{N}\sum _{j=1}^{N}\alpha _{m,j} \end{aligned}$$

(6)

in which is the Euclidean distance between j-th descriptor and m-th codeword. A soft version of this approach, so-called soft-assignment (SA), attributes ${ {{\mathbf x}_{j}}}$ to the n nearest codewords, and usually presents better results than the hard version. Fisher vector (FV), that is derived from Fisher kernel [7], is another mid-level representation which is a generic framework that combines the benefits of generative and discriminative approaches. FV was introduced for large-scale image categorization in [11] by using an improved Fisher vector.

Vector of Locally Aggregated Descriptors (VLAD) [8] can be viewed as a simplification of the Fisher kernel representation keeping the first-order statistics. Fisher kernel combines the benefits of generative and discriminative approaches and it based on computation of two gradients as follows [11],

$$\begin{aligned} \mathcal {G}_{\mu ,k}^{{ {{\mathbf x}_{j}}}}&=\frac{1}{\sqrt{\pi _k}}\gamma _k\bigg (\frac{{ {{\mathbf x}_{j}}}-\mu _k}{\sigma _{k}}\bigg ),\end{aligned}$$

(7)

$$\begin{aligned} \mathcal {G}_{\sigma ,k}^{{ {{\mathbf x}_{j}}}}&=\frac{1}{\sqrt{2\pi _k}}\gamma _k\Bigg [\frac{{({ {{\mathbf x}_{j}}}-\mu _k)}^2}{\sigma _{k}^{2}}-1\Bigg ] , \end{aligned}$$

(8)

where $\gamma _k$, $\mu _k$ and $\sigma _{k}$ and are the weight of local descriptor, mean and standard deviation, respectively, related to the $k^{th}$ Gaussian Mixture. The final Fisher vector is the concatenation of these gradients for K models and is defined by

$$\begin{aligned} \mathrm {FV:}~~~\mathcal {S}=[\mathcal {G}_{\mu ,1}^{{ {{\mathbf x}_{j}}}},\mathcal {G}_{\sigma ,1}^{{ {{\mathbf x}_{j}}}},\ldots ,\mathcal {G}_{\mu ,K}^{{ {{\mathbf x}_{j}}}},\mathcal {G}_{\sigma ,K}^{{ {{\mathbf x}_{j}}}}] \end{aligned}$$

(9)

Furthermore, Vector of Locally Aggregated Descriptors (VLAD) can be defined as follows:

$$\begin{aligned} \mathrm {VLAD:}~~~\mathcal {S}=[\mathbf {0},\ldots ,\mathbf {s}(i)({{ {{\mathbf x}_{j}}}-\mathbf {d}_i}),\ldots ,\mathbf {0}] \end{aligned}$$

(10)

in which $\mathbf {s}(i)$ is the $i^{th}$ element of VQ and is equal to 1; and $\mathbf {d}_i$ is the closest visual word to ${{ {{\mathbf x}_{j}}}}$. Locality-constrained Linear Coding (LLC) [14] is a mid-level strategy which incorporates a linear reconstruction term during the coding step. It aims to iteratively optimize the produced code using the Coordinate Descent method. At each iteration, each descriptor is weighted and projected into the coordinate system using its locality constraint. At the end, the basis vectors which most minimizes the distance between the descriptor and the codebooks are selected and all other coefficients are set to zero.

5 Experimental Analysis

In this section, we describe our experiments in order to investigate the impact, in terms of classification performances, of more powerful encoding methods applied to dense extracted features to for TSC.

5.1 Experimental Setup

Experiments are conducted on the 84 currently available datasets from the UCR repository, the largest on-line database for time series classification. We ignored 2 datasets: (i) StarLightCurves, due the large amount of instances; and (ii) ItalyPowerDemand, due the small length of the series. All datasets are split into a training and a test set, whose size varies between less than 20 and more than 8,000 time series. For a given dataset, all time series have the same length, ranging from 24 to more than 2,500 points. For computing the mid-level representation, we have extracted SIFT-based descriptors proposed in [3] by using dense sampling. For computing the codebook and GMM, we have used the following number of clusters $\{16, 32, 64, 128, 256, 512,1024\}$ considering a sampling of $30\%$ of the descriptors. The representations are normalized by a L2-norm. The best sets of parameters are obtained by a 5-fold cross-validation to be used in a SVM with RBF kernel for the classification. In the following, we present specific details of the setup for these representations. It is important to note that we have adapted the framework proposed in [10] which was used for evaluating video action recognition.

Here, “D” means SIFT-based descriptor by using dense sampling as low level feature. For D-VQ, we followed VQ [16] and the final representation is obtained by sum pooling. For D-LLC, the final representation is obtained by max and sum pooling. For D-SA, we set the number of nearest codewords to 5, and we have used max and sum pooling for the final representation. For D-VLAD, we used $n=5$ for the nearest codewords and sum pooling.

Table 1. Comparison to the state-of-the-art in terms of classification rates and ranking.

Full size table

5.2 Comparison to the State-of-the-Art Methods

In order to study the impact of specially designed mid-level representations on TSC. We focus on two different analysis. In the first one, we compare the studied representations, namely D-VLAD and D-LLC, to D-VQ and D-BoTSW, which are our baselines. In the second one, we present a comparative analysis between the state-of-the-art, namely D-BoTSW [3], BoP [9], BOSS [12] and COTE [2], and the new proposed use of mid-level description. In both cases, we used the average accuracy rate and rank and summarized it in Table 1. As illustrated, D-VLAD obtained the best results among the studied mid-representation, and it is very competitive to the state-of-the-art, being better than D-BoTSW, BoP and BOSS. When compared to our baseline D-BoTSW, D-VLAD statistically outperformed D-BoTSW taking into account the paired t-test with $70\%$ of confidence, and for rates of confidence greater than 70% in the paired t-test, our baseline D-BoTSW and D-VLAD are statistically equivalent. Concerning the comparison of D-LLC and D-VLAD to D-VQ, we have observed: (i) the studied specially designed mid-level representations presented better performances than D-VQ, which confirms the our initial assumptions; and (ii) D-VLAD presented the best results in terms of classification rates and average rank of a single mid-level representation. Moreover, as illustrated in Fig. 2, the pairwise comparisons involving several tested representation showed that D-VLAD outperformed all SIFT-based representations and BOSS. Furthermore, COTE statistically outperformed all tested representations.

6 Conclusions

Time series classification is a challenge task due to its most diverse applications in real life. Among the several kind of approaches, dictionary-based ones have received much attention in the last years. In a general way, these ones are based on the extraction of feature vectors from time series, creation of codebook from the extracted feature vectors and finally representation of time series as traditional Bag-of-Words. In this work, we studied the impact of more discriminative and accurate mid-level representations for describing the time series taking into account SIFT-based descriptors [3].

According to our experiments, D-VLAD and D-LLC outperform the classical BoW representation, namely vector quantization (VQ). Moreover, we achieve competitive results when compared to the state-of-the-art methods, mainly with D-VLAD. By using, the average rank as a comparative criterion, D-VLAD is very competitive with COTE, which represents the state-of-the-art. However, despite the pairwise comparison (Fig. 2) involving both methods, COTE is slightly better than D-VLAD in terms of average rank. Thus, the use of more accurate mid-level representation in conjunction with SIFT-based descriptor seems to be a very interesting approach to cope with time series classification. From our results and observations, we believe that a future study of the normalization and distance functions could be interesting in order to understand their impact in our method, since according to [8] the reduction of frequent codeword influence could be profitable.

References

Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017)
Article MathSciNet Google Scholar
Bagnall, A., Lines, J., Hills, J., Bostrom, A.: Time-series classification with cote: the collective of transformation-based ensembles. IEEE Trans. Knowl. Data Eng. 27(9), 2522–2535 (2015)
Article Google Scholar
Bailly, A., Malinowski, S., Tavenard, R., Chapel, L., Guyet, T.: Dense bag-of-temporal-SIFT-words for time series classification. In: Douzal-Chouakria, A., Vilar, J.A., Marteau, P.-F. (eds.) AALTD 2015. LNCS (LNAI), vol. 9785, pp. 17–30. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44412-3_2
Chapter Google Scholar
Baydogan, M.G., Runger, G., Tuv, E.: A bag-of-features framework to classify time series. IEEE PAMI 35(11), 2796–2802 (2013)
Article Google Scholar
Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. Proc. CVPR 2010, 2559–2566 (2010)
Google Scholar
Hills, J., Lines, J., Baranauskas, E., Mapp, J., Bagnall, A.: Classification of time series by shapelet transformation. Data Min. Knowl. Discov. 28(4), 851–881 (2014)
Article MathSciNet Google Scholar
Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Advances in Neural Information Processing Systems, pp. 487–493 (1999)
Google Scholar
Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE PAMI 34(9), 1704–1716 (2012)
Article Google Scholar
Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012)
Article Google Scholar
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Article Google Scholar
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_11
Chapter Google Scholar
Schäfer, P.: The BOSS is concerned with time series classification in the presence of noise. Data Min. Knowl. Discov. 29(6), 1505–1530 (2014)
Article MathSciNet Google Scholar
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the ICCV 2003, Nice, France, pp. 1470–1477 (2003)
Google Scholar
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. Proc. CVPR 2010, 3360–3367 (2010)
Google Scholar
Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 947–956. ACM (2009)
Google Scholar
Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. In: Proceedings of the CVPR 2006, p. 13. IEEE (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Pontifical Catholic University of Minas Gerais, Belo Horizonte, MG, Brazil
Raquel Almeida, Hugo Herlanin, Zenilton Kleber G. do Patrocinio Jr. & Silvio Jamil Ferzoli Guimarães
Université de Rennes 1, IRISA, Rennes, France
Simon Malinowski

Authors

Raquel Almeida
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Herlanin
View author publications
You can also search for this author in PubMed Google Scholar
Zenilton Kleber G. do Patrocinio Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Simon Malinowski
View author publications
You can also search for this author in PubMed Google Scholar
Silvio Jamil Ferzoli Guimarães
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zenilton Kleber G. do Patrocinio Jr. .

Editor information

Editors and Affiliations

Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Ruben Vera-Rodriguez
Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Julian Fierrez
Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Aythami Morales

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Almeida, R., Herlanin, H., do Patrocinio, Z.K.G., Malinowski, S., Guimarães, S.J.F. (2019). Evaluation of Bag-of-Word Performance for Time Series Classification Using Discriminative SIFT-Based Mid-Level Representations. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-13469-3_13
Published: 03 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)