Abstract
Time series classification has been widely explored over the last years. Amongst the best approaches for that task, many are based on the Bag-of-Words framework, in which time series are transformed into a histogram of word occurrences. These words represent quantized features that are extracted beforehand. In this paper, we aim to evaluate the use of accurate mid-level representations in order to enhance the Bag-of-Words representation. More precisely, this kind of representation enables to reduce the loss induced by feature quantization. Experiments show that these representations are likely to improve time series classification accuracy compared to Bag-of-Words and some of them are very competitive to the state-of-the-art.
This work received funding from CAPES (STIC-AmSUD TRANSFORM 88881.143258/2017-01), FAPEMIG (PPM 00006-16), and CNPq (Universal 421521/2016-3 and PQ 307062/2016-3).
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Time series can be seen as series of ordered measurements. They contain temporal information that needs to be taken into account when dealing with such data. Time series classification (TSC) could be defined as follows: given a collection of unlabeled time series, one should assign each time series to one of a predefined set of classes. TSC is a challenge that is receiving more and more attention recently due to its most diverse applications in real life problems involving, for example, data mining, statistics, machine learning and image processing.
An extensive comparison of TSC approaches is performed in [1]. Two particular methods stand out from other core classifiers for their accuracy: COTE [2] and BOSS [12]. BOSS is a dictionary-based approach based on the extraction of Fourier coefficients from time series windows. Many other dictionary-based approaches have been proposed recently [3, 4]. These methods share the same overall steps: (i) extraction of feature vectors from time series; (ii) creation of a codebook (composed of codewords) from extracted feature vectors; and (iii) representation of time series as a histogram of codeword appearances, called a Bag-of-Words (BoW).
Dictionary-based approaches are well adapted for TSC. Nevertheless, two drawbacks with such methods can be pointed out: (i) global temporal information is lost when representing a time series as a BoW; and (ii) extracted features are quantized using a dictionary inherently inducing a loss in the precision of time series representation. In this paper, we tackle this second issue. Particularly, we study the impact of mid-level representations (widely used for image and video analysis) on time series classification. We focus on mid-level features that aim at enhancing the BoW representation by more accurate description of the distribution of feature vectors related to an object. To the best of our knowledge, such representations have never been used for time series. Vector of Locally Aggregated Descriptors [8] and Locality-constrained Linear Coding [14] are examples of such mid-level representations.
This paper is organized as follows. Section 2 gives the related works about time series classification. Section 3 explains the background about SIFT-based descriptors extracted from time series. In Sect. 4, we present a methodology for time series classification by using powerful mid-level representations on SIFT-based descriptors. Section 5 details the experimental setup and results to validate the method, and finally, some conclusions are drawn in Sect. 6.
2 Related Work
In this section, we give an overview about the related work on TSC. One of the earliest methods for that task is the combination of 1-nearest-neighbor classifier and the Dynamic Time Warping. It has been a baseline for TSC for many years thanks to its good performance. Recently, more sophisticated approaches have been designed for TSC.
Shapelets, for instance, were introduced in [15]. They represent existing subsequences able to discriminate classes. Hills et al. proposed the shapelet transform [6], which consists in transforming a time series into a vector whose components represent the distance between the time series and different shapelets, extracted beforehand. Classifiers, such as SVM, can then be used with the vectorial representations of time series.
Numerous approaches have been designed based on the Bag-of-Word framework. This framework consists in extracting feature vectors from time series, creating a dictionary of words using these extracted features, and then representing each time series as a histogram of words occurrence. The different approaches proposed in the literature differ mainly on the kind of features that are extracted. Local features such as mean, variance and extrema are considered in [4] and Fourier coefficients in [12]. SAX coefficients are used in [9]. Recently, SIFT-based descriptors adapted to time series have been considered as feature vectors in [3].
All the methods based on the BoW framework create a dictionary of words by quantizing the set (or a subset) of extracted features. This quantization step induces a loss when representing time series as histogram of words occurrence. In this paper, we aim at improving the accuracy of time series representations such as BoW through the adoption of specially designed mid-level representations.
3 Background on SIFT-Based Feature Extraction
The work proposed in this paper aims at improving classical BoW representation for time series. More precisely, the idea is to build an accurate vectorial representation that models a set of feature vectors extracted from time series. The mid-level representations that we have used in this paper can be applied with any kind of feature vectors extracted from time series. We choose in this paper to use SIFT-based descriptors that were proposed in [3], illustrated in Fig. 1. We quickly explain in this section how such descriptors are computed.
First, key-points are extracted regularly every \(\tau _{step}\) instants. Then, each key-point is described by different feature vectors representing its neighborhood at different scales. More precisely, let \(L(\mathcal {S},\sigma )\) be the convolution of a time series \(\mathcal {S}\) with a Gaussian function \(G(t,\sigma )\) of width \(\sigma \) computed by \(L(\mathcal {S},\sigma )= \mathcal {S} * G(t,\sigma )\) in which
For description, \(n_b\) blocks of size a are selected in \(L(\mathcal {S},\sigma )\) around each key-point. Each of these blocks is described by the gradient magnitudes of the points in the block. More precisely, the sum of positive gradients and the sum of negative gradients are computed in each block. Hence, the size of a key-point feature vector is \(2\times n_b\). The key-points are described at many different scales, thereby transforming a time series into a set of feature vectors. More details about SIFT-based descriptors can be found in [3].
4 Classification of Time Series by Using Their Mid-Level Representation
The methodology for classifying time series by using mid-level representation can be divided into four main steps: (i) dense extraction and description of key-points in time series; (ii) coding of the key-point descriptors, using clustering methods; (iii) feature encoding following by vector concatenation in order to create a final mid-level representation; and (iv) classification of time series by using their mid-level representation. Here, it is important to note that each time series is represented by only one mid-level description. In the following, we shortly discuss about some of the representations that were used here for describing a set of low features.
Let \({\mathbb X}=\left\{ { {{\mathbf x}_{j}}}\in {\mathbb R}^{ d}\right\} _{j=1}^{N}\) be an unordered set of d-dimensional descriptors \({ {{\mathbf x}_{j}}}\) extracted from the data. Let also be the codebook learned by an unsupervised clustering algorithm, composed by a set of M codewords, also called prototypes or representatives. Consider \(\mathbb {Z}\in {\mathbb R}^{M}\) as the final vector mid-level representation. As formalized in [5], the mapping from \(\mathbb X\) to \(\mathbb {Z}\) can be decomposed into three successive steps: (i) coding; (ii) pooling; and (iii) concatenation, as follows:
In vector quantization (VQ) [13], the coding function f aims to minimize the distance to codewords and pooling function leverages these distances, as follows:
in which is the Euclidean distance between j-th descriptor and m-th codeword. A soft version of this approach, so-called soft-assignment (SA), attributes \({ {{\mathbf x}_{j}}}\) to the n nearest codewords, and usually presents better results than the hard version. Fisher vector (FV), that is derived from Fisher kernel [7], is another mid-level representation which is a generic framework that combines the benefits of generative and discriminative approaches. FV was introduced for large-scale image categorization in [11] by using an improved Fisher vector.
Vector of Locally Aggregated Descriptors (VLAD) [8] can be viewed as a simplification of the Fisher kernel representation keeping the first-order statistics. Fisher kernel combines the benefits of generative and discriminative approaches and it based on computation of two gradients as follows [11],
where \(\gamma _k\), \(\mu _k\) and \(\sigma _{k}\) and are the weight of local descriptor, mean and standard deviation, respectively, related to the \(k^{th}\) Gaussian Mixture. The final Fisher vector is the concatenation of these gradients for K models and is defined by
Furthermore, Vector of Locally Aggregated Descriptors (VLAD) can be defined as follows:
in which \(\mathbf {s}(i)\) is the \(i^{th}\) element of VQ and is equal to 1; and \(\mathbf {d}_i\) is the closest visual word to \({{ {{\mathbf x}_{j}}}}\). Locality-constrained Linear Coding (LLC) [14] is a mid-level strategy which incorporates a linear reconstruction term during the coding step. It aims to iteratively optimize the produced code using the Coordinate Descent method. At each iteration, each descriptor is weighted and projected into the coordinate system using its locality constraint. At the end, the basis vectors which most minimizes the distance between the descriptor and the codebooks are selected and all other coefficients are set to zero.
5 Experimental Analysis
In this section, we describe our experiments in order to investigate the impact, in terms of classification performances, of more powerful encoding methods applied to dense extracted features to for TSC.
5.1 Experimental Setup
Experiments are conducted on the 84 currently available datasets from the UCR repository, the largest on-line database for time series classification. We ignored 2 datasets: (i) StarLightCurves, due the large amount of instances; and (ii) ItalyPowerDemand, due the small length of the series. All datasets are split into a training and a test set, whose size varies between less than 20 and more than 8,000 time series. For a given dataset, all time series have the same length, ranging from 24 to more than 2,500 points. For computing the mid-level representation, we have extracted SIFT-based descriptors proposed in [3] by using dense sampling. For computing the codebook and GMM, we have used the following number of clusters \(\{16, 32, 64, 128, 256, 512,1024\}\) considering a sampling of \(30\%\) of the descriptors. The representations are normalized by a L2-norm. The best sets of parameters are obtained by a 5-fold cross-validation to be used in a SVM with RBF kernel for the classification. In the following, we present specific details of the setup for these representations. It is important to note that we have adapted the framework proposed in [10] which was used for evaluating video action recognition.
Here, “D” means SIFT-based descriptor by using dense sampling as low level feature. For D-VQ, we followed VQ [16] and the final representation is obtained by sum pooling. For D-LLC, the final representation is obtained by max and sum pooling. For D-SA, we set the number of nearest codewords to 5, and we have used max and sum pooling for the final representation. For D-VLAD, we used \(n=5\) for the nearest codewords and sum pooling.
5.2 Comparison to the State-of-the-Art Methods
In order to study the impact of specially designed mid-level representations on TSC. We focus on two different analysis. In the first one, we compare the studied representations, namely D-VLAD and D-LLC, to D-VQ and D-BoTSW, which are our baselines. In the second one, we present a comparative analysis between the state-of-the-art, namely D-BoTSW [3], BoP [9], BOSS [12] and COTE [2], and the new proposed use of mid-level description. In both cases, we used the average accuracy rate and rank and summarized it in Table 1. As illustrated, D-VLAD obtained the best results among the studied mid-representation, and it is very competitive to the state-of-the-art, being better than D-BoTSW, BoP and BOSS. When compared to our baseline D-BoTSW, D-VLAD statistically outperformed D-BoTSW taking into account the paired t-test with \(70\%\) of confidence, and for rates of confidence greater than 70% in the paired t-test, our baseline D-BoTSW and D-VLAD are statistically equivalent. Concerning the comparison of D-LLC and D-VLAD to D-VQ, we have observed: (i) the studied specially designed mid-level representations presented better performances than D-VQ, which confirms the our initial assumptions; and (ii) D-VLAD presented the best results in terms of classification rates and average rank of a single mid-level representation. Moreover, as illustrated in Fig. 2, the pairwise comparisons involving several tested representation showed that D-VLAD outperformed all SIFT-based representations and BOSS. Furthermore, COTE statistically outperformed all tested representations.
6 Conclusions
Time series classification is a challenge task due to its most diverse applications in real life. Among the several kind of approaches, dictionary-based ones have received much attention in the last years. In a general way, these ones are based on the extraction of feature vectors from time series, creation of codebook from the extracted feature vectors and finally representation of time series as traditional Bag-of-Words. In this work, we studied the impact of more discriminative and accurate mid-level representations for describing the time series taking into account SIFT-based descriptors [3].
According to our experiments, D-VLAD and D-LLC outperform the classical BoW representation, namely vector quantization (VQ). Moreover, we achieve competitive results when compared to the state-of-the-art methods, mainly with D-VLAD. By using, the average rank as a comparative criterion, D-VLAD is very competitive with COTE, which represents the state-of-the-art. However, despite the pairwise comparison (Fig. 2) involving both methods, COTE is slightly better than D-VLAD in terms of average rank. Thus, the use of more accurate mid-level representation in conjunction with SIFT-based descriptor seems to be a very interesting approach to cope with time series classification. From our results and observations, we believe that a future study of the normalization and distance functions could be interesting in order to understand their impact in our method, since according to [8] the reduction of frequent codeword influence could be profitable.
References
Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 31(3), 606–660 (2017)
Bagnall, A., Lines, J., Hills, J., Bostrom, A.: Time-series classification with cote: the collective of transformation-based ensembles. IEEE Trans. Knowl. Data Eng. 27(9), 2522–2535 (2015)
Bailly, A., Malinowski, S., Tavenard, R., Chapel, L., Guyet, T.: Dense bag-of-temporal-SIFT-words for time series classification. In: Douzal-Chouakria, A., Vilar, J.A., Marteau, P.-F. (eds.) AALTD 2015. LNCS (LNAI), vol. 9785, pp. 17–30. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44412-3_2
Baydogan, M.G., Runger, G., Tuv, E.: A bag-of-features framework to classify time series. IEEE PAMI 35(11), 2796–2802 (2013)
Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. Proc. CVPR 2010, 2559–2566 (2010)
Hills, J., Lines, J., Baranauskas, E., Mapp, J., Bagnall, A.: Classification of time series by shapelet transformation. Data Min. Knowl. Discov. 28(4), 851–881 (2014)
Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Advances in Neural Information Processing Systems, pp. 487–493 (1999)
Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE PAMI 34(9), 1704–1716 (2012)
Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39(2), 287–315 (2012)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_11
Schäfer, P.: The BOSS is concerned with time series classification in the presence of noise. Data Min. Knowl. Discov. 29(6), 1505–1530 (2014)
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the ICCV 2003, Nice, France, pp. 1470–1477 (2003)
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. Proc. CVPR 2010, 3360–3367 (2010)
Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 947–956. ACM (2009)
Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. In: Proceedings of the CVPR 2006, p. 13. IEEE (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Almeida, R., Herlanin, H., do Patrocinio, Z.K.G., Malinowski, S., Guimarães, S.J.F. (2019). Evaluation of Bag-of-Word Performance for Time Series Classification Using Discriminative SIFT-Based Mid-Level Representations. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-13469-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)