Accelerating the discovery of unsupervised-shapelets

Zakaria, Jesin; Mueen, Abdullah; Keogh, Eamonn; Young, Neal

doi:10.1007/s10618-015-0411-4

Accelerating the discovery of unsupervised-shapelets

Published: 07 May 2015

Volume 30, pages 243–281, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jesin Zakaria¹,
Abdullah Mueen¹,
Eamonn Keogh¹ &
…
Neal Young¹

1411 Accesses
20 Citations
Explore all metrics

Abstract

Over the past decade, time series clustering has become an increasingly important research topic in data mining community. Most existing methods for time series clustering rely on distances calculated from the entire raw data using the Euclidean distance or Dynamic Time Warping distance as the distance measure. However, the presence of significant noise, dropouts, or extraneous data can greatly limit the accuracy of clustering in this domain. Moreover, for most real world problems, we cannot expect objects from the same class to be equal in length. As a consequence, most work on time series clustering only considers the clustering of individual time series “behaviors,” e.g., individual heart beats or individual gait cycles, and contrives the time series in some way to make them all equal in length. However, automatically formatting the data in such a way is often a harder problem than the clustering itself. In this work, we show that by using only some local patterns and deliberately ignoring the rest of the data, we can mitigate the above problems and cluster time series of different lengths, e.g., cluster one heartbeat with multiple heartbeats. To achieve this, we exploit and extend a recently introduced concept in time series data mining called shapelets. Unlike existing work, our work demonstrates the unintuitive fact that shapelets can be learned from unlabeled time series. We show, with extensive empirical evaluation in diverse domains, that our method is more accurate than existing methods. Moreover, in addition to accurate clustering results, we show that our work also has the potential to give insight into the domains to which it is applied. While a brute-force algorithm to discover shapelets in an unsupervised way could be untenably slow, we introduce two novel optimization procedures to significantly speed up the unsupervised-shapelet discovery process and allow it to be cast as an anytime algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CDPS: Constrained DTW-Preserving Shapelets

Window Size Selection in Unsupervised Time Series Analytics: A Review and Benchmark

ClaSP: parameter-free time series segmentation

Article Open access 15 February 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Andino SLG et al (2000) Measuring the complexity of time series: an application to neurophysiological signals. Human Brain Mapp 11:46–57
Article Google Scholar
Aziz W, Arif M (2006) Complexity analysis of stride interval time series by threshold dependent symbolic entropy. Euro J Appl Phys 98:30–40
Article Google Scholar
Batista G, Wang X, Keogh E (2011) A complexity-invariant distance measure for time seies. In: SDM
Cerra D, Bieniarz J, Avbelj J, Reinartz P, Mueller R (2011) Compression-based unsupervised clustering of spectral signatures. In: Hyperspectral image and signal processing: evolution in remote sensing (WHISPERS 2011)
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. Proc VLDB Endow 1:1542–52
Article Google Scholar
Garilov M, Anguelov D, Indyk P, Motwani R (2000) Mining the stock market: which measure is best? In: Proceedings of the ACM KDD
Goldberger AL et al (1997) PhysioBank, PhysioToolkit, and PhysioNet: circulation. Discovery 101:1
Google Scholar
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Syst 17:107–145
Article MATH Google Scholar
Hartmann B, Schwab I, Link N (2010) Prototype optimization for temporarily and spatially distorted time series. In: AAAI spring symposium series (SSS 2010), pp 15–20
Hirano S, Tsumoto S (2006) Cluster analysis of time-series medical data based on the trajectory representation and multiscale comparison techniques. In: Proceedings of the ICDM
Hu B, Chen Y, Keogh E (2013) Time series classification under more realistic assumptions. In: SDM, pp 578–586
Hu B, Rakthanmanon T, Hao Y, Evans S, Lonardi S, Keogh E (2011) Discovering the intrinsic cardinality and dimensionality of time series using MDL. In: Proceedings of the ICDM
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans. Knowl Data Eng 16(11):1370–1386
Article Google Scholar
Jesin’s Webpage (2013) https://sites.google.com/a/ucr.edu/clusteringtsusingushapelet/. Accessed 19 April 2015
Kalpakis K, Gada D, Puttagunta V (2001) Distance measures for effective clustering of ARIMA time-series. In: ICDM
Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the ACM KDD, pp 102–111
Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proceedings of the IEEE ICDM, pp 115–122
Keogh E, Zhu Q, Hu B, Hao Y, Xi X, Wei L, Ratanamahatana CA (2011) The UCR time series classification/clustering homepage. http://www.cs.ucr.edu/~eamonn/time_series_data/. Accessed 19 April 2015
Kosala R, Blockeel H (2000) Web mining research: a survey. In: ACM SIGKDD
Li M, Vitanyi P (1997) An introduction to Kolmogorov complexity and its applications, 2nd edn. Springer, Berlin
Book MATH Google Scholar
Liao TW (2005) Clustering of time series data—a survey. Pattern Recognit 38:1857–1874
Article MATH Google Scholar
Lines J, Bagnall A, Smith PC, Anderson S (2011) Classification of household devices by electricity usage profiles. In: IDEAL, LNCS, vol 6936
Lin J, Khade R, Li Y (2012) Rotation-invariant similarity in time series using bag-of-patterns representation. J Intell Inform Syst 39:287–315
Article Google Scholar
Möerchen F (2003) Time series feature extraction for data mining using DWT and DFT. Technical report no. 33. Philipps-University Marburg, Marburg
Muata K (2007) Post-pruning in decision tree induction using multiple performance measures. Comput Oper Res 34:3331–3345
Article MATH Google Scholar
Mueen A, Keogh E, Young N (2011) Logical-Shapelets: an expressive primitive for time series classification. In: Proceedings of the ACM SIGKDD, pp 1154–1162
Rakthanmanon T, Keogh E, Lonardi S, Evans S (2011) Time series epenthesis: clustering time series streams requires ignoring some data. In: Proceedings of the IEEE ICDM
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Article Google Scholar
Ratanamahatna CA, Keogh E (2004) Making time-series classification more accurate using learned constraints. In: SDM
Ruiz EJ, Hristidis V, Castillo C, Gionis A (2012) Correlating financial time series with micro-blogging activity. In: WSDM
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 19:613–620
Article Google Scholar
Shariat S, Pavlovic V (2011) Isotonic CCA for sequence alignment and activity recognition. In: Proceedings of the IEEE international conference on computer vision (ICCV 2011), pp 2572–2578
Silva C, Ribeiro B (2003) The importance of stop word removal on recall values in text categorization. In: Proceedings of the international joint conference on neural networks 2003
Wang X, Smith K, Hyndman R (2006) Characteristic-based clustering for time series data. Data Min Knowl Discov 13:335–364
Article MathSciNet Google Scholar
Xing Z, Pei J, Yu P, Wang K (2011) Extracting interpretable features for early classification on time series. In: Proceedings of the SDM
Ye L, Keogh E (2009) Time series shapelets: a new primitive for data mining. In: Proceedings of the ACM SIGKDD, pp 947–956
Zakaria J, Mueen A, Keogh E (2012) Clustering time series using unsupervised-shapelets. In: Proceedings of the IEEE ICDM, pp 785–794
Zakaria J, Rotschafer S, Mueen A, Razak KA, Keogh E (2012) Mining massive archives of mice sounds with symbolized representations. In: Proceedings of the SDM
Zhang H, Ho TB, Zhang Y, Lin MS (2005) Unsupervised feature extraction for time series clustering using orthogonal wavelet transform. J Inform 30:305–319
MathSciNet Google Scholar
Zhang M, Sawchuk AA (2012) Motion primitive-based human activity recognition using a bag-of-features approach. In: ACM SIGHIT international health informatics symposium (IHI 2012), pp 1–10
Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17:73–83
Google Scholar

Download references

Acknowledgments

Thanks to all the donors of the datasets. This work was funded by NSF IIS—1161997 and a gift from Siemens.

Author information

Authors and Affiliations

Department of Computer Science & Engineering, University of California, Riverside, Riverside, CA, 92521, USA
Jesin Zakaria, Abdullah Mueen, Eamonn Keogh & Neal Young

Authors

Jesin Zakaria
View author publications
You can also search for this author in PubMed Google Scholar
Abdullah Mueen
View author publications
You can also search for this author in PubMed Google Scholar
Eamonn Keogh
View author publications
You can also search for this author in PubMed Google Scholar
Neal Young
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jesin Zakaria.

Additional information

Responsible editor: Ian Davidson

Appendix: Proof of Theorem 1

Theorem 1

$$\begin{aligned} \sqrt{{\sum }_{i=1}^{n-1} \left( {A_i -A_{i+1} } \right) ^{2}}-\sqrt{{\sum }_{i=1}^{n-1} \left( {B_i -B_{i+1} } \right) ^{2}}\le 2\sqrt{{\sum }_{i=1}^n \left( {A_i -B_i } \right) ^{2}} \end{aligned}$$

(10)

We will prove (10).

Proof

Without loss of generality, assume

$$\begin{aligned} \left( {\hbox {A}_1 -\hbox {B}_1 } \right) ^{2}\le \left( {\hbox {A}_\mathrm{n} -\hbox {B}_\mathrm{n} } \right) ^{2} \end{aligned}$$

(11)

(If not, reverse both $A$ and $B$. That is, replace $A$ with ($A_{n}, A_{n-1},{\ldots }, A_{1}$), and likewise for $B$. Note that (10) holds for $A$ and $B$ iff it holds for $A$ and $B$ reversed.)

Note that the inequality (10) is equivalent to

$$\begin{aligned} \hbox {d}\left( {\hbox {A},\hbox {A}^{\prime }} \right) -\hbox {d}(\hbox {B},\hbox {B}^{\prime })\le 2\hbox {d}(\hbox {A},\hbox {B}) \end{aligned}$$

(12)

where $d(X, Y)$ is Euclidean distance, i.e., $d\left( {X,Y} \right) =\sqrt{{\sum }_i^n (X_i -Y_i )^{2}}$, and

$$\begin{aligned} A^{{\prime }}= & {} \left( {A_1 ,A_1 ,A_2 ,A_3 ,\ldots A_{n-1} } \right) ,\\ B^{{\prime }}= & {} \left( {B_1 ,B_1 ,B_2 ,B_3 ,\ldots B_{n-1} } \right) . \end{aligned}$$

So, to prove (10), it suffices to prove (12). Here is a proof of (10).

The triangle inequality holds for Euclidean distance. Applying it twice shows

$$\begin{aligned} \hbox {d}\left( {\hbox {A},\hbox {A}^{\prime }} \right) \le \hbox {d}\left( {\hbox {A},\hbox {B}} \right) +\hbox {d}\left( {\hbox {B},\hbox {A}^{\prime }} \right) \le \hbox {d}\left( {\hbox {A},\hbox {B}} \right) +\hbox {d}\left( {\hbox {B},\hbox {B}^{{\prime }}} \right) +\hbox {d}(\hbox {B}^{ {\prime }},\hbox {A}^{\prime }). \end{aligned}$$

Rearranging gives

$$\begin{aligned} \hbox {d}\left( {\hbox {A},\hbox {A}^{\prime }} \right) -\hbox {d}\left( {\hbox {B},\hbox {B}^{\prime }} \right) \le \hbox {d}\left( {\hbox {A},\hbox {B}} \right) +\hbox {d}(\hbox {A}^{\prime },\hbox {B}^{\prime }) \end{aligned}$$

(13)

By inspection, $d(A^{{\prime }},B^{\prime })^{2}=d(A,B)^{2}-\left( {A_n -B_n } \right) ^{2}+(A_1 -B_1 )^{2}$, so

$$\begin{aligned} \hbox {d}\left( {\hbox {A}^{\prime },\hbox {B}^{\prime }} \right) =\sqrt{\hbox {d}(\hbox {A},\hbox {B})^{2}-\left( {\hbox {A}_\mathrm{n} -\hbox {B}_\mathrm{n} } \right) ^{2}+\left( {\hbox {A}_1 -\hbox {B}_1 } \right) ^{2}} \end{aligned}$$

(14)

Together with (11), this gives

$$\begin{aligned} \hbox {d}\left( {\hbox {A}^{\prime },\hbox {B}^{\prime }} \right) \le \sqrt{\hbox {d}\left( {\hbox {A},\hbox {B}} \right) ^{2}}=\hbox {d}(\hbox {A},\hbox {B}) \end{aligned}$$

(15)

Substituting this into (13) gives,

$$\begin{aligned} \hbox {d}\left( {\hbox {A},\hbox {A}^{\prime }} \right) -\hbox {d}\left( {\hbox {B},\hbox {B}^{\prime }} \right) \le \hbox {d}\left( {\hbox {A},\hbox {B}} \right) +\hbox {d}\left( {\hbox {A},\hbox {B}} \right) =2\hbox {d}(\hbox {A},\hbox {B}) \end{aligned}$$

(16)

proving (12) which proves (10). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zakaria, J., Mueen, A., Keogh, E. et al. Accelerating the discovery of unsupervised-shapelets. Data Min Knowl Disc 30, 243–281 (2016). https://doi.org/10.1007/s10618-015-0411-4

Download citation

Received: 11 September 2013
Accepted: 26 February 2015
Published: 07 May 2015
Issue Date: January 2016
DOI: https://doi.org/10.1007/s10618-015-0411-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating the discovery of unsupervised-shapelets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CDPS: Constrained DTW-Preserving Shapelets

Window Size Selection in Unsupervised Time Series Analytics: A Review and Benchmark

ClaSP: parameter-free time series segmentation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proof of Theorem 1

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Accelerating the discovery of unsupervised-shapelets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CDPS: Constrained DTW-Preserving Shapelets

Window Size Selection in Unsupervised Time Series Analytics: A Review and Benchmark

ClaSP: parameter-free time series segmentation

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proof of Theorem 1

Appendix: Proof of Theorem 1

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation