Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping

Published: 01 September 2013 Publication History

Abstract

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms, including classification, clustering, motif discovery, anomaly detection, and so on. The difficulty of scaling a search to large datasets explains to a great extent why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine massive time series for the first time. We demonstrate the following unintuitive fact: in large datasets we can exactly search under Dynamic Time Warping (DTW) much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We explain how our ideas allow us to solve higher-level time series data mining problems such as motif discovery and clustering at scales that would otherwise be untenable. Moreover, we show how our ideas allow us to efficiently support the uniform scaling distance measure, a measure whose utility seems to be underappreciated, but which we demonstrate here. In addition to mining massive datasets with up to one trillion datapoints, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

References

[1]
Adams, N., Marquez, D., and Wakefield, G. 2005. Iterative deepening for melody alignment and retrieval. In Proceedings of ISMIR. 199--206.
[2]
Alon, J., Athitsos, V., Yuan, Q., and Sclaroff, S. 2009. A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 31, 9, 1685--1699.
[3]
Assent, I., Krieger, R., Afschari, F., and Seidl, T. 2008. The TS-Tree: Efficient time series search and retrieval. In Proceedings of EDBT. 252--63.
[4]
Bei, C. D. and Gray, R. M. 1985. An improvement of the minimum distortion encoding algorithm for vector quantization. IEEE Trans. Commun. 33, 10, 1132--1133.
[5]
Bragge, T., Tarvainen, M. P., and Karjalainen, P. A. 2004. High-Resolution QRS Detection Algorithm for Sparsely Sampled ECG Recordings. Department of Applied Physics Report, University of Kuopio.
[6]
Chadwick, N. A., McMeekin, D. A., and Tan, T. 2011. Classifying eye and head movement artifacts in EEG Signals. In Proceedings of IEEE DEST. 285--291.
[7]
Chan, T. F., Golub, G. H., and Leveque, R. J. 1983. Algorithms for computing the sample variance: Analysis and recommendations. Amer. Statist. 37, 242--247.
[8]
Chaovalitwongse, W. A., Sachdeo, R. C., Pardalos, P. M., Iasemidis, L. D., and Sackellares, J. C. 2005. Automated brain activity classifier. Epilepsia 46, 313.
[9]
Chen, L. and Ng, R. 2004. On the marriage of LP-norms and edit distance. In Proceedings of VLDB. 792--803.
[10]
Chen, Y., Chen, G., Chen, K., and Ooi, B. C. 2009. Efficient processing of warping time series join of motion capture data. In Proceedings of ICDE. 1048--1059.
[11]
Cheng, D. Y., Gersho, A., Ramamurthi, B., and Shoham, Y. 1984. Fast search algorithms for vector quantization and pattern matching. In Proceedings of ICASSP. 372--375.
[12]
Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., and Keogh, E. J. 2008. Querying and mining of time series data: Experimental comparison of representations and distance measures. J. VLDB 1, 2, 1542--1552.
[13]
Dupasquier, B. and Burschka, S. 2011. Data mining for hackers--Encrypted traffic mining. In Proceedings of the 28th Chaos Comm’ Congress.
[14]
Fornés, A., Lladós, J., and Sanchez, G. 2007. Old handwritten musical symbol classification by a dynamic time warping based method. Graph. Recogn. 5046, 51--60.
[15]
Fu, A., Keogh, E. J., Lau, L., Ratanamahatana, C., and Wong, R. 2008. Scaling and time warping in time series querying. VLDB J. 17, 4, 899--921.
[16]
Gillian, N., Knapp, R., and O’Modhrain, S. 2011. Recognition of multivariate temporal musical gestures using n-dimensional dynamic time warping. In Proceedings of the 11th International Conference on New Interfaces for Musical Expression.
[17]
Goldberg, D. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1.
[18]
Guitel, G. 1975. Histoire Comparée des Numérations Écrites. Flammarion, Paris. 566--574.
[19]
Hsiao, M., West, K., and Vedatesh, G. 2005. Online context recognition in multisensor system using dynamic time warping. In Proceedings of ISSNIP. 283--288.
[20]
Huber-Mörk, R., Zambanini, S., Zaharieva, M., and Kampel, M. 2011. Identification of ancient coins based on fusion of shape and local features. Mach. Vis. Appl. 22, 6, 983--994.
[21]
Jegou, H., Douze, M., Schmid, C., and Perez, P. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of IEEE CVPR. 3304--3311.
[22]
Kahveci, T. and Singh, A. K. 2004. Optimizing similarity search for arbitrary length time series queries. IEEE Trans. Knowl. Data Eng. 16, 4, 418--433.
[23]
Keogh, E. and Ratanamahatana, C. A. 2005. Exact indexing of dynamic time warping. Knowl. Inform. Syst. 7, 3, 358--386.
[24]
Keogh, E. J. and Kasetty, S. 2003. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining Knowl. Discov. 7, 4, 349--371.
[25]
Keogh, E. J., Palpanas, T., Zordan, V. B., Gunopulos, D., and Cardle, M. 2004. Indexing large human-motion databases. In Proceedings of VLDB. 780--791.
[26]
Keogh, E. J., Wei, L., Xi, X., Vlachos, M., Lee, S. H., and Protopapas, P. 2009. Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures. VLDB J. 18, 3, 611--630.
[27]
Kim, S., Park, S., and Chu, W. 2001. An index-based approach for similarity search supporting time warping in large sequence databases. In Proceedings of ICDE. 607--614.
[28]
Laerhoven, K., Berlin, E., and Schiele, B. 2009. Enabling efficient time series analysis for wearable activity data. In Proceedings of ICMLA. 392--397.
[29]
Lim, S. H., Park, H., and Kim, S. W. 2007. Using multiple indexes for efficient subsequence matching in time-series databases. Inf. Sci. 177, 24, 5691--5706.
[30]
Ling, R. F. 1974. Comparison of several algorithms for computing sample means and variances. J. Amer. Statist. Assoc. 69, 348, 859--866.
[31]
Locke, D. P., Hillier, L. W., Warren, W. C., et al. 2011. Comparative and demographic analysis of orangutan genomes. Nature 469, 529--533.
[32]
Masek, W. J. and Paterson, M. S. 1980. A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20, 1, 18--31.
[33]
McNames, J. 2000. Rotated partial distance search for faster vector quantization encoding. IEEE Signal Proc. Lett. 7, 9, 244--246.
[34]
Mueen, A. and Keogh, E. J. 2010. Online discovery and maintenance of time series motifs. In Proceedings of KDD. 1089--1098.
[35]
Mueen, A., Keogh, E. J., Zhu, Q., Cash, S., Westover, M. B., and Shamlo, N. 2011. A disk-aware algorithm for time series motif discovery. Data Min. Knowl. Discov. 22, 1--2, 73--105.
[36]
Muller, M. 2009. Analysis and retrieval techniques for motion and music data. EUROGRAPHICS tutorial.
[37]
Papapetrou, P., Athitsos, V., Potamias, M., Kollios, G., and Gunopulos, D. 2011. Embedding-based subsequence matching in time-series databases. ACM Trans. Datab. Syst. 36, 3, 174.
[38]
Pressly, W. 2008. TSPad: A Tablet-PC based application for annotation and collaboration on time series data. In Proceedings of ACM Southeast Regional Conference. 527--552.
[39]
Raghavendra, B., Bera, D., Bopardikar, A., and Narayanan, R. 2011. Cardiac arrhythmia detection using dynamic time warping of ECG beats in e-healthcare systems. In Proceedings of WOWMOM. 1--6.
[40]
Rebbapragada, U., Protopapas, P., Brodley, C., and Alcock, C. 2009. Finding anomalous periodic time series. Mach. Learn. 74, 3, 281--313.
[41]
Sakurai, Y., Yoshikawa, M., and Faloutsos, C. 2005. FTW: Fast similarity search under the time warping distance. In Proceedings of PODS. 326--337.
[42]
Sakurai, Y., Faloutsos, C., and Yamamuro, M. 2007. Stream monitoring under the time warping distance. In Proceedings of ICDE. 1046--1055.
[43]
Shieh, J. and Keogh, E. J. 2008. iSAX: Indexing and mining terabyte sized time series. In Proceedings of KDD. 623--631.
[44]
Srikanthan, S., Kumar, A., and Gupta, R. 2011. Implementing the dynamic time warping algorithm in multithreaded environments for real time and unsupervised pattern discovery. In Proceedings of IEEE ICCCT. 394--398.
[45]
Stiefmeier, T., Roggen, D., and Tröster, G. 2007. Gestures are strings: Efficient online gesture spotting and classification using string matching. In Proceedings of the ICST 2nd International Conference on Body Area Networks.
[46]
Vlachos, M., Hadjieleftheriou, M., Gunopulos, D., and Keogh, E. J. 2003. Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of KDD. 216--225.
[47]
Whitney, C. R. 1997. Jeanne Calment, World’s elder, dies at 122. New York Times (8/5/97).
[48]
Wobbrock, J. O., Wilson, A. D., and Li, Y. 2007. Gestures without libraries, toolkits or training: A $1 recognizer for user interface prototypes. In Proceedings of ACM UIST. 159--168.
[49]
Ye, L. and Keogh, E. J. 2009. Time series shapelets: A new primitive for data mining. In Proceedings of KDD. 947--956.
[50]
Yi, B., Jagadish, H., and Faloutsos, C. 1998. Efficient retrieval of similar time sequences under time warping. In Proceedings of ICDE. 201--208.

Cited By

View all
  • (2024)Efficient Discovery of Time Series Motifs under both Length Differences and WarpingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671726(1188-1198)Online publication date: 25-Aug-2024
  • (2024)Pattern sequence-based algorithm for multivariate big data time series forecastingFuture Generation Computer Systems10.1016/j.future.2023.12.021154:C(397-412)Online publication date: 1-May-2024
  • (2023)Bidirectional piecewise linear representation of time series with application to collective anomaly detectionAdvanced Engineering Informatics10.1016/j.aei.2023.10215558:COnline publication date: 1-Oct-2023
  • Show More Cited By

Index Terms

  1. Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 7, Issue 3
    Special Issue on ACM SIGKDD 2012
    September 2013
    156 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/2513092
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2013
    Accepted: 01 February 2013
    Revised: 01 December 2012
    Received: 01 August 2012
    Published in TKDD Volume 7, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Time series
    2. lower bounds
    3. similarity search

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)157
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient Discovery of Time Series Motifs under both Length Differences and WarpingProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671726(1188-1198)Online publication date: 25-Aug-2024
    • (2024)Pattern sequence-based algorithm for multivariate big data time series forecastingFuture Generation Computer Systems10.1016/j.future.2023.12.021154:C(397-412)Online publication date: 1-May-2024
    • (2023)Bidirectional piecewise linear representation of time series with application to collective anomaly detectionAdvanced Engineering Informatics10.1016/j.aei.2023.10215558:COnline publication date: 1-Oct-2023
    • (2023)Discovering time series motifs of all lengths using dynamic time warpingWorld Wide Web10.1007/s11280-023-01207-626:6(3815-3836)Online publication date: 1-Nov-2023
    • (2023)MERLIN++: parameter-free discovery of time series anomaliesData Mining and Knowledge Discovery10.1007/s10618-022-00876-737:2(670-709)Online publication date: 16-Jan-2023
    • (2023)Turbo Scan: Fast Sequential Nearest Neighbor Search in High DimensionsSimilarity Search and Applications10.1007/978-3-031-46994-7_9(103-110)Online publication date: 9-Oct-2023
    • (2022)PhytoNodes for Environmental Monitoring: Stimulus Classification based on Natural Plant Signals in an Interactive Energy-efficient Bio-hybrid SystemProceedings of the 2022 ACM Conference on Information Technology for Social Good10.1145/3524458.3547266(258-264)Online publication date: 7-Sep-2022
    • (2022)Towards scalable and reusable predictive models for cyber twins in manufacturing systemsJournal of Intelligent Manufacturing10.1007/s10845-021-01804-033:2(441-455)Online publication date: 1-Feb-2022
    • (2022)Functional classwise principal component analysis: a classification framework for functional data analysisData Mining and Knowledge Discovery10.1007/s10618-022-00898-137:2(552-594)Online publication date: 2-Dec-2022
    • (2021)Influence Maximization in Multi-Relational Social NetworksProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481928(4193-4202)Online publication date: 26-Oct-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media