Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3085504.3085515acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Data Series Similarity Using Correlation-Aware Measures

Published: 27 June 2017 Publication History

Abstract

The increased availability of unprecedented amounts of sequential data (generated by Internet-of-Things, as well as scientific applications) has led in the past few years to a renewed interest and attention to the field of data series processing and analysis. Data series collections are processed and analyzed using a large variety of techniques, most of which are based on the computation of some distance function. In this study, we revisit this basic operation of data series distance calculation. We observe that the popular distance measures are oblivious to the correlations inherent in neighboring values in a data series. Therefore, we evaluate the plausibility and benefit of incorporating into the distance function measures of correlation, which enable us to capture the associations among neighboring values in the sequence. We propose four such measures, inspired by statistical and probabilistic approaches, which can effectively model these correlations. We analytically and experimentally demonstrate the benefits of the new measures using the 1NN classification task, and discuss the lessons learned. Finally, we propose future research directions for enabling the proposed measures to be used in practice.

References

[1]
Adhd-200. http://fcon_1000.projects.nitrc.org/indi/adhd200/, 2011.
[2]
Sloan digital sky survey. https://www.sdss3.org/drl0/data_access/volume.php, 2015.
[3]
R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. pages 69--84. Springer Verlag, 1993.
[4]
J. Aßfalg, H. Kriegel, P. Kröger, P. Kunath, A. Pryakhin, and M. Renz. Similarity search on time series based on threshold queries. Advances in Database Technology-EDBT 2006, pages 276--294, 2006.
[5]
D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In AAAIWS, pages 359--370, 1994.
[6]
G. Beskales, M. Soliman, and I. Ilyas. Efficient search for the top-k probable nearest neighbors in uncertain databases. Proceedings of the VLDB Endowment, 1(1):326--339, 2008.
[7]
A. Camerra, T. Palpanas, J. Shieh, and E. Keogh. isax 2.0: Indexing and mining one billion time series. In ICDM, 2010.
[8]
A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. Keogh. Beyond one billion time series: indexing and mining very large time series collections with isax2+. KAIS, 39(1):123--151, 2014.
[9]
L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In Proceedings of the ACM SIGMOD international conference on Management of data, pages 491--502, 2005.
[10]
Y. Chen, M. A. Nascimento, B. C. Ooi, and A. K. Tung. Spade: On shape-based pattern detection in streaming time series. In International Conference on Data Engineering (ICDE)., pages 786--795, 2007.
[11]
M. Dallachiesa, B. Nushi, K. Mirylenka, and T. Palpanas. Similarity matching for uncertain time series: Analytical and experimental comparison. QUeST '11, pages 8--15. ACM, 2011.
[12]
M. Dallachiesa, B. Nushi, K. Mirylenka, and T. Palpanas. Uncertain time-series similarity: return to the basics. Proceedings of the VLDB Endowment, 5(11):1662--1673, 2012.
[13]
M. Dallachiesa, T. Palpanas, and I. F. Ilyas. Top-k nearest neighbor search in uncertain data series. PVLDB, 8(1):13--24, 2014.
[14]
G. Das, D. Gunopulos, and H. Mannila. Pkdd. Principles of Data Mining and Knowledge Discovery, pages 88--100, 1997.
[15]
B. Dasarathy. Nearest Unlike Neighbor (NUN): An Aid to Decision Confidence Estimation. In Optical Engineering 34, 1995.
[16]
H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. Proceedings of the VLDB Endowment, 1(2):1542--1552, 2008.
[17]
M.-P. Dubuisson and A. Jain. A modified hausdorff distance for object matching. In Pattern Recognition, 1994. Vol. 1 - Conference A: Computer Vision amp; Image Processing., Proceedings of the 12th IAPR International Conference on, volume 1, pages 566--568 vol.1, 1994.
[18]
R. P. W. Duin and P. Paclik. Prototype selection for dissimilarity-based classifiers. Pattern Recognition, 39:189--208, 2006.
[19]
M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. Mining the stock market (extended abstract): which measure is best? In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '00, pages 487--496. ACM, 2000.
[20]
X. Ge and P. Smyth. Deformable markov model templates for time-series pattern matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '00, pages 81--90. ACM, 2000.
[21]
P. Huijse, P. A. Estévez, P. Protopapas, J. C. Principe, and P. Zegers. Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag., 9(3):27--39, 2014.
[22]
D. W. Jacobs, D. Weinshall, and Y. Gdalyahu. Classification with non-metric distances: Image retrieval and class representation. IEEE Trans. Pattern Anal. Mach. Intell., 22(6):583600, 2000.
[23]
K. Kalpakis, D. Gada, and V. Puttagunta. Distance measures for effective clustering of arima time-series. In ICDM, pages 273--280, 2001.
[24]
K. Kashino, G. Smith, and H. Murase. Time-series active search for quick retrieval of audio and video. In ICASSP, 1999.
[25]
E. Keogh, X. Xi, L. Wei, and C. Ratanamahatana. The UCR Time Series Classification/Clustering Homepage, 2011.
[26]
A. Kotsifakos, V. Athitsos, and P. Papapetrou. Query-sensitive distance measure selection for time series nearest neighbor classification. IDA, 20(1):5--27, 2016.
[27]
L. Li, B. A. Prakash, and C. Faloutsos. Parsimonious linear fingerprinting for time series. PVLDB, 3(1):385--396, 2010.
[28]
J. Lin, R. Khade, and Y. Li. Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst., 39(2), 2012.
[29]
K. Mirylenka, V. Christophides, T. Palpanas, I. Pefkianakis, and M. May. Characterizing home device usage from wireless traffic time series. In EDBT, pages 539--550, 2016.
[30]
K. Mirylenka, G. Cormode, T. Palpanas, and D. Srivastava. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal, 24(3):395--414, 2015.
[31]
K. Mirylenka, M. Dallachiesa, and T. Palpanas. Correlation-aware distance measures for data series. In Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, Venice, Italy, March 21-24, 2017., pages 502--505, 2017.
[32]
K. Mirylenka, C. Miksovic, and P. Scotton. Recurrent neural networks for modeling company-product time series. In Proceedings of AALTD 2016: Second ECML/PKDD International Workshop on Advanced Analytics and Learning on Temporal Data, pages 29--36, 2016.
[33]
K. Mirylenka, T. Palpanas, G. Cormode, and D. Srivastava. Finding interesting correlations with conditional heavy hitters. In ICDE, pages 1069--1080, 2013.
[34]
G. Moody and R. Mark. The impact of the mit-bih arrhythmia database. Engineering in Medicine and Biology Magazine, IEEE, 20(3):45--50, may-june 2001.
[35]
T. Palpanas. Data series management: The road to big sequence analytics. SIGMOD Record, 44(2):47--52, 2015.
[36]
T. Palpanas. Big sequence management: A glimpse of the past, the present, and the future. In International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), pages 63--80, 2016.
[37]
S. Papadimitriou, J. Sun, and P. S. Yu. Local correlation tracking in time series. In ICDM, pages 456--465, 2006.
[38]
P. Paraskevopoulos, T.-C. Dinh, Z. Dashdorj, T. Palpanas, and L. Serafini. Identification and characterization of human behavior patterns from mobile phone data. In D4D Challenge session, NetMob, 2013.
[39]
T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In KDD, pages 262--270, 2012.
[40]
U. Raza, A. Camerra, A. L. Murphy, T. Palpanas, and G. P. Picco. Practical data prediction for real-world wireless sensor networks. IEEE Trans. Knowl. Data Eng., accepted for publication, 2015.
[41]
D. Shasha. Tuning time series queries in finance: Case studies and recommendations. IEEE Data Eng. Bull., 22(2):40--46, 1999.
[42]
J. Shieh and E. J. Keogh. isax: indexing and mining terabyte sized time series. In KDD, pages 623--631, 2008.
[43]
Z. R. Struzik and A. Siebes. Measuring time series' similarity through large singular features revealed with wavelet transformation. In International Workshop on Database & Expert Systems Applications (DEXA), 1999.
[44]
T. Warren Liao. Clustering of time series data: a survey. Pattern Recognition, 38(11):1857--1874, 2005.
[45]
L. Ye and E. J. Keogh. Time series shapelets: a new primitive for data mining. In KDD, 2009.
[46]
K. Zoumpatianos, S. Idreos, and T. Palpanas. ADS: the adaptive data series index. VLDB J., 25(6):843--866, 2016.
[47]
K. Zoumpatianos, Y. Lou, T. Palpanas, and J. Gehrke. Query workloads for data series indexes. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, pages 1603--1612, 2015.

Cited By

View all
  • (2022)ProS: data series progressive k-NN similarity search and classification with probabilistic quality guaranteesThe VLDB Journal10.1007/s00778-022-00771-z32:4(763-789)Online publication date: 30-Nov-2022
  • (2021)Presentation of Coupling Analysis Techniques of Maximum and Minimum Values Between N Sets of Data Using Matrix [µ][MKN]International Journal of Mathematical, Engineering and Management Sciences10.33889/IJMEMS.2021.6.4.0676:4(1127-1136)Online publication date: 18-Jul-2021
  • (2020)Debunking Four Long-Standing Misconceptions of Time-Series Distance MeasuresProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389760(1887-1905)Online publication date: 11-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database Management
June 2017
373 pages
ISBN:9781450352826
DOI:10.1145/3085504
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • Northwestern University: Northwestern University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data series
  2. distance measure
  3. sequences
  4. similarity search
  5. time series

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SSDBM '17

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)ProS: data series progressive k-NN similarity search and classification with probabilistic quality guaranteesThe VLDB Journal10.1007/s00778-022-00771-z32:4(763-789)Online publication date: 30-Nov-2022
  • (2021)Presentation of Coupling Analysis Techniques of Maximum and Minimum Values Between N Sets of Data Using Matrix [µ][MKN]International Journal of Mathematical, Engineering and Management Sciences10.33889/IJMEMS.2021.6.4.0676:4(1127-1136)Online publication date: 18-Jul-2021
  • (2020)Debunking Four Long-Standing Misconceptions of Time-Series Distance MeasuresProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389760(1887-1905)Online publication date: 11-Jun-2020
  • (2020)Correlation-Based Analytics of Time Series Data2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378155(4482-4491)Online publication date: 10-Dec-2020
  • (2020)Evolution of a Data Series IndexInformation Search, Integration, and Personalization10.1007/978-3-030-44900-1_5(68-83)Online publication date: 27-Mar-2020
  • (2020)Linking IT Product RecordsMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-43887-6_9(101-111)Online publication date: 28-Mar-2020
  • (2019)Return of the Lernaean HydraProceedings of the VLDB Endowment10.14778/3368289.336830313:3(403-420)Online publication date: 1-Nov-2019
  • (2019)DCS: A Policy Framework for the Detection of Correlated Data StreamsReal-Time Business Intelligence and Analytics10.1007/978-3-030-24124-7_12(191-210)Online publication date: 11-Oct-2019
  • (2018)The lernaean hydra of data series similarity searchProceedings of the VLDB Endowment10.14778/3282495.328249812:2(112-127)Online publication date: 1-Oct-2018
  • (2018)Strategies for Detection of Correlated Data StreamsProceedings of the 5th International Workshop on Exploratory Search in Databases and the Web10.1145/3214708.3214714(1-6)Online publication date: 15-Jun-2018

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media