research-article

Query Workloads for Data Series Indexes

Authors:

Kostas Zoumpatianos,

Themis Palpanas,

Johannes GehrkeAuthors Info & Claims

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1603 - 1612

https://doi.org/10.1145/2783258.2783382

Published: 10 August 2015 Publication History

Abstract

Data series are a prevalent data type that has attracted lots of interest in recent years. Most of the research has focused on how to efficiently support similarity or nearest neighbor queries over large data series collections (an important data mining task), and several data series summarization and indexing methods have been proposed in order to solve this problem. Nevertheless, up to this point very little attention has been paid to properly evaluating such index structures, with most previous work relying solely on randomly selected data series to use as queries (with/without adding noise). In this work, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating a query workload. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections.

Supplementary Material

MP4 File (p1603.mp4)

Download
100.77 MB

References

[1]

R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In FODO, 1993.

Digital Library

[2]

I. Assent, R. Krieger, F. Afschari, and T. Seidl. The ts-tree: Efficient time series search and retrieval. In EDBT, 2008.

Digital Library

[3]

S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The uci kdd archive of large data sets for data mining research and experimentation. In SIGKDD Explorations, 2000.

Digital Library

[4]

K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? In ICDT, 1999.

Digital Library

[5]

A. Camerra, T. Palpanas, J. Shieh, and E. Keogh. iSAX 2.0: Indexing and mining one billion time series. In ICDM, 2010.

Digital Library

[6]

A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. Keogh. Beyond one billion time series: indexing and mining very large time series collections with isax2+. KAIS, 2013.

[7]

K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. In SIGMOD, 2002.

Digital Library

[8]

K.-P. Chan and A.-C. Fu. Efficient time series matching by wavelets. In ICDE, 1999.

[9]

Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu. Indexable pla for efficient similarity search. In VLDB, 2007.

Digital Library

[10]

M. Dallachiesa, B. Nushi, K. Mirylenka, and T. Palpanas. Uncertain time-series similarity: Return to the basics. In VLDB, 2012.

Digital Library

[11]

M. Dallachiesa, T. Palpanas, and I. F. Ilyas. Top-k nearest neighbor search in uncertain data series. In VLDB, 2015.

Digital Library

[12]

C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, 1994.

Digital Library

[13]

A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, 1984.

Digital Library

[14]

P. Huijse, P. A. Estévez, P. Protopapas, J. C. Principe, and P. Zegers. Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag., 9(3), 2014.

Digital Library

[15]

K. Kashino, G. Smith, and H. Murase. Time-series active search for quick retrieval of audio and video. In ICASSP, 1999.

Digital Library

[16]

S. Kashyap and P. Karras. Scalable knn search on vertically stored time series. In KDD, 2011.

Digital Library

[17]

E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimensionality reduction for fast similarity search in large time series databases. KAIS, 3, 2000.

[18]

E. Keogh and M. Pazzani. Scaling up dynamic time warping to massive datasets. In PKDD, 1999.

Digital Library

[19]

F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In SIGMOD, 1997.

Digital Library

[20]

H. Kremer, S. Günnemann, A.-M. Ivanescu, I. Assent, and T. Seidl. Efficient processing of multiple dtw queries in time series databases. In SSDBM, 2011.

Digital Library

[21]

C.-S. Li, P. Yu, and V. Castelli. Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In ICDE, 1996.

Digital Library

[22]

J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of time series, with implications for streaming algorithms. In DMKD, 2003.

Digital Library

[23]

J. Lin, R. Khade, and Y. Li. Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst., 39(2), 2012.

Digital Library

[24]

D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. In SIGMOD, 1997.

Digital Library

[25]

D. Rafiei and A. Mendelzon. Efficient retrieval of similar time sequences using dft. In ICDE, 1998.

[26]

T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In KDD, 2012.

Digital Library

[27]

K. V. Ravi Kanth, D. Agrawal, and A. Singh. Dimensionality reduction for similarity searching in dynamic databases. In SIGMOD, 1998.

Digital Library

[28]

P. Schäfer and M. Högqvist. Sfa: A symbolic fourier approximation and index for similarity search in high dimensional datasets. In EDBT, 2012.

Digital Library

[29]

D. Shasha. Tuning time series queries in finance: Case studies and recommendations. IEEE Data Eng. Bull., 22(2), 1999.

[30]

J. Shieh and E. Keogh. isax: Indexing and mining terabyte sized time series. In KDD, 2008.

Digital Library

[31]

X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E. Keogh. Experimental comparison of representation methods and distance measures for time series data. DMKD, 26(2), 2013.

Digital Library

[32]

Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data-adaptive and dynamic segmentation index for whole matching on time series. In VLDB, 2013.

Digital Library

[33]

L. Ye and E. J. Keogh. Time series shapelets: a new primitive for data mining. In KDD, 2009.

Digital Library

[34]

B.-K. Yi, H. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time warping. In ICDE, 1998.

Digital Library

[35]

K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for interactive exploration of big data series. In SIGMOD, 2014.

Digital Library

[36]

K. Zoumpatianos, S. Idreos, and T. Palpanas. Rinse: Interactive data series exploration. In VLDB, 2015.

Digital Library

Cited By

Wang ZWang QWang PPalpanas TWang W(2024)DumpyOS: A data-adaptive multi-ary index for scalable data series similarity searchThe VLDB Journal10.1007/s00778-024-00874-933:6(1887-1911)Online publication date: 21-Aug-2024
https://doi.org/10.1007/s00778-024-00874-9
Chatzakis MFatourou PKosmas EPalpanas TPeng B(2023)Odyssey: A Journey in the Land of Distributed Data Series Similarity SearchProceedings of the VLDB Endowment10.14778/3579075.357908716:5(1140-1153)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.14778/3579075.3579087
Wang ZWang QWang PPalpanas TWang W(2023)Dumpy: A Compact and Adaptive Index for Large Data Series CollectionsProceedings of the ACM on Management of Data10.1145/35889651:1(1-27)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588965
Show More Cited By

Index Terms

Query Workloads for Data Series Indexes
1. Information systems
  1. Data management systems
  2. Information retrieval
    1. Document representation

Recommendations

The Case for Learned Index Structures
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Indexes are models: a \btree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate ...
Indexing for interactive exploration of big data series
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Numerous applications continuously produce big amounts of data series, and in several time critical scenarios analysts need to be able to query these data as soon as they become available, which is not currently possible with the state-of-the-art ...
Efficient Approximate Nearest Neighbor Search in Multi-dimensional Databases
PACMMOD

Approximate nearest neighbor (ANN) search is a fundamental search in multi-dimensional databases, which has numerous real-world applications, such as image retrieval, recommendation, entity resolution, and sequence matching. Proximity graph (PG) has been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2015

2378 pages

ISBN:9781450336642

DOI:10.1145/2783258

General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

KDD '15

Sponsor:

KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 10 - 13, 2015

NSW, Sydney, Australia

Acceptance Rates

KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
384
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZWang QWang PPalpanas TWang W(2024)DumpyOS: A data-adaptive multi-ary index for scalable data series similarity searchThe VLDB Journal10.1007/s00778-024-00874-933:6(1887-1911)Online publication date: 21-Aug-2024
https://doi.org/10.1007/s00778-024-00874-9
Chatzakis MFatourou PKosmas EPalpanas TPeng B(2023)Odyssey: A Journey in the Land of Distributed Data Series Similarity SearchProceedings of the VLDB Endowment10.14778/3579075.357908716:5(1140-1153)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.14778/3579075.3579087
Wang ZWang QWang PPalpanas TWang W(2023)Dumpy: A Compact and Adaptive Index for Large Data Series CollectionsProceedings of the ACM on Management of Data10.1145/35889651:1(1-27)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588965
Fatourou PKosmas EPalpanas TPaterakis G(2023)FreSh: A Lock-Free Data Series Index2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00029(209-220)Online publication date: 25-Sep-2023
https://doi.org/10.1109/SRDS60354.2023.00029
Echihabi KFatourou PZoumpatianos KPalpanas TBenbrahim H(2022)Hercules against data series similarity searchProceedings of the VLDB Endowment10.14778/3547305.354730815:10(2005-2018)Online publication date: 7-Sep-2022
https://dl.acm.org/doi/10.14778/3547305.3547308
Echihabi KPalpanas T(2022)Scalable Analytics on Large Sequence Collections2022 23rd IEEE International Conference on Mobile Data Management (MDM)10.1109/MDM55031.2022.00022(5-8)Online publication date: Jun-2022
https://doi.org/10.1109/MDM55031.2022.00022
Echihabi KTsandilas TGogolou ABezerianos APalpanas T(2022)ProS: data series progressive k-NN similarity search and classification with probabilistic quality guaranteesThe VLDB Journal10.1007/s00778-022-00771-z32:4(763-789)Online publication date: 30-Nov-2022
https://doi.org/10.1007/s00778-022-00771-z
Gogolou ATsandilas TEchihabi KBezerianos APalpanas TMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Data Series Progressive Similarity Search with Probabilistic Quality GuaranteesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389751(1857-1873)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389751
Linardi MZhu YPalpanas TKeogh E(2020)Matrix profile goes MAD: variable-length motif and discord discovery in data seriesData Mining and Knowledge Discovery10.1007/s10618-020-00685-w34:4(1022-1071)Online publication date: 7-May-2020
https://doi.org/10.1007/s10618-020-00685-w
Linardi MPalpanas T(2020)Scalable data series subsequence matching with ULISSEThe VLDB Journal10.1007/s00778-020-00619-4Online publication date: 4-Jul-2020
https://doi.org/10.1007/s00778-020-00619-4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten