Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2783258.2783382acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Query Workloads for Data Series Indexes

Published: 10 August 2015 Publication History

Abstract

Data series are a prevalent data type that has attracted lots of interest in recent years. Most of the research has focused on how to efficiently support similarity or nearest neighbor queries over large data series collections (an important data mining task), and several data series summarization and indexing methods have been proposed in order to solve this problem. Nevertheless, up to this point very little attention has been paid to properly evaluating such index structures, with most previous work relying solely on randomly selected data series to use as queries (with/without adding noise). In this work, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating a query workload. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections.

Supplementary Material

MP4 File (p1603.mp4)

References

[1]
R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In FODO, 1993.
[2]
I. Assent, R. Krieger, F. Afschari, and T. Seidl. The ts-tree: Efficient time series search and retrieval. In EDBT, 2008.
[3]
S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The uci kdd archive of large data sets for data mining research and experimentation. In SIGKDD Explorations, 2000.
[4]
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is "nearest neighbor" meaningful? In ICDT, 1999.
[5]
A. Camerra, T. Palpanas, J. Shieh, and E. Keogh. iSAX 2.0: Indexing and mining one billion time series. In ICDM, 2010.
[6]
A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. Keogh. Beyond one billion time series: indexing and mining very large time series collections with isax2+. KAIS, 2013.
[7]
K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. In SIGMOD, 2002.
[8]
K.-P. Chan and A.-C. Fu. Efficient time series matching by wavelets. In ICDE, 1999.
[9]
Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu. Indexable pla for efficient similarity search. In VLDB, 2007.
[10]
M. Dallachiesa, B. Nushi, K. Mirylenka, and T. Palpanas. Uncertain time-series similarity: Return to the basics. In VLDB, 2012.
[11]
M. Dallachiesa, T. Palpanas, and I. F. Ilyas. Top-k nearest neighbor search in uncertain data series. In VLDB, 2015.
[12]
C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In SIGMOD, 1994.
[13]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, 1984.
[14]
P. Huijse, P. A. Estévez, P. Protopapas, J. C. Principe, and P. Zegers. Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comp. Int. Mag., 9(3), 2014.
[15]
K. Kashino, G. Smith, and H. Murase. Time-series active search for quick retrieval of audio and video. In ICASSP, 1999.
[16]
S. Kashyap and P. Karras. Scalable knn search on vertically stored time series. In KDD, 2011.
[17]
E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimensionality reduction for fast similarity search in large time series databases. KAIS, 3, 2000.
[18]
E. Keogh and M. Pazzani. Scaling up dynamic time warping to massive datasets. In PKDD, 1999.
[19]
F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In SIGMOD, 1997.
[20]
H. Kremer, S. Günnemann, A.-M. Ivanescu, I. Assent, and T. Seidl. Efficient processing of multiple dtw queries in time series databases. In SSDBM, 2011.
[21]
C.-S. Li, P. Yu, and V. Castelli. Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences. In ICDE, 1996.
[22]
J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of time series, with implications for streaming algorithms. In DMKD, 2003.
[23]
J. Lin, R. Khade, and Y. Li. Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst., 39(2), 2012.
[24]
D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. In SIGMOD, 1997.
[25]
D. Rafiei and A. Mendelzon. Efficient retrieval of similar time sequences using dft. In ICDE, 1998.
[26]
T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In KDD, 2012.
[27]
K. V. Ravi Kanth, D. Agrawal, and A. Singh. Dimensionality reduction for similarity searching in dynamic databases. In SIGMOD, 1998.
[28]
P. Schäfer and M. Högqvist. Sfa: A symbolic fourier approximation and index for similarity search in high dimensional datasets. In EDBT, 2012.
[29]
D. Shasha. Tuning time series queries in finance: Case studies and recommendations. IEEE Data Eng. Bull., 22(2), 1999.
[30]
J. Shieh and E. Keogh. isax: Indexing and mining terabyte sized time series. In KDD, 2008.
[31]
X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E. Keogh. Experimental comparison of representation methods and distance measures for time series data. DMKD, 26(2), 2013.
[32]
Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data-adaptive and dynamic segmentation index for whole matching on time series. In VLDB, 2013.
[33]
L. Ye and E. J. Keogh. Time series shapelets: a new primitive for data mining. In KDD, 2009.
[34]
B.-K. Yi, H. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time warping. In ICDE, 1998.
[35]
K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for interactive exploration of big data series. In SIGMOD, 2014.
[36]
K. Zoumpatianos, S. Idreos, and T. Palpanas. Rinse: Interactive data series exploration. In VLDB, 2015.

Cited By

View all
  • (2024)DumpyOS: A data-adaptive multi-ary index for scalable data series similarity searchThe VLDB Journal10.1007/s00778-024-00874-933:6(1887-1911)Online publication date: 21-Aug-2024
  • (2023)Odyssey: A Journey in the Land of Distributed Data Series Similarity SearchProceedings of the VLDB Endowment10.14778/3579075.357908716:5(1140-1153)Online publication date: 1-Jan-2023
  • (2023)Dumpy: A Compact and Adaptive Index for Large Data Series CollectionsProceedings of the ACM on Management of Data10.1145/35889651:1(1-27)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data series
  2. indexing
  3. similarity search
  4. workloads

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

KDD '15
Sponsor:

Acceptance Rates

KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)DumpyOS: A data-adaptive multi-ary index for scalable data series similarity searchThe VLDB Journal10.1007/s00778-024-00874-933:6(1887-1911)Online publication date: 21-Aug-2024
  • (2023)Odyssey: A Journey in the Land of Distributed Data Series Similarity SearchProceedings of the VLDB Endowment10.14778/3579075.357908716:5(1140-1153)Online publication date: 1-Jan-2023
  • (2023)Dumpy: A Compact and Adaptive Index for Large Data Series CollectionsProceedings of the ACM on Management of Data10.1145/35889651:1(1-27)Online publication date: 30-May-2023
  • (2023)FreSh: A Lock-Free Data Series Index2023 42nd International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS60354.2023.00029(209-220)Online publication date: 25-Sep-2023
  • (2022)Hercules against data series similarity searchProceedings of the VLDB Endowment10.14778/3547305.354730815:10(2005-2018)Online publication date: 7-Sep-2022
  • (2022)Scalable Analytics on Large Sequence Collections2022 23rd IEEE International Conference on Mobile Data Management (MDM)10.1109/MDM55031.2022.00022(5-8)Online publication date: Jun-2022
  • (2022)ProS: data series progressive k-NN similarity search and classification with probabilistic quality guaranteesThe VLDB Journal10.1007/s00778-022-00771-z32:4(763-789)Online publication date: 30-Nov-2022
  • (2020)Data Series Progressive Similarity Search with Probabilistic Quality GuaranteesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389751(1857-1873)Online publication date: 11-Jun-2020
  • (2020)Matrix profile goes MAD: variable-length motif and discord discovery in data seriesData Mining and Knowledge Discovery10.1007/s10618-020-00685-w34:4(1022-1071)Online publication date: 7-May-2020
  • (2020)Scalable data series subsequence matching with ULISSEThe VLDB Journal10.1007/s00778-020-00619-4Online publication date: 4-Jul-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media