Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Dumpy: A Compact and Adaptive Index for Large Data Series Collections

Published: 30 May 2023 Publication History

Abstract

Data series indexes are necessary for managing and analyzing the increasing amounts of data series collections that are nowadays available. These indexes support both exact and approximate similarity search, with approximate search providing high-quality results within milliseconds, which makes it very attractive for certain modern applications. Reducing the pre-processing (i.e., index building) time and improving the accuracy of search results are two major challenges. DSTree and the iSAX index family are state-of-the-art solutions for this problem. However, DSTree suffers from long index building times, while iSAX suffers from low search accuracy. In this paper, we identify two problems of the iSAX index family that adversely affect the overall performance. First, we observe the presence of a proximity-compactness trade-off related to the index structure design (i.e., the node fanout degree), significantly limiting the efficiency and accuracy of the resulting index. Second, a skewed data distribution will negatively affect the performance of iSAX. To overcome these problems, we propose Dumpy, an index that employs a novel multi-ary data structure with an adaptive node splitting algorithm and an efficient building workflow. Furthermore, we devise Dumpy-Fuzzy as a variant of Dumpy which further improves search accuracy by proper duplication of series. Experiments with a variety of large, real datasets demonstrate that the Dumpy solutions achieve considerably better efficiency, scalability and search accuracy than its competitors.

Supplemental Material

MP4 File
Presentation video for the paper titled "Dumpy: A Compact and Adaptive Index for Large Data Series Collections"
MP4 File
Presentation video for the paper in sigmod'23 titled "Dumpy: A Compact and Adaptive Index for Large Data Series Collections"
PDF File
Read me
ZIP File
Source Code

References

[1]
[n. d.]. National Center for Biotechnology Information (NCBI)[Internet]. https://www.ncbi.nlm.nih.gov/ Accessed March 14, 2022.
[2]
Rakesh Agrawal, Christos Faloutsos, and Arun Swami. 1993. Efficient similarity search in sequence databases. In International conference on foundations of data organization and algorithms. Springer, 69--84.
[3]
Panagiotis Anagnostou, Petros Barbas, Aristidis G. Vrahatis, and Sotiris K. Tasoulis. 2020. Approximate kNN Classification for Biomedical Data. In 2020 IEEE International Conference on Big Data (Big Data). 3602--3607. https://doi.org/10.1109/BigData50022.2020.9378126
[4]
Akhil Arora, Sakshi Sinha, Piyush Kumar, and Arnab Bhattacharya. 2018. HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces. Proceedings of the VLDB Endowment 11, 8 (2018).
[5]
Ilias Azizi, Karima Echihabi, and Themis Palpanas. 2023. ELPIS: Graph-Based Similarity Search for Scalable Data Science. Proc. VLDB Endow. 16, 6 (2023).
[6]
Artem Babenko and Victor Lempitsky. 2014. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence 37, 6 (2014), 1247--1260.
[7]
Artem Babenko and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2055--2063.
[8]
Anthony J. Bagnall, Richard L. Cole, Themis Palpanas, and Konstantinos Zoumpatianos. 2019. Data Series Management (Dagstuhl Seminar 19282). Dagstuhl Reports 9, 7 (2019), 24--39. https://doi.org/10.4230/DagRep.9.7.24
[9]
J.S. Beis and D.G. Lowe. 1997. Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, San Juan, PR, USA, 1000--1006.
[10]
Paul Boniol, Michele Linardi, Federico Roncallo, and Themis Palpanas. 2020. Automated anomaly detection in large sequences. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1834--1837.
[11]
Paul Boniol and Themis Palpanas. 2020. Series2graph: Graph-based subsequence anomaly detection for time series. Proceedings of the VLDB Endowment 13, 12 (2020), 1821--1834.
[12]
Alessandro Camerra, Themis Palpanas, Jin Shieh, and Eamonn Keogh. 2010. iSAX 2.0: Indexing and mining one billion time series. In 2010 IEEE International Conference on Data Mining. IEEE, 58--67.
[13]
Alessandro Camerra, Jin Shieh, Themis Palpanas, Thanawin Rakthanmanon, and Eamonn Keogh. 2014. Beyond one billion time series: indexing and mining very large time series collections with i sax2. Knowledge and information systems 39, 1 (2014), 123--151.
[14]
Manos Chatzakis, Panagiota Fatourou, Eleftherios Kosmas, Themis Palpanas, and Botao Peng. 2023. Odyssey: A Journey in the Land of Distributed Data Series Similarity Search. Proc. VLDB Endow. (2023).
[15]
George Chen, Christina Lee, and Shah Devavrat. 2017. Nearest Neighbors for Modern Applications with Massive Data. https://nn2017.mit.edu/. In Proceedings of the 31rd International Conference on Neural Information Processing Systems Workshop.
[16]
George H. Chen and Devavrat Shah. 2018. Explaining the Success of Nearest Neighbor Methods in Prediction. Foundations and Trends® in Machine Learning 10, 5--6 (2018), 337--588. https://doi.org/10.1561/2200000064
[17]
Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zhiyong Zheng, Mao Yang, and Jingdong Wang. 2021. SPANN: Highly-efficient Billion-scale Approximate Nearest Neighborhood Search. Advances in Neural Information Processing Systems 34 (2021).
[18]
Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. 2008. Querying and mining of time series data: experimental comparison of representations and distance measures. Proceedings of the VLDB Endowment 1, 2 (2008), 1542--1552.
[19]
Karima Echihabi, Panagiota Fatourou, Kostas Zoumpatianos, Themis Palpanas, and Houda Benbrahim. 2022. Hercules Against Data Series Similarity Search. Proc. VLDB Endow. 15, 10 (2022), 2005--2018.
[20]
Karima Echihabi, Theophanis Tsandilas, Anna Gogolou, Anastasia Bezerianos, and Themis Palpanas. 2023. ProS: Data Series Progressive k-NN Similarity Search and Classification with Probabilistic Quality Guarantees. VLDBJ (2023).
[21]
Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, and Houda Benbrahim. 2018. The Lernaean Hydra of Data Series Similarity Search: An Experimental Evaluation of the State of the Art. Proc. VLDB Endow. 12, 2 (Oct. 2018), 112--127.
[22]
Karima Echihabi, Kostas Zoumpatianos, Themis Palpanas, and Houda Benbrahim. 2019. Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search. Proc. VLDB Endow. 13, 3 (Nov. 2019), 403--420.
[23]
Cong Fu, Changxu Wang, and Deng Cai. 2021. High Dimensional Similarity Search with Satellite System Graph: Efficiency, Scalability, and Unindexed Query Compatibility. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1--1.
[24]
Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graph. Proc. VLDB Endow. 12, 5 (Jan. 2019), 461--474.
[25]
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence 36, 4 (2013), 744--755.
[26]
Long Gong, Huayi Wang, Mitsunori Ogihara, and Jun Xu. 2020. iDEC: indexable distance estimating codes for approximate nearest neighbor search. Proceedings of the VLDB Endowment 13, 9 (2020).
[27]
Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-aware locality-sensitive hashing for approximate nearest neighbor search. Proceedings of the VLDB Endowment 9, 1 (2015), 1--12.
[28]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2010), 117--128.
[29]
Jaemin Jo, Jinwook Seo, and Jean-Daniel Fekete. 2020. PANENE: A Progressive Algorithm for Indexing and Querying Approximate k-Nearest Neighbors. IEEE Trans. Vis. Comput. Graph. 26, 2 (2020), 1347--1360.
[30]
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3, 1 (2016), 1--9.
[31]
Eamonn Keogh. 2006. A decade of progress in indexing and mining large time series databases. In Proceedings of the 32nd international conference on Very large data bases. 1268--1268.
[32]
Eamonn Keogh, Kaushik Chakrabarti, Michael Pazzani, and Sharad Mehrotra. 2001. Dimensionality reduction for fast similarity search in large time series databases. Knowledge and information Systems 3, 3 (2001), 263--286.
[33]
Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, and Themis Palpanas. 2018. Coconut: A Scalable Bottom-up Approach for Building Data Series Indexes. Proc. VLDB Endow. 11, 6 (Feb. 2018), 677--690.
[34]
Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, and Themis Palpanas. 2019. Coconut: sortable summarizations for scalable indexes over static and streaming data series. VLDB J. 28, 6 (2019), 847--869. https://doi.org/10.1007/s00778-019-00573-w
[35]
Flip Korn, Bernd-Uwe Pagel, and Christos Faloutsos. 2001. On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'. IEEE Trans. Knowl. Data Eng. 13, 1 (2001), 96--111. https://doi.org/10.1109/69.908983
[36]
Oleksandra Levchenko, Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Boyan Kolev, and Dennis Shasha. 2018. Spark-parsketch: a massively distributed indexing of time series datasets. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1951--1954.
[37]
Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2019. Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2019), 1475--1488.
[38]
Michele Linardi and Themis Palpanas. 2018. Scalable, Variable-Length Similarity Search in Data Series: The ULISSE Approach. Proc. VLDB Endow. 11, 13 (2018), 2236--2248. https://doi.org/10.14778/3275366.3284968
[39]
Michele Linardi and Themis Palpanas. 2020. Scalable data series subsequence matching with ULISSE. VLDB J. 29, 6 (2020), 1449--1474.
[40]
Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, Lu Qin, and Xuemin Lin. 2021. EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search. The VLDB Journal 30, 2 (2021), 215--235.
[41]
Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems 45 (2014), 61--68.
[42]
Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824--836.
[43]
Themis Palpanas. 2015. Data series management: The road to big sequence analytics. ACM SIGMOD Record 44, 2 (2015), 47--52.
[44]
Themis Palpanas. 2016. Big Sequence Management: A glimpse of the Past, the Present, and the Future. In SOFSEM 2016: Theory and Practice of Computer Science - 42nd International Conference on Current Trends in Theory and Practice of Computer Science (Lecture Notes in Computer Science, Vol. 9587). 63--80.
[45]
Themis Palpanas. 2020. Evolution of a Data Series Index. In Information Search, Integration, and Personalization. Springer International Publishing, Cham, 68--83.
[46]
Themis Palpanas and Volker Beckmann. 2019. Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA). SIGMOD Rec. 48, 3 (2019), 36--40. https://doi.org/10.1145/3377391.3377400
[47]
John Paparrizos, Ikraduya Edian, Chunwei Liu, Aaron J. Elmore, and Michael J. Franklin. 2022. Fast Adaptive Similarity Search through Variance-Aware Quantization. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE.
[48]
Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2018. Paris: The next destination for fast data series indexing and query answering. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 791--800.
[49]
Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2020. Messi: In-memory data series indexing. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 337--348.
[50]
Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2020. Paris: Data series indexing on multi-core architectures. IEEE Transactions on Knowledge and Data Engineering 33, 5 (2020), 2151--2164.
[51]
Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021. Fast data series indexing for in-memory data. VLDB J. 30, 6 (2021), 1041--1067.
[52]
Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021. SING: Sequence Indexing Using GPUs. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 1883--1888.
[53]
Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh. 2012. Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12). 262--270. https://doi.org/10.1145/2339530.2339576
[54]
Erich Schubert, Arthur Zimek, and Hans-Peter Kriegel. 2015. Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles. In Database Systems for Advanced Applications. 19--36.
[55]
Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379--423.
[56]
Jin Shieh and Eamonn Keogh. 2008. iSAX: indexing and mining terabyte sized time series. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 623--631.
[57]
Suhas Jayaram Subramanya, Rohan Kadekodi, Ravishankar Krishaswamy, and Harsha Vardhan Simhadri. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 13766--13776.
[58]
Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, and Xuemin Lin. 2014. SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index. Proc. VLDB Endow. 8, 1 (Sept. 2014), 1--12.
[59]
Chang Wei Tan, Geoffrey I Webb, and François Petitjean. 2017. Indexing and classifying gigabytes of time series under time warping. In Proceedings of the 2017 SIAM international conference on data mining. SIAM, 282--290.
[60]
Andrew Turpin and Falk Scholer. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 11--18.
[61]
Skoltech Computer Vision. [n. d.]. Deep billion-scale indexing. http://sites.skoltech.ru/compvision/noimi Accessed March 14, 2022.
[62]
Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search. arXiv preprint arXiv:2101.12631 (2021).
[63]
Qitong Wang and Themis Palpanas. 2021. Deep Learning Embeddings for Data Series Similarity Search (KDD '21). ACM, NY, USA, 1708--1716. https://doi.org/10.1145/3447548.3467317
[64]
Qitong Wang, Stephen Whitmarsh, Vincent Navarro, and Themis Palpanas. 2022. iEDeaL: A Deep Learning Framework for Detecting Highly Imbalanced Interictal Epileptiform Discharges. Proc. VLDB Endow. 16, 3 (2022), 480--490.
[65]
Yang Wang, Peng Wang, Jian Pei, Wei Wang, and Sheng Huang. 2013. A data-adaptive and dynamic segmentation index for whole matching on time series. Proceedings of the VLDB Endowment 6, 10 (2013), 793--804.
[66]
Djamel Edine Yagoubi, Reza Akbarinia, Florent Masseglia, and Themis Palpanas. 2017. Dpisax: Massively distributed partitioned isax. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE, 1135--1140.
[67]
Djamel Edine Yagoubi, Reza Akbarinia, Florent Masseglia, and Themis Palpanas. 2020. Massively Distributed Time Series Indexing and Querying. IEEE Trans. Knowl. Data Eng. 32, 1 (2020), 108--120. https://doi.org/10.1109/TKDE.2018.2880215
[68]
Liang Zhang, Noura Alghamdi, Mohamed Y Eltabakh, and Elke A Rundensteiner. 2019. TARDIS: Distributed indexing framework for big time series data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1202--1213.
[69]
Kang Zhao, Liuyihan Song, Yingya Zhang, Pan Pan, Yinghui Xu, and Rong Jin. 2021. ANN Softmax: Acceleration of Extreme Classification Training. Proc. VLDB Endow. 15, 1 (2021), 1--10. https://doi.org/10.14778/3485450.3485451
[70]
Kostas Zoumpatianos, Stratos Idreos, and Themis Palpanas. 2014. Indexing for interactive exploration of big data series. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1555--1566.
[71]
Kostas Zoumpatianos, Stratos Idreos, and Themis Palpanas. 2016. ADS: the adaptive data series index. The VLDB Journal 25, 6 (2016), 843--866.
[72]
Kostas Zoumpatianos, Yin Lou, Ioana Ileana, Themis Palpanas, and Johannes Gehrke. 2018. Generating data series query workloads. VLDB J. 27, 6 (2018), 823--846. https://doi.org/10.1007/s00778-018-0513-x
[73]
Kostas Zoumpatianos, Yin Lou, Themis Palpanas, and Johannes Gehrke. 2015. Query workloads for data series indexes. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1603--1612.
[74]
Kostas Zoumpatianos and Themis Palpanas. 2018. Data series management: Fulfilling the need for big sequence analytics. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1677--1678.

Cited By

View all
  • (2024)DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor SearchProceedings of the VLDB Endowment10.14778/3665844.366585417:9(2241-2254)Online publication date: 1-May-2024
  • (2024)DIDS: Double Indices and Double Summarizations for Fast Similarity SearchProceedings of the VLDB Endowment10.14778/3665844.366585117:9(2198-2211)Online publication date: 1-May-2024
  • (2024)CIVET: Exploring Compact Index for Variable-Length Subsequence Matching on Time SeriesProceedings of the VLDB Endowment10.14778/3665844.366584517:9(2123-2135)Online publication date: 1-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023
Published in PACMMOD Volume 1, Issue 1

Permissions

Request permissions for this article.

Badges

Author Tags

  1. data series indexing
  2. similarity search

Qualifiers

  • Research-article

Funding Sources

  • Ministry of Science and Technology of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)249
  • Downloads (Last 6 weeks)15
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor SearchProceedings of the VLDB Endowment10.14778/3665844.366585417:9(2241-2254)Online publication date: 1-May-2024
  • (2024)DIDS: Double Indices and Double Summarizations for Fast Similarity SearchProceedings of the VLDB Endowment10.14778/3665844.366585117:9(2198-2211)Online publication date: 1-May-2024
  • (2024)CIVET: Exploring Compact Index for Variable-Length Subsequence Matching on Time SeriesProceedings of the VLDB Endowment10.14778/3665844.366584517:9(2123-2135)Online publication date: 1-May-2024
  • (2024)Visualization-Aware Time Series Min-Max Caching with Error Bound GuaranteesProceedings of the VLDB Endowment10.14778/3659437.365946017:8(2091-2103)Online publication date: 31-May-2024
  • (2024)Performance-Based Pricing for Federated Learning via AuctionProceedings of the VLDB Endowment10.14778/3648160.364816917:6(1269-1282)Online publication date: 3-May-2024
  • (2024)Hybrid Prompt Learning for Generating Justifications of Security Risks in Automation RulesACM Transactions on Intelligent Systems and Technology10.1145/3675401Online publication date: 29-Jun-2024
  • (2024)Databases in Edge and Fog Environments: A SurveyACM Computing Surveys10.1145/366600156:11(1-40)Online publication date: 8-Jul-2024
  • (2024)RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor SearchProceedings of the ACM on Management of Data10.1145/36549702:3(1-27)Online publication date: 30-May-2024
  • (2024)Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join QueriesProceedings of the ACM on Management of Data10.1145/36549322:3(1-26)Online publication date: 30-May-2024
  • (2024)Time Series Representation for Visualization in Apache IoTDBProceedings of the ACM on Management of Data10.1145/36392902:1(1-26)Online publication date: 26-Mar-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media