research-article

An Efficient Aggregation Method for the Symbolic Representation of Temporal Data

Authors:

Stefan GüttelAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 17, Issue 1

Article No.: 5, Pages 1 - 22

https://doi.org/10.1145/3532622

Published: 20 February 2023 Publication History

Abstract

Symbolic representations are a useful tool for the dimension reduction of temporal data, allowing for the efficient storage of and information retrieval from time series. They can also enhance the training of machine learning algorithms on time series data through noise reduction and reduced sensitivity to hyperparameters. The adaptive Brownian bridge-based aggregation (ABBA) method is one such effective and robust symbolic representation, demonstrated to accurately capture important trends and shapes in time series. However, in its current form, the method struggles to process very large time series. Here, we present a new variant of the ABBA method, called fABBA. This variant utilizes a new aggregation approach tailored to the piecewise representation of time series. By replacing the k-means clustering used in ABBA with a sorting-based aggregation technique, and thereby avoiding repeated sum-of-squares error computations, the computational complexity is significantly reduced. In contrast to the original method, the new approach does not require the number of time series symbols to be specified in advance. Through extensive tests, we demonstrate that the new method significantly outperforms ABBA with a considerable reduction in runtime while also outperforming the popular SAX and 1d-SAX representations in terms of reconstruction accuracy. We further demonstrate that fABBA can compress other data types such as images.

References

[1]

Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the ACM SIGMOD International Conference on Management of Data 27, 1 (1998), 94–105.

Digital Library

[2]

Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha Agha. 2017. Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262 (2017), 134–147.

[3]

David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 1027–1035.

Digital Library

[4]

Ira Assent, Marc Wichterich, Ralph Krieger, Hardy Kremer, and Thomas Seidl. 2009. Anticipatory DTW for efficient similarity search in time series databases. Proceedings of the VLDB Endowment 2, 1 (2009), 826–837.

Digital Library

[5]

Vineetha Bettaiah and Heggere S. Ranganath. 2014. An analysis of time series representation methods: Data mining applications perspective. In Proceedings of the 2014 ACM Southeast Regional Conference. ACM, 1–6.

Digital Library

[6]

Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A. Lozano. 2021. A review on outlier/anomaly detection in time series data. Computing Surveys 54, 3 (2021), 33 pages.

[7]

Alexis Bondu, Marc Boullé, and Antoine Cornuéjols. 2016. Symbolic representation of time series: A hierarchical coclustering formalization. In Proceedings of the Advanced Analysis and Learning on Temporal Data. Springer, Berlin, 3–16.

[8]

Ryan P. Browne and Paul D. McNicholas. 2015. A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics 43 (2015), 176–198.

[9]

Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density-based clustering based on hierarchical density estimates. In Proceedings of the Advances in Knowledge Discovery and Data Mining. Springer, Berlin, 160–172.

[10]

Kaushik Chakrabarti, Eamonn Keogh, Sharad Mehrotra, and Michael Pazzani. 2002. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems 27, 2 (2002), 188–228.

Digital Library

[11]

Hong Chang and Dit-Yan Yeung. 2008. Robust path-based spectral clustering. Pattern Recognition 41, 1 (2008), 191–203.

Digital Library

[12]

Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, and Hexagon-ML. 2018. The UCR time series classification archive. Retrieved from https://www.cs.ucr.edu/ eamonn/time_series_data_2018/.

[13]

A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1 (1977), 1–38.

[14]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.

[15]

Elizabeth D. Dolan and Jorge J. Moré. 2002. Benchmarking optimization software with performance profiles. Mathematical Programming 91, 2 (2002), 201–213.

[16]

Steven Elsworth and Stefan Güttel. 2020. ABBA: Adaptive brownian bridge-based symbolic aggregation of time series. Data Mining and Knowledge Discovery 34 (2020), 1175–1200.

[17]

Steven Elsworth and Stefan Güttel. 2020. Time Series Forecasting Using LSTM Networks: A Symbolic Approach. Technical Report. arXiv:2003.05672. Retrieved from https://arxiv.org/abs/2003.05672.

[18]

Philippe Esling and Carlos Agon. 2012. Time-series data mining. Comput. Surveys 45, 1 (2012), 34 pages.

Digital Library

[19]

Bilal Esmael, Arghad Arnaout, Rudolf K. Fruhwirth, and Gerhard Thonhauser. 2012. Multivariate time series classification by combining trend-based and value-based approximations. In Proceedings of the Computational Science and Its Applications – ICCSA 2012. Springer, Berlin, 392–403.

Digital Library

[20]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press, 226–231.

Digital Library

[21]

Sylvia Frühwirth-Schnatter and Saumyadipta Pyne. 2010. Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11, 2 (2010), 317–336.

[22]

Limin Fu and Enzo Medico. 2007. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinformatics 8, 1 (2007), 1–15.

[23]

Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas. 2007. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 4–es.

Digital Library

[24]

Alexander Hinneburg and Daniel A. Keim. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. AAAI Press, 58–65.

[25]

Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193–218.

[26]

Anil K. Jain and Martin H. C. Law. 2005. Data clustering: A user’s dilemma. In Proceedings of the Pattern Recognition and Machine Intelligence. Springer, Berlin, 1–10.

Digital Library

[27]

E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. 2001. Dimensionality reduction for fast similarity search in large time series databases. Knowledge and Information Systems 3, 2 (2001), 263–286.

[28]

Eamonn Keogh, Stefano Lonardi, and Bill ‘Yuan-chi’ Chiu. 2002. Finding surprising patterns in a time series database in linear time and space. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 550–556.

Digital Library

[29]

Eamonn J. Keogh and Michael J. Pazzani. 1998. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. AAAI Press, 239–243.

[30]

Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. 2011. Novel dataset for fine-grained image categorization. In Proceedings of the 1st Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2 pages.

[31]

M. Kontaki and A. N. Papadopoulos. 2004. Efficient similarity search in streaming time sequences. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management. IEEE, 63–72.

[32]

Flip Korn, H. V. Jagadish, and Christos Faloutsos. 1997. Efficiently supporting ad hoc queries in large datasets of time sequences. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 289–300.

Digital Library

[33]

Guiling Li, Liping Zhang, and Linquan Yang. 2012. TSX: A novel symbolic representation for financial time series. In Proceedings of the 12th Pacific Rim International Conference on Trends in Artificial Intelligence. Springer, Berlin, 262–273.

Digital Library

[34]

Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, 2–11.

Digital Library

[35]

Jessica Lin, Eamonn Keogh, Li Wei, and Stefano Lonardi. 2007. Experiencing SAX: A novel symbolic representation of time series. Data Mining and Knowledge Discovery 15, 11 (2007), 107–144.

Digital Library

[36]

Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129–137.

Digital Library

[37]

Simon Malinowski, Thomas Guyet, René Quiniou, and Romain Tavenard. 2013. 1d-SAX: A novel symbolic representation for time series. In Proceedings of the Advances in Intelligent Data Analysis XII. Springer, Berlin, 273–284.

Digital Library

[38]

Leland McInnes and John Healy. 2017. Accelerated hierarchical density based clustering. In Proceedings of the International Conference on Data Mining Workshops. IEEE, 33–42.

[39]

Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software 2 (2017), 205.

[40]

David R. Musser. 1997. Introspective sorting and selection algorithms. Software: Practice and Experience 27, 8 (1997), 983–993.

[41]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12, 85 (2011), 2825–2830.

[42]

Ninh D. Pham, Quang Loc Le, and Tran Khanh Dang. 2010. HOT aSAX: A novel adaptive symbolic representation for time series discords discovery. In Proceedings of the 2nd International Conference on Intelligent Information and Database Systems: Part I. Springer, Berlin, 113–121.

[43]

Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20 (1987), 53–65.

Digital Library

[44]

Hui Ruan, Xiaoguang Hu, Jin Xiao, and Guofeng Zhang. 2020. TrSAX-An improved time series symbolic representation for classification. ISA Transactions 100 (2020), 387–395.

[45]

Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1 (1978), 43–49.

[46]

Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. 2000. WaveCluster: A eavelet-based clustering approach for spatial data in very large databases. The VLDB Journal 8, 3–4 (2000), 289–304.

Digital Library

[47]

Jin Shieh and Eamonn Keogh. 2008. iSAX: Indexing and mining terabyte sized time series. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 623–631.

Digital Library

[48]

Romain Tavenard, Johann Faouzi, Gilles Vandewiele, Felix Divo, Guillaume Androz, Chester Holtz, Marie Payne, Roman Yurchak, Marc Rußwurm, Kushal Kolar, and Eli Woods. 2020. Tslearn, a machine learning toolkit for time series data. Journal of Machine Learning Research 21, 118 (2020), 1–6.

[49]

C. J. Veenman, M. J. T. Reinders, and E. Backer. 2002. A maximum variance cluster algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 9 (2002), 1273–1280.

Digital Library

[50]

Wei Wang, Jiong Yang, and Richard R. Muntz. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., 186–195.

Digital Library

[51]

Yi-Leh Wu, Divyakant Agrawal, and Amr El Abbadi. 2000. A comparison of DFT and DWT based similarity search in time-series databases. In Proceedings of the 9th International Conference on Information and Knowledge Management. ACM, 488–495.

Digital Library

[52]

Stella X. Yu and Jianbo Shi. 2003. Multiclass spectral clustering. In Proceedings of the 9th IEEE International Conference on Computer Vision. IEEE, USA, 313.

Digital Library

[53]

Yufeng Yu, Yuelong Zhu, Dingsheng Wan, Huan Liu, and Qun Zhao. 2019. A novel symbolic aggregate approximation for time series. In Proceedings of the 13th International Conference on Ubiquitous Information Management and Communication. Springer, Berlin, 805–822.

[54]

C. T. Zahn. 1971. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20, 1 (1971), 68–86.

Digital Library

[55]

Chaw Thet Zan and Hayato Yamana. 2016. An improved symbolic aggregate approximation distance measure based on its statistical features. In Proceedings of the 18th International Conference on Information Integration and Web-Based Applications and Services (Singapore). ACM, 72–80.

Digital Library

[56]

Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 103–114.

Digital Library

Cited By

Chen XGüttel S(2024)fABBA: A Python library for the fast symbolic approximation of time seriesJournal of Open Source Software10.21105/joss.062949:95(6294)Online publication date: Mar-2024
https://doi.org/10.21105/joss.06294
Mao YYe QHu HWang QHuang K(2024)PrivShape: Extracting Shapes in Time Series Under User-Level Local Differential Privacy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00141(1739-1751)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00141
Tao ZJiang WWu C(2023)Power Consumption Data Compression via Shift-Invariant Dictionary Learning2023 IEEE 7th Conference on Energy Internet and Energy System Integration (EI2)10.1109/EI259745.2023.10513063(2033-2038)Online publication date: 15-Dec-2023
https://doi.org/10.1109/EI259745.2023.10513063
Show More Cited By

Index Terms

An Efficient Aggregation Method for the Symbolic Representation of Temporal Data
1. Computing methodologies
  1. Machine learning

Recommendations

Modifying the Symbolic Aggregate Approximation Method to Capture Segment Trend Information
Modeling Decisions for Artificial Intelligence
Abstract
The Symbolic Aggregate approXimation (SAX) is a very popular symbolic dimensionality reduction technique of time series data, as it has several advantages over other dimensionality reduction techniques. One of its major advantages is its ...
A Novel Fractal Representation for Dimensionality Reduction of Large Time Series Data
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Recent research has attempted to speed up time series data mining tasks which focus on dimensionality reduction, indexing, and lower bounding function, among many others. For large time series data, current dimensionality reduction techniques cannot ...
Knowledge representation applied to robotic orthopedic surgery

In this paper the efforts and methods used in the past years are presented to represent knowledge in the biomedical field and to obtain a conceptual model of the Ontology for Robotic Orthopedic Surgery (OROSU). This model is proposed in this paper to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 17, Issue 1

January 2023

375 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3572846

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2023

Online AM: 27 April 2022

Accepted: 16 April 2022

Revised: 31 March 2022

Received: 11 September 2021

Published in TKDD Volume 17, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The Alan Turing Institute under the EPSRC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
551
Total Downloads

Downloads (Last 12 months)174
Downloads (Last 6 weeks)14

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen XGüttel S(2024)fABBA: A Python library for the fast symbolic approximation of time seriesJournal of Open Source Software10.21105/joss.062949:95(6294)Online publication date: Mar-2024
https://doi.org/10.21105/joss.06294
Mao YYe QHu HWang QHuang K(2024)PrivShape: Extracting Shapes in Time Series Under User-Level Local Differential Privacy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00141(1739-1751)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00141
Tao ZJiang WWu C(2023)Power Consumption Data Compression via Shift-Invariant Dictionary Learning2023 IEEE 7th Conference on Energy Internet and Energy System Integration (EI2)10.1109/EI259745.2023.10513063(2033-2038)Online publication date: 15-Dec-2023
https://doi.org/10.1109/EI259745.2023.10513063
Liang KLiu HShan MZhao JLi XZhou L(2023)Enhancing scenic recommendation and tour route personalization in tourism using UGC text miningApplied Intelligence10.1007/s10489-023-05244-654:1(1063-1098)Online publication date: 29-Dec-2023
https://dl.acm.org/doi/10.1007/s10489-023-05244-6

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents