Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An Efficient Aggregation Method for the Symbolic Representation of Temporal Data

Published: 20 February 2023 Publication History

Abstract

Symbolic representations are a useful tool for the dimension reduction of temporal data, allowing for the efficient storage of and information retrieval from time series. They can also enhance the training of machine learning algorithms on time series data through noise reduction and reduced sensitivity to hyperparameters. The adaptive Brownian bridge-based aggregation (ABBA) method is one such effective and robust symbolic representation, demonstrated to accurately capture important trends and shapes in time series. However, in its current form, the method struggles to process very large time series. Here, we present a new variant of the ABBA method, called fABBA. This variant utilizes a new aggregation approach tailored to the piecewise representation of time series. By replacing the k-means clustering used in ABBA with a sorting-based aggregation technique, and thereby avoiding repeated sum-of-squares error computations, the computational complexity is significantly reduced. In contrast to the original method, the new approach does not require the number of time series symbols to be specified in advance. Through extensive tests, we demonstrate that the new method significantly outperforms ABBA with a considerable reduction in runtime while also outperforming the popular SAX and 1d-SAX representations in terms of reconstruction accuracy. We further demonstrate that fABBA can compress other data types such as images.

References

[1]
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the ACM SIGMOD International Conference on Management of Data 27, 1 (1998), 94–105.
[2]
Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha Agha. 2017. Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262 (2017), 134–147.
[3]
David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 1027–1035.
[4]
Ira Assent, Marc Wichterich, Ralph Krieger, Hardy Kremer, and Thomas Seidl. 2009. Anticipatory DTW for efficient similarity search in time series databases. Proceedings of the VLDB Endowment 2, 1 (2009), 826–837.
[5]
Vineetha Bettaiah and Heggere S. Ranganath. 2014. An analysis of time series representation methods: Data mining applications perspective. In Proceedings of the 2014 ACM Southeast Regional Conference. ACM, 1–6.
[6]
Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A. Lozano. 2021. A review on outlier/anomaly detection in time series data. Computing Surveys 54, 3 (2021), 33 pages.
[7]
Alexis Bondu, Marc Boullé, and Antoine Cornuéjols. 2016. Symbolic representation of time series: A hierarchical coclustering formalization. In Proceedings of the Advanced Analysis and Learning on Temporal Data. Springer, Berlin, 3–16.
[8]
Ryan P. Browne and Paul D. McNicholas. 2015. A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics 43 (2015), 176–198.
[9]
Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density-based clustering based on hierarchical density estimates. In Proceedings of the Advances in Knowledge Discovery and Data Mining. Springer, Berlin, 160–172.
[10]
Kaushik Chakrabarti, Eamonn Keogh, Sharad Mehrotra, and Michael Pazzani. 2002. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems 27, 2 (2002), 188–228.
[11]
Hong Chang and Dit-Yan Yeung. 2008. Robust path-based spectral clustering. Pattern Recognition 41, 1 (2008), 191–203.
[12]
Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, and Hexagon-ML. 2018. The UCR time series classification archive. Retrieved from https://www.cs.ucr.edu/ eamonn/time_series_data_2018/.
[13]
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1 (1977), 1–38.
[14]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
[15]
Elizabeth D. Dolan and Jorge J. Moré. 2002. Benchmarking optimization software with performance profiles. Mathematical Programming 91, 2 (2002), 201–213.
[16]
Steven Elsworth and Stefan Güttel. 2020. ABBA: Adaptive brownian bridge-based symbolic aggregation of time series. Data Mining and Knowledge Discovery 34 (2020), 1175–1200.
[17]
Steven Elsworth and Stefan Güttel. 2020. Time Series Forecasting Using LSTM Networks: A Symbolic Approach. Technical Report. arXiv:2003.05672. Retrieved from https://arxiv.org/abs/2003.05672.
[18]
Philippe Esling and Carlos Agon. 2012. Time-series data mining. Comput. Surveys 45, 1 (2012), 34 pages.
[19]
Bilal Esmael, Arghad Arnaout, Rudolf K. Fruhwirth, and Gerhard Thonhauser. 2012. Multivariate time series classification by combining trend-based and value-based approximations. In Proceedings of the Computational Science and Its Applications – ICCSA 2012. Springer, Berlin, 392–403.
[20]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press, 226–231.
[21]
Sylvia Frühwirth-Schnatter and Saumyadipta Pyne. 2010. Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11, 2 (2010), 317–336.
[22]
Limin Fu and Enzo Medico. 2007. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinformatics 8, 1 (2007), 1–15.
[23]
Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas. 2007. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 4–es.
[24]
Alexander Hinneburg and Daniel A. Keim. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. AAAI Press, 58–65.
[25]
Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193–218.
[26]
Anil K. Jain and Martin H. C. Law. 2005. Data clustering: A user’s dilemma. In Proceedings of the Pattern Recognition and Machine Intelligence. Springer, Berlin, 1–10.
[27]
E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. 2001. Dimensionality reduction for fast similarity search in large time series databases. Knowledge and Information Systems 3, 2 (2001), 263–286.
[28]
Eamonn Keogh, Stefano Lonardi, and Bill ‘Yuan-chi’ Chiu. 2002. Finding surprising patterns in a time series database in linear time and space. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 550–556.
[29]
Eamonn J. Keogh and Michael J. Pazzani. 1998. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. AAAI Press, 239–243.
[30]
Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. 2011. Novel dataset for fine-grained image categorization. In Proceedings of the 1st Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2 pages.
[31]
M. Kontaki and A. N. Papadopoulos. 2004. Efficient similarity search in streaming time sequences. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management. IEEE, 63–72.
[32]
Flip Korn, H. V. Jagadish, and Christos Faloutsos. 1997. Efficiently supporting ad hoc queries in large datasets of time sequences. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 289–300.
[33]
Guiling Li, Liping Zhang, and Linquan Yang. 2012. TSX: A novel symbolic representation for financial time series. In Proceedings of the 12th Pacific Rim International Conference on Trends in Artificial Intelligence. Springer, Berlin, 262–273.
[34]
Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, 2–11.
[35]
Jessica Lin, Eamonn Keogh, Li Wei, and Stefano Lonardi. 2007. Experiencing SAX: A novel symbolic representation of time series. Data Mining and Knowledge Discovery 15, 11 (2007), 107–144.
[36]
Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129–137.
[37]
Simon Malinowski, Thomas Guyet, René Quiniou, and Romain Tavenard. 2013. 1d-SAX: A novel symbolic representation for time series. In Proceedings of the Advances in Intelligent Data Analysis XII. Springer, Berlin, 273–284.
[38]
Leland McInnes and John Healy. 2017. Accelerated hierarchical density based clustering. In Proceedings of the International Conference on Data Mining Workshops. IEEE, 33–42.
[39]
Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software 2 (2017), 205.
[40]
David R. Musser. 1997. Introspective sorting and selection algorithms. Software: Practice and Experience 27, 8 (1997), 983–993.
[41]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12, 85 (2011), 2825–2830.
[42]
Ninh D. Pham, Quang Loc Le, and Tran Khanh Dang. 2010. HOT aSAX: A novel adaptive symbolic representation for time series discords discovery. In Proceedings of the 2nd International Conference on Intelligent Information and Database Systems: Part I. Springer, Berlin, 113–121.
[43]
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20 (1987), 53–65.
[44]
Hui Ruan, Xiaoguang Hu, Jin Xiao, and Guofeng Zhang. 2020. TrSAX-An improved time series symbolic representation for classification. ISA Transactions 100 (2020), 387–395.
[45]
Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1 (1978), 43–49.
[46]
Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. 2000. WaveCluster: A eavelet-based clustering approach for spatial data in very large databases. The VLDB Journal 8, 3–4 (2000), 289–304.
[47]
Jin Shieh and Eamonn Keogh. 2008. iSAX: Indexing and mining terabyte sized time series. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 623–631.
[48]
Romain Tavenard, Johann Faouzi, Gilles Vandewiele, Felix Divo, Guillaume Androz, Chester Holtz, Marie Payne, Roman Yurchak, Marc Rußwurm, Kushal Kolar, and Eli Woods. 2020. Tslearn, a machine learning toolkit for time series data. Journal of Machine Learning Research 21, 118 (2020), 1–6.
[49]
C. J. Veenman, M. J. T. Reinders, and E. Backer. 2002. A maximum variance cluster algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 9 (2002), 1273–1280.
[50]
Wei Wang, Jiong Yang, and Richard R. Muntz. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., 186–195.
[51]
Yi-Leh Wu, Divyakant Agrawal, and Amr El Abbadi. 2000. A comparison of DFT and DWT based similarity search in time-series databases. In Proceedings of the 9th International Conference on Information and Knowledge Management. ACM, 488–495.
[52]
Stella X. Yu and Jianbo Shi. 2003. Multiclass spectral clustering. In Proceedings of the 9th IEEE International Conference on Computer Vision. IEEE, USA, 313.
[53]
Yufeng Yu, Yuelong Zhu, Dingsheng Wan, Huan Liu, and Qun Zhao. 2019. A novel symbolic aggregate approximation for time series. In Proceedings of the 13th International Conference on Ubiquitous Information Management and Communication. Springer, Berlin, 805–822.
[54]
C. T. Zahn. 1971. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20, 1 (1971), 68–86.
[55]
Chaw Thet Zan and Hayato Yamana. 2016. An improved symbolic aggregate approximation distance measure based on its statistical features. In Proceedings of the 18th International Conference on Information Integration and Web-Based Applications and Services (Singapore). ACM, 72–80.
[56]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 103–114.

Cited By

View all
  • (2024)fABBA: A Python library for the fast symbolic approximation of time seriesJournal of Open Source Software10.21105/joss.062949:95(6294)Online publication date: Mar-2024
  • (2024)PrivShape: Extracting Shapes in Time Series Under User-Level Local Differential Privacy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00141(1739-1751)Online publication date: 13-May-2024
  • (2023)Power Consumption Data Compression via Shift-Invariant Dictionary Learning2023 IEEE 7th Conference on Energy Internet and Energy System Integration (EI2)10.1109/EI259745.2023.10513063(2033-2038)Online publication date: 15-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 17, Issue 1
January 2023
375 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3572846
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2023
Online AM: 27 April 2022
Accepted: 16 April 2022
Revised: 31 March 2022
Received: 11 September 2021
Published in TKDD Volume 17, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Knowledge representation
  2. symbolic aggregation
  3. time series mining
  4. data compression

Qualifiers

  • Research-article

Funding Sources

  • The Alan Turing Institute under the EPSRC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)174
  • Downloads (Last 6 weeks)14
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)fABBA: A Python library for the fast symbolic approximation of time seriesJournal of Open Source Software10.21105/joss.062949:95(6294)Online publication date: Mar-2024
  • (2024)PrivShape: Extracting Shapes in Time Series Under User-Level Local Differential Privacy2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00141(1739-1751)Online publication date: 13-May-2024
  • (2023)Power Consumption Data Compression via Shift-Invariant Dictionary Learning2023 IEEE 7th Conference on Energy Internet and Energy System Integration (EI2)10.1109/EI259745.2023.10513063(2033-2038)Online publication date: 15-Dec-2023
  • (2023)Enhancing scenic recommendation and tour route personalization in tourism using UGC text miningApplied Intelligence10.1007/s10489-023-05244-654:1(1063-1098)Online publication date: 29-Dec-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media