Article

Adaptive Sparsity Level During Training for Efficient Time Series Forecasting with Transformers

Authors:

Zahra Atashgahi,

Mykola Pechenizkiy,

Raymond Veldhuis,

Decebal Constantin MocanuAuthors Info & Claims

Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part I

Pages 3 - 20

https://doi.org/10.1007/978-3-031-70341-6_1

Published: 08 September 2024 Publication History

Abstract

Efficient time series forecasting has become critical for real-world applications, particularly with deep neural networks (DNNs). Efficiency in DNNs can be achieved through sparse connectivity and reducing the model size. However, finding the sparsity level automatically during training remains challenging due to the heterogeneity in the loss-sparsity tradeoffs across the datasets. In this paper, we propose “Pruning with Adaptive Sparsity Level” (PALS), to automatically seek a decent balance between loss and sparsity, all without the need for a predefined sparsity level. PALS draws inspiration from sparse training and during-training methods. It introduces the novel “expand” mechanism in training sparse neural networks, allowing the model to dynamically shrink, expand, or remain stable to find a proper sparsity level. In this paper, we focus on achieving efficiency in transformers known for their excellent time series forecasting performance but high computational cost. Nevertheless, PALS can be applied directly to any DNN. To this aim, we demonstrate its effectiveness also on the DLinear model. Experimental results on six benchmark datasets and five state-of-the-art (SOTA) transformer variants show that PALS substantially reduces model size while maintaining comparable performance to the dense model. More interestingly, PALS even outperforms the dense model, in [inline-graphic not available: see fulltext] and [inline-graphic not available: see fulltext] cases out of 30 cases in terms of MSE and MAE loss, respectively, while reducing [inline-graphic not available: see fulltext] parameter count and [inline-graphic not available: see fulltext] FLOPs on average. Our code and supplementary material are available on Github (https://github.com/zahraatashgahi/PALS).

References

[1]

Atashgahi, Z., et al.: Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders. Mach. Learn. 1–38 (2022)

[2]

Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time Series Analysis: Forecasting and Control. Wiley, New York (2015)

[3]

Challu, C., Olivares, K.G., Oreshkin, B.N., Garza, F., Mergenthaler, M., Dubrawski, A.: N-hits: neural hierarchical interpolation for time series forecasting. arXiv preprint arXiv:2201.12886 (2022)

[4]

Chen T, Cheng Y, Gan Z, Yuan L, Zhang L, and Wang Z Chasing sparsity in vision transformers: an end-to-end exploration Adv. Neural. Inf. Process. Syst. 2021 34 19974-19988

[5]

Chen, T., et al.: The lottery ticket hypothesis for pre-trained BERT networks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 15834–15846 (2020)

[6]

Curci, S., Mocanu, D.C., Pechenizkiyi, M.: Truly sparse neural networks at scale. arXiv preprint arXiv:2102.01732 (2021)

[7]

Dietrich, A.S.D., Gressmann, F., Orr, D., Chelombiev, I., Justus, D., Luschi, C.: Towards structured dynamic sparse pre-training of BERT (2022)

[8]

Dosovitskiy, A., et al.: An image is worth 16

\times

16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

[9]

Evci, U., Gale, T., Menick, J., Castro, P.S., Elsen, E.: Rigging the lottery: making all tickets winners. In: International Conference on Machine Learning (2020)

[10]

Franceschi, J.Y., Dieuleveut, A., Jaggi, M.: Unsupervised scalable representation learning for multivariate time series. In: Advances in Neural Information Processing Systems (2019)

[11]

Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (2018)

[12]

Furuya, T., Suetake, K., Taniguchi, K., Kusumoto, H., Saiin, R., Daimon, T.: Spectral pruning for recurrent neural networks. In: International Conference on Artificial Intelligence and Statistics (2022)

[13]

Ganesh, P., et al.: Compressing large-scale transformer-based models: a case study on BERT. Trans. Assoc. Comput. Linguist. 9, 1061–1080 (2021)

[14]

Han, S., et al.: DSD: dense-sparse-dense training for deep neural networks. In: International Conference on Learning Representations (2017)

[15]

Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems (2015)

[16]

Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., Peste, A.: Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res. 22(241), 1–124 (2021)

[17]

Hyndman RJ, Lee AJ, and Wang E Fast computation of reconciled forecasts for hierarchical and grouped time series Comput. Stat. Data Anal. 2016 97 16-32

Digital Library

[18]

Jayakumar, S., Pascanu, R., Rae, J., Osindero, S., Elsen, E.: Top-KAST: Top-K always sparse training. In: Advances in Neural Information Processing Systems (2020)

[19]

Jin, X., Park, Y., Maddix, D., Wang, H., Wang, Y.: Domain adaptation for time series forecasting via attention sharing. In: International Conference on Machine Learning, pp. 10280–10297. PMLR (2022)

[20]

Kieu, T., Yang, B., Guo, C., Jensen, C.S.: Outlier detection for time series with recurrent autoencoder ensembles. In: IJCAI, pp. 2725–2732 (2019)

[21]

Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: International Conference on Learning Representations (2020)

[22]

Lai, G., Chang, W.C., Yang, Y., Liu, H.: Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st international ACM SIGIR Conference on Research Development in Information Retrieval, pp. 95–104 (2018)

[23]

Lee, N., Ajanthan, T., Torr, P.: SNIP: single-shot network pruning based on connection sensitivity. In: International Conference on Learning Representations (2019)

[24]

Li, S., et al.: Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

[25]

Li, Y., Lu, X., Wang, Y., Dou, D.: Generative time series forecasting with diffusion, denoise, and disentanglement. In: Advances in Neural Information Processing Systems (2022)

[26]

Li, Z., et al.: Train big, then compress: rethinking model size for efficient training and inference of transformers. In: International Conference on Machine Learning (2020)

[27]

Lim B and Zohren S Time-series forecasting with deep learning: a survey Phil. Trans. R. Soc. A 2021 379 2194 20200209

[28]

Liu, S., et al.: Sparse training via boosting pruning plasticity with neuroregeneration. Adv. Neural Inf. Process. Syst. 34, 9908–9922 (2021)

[29]

Liu S et al. Hutter F, Kersting K, Lijffijt J, Valera I, et al. Topological insights into sparse neural networks Machine Learning and Knowledge Discovery in Databases 2021 Cham Springer 279-294

Digital Library

[30]

Liu, S., Mocanu, D.C., Pei, Y., Pechenizkiy, M.: Selfish sparse RNN training. In: International Conference on Machine Learning (2021)

[31]

Liu, S., Wang, Z.: Ten lessons we have learned in the new “Sparseland”: a short handbook for sparse neural network researchers. arXiv preprint arXiv:2302.02596 (2023)

[32]

Liu, S., Yin, L., Mocanu, D.C., Pechenizkiy, M.: Do we actually need dense over-parameterization? In-time over-parameterization in sparse training. In: International Conference on Machine Learning (2021)

[33]

Liu, S., et al.: Pyraformer: low-complexity pyramidal attention for long-range time series modeling and forecasting. In: International Conference on Learning Representations (2021)

[34]

Liu, Y., Wu, H., Wang, J., Long, M.: Non-stationary transformers: rethinking the stationarity in time series forecasting. arXiv preprint arXiv:2205.14415 (2022)

[35]

Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through L_0 regularization. In: International Conference on Learning Representations (2018)

[36]

Ma, X., et al.: Effective model sparsification by scheduled grow-and-prune methods. In: International Conference on Learning Representations (2022)

[37]

Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? In: Advances in Neural Information Processing Systems, vol. 32 (2019)

[38]

Mocanu, D.C., Mocanu, E., Nguyen, P.H., Gibescu, M., Liotta, A.: A topological insight into restricted Boltzmann machines. Mach. Learn. 104(2), 243–270 (2016)

[39]

Mocanu, D.C., et al.: Sparse training theory for scalable and efficient agents. In: 20th International Conference on Autonomous Agents and Multiagent Systems (2021)

[40]

Mocanu DC, Mocanu E, Stone P, Nguyen PH, Gibescu M, and Liotta A Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science Nat. Commun. 2018 9 1 1-12

[41]

Oreshkin, B.N., Carpov, D., Chapados, N., Bengio, Y.: N-BEATS: neural basis expansion analysis for interpretable time series forecasting. In: International Conference on Learning Representations (2019)

[42]

Prasanna, S., Rogers, A., Rumshisky, A.: When BERT plays the lottery, all tickets are winning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)

[43]

Qin, Y., Song, D., Cheng, H., Cheng, W., Jiang, G., Cottrell, G.W.: A dual-stage attention-based recurrent neural network for time series prediction. In: International Joint Conference on Artificial Intelligence, pp. 2627–2633 (2017)

[44]

Rakthanmanon, T., et al.: Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 262–270 (2012)

[45]

Salinas D, Flunkert V, Gasthaus J, and Januschowski T DeepAR: probabilistic forecasting with autoregressive recurrent networks Int. J. Forecast. 2020 36 3 1181-1191

[46]

Schlake, G.S., Hwel, J.D., Berns, F., Beecks, C.: Evaluating the lottery ticket hypothesis to sparsify neural networks for time series classification. In: International Conference on Data Engineering Workshops (ICDEW), pp. 70–73 (2022)

[47]

Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for modern deep learning research. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)

[48]

Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J.: Attention is all you need in speech separation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25. IEEE (2021)

[49]

Talagala, T.S., Hyndman, R.J., Athanasopoulos, G., et al.: Meta-learning how to forecast time series. Monash Econometrics and Business Statistics Working Papers, vol. 6(18), p. 16 (2018)

[50]

Tay Y, Dehghani M, Bahri D, and Metzler D Efficient transformers: a survey ACM Comput. Surv. 2022 55 6 1-28

Digital Library

[51]

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

[52]

Wang, Z., Xu, X., Zhang, W., Trajcevski, G., Zhong, T., Zhou, F.: Learning latent seasonal-trend representations for time series forecasting. In: Advances in Neural Information Processing Systems (2022)

[53]

Wen, Q., et al.: Transformers in time series: a survey. arXiv preprint arXiv:2202.07125 (2022)

[54]

Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)

[55]

Woo, G., Liu, C., Sahoo, D., Kumar, A., Hoi, S.: ETSformer: exponential smoothing transformers for time-series forecasting. arXiv preprint arXiv:2202.01381 (2022)

[56]

Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M.: TimesNet: temporal 2D-variation modeling for general time series analysis. In: International Conference on Learning Representations (2023)

[57]

Wu H, Xu J, Wang J, and Long M AutoFormer: decomposition transformers with auto-correlation for long-term series forecasting Adv. Neural. Inf. Process. Syst. 2021 34 22419-22430

[58]

Xiao, Q., et al.: Dynamic sparse network for time series classification: learning what to “see”. In: Advances in Neural Information Processing Systems (2022)

[59]

Yuan, G., et al.: MEST: accurate and fast memory-economic sparse training framework on the edge. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

[60]

Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504 (2022)

[61]

Zhang, T., et al.: Less is more: fast multivariate time series forecasting with light sampling-oriented MLP structures. arXiv preprint arXiv:2207.01186 (2022)

[62]

Zhang, Y., Yan, J.: Crossformer: transformer utilizing cross-dimension dependency for multivariate time series forecasting. In: International Conference on Learning Representations (2023)

[63]

Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11106–11115 (2021)

[64]

Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., Jin, R.: FEDformer: frequency enhanced decomposed transformer for long-term series forecasting. arXiv preprint arXiv:2201.12740 (2022)

[65]

Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017)

Index Terms

Adaptive Sparsity Level During Training for Efficient Time Series Forecasting with Transformers

Index terms have been assigned to the content through auto-classification.

Recommendations

Image/video compressive sensing recovery using joint adaptive sparsity measure

Compressive sensing (CS) is a recently emerging technique and an extensively studied problem in signal and image processing, which enables joint sampling and compression into a unified approach. Recently, local smoothness and nonlocal self-similarity ...
A Sparsity Preestimated Adaptive Matching Pursuit Algorithm
In the matching pursuit algorithm of compressed sensing, the traditional reconstruction algorithm needs to know the signal sparsity. The sparsity adaptive matching pursuit (SAMP) algorithm can adaptively approach the signal sparsity when the sparsity is ...
Dynamic sparsity is channel-level sparsity learner
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Sparse training has received an upsurging interest in machine learning due to its tantalizing saving potential for the entire training process as well as inference. Dynamic sparse training (DST), as a leading sparse training approach, can train deep ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part I

Sep 2024

513 pages

ISBN:978-3-031-70340-9

DOI:10.1007/978-3-031-70341-6

Editors:
Albert Bifet
https://ror.org/01naq7912LTCI, Télécom Paris, Palaiseau Cedex, France
,
Jesse Davis
KU Leuven, Leuven, Belgium
,
Tomas Krilavičius
https://ror.org/04y7eh037Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
,
Meelis Kull
https://ror.org/03z77qz90Institute of Computer Science, University of Tartu, Tartu, Estonia
,
Eirini Ntoutsi
https://ror.org/05kkv3f82Department of Computer Science, Bundeswehr University Munich, Munich, Germany
,
Indrė Žliobaitė
https://ror.org/040af2s02Department of Computer Science, University of Helsinki, Helsinki, Finland

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 September 2024

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten