a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs

Duan, Lian; Xiao, Chuanfu; Li, Min; Ding, Mingshuo; Yang, Chao

doi:10.1007/s42514-022-00119-7

a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs

Regular Paper
Published: 11 August 2022

Volume 5, pages 12–25, (2023)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Lian Duan¹,
Chuanfu Xiao²,
Min Li³,
Mingshuo Ding⁴ &
…
Chao Yang ORCID: orcid.org/0000-0001-7426-6248^2,5

380 Accesses
1 Citation
Explore all metrics

Abstract

Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data. Existing Tucker decomposition algorithms are usually based on a single solver to compute the factor matrices and intermediate tensor in a predetermined order, and are not flexible enough to adapt with the diversities of the input data and the hardware. Moreover, to exploit highly efficient matrix multiplication kernels, most Tucker decomposition implementations rely on explicit matricizations, which could introduce extra costs of data conversion. In this paper, we present a-Tucker, a new framework for input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs. A two-level flexible Tucker decomposition algorithm is proposed to enable the switch of different calculation orders and different factor solvers, and a machine-learning adaptive order-solver selector is applied to automatically cope with change of the application scenarios. To further improve the performance, we implement a-Tucker in a fully matricization-free manner without any conversion between tensors and matrices. Experiments show that a-Tucker can substantially outperform existing works while keeping similar accuracy with a variety of synthetic and real-world tensors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large-scale tucker Tensor factorization for sparse and accurate decomposition

Article 27 May 2022

VecHGrad for Solving Accurately Tensor Decomposition

Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS

References

Ahmad, N., Yilmaz, B., Unat, D.: A prediction framework for fast sparse triangular solves. In: : Malawski M., Rzadca K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science, vol. 12247 (2020)
Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: International Parallel and Distributed Processing Symposium, pp. 912–922 (2016)
Bader, B.W., Kolda, T.G., et al.: MATLAB Tensor Toolbox Version 3.1. Available online (2019). https://www.tensortoolbox.org
Baglama, J., Reichel, L.: Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput. 27(1), 19–42 (2005)
Article MathSciNet MATH Google Scholar
Ballard, G., Klinvex, A., Kolda, T.G.: TuckerMPI: a parallel C++/MPI software package for large-scale data compression via the Tucker tensor decomposition. ACM Transact. Math. Softw. 46(2), 1–13 (2020)
Article MathSciNet MATH Google Scholar
Ballester-Ripoll, R., Pajarola, R.: Lossy volume compression using tucker truncation and thresholding. Vis. Comput. 32(11), 1433–1446 (2016). https://doi.org/10.1007/s00371-015-1130-y
Article Google Scholar
Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: International Conference on Parallel Processing, pp. 496–505 (2016)
Burggraf, R.: Analytical and numerical studies of the structure of steady separated flows. J. Fluid Mech. 24(1), 113–151 (1966)
Article Google Scholar
Chakaravarthy, V.T., Choi, J.W., Joseph, D.J., Liu, X., Murali, P., Sabharwal, Y., Sreedhar, D.: On optimizing distributed Tucker decomposition for dense tensors. In: International Parallel and Distributed Processing Symposium, pp. 1038–1047 (2017)
Chen, Y., Li, K., Yang, W., Xiao, G., Xie, X., Li, T.: Performance-aware model for sparse matrix-matrix multiplication on the Sunway TaihuLight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30(4), 923–938 (2018)
Article Google Scholar
Chen, Y., Xiao, G., Özsu, M.T., Liu, C., Zomaya, A.Y., Li, T.: aeSpTV: an adaptive and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE Trans. Parallel Distrib. Syst. 31(10), 2329–2345 (2020)
Article Google Scholar
Choi, J.W., Liu, X., Chakaravarthy, V.T.: High-performance dense Tucker decomposition on GPU clusters. International Conference for High Performance Computing, Networking, Storage and Analysis, 543–553 (2018)
Cui, H., Hirasawa, S., Takizawa, H., Kobayashi, H.: A code selection mechanism using deep learning. In: International Symposium on Embedded Multicore/Many-core Systems-on-Chip, pp. 385–392 (2016)
De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-1 and rank-$(r_{1}, r_{2}, \cdots, r_{N})$ approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21, 1324–1342 (2000a)
Article MathSciNet MATH Google Scholar
De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000b)
Article MathSciNet MATH Google Scholar
Dongarra, J., Duff, I., Gates, M., Haidar, A., Hammarling, S., Higham, N.J., Hogg, J., Valero-Lara, P., Relton, S.D., Tomov, S., Zounon, M.: A proposed API for batched basic linear algebra subprograms. Technical report, Manchester Institute for Mathematical Sciences, University of Manchester (2006)
Foster, D., Amano, K., Nascimento, S., Foster, M.: Frequency of metamerism in natural scenes. Opt. Soc. Am. J. A 23(10), 2359–2372 (2006). https://doi.org/10.1364/JOSAA.23.002359
Article Google Scholar
Gu, M., Eisenstat, S.C.: A divide-and-conquer algorithm for the bidiagonal svd. SIAM J. Matrix Anal. Appl. 16(1), 79–92 (1995)
Article MathSciNet MATH Google Scholar
Hitchcock, F.L.: Multiple invariants and generalized rank of a $p$-way matrix or tensor. J. Math. Phys. 7(1–4), 39–79 (1928)
Article MATH Google Scholar
Hynninen, A.-P., Lyakh, D.I.: cuTT: A high-performance tensor transpose library for CUDA compatible GPUs. arXiv preprint arXiv:1705.01598 (2017)
Jang, J., Kang, U.: D-Tucker: Fast and memory-efficient Tucker decomposition for dense tensors. In: International Conference on Data Engineering, pp. 1850–1853 (2020)
Karami, A., Yazdi, M., Mercier, G.: Compression of hyperspectral images using discerete wavelet transform and Tucker decomposition. J. Sel. Topics Appl. Earth Obs. Remote Sens. 5(2), 444–450 (2012)
Article Google Scholar
Kim, Y.-D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530 (2015)
Kim, J., Sukumaran-Rajam, A., Thumma, V., Krishnamoorthy, S., Panyala, A., Pouchet, L., Rountev, A., Sadayappan, P.: A code generator for high-performance tensor contractions on GPUs. In: International Symposium on Code Generation and Optimization, pp. 85–95 (2019)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MathSciNet MATH Google Scholar
Larsen, R.M.: Lanczos bidiagonalization with partial reorthogonalization. DAIMI Report Series (537) (1998)
LeCun, Y., Cortes, C., Burges, C.J.C.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998). Accessed 25 Nov 2021
Levin, J.: Three-mode factor analysis. PhD thesis, University of Illinois, Urbana-Champaign (1963)
Li, J., Tan, G., Chen, M., Sun, N.: SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 117–126 (2013)
Li, K., Yang, W., Li, K.: Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Trans. Parallel Distrib. Syst. 26(1), 196–205 (2014)
Article Google Scholar
Li, J., Battaglino, C., Perros, I., Sun, J., Vuduc, R.: An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2015)
Li, J., Choi, J., Perros, I., Sun, J., Vuduc, R.: Model-driven sparse CP decomposition for higher-order tensors. In: International Parallel and Distributed Processing Symposium, pp. 1048–1057 (2017)
Li, J., Sun, J., Vuduc, R.: HiCOO: Hierarchical storage of sparse tensors. In: International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 238–252 (2018)
Li, J., Ma, Y., Wu, X., Li, A., Barker, K.: PASTA: A parallel sparse tensor algorithm benchmark suite. CCF Transactions on High Performance Computing, 111–130 (2019)
Li, M., Ao, Y., Yang, C.: Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE Trans. Parallel Distrib. Syst. 32(7), 1842–1853 (2020)
Google Scholar
Ma, W., Krishamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Clust. Comput. 16, 1–25 (2013)
Article Google Scholar
Ma, Y., Li, J., Wu, X., Yan, C., Sun, J., Vuduc, R.: Optimizing sparse tensor times matrix on GPUs. J. Parallel Distrib. Comput. 129, 99–109 (2019)
Article Google Scholar
Matthews, D.A.: High-performance tensor contraction without transposition. SIAM J. Sci. Comput. 40(1), 1–24 (2018)
Article MathSciNet MATH Google Scholar
Nico, V., Otto, D., Laurent, S., Barel, M.V., De Lathauwer, L.: Tensorlab 3.0. https://www.tensorlab.net (2016). Accessed 13 Nov 2021
Nisa, I., Li, J., Sukumaran Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse MTTKRP on GPUs. In: International Parallel and Distributed Processing Symposium, pp. 123–133 (2019a)
Nisa, I., Li, J., Sukumaran-Rajam, A., Rawat, P.S., Krishnamoorthy, S., Sadayappan, P.: An efficient mixed-mode representation of sparse tensors. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2019b)
Nisa, I., Siegel, C., Rajam, A.S., Vishnu, A., Sadayappan, P.: Effective machine learning based format selection and performance modeling for SpMV on GPUs. In: International Parallel and Distributed Processing Symposium Workshops, pp. 1056–1065 (2018)
NVIDIA: The API Reference guide for cuBLAS, the CUDA Basic Linear Algebra Subroutine library. in press (2019a). https://docs.nvidia.com/cuda/cublas/. Accessed 25 Nov 2021
NVIDIA: The API reference guide for cuSolver, the CUDA sparse matix library. in press (2019b). https://docs.nvidia.com/cuda/cusolver/. Accessed 25 Nov 2021
Oh, J., Shin, K., Papalexakis, E.E., Faloutsos, C., Yu, H.: S-HOT: Scalable high-order Tucker decomposition. In: ACM International Conference on Web Search and Data Mining, pp. 761–770 (2017)
Oh, S., Park, N., Sael, L., Kang, U.: Scalable Tucker factorization for sparse tensors - algorithms and discoveries. In: International Conference on Data Engineering, pp. 1120–1131 (2018)
Oseledetsv, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput. 33(5), 2295–2317 (2011)
Article MathSciNet MATH Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Perros, I., Chen, R., Vuduc, R., Sun, J.: Sparse hierarchical Tucker factorization and its application to healthcare. In: International Conference on Data Mining, pp. 943–948 (2015)
Smith, S., Karypis, G.: Tensor-matrix products with a compressed sparse tensor. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, pp. 1–7 (2015)
Smith, S., Karypis, G.: Accelerating the Tucker decomposition with compressed sparse tensors. In: International Conference on Parallel and Distributed Computing, Euro-Par 2017, pp. 653–668 (2017)
Springer, P., Su, T., Bientinesi, P.: HPTT: A high-performance tensor transposition C++ library. In: ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, pp. 56–62 (2017)
Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: Dynamic tensor analysis. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 374–383 (2006)
Szlam, A., Tulloch, A., Tygert, M.: Accurate low-rank approximations via a few iterations of alternating least squares. SIAM J. Matrix Anal. Appl. 38(2), 425–433 (2017)
Article MathSciNet MATH Google Scholar
Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966)
Article MathSciNet Google Scholar
Vannieuwenhoven, N., Vandebril, R., Meerbergen, K.: On the truncated multilinear singular value decomposition. Technical Report TW589, Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium (2011)
Vannieuwenhoven, N., Vandebril, R., Meerbergen, K.: A new truncation strategy for the higher-order singular value decomposition. SIAM J. Sci. Comput. 34(2), 1027–1052 (2012)
Article MathSciNet MATH Google Scholar
Vedurada, J., Suresh, A., Rajam, A.S., Kim, J., Hong, C., Panyala, A., Krishnamoorthy, S., Nandivada, V.K., Srivastava, R.K., Sadayappan, P.: TTLG-an efficient tensor transposition library for GPUs. In: International Parallel and Distributed Processing Symposium, pp. 578–588 (2018)
Vervliet, N., Debals, O., Sorber, L., Barel, M.V., De Lathauwer, L.: MATLAB Tensorlab 3.0. Available online (2016). http://www.tensorlab.net. Accessed 13 Nov 2021
Wang, Y., Jodoin, P.-M., Porikli, F., Konrad, J., Benezeth, Y., Ishwar, P.: CDnet 2014: An expanded change detection benchmark dataset. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 393–400 (2014a). https://doi.org/10.1109/CVPRW.2014.126
Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y.: Intel math kernel library. Springer, New York (2014b)
Book Google Scholar
Xiao, C., Yang, C., Li, M.: Efficient alternating least squares algorithms for low multilinear rank approximation of tensors. J. Sci. Comput. 87(3), 1–25 (2021)
Article MathSciNet MATH Google Scholar
Xie, Z., Tan, G., Liu, W., Sun, N.: IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In: International Conference on Supercomputing, pp. 94–105 (2019)
Zhao, Y., Zhou, W., Shen, X., Yiu, G.: Overhead-conscious format selection for SpMV-based applications. In: International Parallel and Distributed Processing Symposium, pp. 950–959 (2018a)
Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. ACM SIGPLAN Notices 53(1), 94–108 (2018b)
Article Google Scholar
Zhihua, Z.: Mach. Learn. Tsinghua University Press, Beijing (2016)
Google Scholar

Download references

Acknowledgements

This work was partially supported by Huawei Technologies.

Author information

Authors and Affiliations

Center for Data Science, Peking University, Beijing, 100871, China
Lian Duan
School of Mathematics Sciences, Peking University, Beijing, 100871, China
Chuanfu Xiao & Chao Yang
Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Min Li
School of Electronic Engineering and Computer Science, Peking University, Beijing, 100871, China
Mingshuo Ding
Institute of Computing and Digital Economy, Peking University, Changsha, 410205, China
Chao Yang

Authors

Lian Duan
View author publications
You can also search for this author in PubMed Google Scholar
Chuanfu Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Min Li
View author publications
You can also search for this author in PubMed Google Scholar
Mingshuo Ding
View author publications
You can also search for this author in PubMed Google Scholar
Chao Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao Yang.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Duan, L., Xiao, C., Li, M. et al. a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs. CCF Trans. HPC 5, 12–25 (2023). https://doi.org/10.1007/s42514-022-00119-7

Download citation

Received: 02 November 2021
Accepted: 14 July 2022
Published: 11 August 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s42514-022-00119-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Large-scale tucker Tensor factorization for sparse and accurate decomposition

VecHGrad for Solving Accurately Tensor Decomposition

Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Large-scale tucker Tensor factorization for sparse and accurate decomposition

VecHGrad for Solving Accurately Tensor Decomposition

Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation