Abstract
Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data. Existing Tucker decomposition algorithms are usually based on a single solver to compute the factor matrices and intermediate tensor in a predetermined order, and are not flexible enough to adapt with the diversities of the input data and the hardware. Moreover, to exploit highly efficient matrix multiplication kernels, most Tucker decomposition implementations rely on explicit matricizations, which could introduce extra costs of data conversion. In this paper, we present a-Tucker, a new framework for input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs. A two-level flexible Tucker decomposition algorithm is proposed to enable the switch of different calculation orders and different factor solvers, and a machine-learning adaptive order-solver selector is applied to automatically cope with change of the application scenarios. To further improve the performance, we implement a-Tucker in a fully matricization-free manner without any conversion between tensors and matrices. Experiments show that a-Tucker can substantially outperform existing works while keeping similar accuracy with a variety of synthetic and real-world tensors.
Similar content being viewed by others
References
Ahmad, N., Yilmaz, B., Unat, D.: A prediction framework for fast sparse triangular solves. In: : Malawski M., Rzadca K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science, vol. 12247 (2020)
Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: International Parallel and Distributed Processing Symposium, pp. 912–922 (2016)
Bader, B.W., Kolda, T.G., et al.: MATLAB Tensor Toolbox Version 3.1. Available online (2019). https://www.tensortoolbox.org
Baglama, J., Reichel, L.: Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput. 27(1), 19–42 (2005)
Ballard, G., Klinvex, A., Kolda, T.G.: TuckerMPI: a parallel C++/MPI software package for large-scale data compression via the Tucker tensor decomposition. ACM Transact. Math. Softw. 46(2), 1–13 (2020)
Ballester-Ripoll, R., Pajarola, R.: Lossy volume compression using tucker truncation and thresholding. Vis. Comput. 32(11), 1433–1446 (2016). https://doi.org/10.1007/s00371-015-1130-y
Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: International Conference on Parallel Processing, pp. 496–505 (2016)
Burggraf, R.: Analytical and numerical studies of the structure of steady separated flows. J. Fluid Mech. 24(1), 113–151 (1966)
Chakaravarthy, V.T., Choi, J.W., Joseph, D.J., Liu, X., Murali, P., Sabharwal, Y., Sreedhar, D.: On optimizing distributed Tucker decomposition for dense tensors. In: International Parallel and Distributed Processing Symposium, pp. 1038–1047 (2017)
Chen, Y., Li, K., Yang, W., Xiao, G., Xie, X., Li, T.: Performance-aware model for sparse matrix-matrix multiplication on the Sunway TaihuLight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30(4), 923–938 (2018)
Chen, Y., Xiao, G., Özsu, M.T., Liu, C., Zomaya, A.Y., Li, T.: aeSpTV: an adaptive and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE Trans. Parallel Distrib. Syst. 31(10), 2329–2345 (2020)
Choi, J.W., Liu, X., Chakaravarthy, V.T.: High-performance dense Tucker decomposition on GPU clusters. International Conference for High Performance Computing, Networking, Storage and Analysis, 543–553 (2018)
Cui, H., Hirasawa, S., Takizawa, H., Kobayashi, H.: A code selection mechanism using deep learning. In: International Symposium on Embedded Multicore/Many-core Systems-on-Chip, pp. 385–392 (2016)
De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-1 and rank-\((r_{1}, r_{2}, \cdots, r_{N})\) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21, 1324–1342 (2000a)
De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000b)
Dongarra, J., Duff, I., Gates, M., Haidar, A., Hammarling, S., Higham, N.J., Hogg, J., Valero-Lara, P., Relton, S.D., Tomov, S., Zounon, M.: A proposed API for batched basic linear algebra subprograms. Technical report, Manchester Institute for Mathematical Sciences, University of Manchester (2006)
Foster, D., Amano, K., Nascimento, S., Foster, M.: Frequency of metamerism in natural scenes. Opt. Soc. Am. J. A 23(10), 2359–2372 (2006). https://doi.org/10.1364/JOSAA.23.002359
Gu, M., Eisenstat, S.C.: A divide-and-conquer algorithm for the bidiagonal svd. SIAM J. Matrix Anal. Appl. 16(1), 79–92 (1995)
Hitchcock, F.L.: Multiple invariants and generalized rank of a \(p\)-way matrix or tensor. J. Math. Phys. 7(1–4), 39–79 (1928)
Hynninen, A.-P., Lyakh, D.I.: cuTT: A high-performance tensor transpose library for CUDA compatible GPUs. arXiv preprint arXiv:1705.01598 (2017)
Jang, J., Kang, U.: D-Tucker: Fast and memory-efficient Tucker decomposition for dense tensors. In: International Conference on Data Engineering, pp. 1850–1853 (2020)
Karami, A., Yazdi, M., Mercier, G.: Compression of hyperspectral images using discerete wavelet transform and Tucker decomposition. J. Sel. Topics Appl. Earth Obs. Remote Sens. 5(2), 444–450 (2012)
Kim, Y.-D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530 (2015)
Kim, J., Sukumaran-Rajam, A., Thumma, V., Krishnamoorthy, S., Panyala, A., Pouchet, L., Rountev, A., Sadayappan, P.: A code generator for high-performance tensor contractions on GPUs. In: International Symposium on Code Generation and Optimization, pp. 85–95 (2019)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Larsen, R.M.: Lanczos bidiagonalization with partial reorthogonalization. DAIMI Report Series (537) (1998)
LeCun, Y., Cortes, C., Burges, C.J.C.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998). Accessed 25 Nov 2021
Levin, J.: Three-mode factor analysis. PhD thesis, University of Illinois, Urbana-Champaign (1963)
Li, J., Tan, G., Chen, M., Sun, N.: SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 117–126 (2013)
Li, K., Yang, W., Li, K.: Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Trans. Parallel Distrib. Syst. 26(1), 196–205 (2014)
Li, J., Battaglino, C., Perros, I., Sun, J., Vuduc, R.: An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2015)
Li, J., Choi, J., Perros, I., Sun, J., Vuduc, R.: Model-driven sparse CP decomposition for higher-order tensors. In: International Parallel and Distributed Processing Symposium, pp. 1048–1057 (2017)
Li, J., Sun, J., Vuduc, R.: HiCOO: Hierarchical storage of sparse tensors. In: International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 238–252 (2018)
Li, J., Ma, Y., Wu, X., Li, A., Barker, K.: PASTA: A parallel sparse tensor algorithm benchmark suite. CCF Transactions on High Performance Computing, 111–130 (2019)
Li, M., Ao, Y., Yang, C.: Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE Trans. Parallel Distrib. Syst. 32(7), 1842–1853 (2020)
Ma, W., Krishamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Clust. Comput. 16, 1–25 (2013)
Ma, Y., Li, J., Wu, X., Yan, C., Sun, J., Vuduc, R.: Optimizing sparse tensor times matrix on GPUs. J. Parallel Distrib. Comput. 129, 99–109 (2019)
Matthews, D.A.: High-performance tensor contraction without transposition. SIAM J. Sci. Comput. 40(1), 1–24 (2018)
Nico, V., Otto, D., Laurent, S., Barel, M.V., De Lathauwer, L.: Tensorlab 3.0. https://www.tensorlab.net (2016). Accessed 13 Nov 2021
Nisa, I., Li, J., Sukumaran Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse MTTKRP on GPUs. In: International Parallel and Distributed Processing Symposium, pp. 123–133 (2019a)
Nisa, I., Li, J., Sukumaran-Rajam, A., Rawat, P.S., Krishnamoorthy, S., Sadayappan, P.: An efficient mixed-mode representation of sparse tensors. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2019b)
Nisa, I., Siegel, C., Rajam, A.S., Vishnu, A., Sadayappan, P.: Effective machine learning based format selection and performance modeling for SpMV on GPUs. In: International Parallel and Distributed Processing Symposium Workshops, pp. 1056–1065 (2018)
NVIDIA: The API Reference guide for cuBLAS, the CUDA Basic Linear Algebra Subroutine library. in press (2019a). https://docs.nvidia.com/cuda/cublas/. Accessed 25 Nov 2021
NVIDIA: The API reference guide for cuSolver, the CUDA sparse matix library. in press (2019b). https://docs.nvidia.com/cuda/cusolver/. Accessed 25 Nov 2021
Oh, J., Shin, K., Papalexakis, E.E., Faloutsos, C., Yu, H.: S-HOT: Scalable high-order Tucker decomposition. In: ACM International Conference on Web Search and Data Mining, pp. 761–770 (2017)
Oh, S., Park, N., Sael, L., Kang, U.: Scalable Tucker factorization for sparse tensors - algorithms and discoveries. In: International Conference on Data Engineering, pp. 1120–1131 (2018)
Oseledetsv, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput. 33(5), 2295–2317 (2011)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Perros, I., Chen, R., Vuduc, R., Sun, J.: Sparse hierarchical Tucker factorization and its application to healthcare. In: International Conference on Data Mining, pp. 943–948 (2015)
Smith, S., Karypis, G.: Tensor-matrix products with a compressed sparse tensor. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, pp. 1–7 (2015)
Smith, S., Karypis, G.: Accelerating the Tucker decomposition with compressed sparse tensors. In: International Conference on Parallel and Distributed Computing, Euro-Par 2017, pp. 653–668 (2017)
Springer, P., Su, T., Bientinesi, P.: HPTT: A high-performance tensor transposition C++ library. In: ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, pp. 56–62 (2017)
Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: Dynamic tensor analysis. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 374–383 (2006)
Szlam, A., Tulloch, A., Tygert, M.: Accurate low-rank approximations via a few iterations of alternating least squares. SIAM J. Matrix Anal. Appl. 38(2), 425–433 (2017)
Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966)
Vannieuwenhoven, N., Vandebril, R., Meerbergen, K.: On the truncated multilinear singular value decomposition. Technical Report TW589, Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium (2011)
Vannieuwenhoven, N., Vandebril, R., Meerbergen, K.: A new truncation strategy for the higher-order singular value decomposition. SIAM J. Sci. Comput. 34(2), 1027–1052 (2012)
Vedurada, J., Suresh, A., Rajam, A.S., Kim, J., Hong, C., Panyala, A., Krishnamoorthy, S., Nandivada, V.K., Srivastava, R.K., Sadayappan, P.: TTLG-an efficient tensor transposition library for GPUs. In: International Parallel and Distributed Processing Symposium, pp. 578–588 (2018)
Vervliet, N., Debals, O., Sorber, L., Barel, M.V., De Lathauwer, L.: MATLAB Tensorlab 3.0. Available online (2016). http://www.tensorlab.net. Accessed 13 Nov 2021
Wang, Y., Jodoin, P.-M., Porikli, F., Konrad, J., Benezeth, Y., Ishwar, P.: CDnet 2014: An expanded change detection benchmark dataset. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 393–400 (2014a). https://doi.org/10.1109/CVPRW.2014.126
Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y.: Intel math kernel library. Springer, New York (2014b)
Xiao, C., Yang, C., Li, M.: Efficient alternating least squares algorithms for low multilinear rank approximation of tensors. J. Sci. Comput. 87(3), 1–25 (2021)
Xie, Z., Tan, G., Liu, W., Sun, N.: IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In: International Conference on Supercomputing, pp. 94–105 (2019)
Zhao, Y., Zhou, W., Shen, X., Yiu, G.: Overhead-conscious format selection for SpMV-based applications. In: International Parallel and Distributed Processing Symposium, pp. 950–959 (2018a)
Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. ACM SIGPLAN Notices 53(1), 94–108 (2018b)
Zhihua, Z.: Mach. Learn. Tsinghua University Press, Beijing (2016)
Acknowledgements
This work was partially supported by Huawei Technologies.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Duan, L., Xiao, C., Li, M. et al. a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs. CCF Trans. HPC 5, 12–25 (2023). https://doi.org/10.1007/s42514-022-00119-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-022-00119-7