Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Tucker decomposition is one of the most popular models for analyzing and compressing large-scale tensorial data. Existing Tucker decomposition algorithms are usually based on a single solver to compute the factor matrices and intermediate tensor in a predetermined order, and are not flexible enough to adapt with the diversities of the input data and the hardware. Moreover, to exploit highly efficient matrix multiplication kernels, most Tucker decomposition implementations rely on explicit matricizations, which could introduce extra costs of data conversion. In this paper, we present a-Tucker, a new framework for input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs. A two-level flexible Tucker decomposition algorithm is proposed to enable the switch of different calculation orders and different factor solvers, and a machine-learning adaptive order-solver selector is applied to automatically cope with change of the application scenarios. To further improve the performance, we implement a-Tucker in a fully matricization-free manner without any conversion between tensors and matrices. Experiments show that a-Tucker can substantially outperform existing works while keeping similar accuracy with a variety of synthetic and real-world tensors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Ahmad, N., Yilmaz, B., Unat, D.: A prediction framework for fast sparse triangular solves. In: : Malawski M., Rzadca K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science, vol. 12247 (2020)

  • Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: International Parallel and Distributed Processing Symposium, pp. 912–922 (2016)

  • Bader, B.W., Kolda, T.G., et al.: MATLAB Tensor Toolbox Version 3.1. Available online (2019). https://www.tensortoolbox.org

  • Baglama, J., Reichel, L.: Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput. 27(1), 19–42 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Ballard, G., Klinvex, A., Kolda, T.G.: TuckerMPI: a parallel C++/MPI software package for large-scale data compression via the Tucker tensor decomposition. ACM Transact. Math. Softw. 46(2), 1–13 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Ballester-Ripoll, R., Pajarola, R.: Lossy volume compression using tucker truncation and thresholding. Vis. Comput. 32(11), 1433–1446 (2016). https://doi.org/10.1007/s00371-015-1130-y

    Article  Google Scholar 

  • Benatia, A., Ji, W., Wang, Y., Shi, F.: Sparse matrix format selection with multiclass SVM for SpMV on GPU. In: International Conference on Parallel Processing, pp. 496–505 (2016)

  • Burggraf, R.: Analytical and numerical studies of the structure of steady separated flows. J. Fluid Mech. 24(1), 113–151 (1966)

    Article  Google Scholar 

  • Chakaravarthy, V.T., Choi, J.W., Joseph, D.J., Liu, X., Murali, P., Sabharwal, Y., Sreedhar, D.: On optimizing distributed Tucker decomposition for dense tensors. In: International Parallel and Distributed Processing Symposium, pp. 1038–1047 (2017)

  • Chen, Y., Li, K., Yang, W., Xiao, G., Xie, X., Li, T.: Performance-aware model for sparse matrix-matrix multiplication on the Sunway TaihuLight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30(4), 923–938 (2018)

    Article  Google Scholar 

  • Chen, Y., Xiao, G., Özsu, M.T., Liu, C., Zomaya, A.Y., Li, T.: aeSpTV: an adaptive and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE Trans. Parallel Distrib. Syst. 31(10), 2329–2345 (2020)

    Article  Google Scholar 

  • Choi, J.W., Liu, X., Chakaravarthy, V.T.: High-performance dense Tucker decomposition on GPU clusters. International Conference for High Performance Computing, Networking, Storage and Analysis, 543–553 (2018)

  • Cui, H., Hirasawa, S., Takizawa, H., Kobayashi, H.: A code selection mechanism using deep learning. In: International Symposium on Embedded Multicore/Many-core Systems-on-Chip, pp. 385–392 (2016)

  • De Lathauwer, L., De Moor, B., Vandewalle, J.: On the best rank-1 and rank-\((r_{1}, r_{2}, \cdots, r_{N})\) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl. 21, 1324–1342 (2000a)

    Article  MathSciNet  MATH  Google Scholar 

  • De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000b)

    Article  MathSciNet  MATH  Google Scholar 

  • Dongarra, J., Duff, I., Gates, M., Haidar, A., Hammarling, S., Higham, N.J., Hogg, J., Valero-Lara, P., Relton, S.D., Tomov, S., Zounon, M.: A proposed API for batched basic linear algebra subprograms. Technical report, Manchester Institute for Mathematical Sciences, University of Manchester (2006)

  • Foster, D., Amano, K., Nascimento, S., Foster, M.: Frequency of metamerism in natural scenes. Opt. Soc. Am. J. A 23(10), 2359–2372 (2006). https://doi.org/10.1364/JOSAA.23.002359

    Article  Google Scholar 

  • Gu, M., Eisenstat, S.C.: A divide-and-conquer algorithm for the bidiagonal svd. SIAM J. Matrix Anal. Appl. 16(1), 79–92 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  • Hitchcock, F.L.: Multiple invariants and generalized rank of a \(p\)-way matrix or tensor. J. Math. Phys. 7(1–4), 39–79 (1928)

    Article  MATH  Google Scholar 

  • Hynninen, A.-P., Lyakh, D.I.: cuTT: A high-performance tensor transpose library for CUDA compatible GPUs. arXiv preprint arXiv:1705.01598 (2017)

  • Jang, J., Kang, U.: D-Tucker: Fast and memory-efficient Tucker decomposition for dense tensors. In: International Conference on Data Engineering, pp. 1850–1853 (2020)

  • Karami, A., Yazdi, M., Mercier, G.: Compression of hyperspectral images using discerete wavelet transform and Tucker decomposition. J. Sel. Topics Appl. Earth Obs. Remote Sens. 5(2), 444–450 (2012)

    Article  Google Scholar 

  • Kim, Y.-D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530 (2015)

  • Kim, J., Sukumaran-Rajam, A., Thumma, V., Krishnamoorthy, S., Panyala, A., Pouchet, L., Rountev, A., Sadayappan, P.: A code generator for high-performance tensor contractions on GPUs. In: International Symposium on Code Generation and Optimization, pp. 85–95 (2019)

  • Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Larsen, R.M.: Lanczos bidiagonalization with partial reorthogonalization. DAIMI Report Series (537) (1998)

  • LeCun, Y., Cortes, C., Burges, C.J.C.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998). Accessed 25 Nov 2021

  • Levin, J.: Three-mode factor analysis. PhD thesis, University of Illinois, Urbana-Champaign (1963)

  • Li, J., Tan, G., Chen, M., Sun, N.: SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 117–126 (2013)

  • Li, K., Yang, W., Li, K.: Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Trans. Parallel Distrib. Syst. 26(1), 196–205 (2014)

    Article  Google Scholar 

  • Li, J., Battaglino, C., Perros, I., Sun, J., Vuduc, R.: An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2015)

  • Li, J., Choi, J., Perros, I., Sun, J., Vuduc, R.: Model-driven sparse CP decomposition for higher-order tensors. In: International Parallel and Distributed Processing Symposium, pp. 1048–1057 (2017)

  • Li, J., Sun, J., Vuduc, R.: HiCOO: Hierarchical storage of sparse tensors. In: International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 238–252 (2018)

  • Li, J., Ma, Y., Wu, X., Li, A., Barker, K.: PASTA: A parallel sparse tensor algorithm benchmark suite. CCF Transactions on High Performance Computing, 111–130 (2019)

  • Li, M., Ao, Y., Yang, C.: Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE Trans. Parallel Distrib. Syst. 32(7), 1842–1853 (2020)

    Google Scholar 

  • Ma, W., Krishamoorthy, S., Villa, O., Kowalski, K., Agrawal, G.: Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Clust. Comput. 16, 1–25 (2013)

    Article  Google Scholar 

  • Ma, Y., Li, J., Wu, X., Yan, C., Sun, J., Vuduc, R.: Optimizing sparse tensor times matrix on GPUs. J. Parallel Distrib. Comput. 129, 99–109 (2019)

    Article  Google Scholar 

  • Matthews, D.A.: High-performance tensor contraction without transposition. SIAM J. Sci. Comput. 40(1), 1–24 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Nico, V., Otto, D., Laurent, S., Barel, M.V., De Lathauwer, L.: Tensorlab 3.0. https://www.tensorlab.net (2016). Accessed 13 Nov 2021

  • Nisa, I., Li, J., Sukumaran Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse MTTKRP on GPUs. In: International Parallel and Distributed Processing Symposium, pp. 123–133 (2019a)

  • Nisa, I., Li, J., Sukumaran-Rajam, A., Rawat, P.S., Krishnamoorthy, S., Sadayappan, P.: An efficient mixed-mode representation of sparse tensors. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2019b)

  • Nisa, I., Siegel, C., Rajam, A.S., Vishnu, A., Sadayappan, P.: Effective machine learning based format selection and performance modeling for SpMV on GPUs. In: International Parallel and Distributed Processing Symposium Workshops, pp. 1056–1065 (2018)

  • NVIDIA: The API Reference guide for cuBLAS, the CUDA Basic Linear Algebra Subroutine library. in press (2019a). https://docs.nvidia.com/cuda/cublas/. Accessed 25 Nov 2021

  • NVIDIA: The API reference guide for cuSolver, the CUDA sparse matix library. in press (2019b). https://docs.nvidia.com/cuda/cusolver/. Accessed 25 Nov 2021

  • Oh, J., Shin, K., Papalexakis, E.E., Faloutsos, C., Yu, H.: S-HOT: Scalable high-order Tucker decomposition. In: ACM International Conference on Web Search and Data Mining, pp. 761–770 (2017)

  • Oh, S., Park, N., Sael, L., Kang, U.: Scalable Tucker factorization for sparse tensors - algorithms and discoveries. In: International Conference on Data Engineering, pp. 1120–1131 (2018)

  • Oseledetsv, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput. 33(5), 2295–2317 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  • Perros, I., Chen, R., Vuduc, R., Sun, J.: Sparse hierarchical Tucker factorization and its application to healthcare. In: International Conference on Data Mining, pp. 943–948 (2015)

  • Smith, S., Karypis, G.: Tensor-matrix products with a compressed sparse tensor. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, pp. 1–7 (2015)

  • Smith, S., Karypis, G.: Accelerating the Tucker decomposition with compressed sparse tensors. In: International Conference on Parallel and Distributed Computing, Euro-Par 2017, pp. 653–668 (2017)

  • Springer, P., Su, T., Bientinesi, P.: HPTT: A high-performance tensor transposition C++ library. In: ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, pp. 56–62 (2017)

  • Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: Dynamic tensor analysis. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 374–383 (2006)

  • Szlam, A., Tulloch, A., Tygert, M.: Accurate low-rank approximations via a few iterations of alternating least squares. SIAM J. Matrix Anal. Appl. 38(2), 425–433 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31, 279–311 (1966)

    Article  MathSciNet  Google Scholar 

  • Vannieuwenhoven, N., Vandebril, R., Meerbergen, K.: On the truncated multilinear singular value decomposition. Technical Report TW589, Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium (2011)

  • Vannieuwenhoven, N., Vandebril, R., Meerbergen, K.: A new truncation strategy for the higher-order singular value decomposition. SIAM J. Sci. Comput. 34(2), 1027–1052 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Vedurada, J., Suresh, A., Rajam, A.S., Kim, J., Hong, C., Panyala, A., Krishnamoorthy, S., Nandivada, V.K., Srivastava, R.K., Sadayappan, P.: TTLG-an efficient tensor transposition library for GPUs. In: International Parallel and Distributed Processing Symposium, pp. 578–588 (2018)

  • Vervliet, N., Debals, O., Sorber, L., Barel, M.V., De Lathauwer, L.: MATLAB Tensorlab 3.0. Available online (2016). http://www.tensorlab.net. Accessed 13 Nov 2021

  • Wang, Y., Jodoin, P.-M., Porikli, F., Konrad, J., Benezeth, Y., Ishwar, P.: CDnet 2014: An expanded change detection benchmark dataset. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 393–400 (2014a). https://doi.org/10.1109/CVPRW.2014.126

  • Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y.: Intel math kernel library. Springer, New York (2014b)

    Book  Google Scholar 

  • Xiao, C., Yang, C., Li, M.: Efficient alternating least squares algorithms for low multilinear rank approximation of tensors. J. Sci. Comput. 87(3), 1–25 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  • Xie, Z., Tan, G., Liu, W., Sun, N.: IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In: International Conference on Supercomputing, pp. 94–105 (2019)

  • Zhao, Y., Zhou, W., Shen, X., Yiu, G.: Overhead-conscious format selection for SpMV-based applications. In: International Parallel and Distributed Processing Symposium, pp. 950–959 (2018a)

  • Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. ACM SIGPLAN Notices 53(1), 94–108 (2018b)

    Article  Google Scholar 

  • Zhihua, Z.: Mach. Learn. Tsinghua University Press, Beijing (2016)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by Huawei Technologies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Yang.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duan, L., Xiao, C., Li, M. et al. a-Tucker: fast input-adaptive and matricization-free Tucker decomposition of higher-order tensors on GPUs. CCF Trans. HPC 5, 12–25 (2023). https://doi.org/10.1007/s42514-022-00119-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-022-00119-7

Keywords