research-article

Open access

High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach

Authors:

Tobias Grosser,

Michael KruseAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 3

Article No.: 34, Pages 1 - 27

https://doi.org/10.1145/3235029

Published: 04 September 2018 Publication History

All formats PDF

Abstract

The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be generalized to deliver the same benefits for algebraic path problems. By making fast linear algebra kernels available to everyone, we expect productivity increases when optimized libraries are not available.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Retrieved from https://www.tensorflow.org/.

[2]

E. Apra, M. Klemm, and K. Kowalski. 2014. Efficient implementation of many-body quantum chemical methods on the Intel E Xeon Phi coprocessor. In Proceedings of the International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC’14). 674--684.

Digital Library

[3]

ARM. 2015. ARM Performance Libraries Reference Manual. ARM.

[4]

ARM. 2016. Cortex-A57 Software Optimization Guide. ARM.

[5]

Rodney J. Bartlett and Monika Musiał. 2007. Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79 (2007), 291--352. Issue 1.

[6]

G. Baumgartner, A. Auer, and D. Bernholdt. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2 (Feb. 2005), 276--292.

[7]

Geoffrey Belter, E. R. Jessup, Ian Karlin, and Jeremy G. Siek. 2009. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High-Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY.

Digital Library

[8]

Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 2014. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the ACM International Conference on Supercomputing 25th Anniversary. ACM, 253--260.

Digital Library

[9]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 43, 6 (June 2008), 101--113.

Digital Library

[10]

Jason Cong and Bingjun Xiao. 2014. Minimizing Computation in Convolutional Neural Networks. Springer, Cham, 281--290.

[11]

Romain Dolbeau. 2016. Theoretical peak FLOPS per instruction set on less conventional hardware. https://www.researchgate.net/publication/308804090_Theoretical_Peak_FLOPS_per_instruction_set_on_less_conventional_hardwar. (Accessed: July 10, 2018).

[12]

J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (Mar. 1990), 1--17.

Digital Library

[13]

F. Facchinei, S. Sagratella, and G. Scutari. 2014. Flexible parallel algorithms for big data optimization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 7208--7212.

[14]

Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US, Boston, MA, 1581--1592.

[15]

Agner Fog. 2017. Instruction Tables. Technical University of Denmark.

[16]

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Program. 34, 3 (June 2006), 261--317.

Digital Library

[17]

Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008).

Digital Library

[18]

Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22, 4 (2012).

[19]

Tobias Grosser, Hongbin Zheng, Ragesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly—Polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11), C. Alias and C. Bastoul (Eds.). Chamonix, France.

[20]

Robert Harrison, Gregory Beylkin, Florian Bischoff et al. 2016. MADNESS: A multiresolution, adaptive numerical environment for scientific simulation. SIAM J. Sci. Comput. 38, 5 (2016), S123--S142.

Digital Library

[21]

Alexander Heinecke, Hans Pabst, and Greg Henry. 2015. LIBXSMM: A high-performance library for small matrix multiplications. In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC’15).

[22]

Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam, and P. Sadayappan. 2011. Data layout transformation for stencil computations on short-vector SIMD architectures. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC’11/ETAPS’11). Springer-Verlag, Berlin, 225--245.

Digital Library

[23]

So Hirata. 2003. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. J. Phys. Chem. A 107, 46 (2003), 9887--9897.

[24]

IBM 2012. XL C/C++: Compiler Reference—IBM. IBM.

[25]

Intel. {n.d.}. Intel Math Kernel Library (Intel MKL). Retrieved from https://software.intel.com/ru-ru/intel-mkl/?cid=sem43700011401059448&intel_term===intel+mkl&gclid===CIjbtvqaqM8CFSoNcwodDPUAbw&gclsrc===aw.ds.

[26]

Intel. 2015. Intel C++ Compiler 16.0 Update 4 User and Reference Guide. Intel.

[27]

Intel. 2018. Intel Intrinsics Guide. Intel.

[28]

Shana Jayachandran and T. Venkatachalam. 2016. A secure scheme for privacy preserving data mining using matrix encoding. World Eng. Appl. Sci. J. 7, 3 (2016), 190--193.

[29]

Chris Lattner. 2002. LLVM: An Infrastructure for Multi-Stage Optimization. Master’s thesis. Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL. Retrieved from http://llvm.cs.uiuc.edu.

[30]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5, 3 (Sept. 1979), 308--323.

Digital Library

[31]

Michael Lehn. 2014. GEMM: From Pure C to SSE Optimized Micro Kernels. Retrieved from http://apfel.mathematik.uni-ulm.de/&sim;lehn/sghpc/gemm/index.html.

[32]

Vincent Loechner and Doran K. Wilde. 1997. Parameterized polyhedra and their vertices. Int. J. Parallel Program. 25, 6 (1997), 525--549.

Digital Library

[33]

Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (Aug. 2016).

Digital Library

[34]

Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, and Karol Kowalski. 2011. GPU-based implementations of the noniterative regularized-CCSD(T) corrections: Applications to strongly correlated systems. J. Chem. Theory Comput. 7, 5 (2011), 1316--1327.

[35]

Kazuya Matsumoto and Stanislav G. Sedukhin. 2009. A solution of the all-pairs shortest paths problem on the cell broadband engine processor. IEICE Trans. 92-D, 6 (2009), 1225--1231.

[36]

Devin Matthews. 2016. High-performance tensor contraction without BLAS. CoRR abs/1607.00291.

[37]

Vijay Menon and Keshav Pingali. 1999. High-level semantic optimization of numerical codes. In Proceedings of the 13th International Conference on Supercomputing (ICS’99). ACM, New York, NY, 434--443.

Digital Library

[38]

Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio Quintana-Ortí, and Paolo Bientinesi. 2014. Towards an efficient use of the BLAS library for multilinear tensor contractions. Appl. Math. Comput. 235 (2014), 454--468.

[39]

Mkhuseli Ngxande and Nyalleng Moorosi. 2014. Development of Beowulf cluster to perform large datasets simulations in educational institutions. In International Journal of Computer Applications 99, 15 (Aug. 2014), 29--35.

[40]

Dmitry Pekurovsky. 2012. P3DFFT: A framework for parallel computations of fourier transforms in three dimensions. SIAM J. Sci. Comput. 34, 4 (2012), C192--C209.

[41]

Louis-Noël Pouchet. 2011. PolyBench/C the Polyhedral Benchmark suite. Retrieved from http://web.cse.ohio-state.edu/&sim;pouchet/software/polybench/.

[42]

William Pugh and David Wonnacott. 1994a. An Exact Method for Analysis of Value-Based Array Data Dependences. Springer, Berlin, 546--566.

[43]

William Pugh and David Wonnacott. 1994b. Static analysis of upper and lower bounds on dependences and parallelism. ACM Trans. Program. Lang. Syst. 16, 4 (July 1994), 1248--1278.

Digital Library

[44]

Stanislav G. Sedukhin and Marcin Paprzycki. 2012. Generalizing matrix multiplication for efficient computations on modern computers. In Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics (PPAM’11). Springer-Verlag, Berlin, Heidelberg, 225--234.

Digital Library

[45]

Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, Washington, DC, 1049--1059.

Digital Library

[46]

Daniele G. Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’14). 23--32.

Digital Library

[47]

Paul Springer and Paolo Bientinesi. 2018. Design of a high-performance GEMM-like tensor-tensor multiplication. ACM Trans. Math. Softw. 44, 3, Article 28 (Jan. 2018).

Digital Library

[48]

Y. N. Srikant and P. Shankar. 2007. The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition. CRC Press.

Digital Library

[49]

R. Stallman. 1999. Using and Porting the GNU Compiler Collection: For Gcc-2.95. Free Software Foundation.

[50]

Kevin Stock, Tom Henretty, Iyyappa Murugandi, P. Sadayappan, and Robert Harrison. 2011. Model-driven SIMD code generation for a multi-resolution tensor kernel. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, Washington, DC, 1058--1067.

Digital Library

[51]

Kevin Stock, Louis-Noël Pouchet, and P. Sadayappan. 2012. Using machine learning to improve automatic vectorization. ACM Trans. Archit. Code Optim. 8, 4, Article 50 (Jan. 2012).

Digital Library

[52]

Xing Su, Xiangke Liao, and Jingling Xue. 2017. Automatic generation of fast BLAS3-GEMM: A portable compiler approach. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’17). IEEE Press, Piscataway, NJ, 122--133.

[53]

Akihito Takahashi and Stanislav Sedukhin. 2005. Parallel Blocked Algorithm for Solving the Algebraic Path Problem on a Matrix Processor. Springer, Berlin, 786--795.

Digital Library

[54]

H. M. Tufo and P. F. Fischer. 1999. Terascale spectral element algorithms and implementations. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’99). ACM, New York, NY.

Digital Library

[55]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3, Article 14 (June 2015).

Digital Library

[56]

Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR (2014). Retrieved from http://arxiv.org/abs/1412.7580.

[57]

Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically generate high-performance dense linear algebra kernels on x86 CPUs. In Proceedings of the International Conference on High-Performance Computing, Networking, Storage and Analysis (SC’13). ACM, New York, NY.

Digital Library

[58]

M. Watson, R. Olivares-Amaya, R. G. Edgar, and A. Aspuru-Guzik. 2010. Accelerating correlated quantum chemistry calculations using graphical processing units. Comput. Sci. Eng. 12, 4 (July 2010), 40--51.

Digital Library

[59]

R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1 (2001), 3--35.

Digital Library

[60]

Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In Proceedings of the IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS’12). IEEE, 684--691.

Digital Library

[61]

Qing Yi, Qian Wang, and Huimin Cui. 2014. Specializing compiler optimizations through programmable composition for dense matrix computations. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE Computer Society, Washington, DC, 596--608.

Digital Library

Cited By

Ghorbani MHuot MHashemian SShaikhha A(2023)Compiling Structured Tensor AlgebraProceedings of the ACM on Programming Languages10.1145/36228047:OOPSLA2(204-233)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622804
Šinkarovs AKoopman TScholz SKeller GWestrick S(2023)Rank-Polymorphism for Shape-Guided BlockingProceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3609024.3609410(1-14)Online publication date: 30-Aug-2023
https://dl.acm.org/doi/10.1145/3609024.3609410
Spampinato DJelovina DZhuang JYzelman AHenriksen TSinkarovs A(2023)Towards Structured Algebraic ProgrammingProceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3589246.3595373(50-61)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3589246.3595373
Show More Cited By

Index Terms

High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication

We present “GEMM-like Tensor–Tensor multiplication” (GETT), a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix–matrix multiplication (GEMM). The critical insight behind GETT is the identification ...
High-performance Tensor Contractions for GPUs

We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/...
A code generator for high-performance tensor contractions on GPUs
CGO 2019: Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization

Tensor contractions are higher dimensional generalizations of matrix-matrix multiplication. They form the compute-intensive core of many applications in computational science and data science. In this paper, we describe a high-performance GPU code ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 3

September 2018

322 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3274266

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2018

Accepted: 01 June 2018

Revised: 01 May 2018

Received: 01 October 2017

Published in TACO Volume 15, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
2,338
Total Downloads

Downloads (Last 12 months)344
Downloads (Last 6 weeks)55

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ghorbani MHuot MHashemian SShaikhha A(2023)Compiling Structured Tensor AlgebraProceedings of the ACM on Programming Languages10.1145/36228047:OOPSLA2(204-233)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622804
Šinkarovs AKoopman TScholz SKeller GWestrick S(2023)Rank-Polymorphism for Shape-Guided BlockingProceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3609024.3609410(1-14)Online publication date: 30-Aug-2023
https://dl.acm.org/doi/10.1145/3609024.3609410
Spampinato DJelovina DZhuang JYzelman AHenriksen TSinkarovs A(2023)Towards Structured Algebraic ProgrammingProceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3589246.3595373(50-61)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3589246.3595373
Lopes NVerbrugge CLhoták OShen X(2023)Torchy: A Tracing JIT Compiler for PyTorchProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580266(98-109)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580266
Espindola VZago LYviquel HAraujo G(2023)Source Matching and Rewriting for MLIR Using String-Based AutomataACM Transactions on Architecture and Code Optimization10.1145/357128320:2(1-26)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3571283
Kuzma BKorostelev Ide Carvalho JMoreira JBarton CAraujo GAmaral J(2023)Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic loweringSoftware: Practice and Experience10.1002/spe.321453:9(1793-1814)Online publication date: 14-May-2023
https://doi.org/10.1002/spe.3214
Tukanov NSrinivasaraghavan RMoreira JLow T(2022)Modeling Matrix Engines for Portability and Performance2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00117(1173-1183)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00117
Штейнберг БШтейнберг О(2021)Program transformations as the base for optimizing parallelizing compilersProgram Systems: Theory and ApplicationsПрограммные системы: теория и приложения10.25209/2079-3316-2021-12-1-21-11312:1(21-113)Online publication date: 2021
https://doi.org/10.25209/2079-3316-2021-12-1-21-113
Lohstroh MMenard CBateni SLee E(2021)Toward a Lingua Franca for Deterministic Concurrent SystemsACM Transactions on Embedded Computing Systems10.1145/344812820:4(1-27)Online publication date: 18-May-2021
https://dl.acm.org/doi/10.1145/3448128
Rahman MIsmail DModekurthy VSaifullah A(2021)LPWAN in the TV White SpacesACM Transactions on Embedded Computing Systems10.1145/344787720:4(1-26)Online publication date: 13-May-2021
https://dl.acm.org/doi/10.1145/3447877
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents