Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach

Published: 04 September 2018 Publication History

Abstract

The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be generalized to deliver the same benefits for algebraic path problems. By making fast linear algebra kernels available to everyone, we expect productivity increases when optimized libraries are not available.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Retrieved from https://www.tensorflow.org/.
[2]
E. Apra, M. Klemm, and K. Kowalski. 2014. Efficient implementation of many-body quantum chemical methods on the Intel E Xeon Phi coprocessor. In Proceedings of the International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC’14). 674--684.
[3]
ARM. 2015. ARM Performance Libraries Reference Manual. ARM.
[4]
ARM. 2016. Cortex-A57 Software Optimization Guide. ARM.
[5]
Rodney J. Bartlett and Monika Musiał. 2007. Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79 (2007), 291--352. Issue 1.
[6]
G. Baumgartner, A. Auer, and D. Bernholdt. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2 (Feb. 2005), 276--292.
[7]
Geoffrey Belter, E. R. Jessup, Ian Karlin, and Jeremy G. Siek. 2009. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High-Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY.
[8]
Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 2014. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the ACM International Conference on Supercomputing 25th Anniversary. ACM, 253--260.
[9]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 43, 6 (June 2008), 101--113.
[10]
Jason Cong and Bingjun Xiao. 2014. Minimizing Computation in Convolutional Neural Networks. Springer, Cham, 281--290.
[11]
Romain Dolbeau. 2016. Theoretical peak FLOPS per instruction set on less conventional hardware. https://www.researchgate.net/publication/308804090_Theoretical_Peak_FLOPS_per_instruction_set_on_less_conventional_hardwar. (Accessed: July 10, 2018).
[12]
J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (Mar. 1990), 1--17.
[13]
F. Facchinei, S. Sagratella, and G. Scutari. 2014. Flexible parallel algorithms for big data optimization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 7208--7212.
[14]
Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US, Boston, MA, 1581--1592.
[15]
Agner Fog. 2017. Instruction Tables. Technical University of Denmark.
[16]
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Program. 34, 3 (June 2006), 261--317.
[17]
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008).
[18]
Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22, 4 (2012).
[19]
Tobias Grosser, Hongbin Zheng, Ragesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly—Polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11), C. Alias and C. Bastoul (Eds.). Chamonix, France.
[20]
Robert Harrison, Gregory Beylkin, Florian Bischoff et al. 2016. MADNESS: A multiresolution, adaptive numerical environment for scientific simulation. SIAM J. Sci. Comput. 38, 5 (2016), S123--S142.
[21]
Alexander Heinecke, Hans Pabst, and Greg Henry. 2015. LIBXSMM: A high-performance library for small matrix multiplications. In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC’15).
[22]
Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam, and P. Sadayappan. 2011. Data layout transformation for stencil computations on short-vector SIMD architectures. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC’11/ETAPS’11). Springer-Verlag, Berlin, 225--245.
[23]
So Hirata. 2003. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. J. Phys. Chem. A 107, 46 (2003), 9887--9897.
[24]
IBM 2012. XL C/C++: Compiler Reference—IBM. IBM.
[25]
Intel. {n.d.}. Intel Math Kernel Library (Intel MKL). Retrieved from https://software.intel.com/ru-ru/intel-mkl/?cid=sem43700011401059448&intel_term===intel+mkl&gclid===CIjbtvqaqM8CFSoNcwodDPUAbw&gclsrc===aw.ds.
[26]
Intel. 2015. Intel C++ Compiler 16.0 Update 4 User and Reference Guide. Intel.
[27]
Intel. 2018. Intel Intrinsics Guide. Intel.
[28]
Shana Jayachandran and T. Venkatachalam. 2016. A secure scheme for privacy preserving data mining using matrix encoding. World Eng. Appl. Sci. J. 7, 3 (2016), 190--193.
[29]
Chris Lattner. 2002. LLVM: An Infrastructure for Multi-Stage Optimization. Master’s thesis. Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL. Retrieved from http://llvm.cs.uiuc.edu.
[30]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5, 3 (Sept. 1979), 308--323.
[31]
Michael Lehn. 2014. GEMM: From Pure C to SSE Optimized Micro Kernels. Retrieved from http://apfel.mathematik.uni-ulm.de/∼lehn/sghpc/gemm/index.html.
[32]
Vincent Loechner and Doran K. Wilde. 1997. Parameterized polyhedra and their vertices. Int. J. Parallel Program. 25, 6 (1997), 525--549.
[33]
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (Aug. 2016).
[34]
Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, and Karol Kowalski. 2011. GPU-based implementations of the noniterative regularized-CCSD(T) corrections: Applications to strongly correlated systems. J. Chem. Theory Comput. 7, 5 (2011), 1316--1327.
[35]
Kazuya Matsumoto and Stanislav G. Sedukhin. 2009. A solution of the all-pairs shortest paths problem on the cell broadband engine processor. IEICE Trans. 92-D, 6 (2009), 1225--1231.
[36]
Devin Matthews. 2016. High-performance tensor contraction without BLAS. CoRR abs/1607.00291.
[37]
Vijay Menon and Keshav Pingali. 1999. High-level semantic optimization of numerical codes. In Proceedings of the 13th International Conference on Supercomputing (ICS’99). ACM, New York, NY, 434--443.
[38]
Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio Quintana-Ortí, and Paolo Bientinesi. 2014. Towards an efficient use of the BLAS library for multilinear tensor contractions. Appl. Math. Comput. 235 (2014), 454--468.
[39]
Mkhuseli Ngxande and Nyalleng Moorosi. 2014. Development of Beowulf cluster to perform large datasets simulations in educational institutions. In International Journal of Computer Applications 99, 15 (Aug. 2014), 29--35.
[40]
Dmitry Pekurovsky. 2012. P3DFFT: A framework for parallel computations of fourier transforms in three dimensions. SIAM J. Sci. Comput. 34, 4 (2012), C192--C209.
[41]
Louis-Noël Pouchet. 2011. PolyBench/C the Polyhedral Benchmark suite. Retrieved from http://web.cse.ohio-state.edu/∼pouchet/software/polybench/.
[42]
William Pugh and David Wonnacott. 1994a. An Exact Method for Analysis of Value-Based Array Data Dependences. Springer, Berlin, 546--566.
[43]
William Pugh and David Wonnacott. 1994b. Static analysis of upper and lower bounds on dependences and parallelism. ACM Trans. Program. Lang. Syst. 16, 4 (July 1994), 1248--1278.
[44]
Stanislav G. Sedukhin and Marcin Paprzycki. 2012. Generalizing matrix multiplication for efficient computations on modern computers. In Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics (PPAM’11). Springer-Verlag, Berlin, Heidelberg, 225--234.
[45]
Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, Washington, DC, 1049--1059.
[46]
Daniele G. Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’14). 23--32.
[47]
Paul Springer and Paolo Bientinesi. 2018. Design of a high-performance GEMM-like tensor-tensor multiplication. ACM Trans. Math. Softw. 44, 3, Article 28 (Jan. 2018).
[48]
Y. N. Srikant and P. Shankar. 2007. The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition. CRC Press.
[49]
R. Stallman. 1999. Using and Porting the GNU Compiler Collection: For Gcc-2.95. Free Software Foundation.
[50]
Kevin Stock, Tom Henretty, Iyyappa Murugandi, P. Sadayappan, and Robert Harrison. 2011. Model-driven SIMD code generation for a multi-resolution tensor kernel. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, Washington, DC, 1058--1067.
[51]
Kevin Stock, Louis-Noël Pouchet, and P. Sadayappan. 2012. Using machine learning to improve automatic vectorization. ACM Trans. Archit. Code Optim. 8, 4, Article 50 (Jan. 2012).
[52]
Xing Su, Xiangke Liao, and Jingling Xue. 2017. Automatic generation of fast BLAS3-GEMM: A portable compiler approach. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’17). IEEE Press, Piscataway, NJ, 122--133.
[53]
Akihito Takahashi and Stanislav Sedukhin. 2005. Parallel Blocked Algorithm for Solving the Algebraic Path Problem on a Matrix Processor. Springer, Berlin, 786--795.
[54]
H. M. Tufo and P. F. Fischer. 1999. Terascale spectral element algorithms and implementations. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’99). ACM, New York, NY.
[55]
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3, Article 14 (June 2015).
[56]
Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. CoRR (2014). Retrieved from http://arxiv.org/abs/1412.7580.
[57]
Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically generate high-performance dense linear algebra kernels on x86 CPUs. In Proceedings of the International Conference on High-Performance Computing, Networking, Storage and Analysis (SC’13). ACM, New York, NY.
[58]
M. Watson, R. Olivares-Amaya, R. G. Edgar, and A. Aspuru-Guzik. 2010. Accelerating correlated quantum chemistry calculations using graphical processing units. Comput. Sci. Eng. 12, 4 (July 2010), 40--51.
[59]
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1 (2001), 3--35.
[60]
Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In Proceedings of the IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS’12). IEEE, 684--691.
[61]
Qing Yi, Qian Wang, and Huimin Cui. 2014. Specializing compiler optimizations through programmable composition for dense matrix computations. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE Computer Society, Washington, DC, 596--608.

Cited By

View all
  • (2023)Compiling Structured Tensor AlgebraProceedings of the ACM on Programming Languages10.1145/36228047:OOPSLA2(204-233)Online publication date: 16-Oct-2023
  • (2023)Rank-Polymorphism for Shape-Guided BlockingProceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3609024.3609410(1-14)Online publication date: 30-Aug-2023
  • (2023)Towards Structured Algebraic ProgrammingProceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3589246.3595373(50-61)Online publication date: 6-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 3
September 2018
322 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3274266
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2018
Accepted: 01 June 2018
Revised: 01 May 2018
Received: 01 October 2017
Published in TACO Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Tensor contractions
  2. high-performance computing
  3. matrix-matrix multiplication

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)344
  • Downloads (Last 6 weeks)55
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Compiling Structured Tensor AlgebraProceedings of the ACM on Programming Languages10.1145/36228047:OOPSLA2(204-233)Online publication date: 16-Oct-2023
  • (2023)Rank-Polymorphism for Shape-Guided BlockingProceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3609024.3609410(1-14)Online publication date: 30-Aug-2023
  • (2023)Towards Structured Algebraic ProgrammingProceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3589246.3595373(50-61)Online publication date: 6-Jun-2023
  • (2023)Torchy: A Tracing JIT Compiler for PyTorchProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580266(98-109)Online publication date: 17-Feb-2023
  • (2023)Source Matching and Rewriting for MLIR Using String-Based AutomataACM Transactions on Architecture and Code Optimization10.1145/357128320:2(1-26)Online publication date: 1-Mar-2023
  • (2023)Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic loweringSoftware: Practice and Experience10.1002/spe.321453:9(1793-1814)Online publication date: 14-May-2023
  • (2022)Modeling Matrix Engines for Portability and Performance2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00117(1173-1183)Online publication date: May-2022
  • (2021)Program transformations as the base for optimizing parallelizing compilersProgram Systems: Theory and ApplicationsПрограммные системы: теория и приложения10.25209/2079-3316-2021-12-1-21-11312:1(21-113)Online publication date: 2021
  • (2021)Toward a Lingua Franca for Deterministic Concurrent SystemsACM Transactions on Embedded Computing Systems10.1145/344812820:4(1-27)Online publication date: 18-May-2021
  • (2021)LPWAN in the TV White SpacesACM Transactions on Embedded Computing Systems10.1145/344787720:4(1-26)Online publication date: 13-May-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media