research-article

Analytical cache modeling and tilesize optimization for tensor contractions

Authors:

Aravind Sukumaran-Rajam,

Fabrice Rastello,

Atanas Rountev,

P. SadayappanAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 74, Pages 1 - 13

https://doi.org/10.1145/3295500.3356218

Published: 17 November 2019 Publication History

Abstract

Data movement between processor and memory hierarchy is a fundamental bottleneck that limits the performance of many applications on modern computer architectures. Tiling and loop permutation are key techniques for improving data locality. However, selecting effective tile-sizes and loop permutations is particularly challenging for tensor contractions due to the large number of loops. Even state-of-the-art compilers usually produce sub-optimal tile-sizes and loop permutations, as they rely on naïve cost models. In this paper we provide an analytical model based approach to multi-level tile size optimization and permutation selection for tensor contractions. Our experimental results show that this approach achieves comparable or better performance than state-of-the-art frameworks and libraries for tensor contractions.

References

[1]

Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proc. of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE.

Digital Library

[2]

Pietro Belotti. 2009. Couenne: a user's manual. Technical Report. Technical report, Lehigh University.

[3]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. PLUTO: A Practical and Fully Automatic Polyhedral Program Optimization System. In Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08).

[4]

Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI '95). ACM, 279--290.

[5]

T Daniel Crawford and Henry F Schaefer III. 2000. An introduction to coupled cluster theory for computational chemists. Reviews in computational chemistry (2000), 33--136.

[6]

Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. International journal of parallel programming 21, 5 (1992), 313--347.

[7]

Kazushige Goto and Robert A Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 12.

Digital Library

[8]

Tze Meng Low, Francisco D Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, 2 (2016), 12.

Digital Library

[9]

Devin A Matthews. 2018. High-performance tensor contraction without transposition. SIAM Journal on Scientific Computing 40, 1 (2018), C1--C24.

Digital Library

[10]

Lakshminarayanan Renganarayana and Sanjay Rajopadhye. 2008. Positivity, Posynomials and Tile Size Selection. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Article 55, 12 pages.

Digital Library

[11]

Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J Ramanujam, P Sadayappan, and Vivek Sarkar. 2012. Analytical bounds for optimal tile size selection. In International Conference on Compiler Construction. Springer, 101--121.

Digital Library

[12]

Tyler M Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1049--1059.

Digital Library

[13]

Paul Springer and Paolo Bientinesi. 2016. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. CoRR (2016). arXiv:cs.MS, cs.PF/1607.00145 http://arxiv.org/abs/1607.00145

[14]

Paul Springer, Tong Su, and Paolo Bientinesi. 2017. HPTT: A High-Performance Tensor Transposition C++ Library. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2017). ACM, New York, NY, USA, 56--62.

Digital Library

[15]

Field G Van Zee and Robert A Van De Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) 41, 3 (2015), 14.

Digital Library

[16]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1--54:23.

Digital Library

[17]

Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167--188.

[18]

Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O'Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '10). ACM, 190--199.

Digital Library

Cited By

Maeng KLucia BRodríguez GSadayappan PSukumaran-Rajam A(2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641564
Xiao GYin CChen YDuan MLi K(2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3391254
Ahmed NAlwan EFanfakh A(2024)Enhancing Programs Efficiency through a Machine Learning-Based Model for Tile Size SelectionBIO Web of Conferences10.1051/bioconf/2024970002197(00021)Online publication date: 5-Apr-2024
https://doi.org/10.1051/bioconf/20249700021
Show More Cited By

Analytical cache modeling and tilesize optimization for tensor contractions
1. Software and its engineering
  1. Software notations and tools

Recommendations

Analytical characterization and design space exploration for optimization of CNNs
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and loop permutation,...
Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor Network
SPAA '24: Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures

Sparse tensor decomposition and completion are common in numerous applications, ranging from machine learning to computational quantum chemistry. Typically, the main bottleneck in optimization of these models are contractions of a single large sparse ...
Performance modeling and optimization of parallel out-of-core tensor contractions
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

The Tensor Contraction Engine (TCE) is a domain-specific compiler for implementing complex tensor contraction expressions arising in quantum chemistry applications modeling electronic structure. This paper develops a performance model for tensor ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundations

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
715
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Maeng KLucia BRodríguez GSadayappan PSukumaran-Rajam A(2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641564
Xiao GYin CChen YDuan MLi K(2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3391254
Ahmed NAlwan EFanfakh A(2024)Enhancing Programs Efficiency through a Machine Learning-Based Model for Tile Size SelectionBIO Web of Conferences10.1051/bioconf/2024970002197(00021)Online publication date: 5-Apr-2024
https://doi.org/10.1051/bioconf/20249700021
Hutter ESolomonik EMohror KArnold DBadia R(2023)Application Performance Modeling via Tensor CompletionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607069(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607069
Tollenaere NIooss GPouget SBrunie HGuillon CCohen ASadayappan PRastello F(2023)Autotuning Convolutions Is Easier Than You ThinkACM Transactions on Architecture and Code Optimization10.1145/357064120:2(1-24)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3570641
Yu FZhao JCui HFeng XXue J(2023)VTensor: Using Virtual Tensors to Build a Layout-Oblivious AI Programming FrameworkJournal of Computer Science and Technology10.1007/s11390-022-1457-638:5(1074-1097)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s11390-022-1457-6
Wang QPeng ZRen BChen JEdwards R(2022)MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body CorrelationACM Transactions on Architecture and Code Optimization10.1145/350670519:2(1-26)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3506705
Tukanov NSrinivasaraghavan RMoreira JLow T(2022)Modeling Matrix Engines for Portability and Performance2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00117(1173-1183)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00117
Xiao GYin CChen YDuan MLi K(2022)GSpTC: High-Performance Sparse Tensor Contraction on CPU-GPU Heterogeneous Systems2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080(380-387)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080
Kelefouras VDjemame KKeramidas GVoros N(2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 23-May-2022
https://doi.org/10.1007/s10766-022-00734-5
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents