Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3295500.3356218acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Analytical cache modeling and tilesize optimization for tensor contractions

Published: 17 November 2019 Publication History

Abstract

Data movement between processor and memory hierarchy is a fundamental bottleneck that limits the performance of many applications on modern computer architectures. Tiling and loop permutation are key techniques for improving data locality. However, selecting effective tile-sizes and loop permutations is particularly challenging for tensor contractions due to the large number of loops. Even state-of-the-art compilers usually produce sub-optimal tile-sizes and loop permutations, as they rely on naïve cost models. In this paper we provide an analytical model based approach to multi-level tile size optimization and permutation selection for tensor contractions. Our experimental results show that this approach achieves comparable or better performance than state-of-the-art frameworks and libraries for tensor contractions.

References

[1]
Cedric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proc. of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE.
[2]
Pietro Belotti. 2009. Couenne: a user's manual. Technical Report. Technical report, Lehigh University.
[3]
U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. PLUTO: A Practical and Fully Automatic Polyhedral Program Optimization System. In Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08).
[4]
Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI '95). ACM, 279--290.
[5]
T Daniel Crawford and Henry F Schaefer III. 2000. An introduction to coupled cluster theory for computational chemists. Reviews in computational chemistry (2000), 33--136.
[6]
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. International journal of parallel programming 21, 5 (1992), 313--347.
[7]
Kazushige Goto and Robert A Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 12.
[8]
Tze Meng Low, Francisco D Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, 2 (2016), 12.
[9]
Devin A Matthews. 2018. High-performance tensor contraction without transposition. SIAM Journal on Scientific Computing 40, 1 (2018), C1--C24.
[10]
Lakshminarayanan Renganarayana and Sanjay Rajopadhye. 2008. Positivity, Posynomials and Tile Size Selection. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC '08). IEEE Press, Article 55, 12 pages.
[11]
Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J Ramanujam, P Sadayappan, and Vivek Sarkar. 2012. Analytical bounds for optimal tile size selection. In International Conference on Compiler Construction. Springer, 101--121.
[12]
Tyler M Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1049--1059.
[13]
Paul Springer and Paolo Bientinesi. 2016. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. CoRR (2016). arXiv:cs.MS, cs.PF/1607.00145 http://arxiv.org/abs/1607.00145
[14]
Paul Springer, Tong Su, and Paolo Bientinesi. 2017. HPTT: A High-Performance Tensor Transposition C++ Library. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY 2017). ACM, New York, NY, USA, 56--62.
[15]
Field G Van Zee and Robert A Van De Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) 41, 3 (2015), 14.
[16]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1--54:23.
[17]
Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167--188.
[18]
Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay Rajopadhye, Charles Anderson, Alexandre E. Eichenberger, and Kevin O'Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '10). ACM, 190--199.

Cited By

View all
  • (2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
  • (2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
  • (2024)Enhancing Programs Efficiency through a Machine Learning-Based Model for Tile Size SelectionBIO Web of Conferences10.1051/bioconf/2024970002197(00021)Online publication date: 5-Apr-2024
  • Show More Cited By
  1. Analytical cache modeling and tilesize optimization for tensor contractions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2019
    1921 pages
    ISBN:9781450362290
    DOI:10.1145/3295500
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. domain-specific compiler optimization
    2. model-driven design-space exploration
    3. performance modeling
    4. tensor contraction

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundations

    Conference

    SC '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)96
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 12 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power DevicesProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641564(198-211)Online publication date: 17-Feb-2024
    • (2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
    • (2024)Enhancing Programs Efficiency through a Machine Learning-Based Model for Tile Size SelectionBIO Web of Conferences10.1051/bioconf/2024970002197(00021)Online publication date: 5-Apr-2024
    • (2023)Application Performance Modeling via Tensor CompletionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607069(1-14)Online publication date: 12-Nov-2023
    • (2023)Autotuning Convolutions Is Easier Than You ThinkACM Transactions on Architecture and Code Optimization10.1145/357064120:2(1-24)Online publication date: 1-Mar-2023
    • (2023)VTensor: Using Virtual Tensors to Build a Layout-Oblivious AI Programming FrameworkJournal of Computer Science and Technology10.1007/s11390-022-1457-638:5(1074-1097)Online publication date: 1-Sep-2023
    • (2022)MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body CorrelationACM Transactions on Architecture and Code Optimization10.1145/350670519:2(1-26)Online publication date: 24-Mar-2022
    • (2022)Modeling Matrix Engines for Portability and Performance2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00117(1173-1183)Online publication date: May-2022
    • (2022)GSpTC: High-Performance Sparse Tensor Contraction on CPU-GPU Heterogeneous Systems2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080(380-387)Online publication date: Dec-2022
    • (2022)A Methodology for Efficient Tile Size Selection for Affine Loop KernelsInternational Journal of Parallel Programming10.1007/s10766-022-00734-550:3-4(405-432)Online publication date: 23-May-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media