Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Analytical Modeling Is Enough for High-Performance BLIS

Published: 16 August 2016 Publication History

Abstract

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).

References

[1]
AMD. 2015. AMD Core Math Library. (2015). http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/.
[2]
Edward Anderson, Zhaojun Bai, L. Susan Blackford, James Demmesl, Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, Anne Greenbaum, Alan McKenney, and Danny C. Sorensen. 1999. LAPACK Users' Guide (3rd ed.). SIAM.
[3]
Jeff Bilmes, Krste Asanović, Chee whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: A Portable, high-performance, ANSI c coding methodology. In Proceedings of the International Conference on Supercomputing. Vienna, Austria.
[4]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990), 1--17.
[5]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft. 14, 1 (March 1988), 1--17.
[6]
Kazushige Goto and Robert van de Geijn. 2008a. High performance implementation of the level-3 BLAS. ACM Trans. Math. Software 35, 1 (July 2008), 4:1--4:14. http://doi.acm.org/10.1145/1377603. 1377607
[7]
Kazushige Goto and Robert A. van de Geijn. 2008b. Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Software 34, 3 (May 2008), 12:1--12:25. http://doi.acm.org/10.1145/1356052.1356053
[8]
John L. Hennessy and David A. Patterson. 2003. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., San Francisco.
[9]
Greg Henry. 1992. BLAS Based on Block Data Structures. Theory Center Technical Report CTC92TR89. Advanced Computing Research Institute. Cornell University.
[10]
IBM. 2015. Engineering and Scientific Subroutine Library. (2015). http://www-03.ibm.com/systems/power/software/essl/.
[11]
Intel. 2015. Math Kernel Library. (2015). https://software.intel.com/en-us/intel-mkl.
[12]
Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2014. A matrix-matrix multiplication methodology for single/multi-core architectures using SIMD. J, Supercomput, (2014), 1--23.
[13]
Charles L. Lawson, Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Software 5, 3 (Sept. 1979), 308--323.
[14]
OpenBLAS 2015. http://www.openblas.net. (2015).
[15]
Ardavan Pedram, Andreas Gerstlauer, and Robert A. van de Geijn. 2012a. On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In 2012 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).
[16]
Ardavan Pedram, Robert A. van de Geijn, and Andreas Gerstlauer. 2012b. Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Trans. Comput. 61 (Dec. 2012), 1724--1736.
[17]
Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-Performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS'14). IEEE Computer Society, Washington, DC, USA, 1049--1059.
[18]
Field G. Van Zee, Tyler Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, and Lee Killough. 2014. The BLIS framework: Experiments in portability. ACM Trans. Math. Software (2014). In review.
[19]
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3, Article 14 (June 2015), 33 pages.
[20]
Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13). ACM, Article 25, 12 pages.
[21]
R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC'98).
[22]
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1--2 (2001), 3--35.
[23]
Kamen Yotov, Xiaoming Li, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE, special issue on “Program Generation, Optimization, and Adaptation” 93, 2 (2005).

Cited By

View all
  • (2024)Research on the effectiveness of data compression methods in relational and NoSQL DBMSHerald of Dagestan State Technical University. Technical Sciences10.21822/2073-6185-2024-51-1-87-9451:1(87-94)Online publication date: 16-Apr-2024
  • (2024)Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep LearningProceedings of the 4th Workshop on Performance and Energy Efficiency in Concurrent and Distributed Systems10.1145/3659997.3660032(1-8)Online publication date: 3-Jun-2024
  • (2024)Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656620(137-149)Online publication date: 30-May-2024
  • Show More Cited By

Index Terms

  1. Analytical Modeling Is Enough for High-Performance BLIS

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 43, Issue 2
    June 2017
    200 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/2988256
    • Editor:
    • Michael A. Heroux
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 August 2016
    Accepted: 01 April 2016
    Revised: 01 April 2016
    Received: 01 February 2015
    Published in TOMS Volume 43, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Linear algebra
    2. analytical modeling
    3. high performance
    4. libraries
    5. matrix multiplication

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Ministerio de Economía y Competitividad
    • NSF
    • FEDER

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)642
    • Downloads (Last 6 weeks)114
    Reflects downloads up to 16 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Research on the effectiveness of data compression methods in relational and NoSQL DBMSHerald of Dagestan State Technical University. Technical Sciences10.21822/2073-6185-2024-51-1-87-9451:1(87-94)Online publication date: 16-Apr-2024
    • (2024)Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep LearningProceedings of the 4th Workshop on Performance and Energy Efficiency in Concurrent and Distributed Systems10.1145/3659997.3660032(1-8)Online publication date: 3-Jun-2024
    • (2024)Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656620(137-149)Online publication date: 30-May-2024
    • (2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: 1-Sep-2024
    • (2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: 1-Mar-2024
    • (2024)A Predictable SIMD Library for GEMM Routines2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS61025.2024.00013(55-67)Online publication date: 13-May-2024
    • (2024)Tackling the Matrix Multiplication Micro-kernel Generation with ExoProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444883(182-192)Online publication date: 2-Mar-2024
    • (2024)oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning CompilationProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444871(460-470)Online publication date: 2-Mar-2024
    • (2024)Automatic generation of ARM NEON micro-kernels for matrix multiplicationThe Journal of Supercomputing10.1007/s11227-024-05955-880:10(13873-13899)Online publication date: 1-Jul-2024
    • (2024)Which C compiler and BLAS/LAPACK library should I use: gretl’s numerical efficiency in different configurationsComputational Statistics10.1007/s00180-024-01461-wOnline publication date: 22-Mar-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media