research-article

Public Access

Analytical Modeling Is Enough for High-Performance BLIS

Authors:

Francisco D. Igual,

Tyler M. Smith,

Enrique S. Quintana-OrtiAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 43, Issue 2

Article No.: 12, Pages 1 - 18

https://doi.org/10.1145/2925987

Published: 16 August 2016 Publication History

Abstract

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).

References

[1]

AMD. 2015. AMD Core Math Library. (2015). http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/.

[2]

Edward Anderson, Zhaojun Bai, L. Susan Blackford, James Demmesl, Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, Anne Greenbaum, Alan McKenney, and Danny C. Sorensen. 1999. LAPACK Users' Guide (3rd ed.). SIAM.

Digital Library

[3]

Jeff Bilmes, Krste Asanović, Chee whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: A Portable, high-performance, ANSI c coding methodology. In Proceedings of the International Conference on Supercomputing. Vienna, Austria.

Digital Library

[4]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990), 1--17.

Digital Library

[5]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft. 14, 1 (March 1988), 1--17.

Digital Library

[6]

Kazushige Goto and Robert van de Geijn. 2008a. High performance implementation of the level-3 BLAS. ACM Trans. Math. Software 35, 1 (July 2008), 4:1--4:14. http://doi.acm.org/10.1145/1377603. 1377607

Digital Library

[7]

Kazushige Goto and Robert A. van de Geijn. 2008b. Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Software 34, 3 (May 2008), 12:1--12:25. http://doi.acm.org/10.1145/1356052.1356053

Digital Library

[8]

John L. Hennessy and David A. Patterson. 2003. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Pub., San Francisco.

Digital Library

[9]

Greg Henry. 1992. BLAS Based on Block Data Structures. Theory Center Technical Report CTC92TR89. Advanced Computing Research Institute. Cornell University.

Digital Library

[10]

IBM. 2015. Engineering and Scientific Subroutine Library. (2015). http://www-03.ibm.com/systems/power/software/essl/.

[11]

Intel. 2015. Math Kernel Library. (2015). https://software.intel.com/en-us/intel-mkl.

[12]

Vasilios Kelefouras, Angeliki Kritikakou, and Costas Goutis. 2014. A matrix-matrix multiplication methodology for single/multi-core architectures using SIMD. J, Supercomput, (2014), 1--23.

Digital Library

[13]

Charles L. Lawson, Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Software 5, 3 (Sept. 1979), 308--323.

Digital Library

[14]

OpenBLAS 2015. http://www.openblas.net. (2015).

[15]

Ardavan Pedram, Andreas Gerstlauer, and Robert A. van de Geijn. 2012a. On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In 2012 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

Digital Library

[16]

Ardavan Pedram, Robert A. van de Geijn, and Andreas Gerstlauer. 2012b. Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Trans. Comput. 61 (Dec. 2012), 1724--1736.

Digital Library

[17]

Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-Performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS'14). IEEE Computer Society, Washington, DC, USA, 1049--1059.

Digital Library

[18]

Field G. Van Zee, Tyler Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, and Lee Killough. 2014. The BLIS framework: Experiments in portability. ACM Trans. Math. Software (2014). In review.

Digital Library

[19]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3, Article 14 (June 2015), 33 pages.

Digital Library

[20]

Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically generate high performance dense linear algebra kernels on x86 CPUs. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (SC'13). ACM, Article 25, 12 pages.

Digital Library

[21]

R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC'98).

Digital Library

[22]

R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1--2 (2001), 3--35.

[23]

Kamen Yotov, Xiaoming Li, María Jesús Garzarán, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE, special issue on “Program Generation, Optimization, and Adaptation” 93, 2 (2005).

Cited By

Egunov VSurin VStupnitskiy PAkhmetova R(2024)Research on the effectiveness of data compression methods in relational and NoSQL DBMSHerald of Dagestan State Technical University. Technical Sciences10.21822/2073-6185-2024-51-1-87-9451:1(87-94)Online publication date: 16-Apr-2024
https://doi.org/10.21822/2073-6185-2024-51-1-87-94
Lei JQuintana-Ortí Edi Sanzo PMarotta R(2024)Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep LearningProceedings of the 4th Workshop on Performance and Energy Efficiency in Concurrent and Distributed Systems10.1145/3659997.3660032(1-8)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659997.3660032
Fu XYang WDong DSu X(2024)Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656620(137-149)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656620
Show More Cited By

Index Terms

Analytical Modeling Is Enough for High-Performance BLIS
1. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

The BLIS Framework: Experiments in Portability

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The ...
BLIS: A Framework for Rapidly Instantiating BLAS Functionality

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...
High-performance up-and-downdating via householder-like transformations

We present high-performance algorithms for up-and-downdating a Cholesky factor or QR factorization. The method uses Householder-like transformations, sometimes called hyperbolic Householder transformations, that are accumulated so that most computation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 43, Issue 2

June 2017

200 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/2988256

Editor:
Michael A. Heroux
Sandia National Laboratories, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2016

Accepted: 01 April 2016

Revised: 01 April 2016

Received: 01 February 2015

Published in TOMS Volume 43, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Ministerio de Economía y Competitividad
NSF
FEDER

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

104
Total Citations
View Citations
2,425
Total Downloads

Downloads (Last 12 months)642
Downloads (Last 6 weeks)114

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Egunov VSurin VStupnitskiy PAkhmetova R(2024)Research on the effectiveness of data compression methods in relational and NoSQL DBMSHerald of Dagestan State Technical University. Technical Sciences10.21822/2073-6185-2024-51-1-87-9451:1(87-94)Online publication date: 16-Apr-2024
https://doi.org/10.21822/2073-6185-2024-51-1-87-94
Lei JQuintana-Ortí Edi Sanzo PMarotta R(2024)Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep LearningProceedings of the 4th Workshop on Performance and Energy Efficiency in Concurrent and Distributed Systems10.1145/3659997.3660032(1-8)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659997.3660032
Fu XYang WDong DSu X(2024)Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656620(137-149)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656620
Wei CJia HZhang YYao JLi CCao W(2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3432579
Yang WFang JDong DSu XWang Z(2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3350368
De Albuquerque Silva ICarle TGauffriau AJegu VPagetti C(2024)A Predictable SIMD Library for GEMM Routines2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS61025.2024.00013(55-67)Online publication date: 13-May-2024
https://doi.org/10.1109/RTAS61025.2024.00013
Castelló ABellavita JDinh GIkarashi YMartínez HGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Tackling the Matrix Multiplication Micro-kernel Generation with ExoProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444883(182-192)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444883
Li JQin ZMei YCui JSong YChen CZhang YDu LCheng XJin BZhang YYe JLin ELavery DGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning CompilationProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444871(460-470)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444871
Alaejos GMartínez HCastelló ADolz MIgual FAlonso-Jordá PQuintana-Ortí E(2024)Automatic generation of ARM NEON micro-kernels for matrix multiplicationThe Journal of Supercomputing10.1007/s11227-024-05955-880:10(13873-13899)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-05955-8
Błażejowski M(2024)Which C compiler and BLAS/LAPACK library should I use: gretl’s numerical efficiency in different configurationsComputational Statistics10.1007/s00180-024-01461-wOnline publication date: 22-Mar-2024
https://doi.org/10.1007/s00180-024-01461-w
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents