article

Free access

Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

Authors:

Bo Kågström,

Charles van LoanAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 24, Issue 3

Pages 303 - 316

https://doi.org/10.1145/292395.292426

Published: 01 September 1998 Publication History

PDF eReader

Abstract

This companion article discusses portability and optimization issues of the GEMM-based level 3 BLAS model implementations and the performance evaluation benchmark. All software comes in all four data types (single- and double-precision, real and complex) and are designed to be easy to implement and use on different platforms. Each of the GEMM-based routines has a few machine-dependent parameters that specify internal block sizes, cache characteristics, and branch points for alternative code sections. These parameters provide means for adjustment to the characteristics of a memory hierarchy.

Supplementary Material

GZ File (784.gz)

Software for "GEMM-Based Level 3 BLAS: Portability and Optimization Issues"

Download
329.78 KB

References

[1]

ANDERSON, E., BAI, Z., BISCHOF, C., DEMMEL, J., DONGARRA, J., DuCRoz, J., GREENBAUM, A., HAMMARLING, S., MCKENNY, A., OSTROUCHOV, S., AND SORENSEN, D. 1992 LAPACK Users Guide. SIAM Publications.

Crossref

Google Scholar

[2]

BAILEY, D. 1995. Unfavorable strides in cache memory systems. Sci. Program. 4, 53-58.

Crossref

Google Scholar

[3]

DONGARRA, J., DuCRoz, J., DUFF, I., AND HAMMARLING, S. 1990a. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1 (Mar.), 1-17.

Crossref

Google Scholar

[4]

DONGARRA, J., DuCRoz, J., DUFF, I., AND HAMMARLING, S. 1990b. Algorithm 679: A set of level 3 Basic Linear Algebra Subprograms: Model implementation and test programs. ACM Trans. Math. Softw. 16, 1 (Mar.), 18-28.

Crossref

Google Scholar

[5]

K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1998. GEMM-based level 3 BLAS: Highperformance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. This issue.

Crossref

Google Scholar

Cited By

View all

Zhang WJiang ZChen ZXiao NOu Y(2021)NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors ArchitectureElectronics10.3390/electronics1016198410:16(1984)Online publication date: 17-Aug-2021
https://doi.org/10.3390/electronics10161984
Su XLei F(2018)Hybrid-Grained Dynamic Load Balanced GEMM on NUMA ArchitecturesElectronics10.3390/electronics71203597:12(359)Online publication date: 27-Nov-2018
https://doi.org/10.3390/electronics7120359
Su XLiao XJiang HYang CXue J(2018)SCPACM Transactions on Architecture and Code Optimization10.1145/327465415:4(1-21)Online publication date: 10-Oct-2018
https://dl.acm.org/doi/10.1145/3274654
Show More Cited By

Index Terms

Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

Recommendations

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level ...
A parallel block implementation of Level-3 BLAS for MIMD vector processors

We describe an implementation of Level-3 BLAS (Basic Linear Algebra Subprograms) based on the use of the matrix-matrix multiplication kernel (GEMM). Blocking techniques are used to express the BLAS in terms of operations involving triangular blocks and ...
GEMM-Based Level-3 BLAS

Reviews

Reviewer: Timothy R. Hopkins

The basic linear algebra s ubroutines (BLAS) consist of three libraries (known as Levels 1, 2, and 3) and form an integral part of much of the important numerical software developed over the last two decades. Efficient implementations of these libraries often lead in turn to large gains in the efficiency of higher-level routines, such as the Lapack library of linear algebra software. Many vendors supply versions of the BLAS tuned to a particular platform, and these are often hand coded to extract the best performance from the target hardware. The hierarchical memory organization common to many current systems makes the development of efficient, plat form-specific Level 3 BLAS (which perform matrix-matrix operations) especially challenging and expensive. The authors of this pair of papers show how it is possible to produce an efficient BLAS Level 3 library based on highly optimized versions of the single Level 3 routine that performs a general matrix-matrix multiply (GEMM) and a small number of simpler Level 1 and Level 2 routines. They also provide a model implementation of their GEMM-based routines along with comprehensive benchmarking software that allows users to measure the quality of vendor implementations against an efficient, portable, Fortran 77 version of the library. The papers provide a detailed account of the strategies required to squeeze the last drop of efficiency from today's processors. Although targeted at the matrix-matrix multiply, the lessons and techniques employed will be valuable to anyone wishing to obtain similar performance from other higher-level numerical operations.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 24, Issue 3

Sept. 1998

95 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/292395

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1998

Published in TOMS Volume 24, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
649
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)1

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zhang WJiang ZChen ZXiao NOu Y(2021)NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors ArchitectureElectronics10.3390/electronics1016198410:16(1984)Online publication date: 17-Aug-2021
https://doi.org/10.3390/electronics10161984
Su XLei F(2018)Hybrid-Grained Dynamic Load Balanced GEMM on NUMA ArchitecturesElectronics10.3390/electronics71203597:12(359)Online publication date: 27-Nov-2018
https://doi.org/10.3390/electronics7120359
Su XLiao XJiang HYang CXue J(2018)SCPACM Transactions on Architecture and Code Optimization10.1145/327465415:4(1-21)Online publication date: 10-Oct-2018
https://dl.acm.org/doi/10.1145/3274654
Hsiu PTseng PChen WPan CKuo T(2016)User-Centric Scheduling and Governing on Mobile Devices with big.LITTLE ProcessorsACM Transactions on Embedded Computing Systems10.1145/282994615:1(1-22)Online publication date: 28-Jan-2016
https://dl.acm.org/doi/10.1145/2829946
Kwon S(2016)A Cache-Based Flash Translation Layer for TLC-Based Multimedia Storage DevicesACM Transactions on Embedded Computing Systems10.1145/282061415:1(1-28)Online publication date: 13-Jan-2016
https://dl.acm.org/doi/10.1145/2820614
Wang YLi Y(2015)An efficient and tunable matrix-disguising method toward privacy-preserving computationSecurity and Communication Networks10.1002/sec.12358:17(3099-3110)Online publication date: 25-Nov-2015
https://dl.acm.org/doi/10.1002/sec.1235
Chisnall D(2013)The Challenge of Cross-language InteroperabilityQueue10.1145/2542661.254397111:10(20-28)Online publication date: 8-Oct-2013
https://dl.acm.org/doi/10.1145/2542661.2543971
D'alberto PBodrato MNicolau A(2011)Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systemsACM Transactions on Mathematical Software10.1145/2049662.204966438:1(1-30)Online publication date: 7-Dec-2011
https://dl.acm.org/doi/10.1145/2049662.2049664
Granat RKågström B(2010)Parallel Solvers for Sylvester-Type Matrix Equations with Applications in Condition Estimation, Part IACM Transactions on Mathematical Software10.1145/1824801.182481037:3(1-32)Online publication date: 1-Sep-2010
https://dl.acm.org/doi/10.1145/1824801.1824810
D'Alberto PNicolau A(2009)Adaptive Winograd's matrix multiplicationsACM Transactions on Mathematical Software10.1145/1486525.148652836:1(1-23)Online publication date: 16-Mar-2009
https://dl.acm.org/doi/10.1145/1486525.1486528
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

A parallel block implementation of Level-3 BLAS for MIMD vector processors

GEMM-Based Level-3 BLAS

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations