article

Free access

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

Authors:

Bo Kågström,

Per Ling,

Charles van LoanAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 24, Issue 3

Pages 268 - 302

https://doi.org/10.1145/292395.292412

Published: 01 September 1998 Publication History

PDF eReader

Abstract

The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMM-based level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations.

References

[1]

AGARWAL, R., GUSTAVSON, F., AND ZUBAIR, Z. 1994a. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Develop. 38, 5 (Sept.), 563-576.

Crossref

Google Scholar

[2]

AGARWAL, R., GUSTAVSON, F., AND ZUBAIR, Z. 1994b. Improving performance of linear algebra algorithms for dense matrices using algorithmic prefetching. IBM J. Res. Develop. 38, 3 (May), 265-275.

Crossref

Google Scholar

[3]

ANDERSON, E., BAI, Z., BISCHOF, C., DEMMEL, J., DONGARRA, J., DuCRoz, J., GREENBAUM, A., HAMMARLING, S., MCKENNY, A., OSTROUCHOV, S., AND SORENSEN, D. 1992. LAPACK Users Guide. SIAM Publications.

Crossref

Google Scholar

[4]

CARR, S. AND LEHOUCQ, R. 1997. Compiler blockability of dense matrix factorizations. ACM Trans. Math. Softw. 23, 3 (Sept.), 336-361.

Crossref

Google Scholar

[5]

DACKLAND, K. 1995. Design issues and the performance of level 1 and level 2 kernels on Intel i860-based platforms. Report UMINF-95.xx, Department of Computing Science, Ume University, Ume, Sweden.

Google Scholar

[6]

DAYDE, M. J., DUFF, I. S., AND PETITET, A. 1994. A parallel block implementation of level-3 BLAS for MIMD vector processors. ACM Trans. Math. Softw. 20, 2 (June), 178-193.

Crossref

Google Scholar

[7]

DONGARRA, J., DuCRoz, J. D., HAMMARLING, S., AND HANSON, R. 1988. An extended set of Fortran basic linear Algebra Subprograms. ACM Trans. Math. Software 14, 1-17, 18-32.

Crossref

Google Scholar

[8]

DONGARRA, J., DuCRoz, J., DUFF, I., AND HAMMARLING, S. 1990a. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Software 16, 1 (Mar.), 1-17.

Crossref

Google Scholar

[9]

DONGARRA, J., DuCRoz, J., DUFF, I., AND HAMMARLING, S. 1990b. Algorithm 679: A set of level 3 Basic Linear Algebra Subprograms: Model implementation and test programs. ACM Trans. Math. Software 16, 1 (Mar.), 18-28.

Crossref

Google Scholar

[10]

DONGARRA, J., MAYES, P., AND RADICATI DI BROZOLO, G. 1991. The IBM RISC System 6000 and linear algebra operations. Supercomput. 8, 4, 15-30.

Google Scholar

[11]

DOUGLAS, C., HEROUX, M., SLISHMAN, G., AND SMITH, R. 1994. GEMMV: A portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm. J. Comput. Phys. 110, 1-10.

Crossref

Google Scholar

[12]

GALLIVAN, K., JALBY, W., MEIER, U., AND SAMEH, A. 1988. Impact of hierarchical memory systems on linear algebra algorithm design. Int. J. Supercomput. Appl. 2, 12-48.

Google Scholar

[13]

GRASEMANN, H. 1989. Optimization of level 3 BLAS for SIEMENS VP systems. Tech. Rep. 38.89 (Sept.), University of Karlsruhe, Computer Center.

Google Scholar

[14]

GREEN, M. 1994. High performance level 3 BLAS. A KSR implementation. Working Note (April), Department of Mathematics, University of Manchester, Manchester, UK.

Google Scholar

[15]

HIGHAM, N. 1990. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Softw. 16, 4, 352-368.

Crossref

Google Scholar

[16]

IBM. 1994. Engineering and Scientific Subroutine Library, Guide and Reference.

Google Scholar

[17]

INTEL. 1993. Paragon Basic Math Library performance report. Technical Report. 312936- 001 (Oct.), Intel Supercomputer Division. Beaverton, Ore.

Google Scholar

[18]

K GSTR(~M, B. AND VAN LOAN, C. 1989. GEMM-based level 3 BLAS. Technical Report CTC91TR47 (Dec.), Department of Computer Science, Cornell University.

Google Scholar

[19]

K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1991. High performance GEMM-based level 3 BLAS: Sample routines for double precision real data. In High Performance Computing II (Amsterdam, 1991). North-Holland, 269-281.

Google Scholar

[20]

K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1993. Portable high performance GEMM-based level 3 BLAS. In Parallel Processing for Scientific Computing (Philadelphia, 1993). SIAM Publications, 339-346.

Google Scholar

[21]

K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1994. GEMM-based level 3 BLAS: Algorithms for the model implementations. Report UMINF-94.13 (December), Department of Computing Science, Ume University, Ume, Sweden. Revised, December 1995.

Google Scholar

[22]

K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1998. Algorithm 784: GEMM-based level 3 BLAS: Portability and optimization issues. ACM Trans. Math. Software. This issue.

Crossref

Google Scholar

[23]

LAWSON, C., HANSON, R., KINCAID, R., AND KROGH, F. 1979. Basic Linear Algebra Subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 308-323.

Crossref

Google Scholar

[24]

LING, P. 1993. A set of high performance level-3 BLAS structured and tuned for the IBM 3090 VF and implemented in Fortran 77. J. Supercomput. 7, 3 (Sept.), 323-355.

Crossref

Google Scholar

[25]

SHEIK, Q., PHUONG, V., CHAO, Y., AND MERCHANT, M. 1992. Implementation of the level 2 and 3 BLAS on the CRAY Y-MP and the CRAY-2. J. Supercomput. 5, 4 (Feb.), 291-305.

Crossref

Google Scholar

[26]

STRASSEN, V. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 354-356.

Google Scholar

[27]

WINOGRAD, S. 1973. Some remarks on fast multiplication of polynomials. In Complexity of Sequential and Parallel Numerical Algorithms (New York). Academic Press, 181.

Google Scholar

Cited By

View all

Tung CEliopoulos NJajal PRamshankar GYang CSynovic NZhang XChaudhary VThiruvathukal GLu YKim T(2024)An Automated Approach for Improving the Inference Latency and Energy Efficiency of Pretrained CNNs by Removing Irrelevant Pixels with Focused ConvolutionsProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473884(890-895)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1109/ASP-DAC58780.2024.10473884
Alaejos GMartínez HCastelló ADolz MIgual FAlonso-Jordá PQuintana-Ortí E(2024)Automatic generation of ARM NEON micro-kernels for matrix multiplicationThe Journal of Supercomputing10.1007/s11227-024-05955-880:10(13873-13899)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-05955-8
Goyal VDas RBertacco V(2022)Hardware-friendly User-specific Machine Learning for Edge DevicesACM Transactions on Embedded Computing Systems10.1145/352412521:5(1-29)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3524125
Show More Cited By

Index Terms

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

Recommendations

Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

This companion article discusses portability and optimization issues of the GEMM-based level 3 BLAS model implementations and the performance evaluation benchmark. All software comes in all four data types (single- and double-precision, real and complex)...
A parallel block implementation of Level-3 BLAS for MIMD vector processors

We describe an implementation of Level-3 BLAS (Basic Linear Algebra Subprograms) based on the use of the matrix-matrix multiplication kernel (GEMM). Blocking techniques are used to express the BLAS in terms of operations involving triangular blocks and ...
GEMM-Based Level-3 BLAS

Reviews

Reviewer: Timothy R. Hopkins

The basic linear algebra s ubroutines (BLAS) consist of three libraries (known as Levels 1, 2, and 3) and form an integral part of much of the important numerical software developed over the last two decades. Efficient implementations of these libraries often lead in turn to large gains in the efficiency of higher-level routines, such as the Lapack library of linear algebra software. Many vendors supply versions of the BLAS tuned to a particular platform, and these are often hand coded to extract the best performance from the target hardware. The hierarchical memory organization common to many current systems makes the development of efficient, plat form-specific Level 3 BLAS (which perform matrix-matrix operations) especially challenging and expensive. The authors of this pair of papers show how it is possible to produce an efficient BLAS Level 3 library based on highly optimized versions of the single Level 3 routine that performs a general matrix-matrix multiply (GEMM) and a small number of simpler Level 1 and Level 2 routines. They also provide a model implementation of their GEMM-based routines along with comprehensive benchmarking software that allows users to measure the quality of vendor implementations against an efficient, portable, Fortran 77 version of the library. The papers provide a detailed account of the strategies required to squeeze the last drop of efficiency from today's processors. Although targeted at the matrix-matrix multiply, the lessons and techniques employed will be valuable to anyone wishing to obtain similar performance from other higher-level numerical operations.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 24, Issue 3

Sept. 1998

95 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/292395

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1998

Published in TOMS Volume 24, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

170
Total Citations
View Citations
2,023
Total Downloads

Downloads (Last 12 months)404
Downloads (Last 6 weeks)38

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

View all

Tung CEliopoulos NJajal PRamshankar GYang CSynovic NZhang XChaudhary VThiruvathukal GLu YKim T(2024)An Automated Approach for Improving the Inference Latency and Energy Efficiency of Pretrained CNNs by Removing Irrelevant Pixels with Focused ConvolutionsProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473884(890-895)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1109/ASP-DAC58780.2024.10473884
Alaejos GMartínez HCastelló ADolz MIgual FAlonso-Jordá PQuintana-Ortí E(2024)Automatic generation of ARM NEON micro-kernels for matrix multiplicationThe Journal of Supercomputing10.1007/s11227-024-05955-880:10(13873-13899)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-05955-8
Goyal VDas RBertacco V(2022)Hardware-friendly User-specific Machine Learning for Edge DevicesACM Transactions on Embedded Computing Systems10.1145/352412521:5(1-29)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3524125
Kutukcu BBaidya SRaghunathan ADey S(2022)Contention Grading and Adaptive Model Selection for Machine Vision in Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/352013421:5(1-29)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3520134
Almeida MLaskaridis SVenieris SLeontiadis ILane N(2022)DynO: Dynamic Onloading of Deep Neural Networks from Cloud to DeviceACM Transactions on Embedded Computing Systems10.1145/351083121:6(1-24)Online publication date: 18-Oct-2022
https://dl.acm.org/doi/10.1145/3510831
Mukherjee AMondal JDey S(2022)Accelerated Fire Detection and Localization at EdgeACM Transactions on Embedded Computing Systems10.1145/351002721:6(1-27)Online publication date: 18-Oct-2022
https://dl.acm.org/doi/10.1145/3510027
Jeong EKim JHa S(2022)TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson BoardsACM Transactions on Embedded Computing Systems10.1145/350839121:5(1-26)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3508391
Zhu SDuong LLiu W(2022)TAB: Unified and Optimized Ternary, Binary, and Mixed-precision Neural Network Inference on the EdgeACM Transactions on Embedded Computing Systems10.1145/350839021:5(1-26)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3508390
Xiao JShen YPimentel A(2022)Cache Interference-aware Task Partitioning for Non-preemptive Real-time Multi-core SystemsACM Transactions on Embedded Computing Systems10.1145/348758121:3(1-28)Online publication date: 28-May-2022
https://dl.acm.org/doi/10.1145/3487581
Mendez JBierzynski KCuéllar MMorales D(2022)Edge Intelligence: Concepts, Architectures, Applications, and Future DirectionsACM Transactions on Embedded Computing Systems10.1145/348667421:5(1-41)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3486674
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

A parallel block implementation of Level-3 BLAS for MIMD vector processors