Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

Published: 01 September 1998 Publication History
  • Get Citation Alerts
  • Abstract

    The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMM-based level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations.

    References

    [1]
    AGARWAL, R., GUSTAVSON, F., AND ZUBAIR, Z. 1994a. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Develop. 38, 5 (Sept.), 563-576.
    [2]
    AGARWAL, R., GUSTAVSON, F., AND ZUBAIR, Z. 1994b. Improving performance of linear algebra algorithms for dense matrices using algorithmic prefetching. IBM J. Res. Develop. 38, 3 (May), 265-275.
    [3]
    ANDERSON, E., BAI, Z., BISCHOF, C., DEMMEL, J., DONGARRA, J., DuCRoz, J., GREENBAUM, A., HAMMARLING, S., MCKENNY, A., OSTROUCHOV, S., AND SORENSEN, D. 1992. LAPACK Users Guide. SIAM Publications.
    [4]
    CARR, S. AND LEHOUCQ, R. 1997. Compiler blockability of dense matrix factorizations. ACM Trans. Math. Softw. 23, 3 (Sept.), 336-361.
    [5]
    DACKLAND, K. 1995. Design issues and the performance of level 1 and level 2 kernels on Intel i860-based platforms. Report UMINF-95.xx, Department of Computing Science, Ume University, Ume, Sweden.
    [6]
    DAYDE, M. J., DUFF, I. S., AND PETITET, A. 1994. A parallel block implementation of level-3 BLAS for MIMD vector processors. ACM Trans. Math. Softw. 20, 2 (June), 178-193.
    [7]
    DONGARRA, J., DuCRoz, J. D., HAMMARLING, S., AND HANSON, R. 1988. An extended set of Fortran basic linear Algebra Subprograms. ACM Trans. Math. Software 14, 1-17, 18-32.
    [8]
    DONGARRA, J., DuCRoz, J., DUFF, I., AND HAMMARLING, S. 1990a. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Software 16, 1 (Mar.), 1-17.
    [9]
    DONGARRA, J., DuCRoz, J., DUFF, I., AND HAMMARLING, S. 1990b. Algorithm 679: A set of level 3 Basic Linear Algebra Subprograms: Model implementation and test programs. ACM Trans. Math. Software 16, 1 (Mar.), 18-28.
    [10]
    DONGARRA, J., MAYES, P., AND RADICATI DI BROZOLO, G. 1991. The IBM RISC System 6000 and linear algebra operations. Supercomput. 8, 4, 15-30.
    [11]
    DOUGLAS, C., HEROUX, M., SLISHMAN, G., AND SMITH, R. 1994. GEMMV: A portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm. J. Comput. Phys. 110, 1-10.
    [12]
    GALLIVAN, K., JALBY, W., MEIER, U., AND SAMEH, A. 1988. Impact of hierarchical memory systems on linear algebra algorithm design. Int. J. Supercomput. Appl. 2, 12-48.
    [13]
    GRASEMANN, H. 1989. Optimization of level 3 BLAS for SIEMENS VP systems. Tech. Rep. 38.89 (Sept.), University of Karlsruhe, Computer Center.
    [14]
    GREEN, M. 1994. High performance level 3 BLAS. A KSR implementation. Working Note (April), Department of Mathematics, University of Manchester, Manchester, UK.
    [15]
    HIGHAM, N. 1990. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Softw. 16, 4, 352-368.
    [16]
    IBM. 1994. Engineering and Scientific Subroutine Library, Guide and Reference.
    [17]
    INTEL. 1993. Paragon Basic Math Library performance report. Technical Report. 312936- 001 (Oct.), Intel Supercomputer Division. Beaverton, Ore.
    [18]
    K GSTR(~M, B. AND VAN LOAN, C. 1989. GEMM-based level 3 BLAS. Technical Report CTC91TR47 (Dec.), Department of Computer Science, Cornell University.
    [19]
    K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1991. High performance GEMM-based level 3 BLAS: Sample routines for double precision real data. In High Performance Computing II (Amsterdam, 1991). North-Holland, 269-281.
    [20]
    K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1993. Portable high performance GEMM-based level 3 BLAS. In Parallel Processing for Scientific Computing (Philadelphia, 1993). SIAM Publications, 339-346.
    [21]
    K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1994. GEMM-based level 3 BLAS: Algorithms for the model implementations. Report UMINF-94.13 (December), Department of Computing Science, Ume University, Ume, Sweden. Revised, December 1995.
    [22]
    K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1998. Algorithm 784: GEMM-based level 3 BLAS: Portability and optimization issues. ACM Trans. Math. Software. This issue.
    [23]
    LAWSON, C., HANSON, R., KINCAID, R., AND KROGH, F. 1979. Basic Linear Algebra Subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 308-323.
    [24]
    LING, P. 1993. A set of high performance level-3 BLAS structured and tuned for the IBM 3090 VF and implemented in Fortran 77. J. Supercomput. 7, 3 (Sept.), 323-355.
    [25]
    SHEIK, Q., PHUONG, V., CHAO, Y., AND MERCHANT, M. 1992. Implementation of the level 2 and 3 BLAS on the CRAY Y-MP and the CRAY-2. J. Supercomput. 5, 4 (Feb.), 291-305.
    [26]
    STRASSEN, V. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 354-356.
    [27]
    WINOGRAD, S. 1973. Some remarks on fast multiplication of polynomials. In Complexity of Sequential and Parallel Numerical Algorithms (New York). Academic Press, 181.

    Cited By

    View all
    • (2024)An Automated Approach for Improving the Inference Latency and Energy Efficiency of Pretrained CNNs by Removing Irrelevant Pixels with Focused ConvolutionsProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473884(890-895)Online publication date: 22-Jan-2024
    • (2024)Automatic generation of ARM NEON micro-kernels for matrix multiplicationThe Journal of Supercomputing10.1007/s11227-024-05955-880:10(13873-13899)Online publication date: 1-Jul-2024
    • (2022)Hardware-friendly User-specific Machine Learning for Edge DevicesACM Transactions on Embedded Computing Systems10.1145/352412521:5(1-29)Online publication date: 8-Oct-2022
    • Show More Cited By

    Recommendations

    Reviews

    Timothy R. Hopkins

    The basic linear algebra s ubroutines (BLAS) consist of three libraries (known as Levels 1, 2, and 3) and form an integral part of much of the important numerical software developed over the last two decades. Efficient implementations of these libraries often lead in turn to large gains in the efficiency of higher-level routines, such as the Lapack library of linear algebra software. Many vendors supply versions of the BLAS tuned to a particular platform, and these are often hand coded to extract the best performance from the target hardware. The hierarchical memory organization common to many current systems makes the development of efficient, plat form-specific Level 3 BLAS (which perform matrix-matrix operations) especially challenging and expensive. The authors of this pair of papers show how it is possible to produce an efficient BLAS Level 3 library based on highly optimized versions of the single Level 3 routine that performs a general matrix-matrix multiply (GEMM) and a small number of simpler Level 1 and Level 2 routines. They also provide a model implementation of their GEMM-based routines along with comprehensive benchmarking software that allows users to measure the quality of vendor implementations against an efficient, portable, Fortran 77 version of the library. The papers provide a detailed account of the strategies required to squeeze the last drop of efficiency from today's processors. Although targeted at the matrix-matrix multiply, the lessons and techniques employed will be valuable to anyone wishing to obtain similar performance from other higher-level numerical operations.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 24, Issue 3
    Sept. 1998
    95 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/292395
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 1998
    Published in TOMS Volume 24, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GEMM-based level 3 BLAS
    2. blocked algorithms
    3. matrix-matrix kernels
    4. memory hierarchy
    5. parallelization
    6. vectorization

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)404
    • Downloads (Last 6 weeks)38
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An Automated Approach for Improving the Inference Latency and Energy Efficiency of Pretrained CNNs by Removing Irrelevant Pixels with Focused ConvolutionsProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473884(890-895)Online publication date: 22-Jan-2024
    • (2024)Automatic generation of ARM NEON micro-kernels for matrix multiplicationThe Journal of Supercomputing10.1007/s11227-024-05955-880:10(13873-13899)Online publication date: 1-Jul-2024
    • (2022)Hardware-friendly User-specific Machine Learning for Edge DevicesACM Transactions on Embedded Computing Systems10.1145/352412521:5(1-29)Online publication date: 8-Oct-2022
    • (2022)Contention Grading and Adaptive Model Selection for Machine Vision in Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/352013421:5(1-29)Online publication date: 8-Oct-2022
    • (2022)DynO: Dynamic Onloading of Deep Neural Networks from Cloud to DeviceACM Transactions on Embedded Computing Systems10.1145/351083121:6(1-24)Online publication date: 18-Oct-2022
    • (2022)Accelerated Fire Detection and Localization at EdgeACM Transactions on Embedded Computing Systems10.1145/351002721:6(1-27)Online publication date: 18-Oct-2022
    • (2022)TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson BoardsACM Transactions on Embedded Computing Systems10.1145/350839121:5(1-26)Online publication date: 8-Oct-2022
    • (2022)TAB: Unified and Optimized Ternary, Binary, and Mixed-precision Neural Network Inference on the EdgeACM Transactions on Embedded Computing Systems10.1145/350839021:5(1-26)Online publication date: 8-Oct-2022
    • (2022)Cache Interference-aware Task Partitioning for Non-preemptive Real-time Multi-core SystemsACM Transactions on Embedded Computing Systems10.1145/348758121:3(1-28)Online publication date: 28-May-2022
    • (2022)Edge Intelligence: Concepts, Architectures, Applications, and Future DirectionsACM Transactions on Embedded Computing Systems10.1145/348667421:5(1-41)Online publication date: 8-Oct-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media