Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues

Published: 01 September 1998 Publication History
  • Get Citation Alerts
  • Abstract

    This companion article discusses portability and optimization issues of the GEMM-based level 3 BLAS model implementations and the performance evaluation benchmark. All software comes in all four data types (single- and double-precision, real and complex) and are designed to be easy to implement and use on different platforms. Each of the GEMM-based routines has a few machine-dependent parameters that specify internal block sizes, cache characteristics, and branch points for alternative code sections. These parameters provide means for adjustment to the characteristics of a memory hierarchy.

    Supplementary Material

    GZ File (784.gz)
    Software for "GEMM-Based Level 3 BLAS: Portability and Optimization Issues"

    References

    [1]
    ANDERSON, E., BAI, Z., BISCHOF, C., DEMMEL, J., DONGARRA, J., DuCRoz, J., GREENBAUM, A., HAMMARLING, S., MCKENNY, A., OSTROUCHOV, S., AND SORENSEN, D. 1992 LAPACK Users Guide. SIAM Publications.
    [2]
    BAILEY, D. 1995. Unfavorable strides in cache memory systems. Sci. Program. 4, 53-58.
    [3]
    DONGARRA, J., DuCRoz, J., DUFF, I., AND HAMMARLING, S. 1990a. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1 (Mar.), 1-17.
    [4]
    DONGARRA, J., DuCRoz, J., DUFF, I., AND HAMMARLING, S. 1990b. Algorithm 679: A set of level 3 Basic Linear Algebra Subprograms: Model implementation and test programs. ACM Trans. Math. Softw. 16, 1 (Mar.), 18-28.
    [5]
    K GSTR(~M, B., LING, P., AND VAN LOAN, C. 1998. GEMM-based level 3 BLAS: Highperformance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. This issue.

    Cited By

    View all

    Recommendations

    Reviews

    Timothy R. Hopkins

    The basic linear algebra s ubroutines (BLAS) consist of three libraries (known as Levels 1, 2, and 3) and form an integral part of much of the important numerical software developed over the last two decades. Efficient implementations of these libraries often lead in turn to large gains in the efficiency of higher-level routines, such as the Lapack library of linear algebra software. Many vendors supply versions of the BLAS tuned to a particular platform, and these are often hand coded to extract the best performance from the target hardware. The hierarchical memory organization common to many current systems makes the development of efficient, plat form-specific Level 3 BLAS (which perform matrix-matrix operations) especially challenging and expensive. The authors of this pair of papers show how it is possible to produce an efficient BLAS Level 3 library based on highly optimized versions of the single Level 3 routine that performs a general matrix-matrix multiply (GEMM) and a small number of simpler Level 1 and Level 2 routines. They also provide a model implementation of their GEMM-based routines along with comprehensive benchmarking software that allows users to measure the quality of vendor implementations against an efficient, portable, Fortran 77 version of the library. The papers provide a detailed account of the strategies required to squeeze the last drop of efficiency from today's processors. Although targeted at the matrix-matrix multiply, the lessons and techniques employed will be valuable to anyone wishing to obtain similar performance from other higher-level numerical operations.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 24, Issue 3
    Sept. 1998
    95 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/292395
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 1998
    Published in TOMS Volume 24, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. GEMM-based level 3 BLAS
    2. blocked algorithms
    3. matrix-matrix kernels
    4. memory hierarchy
    5. parallelization
    6. vectorization

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)42
    • Downloads (Last 6 weeks)1
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors ArchitectureElectronics10.3390/electronics1016198410:16(1984)Online publication date: 17-Aug-2021
    • (2018)Hybrid-Grained Dynamic Load Balanced GEMM on NUMA ArchitecturesElectronics10.3390/electronics71203597:12(359)Online publication date: 27-Nov-2018
    • (2018)SCPACM Transactions on Architecture and Code Optimization10.1145/327465415:4(1-21)Online publication date: 10-Oct-2018
    • (2016)User-Centric Scheduling and Governing on Mobile Devices with big.LITTLE ProcessorsACM Transactions on Embedded Computing Systems10.1145/282994615:1(1-22)Online publication date: 28-Jan-2016
    • (2016)A Cache-Based Flash Translation Layer for TLC-Based Multimedia Storage DevicesACM Transactions on Embedded Computing Systems10.1145/282061415:1(1-28)Online publication date: 13-Jan-2016
    • (2015)An efficient and tunable matrix-disguising method toward privacy-preserving computationSecurity and Communication Networks10.1002/sec.12358:17(3099-3110)Online publication date: 25-Nov-2015
    • (2013)The Challenge of Cross-language InteroperabilityQueue10.1145/2542661.254397111:10(20-28)Online publication date: 8-Oct-2013
    • (2011)Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systemsACM Transactions on Mathematical Software10.1145/2049662.204966438:1(1-30)Online publication date: 7-Dec-2011
    • (2010)Parallel Solvers for Sylvester-Type Matrix Equations with Applications in Condition Estimation, Part IACM Transactions on Mathematical Software10.1145/1824801.182481037:3(1-32)Online publication date: 1-Sep-2010
    • (2009)Adaptive Winograd's matrix multiplicationsACM Transactions on Mathematical Software10.1145/1486525.148652836:1(1-23)Online publication date: 16-Mar-2009
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media