article

Design, implementation and testing of extended and mixed precision BLAS

Authors:

Daniel J. YooAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 28, Issue 2

Pages 152 - 205

https://doi.org/10.1145/567806.567808

Published: 01 June 2002 Publication History

Get Access

Abstract

This article describes the design rationale, a C implementation, and conformance testing of a subset of the new Standard for the BLAS (Basic Linear Algebra Subroutines): Extended and Mixed Precision BLAS. Permitting higher internal precision and mixed input/output types and precisions allows us to implement some algorithms that are simpler, more accurate, and sometimes faster than possible without these features. The new BLAS are challenging to implement and test because there are many more subroutines than in the existing Standard, and because we must be able to assess whether a higher precision is used for internal computations than is used for either input or output variables. We have therefore developed an automated process of generating and systematically testing these routines. Our methodology is applicable to languages besides C. In particular, our algorithms used in the testing code will be valuable to all other BLAS implementors. Our extra precision routines achieve excellent performance---close to half of the machine peak Megaflop rate even for the Level 2 BLAS, when the data access is stride one.

References

[1]

Basic Linear Algebra Subprograms (BLAS) Standard. 2000. http://www.netlib.org/utk/papers/blast-forum.html.

Google Scholar

[2]

IEEE. 2001. 754 Revision. http://grouper.ieee.org/groups/754/.

Google Scholar

[3]

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. 1999. LAPACK Users' Guide, Release 3.0. SIAM, Philadelphia. Pa.

Crossref

Google Scholar

[4]

ANSI/IEEE Std. 754-1985. IEEE Standard for Binary Floating Point Arithmetic.

Google Scholar

[5]

ANSI/IEEE Std. 854-1987. IEEE Standard for Radix Independent Floating Point Arithmetic.

Google Scholar

[6]

ISO/IEC 9899:1999. Standard for the C programming language (C99). Jan 99 draft available at http://anubis.dkuug.dk/JTC1/SC22/WG14/www/docs/n869. Final version to be available from http://www.iso.ch.

Google Scholar

[7]

Bailey, D. 2000. A Fortran-90 double-double precision library. http://www.nersc.gov/&sim;dhbailey/mpdist/mpdist.html.

Google Scholar

[8]

Bailey, D. H. 1995. A Fortran-90 based multiprecision system. ACM Trans. Math. Soft. 21, 4, 379--387.

Crossref

Google Scholar

[9]

Bilmes, J., Asanović, K., Demmel, J., Lam, D., and Chin, C. 1996. The PHiPAC WWW home page. http://www.icsi.berkeley.edu/&sim;bilmes/phipac.

Google Scholar

[10]

Blackford, L. S., Choi, J., D'Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C. 1997. ScaLAPACK Users' Guide. SIAM, Philadelphia, Pa.

Crossref

Google Scholar

[11]

Bleher, J., Roeder, A., and Rump, S. 1985. ACRITH: High-Accuracy Arithmetic---An advanced tool for numerical computation. In Proceedings of the 7th Symposium on Computer Arithmetic. IEEE Computer Society Press, Urbana, Il.

Google Scholar

[12]

Brent, R. 1978. A Fortran multiple precision arithmetic package. ACM Trans. Math. Soft. 4, 57--70.

Crossref

Google Scholar

[13]

Briggs, K. 1998. Doubledouble floating point arithmetic. http://www-epidem.plantsci.cam.ac.uk/&sim;kbriggs/doubledouble.html.

Google Scholar

[14]

Dekker, T. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 224--242.

Google Scholar

[15]

Demmel, J. 1984. Underflow and the reliability of numerical software. SIAM J. Sci. Stat. Comput. 5, 4 (Dec.), 887--919.

Google Scholar

[16]

Demmel, J. and Li, X. 1994. Faster numerical algorithms via exception handling. IEEE Trans. Comput. 43, 8, 983--992. LAPACK Working Note 59.

Crossref

Google Scholar

[17]

Demmel, J. W. 1997. Applied Numerical Linear Algebra. SIAM, Philadelphia, Pa.

Crossref

Google Scholar

[18]

Dhillon, I., Fann, G., and Parlett, B. 1997. Application of a new algorithm for the symmetric eigenproblem to computational quantum chemistry. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing. SIAM, Philadelphia, Pa.

Google Scholar

[19]

Dhillon, I. S. 1997. A new O(n2) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. Ph.D. dissertation. University of California, Berkeley Berkeley, California.

Crossref

Google Scholar

[20]

Dongarra, J., Bunch, J., Moler, C., and Stewart, G. W. 1979. LINPACK User's Guide. SIAM, Philadelphia, Pa.

Google Scholar

[21]

Dongarra, J., Du Croz, J., Duff, I. S., and Hammarling, S. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1--17.

Crossref

Google Scholar

[22]

Dongarra, J., Du Croz, J., Hammarling, S., and Hanson, R. J. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1 (Mar.), 1--17.

Crossref

Google Scholar

[23]

Eisenstat, S. C., Elman, H. C., and Schultz, M. H. 1983. Variational iterative methods for nonsymmetric systems of linear equations. SIAM J. Numer. Anal. 20, 345--357.

Google Scholar

[24]

Forsythe, G. E. and Moler, C. B. 1967. Computer Solution of Linear Algebraic Systems. Prentice-Hall, Englewood Cliffs, N.J.

Google Scholar

[25]

Gupta, A., Joshi, M., Karypis, G., and Kumar, V. 1999. PSPASES: A scalable parallel direct solver for sparse symmetric positive definite systems. http://www.cs.umn.edu/mjoshi/pspases.

Google Scholar

[26]

Henry, G. 1998. UNIX extended precision library for the pentium. http://www.cs.utk.edu/&sim;ghenry/distrib/archive.htm.

Google Scholar

[27]

Higham, N. J. 1996. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, Pa.

Crossref

Google Scholar

[28]

Kahan, W. 1966 (revised June 1968). Accurate eigenvalues of a symmetric tridiagonal matrix. Computer Science Dept. Technical Report CS41, Stanford University, Stanford, Calif., July.

Crossref

Google Scholar

[29]

Kahan, W. 1998. Matlab's loss is nobody's gain. http://www.cs.berkeley.edu/&sim;wkahan/MxMulEps.pdf.

Google Scholar

[30]

Kahan, W. and Darcy, J. D. 1998. How Java's floating-point hurts everyone everywhere. http://www.cs.berkeley.edu/&sim;wkahan/JAVAhurt.pdf.

Google Scholar

[31]

Kahan, W. and LeBlanc, E. 1985. Anomalies in the IBM ACRITH package. In Proceedings of the 7th Symposium on Computer Arithmetic. IEEE Computer Society Press, Urbana, Ill.

Google Scholar

[32]

Knuth, D. 1969. The Art of Computer Programming, vol. 2. Addison-Wesley, Reading, Mass.

Crossref

Google Scholar

[33]

Kulisch, U. and Miranker, W., Eds. 1983. A New Approach to Scientific Computing. Academic Press, New York.

Google Scholar

[34]

Lawson, C., Hanson, R., Kincaid, D., and Krogh, F. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 308--323.

Crossref

Google Scholar

[35]

Li, X. S. and Demmel, J. W. 1998. Making sparse Gaussian elimination scalable by static pivoting. In Proceedings of SC98. Orlando, Fla.

Crossref

Google Scholar

[36]

Li, X. S. and Demmel, J. W. 2001. SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. Tech. rep., Lawrence Berkeley National Laboratory. December. in preparation.

Google Scholar

[37]

Li, X. S., Demmel, J. W., Bailey, D. H., Henry, G., Hida, Y., Iskandar, J., Kahan, W., Kapur, A., Martin, M. C., Tung, T., and Yoo, D. J. 2000. Design, implementation and testing of extended and mixed precision BLAS. Tech. Rep. LBNL-45991, Lawrence Berkeley National Laboratory. June. http://www.nersc.gov/&sim;xiaoye/XBLAS/.

Google Scholar

[38]

M4 macro processor. http://www.cs.utah.edu/csinfo/texinfo/m4/m4.html.

Google Scholar

[39]

Macsyma, Inc. 1996. Macsyma Mathematics and System Reference Manual, 16th ed, 589 pages.

Google Scholar

[40]

Møller, O. 1965. Quasi double precision in floating-point arithmetic. BIT 5, 37--50.

Google Scholar

[41]

Monagan, M., Geddes, K., Heal, K., Labahn, G., and Vorkoetter, S. 1997. Maple V Programming Guide for Release 5. Springer-Verlag, New York.

Crossref

Google Scholar

[42]

Oettli, W. and Prager, W. 1964. Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides. Numer. Math. 6, 405--409.

Google Scholar

[43]

Parlett, B. and Dhillon, I. 1997. Fernando's solution to Wilkinson's problem: An application of double factorization. Lin. Alg. Appl. 267, 247--279.

Google Scholar

[44]

Parlett, B. and Marques, O. 2000. An implementation of the DQDS algorithm (Positive case). Lin. Alg. Appl. 309, 217--259.

Google Scholar

[45]

Pichat, M. 1972. Correction d'une somme en arithmetique à virgule flottante. Numer. Math. 19, 400--406.

Google Scholar

[46]

Priest, D. 1991. Algorithms for arbitrary precision floating point arithmetic. In Proceedings of the 10th Symposium on Computer Arithmetic, P. Kornerup and D. Matula, Eds. IEEE Computer Society Press, Grenoble, France, 132--145.

Google Scholar

[47]

Saad, Y. and Schultz, M. H. 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 7, 856--869.

Crossref

Google Scholar

[48]

Shewchuk, J. R. 1997. Adaptive precision floating-point arithmetic and fast robust geometric predicates. Disc. Comput. Geom. 18, 305--363.

Google Scholar

[49]

Smith, B. T., Boyle, J. M., Dongarra, J. J., Garbow, B. S., Ikebe, Y., Klema, V. C., and Moler, C. B. 1976. Matrix Eigensystem Routines---EISPACK Guide. In Lecture Notes in Computer Science, vol. 6. Springer-Verlag, Berlin, Germany.

Google Scholar

[50]

Turner, K. and Walker, H. F. 1992. Efficient high accuracy solutions with GMRES(m). SIAM J. Sci. Stat. Comput. 13, 815--825.

Crossref

Google Scholar

[51]

Whaley, R. C. and Dongarra, J. 1998. The ATLAS home page. http://www.netlib.org/atlas/.

Google Scholar

[52]

Wolfram, S. 1988. Mathematica: A System for Doing Mathematics by Computer. Addison-Wesley, Reading, Mass.

Crossref

Google Scholar

Cited By

View all

Hubrecht TJeannerod CMuller J(2024)Useful applications of correctly-rounded operators of the form ab + cd + e2024 IEEE 31st Symposium on Computer Arithmetic (ARITH)10.1109/ARITH61463.2024.00015(32-39)Online publication date: 10-Jun-2024
https://doi.org/10.1109/ARITH61463.2024.00015
Uchino YTerao TOzaki K(2024)Acceleration of iterative refinement for singular value decompositionNumerical Algorithms10.1007/s11075-023-01596-995:2(979-1009)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s11075-023-01596-9
Mukunoki DKawai MImamura T(2023)Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC60832.2023.00094(608-615)Online publication date: 18-Dec-2023
https://doi.org/10.1109/MCSoC60832.2023.00094
Show More Cited By

Index Terms

Design, implementation and testing of extended and mixed precision BLAS
1. Mathematics of computing
  1. Mathematical software

Recommendations

The implementation of BLAS for band matrices
PPAM'07: Proceedings of the 7th international conference on Parallel processing and applied mathematics

In this paper we evaluate the performance of several implementations of the routines in BLAS involving band matrices. The results on two different platforms show that not enough attention has been paid to the efficient implementation of these ...
The performance of the BLAS and LAPACK on a shared memory scalar multiprocessor

LAPACK is a set of Fortran subroutines covering a wide area of linear algebra algorithms. It was developed with the intention of being portable across a range of parallel processing environments. We consider here the performance of the Basic Linear ...
Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format

We present subroutines for the Cholesky factorization of a positive-definite symmetric matrix and for solving corresponding sets of linear equations. They exploit cache memory by using the block hybrid format proposed by the authors in a companion ...

Reviews

Reviewer: Timothy R. Hopkins

This paper describes a proposed extension to the current basic linear algebra subprograms (BLAS) that will provide efficiency gains from using mixed precision arguments (for example, multiplication of a REAL matrix and a COMPLEX matrix), and allow the optional use of extended precision arithmetic within BLAS code. In section 2, the reader is presented with a well-chosen set of numerical methods that could potentially benefit from the use of such an extended BLAS, which could provide improvements in both accuracy and reliability. The examples discussed cover a wide range of linear algebra calculations, from the use of iterative refinement in the direct solution of linear systems, through accelerating iterative linear equation solvers and the solution of least squares problems, to eigensystem solvers. Compelling evidence is given for the usefulness of this extended set of routines. A detailed account follows about the history of the availability of extended precision arithmetic over the past four decades, and how, essentially, the provision of good floating-point hardware has been sacrificed to commercial considerations. A number of software implementations of double-double arithmetic are also described. Double-double arithmetic is shown to be an attractive feature for performing computations within BLAS routines, but far less useful for more general calculations. Performance figures that are given show extended precision can be provided in a cost effective way. The five design goals presented in section 4 ensure that the number of possible variants of each of the 29 original BLAS chosen for this extended set is kept at a sensible level. Even so, there is an obvious necessity to generate these automatically from a master template. How this was accomplished, using the m4 macroprocessor, is described. The final section presents an in-depth description of the testing of a reference implementation, written in C. There is a vast amount of code; for example, variations of the dot product routine, DOT, generate eleven thousand lines. Once again, there is a need to automate this process as far as possible, while ensuring that the software under test is exhaustively exercised. A wealth of detail is presented on the methods adopted and, especially, the generation of test data. The only downsides to the paper are, first, that the Web link to the BLAS Standard quoted in the paper seems to have changed already; it is now (October 2002) apparently at http://www.netlib.org/blas/blast-forum/. In addition, the link to the work of Briggs is no longer operational. Second, it is very difficult to differentiate between (or indeed find) some of the line graphs presented in several of the figures. Apart from these very minor problems, this is a superb paper for anyone who is interested in the finer points of computer arithmetic and their importance to the performance of numerical computation. The meticulous attention to detail makes it an extremely worthwhile study. The amount of effort put into the testing phase should be an example to everyone producing numerical software. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 28, Issue 2

June 2002

151 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/567806

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2002

Published in TOMS Volume 28, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

141
Total Citations
View Citations
1,312
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)3

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Hubrecht TJeannerod CMuller J(2024)Useful applications of correctly-rounded operators of the form ab + cd + e2024 IEEE 31st Symposium on Computer Arithmetic (ARITH)10.1109/ARITH61463.2024.00015(32-39)Online publication date: 10-Jun-2024
https://doi.org/10.1109/ARITH61463.2024.00015
Uchino YTerao TOzaki K(2024)Acceleration of iterative refinement for singular value decompositionNumerical Algorithms10.1007/s11075-023-01596-995:2(979-1009)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s11075-023-01596-9
Mukunoki DKawai MImamura T(2023)Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC60832.2023.00094(608-615)Online publication date: 18-Dec-2023
https://doi.org/10.1109/MCSoC60832.2023.00094
Balos CRoberts SGardner D(2023)Leveraging Mixed Precision in Exponential Time Integration Methods2023 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC58863.2023.10363489(1-8)Online publication date: 25-Sep-2023
https://doi.org/10.1109/HPEC58863.2023.10363489
Kellison AAppel ATekriwal MBindel D(2023)LAProof: A Library of Formal Proofs of Accuracy and Correctness for Linear Algebra Programs2023 IEEE 30th Symposium on Computer Arithmetic (ARITH)10.1109/ARITH58626.2023.00021(36-43)Online publication date: 4-Sep-2023
https://doi.org/10.1109/ARITH58626.2023.00021
Li CBarrio RXiao XDu PJiang HQuan ZLi K(2023)PACFApplied Mathematics and Computation10.1016/j.amc.2022.127611440:COnline publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1016/j.amc.2022.127611
Lei XGu TXu X(2023)ddRingAllreduce: a high-precision RingAllreduce algorithmCCF Transactions on High Performance Computing10.1007/s42514-023-00150-25:3(245-257)Online publication date: 5-Jul-2023
https://doi.org/10.1007/s42514-023-00150-2
Muller JRideau L(2022)Formalization of Double-Word Arithmetic, and Comments on “Tight and Rigorous Error Bounds for Basic Building Blocks of Double-Word Arithmetic”ACM Transactions on Mathematical Software10.1145/348451448:1(1-24)Online publication date: 16-Feb-2022
https://dl.acm.org/doi/10.1145/3484514
Durand YGuthmuller EFuguet CFereyre JBocco AAlidori R(2022)Accelerating Variants of the Conjugate Gradient with the Variable Precision Processor2022 IEEE 29th Symposium on Computer Arithmetic (ARITH)10.1109/ARITH54963.2022.00017(51-57)Online publication date: Sep-2022
https://doi.org/10.1109/ARITH54963.2022.00017
Higham NMary T(2022)Mixed precision algorithms in numerical linear algebraActa Numerica10.1017/S096249292200002231(347-414)Online publication date: 9-Jun-2022
https://doi.org/10.1017/S0962492922000022
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

The implementation of BLAS for band matrices

The performance of the BLAS and LAPACK on a shared memory scalar multiprocessor

Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

The implementation of BLAS for band matrices

The performance of the BLAS and LAPACK on a shared memory scalar multiprocessor

Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations