Accurate and efficient floating point summation
J Demmel, Y Hida - SIAM Journal on Scientific Computing, 2004 - SIAM
J Demmel, Y Hida
SIAM Journal on Scientific Computing, 2004•SIAMWe present and analyze several simple algorithms for accurately computing the sum of n
floating point numbers using a wider accumulator. Let f and F be the number of significant
bits in the summands and the accumulator, respectively. Then assuming gradual underflow,
no overflow, and round-to-nearest arithmetic, up to approximately 2 Ff numbers can be
added accurately by simply summing the terms in decreasing order of exponents, yielding a
sum correct to within about 1.5 units in the last place (ulps). We apply this result to the …
floating point numbers using a wider accumulator. Let f and F be the number of significant
bits in the summands and the accumulator, respectively. Then assuming gradual underflow,
no overflow, and round-to-nearest arithmetic, up to approximately 2 Ff numbers can be
added accurately by simply summing the terms in decreasing order of exponents, yielding a
sum correct to within about 1.5 units in the last place (ulps). We apply this result to the …
We present and analyze several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator. Let f and F be the number of significant bits in the summands and the accumulator, respectively. Then assuming gradual underflow, no overflow, and round-to-nearest arithmetic, up to approximately 2F-f numbers can be added accurately by simply summing the terms in decreasing order of exponents, yielding a sum correct to within about 1.5 units in the last place (ulps). We apply this result to the floating point formats in the IEEE floating point standard. For example, a dot product of single precision vectors of length at most 33 computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps. If double-extended precision is used, the vector length can be as large as 65,537. We also investigate how the cost of sorting can be reduced or eliminated while retaining accuracy.
Society for Industrial and Applied Mathematics