Comparison between binary and decimal floating-point
numbers
Nicolas Brisebarre, Christoph Lauter, Marc Mezzarobba, Jean-Michel Muller
To cite this version:
Nicolas Brisebarre, Christoph Lauter, Marc Mezzarobba, Jean-Michel Muller. Comparison between
binary and decimal floating-point numbers. IEEE Transactions on Computers, Institute of Electrical
and Electronics Engineers, 2016, 65 (7), pp.2032-2044. 10.1109/TC.2015.2479602. hal-01021928v2
HAL Id: hal-01021928
https://hal.archives-ouvertes.fr/hal-01021928v2
Submitted on 25 Jun 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
1
Comparison between binary and decimal
floating-point numbers
Nicolas Brisebarre, Christoph Lauter, Marc Mezzarobba, and Jean-Michel Muller
Abstract—We introduce an algorithm to compare a binary floating-point (FP) number and a decimal FP number, assuming the “binary
encoding” of the decimal formats is used, and with a special emphasis on the basic interchange formats specified by the IEEE
754-2008 standard for FP arithmetic. It is a two-step algorithm: a first pass, based on the exponents only, quickly eliminates most
cases, then, when the first pass does not suffice, a more accurate second pass is performed. We provide an implementation of several
variants of our algorithm, and compare them.
✦
1
I NTRODUCTION
The IEEE 754-2008 Standard for Floating-Point Arithmetic [5] specifies binary (radix-2) and decimal (radix-10)
floating-point number formats for a variety of precisions.
The so-called “basic interchange formats” are presented in
Table 1.
The Standard neither requires nor forbids comparisons
between floating-point numbers of different radices (it states
that floating-point data represented in different formats shall
be comparable as long as the operands’ formats have the same
radix). However, such “mixed-radix” comparisons may offer
several advantages. It is not infrequent to read decimal data
from a database and to have to compare it to some binary
floating-point number. The comparison may be inaccurate
if the decimal number is preliminarily converted to binary,
or, respectively, if the binary number is first converted to
decimal.
Consider for instance the following C code:
double x = ...;
_Decimal64 y = ...;
if (x <= y)
...
The standardization of decimal floating-point arithmetic
in C [6] is still at draft stage, and compilers supporting
decimal floating-point arithmetic handle code sequences
such as the previous one at their discretion and often in an
unsatisfactory way. As it occurs, Intel’s icc 12.1.3 translates
this sequence into a conversion from binary to decimal
followed by a decimal comparison. The compiler emits no
precision (bits)
emin
emax
precision (digits)
emin
emax
•
•
•
N. Brisebarre and J.-M. Muller are with CNRS, Laboratoire LIP, ENS
Lyon, INRIA, Université Claude Bernard Lyon 1, Lyon, France.
E-mail: nicolas.brisebarre@ens-lyon.fr, jean-michel.muller@ens-lyon.fr
C. Lauter is with Sorbonne Universités, UPMC Univ Paris 06, UMR
7606, LIP6, F-75005, Paris, France. E-mail: christoph.lauter@lip6.fr
This work was initiated while M. Mezzarobba was with the Research
Institute for Symbolic Computation, Johannes Kepler University, Linz,
Austria. M. Mezzarobba is now with CNRS, UMR 7606, LIP6, F-75005,
Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR 7606,
LIP6, F-75005, Paris, France. E-mail: marc@mezzarobba.net
This work was partly supported by the TaMaDi project of the French
Agence Nationale de la Recherche, and by the Austria Science Fund
(FWF) grants P22748-N18 and Y464-N18.
binary64
binary128
24
−126
+127
53
−1022
+1023
113
−16382
+16383
decimal64
decimal128
16
−383
+384
34
−6143
+6144
TABLE 1
The basic binary and decimal interchange formats specified by the
IEEE 754-2008 Standard.
warning that the boolean result might not be the expected
one because of rounding.
This kind of strategy may lead to inconsistencies. Consider such a “naïve” approach built as follows: when comparing a binary floating-point number x2 of format F2 , and
a decimal floating-point number x10 of format F10 , we first
convert x10 to the binary format F2 (that is, we replace it
by the F2 number nearest x10 ), and then we perform the
comparison in binary. Denote the comparison operators so
defined as < , 6 , > , and > . Consider the following variables (all exactly represented in their respective formats):
•
•
•
•
binary32
x = 3602879701896397/255, declared as a binary64
number;
y = 13421773/227, declared as a binary32 number;
z = 1/10, declared as a decimal64 number.
Then it holds that x < y , but also y 6 z and z 6 x. Such
an inconsistent result might for instance suffice to prevent a
sorting program from terminating.
Remark that in principle, x2 and x10 could both be converted to some longer format in such a way that the naïve
method yields a correct answer. However, that method
requires a correctly rounded radix conversion, which is an
intrinsically more expensive operation than the comparison
itself.
An experienced programer is likely to explicitly convert
all variables to a larger format. However we believe that
2
ultimately the task of making sure that the comparisons
are consistent and correctly performed should be left to
the compiler, and that the only fully consistent way of
comparing two numeric variables x and y of a different type
is to effectively compare x and y , and not to compare an
approximation to x to an approximation to y .
In the following, we describe algorithms to perform such
“exact” comparisons between binary numbers in any of the
basic binary interchange formats and decimal numbers in
any of the basic decimal interchange formats. Our algorithms were developed with a software implementation in
mind. They end up being pretty generic, but we make no
claim as to their applicability to hardware implementations.
It is natural to require that mixed-radix comparisons not
signal any floating-point exception, except in situations that
parallel exceptional conditions specified in the Standard [5,
Section 5.11] for regular comparisons. For this reason, we
avoid floating-point operations that might signal exceptions,
and mostly use integer arithmetic to implement our comparisons.
Our method is based on two steps:
•
•
Algorithm 1 tries to compare numbers just by examining their floating-point exponents;
when Algorithm 1 is inconclusive, Algorithm 2 provides the answer.
Algorithm 2 admits many variants and depends on a number of parameters. Tables 7 and 8 suggest suitable parameters for most typical use cases.
Implementing the comparison for a given pair of formats
just requires the implementation of Algorithms 1 and 2. The
analysis of the algorithms as a function of all parameters, as
presented in Sections 4 to 6, is somewhat technical. These
details can be skipped at first reading.
This paper is an extended version of the article [2].
2
S ETTING AND O UTLINE
We consider a binary format of precision p2 , minimum
exponent emin
and maximum exponent emax
, and a decimal
2
2
format of precision p10 , minimum exponent emin
10 and maximum exponent emax
10 . We want to compare a binary floatingpoint number x2 and a decimal floating-point number x10 .
Without loss of generality we assume x2 > 0 and x10 >
0 (when x2 and x10 are negative, the problem reduces to
comparing −x2 and −x10 , and when they have different
signs the comparison is straightforward). The floating-point
representations of x2 and x10 are
We assume that the so-called binary encoding (BID) [5],
[8] of IEEE 754-2008 is used for the decimal format1 , so that
the integer M10 is easily accessible in binary. Denote by
p′10 = ⌈p10 log2 10⌉ ,
the number of bits that are necessary for representing the
decimal significands in binary.
When x2 and x10 have significantly different orders of
magnitude, examining their exponents will suffice to compare them. Hence, we first perform a simple exponent-based
test (Section 3). When this does not suffice, a second step
compares the significands multiplied by suitable powers of
2 and 5. We compute “worst cases” that determine how
accurate the second step needs to be (Section 4), and show
how to implement it efficiently (Section 5). Section 6 is an
aside describing a simpler algorithm that can be used if we
only want to decide whether x2 and x10 are equal. Finally,
Section 7 discusses our implementation of the comparison
algorithms and presents experimental results.
3
F IRST STEP : ELIMINATING THE “ SIMPLE CASES ”
BY EXAMINING THE EXPONENTS
3.1 Normalization
As we have
1 6 M10 6 10p10 − 1,
there exists a unique ν ∈ {0, 1, 2, . . . , p′10 − 1} such that
′
Our initial problem of comparing x2 and x10 reduces to
comparing M2 · 2e2 −p2 +1+ν and (2ν M10 ) · 10e10 −p10 +1 .
The fact that we “normalize” the decimal significand
M10 by a binary shift between two consecutive powers of
two is of course questionable; M10 could also be normalized
into the range 10p10 −1 6 10t · M10 6 10p10 − 1. However,
hardware support for this operation [11] is not widespread.
A decimal normalization would thus require a loop, provoking pipeline stalls, whereas the proposed binary normalization can exploit an existing hardware leading-zero counter
with straight-line code.
3.2 Comparing the Exponents
Define
x2 = M2 · 2e2 −p2 +1 ,
x10 = M10 · 10e10 −p10 +1 ,
where M2 , M10 , e2 and e10 are integers that satisfy:
so that
emin
− p2 + 1 6 e2 6 emax
,
2
2
max
emin
10 6 e10 6 e10 ,
2p2 −1 6 M2 6 2p2 − 1,
1 6 M10 6 10p10 − 1
(1)
with p2 , p10 > 1. (The choice of lower bound for e2 makes
the condition 2p2 −1 6 M2 hold even for subnormal binary
numbers.)
′
2p10 −1 6 2ν M10 6 2p10 − 1.
m = M2 ,
′
h = e2 − e10 + ν + p10 − p10 + 1,
n = M10 · 2ν ,
g = e10 − p10 + 1,
w = p′10 − p2 − 1
(
(2)
2ν x2 = m · 2h+g+w ,
2ν x10 = n · 10g .
Our comparison problem becomes:
Compare m · 2h+w with n · 5g .
1. This encoding is typically used in software implementations of
decimal floating-point arithmetic, even if BID-based hardware designs
have been proposed [11].
3
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
24
16
54
29
24
34
113
88
53
16
54
0
53
34
113
59
113
16
54
−60
113
34
113
−1
−570, 526
18
int32
−6371, 6304
23
int64
−1495, 1422
19
int32
−7296, 7200
23
int64
−16915, 16782
27
int64
−22716, 22560
27
int64
p2
p10
p′10
w
(1)
(1)
hmin , hmax
smin
datatype
TABLE 2
The various parameters involved in Step 1 of the comparison algorithm.
Propositions 1 and 2 yield to the following algorithm.
We have
mmin = 2
nmin = 2
p2 −1
p′10 −1
6 m 6 mmax = 2
6 n 6 nmax = 2
h+w
p2
p′10
− 1,
− 1.
(3)
g
It is clear that mmin · 2
> nmax · 5 implies x2 > x10 ,
while mmax · 2h+w < nmin · 5g implies x2 < x10 . This gives
′
(1 − 2−p10 ) · 5g < 2h−2
(1 − 2
−p2
h
)·2 <5
g
⇒
x10 < x2 ,
⇒
x2 < x10 .
(4)
In order to compare x2 and x10 based on these implications, define
ϕ(h) = ⌊h · log5 2⌋.
Proposition 1. We have
g < ϕ(h) ⇒ x2 > x10 ,
g > ϕ(h) ⇒ x2 < x10 .
Proof: If g < ϕ(h) then g 6 ϕ(h) − 1, hence g 6
h log5 2 − 1. This implies that 5g 6 (1/5) · 2h < 2h /(4 · (1 −
′
2−p10 )), therefore, from (4), x10 < x2 . If g > ϕ(h) then g >
ϕ(h)+1, hence g > h log5 2, so that 5g > 2h > (1−2−p2 )·2h .
This implies, from (4), x2 < x10 .
Now consider the range of h: by (2), h lies between
(1)
′
− p2 + 1) − emax
hmin = (emin
2
10 + p10 − p10 + 1
and
Algorithm 1. First, exponent-based step
1 compute h = e2 − e10 + ν + p10 − p′10 + 1 and g =
e10 − p10 + 1;
2 with the appropriate value of s, compute ϕ(h) = ⌊L ·
h · 2−s ⌋ using integer arithmetic;
3 if g < ϕ(h) then
4 return “x2 > x10 ”
5 else if g > ϕ(h) then
6 return “x2 < x10 ”
7 else (g = ϕ(h))
8 first step is inconclusive (perform the second step).
Note that, when x10 admits multiple distinct representations in the precision-p10 decimal format (i.e., when its
cohort is non-trivial [5]), the success of the first step may
depend on the specific representation passed as input. For
instance, assume that the binary and decimal formats are
binary64 and decimal64, respectively. Both A = {M10 =
1015 , e10 = 0} and B = {M10 = 1, e10 = 15} are valid
representations of the integer 1. Assume we are trying to
compare x10 = 1 to x2 = 2. Using representation A, we have
ν = 4, h = −32, and ϕ(h) = −14 > g = −15, hence the test
from Algorithm 1 shows that x10 < x2 . In contrast, if x10 is
given in the form B , we get ν = 53, ϕ(h) = ϕ(2) = 0 = g ,
and the test is inconclusive.
3.3 How Often is the First Step Enough?
(1)
− emin
hmax = emax
2
10 + p10 .
That range is given in Table 2 for the basic IEEE formats.
Knowing that range, it is easy to implement ϕ as follows.
Proposition 2. Denote by ⌊·⌉ the nearest integer function. For
large enough s ∈ N, the function defined by
ϕ̂(h) = ⌊L · h · 2−s ⌋,
with L = ⌊2s log5 2⌉,
(1)
(1)
satisfies ϕ(h) = ϕ̂(h) for all h in the range [hmin , hmax ].
Proposition 2 is an immediate consequence of the irra(1)
tionality of log5 2. For known, moderate values of hmin and
(1)
hmax , the optimal choice smin of s is easy to find and small.
For instance, if the binary format is binary64 and the decimal
format is decimal64, then smin = 19.
Table 2 gives the value of smin for the basic IEEE formats.
The product L · h, for h in the indicated range and s = smin ,
can be computed exactly in (signed or unsigned) integer
arithmetic, with the indicated data type. Computing ⌊ξ·2−β ⌋
of course reduces to a right-shift by β bits.
We may quantify the quality of this first filter as follows.
We say that Algorithm 1 succeeds if it answers “x2 > x10 ”
or “x2 < x10 ” without proceeding to the second step,
and fails otherwise. Let X2 and X10 denote the sets of
representations of positive, finite numbers in the binary, resp.
decimal formats of interest. (In the case of X2 , each number
has a single representation.) Assuming zeros, infinities, and
NaNs have been handled before, the input of Algorithm 1
may be thought of as a pair (ξ2 , ξ10 ) ∈ X2 × X10 .
Proposition 3. The proportion of input pairs (ξ2 , ξ10 ) ∈ X2 ×
X10 for which Algorithm 1 fails is bounded by
1
4
min max
,
,
max − emin + 1
e10 − emin
10 + 1 e2
2
assuming p2 > 3.
Proof. The first step fails if and only if ϕ(h) = g . Write h =
e2 + t, that is, t = p10 − p′10 + 1 − e10 + ν . Then ϕ(h) = g
rewrites as
⌊(e2 + t) log5 2⌋ = g,
4
which implies
−t + g log2 5 6 e2 < −t + (g + 1) log2 5
< −t + g log2 5 + 2.4.
The value of ν is determined by M10 , so that t and g = e10 −
p10 + 1 depend on ξ10 only. Thus, for any given ξ10 ∈ X10 ,
there can be at most 3 values of e2 for which ϕ(h) = g .
A similar argument shows that for given e2 and M10 ,
there exist at most one value of e10 such that ϕ(h) = g .
Let X2norm and X2sub be the subsets of X2 consisting
of normal and subnormal numbers respectively. Let ri =
emax
− emin
+ 1. The number Nnorm of pairs (ξ2 , ξ10 ) ∈
i
i
X2norm × X10 such that ϕ(h) = g satisfies
Nnorm 6 #{M10 : ξ10 ∈ X10 } · #{M2 : ξ2 ∈ X2norm }
· #{(e2 , e10 ) : ξ2 ∈ X2norm ∧ ϕ(h) = g}
6 (10
p10
− 1) · 2
p2 −1
Nsub := #{(ξ2 , ξ10 ) ∈ X2sub × X10 : ϕ(h) = g}
6 #{M10 } · #X2sub
= (10p10 − 1) · (2p2 −1 − 1).
The total number of elements of X2 × X10 for which the
first step fails is bounded by Nnorm + Nsub . This is to be
compared with
#X2 = r2 · 2p2 −1 + (p2 − 1) · (2p2 −1 − 1)
> (r2 + 1) · 2p2 −1 ,
#X10 = r10 · (10p10 − 1).
We obtain
Nnorm + Nsub
6 min
#(X2 × X10 )
1 4
,
r10 r2
We have
5ϕ(h)
.
2h+w
f (h) · n > m ⇒ x10 > x2 ,
f (h) · n < m ⇒ x10 < x2 ,
f (h) · n = m ⇒ x10 = x2 .
(5)
The second test consists in performing this comparison, with
f (h) · n replaced by an accurate enough approximation.
In order to ensure that an approximate test is indeed
equivalent to (5), we need a lower bound η on the minimum
nonzero value of
dh (m, n) =
5ϕ(h)
m
−
2h+w
n
(6)
that may appear at this stage. We want η to be as tight as
possible in order to avoid unduly costly computations when
approximating f (h) · n. The search space is constrained by
the following observations.
Proposition 4. Let g and h be defined by Equation (2). The
equality g = ϕ(h) implies
emin
− p2 − p′10 + 3
emax + 2
2
6h< 2
.
1 + log5 2
1 + log5 2
(7)
Additionally, n satisfies the following properties:
1)
2)
if n > 10p10 , then n is even;
if ν ′ = h+ ϕ(h)− emax
+ p′10 − 2 is nonnegative (which
2
′
holds for large enough h), then 2ν divides n.
Proof: From (2), we get
.
These are rough estimates. One way to get tighter
bounds for specific formats is simply to count, for each value
of ν , the pairs (e2 , e10 ) such that ϕ(h) = g . For instance,
in the case of comparison between binary64 and decimal64
floating-point numbers, one can check that the failure rate
in the sense of Proposition 3 is less than 0.1%. As a matter of
course, pairs (ξ2 , ξ10 ) will almost never be equidistributed
in practice. Hence the previous estimate should not be
interpreted as a probability of success of Step 1. It seems more
realistic to assume that a well-written numerical algorithm
will mostly perform comparisons between numbers which
are suspected to be close to each other. For instance, in an
iterative algorithm where comparisons are used as part of a
convergence test, it is to be expected that most comparisons
need to proceed to the second step. Conversely, there are
scenarios, e.g., checking for out-of-range data, where the
first step should be enough.
e2 = h + g + p′10 − ν − 2.
max
Since emin
and as we assumed g = ϕ(h),
2 −p2 +1 6 e2 6 e2
it follows that
emin
− p2 + 1 6 h + ϕ(h) − ν + p′10 − 2 6 emax
.
2
2
Therefore we have
+ ν − p′10 + 3,
emin
− p2 + ν − p′10 + 3 6 (1 + log5 2)h < emax
2
2
and the bounds (7) follow because 0 6 ν 6 p′10 − 1.
The binary normalization of M10 , yielding n, implies
that 2ν | n. If n > 10p10 , then, by (1) and (2), it follows
that ν > 1 and n is even. As h + ϕ(h) − ν 6 emax
− p′10 + 2,
2
′
we also have ν > ν . Since ϕ is increasing, there exists h0
such that ν ′ > 0 for h > h0 .
Table 3 gives the range (7) for h with respect to the
comparisons between basic IEEE formats.
4.2 Required Worst-Case Accuracy
Let us now deal with the problem of computing η , consid-
S ECOND STEP : A CLOSER LOOK AT THE SIGNIF - ering d (m, n) under the constraints given by Equation (3)
h
ICANDS
4.1
f (h) =
· min{r2 , 3r10 }.
For subnormal x2 , we have the bound
4
(Otherwise, the first step already allowed us to compare x2
and x10 .)
Define a function
Problem Statement
In the following we assume that g = ϕ(h), i.e.,
e10 − p10 + 1 = ⌊(e2 − e10 + ν + p10 − p′10 + 1) log5 (2)⌋ .
and Proposition 4. A similar problem was considered by
Cornea et al. [3] in the context of correctly rounded binary to
decimal conversions. Their techniques yield worst and bad
cases for the approximation problem we consider. We will
take advantage of them in Section 7 to test our algorithms on
5
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
#g
−140, 90
−61, 38
100
−181, 90
−78, 38
117
−787, 716
−339, 308
648
−828, 716
−357, 308
666
−11565, 11452
−4981, 4932
9914
−11606, 11452
−4999, 4932
9932
h0
54
12
680
639
11416
11375
(2)
(2)
hmin , hmax
(2)
(2)
gmin , gmax
TABLE 3
Range of h for which we may have g = ϕ(h), and size of the table used in the “direct method” of Section 5.1.
many cases when x2 and x10 are very close. Here, we favor a
different approach that is less computationally intensive and
mathematically simpler, making the results easier to check
either manually or with a proof-assistant.
Problem 1. Find the smallest nonzero value of
dh (m, n) =
5ϕ(h)
m
−
2h+w
n
subject to the constraints
p2 −1
2
6 m 6 2p2 − 1,
′
′
2p10 −1 6 n 6 2p10 − 1,
(2)
(2)
hmin 6 h 6 hmax ,
n is even if n > 10p10 ,
′
if h > h0 , then 2ν | n
where
emin
− p2 − p′10 + 3
2
,
1 + log5 2
max
e2 + 2
(2)
hmax =
,
1 + log5 2
max
e
− p′10 + 3
h0 = 2
,
1 + log5 2
ν ′ = h + ϕ(h) − emax
+ p′10 − 2.
2
(2)
hmin =
We recall how such a problem can be solved using the
classical theory of continued fractions [4], [7], [9].
Given α ∈ Q, build two finite sequences (ai )06i6n and
(ri )06i6n by setting r0 = α and
(
ai = ⌊ri ⌋ ,
ri+1 = 1/(ri − ai ) if ai 6= ri .
For all 0 6 i 6 n, the rational number
pi
= a0 +
qi
1
a1 +
1
..
.+
1
ai
is called the ith convergent of the continued fraction expansion
of α. Observe that the process defining the ai essentially is
the Euclidean algorithm. If we assume p−1 = 1, q−1 = 0,
p0 = a0 , q0 = 1, we have pi+1 = ai+1 pi + pi−1 , qi+1 =
ai+1 qi + qi−1 , and α = pn /qn .
It is a classical result [7, Theorem 19] that any rational
number m/n such that
m
1
α−
<
n
2n2
is a convergent of the continued fraction expansion of α.
For basic IEEE formats, it turns out that this classical result
is enough to solve Problem 1, as cases when it is not precise
enough can be handled in an ad hoc way.
First consider the case max(0, −w) 6 h 6 p2 . We can
then write
dh (m, n) =
|5ϕ(h) n − 2h+w m|
2h+w n
where the numerator and denominator are both integers.
If dh (m, n) 6= 0, we have |5ϕ(h) n − m2h+w | > 1, hence
′
′
dh (m, n) > 1/(2h+w n) > 2p2 −h−2p10 +1 > 2−2p10 +1 . For
the range of h where dh (m, n) might be zero, the non-zero
′
minimum is thus no less than 2−2p10 +1 .
(2)
If now p2 + 1 6 h 6 hmax , then h + w > p′10 , so 5ϕ(h) and
′
h+w
2
are integers, and m2h+w is divisible by 2p10 . As n 6
′
2p10 − 1, we have dh (m, n) 6= 0. To start with, assume that
′
′
dh (m, n) 6 2−2p10 −1 . Then, the integers m and n 6 2p10 − 1
satisfy
′
m
1
5ϕ(h)
−
,
6 2−2p10 −1 <
h+w
2
n
2n2
and hence m/n is a convergent of the continued fraction
expansion of α = 5ϕ(h) /2h+w .
We compute the convergents with denominators less
′
than 2p10 and take the minimum of the |α − pi /qi | for
which there exists k ∈ N such that m = kpi and n = kqi
satisfy the constraints of Problem 1. Notice that this strategy
only yields the actual minimum nonzero of dh (m, n) if we
eventually find, for some h, a convergent with |α − pi /qi | <
′
′
2−2p10 −1 . Otherwise, taking η = 2−2p10 −1 yields a valid (yet
not tight) lower bound.
(2)
The case hmin 6 h 6 min(0, −w) is similar. A numerator of dh (m, n) then is |2−h−w n − 5−ϕ(h) m| and a
′
denominator is 5−ϕ(h) n. We notice that 5−ϕ(h) > 2p10 if
and only if h 6 −p′10 . If −p′10 + 1 6 h 6 min(0, −w)
and dh (m, n) 6= 0, we have |2−h n − 5−ϕ(h) m| > 1, hence
′
(2)
dh (m, n) > 1/(5−ϕ(h) n) > 2−2p10 . For hmin 6 h 6 −p′10 ,
we use the same continued fraction tools as above.
There remains the case min(0, −w) 6 h 6 max(0, −w).
If 0 6 h 6 −w, a numerator of dh (m, n) is |5ϕ(h) 2−h−w n −
m| and a denominator is n. Hence, if dh (m, n) 6= 0, we
′
have dh (m, n) > 1/n > 2p10 . If −w 6 h 6 0, a numerator
−ϕ(h) h+w
of dh (m, n) is |n − 5
2
m| and a denominator is
′
5−ϕ(h)2h+w n 6 5 · 2−h 2h+w n < 5 · 22p10 −p2 −1 . It ′ follows
that if dh (m, n) 6= 0, we have dh (m, n) > 1/5 · 2−2p10 +p2 +1 .
Proposition 5. The minimum nonzero values of dh (m, n) under
the constraints from Problem 1 for the basic IEEE formats are
attained for the parameters given in Table 4.
6
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
h
m
n
50
−159
−275
−818
2546
10378
10888194
11386091
4988915232824583
5148744585188163
7116022508838657793249305056613439
7977485665655127446147737154136553
13802425659501406
8169119658476861812680212016502305
12364820988483254
9254355313724266263661769079234135
13857400902051554
9844227914381600512882010261817769
log2 (η−1 )
111.40
229.57
113.68
233.58
126.77
237.14
TABLE 4
Worst cases for dh (m, n) and corresponding lower bounds η.
Proof. This follows from applying the algorithm outlined
above to the parameters associated with each basic IEEE
format. Scripts to verify the results are provided along with
our example implementation of the comparison algorithm
(see Section 7).
5
I NEQUALITY TESTING
As already mentioned, the first step of the full comparison
algorithm is straightforwardly implementable in 32-bit or
64-bit integer arithmetic. The second step reduces to comparing m and f (h) · n, and we have seen in Section 4 that
it is enough to know f (h) with a relative accuracy given by
the last column of Table 4. We now discuss several ways to
perform that comparison. The main difficulty is to efficiently
evaluate f (h) · n with just enough accuracy.
5.1
Direct Method
A direct implementation of Equation (5) just replaces f (h)
with a sufficiently accurate approximation, read in a table
indexed with h. The actual accuracy requirements can be
stated as follows. Recall that η denotes a (tight) lower bound
on (6).
Proposition 6. Assume that µ approximates f (h) · n with
relative accuracy χ < η/2−w+2 or better, that is,
η
|f (h) · n − µ| 6 χf (h) · n < −w+2 f (h) · n.
(8)
2
The following implications hold:
p2 +1
=⇒ x10 > x2 ,
µ >m+χ·2
p2 +1
(9)
µ <m−χ·2
=⇒ x10 < x2 ,
p2 +1
|m − µ| 6 χ · 2
=⇒ x10 = x2 .
−w
p2 −p′10 +1
Proof: First notice that f (h) 6 2
= 2
for
′
all h. Since n < 2p10 , condition (8) implies |µ − f (h) · n| <
2p2 +1 χ, and hence |f (h) · n − m| > |µ − m| − 2p2 +1 χ.
Now consider each of the possible cases that appear
in (9). If |µ − m| > 2p2 +1 χ, then f (h) · n − m and µ − m
have the same sign. The first two implications from (5) then
translate into the corresponding cases from (9).
If finally |µ − m| 6 2p2 +1 χ, then the triangle inequality
yields |f (h)·n−m| 6 2p2 +2 χ, which implies |f (h)−m/n| <
2p2 +2 χ/n < 2p2 +w η/n 6 η . By definition of η , this cannot
happen unless m/n = f (h). This accounts for the last case.
For instance, in the case of binary64–decimal64 comparisons, it is more than enough to compute the product f (h)·n
in 128-bit arithmetic.
Proposition 4 gives the number of values of h to consider
in the table of f (h), which grows quite large (23.5 kB for
b64/d64 comparisons, 721 kB in the b128/d128 case, assuming a memory alignment on 8-byte boundaries). However,
it could possibly be shared with other algorithms such
as binary-decimal conversions. Additionally, h is available
early in the algorithm flow, so that the memory latency for
the table access can be hidden behind the first step.
The number of elements in the table can be reduced at
the price of a few additional operations. Several consecutive h map to a same value g = ϕ(h). The corresponding
values of f (h) = 2ϕ(h)·log2 5−h are binary shifts of each other.
Hence, we may store the most significant bits of f (h) as
a function of g , and shift the value read off the table by
an appropriate amount to recover f (h). The table size goes
down by a factor of about log2 (5) ≈ 2.3.
More precisely, define F (g) = 5g 2−ψ(g) , where
ψ(g) = ⌊g · log2 5⌋.
(10)
We then have f (h) = F (ϕ(h)) · 2−ρ(h) with ρ(h) = h −
ψ(ϕ(h)). The computation of ψ and ρ is similar to that of ϕ
in Step 1. In addition, we may check that
1 6 F (ϕ(h)) < 2 and 0 6 ρ(h) 6 3
for all h. Since ρ(h) is nonnegative, multiplication by 2ρ(h)
can be implemented with a bitshift, not requiring branches.
We did not implement this size reduction: instead, we
concentrated on another size reduction opportunity that we
shall describe now.
5.2 Bipartite Table
That second table size reduction method uses a bipartite table. Additionally, it takes advantage of the fact that, for small
nonnegative g , the exact value of 5g takes less space than an
accurate enough approximation of f (h) (for instance, 5g fits
on 64 bits for g 6 27). In the case of a typical implementation
of comparison between binary64 and decimal64 numbers,
the bipartite table uses only 800 bytes of storage.
Most of the overhead in arithmetic operations for the
bipartite table method can be hidden on current processors
through increased instruction-level parallelism. The reduction in table size also helps decreasing the probability of
cache misses.
Recall that we suppose g = ϕ(h), so that h lies
(2)
(2)
between bounds hmin , hmax deduced from Proposition 4.
(2)
(2)
Corresponding bounds gmin , gmax on g follow since ϕ is
nondecreasing.
7
For some integer “splitting factor” γ > 0 and ǫ = ±1,
write
ǫg
g = ϕ(h) = ǫ(γq − r),
q=
,
(11)
γ
As we also have h log5 2 − 1 6 g = ϕ(h) < h log5 2 by
definition of ϕ, it follows that
so that
Now let τ > −1 be an integer serving as an adjustment
parameter to be chosen later. In practice, we will usually2
choose τ = 0 and sometimes τ = −1 (see Tables 7 and 8).
For most purposes the reader may simply assume τ = 0.
Define
′
∆ = θ1 (q) · n · 2τ −p10
− θ2 (r) · m · 2τ +λ1 −λ2 +σ(h)−p2 −1 , ǫ = +1,
∆ = θ1 (q) · m · 2τ −p2 −1
′
− θ2 (r) · n · 2τ +λ1 −λ2 −σ(h)−p10 ,
ǫ = −1.
f (h) =
5γq
5r
ǫ
·2
−h−w
.
Typically, γ will be chosen as a power of 2, but other values
can occasionally be useful. With these definitions, we have
0 6 r 6 γ − 1, and q lies between
& (2) '
& (2) '
g
gmin
and
ǫ max
ǫ
γ
γ
(the order of the bounds depends on ǫ). The integer 5r fits
on ψ(r) + 1 bits, where ψ is the function defined by (10).
Instead of tabulating f (h) directly, we will use two tables: one containing the most significant bits of 5γq roughly
to the precision dictated by the worst cases computed in
Section 4, and the other containing the exact value 5r for
0 6 r 6 γ − 1. We denote by λ1 and λ2 the respective
precisions (entry sizes in bits) of these tables. Based on these
ranges and precisions, we assume
λ1 > log2 (η −1 ) − w + 3,
λ2 > ψ(γ − 1) + 1.
(12)
(13)
In addition, it is natural to suppose λ1 larger than or close
to λ2 : otherwise, an algorithm using two approximate tables
will likely be more efficient.
In practice, λ2 will typically be one of the available machine word widths, while, for reasons that should become
clear later, λ1 will be chosen slightly smaller than a word
size. For each pair of basic IEEE formats, Tables 7 and 8
provide values of ǫ, γ, λ1 , λ2 (and other parameters whose
definition follows later) suitable for typical processors. The
suggested values were chosen by optimizing either the table
size or the number of elementary multiplications in the
algorithm, with or without the constraint that γ be a power
of two.
It will prove convenient to store the table entries leftaligned, in a fashion similar to floating-point significands.
Thus, we set
θ1 (q) = 5γq ·2λ1 −1−ψ(γq) ,
θ2 (r) = 5r ·2λ2 −1−ψ(r) ,
where the power-of-two factors provide for the desired
alignment. One can check that
2λ1 −1 6 θ1 (q) < 2λ1 ,
2λ2 −1 6 θ2 (r) < 2λ2
for all h.
The value f (h) now decomposes as
ǫ
5g
θ1 (q)
f (h) = h+w =
2ǫ·(λ2 −λ1 )−σ(h)−w
2
θ2 (r)
(14)
(15)
where
σ(h) = h − ǫ · ψ(γq) − ψ(r) .
Equations (10) and (16) imply that
h − g log2 5 − 1 < σ(h) < h − g log2 5 + 1.
(16)
0 6 σ(h) 6 3.
(17)
The relations |x2 | > |x10 |, |x2 | = |x10 |, and |x2 | < |x10 | hold
respectively when ǫ∆ < 0, ∆ = 0, and ǫ∆ > 0.
Proposition 7. Unless x2 = x10 , we have |∆| > 2τ +1 .
Proof: Assume x2 6= x10 . For ǫ = +1, and using (15),
the definition of ∆ rewrites as
m
.
∆ = θ2 (r) n 2τ +λ1 −λ2 +σ(h)−p2 −1 f (h) −
n
Since f (h) = m/n is equivalent to x2 = x10 , we know
by Proposition 5 that |f (h) − m/n| > η . Together with the
bounds (3), (12), (14), and (17), this implies
log2 |∆| > (λ2 − 1) + (p′10 − 1) + τ + λ1 − λ2 − p2 − 1
+ log2 η
= (λ1 + log2 η + w − 3) + τ + 1 > τ + 1.
Similarly, for ǫ = −1, we have
so that
m
∆ = −θ1 (q) n 2τ −p2 −1 f (h) −
,
n
log2 |∆| > (λ1 − 1) + (p′10 − 1) + τ − p2 − 1 + log2 η
>τ +1
when ∆ 6= 0.
As already explained, the values of θ2 (r) are integers and
can be tabulated exactly. In contrast, only an approximation
of θ1 (q) can be stored. We represent these values as ⌈θ1 (q)⌉,
hence replacing ∆ by the easy-to-compute
˜ = ⌈θ1 (q)⌉ · n · 2τ −p′10
∆
− θ2 (r) · m · 2τ +λ1 −λ2 +σ(h)−p2 −1 , ǫ = +1,
˜ = ⌈θ1 (q)⌉ · m · 2τ −p2 −1
∆
′
− θ2 (r) · n · 2τ +λ1 −λ2 −σ(h)−p10 ,
ǫ = −1.
Proposition 7 is in principle enough to compare x2 with
x10 , using a criterion similar to that from Proposition 6. Yet
additional properties hold that allow for a more efficient
final decision step.
2. See the comments following Equations (19) to (21). Yet allowing
for positive values of τ enlarges the possible choices of parameters to
satisfy (21) (resp. (21’)) for highly unusual floating-point formats and
implementation environments. For instance, it makes it possible to store
the table for θ1 in a more compact fashion, using a different word size
for the table than for the computation.
8
˜ be the signed
Proposition 8. Assume g = ϕ(h), and let ∆
integer defined above. For τ > 0, the following equivalences hold:
˜ 6 −2τ ,
∆
˜ 6 2τ ,
∆ = 0 ⇐⇒ 0 6 ∆
˜ > 2τ +1 .
∆ > 0 ⇐⇒
∆
W (p′10 ) − p′10 > 1,
W (λ1 ) − λ1 > W (λ2 ) − λ2 + τ + 3,
W (p2 ) − p2 + W (λ2 ) − λ2 > W (λ1 ) − λ1 − τ + 1.
∆ < 0 ⇐⇒
Furthermore, if τ ∈ {−1, 0} and
(
p2 + 1, ǫ = +1,
λ1 − λ2 + τ >
p′10 + 3, ǫ = −1,
(18)
˜ have the same sign (in particular, ∆
˜ = 0 if and
then ∆ and ∆
only if ∆ = 0).
˜ as
Proof: Write the definition of ∆
˜ = ⌈θ1 (q)⌉X − θ2 (r)X ′
∆
where the expressions of X and X ′ depend on ǫ, and define
′
error terms δtbl , δrnd , δrnd
by
δtbl = ⌈θ1 (q)⌉ − θ1 (q), δrnd = ⌊⌈θ1 (q)⌉X⌋ − ⌈θ1 (q)⌉X,
′
δrnd
= ⌊θ2 (r)X ′ ⌋ − θ2 (r)X ′ .
The δ ’s then satisfy
0 6 δtbl < 1,
′
−1 < δrnd , δrnd
6 0.
For both possible values of ǫ, we have
hence
3
the smallest “machine word” that holds at least k bits. We
further require that (for ǫ = +1):
0 6 X 6 2τ and
′
˜ − ∆ = Xδtbl + δrnd − δrnd
−1 < ∆
< 2τ + 1.
If ∆ < 0, Proposition 7 implies that
˜ < −2τ +1 + 2τ + 1 = −2τ + 1.
∆
˜ 6 −⌊2τ ⌋ since ∆
˜ is an integer. By the same
It follows that ∆
˜
reasoning, ∆ > 0 implies ∆ > ⌊2τ +1 ⌋, and ∆ = 0 implies
˜ 6 ⌈2τ ⌉. This proves the first claim.
06∆
′
Now assume that (18) holds. In this case, we have δrnd
=
τ
˜
˜
0, so that −1 < ∆−∆ < 2 . The upper bounds on ∆ become
˜ 6 −⌊2τ ⌋−1 for ∆ < 0 and ∆
˜ 6 ⌈2τ ⌉−1 for ∆ = 0. When
∆
˜ < 0 and
−1 6 τ 6 0, these estimates respectively imply ∆
˜ = 0, while ∆
˜ > ⌊2τ +1 ⌋ implies ∆ > 0. The second claim
∆
follows.
˜ = 0 = ∆ when x2 = x10 for
The conclusion ∆
certain parameter choices means that there is no error in the
approximate computation of ∆ in these cases. Specifically,
the tabulation error δtbl and the rounding error δrnd cancel
out, thanks to our choice to tabulate the ceiling of θ1 (q) (and
not, say, the floor).
We now describe in some detail the algorithm resulting
from Proposition 8, in the case ǫ = +1. In particular, we
will determine integer datatypes on which all steps can be
performed without overflow, and arrange the powers of two
˜ in such a way that computing the
in the expression of ∆
floor functions comes down to dropping the least significant
words of multiple-word integers.
Let W : N → N denote a function satisfying W (k) > k
for all k . In practice, W (k) will typically be the width of
3. Actually 0 6 X 6 2τ −1 for ǫ = −1: this dissymmetry is related
to our choice of always expressing the lower bound on ∆ in terms of η
rather than using a lower bound on f (h)−1 − n/m when ǫ = −1.
(19)
(20)
(21)
These assumptions are all very mild in view of the IEEE-7542008 Standard. Indeed, for all IEEE interchange formats (including binary16 and formats wider than 64 bits), the space
taken up by the exponent and sign leaves us with at least
5 bits of headroom if we choose for W (p2 ), resp. W (p′10 ) the
word width of the format. Even if we force W (λ2 ) = λ2 and
τ = 0, this always allows for a choice of λ1 , λ2 and W (λ2 )
satisfying (12), (13) and (19)–(21).
Denote x+ = max(x, 0) and x− = max(−x, 0), so that
x = x+ − x− .
Algorithm 2. Step 2, second method, ǫ = +1.
1 Compute q and r as defined in (11), with ǫ = +1.
Compute σ(h) using (16) and Proposition 2.
2 Read ⌈θ1 (q)⌉ from the table, storing it on a W (λ1 )-bit
(multiple-)word.
3 Compute
+
′
′
−
′
A = (⌈θ1 (q)⌉ · 2τ ) · (n2W (p10 )−p10 −τ ) · 2−W (p10 )
by (constant) left shifts followed by a W (λ1 )-byW (p′10 )-bit multiplication, dropping the least significant word(s) of the result that correspond to W (p′10 )
bits.
4 Compute
B = θ2 (r) · (m2σ(h)+ω ) · 2W (λ1 )−W (λ2 )−W (p2 )
where the constant ω is given by
ω = W (p2 ) − p2 + W (λ2 ) − λ2 − W (λ1 ) + λ1 + τ − 1
in a similar way, using a W (λ2 )-by-W (p2 )-bit multiplication.
˜ = A − B as a W (λ1 )-bit
5 Compute the difference ∆
signed integer.
˜ −τ −1⌋2τ +1 by masking the τ + 1
6 Compute ∆′ = ⌊∆2
˜.
least significant bits of ∆
7 Return
′
“x10 < x2 ” if ∆ < 0,
“x10 = x2 ” if ∆′ = 0,
“x10 > x2 ” if ∆′ > 0.
Proposition 9. When either τ > 0 or τ = −1 and (18) holds,
Algorithm 2 correctly decides the ordering between x2 and x10 .
Proof: Hypotheses (19) and (21), combined with the
inequalities σ(h) > 0 and τ > −1, imply that the shifts in
Steps 3 and 4 are left shifts. They can be performed without
overflow on unsigned words of respective widths W (λ1 ),
W (p′10 ) and W (p2 ) since W (λ1 ) − λ1 > τ + 1 and W (p2 ) −
p2 > ω + 3, both by (20). The same inequality implies that
the (positive) integers A and B both fit on W (λ1 ) − 1 bits,
so that their difference can be computed by a subtraction on
W (λ1 ) bits. The quantity A − B computed in Step 5 agrees
˜ , and ∆′ has the same sign
with the previous definition of ∆
˜ , hence Proposition 8 implies that the relation returned
as ∆
in Step 7 is correct.
9
When τ ∈ {−1, 0} and (18) holds, no low-order bits
are dropped in Step 4, and Step 6 can be omitted (that
˜ ). Also observe that a version of
is, replaced by ∆′ := ∆
Step 7 returning −1, 0 or 1 according to the comparison
result can be implemented without branching, using bitwise
operations on the individual words which make up the
representation of ∆′ in two’s complement encoding.
The algorithm for ǫ = −1 is completely similar, except
that (19)–(21) become
W (p2 ) − p2 > 2,
(19’)
W (λ1 ) − λ1 > W (λ2 ) − λ2 + τ + 1, (20’)
W (p′10 ) − p′10 + W (λ2 ) − λ2 > W (λ1 ) − λ1 − τ + 3, (21’)
and A and B are computed as
+
−
A = (⌈θ1 (q)⌉ · 2τ ) · (m2W (p2 )−p2 −τ −1 ) · 2−W (p2 )
′
B = θ2 (r) · (n2ω−σ(h) ) · 2W (λ1 )−W (λ2 )−W (p10 ) ,
with
ω = W (p′10 ) − p′10 + W (λ2 ) − λ2 − W (λ1 ) + λ1 + τ,
respectively using a W (λ1 )-by-W (p2 )-bit and a W (λ2 )-byW (p′10 ) multiplication.
Tables 7 and 8 suggest parameter choices for comparisons in basic IEEE formats. (We only claim that these are reasonable choices that satisfy all conditions, not that they are
optimal for any well-defined metric.) Observe that, though
it has the same overall structure, the algorithm presented
in [2] is not strictly speaking a special case of Algorithm 2,
as the definition of the bipartite table (cf. (11)) is slightly
different.
6
Besides deciding the two possible inequalities, the direct
and bipartite methods described in the previous sections are
able to determine cases when x2 is equal to x10 . However,
when it comes to solely determine such equality cases
additional properties lead to a simpler and faster algorithm.
The condition x2 = x10 is equivalent to
m · 2h+w = n · 5g ,
(22)
and thus implies certain divisibility relations between m,
n, 5±g and 2±h±w . We now describe an algorithm that
takes advantage of these relations to decide (22) using
only binary shifts and multiplications on no more than
max(p′10 + 2, p2 + 1) bits.
Proposition 10. Assume x2 = x10 . Then, we have
−p′10 + 1 6 h 6 p2 .
where
= −⌊p10 (1 + log5 2)⌋,
eq
gmax
= ⌊p2 log5 2⌋.
Additionally,
•
If now g < 0, then 5−g divides m = 2ν M10 and hence
divides M10 , while 2−h divides 2w m. Since M10 < 10p10
and 2w m < 2p2 +w , we have −g < p10 log5 10 and −h <
p10 log2 10 < p′10 .
These properties translate into Algorithm 3 below. Recall that x+ = max(x, 0) and x− = x+ − x. (Branchless
algorithms exist to compute x+ and x− [1].) Let p =
max(p′10 + 2, p2 + 1).
Algorithm 3. Equality test.
eq
eq
1 If not gmin 6 g 6 gmax then return false.
−
2 Set u = (w + h) and v = (w + h)+ .
3 Using right shifts, compute
m1 = ⌊2−u m⌋,
n1 = ⌊2−v n⌋.
4 Using left shifts, compute m2 = 2u m1 and n2 = 2v n1 .
+
−
5 Read 5g and 5g from a table (or compute them).
−
6 Compute m3 = 5g m1 with a p′10 × p2 → p bit
multiplication (meaning that the result can be anything
if the product does not fit on p bits).
+
7 Compute n3 = 5g n1 with a p2 × p′10 → p bit multiplication.
8 Return (m2 = m) ∧ (n2 = n) ∧ (g = ϕ(h)) ∧ (n3 = m3 ).
Proposition 11. Algorithm 3 correctly decides whether x2 =
x10 .
Proof: First observe that, if any of the conditions
eq
eq
g = ϕ(h), gmin
6 g 6 gmax , m2 = m and n2 = n is violated,
then x2 6= x10 by Propositions 1 and 10. Indeed, x2 = x10
implies m2 = m and n2 = n due to the divisibility
conditions of Proposition 10. In all these cases, the algorithm
correctly returns false. In particular, no table access occurs
eq
eq
unless g lies between gmin and gmax .
Now assume that these four conditions hold. In particular, we have m1 = 2−u m as m2 = m, and similarly
+
n1 = 2−v n. The bounds on g imply that 5g fits on p2 bits
−
and 5g fits on p′10 bits. If g and w + h are both nonnegative,
then m3 = m, and we have
+
′
5g n1 = 5ϕ(h) 2−(w+h) n < 2−w+p10 = 2p2 +1 6 2p .
By the same reasoning, if g, w + h 6 0, then n3 = n and
eq
gmin
•
h 6 p′10 − w − 1 = p2 .
As a matter of course, the algorithm can return false
as soon as any of the conditions from Step 8 is known
unsatisfied.
A N E QUALITY T EST
eq
eq
gmin
6 g 6 gmax ,
Proof: First observe that g must be equal to ϕ(h) by
Proposition 1, so g > 0 if and only if h > 0. The divisibility properties follow immediately from the coprimality of
2 and 5.
When g > 0, they imply 5g 6 m < 2p2 − 1, whence g <
p2 log5 2. Additionally, 2−h−w n is an integer. Since h > 0, it
′
follows that 2h 6 2−w n < 2p10 −w and hence
if g > 0, then 5g divides m, otherwise 5−g divides M10
(and n);
if h > −w, then 2h+w divides n, otherwise 2−(h+w)
divides m.
′
5−g 2w+h m < 5 · 2p10 −1 < 2p .
If now g > 0 with h 6 −w, then m3 = m1 and
′
′
5g n < 2h+p10 6 2−w+p10 = 2p2 +1 6 2p .
Similarly, for g 6 0, −h 6 w, we have
5−g m < 5 · 2−h+p2 6 5 · 2w+p2 6 2p .
10
In all four cases,
+
−
m3 = 5g 2−u m and n3 = 5g 2−v n
are computed without overflow. Therefore
+
m · 2(h+w) −(h+w)
m3
=
n3
n · 5g+ −g−
−
=
m · 2h+w
n · 5g
and x2 = x10 if and only if m3 = n3 , by (22).
7
E XPERIMENTAL RESULTS
For each combination of a binary basic IEEE format and
a decimal one, we implemented the comparison algorithm
consisting of the first step (Section 3) combined with the
version of the second step based on a bipartite table (Section 5.2). We used the parameters suggested in the second
part of Table 8, that is for the case when γ is a power of two
and one tries to reduce table size.
The implementation was aided by scripts that take as
input the parameters from Tables 2 and 8 and produce C
code with GNU extensions [10]. Non-standard extensions
are used for 128-bit integer arithmetic and other basic integer operations. No further effort at optimization was made.
Our implementations, the code generator itself, as well
as various scripts to select the parameters are available at
http://hal.archives-ouvertes.fr/hal-01021928/.
The code generator can easily be adapted to produce other
variants of the algorithm.
We tested the implementations on random inputs generated as follows. For each decimal exponent and each
quantum [5, 2.1.42], we generate a few random significands
and round the resulting decimal numbers upwards and
downwards to the considered binary format. (We exclude
the decimal exponents that merely yield to underflow or
overflow on the binary side.)
Table 6 reports the observed timings for our random
test data. The generated code was executed on a system
equipped with a quad-core Intel Core i5-3320M processor
clocked at 2.6 GHz, running Linux 3.16 in 64 bit mode.
We used the gcc 4.9.2 compiler, at optimization level -O3,
with the -march=native flag. The measurements were
performed using the Read-Time-Step-Counter instruction
after pipeline serialization. The serialization and function
call overhead was estimated by timing an empty function
and subtracted off. The caches were preheated by additional
comparison function calls, the results of which were discarded.
It can be observed that “balanced” comparisons
(d64/b64, d128/b128) perform significantly better than “unbalanced” ones. A quick look at the generated assembly
code suggests that there is room for performance improvements, especially in cases involving 128-bit floating point
formats.
We performed more detailed experiments in the special
case of binary64–decimal64 comparisons. In complement to
the generated code, we wrote a version of the bipartite table
method with some manual optimization, and implemented
the method based on direct tabulation of f (h) presented
in Section 5.1. We tested these implementations thoroughly
with test vectors that extensively cover all floating-point input classes (normal numbers, subnormals, Not-A-Numbers,
zeros, and infinities) and exercise all possible values of the
normalization parameter ν (cf. Section 3) as well as all
possible outcomes for both of the algorithm’s steps.
Finally, we compared them for performance to the naïve
comparison method where one of the inputs is converted to
the radix of the other input. These experimental results are
reported in Table 5. In the last two rows, “easy” cases are
cases when the first step of our algorithm succeeds, while
“hard” cases refer to those for which running the second
step is necessary.
8
C ONCLUSION
Though not foreseen in the IEEE 754-2008 Standard, exact
comparisons between floating-point formats of different
radices would enrich the current floating-point environment
and make numerical software safer and easier to prove.
This paper has investigated the feasibility of such comparisons. A simple test has been presented that eliminates
most of the comparison inputs. For the remaining cases, two
algorithms were proposed, a direct method and a technique
based on a more compact bipartite table. For instance, in the
case of binary64–decimal64, the bipartite table uses only 800
bytes of table space.
Both methods have been proven, implemented and thoroughly tested. According to our experiments, they outperform the naïve comparison technique consisting in conversion of one of the inputs to the respectively other format.
Furthermore, they always return a correct answer, which is
not the case of the naïve technique.
Since the algorithmic problems of exact binary to decimal comparison and correctly rounded radix conversion
are related, future investigations should also consider the
possible reuse of tables for both problems. Finally, one
should mention that considerable additional work is required in order to enable mixed-radix comparisons in the
case when the decimal floating-point number is stored in
dense-packed-decimal representation.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Jörg Arndt, Matters computational. Ideas, algorithms, source code,
Springer, 2011.
Nicolas Brisebarre, Christoph Lauter, Marc Mezzarobba, and JeanMichel Muller, Comparison between binary64 and decimal64 floatingpoint numbers, ARITH 21 (Austin, Texas, USA) (Alberto Nannarelli,
Peter-Michael Seidel, and Ping Tak Peter Tang, eds.), IEEE Computer Society, 2013, pp. 145–152.
Marius Cornea, Cristina Anderson, John Harrison, Ping Tak Peter
Tang, and Eric Schneider Evgeny Gvozdev, A software implementation of the IEEE 754R decimal floating-point arithmetic using the binary
encoding format, IEEE Transactions on Computers 58 (2009), no. 2,
148–162.
Godfrey H. Hardy and Edward M. Wright, An introduction to the
theory of numbers, Oxford University Press, London, 1979.
IEEE Computer Society, IEEE standard for floating-point arithmetic, IEEE Standard 754-2008, August 2008, available at
http://ieeexplore.ieee.org/servlet/opac?punumber=4610933.
ISO/IEC JTC 1/SC 22/WG 14, Extension for the programming language C to support decimal floating-point arithmetic, Proposed Draft
Technical Report, May 2008.
Aleksandr Ya. Khinchin, Continued fractions, Dover, New York,
1997.
11
Special cases (±0, NaNs, Inf)
x2 , x10 of opposite sign
x2 normal, same sign, “easy” cases
x2 subnormal, same sign, “easy” cases
x2 normal, same sign, “hard” cases
x2 subnormal, same sign, “hard” cases
Naïve method
converting x10
to binary64
(incorrect)
min/avg/max
Naïve method
converting x2
to decimal64
(incorrect)
min/avg/max
Direct method
(Section 5.1)
min/avg/max
Bipartite table
method (manual
implementation)
(Section 5.2)
min/avg/max
18/–/164
30/107/188
31/107/184
45/106/178
60/87/186
79/106/167
21/–/178
44/124/179
48/132/180
89/119/179
39/78/180
78/107/179
12/–/82
15/30/76
25/57/118
153/163/189
45/56/139
171/177/189
11/–/74
16/31/73
21/43/95
19/27/73
29/40/119
34/49/94
TABLE 5
Timings (in cycles) for binary64–decimal64 comparisons.
[8]
Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin,
Claude-Pierre Jeannerod, Vincent Lefèvre, Guillaume Melquiond,
Nathalie Revol, Damien Stehlé, and Serge Torres, Handbook of
floating-point arithmetic, Birkhäuser, 2010.
[9] Oskar Perron, Die Lehre von den Kettenbrüchen, 3rd ed., Teubner,
Stuttgart, 1954–57.
[10] The GNU project, GNU compiler collection, 1987–2014, available at
http://gcc.gnu.org/.
[11] Charles Tsen, Sonia González-Navarro, and Michael Schulte, Hardware design of a binary integer decimal-based floating-point adder, ICCD
2007. 25th International Conference on Computer Design, IEEE,
2007, pp. 288–295.
d64
d128
b32
b64
b128
41
235
36
266
75
61
TABLE 6
Average run time (in cycles) for the bipartite table method, using the
parameters suggested in Table 8 for γ = 2k , optimizing table size.
12
W (p2 )
W (p′10 )
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
32
64
32
128
64
64
64
128
128
64
128
128
−1
8
−4, 10
145
17
159, 160
32, 32
0
+1
8
−42, 39
117
17
125, 128
32, 32
0
−1
8
−38, 45
178
17
191, 192
32, 32
0
+1
8
−622, 617
190
17
190, 192
32, 32
−1
+1
8
−624, 617
242
17
253, 256
32, 32
0
γ = 2k , optimizing multiplication count
ǫ
−1
γ
8
range of q
−4, 8
min λ1 satisfying (12)
86
min λ2 satisfying (13)
17
chosen λ1 , W (λ1 )
95, 96
chosen λ2 , W (λ2 )
32, 32
τ
0
!
Step 6 of Algo. 2 can be omitted
total table size in bytes
188
32-bit muls
5
!
!
!
!
!
332
9
1344
10
2048
16
29792
16
39776
36
−1
16
−2, 5
145
35
159, 160
64, 64
0
+1
32
−10, 10
117
72
125, 128
96, 96
0
−1
32
−9, 12
178
72
191, 192
96, 96
0
−1
64
−77, 78
190
147
191, 192
160, 160
0
+1
64
−78, 78
242
147
253, 256
160, 160
0
γ = 2k , optimizing table size
ǫ
γ
range of q
min λ1 satisfying (12)
min λ2 satisfying (13)
chosen λ1 , W (λ1 )
chosen λ2 , W (λ2 )
τ
Step 6 of Algo. 2 can be omitted
total table size in bytes
32-bit muls
−1
8
−4, 8
86
17
95, 96
32, 32
0
!
188
5
any γ , optimizing multiplication count
ǫ
−1
γ
13
range of q
−2, 5
min λ1 satisfying (12)
86
min λ2 satisfying (13)
28
chosen λ1 , W (λ1 )
95, 96
chosen λ2 , W (λ2 )
32, 32
τ
0
Step 6 of Algo. 2 can be omitted
!
total table size in bytes
148
32-bit muls
5
%
%
%
%
%
288
13
720
14
912
24
5024
34
6304
52
−1
13
−2, 6
145
28
159, 160
32, 32
0
+1
14
−24, 22
117
31
125, 128
32, 32
0
−1
14
−22, 26
178
31
191, 192
32, 32
0
+1
14
−355, 353
190
31
190, 192
32, 32
−1
+1
14
−357, 353
242
31
253, 256
32, 32
0
!
!
!
!
!
232
9
808
10
1232
16
17072
16
22808
36
−1
13
−2, 6
145
28
159, 160
32, 32
0
+1
28
−12, 11
117
63
125, 128
64, 64
0
+1
28
−12, 11
178
63
189, 192
64, 64
0
−1
69
−71, 73
190
158
191, 192
160, 160
0
+1
83
−60, 60
242
191
253, 256
192, 192
0
any γ , optimizing table size
ǫ
γ
range of q
min λ1 satisfying (12)
min λ2 satisfying (13)
chosen λ1 , W (λ1 )
chosen λ2 , W (λ2 )
τ
Step 6 of Algo. 2 can be omitted
total table size in bytes
32-bit muls
−1
13
−2, 5
86
28
95, 96
32, 32
0
!
148
5
!
232
9
!
608
12
!
800
28
TABLE 7
Suggested parameters for Algorithm 2 on a 32-bit machine.
%
%
4860
34
5864
56
13
W (p2 )
W (p′10 )
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
64
64
64
128
64
64
64
128
128
64
128
128
−1
16
−2, 5
145
35
191, 192
64, 64
0
+1
16
−21, 20
117
35
125, 128
64, 64
0
−1
16
−19, 23
178
35
191, 192
64, 64
0
+1
16
−311, 309
190
35
190, 192
64, 64
−1
+1
16
−312, 309
242
35
253, 256
64, 64
0
γ = 2k , optimizing multiplication count
ǫ
+1
γ
16
range of q
−3, 3
min λ1 satisfying (12)
86
min λ2 satisfying (13)
35
chosen λ1 , W (λ1 )
125, 128
chosen λ2 , W (λ2 )
64, 64
τ
0
!
Step 6 of Algo. 2 can be omitted
total table size in bytes
240
64-bit muls
3
!
!
!
!
!
320
5
800
3
1160
5
15032
5
20032
10
−1
16
−2, 5
145
35
191, 192
64, 64
0
+1
16
−21, 20
117
35
125, 128
64, 64
0
−1
32
−9, 12
178
72
191, 192
128, 128
0
−1
64
−77, 78
190
147
191, 192
192, 192
0
+1
64
−78, 78
242
147
253, 256
192, 192
0
γ = 2k , optimizing table size
ǫ
γ
range of q
min λ1 satisfying (12)
min λ2 satisfying (13)
chosen λ1 , W (λ1 )
chosen λ2 , W (λ2 )
τ
Step 6 of Algo. 2 can be omitted
total table size in bytes
64-bit muls
+1
16
−3, 3
86
35
125, 128
64, 64
0
!
240
3
any γ , optimizing multiplication count
ǫ
+1
γ
13
range of q
−4, 3
min λ1 satisfying (12)
86
min λ2 satisfying (13)
28
chosen λ1 , W (λ1 )
125, 128
chosen λ2 , W (λ2 )
64, 64
τ
0
Step 6 of Algo. 2 can be omitted
!
total table size in bytes
232
64-bit muls
3
%
%
%
320
5
!
800
3
1040
7
5280
9
6560
14
−1
20
−1, 4
145
45
191, 192
64, 64
0
+1
28
−12, 11
117
63
125, 128
64, 64
0
−1
28
−11, 13
178
63
191, 192
64, 64
0
+1
28
−177, 177
190
63
190, 192
64, 64
−1
+1
28
−178, 177
242
63
253, 256
64, 64
0
!
!
!
!
!
!
304
5
608
3
824
5
8744
5
11616
10
−1
20
−1, 4
145
45
191, 192
64, 64
0
+1
28
−12, 11
117
63
125, 128
64, 64
0
+1
28
−12, 11
178
63
189, 192
64, 64
0
−1
82
−60, 61
190
189
191, 192
192, 192
0
+1
83
−60, 60
242
191
253, 256
192, 192
0
any γ , optimizing table size
ǫ
γ
range of q
min λ1 satisfying (12)
min λ2 satisfying (13)
chosen λ1 , W (λ1 )
chosen λ2 , W (λ2 )
τ
Step 6 of Algo. 2 can be omitted
total table size in bytes
64-bit muls
+1
13
−4, 3
86
28
125, 128
64, 64
0
!
232
3
!
304
5
!
608
3
!
800
7
TABLE 8
Suggested parameters for Algorithm 2 on a 64-bit machine.
%
%
4896
9
5864
14