Chapter 1

Chapter 1
Numbers and Precision
You should be quite familiar with base-10 arithmetic1 ; digital computers, on ! If working outside of base-
the other hand, use base-2 or binary representation for arithmetic. Like 10, we use the notation
XB to denote that X is
base-10, binary is also written using positional notation, however restricted
a number represented in
to only allowing the digits 0 and 1. For example, the binary number 10110.12 base-B
is converted to base-10 as follows:
10110.12 = 1 × 24 + 0 × 23 + 1 × 22 + 1 × 21 + 0 × 20 + 1 × 2−1
= 16 + 4 + 2 + 1/2
= 22.5 (1.1) Figure 1.1 The value of binary
digits in fixed point representa-
Similarly to base-10, the position of the digits relative to the decimal point tion
determine the value of the digit — if a digit is n places to the left of the
decimal point, it has a value of 2n−1 , whilst if it is n places to the right, it
has a value of 2−n (see Fig. 1.1).
Converting from base-10 to binary is a little more involved, but still quite
elementary. For example, to convert the number 23.625 to binary, we start
with the integer component 23, and continuously divide by 2, storing each
remainder until we get a result of 0:
23/2 = 11 remainder 1
⇒ 11/2 = 5 remainder 1
The remainder of the first division gives us the least significant digit,
and the remainder of the final division gives us the most significant digit.
Therefore, reading the remainder from bottom to top gives us 23 = 101112 .
To convert the fractional part 0.625, we instead multiply the fractional
part by 2 each time, storing each integer until we have a fractional part of
1
We are assuming that your physics undergraduate course was taught in base-10!
© Springer Nature Switzerland AG 2018 3
J. Izaac and J. B. Wang, Computational Quantum Mechanics, Undergraduate
Lecture Notes in Physics, https://doi.org/10.1007/978-3-319-99930-2_1
4 Chapter 1 Numbers and Precision
zero:
0.625 × 2 = 1.25
⇒ 0.25 × 2 = 0.5
⇒ 0.5 × 2 = 1.0
The first multiplication gives us the most significant digit, and the final
multiplication gives us the least significant digit. So, reading from top to
bottom, 0.625 = 0.1012 .
Putting these both together, we have 23.625 = 10111.1012 .
1.1 Fixed-Point Representation

! Each bit stores one binary As computers are limited in the amount of memory they have to store
digit of 0 or 1 (hence the numbers, we must choose a finite set of ‘bits’ to represent our real-valued
name) numbers, split between the integer part of the number and the fractional
part of the number. This is referred to as fixed-point representation. For
example, a 16-bit fixed-point representation might choose to use 4 bits for
the integer, and 12 bits for the fractional part. Since the location of the
decimal place is fixed, we can write out the base-10 value of this example
fixed-point number convention as follows:
N
!
base-10 value = Mi 2I−i (1.2)
i=1
where M is our binary fixed-point representation, Mi represents the ith digit,

N is the total number of bits, and I the number of integer bits. In our 16-bit
convention, we’d set N = 16 and I = 4.
Let’s consider some real numbers in base-10, their conversion to base-2,
and their resulting 16-bit fixed-point representation (using 4 bits for the
integer, and 12 bits for the fractional part).
(a) 10 = 10102 → 1010.0000000000002
(b) −0.8125 = −0.11012 → −0000.1101000000002
(c) 9.2 = 1001.0011001100110011 . . .2 → 1001.0011001100112
Now, let’s convert the fixed-point representations back into base-10.
(a) 1010.0000000000002 = 23 + 21 = 10
(b) −0000.1101000000002 = −20 − 2−1 − 2−2 − 2−4 = −0.8125
(c) 1001.0011001100112 = 23 + 20 + 2−3 + 2−4 + 2−7 + 2−8 + 2−11 + 2−12

= 9.19995
1.2 Floating-Point Representation 5
What is happening here? In the first two examples, 10 and −0.8125 were able
to be represented precisely using the 16 bits available to us in our fixed-point
representation. On the other hand, when converted to binary, 9.2 requires
more than 16 bits in the fractional part to properly represent its value. In
fact, in the case of 9.2, the binary decimal is non-terminating! This is similar
to how 1/3 = 0.3333 . . . is represented in decimal form in base-10. As we
can’t construct a computer with infinite memory, we must truncate the binary
representation of numbers to fit within our fixed-point representation. This
is known as round-o! error or truncation error.
1.2 Floating-Point Representation

In general usage, computers do not use fixed-point representation to encode
real numbers. Not being able to control exactly where the decimal point goes
in fixed-point representation is inconvenient in most cases. For example, to
represent 1034832.12 in fixed point, we could reduce truncation error by
having a large number of integer bits, whereas to represent 0.84719830 we
would reduce truncation error by having a larger number of fractional bits.
The solution to this problem is to use floating-point representation,
which works in a similar manner to scientific notation in ensuring that all
numbers, regardless of their size, are represented using the same number of
significant figures. Using the previous examples, lets write these in scientific
notation in base-2, using 8 significant figures:
(a) 10 = 10102 = 1.01000002 × 23
(b) −0.8125 = −0.11012 = −1.10100002 × 2−1
(c) 9.2 = 1001.0011001100110011 . . .2 ≈ 1.00100112 × 23
Unlike base-10, we use exponents of 2 in our base-2 scientific notation; this is

because binary positional notation gives a digit n places left of the decimal
point a value of 2n . Even more importantly, as we are restricting the number
of significant figures in our base-2 scientific notation, there is still truncation
error in our representation of 9.2.
Floating-point representation as used by digital computers is written in ! You
" sometimes
# $might see
−i
the following normalised form: 1 + i Mi 2 written as
1.M , signifying the implied
% &
! leading one and the frac-
(−1)S × 1 + Mi 2−i × 2E , E = e − d (1.3) tional significant digits of
i the mantissa
where
• S is the sign bit: S = 0 for positive numbers, S = 1 for negative

numbers,
• M is the mantissa: bits that encode the fractional significant figures

(the significant figures to the right of the decimal point — the leading
1 is implied),
• E is the exponent of the number to be encoded,
• d is the o!set or excess: the excess allows the representation of nega-

tive exponents, and is determined by the floating-point standard used,
• e: bits that encode the exponent after being shifted by the excess.
! For example, if we have the The most common standard for floating-point representation is IEEE754; in
binary number 1.012 × 25 fact, this standard is so universal that it will be very unlikely for you to
and an excess of 16, then encounter a computer that doesn’t use it (unless you happen to find yourself
• s=0 using a mainframe or very specialised supercomputer!).
• M = 012
(drop the leading 1)
1.2.1 IEEE754 32-Bit Single Precision
• E=5 The IEEE754 32 bit single precision floating-point number is so-called because
• e = E + d = 21 = of the 32 bits required to implement this standard. The bits are assigned as
101012 follows: that is, 1 bit for the sign bit, 8 bits for the exponent, and 23 bits for
! 24 bits of binary precision

gives log10 (224 ) ≈ 7.22
significant digits in base-10
the mantissa. Because of the implicit leading digit in the mantissa, IEEE754
32 bit single precision therefore has 24 bits of precision, while the 8 bits in
! With only 8 bits, the the exponent have an excess of d = 127.
largest integer we can To see how this works, let’s have a look at a quick example.
represent in binary is
28 − 1 = 255:
Example 1.1
111111112
What is the IEEE754 single precision floating-point representation of (a)
This gives a range of −0.8125, and (b) 9.2?
[0, 255].
By introducing an excess Solution:
of 12 28 − 1 = 127, we shift
this range to [−127, 128], (a) We know that −0.8125 = −0.11012 = −1.1012 × 2−1 .
allowing us to encode nega- Since this is a negative number, the sign bit has a value of 1, S = 1.
tive exponents
The exponent is −1, accounting for the excess, we therefore have e =
−1 + 127 = 126 = 011111102 .
Finally, we drop the leading 1 to form the mantissa — we can drop the
leading 1. Therefore, M = 101000000000000000000002 .
Thus, in single-precision floating point representation, −0.8125 is given
by 1 01111110 10100000000000000000000
(b) We know that 9.2 = 1001.0011001100110011 . . .2 ≈ 1.00100112 × 23 .

Since this is a positive number, the sign bit has a value of 0, S = 0.
The exponent is 3, accounting for the excess, we therefore have e =
3 + 127 = 130 = 100000102 .
Finally, we drop the leading 1 to form the mantissa. Therefore, M =
001001100110011001100112 .
Thus, in single-precision floating point representation, 9.2 is given by
0 10000010 00100110011001100110011
Problem question
Convert the single precision floating-point representation of 9.2, calcu-

lated above, back into decimal. What is the value actually stored in
0 10000010 00100110011001100110011?
What is the percentage di!erence between the actual stored value and
9.2? This is the truncation error.
Before we move on, let’s try encoding 0 as a single precision floating-point ! Actually, IEEE754 allows a
number. It doesn’t really matter what sign we pick, so we’ll let S = 0. For signed zero; i.e. there is
slightly di!ering behaviour
the exponent, we want the smallest possible value — this corresponds to
for positive zero +0 and
e = 0 or 2−127 . So that gives us the normalised floating-point representation negative zero −0
of
1.M × 2−127
But wait, we’ve forgotten about the leading 1 in the mantissa — so even if
we choose M = 0, we are still left with 1 × 2−127 , or 5.88 × 10−39 in base-10.
A very small number for sure, but definitely not zero!
In order to make sure that it is possible to encode zero in floating-point
representation, IEEE754 mandates that if the exponent bits are all zero e = 0,
then the implied leading 1 in the mantissa is replaced by an implied leading 0,
and the represented exponent is E = −d + 1, giving E = −126 in the single
precision case. This is known as denormalised form. We can therefore
use denormalised form to encode zero as a IEEE754 single precision floating
point number:
0 00000000 00000000000000000000000 ≡ (−1)0 × 0 × 2−126 = 0
So, in summary:
• If 1 ≤ e ≤ 2d (or −126 ≤ E ≤ 127), we use normalised form, and

the mantissa has an implied leading 1:
(−1)S × 1.M × 2e−d

• If e = 0, we use denormalised form. We set E = 1 − d = −126, and

the mantissa has an implied leading 0:
(−1)S × 0.M × 21−d
Example 1.2
What are the (a) largest and (b) smallest possible positive single precision
floating point numbers? (Ignore the case of zero.)
Solution:
(a) The smallest possible positive single precision number has E = 0, and
so must be denormalised . As it is larger than zero, it therefore has a
mantissa of 1 in the least significant bit position, and an exponent of
E = −126:
0 00000000 00000000000000000000001
Converting this into base-10:
0.M × 2−126 = 2−23 × 2−126 ≈ 1.40 × 10−45
(b) The largest possible positive single precision number has E = 127, and
so must be normalised. It has a mantissa of all 1s:
0 11111110 11111111111111111111111
Converting this into base-10:
% 23
&
!
127 −n
1.M × 2 = 1+ 2 × 2127 ≈ 3.40 × 1038
n=1
Problem question
Calculate (a) the largest possible denormalised single precision number,

and (b) the smallest possible normalised single precision number. What
do you notice? Do they overlap, or is there a gap?
1.2.2 Non-numeric Values

The eagle-eyed reader might have noticed something odd in the above section.
“Hang on, if the exponent has 8 bits, then
7
!
11111111 = 2i = 255
i=0
is the largest possible exponent. Subtracting the defect gives 255−127 = 128.
Then why did you say that the largest exponent in normalised form is only
127?!”
Well, uh... you caught me. Turns out, the case where the exponent is all 1s
is reserved in IEEE754 for non-numeric values.
• Infinity: If the exponent is all 1s, and the mantissa is all zeros, this
represents positive or negative infinity, depending on the sign bit:
1 11111111 00000000000000000000000 ≡ −∞
0 11111111 00000000000000000000000 ≡ +∞
This behaves like you’d expect; add 1 to the largest possible floating-point
number, and you’ll get +∞ as a result. Adding or multiplying infinities will
result in further infinities.
• Not a Number (NaN): If the exponent is all 1s, and the mantissa
contains 1s, this represents NaNs:
0 11111111 00000001000000100001001 ≡ NaN
NaNs are the result of invalid operations, such as taking the square root of a
negative number (IEEE754 floating-point numbers represent real numbers),
or attempting to divide by zero.
1.2.3 IEEE754 64 Bit Double Precision

While single-precision floating point numbers are extremely useful in day-to-
day calculation, sometimes we need more precision (that is, more significant
digits — requiring a higher number of bits in the mantissa), or a larger range
(more bits in the exponent). The IEEE754 64 bit double precision floating-
point standard helps with both of these issues, as we now have 64 bits to
work with; this is divided up to give 1 sign bit (as before), 11 exponent bits,
and 52 mantissa bits. With 11 exponent bits, we can form integers from 0 to
211 − 1 = 2047; thus, the excess or o!set for double precision floating-point
numbers is 211 /2 − 1 = 1023. Taking into account the denormalised case,
this allows exponents to range from 2−1022 to 21023 .
To see what the double precision floating-point range is, consider the
largest possible positive number:
0 11111111110 1111111111111111111111111111111111111111111111111111
In binary, this is equivalent to

% 52
&
!
−i
1+ 2 21023 ≈ 1.80 × 10308 (1.4)
i=1
This is 10342 times larger than the maximum possible single precision value
of only 3.40 × 1038 .
Furthermore, the 53 binary bits of precision (52 from the mantissa, and 1
from the implied leading digit) results in log10 253 ≈ 15.9 significant figures
in base-10, more than double the number of significant digits in the single
precision case. You can see how double precision numbers can be much more
useful when you need to deal with large numbers and many significant digits!
We can go even higher than double precision if we’d like; an IEEE754
standard also exists for 128 bit quadruple (quad) precision — containing
1 sign bit, 15 exponent bits, and 112 mantissa bits, for 128 bits total. A
summary of single, double, and quad precision is provided in Table 1.1.
! If the IEEE754 floating
point standard has an ex- Sign Exponent Mantissa Total Base-10 sig.
cess of d, then the allowed Name Excess
bit bits bits bits figs.
exponent range is [1..d; d]
Single 1 8 23 32 127 ∼7.2
Double 1 11 52 64 1023 ∼15.9
Quad 1 15 112 128 16383 ∼34.0
Table 1.1 IEEE754 floating-point number standards
Problem question
Using the identity

N
!
2−i = 1 − 2−N ,
i=1
derive expressions for the largest and smallest positive IEEE754 floating-
point number with e exponent bits and m mantissa bits. Assume that
when the exponent bits are all zero, the floating-point number is denor-
malised.
Hint: if there are e exponent bits, then the excess is given by d = 2e /2−1.
• Property of commutativity:
x+y =y+x
x×y =y×x
• Property of associativity:
1.3 Floating-Point Arithmetic
(x + y) + z = x + (y + z) When working through theoretical calculations using our standard mathemat-
ical toolset, we are used to applying the standard properties of arithmetic.
(x × y) × z = x × (y × z)
For example, multiplicative and additive commutativity, associativity, and
• Property of distributivity:
distributivity. But do these properties still apply in the case of arithmetic
x × (y + z) = x × y + x × z performed on computers using floating-point representation? Commutativ-
ity still holds, as the order we multiply or add floating-point numbers should
Table 1.2 Standard properties not a!ect the result. But what about associativity? Let’s work through a
of arithmetic quick example to see what happens.
1.3 Floating-Point Arithmetic 11
Example 1.3
Consider the expression −5 + 5 + 0.05. Using the property of associativity, we
could calculate this using either
(a) (−5 + 5) + 0.05, or

(b) −5 + (5 + 0.05)
These are equivalent if we can represent these numbers exactly, with no loss
of precision.
Without evaluating (a) or (b), convert them into single precision floating-
point representation. Are they still equivalent?
Solution: We’ll start by converting the numbers into single precision floating-
point representation; that is, converting them into binary with 24 significant
digits.
• 5 → 1.010000000000000000000002 × 22
• 0.05 → 1.100110011001100110011012 × 2−5
Using this to evaluate the two expressions (a) and (b):
(a) For the first expression, we have 5 − 5 = 0 regardless of whether we use

exact or floating-point representation — a number minus itself should
always give zero. Therefore, after calculating the result, we are left with
our floating-point number representing 0.05. Converting this back into
base-10,
1.100110011001100110011012 × 2−5 → 0.050000000745058059692 . . .
The error we see in the 10th decimal place is round-o!/truncation error,

due to the limited precision.
(b) For this case, we must first add 5 and 0.05 in single precision floating-
point representation. In order to add them together, we need 0.05 to
have the same exponent in floating-point form as 5:
1.100110011001100110011012 × 2−5 → 0.000000110011001100110012 × 22
Note that we must truncate the rescaled value after 24 significant figures,
as we only use 24 bits in single-precision. Adding 5 and then subtracting
5 results in no change. Converting to base-10,
0.000000110011001100110012 × 22 → 0.049999713897705078125 . . .
So, in the above example, we can see that associativity doesn’t hold for
floating point numbers — by changing the order in which we add floating-point
numbers together, we change the amount of truncation error that occurs! This
can’t be eliminated entirely; all we can do is to try and perform operations
on numbers of similar magnitude first, in order to reduce truncation error.
Like associativity, the property of distributivity also does not always hold
for floating-point arithmetic. In fact simple equations such as
0.1 + 0.1 + 0.1 = 0.3
are no longer perfectly guaranteed in floating-point arithmetic. To see why,
let’s convert 0.1 to a single-precision floating-point:
0.1 → 1.100110011001100110011012 × 2−4
Converting back to base-10, we see that truncation error has occurred — the
floating-point representation slightly overestimates the value of 0.1:
1.100110011001100110011012 × 2−4 = 0.10000000149011611938 . . .
Therefore, in single-precision floating point arithmetic,
0.1 + 0.1 + 0.1 = 0.30000000447034835815429688 . . .
But what is 0.3 in single precision floating-point representation? Doing a
quick conversion, we find that
0.3 → 0.30000001192092895508
which is a di!erent result from 0.1 + 0.1 + 0.1. Therefore, in single precision
floating-point arithmetic, 0.1 + 0.1 + 0.1 *= 0.3.
1.3.1 Machine Epsilon

In order to get a rough ‘upper bound’ on the relative round-o! error that oc-
curs during floating-point arithmetic, it is common to see standards (such as
IEEE754), programming languages (such as Fortran and Python), and soft-
ware packages (such as Mathematica) quote something called the machine
epsilon.
Machine Epsilon and Round-O! Error

Machine epsilon is defined by the di!erence between the floating-point rep-
resentation of 1 and the next highest number. In single precision, we know
that
1 ≡ 1.000000000000000000000002
The next highest number is represented by changing the least significant bit
in the mantissa to 1, giving
1.000000000000000000000012
Therefore, in single precision IEEE754 floating-point representation,
! = 1.000000000000000000000012 − 1.000000000000000000000002 = 2−23
Repeating this process for double precision floating-point numbers results in
! = 2−52 . In general, if the mantissa is encoded using M bits — resulting in
M + 1 significant figures — then the machine epsilon is ! = 2−M .
1.3 Floating-Point Arithmetic 13
The important take-away here is that this value represents the minimum
possible spacing between two floating point numbers, and as such provides
an upper bound on the relative truncation error introduced by floating-point
numbers.
To see how this works, lets consider a floating-point representation of
a number x; we’ll denote this f l(x). To calculate the relative error of
the floating-point representation, we subtract the original exact value x, and
divide by x:
f l(x) − x
relative error =
x
Now, we know that the relative error must be bound by the machine epsilon,
so substituting this in to form an inequality:
f l(x) − x
−! ≤ ≤! (1.5)
x
By taking the absolute value and rearranging this equation, we also come
across an expression for an upper bound on the absolute error of the ! This result also generalises
floating-point representation: to operations of floating-
point numbers. For exam-
|f l(x) − x| ≤ !|x| ple, the square root
√ √ √
Ultimately, however, while the IEEE754 floating-point numbers have |f l( x) − x| ≤ !| x|
their quirks and setbacks, alternative proposals and standards have their and addition
own oddities. The sheer dominance of the IEEE754 standard in modern
computing means that, although we sometimes still have to worry about pre- |f l(x+y)−(x+y)| ≤ !|x+y|
cision, accuracy, and round-o! errors, we can do so using a standard that
will work the same way on almost all modern machines.
In the next two chapters, we’ll start putting our knowledge of floating-
point representation into practice — by actually bringing them into existence
on a computer near you — using either Fortran (Chap. 2) or Python (Chap.
3).2 It is up to you which programming language you would like to delve into,
as Chap. 4 and onwards will include both Fortran and Python examples.
Further reading
This chapter introduces the topic of floating-point arithmetic, and forces

you to consider a topic (number representation) that normally does not
appear in most analytic work. As we will see repeatedly in this text,
however, you could write a whole textbook on this topic — and people
have! If you would like to read more on floating-point representation,
how it’s implemented, and relevant applications to computational science
and mathematics, you cannot go past the Handbook of Floating-Point
Arithmetic, the definitive textbook on the subject.
2
This is a choose your own adventure textbook!
• Muller, J.M., Brunie, N., de Dinechin, F., et al. (2018). Handbook

of Floating-Point Arithmetic. New York: Springer International
Publishing, ISBN 978-3-319-76526-6.

Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)

Uploaded by

Copyright:

Available Formats

Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)

Uploaded by

Copyright:

Available Formats

Numbers and Precision

1.1 Fixed-Point Representation

where M is our binary fixed-point representation, Mi represents the ith digit,

(a) 10 = 10102 → 1010.0000000000002

(b) −0.8125 = −0.11012 → −0000.1101000000002

(c) 9.2 = 1001.0011001100110011 . . .2 → 1001.0011001100112

Now, let’s convert the fixed-point representations back into base-10.

(b) −0000.1101000000002 = −20 − 2−1 − 2−2 − 2−4 = −0.8125

(c) 1001.0011001100112 = 23 + 20 + 2−3 + 2−4 + 2−7 + 2−8 + 2−11 + 2−12

1.2 Floating-Point Representation

(a) 10 = 10102 = 1.01000002 × 23

(b) −0.8125 = −0.11012 = −1.10100002 × 2−1

(c) 9.2 = 1001.0011001100110011 . . .2 ≈ 1.00100112 × 23

Unlike base-10, we use exponents of 2 in our base-2 scientific notation; this is

• S is the sign bit: S = 0 for positive numbers, S = 1 for negative

• M is the mantissa: bits that encode the fractional significant figures

• E is the exponent of the number to be encoded,

• d is the o!set or excess: the excess allows the representation of nega-

! 24 bits of binary precision

(b) We know that 9.2 = 1001.0011001100110011 . . .2 ≈ 1.00100112 × 23 .

Convert the single precision floating-point representation of 9.2, calcu-

0 00000000 00000000000000000000000 ≡ (−1)0 × 0 × 2−126 = 0

• If 1 ≤ e ≤ 2d (or −126 ≤ E ≤ 127), we use normalised form, and

(−1)S × 1.M × 2e−d

• If e = 0, we use denormalised form. We set E = 1 − d = −126, and

Calculate (a) the largest possible denormalised single precision number,

1.2.2 Non-numeric Values

0 11111111 00000001000000100001001 ≡ NaN

1.2.3 IEEE754 64 Bit Double Precision

In binary, this is equivalent to

Table 1.1 IEEE754 floating-point number standards

Using the identity

(a) (−5 + 5) + 0.05, or

Using this to evaluate the two expressions (a) and (b):

(a) For the first expression, we have 5 − 5 = 0 regardless of whether we use

1.100110011001100110011012 × 2−5 → 0.050000000745058059692 . . .

The error we see in the 10th decimal place is round-o!/truncation error,

1.100110011001100110011012 × 2−5 → 0.000000110011001100110012 × 22

1.3.1 Machine Epsilon

Machine Epsilon and Round-O! Error

This chapter introduces the topic of floating-point arithmetic, and forces

• Muller, J.M., Brunie, N., de Dinechin, F., et al. (2018). Handbook

You might also like