Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)
Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)
Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)
You should be quite familiar with base-10 arithmetic1 ; digital computers, on ! If working outside of base-
the other hand, use base-2 or binary representation for arithmetic. Like 10, we use the notation
XB to denote that X is
base-10, binary is also written using positional notation, however restricted
a number represented in
to only allowing the digits 0 and 1. For example, the binary number 10110.12 base-B
is converted to base-10 as follows:
10110.12 = 1 × 24 + 0 × 23 + 1 × 22 + 1 × 21 + 0 × 20 + 1 × 2−1
= 16 + 4 + 2 + 1/2
= 22.5 (1.1) Figure 1.1 The value of binary
digits in fixed point representa-
Similarly to base-10, the position of the digits relative to the decimal point tion
determine the value of the digit — if a digit is n places to the left of the
decimal point, it has a value of 2n−1 , whilst if it is n places to the right, it
has a value of 2−n (see Fig. 1.1).
Converting from base-10 to binary is a little more involved, but still quite
elementary. For example, to convert the number 23.625 to binary, we start
with the integer component 23, and continuously divide by 2, storing each
remainder until we get a result of 0:
23/2 = 11 remainder 1
⇒ 11/2 = 5 remainder 1
⇒ 5/2 = 2 remainder 1
⇒ 2/2 = 1 remainder 0
⇒ 1/2 = 0 remainder 1
The remainder of the first division gives us the least significant digit,
and the remainder of the final division gives us the most significant digit.
Therefore, reading the remainder from bottom to top gives us 23 = 101112 .
To convert the fractional part 0.625, we instead multiply the fractional
part by 2 each time, storing each integer until we have a fractional part of
1
We are assuming that your physics undergraduate course was taught in base-10!
© Springer Nature Switzerland AG 2018 3
J. Izaac and J. B. Wang, Computational Quantum Mechanics, Undergraduate
Lecture Notes in Physics, https://doi.org/10.1007/978-3-319-99930-2_1
4 Chapter 1 Numbers and Precision
zero:
0.625 × 2 = 1.25
⇒ 0.25 × 2 = 0.5
⇒ 0.5 × 2 = 1.0
The first multiplication gives us the most significant digit, and the final
multiplication gives us the least significant digit. So, reading from top to
bottom, 0.625 = 0.1012 .
Putting these both together, we have 23.625 = 10111.1012 .
(a) 1010.0000000000002 = 23 + 21 = 10
What is happening here? In the first two examples, 10 and −0.8125 were able
to be represented precisely using the 16 bits available to us in our fixed-point
representation. On the other hand, when converted to binary, 9.2 requires
more than 16 bits in the fractional part to properly represent its value. In
fact, in the case of 9.2, the binary decimal is non-terminating! This is similar
to how 1/3 = 0.3333 . . . is represented in decimal form in base-10. As we
can’t construct a computer with infinite memory, we must truncate the binary
representation of numbers to fit within our fixed-point representation. This
is known as round-o! error or truncation error.
where
• e: bits that encode the exponent after being shifted by the excess.
! For example, if we have the The most common standard for floating-point representation is IEEE754; in
binary number 1.012 × 25 fact, this standard is so universal that it will be very unlikely for you to
and an excess of 16, then encounter a computer that doesn’t use it (unless you happen to find yourself
• s=0 using a mainframe or very specialised supercomputer!).
• M = 012
(drop the leading 1)
1.2.1 IEEE754 32-Bit Single Precision
• E=5 The IEEE754 32 bit single precision floating-point number is so-called because
• e = E + d = 21 = of the 32 bits required to implement this standard. The bits are assigned as
101012 follows: that is, 1 bit for the sign bit, 8 bits for the exponent, and 23 bits for
Problem question
Before we move on, let’s try encoding 0 as a single precision floating-point ! Actually, IEEE754 allows a
number. It doesn’t really matter what sign we pick, so we’ll let S = 0. For signed zero; i.e. there is
slightly di!ering behaviour
the exponent, we want the smallest possible value — this corresponds to
for positive zero +0 and
e = 0 or 2−127 . So that gives us the normalised floating-point representation negative zero −0
of
1.M × 2−127
But wait, we’ve forgotten about the leading 1 in the mantissa — so even if
we choose M = 0, we are still left with 1 × 2−127 , or 5.88 × 10−39 in base-10.
A very small number for sure, but definitely not zero!
In order to make sure that it is possible to encode zero in floating-point
representation, IEEE754 mandates that if the exponent bits are all zero e = 0,
then the implied leading 1 in the mantissa is replaced by an implied leading 0,
and the represented exponent is E = −d + 1, giving E = −126 in the single
precision case. This is known as denormalised form. We can therefore
use denormalised form to encode zero as a IEEE754 single precision floating
point number:
So, in summary:
Example 1.2
What are the (a) largest and (b) smallest possible positive single precision
floating point numbers? (Ignore the case of zero.)
Solution:
(a) The smallest possible positive single precision number has E = 0, and
so must be denormalised . As it is larger than zero, it therefore has a
mantissa of 1 in the least significant bit position, and an exponent of
E = −126:
0 00000000 00000000000000000000001
Converting this into base-10:
0.M × 2−126 = 2−23 × 2−126 ≈ 1.40 × 10−45
(b) The largest possible positive single precision number has E = 127, and
so must be normalised. It has a mantissa of all 1s:
0 11111110 11111111111111111111111
Converting this into base-10:
% 23
&
!
127 −n
1.M × 2 = 1+ 2 × 2127 ≈ 3.40 × 1038
n=1
Problem question
is the largest possible exponent. Subtracting the defect gives 255−127 = 128.
Then why did you say that the largest exponent in normalised form is only
127?!”
1.2 Floating-Point Representation 9
Well, uh... you caught me. Turns out, the case where the exponent is all 1s
is reserved in IEEE754 for non-numeric values.
• Infinity: If the exponent is all 1s, and the mantissa is all zeros, this
represents positive or negative infinity, depending on the sign bit:
1 11111111 00000000000000000000000 ≡ −∞
0 11111111 00000000000000000000000 ≡ +∞
This behaves like you’d expect; add 1 to the largest possible floating-point
number, and you’ll get +∞ as a result. Adding or multiplying infinities will
result in further infinities.
• Not a Number (NaN): If the exponent is all 1s, and the mantissa
contains 1s, this represents NaNs:
NaNs are the result of invalid operations, such as taking the square root of a
negative number (IEEE754 floating-point numbers represent real numbers),
or attempting to divide by zero.
0 11111111110 1111111111111111111111111111111111111111111111111111
This is 10342 times larger than the maximum possible single precision value
of only 3.40 × 1038 .
10 Chapter 1 Numbers and Precision
Furthermore, the 53 binary bits of precision (52 from the mantissa, and 1
from the implied leading digit) results in log10 253 ≈ 15.9 significant figures
in base-10, more than double the number of significant digits in the single
precision case. You can see how double precision numbers can be much more
useful when you need to deal with large numbers and many significant digits!
We can go even higher than double precision if we’d like; an IEEE754
standard also exists for 128 bit quadruple (quad) precision — containing
1 sign bit, 15 exponent bits, and 112 mantissa bits, for 128 bits total. A
summary of single, double, and quad precision is provided in Table 1.1.
! If the IEEE754 floating
point standard has an ex- Sign Exponent Mantissa Total Base-10 sig.
cess of d, then the allowed Name Excess
bit bits bits bits figs.
exponent range is [1..d; d]
Single 1 8 23 32 127 ∼7.2
Double 1 11 52 64 1023 ∼15.9
Quad 1 15 112 128 16383 ∼34.0
Problem question
derive expressions for the largest and smallest positive IEEE754 floating-
point number with e exponent bits and m mantissa bits. Assume that
when the exponent bits are all zero, the floating-point number is denor-
malised.
Hint: if there are e exponent bits, then the excess is given by d = 2e /2−1.
• Property of commutativity:
x+y =y+x
x×y =y×x
• Property of associativity:
1.3 Floating-Point Arithmetic
(x + y) + z = x + (y + z) When working through theoretical calculations using our standard mathemat-
ical toolset, we are used to applying the standard properties of arithmetic.
(x × y) × z = x × (y × z)
For example, multiplicative and additive commutativity, associativity, and
• Property of distributivity:
distributivity. But do these properties still apply in the case of arithmetic
x × (y + z) = x × y + x × z performed on computers using floating-point representation? Commutativ-
ity still holds, as the order we multiply or add floating-point numbers should
Table 1.2 Standard properties not a!ect the result. But what about associativity? Let’s work through a
of arithmetic quick example to see what happens.
1.3 Floating-Point Arithmetic 11
Example 1.3
Consider the expression −5 + 5 + 0.05. Using the property of associativity, we
could calculate this using either
These are equivalent if we can represent these numbers exactly, with no loss
of precision.
Without evaluating (a) or (b), convert them into single precision floating-
point representation. Are they still equivalent?
Solution: We’ll start by converting the numbers into single precision floating-
point representation; that is, converting them into binary with 24 significant
digits.
• 5 → 1.010000000000000000000002 × 22
• 0.05 → 1.100110011001100110011012 × 2−5
Note that we must truncate the rescaled value after 24 significant figures,
as we only use 24 bits in single-precision. Adding 5 and then subtracting
5 results in no change. Converting to base-10,
0.000000110011001100110012 × 22 → 0.049999713897705078125 . . .
So, in the above example, we can see that associativity doesn’t hold for
floating point numbers — by changing the order in which we add floating-point
numbers together, we change the amount of truncation error that occurs! This
can’t be eliminated entirely; all we can do is to try and perform operations
on numbers of similar magnitude first, in order to reduce truncation error.
12 Chapter 1 Numbers and Precision
Like associativity, the property of distributivity also does not always hold
for floating-point arithmetic. In fact simple equations such as
0.1 + 0.1 + 0.1 = 0.3
are no longer perfectly guaranteed in floating-point arithmetic. To see why,
let’s convert 0.1 to a single-precision floating-point:
0.1 → 1.100110011001100110011012 × 2−4
Converting back to base-10, we see that truncation error has occurred — the
floating-point representation slightly overestimates the value of 0.1:
1.100110011001100110011012 × 2−4 = 0.10000000149011611938 . . .
Therefore, in single-precision floating point arithmetic,
0.1 + 0.1 + 0.1 = 0.30000000447034835815429688 . . .
But what is 0.3 in single precision floating-point representation? Doing a
quick conversion, we find that
0.3 → 0.30000001192092895508
which is a di!erent result from 0.1 + 0.1 + 0.1. Therefore, in single precision
floating-point arithmetic, 0.1 + 0.1 + 0.1 *= 0.3.
The important take-away here is that this value represents the minimum
possible spacing between two floating point numbers, and as such provides
an upper bound on the relative truncation error introduced by floating-point
numbers.
To see how this works, lets consider a floating-point representation of
a number x; we’ll denote this f l(x). To calculate the relative error of
the floating-point representation, we subtract the original exact value x, and
divide by x:
f l(x) − x
relative error =
x
Now, we know that the relative error must be bound by the machine epsilon,
so substituting this in to form an inequality:
f l(x) − x
−! ≤ ≤! (1.5)
x
By taking the absolute value and rearranging this equation, we also come
across an expression for an upper bound on the absolute error of the ! This result also generalises
floating-point representation: to operations of floating-
point numbers. For exam-
|f l(x) − x| ≤ !|x| ple, the square root
√ √ √
Ultimately, however, while the IEEE754 floating-point numbers have |f l( x) − x| ≤ !| x|
their quirks and setbacks, alternative proposals and standards have their and addition
own oddities. The sheer dominance of the IEEE754 standard in modern
computing means that, although we sometimes still have to worry about pre- |f l(x+y)−(x+y)| ≤ !|x+y|
cision, accuracy, and round-o! errors, we can do so using a standard that
will work the same way on almost all modern machines.
In the next two chapters, we’ll start putting our knowledge of floating-
point representation into practice — by actually bringing them into existence
on a computer near you — using either Fortran (Chap. 2) or Python (Chap.
3).2 It is up to you which programming language you would like to delve into,
as Chap. 4 and onwards will include both Fortran and Python examples.
Further reading
2
This is a choose your own adventure textbook!
14 Chapter 1 Numbers and Precision