Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
Lec07
2
Fixed-Point Representation - Drawbacks
±S × B±E
Store the number with three fields:
• Sign: plus (0) or minus (1)
• Significand S (or Mantissa or Fraction)
• Exponent E
Base B is implicit and need not be stored
Assumption: Radix point is to the right of MSB of significand
4
Typical 32-bit Floating-Point Format
± Significand x 2±exponent
Left most bit (MSB) – Sign of the number (0=positive, 1= negative)
Exponent is in excess or biased notation.
A fixed value, called the bias, is subtracted from the field to get the true exponent value
Typically the bias equals (2k-1 – 1) where k is the number of bits in the binary exponent field
Excess (biased exponent) 127 means
8-bit exponent field
8-bit yields the numbers: 0-255
Subtract 127 (2 k-1 - 1) to get true exponent value
True exponent Range: -127 to +128
5
Normalization
6
Floating-Point Representation - Examples
negative
True Biased exponent
exponent
127 + 20 = 147
20
negative
Biased exponent
True
normalized exponent 127 - 20 = 107
-20
8
Density of Floating-Point Numbers
9
Range and precision – Trade-off
10
IEEE Standard 754
Floating-Point Format (1)…
Standard for floating point storage defined in IEEE 754,
adopted in 1985
To facilitate portability of programs from one processor to
another
Widely adopted and used on all processors and arithmetic
coprocessors
Defines 32-bit (single-precision) and 64-bit (double precision)
standards
8-bit and 11-bit exponents respectively
Extended formats for intermediate results
Additional bits in significand (extended precision)
Additional bits in exponent (extended range)
11
IEEE Standard 754
Floating-Point Format (2)…
12
IEEE Standard 754
Floating-Point Format (3)…
Sign
8 bits 23 bits
bit
Biased
Significand
Exponent
(a) Single format
Sign
11 bits 52 bits
bit
13
IEEE Standard 754
Floating-Point Format - Parameters
Format
Parameter Single Single Extended Double Double Extended
Word width 32 ≥ 43 64 ≥ 79
(bits)
Exponent 8 ≥ 11 11 ≥ 15
width (bits)
Exponent 127 unspecified 1023 unspecified
bias
Maximum 127 ≥ 1023 1023 ≥ 16383
exponent
Minimum –126 –1022 –1022 –16382
exponent
Number 10–38, 10+38 unspecified 10–308, 10+308 unspecified
range (base
10)
Significand 23 ≥ 31 52 ≥ 63
width (bits)* * Does not include
implied bit
Number of 254 unspecified 2046 unspecified
exponents
Number of 223 unspecified 252 unspecified
fractions
Number of 1.98 231 unspecified 1.99 263 unspecified
values
14
IEEE Standard 754
Floating-Point Format – Example 1
Convert the given numbers to IEEE single precision format:
(a) 199.95312510 = 1100 0111.1111012
= 1.100 0111 111101 x 27 stored
+ 7 + 127 = 13410 1 · 1 0 0 0 1 1 1 1 1 1 1 0 1
0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand
...
stored [23 bits]
– 6 + 127 = 13310 1 · 0 0 1 1 0 1 1 0 1 1 0 ...
1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
sign biased exponent significand
15
IEEE Standard 754
Floating-Point Format – Example 2
Convert the given IEEE single precision floating-point numbers to their
decimal equivalent:
(a) 0100 0101 1001 1100 0100 0001 0000 00002
sign biased exponent significand
0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
+ 139 – 127 = 1210 1.0011100012
= 5000.12510
-1.11110011111112 x 29 = -1111100111.11112
= -999.937510
16
Floating-Point Format – Interpretation of
Numbers
17
Floating-point Arithmetic Operations
X Y X s Ys B XE YE
X Xs XE YE
B
Y
Ys
Examples:
X = 0.3 102 = 30
Y = 0.2 103 = 200
X + Y = (0.3 102–3 + 0.2) 103 = 0.23 103 = 230
X – Y = (0.3 102–3 – 0.2) 103 = (–0.17) 103 = –170
X Y = (0.3 0.2) 102+3 = 0.06 105 = 6000
X Y = (0.3 0.2) 102–3 = 1.5 10–1 = 0.15
18
Floating-point Arithmetic operations –
Conditions
0.12345 x 105
+ 0.56789 x 103
?.????? x 10?
21
Floating-Point Arithmetic +/- (3)…
22
Floating-Point Arithmetic +/- (4)…
23
Floating-Point Arithmetic +/- (5)…
1.1101 x 24
+ 0.0101 x 24
Phase 3: Addition of significands: 10.0010 x 24 1.0001 x 25
• After the numbers have been aligned, the two significands are added
together taking into account their signs
• There might be a possibility of significand overflow due to a carry out
from the most significant bit:
o If this occurs, the significand of the result is shifted right and the
exponent is incremented
o As the exponents are incremented, it might overflow and the
operation will stop
Phase 4: Normalization:
• Lastly, the result is normalized by shifting significand digits left until the
most significant digit is non-zero
• Each shift causes a decrement of the exponent and thus could cause an
exponent underflow
• Finally, the result is rounded off and reported
24
1.01101 x 27
SUBTRACT X = 1.01101 x 27
+ 0.110101 x 27
Y = 1.10101 x 26
10.001111 x 27
Change sign of Y
X = 1.01101 x 27
Y = 0.110101 x 27 1.0001111 x 28
X+Y=Z
no no Exponents yes Add signed Results yes
ADD X = 0? Y = 0? Round result
Equal? significands normalized?
no 1.0001111 x 28
10.001111 x 27 yes
7
Y = 0.110101 x 2
Significand Shift significand no Exponent
no = 0? right underflow?
yes
1.0001111 x 28
RETURN yes Exponent no RETURN 25
Report overflow
overflow?
1.01101 x 27
X–Y=Z X = 1.01101 x 27
SUBTRACT
– 0.110101 x 27
Y = 1.10101 x 26
0.100101 x 27
Change sign of Y
X = 1.01101 x 27
Y = 0.110101 x 27
0.100101 x 27
no no Exponents yes Add signed Results yes
ADD X = 0? Y = 0? Round result
Equal? significands normalized?
yes yes no
0.100101 x 27 no
1.00101 x 26
no
7
yes
1.00101 x 26
Y = 0.110101 x 2
Significand Shift significand no Exponent
no = 0? right underflow?
yes
(ii) 0.0000011
extra bits 0.0000011
LSB of retained bits 0.0001
0.1 1 0 1 0 1 1 0.1101
stored
+ 9 + 127 = 13610 1 · 0 0 1 1 1 0 0 0 1 0 1 1
0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand
Floating-Point Arithmetic x/
32
Floating-point Multiplication
XxY=Z X = 6.2510 = 110.012 = 1.1001 x 22
MULTIPLY Y = 12.510 = 1100.12 = 1.1001 x 23
E1 = 127 + 2 = 129
no no
E2 = 127 + 3 = 130
X = 0? Y = 0? Add exponents
E1 + E2 = 259
yes yes
no
1.10012 no
x 1.10012
Multiply
10.011100012 significands
10.01110001 x 25
=1.001110001 x 26 Normalize
33
Round RETURN
Floating-point Division
Y = 3.7510 = 11.112 = 1.111 x 21
XY=Z
DIVIDE
X = 95.62510 = 101 1111.1012
= 1.011111101 x 26
E1 = 127 + 1 = 128
no no
X = 0? Y = 0?
Subtract
exponents
E2 = 127 + 6 = 133
E2 – E1 = 5
yes yes
Z 0 Z Add bias
ET = 127 + 5 = 132
no
no
0.110011
Divide
1.111 1.011111101 significands
0.110011 x 25
Normalize
= 1.10011 x 24
34
Round RETURN
PROBLEM (1)
35
SOLUTION (1)…
36
SOLUTION (1)…
Step 3: For the 8 bit biased exponent field, the bias used is
2k-1-1 = 28-1-1 = 127
Add the bias 127 to the exponent 9 and convert it into binary
in order to store for 8-bit biased exponent.
127 + 9 =136 ( 1000 1000)
Step 4: Since the given number is negative, put MSB as 1
Step 5: Pack the result into proper format (IEEE 32 bit)
1 1000 1000 0100 0000 0100 0000 0000 000
37
SOLUTION (1)…
38
SOLUTION (1)
1 1000 0001 000 0100 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
39
PROBLEM (2)
Perform the following addition using floating point arithmetic and show how
the numbers would be stored using IEEE single-precision format
68.310 + 12.210
6810 = 100 01002
0.310 Þ 0.3 x 2 0.6
68.310 = 100 0100.01001 1001 ...
0.6 x 2 1.2
= 1.00 0100 01001 1001 ... x 26
0.2 x 2 0.4
0.4 x 2 0.8
0.8 x 2 1.6
0.6 x 2 1.2
...
only 24 bits can be stored
1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
32-bit register
more than half
+1 of the LSB
stored [23 bits]
+ 6 + 127 = 13310 1 · 0 0 0 1 0 0 0 1 0 0 1 ...
0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0
sign biased exponent significand
SOLUTION (2)…
1210 = 11002
12.210 = 1100.0011 0011 ... 0.210 Þ 0.2 x 2 0.4
= 1.100 0011 0011 ... x 23 0.4 x 2 0.8
0.8 x 2 1.6
0.6 x 2 1.2
0.2 x 2 0.4
...
only 24 bits can be stored
1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
less than half of
the LSB
42