Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Friendly Introduction To Numerical Analysis 1st Edition Bradie Solutions Manual PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Friendly Introduction To Numerical Analysis 1st Edition Bradie Solutions Manual

Full Download: https://testbanklive.com/download/friendly-introduction-to-numerical-analysis-1st-edition-bradie-solutions-manua

Floating Point Number Systems 1

1.3 Floating Point Number Systems

1. Provide the floating point equivalent for each of the following numbers from the
floating point number system F(10, 4, 0, 4). Consider both chopping and round-
ing. Compute the absolute and relative error in each floating point equivalent.
(a) √π (b) e
(c) 2 (d) 1/7
(e) cos 22◦ (f ) ln 10

(g) 3 9

In the following table, δ denotes the absolute error and ǫ the relative error.
Chopping Rounding
y f l(y) error f l(y) error
π 3.141 δ = 5.927 × 10−4 3.142 δ = 4.073 × 10−4
ǫ = 1.886 × 10−4 ǫ = 1.297 × 10−4
e 2.718 δ = 2.818 × 10−4 2.718 δ = 2.818 × 10−4
√ ǫ = 1.037 × 10−4 ǫ = 1.037 × 10−4
2 1.414 δ = 2.136 × 10−4 1.414 δ = 2.136 × 10−4
ǫ = 1.510 × 10−4 ǫ = 1.510 × 10−4
1/7 0.1428 δ = 5.714 × 10−4 0.1429 δ = 4.286 × 10−4
ǫ = 4.000 × 10−4 ǫ = 3.000 × 10−4
cos 22◦ 0.9271 δ = 8.385 × 10−5 0.9272 δ = 1.615 × 10−5
ǫ = 9.044 × 10−5 ǫ = 1.741 × 10−5
ln 10 2.302 δ = 5.851 × 10−4 2.303 δ = 4.149 × 10−4
√ ǫ = 2.541 × 10−4 ǫ = 1.802 × 10−4
3
9 2.080 δ = 8.382 × 10−5 2.080 δ = 8.382 × 10−5
ǫ = 4.030 × 10−5 ǫ = 4.030 × 10−5

2. Prove the bounds on the absolute and relative roundoff error associated with
rounding:
1 e−k |f lround (y) − y| 1
|f lround (y) − y| ≤ β and ≤ β 1−k .
2 |y| 2

Consider the floating point system F(β, k, m, M ) with rounding. Let y be a real
number whose expansion is given by
y = ±(0.d1 d2 d3 · · · dk dk+1 · · ·)β × β e

Full download all chapters instantly please go to Solutions Manual, Test Bank site: TestBankLive.com
2 Section 1.3

with d1 6= 0 and m ≤ e ≤ M . If we let d denote β/2, then a bound on the absolute


size of the roundoff error is
|f lround (y) − y| ≤ (0.d)β × β e−k
1 e−k
= β .
2
Provided y 6= 0, given the restriction on d1 ,
|y| = (0.d1 d2 d3 · · ·)β × β e
≥ (0.1)β × β e = β e−1 .
Therefore, the relative error in f lround (y) is bounded by
1 e−k
|f lround (y) − y| β 1
≤ 2 e−1 = β 1−k .
|y| β 2

3. Show that machine precision is the smallest floating point number, v, such that
f l(1 + v) > 1.

First consider the floating point number system F(β, k, m, M ) with chopping. The
number one is represented by the expansion
(0.1 00 · · 00} )β × β 1 .
| ·{z
k−1 zeros

If we let
v = u = β 1−k = · · 00} )β × β 2−k
(0.1 |00 ·{z
k−1 zeros

= (0. 00
| ·{z· · 00} 1 00 · · 00} )β × β 1 ,
| ·{z
k−1 zeros k−1 zeros

then
1 + v = (0.1 00
| ·{z
· · 00} 1 00 · · 00} )β × β 1
| ·{z
k−2 zeros k−1 zeros

and
f lchop (1 + v) = 1.00 · · · 001 > 1.
If we assign to v any value smaller than u, then the kth digit in the mantissa of
1 + v is zero and f lchop (1 + v) = 1. Thus, with chopping, machine precision is
the smallest floating point number, v, such that f l(1 + v) > 1.

Now, consider the floating point number system F(β, k, m, M ) with rounding. For
notational convenience, let d denote β/2. If we take
1 1−k
v=u= β = (0.d 00 · · 00} )β × β 1−k
| ·{z
2
k−1 zeros
Floating Point Number Systems 3

= (0. 00
| ·{z
· · 00} d 00 · · 00} )β × β 1 ,
| ·{z
k zeros k−1 zeros

then
1 + v = (0.1 00 · · 00} d 00
| ·{z · · 00} )β × β 1
| ·{z
k−1 zeros k−1 zeros

and
f lround (1 + v) = 1.00 · · · 001 > 1.
If we assign to v any value smaller than u, then the (k + 1)st digit in the mantissa
of 1 + v is smaller than β/2 and f lround (1 + v) = 1. Thus, with rounding, machine
precision is the smallest floating point number, v, such that f l(1 + v) > 1.

4. (a) Construct an algorithm to determine machine precision and another algo-


rithm to determine the smallest positive number of a floating point number
system.
(b) Implement the algorithms from part (a) to determine machine precision
and the smallest positive number on your computing system. Consider
both single and double precision.
(c) Assuming that your computing system uses β = 2 and rounding, use the
results from part (b) to determine the values for k and m.

(a) Assuming the floating point system uses rounding, here is an algorithm to
determine machine precision. Multiplication by β is performed in the output
step because the while loop terminates when one too many divisions by β
have been carried out.
GIVEN: base β

STEP 1: initialize u = 1/2


STEP 2: while (1 + u > 1)
replace u by u/β
OUTPUT: β·u
Here is an algorithm to determine the smallest positive number, assuming that
underflow is handled by setting the value to zero.
GIVEN: base β

STEP 1: initialize temp = 1


STEP 2: while (temp > 0)
STEP 3: set sm = temp
replace temp by temp/β
OUTPUT: sm
(b) Answers will of course vary. On a SunBlade 100, machine precision in both
single and double precision is 2.22045×10−16 . The smallest positive number in
single precision is 1.4013 × 10−45 and in double precision is 4.94066 × 10−324 .
4 Section 1.3

(c) In general, machine precision with rounding is 12 β 1−k and the smallest positive
number is (0.1)β × β m = β m−1 . Assuming β = 2, we solve 2−k = 2.22045 ×
10−16 to find k = 52 in both single and double precision on the SunBlade 100.
In single precision, we solve 2m−1 = 1.4013 × 10−45 to find m = −148; in
double precision, the equation 2m−1 = 4.94066 × 10−324 yields m = −1073.

5. Determine machine precision, the smallest positive number and the largest pos-
itive number for the floating point number system used by your calculator.
Assuming the calculator uses β = 10, determine the values for k, m and M.

Answers will of course vary. On a Casio f x − 300SA, machine precision is 5 ×


10−10 , the smallest positive number is 10−99 and the largest positive number is
9.999999999 × 1099 . In general, machine precision with rounding is 12 β 1−k , the
smallest positive number is (0.1)β × β m = β m−1 and the largest positive number is
(1 − β −k ) × β M . Assuming that the calculator uses β = 10, the values for machine
precision, the smallest positive number and the largest positive number on the Casio
f x − 300SA determine k = 10, m = −98 and M = 100.

6. Determine the number of significant decimal digits and the number of significant
binary digits to which each of the following pairs of numbers agree.
(a) 355/113 and π
(b) 685/252 and e
√ √
(c) 10002 and 10001
(d) 103/280 and 1/e

(a) Because
355
113 − π
π = 8.491 × 10
−8

and
10−8 < 8.491 × 10−8 ≤ 10−7 ,
355
it follows that 113 and π agree to at least 7 and at most 8 decimal digits.
Since

2−24 = 5.960 × 10−8 < 8.491 × 10−8 < 1.192 × 10−7 = 2−23 ,
355
we see that 113 and π agree to at least 23 and at most 24 binary digits.
(b) Because
685
252 − e
e = 1.025 × 10
−5

and
10−5 < 1.025 × 10−5 ≤ 10−4 ,
Floating Point Number Systems 5

685
it follows that 252 and e agree to at least 4 and at most 5 decimal digits.
Since

2−17 = 7.629 × 10−6 < 1.025 × 10−5 < 1.526 × 10−5 = 2−16 ,
685
we see that 252 and e agree to at least 16 and at most 17 binary digits.
(c) Because

10002 − √10001


√ = 4.999 × 10−5
10001

and
10−5 < 4.999 × 10−5 ≤ 10−4 ,
√ √
it follows that 10002 and 10001 agree to at least 4 and at most 5 decimal
digits. Since

2−15 = 3.052 × 10−5 < 4.999 × 10−5 < 6.103 × 10−5 = 2−14 ,
√ √
we see that 10002 and 10001 agree to at least 14 and at most 15 binary
digits.
(d) Because
103
280 − 1/e
1/e = 6.061 × 10
−5

and
10−5 < 6.061 × 10−5 ≤ 10−4 ,
103
it follows that 280 and 1/e agree to at least 4 and at most 5 decimal digits.
Since

2−15 = 3.052 × 10−5 < 6.061 × 10−5 < 6.103 × 10−5 = 2−14 ,
103
we see that 280 and 1/e agree to at least 14 and at most 15 binary digits.

7. The ideal gas law states that P V = nRT , where P is the pressure of the gas, V
is the volume, n is the number of moles, T is the temperature and R = 0.08206
atm·m3 /moles·K is the universal gas constant.
(a) Experimentally, it has been determined that P = 0.750 atm, V = 1.15 m3
and T = 294.1K. Assuming that all values have been rounded to the digits
shown, in what range of values does n fall?
(b) Experimentally, it has been determined that V = 0.331 m3 , n = 0.00712
moles and T = 264.7K. Assuming that all values have been rounded to the
digits shown, in what range of values does P fall?
6 Section 1.3

(a) With
0.7495 atm < P < 0.7505 atm
1.145 m3 < V < 1.155 m3
294.05 K < T < 294.15 K
it follows from the ideal gas law that
(0.7495)(1.145) (0.7505)(1.155)
<n<
(0.08206)(294.15) (0.08206)(294.05)
or 0.03555 moles < n < 0.03592 moles.
(b) With
0.3305 m3 < V < 0.3315 m3
0.007115 moles < n < 0.007125 moles
264.65 K < T < 264.75 K
it follows from the ideal gas law that
(0.007115)(0.08206)(264.65) (0.007125)(0.08206)(264.75)
<P <
0.3315 0.3305
or 0.46612 atm < P < 0.46836 atm.

8. In a physics laboratory, students measure the mass of a rectangular block to be


243.27 ± 0.005 grams. The length, width and depth of the block are measured
to be 7.8 ± 0.05 cm, 3.1 ± 0.05 cm and 4.2 ± 0.05 cm, respectively.
(a) In what range of values does the volume of the block fall?
(b) In what range of values does the density of the block fall? Density is mass
per unit volume.

(a) With
7.75 cm < length < 7.85 cm
3.05 cm < width < 3.15 cm
4.15 cm < depth < 4.25 cm
it follows that

98.095625 cm3 = (7.75)(3.05)(4.15) < volume < (7.85)(3.15)(4.25) = 105.091875 cm3 .

(b) Density is defined as mass per unit volume. It is given that

243.265 grams < mass < 243.275 grams,

and we determined in part (a) that 98.095625 cm3 < volume < 105.091875 cm3 ,
so
grams 243.625 243.275 grams
2.32 = < density < = 2.48 .
cm3 105.091875 98.095625 cm3
Floating Point Number Systems 7

9. Students are using a pendulum to experimentally determine the acceleration


due to gravity, g. They measure the period, T, of the pendulum to be 2.2
seconds, and the length, l, of the pendulum to be 1.15 meters. Assuming that
all values are correct to the digits shown, in what range of values
p does g fall?
The variables in this problem are related by the formula T = 2π l/g.
p
Solving T = 2π l/g for g yields g = 4π 2 l/T 2 . With

2.15 s < T < 2.25 s and 1.145 m < l < 1.155 m,

it follows that
1.145 1.155
4π 2 < g < 4π 2 ,
2.252 2.152
2 2
or 8.929 m/s < g < 9.864 m/s .

10. Determine machine precision, the smallest positive number and the largest posi-
tive number in the IEEE standard double precision system. Approximately how
many significant decimal digits does the double precision standard supply?

With β = 2 and k = 53, machine precision with rounding is


1 1−53
u= 2 = 2−53 ≈ 1.11 × 10−16 .
2
Accordingly, there are between 15 and 16 significant decimal digits available in IEEE
standard double precision. The smallest positive number in double precision is

(0.1)2 × 2−1021 = 2−1022 ≈ 2.23 × 10−308 ,

while the largest positive number is

(0.111 · · · 1)2 × 21024 = (1 − 2−53 )21024 ≈ 1.80 × 10308 .

11. In addition to the standard single and double precision floating point systems,
Intel microprocessors also have an extended precision system F(2, 64, −16381, 16384).
Determine machine precision, the smallest positive number and the largest pos-
itive number for this extended precision system.

With β = 2 and k = 64, machine precision with rounding is


1 1−64
u= 2 = 2−64 ≈ 5.42 × 10−20 .
2
Accordingly, there are between 19 and 20 significant decimal digits available in this
extended precision format. The smallest positive number is

(0.1)2 × 2−16381 = 2−16382 ≈ 3.36 × 10−4932 ,


8 Section 1.3

while the largest positive number is

(0.111 · · · 1)2 × 216384 = (1 − 2−64 )216384 ≈ 1.19 × 104932 .

12. IBM System/390 mainframes provide three floating point number systems: short
precision F(16, 6, −64, 63), long precision F(16, 14, −64, 63) and extended preci-
sion F(16, 28, −64, 63). Compare machine precision, the smallest positive num-
ber and the largest positive number for each of these number systems.

In the short precision system F(16, 6, −64, 63), machine precision with rounding is

1 1−6
u= 16 = 2−21 ≈ 4.77 × 10−7 ,
2
while machine precision with rounding in the long precision system F(16, 14, −64, 63)
is
1
u = 161−14 = 2−53 ≈ 1.11 × 10−16 .
2
In the extended precision system F(16, 28, −64, 63), machine precision with round-
ing is
1
u = 161−28 = 2−109 ≈ 1.54 × 10−33 .
2
Accordingly, the short precision system provides between 6 and 7 significant decimal
digits, the long precision system provides between 15 and 16 significant decimal
digits and the extended precision system provides between 32 and 33 significant
decimal digits. In all three systems, the smallest positive number is

(0.1)16 × 16−64 = 16−65 ≈ 5.40 × 10−79 .

The largest positive number in short, long and extended precision is

(1 − 16−6 ) · 1663 ,
(1 − 16−14 ) · 1663 , and
(1 − 16−28 ) · 1663 ,

respectively.

13. A common floating point number system used on modern calculators is


F(10, 10, −98, 100). Determine machine precision, the smallest positive num-
ber and the largest positive number for this extended precision system.

With β = 10 and k = 10, machine precision with rounding is

1 1−10
u= 10 = 5 × 10−10 .
2
Floating Point Number Systems 9

Accordingly, there are between 9 and 10 significant decimal digits available in


F(10, 10, −98, 100). The smallest positive number is

(0.1)10 × 10−98 = 10−99 ,

while the largest positive number is

(1 − 10−10 )10100 = 9.999999999 × 1099 .

14. (a) Show that the number of elements in the set F(β, k, m, M ) is given by
1 + 2(β − 1)β k−1 (M − m + 1).
(b) How many elements are in the IEEE standard single precision number
system?
(c) How many elements are in the IEEE standard double precision number
system?

(a) Let’s start by counting the number of positive elements in F(β, k, m, M ).


The only restriction on the mantissa is that the first digit cannot be zero;
hence, there are (β − 1)β k−1 different mantissas that can be formed. With
a minimum exponent of m and a maximum exponent of M , there are M −
m + 1 different exponents. Consequently, there are (β − 1)β k−1 (M − m + 1)
positive elements in F(β, k, m, M ). Because the number system contains
a zero element and equally many negative elements as positive elements, it
follows that F(β, k, m, M ) contains a total of

1 + 2(β − 1)β k−1 (M − m + 1) elements.

(b) IEEE standard single precision is the system F(2, 24, −125, 128). Therefore,
the IEEE standard single precision number system has

1 + 2(2 − 1)224−1 (128 − (−125) + 1) = 4, 261, 412, 865

elements.
(c) IEEE standard double precision is the system F(2, 53, −1021, 1024). There-
fore, the IEEE standard single precision number system has

1 + 2(2 − 1)253−1 (1024 − (−1021) + 1) = 18, 428, 729, 675, 200, 069, 633
≈ 1.84 × 1019

elements.

15. Consider the function f (x) = x2 − 4x + 4.


(a) What are the zeros of f?
10 Section 1.3

(b) Suppose we were to change the constant term to 4 − 10−8 . What are
the zeros of this new function? Relative to the size of the change in the
constant term, how big is the change in the zeros of the function?
(c) Now, suppose we were to change the constant term to 4 + 10−8 . What are
the zeros of this new function? Relative to the size of the change in the
constant term, how big is the change in the zeros of the function?

(a) The function f (x) = x2 − 4x + 4 = (x − 2)2 has a (repeated) zero at x = 2.


(b) Now consider the function f (x) = x2 − 4x + (4 − 10−8 ). By the quadratic
formula, the zeros of this new function are
p
4 ± 16 − 4(4 − 10−8 )
x =
2
= 2 ± 10−4
= 1.9999, 2.0001

Observe that the change in the zeros (±0.0001) is 10,000 times larger than
the change in the constant term in the function.
(c) Finally, consider the function f (x) = x2 − 4x + (4 + 10−8 ). By the quadratic
formula, the zeros of this new function are
p
4 ± 16 − 4(4 + 10−8 )
x =
2
= 2 ± 0.0001 · i

Observe that the change in the zeros (±0.0001 · i) is 10,000 times larger than
the change in the constant term in the function.

16. Consider the linear, first-order differential equation


dx 1 sin t
+ x= .
dt t t
(a) Solve this equation subject to the initial condition x(π/2) = x0 .
(b) Solve this equation subject to the perturbed initial condition x(π/2) =
x0 + ǫ.
(c) By considering the difference between the solutions obtained in parts (a)
and (b), comment on the conditioning of this problem.

(a) Multiplying
dx 1 sin t
+ x=
dt t t
by t yields
dx
+ x = sin t.
dt
Floating Point Number Systems 11

Note that the terms on the left-hand side of this latter equation are equal
to the derivative of the product tx. Integrating both sides of this equation
therefore produces
C − cos t
tx = − cos t + C or x(t) = ,
t
where C is a constant of integration. Using the initial condition x(π/2) = x0 ,
we determine
C −0 πx0
x0 = or C= .
π/2 2
Hence, the solution of the initial value problem is
πx0 cos t
x(t) = − .
2t t
(b) The general solution to the differential equation remains
C − cos t
x(t) = ,
t
where C is a constant of integration. Using the initial condition x(π/2) =
x0 + ǫ, we determine
C −0 π(x0 + ǫ)
x0 + ǫ = or C= .
π/2 2
Hence, the solution to the perturbed initial value problem is
π(x0 + ǫ) cos t
x(t) = − .
2t t
(c) The difference between the solutions obtained in parts (a) and (b) is
πǫ
.
2t
Because this difference decays to zero as t → ∞, this problem is not ill-
conditioned.

17. Consider the linear, first-order differential equation


dx 1
− x = t sin t.
dt t
(a) Solve this equation subject to the initial condition x(π/2) = x0 .
(b) Solve this equation subject to the perturbed initial condition x(π/2) =
x0 + ǫ.
(c) By considering the difference between the solutions obtained in parts (a)
and (b), comment on the conditioning of this problem.
12 Section 1.3

(a) Multiplying
dx 1
− x = t sin t
dt t
by t−1 yields
dx 1
− 2 x = sin t.
dt t
Note that the terms on the left-hand side of this latter equation are equal to
the derivative of the product t−1 x. Integrating both sides of this equation
therefore produces
x
= − cos t + C or x(t) = t(C − cos t),
t
where C is a constant of integration. Using the initial condition x(π/2) = x0 ,
we determine
π 2x0
x0 = (C − 0) or C= .
2 π
Hence, the solution of the initial value problem is
 
2x0
x(t) = t − cos t .
π

(b) The general solution to the differential equation remains

x(t) = t(C − cos t),

where C is a constant of integration. Using the initial condition x(π/2) =


x0 + ǫ, we determine
π 2(x0 + ǫ)
x0 + ǫ = (C − 0) or C= .
2 π
Hence, the solution to the perturbed initial value problem is
 
2(x0 + ǫ)
x(t) = t − cos t .
π

(c) The difference between the solutions obtained in parts (a) and (b) is
2ǫt
.
π
Because this difference tends toward infinity as t → ∞, meaning that a small
change in input data can result in a large change in the output, this problem
is ill conditioned.

18. Consider the linear system of equations


   
1.1 2.1 x
= b.
2 3.8 y
Friendly Introduction To Numerical Analysis 1st Edition Bradie Solutions Manual
Full Download: https://testbanklive.com/download/friendly-introduction-to-numerical-analysis-1st-edition-bradie-solutions-manua

Floating Point Number Systems 13

 T
(a) Solve the system for the right-hand side vector b = 3.2 5.8 .
 T
(b) Solve the system for the right-hand side vector b = 3.21 5.79 .
 T
(c) Solve the system for the right-hand side vector b = 3.1 5.7 .
(d) By considering the difference between the solutions obtained in parts (a),
(b), and (c), comment on the conditioning of this problem.

(a) The system of equations is

1.1x + 2.1y = 3.2


2x + 3.8y = 5.8

If we multiply the first equation by 2 and the second equation by−1.1 and
then add, we obtain 0.02y = 0.02. Thus, y = 1. Back substituting into either
of the original equations yields x = 1.
(b) Working as we did in part (a), we find the solution corresponding to the right-
 T
hand side vector b = 3.21 5.79 is x = −1.95 and y = 2.55.
(c) Working as we did in part (a), we find the solution corresponding to the right-
 T
hand side vector b = 3.1 5.7 is x = 9.5 and y = −3.5.
(d) Given that small changes to the right-hand side vector resulted in relatively
large changes to the solution vector, it appears that this problem is ill condi-
tioned

Full download all chapters instantly please go to Solutions Manual, Test Bank site: TestBankLive.com

You might also like