Solutions Modernstatistics

Ron Kenett, Shelemyahu Zacks, Peter Gedeck
Modern Statistics: A Computer

Based Approach with Python
Solutions
May 3, 2023
Springer Nature
Contents
1 Analyzing Variability: Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . 5
2 Probability Models and Distribution Functions . . . . . . . . . . . . . . . . . . . . 17
3 Statistical Inference and Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Variability in Several Dimensions and Regression Models . . . . . . . . . . . 65
5 Sampling for Estimation of Finite Population Quantities . . . . . . . . . . . . 91
6 Time Series Analysis and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7 Modern analytic methods: Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8 Modern analytic methods: Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3
Chapter 1
Analyzing Variability: Descriptive Statistics
Import required modules and define required functions
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mistat
from scipy import stats
def trim_std(data, alpha):

""" Calculate trimmed standard deviation """
data = np.array(data)
data.sort()
n = len(data)
low = int(n * alpha) + 1
high = int(n * (1 - alpha))
return data[low:(high + 1)].std()
Solution 1.1 random.choices selects 𝑘 values from the list using sampling with
replacement.
import random
random.seed(1)
values = random.choices([1, 2, 3, 4, 5, 6], k=50)
Counter counts the number of occurrences of a given value in a list.
from collections import Counter

Counter(values)
Counter({2: 10, 3: 10, 1: 9, 6: 9, 5: 8, 4: 4})
The expected frequency in each cell, under randomness is 50/6 = 8.3. You will
get different numerical results, due to randomness.
Solution 1.2 The Python function range is an iterator. As we need a list of values,
we need to explicitly convert it.
5
6 1 Analyzing Variability: Descriptive Statistics
x = list(range(50))
y = [5 + 2.5 * xi for xi in x]
y = [yi + random.uniform(-10, 10) for yi in y]
pd.DataFrame({'x': x, 'y': y}).plot.scatter(x='x', y='y')
plt.show()
120
100
80
60
y
40
20
0
0 10 20 30 40 50
x
Solution 1.3 In Python
from scipy.stats import binom

np.random.seed(1)
for p in (0.1, 0.3, 0.7, 0.9):

X = binom.rvs(1, p, size=50)
print(p, sum(X))
0.1 4
0.3 12
0.7 33
0.9 43
Notice that the expected values of the sums are 5, 15, 35 and 45.
Solution 1.4 We can plot the data and calculate mean and standard deviation.
inst1 = [9.490950, 10.436813, 9.681357, 10.996083, 10.226101, 10.253741,

10.458926, 9.247097, 8.287045, 10.145414, 11.373981, 10.144389,
11.265351, 7.956107, 10.166610, 10.800805, 9.372905, 10.199018,
9.742579, 10.428091]
inst2 = [11.771486, 10.697693, 10.687212, 11.097567, 11.676099,
10.583907, 10.505690, 9.958557, 10.938350, 11.718334,
11.308556, 10.957640, 11.250546, 10.195894, 11.804038,
11.825099, 10.677206, 10.249831, 10.729174, 11.027622]
ax = pd.Series(inst1).plot(marker='o', linestyle='none',
fillstyle='none', color='black')
pd.Series(inst2).plot(marker='+', linestyle='none', ax=ax,
fillstyle='none', color='black')
1 Analyzing Variability: Descriptive Statistics 7
plt.show()
print('mean inst1', np.mean(inst1))

print('stdev inst1', np.std(inst1, ddof=1))
print('mean inst2', np.mean(inst2))
print('stdev inst2', np.std(inst2, ddof=1))
mean inst1 10.03366815

stdev inst1 0.8708144577963102
mean inst2 10.98302505
stdev inst2 0.5685555119253366
12.0
11.5
11.0
10.5
10.0
9.5
9.0
8.5
8.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
As shown in the following Figure, the measurements on Instrument 1, , seem
to be accurate but less precise than those on Instrument 2, +. Instrument 2 seems to
have an upward bias (inaccurate). Quantitatively, the mean of the measurements on
Instrument 1 is 𝑋¯ 1 = 10.034 and its standard deviation is 𝑆1 = 0.871. For Instrument
2 we have 𝑋¯ 2 = 10.983 and 𝑆2 = 0.569.
Solution 1.5 If the scale is inaccurate it will show on the average a deterministic
component different than the nominal weight. If the scale is imprecise, different
weight measurements will show a high degree of variability around the correct
nominal weight. Problems with stability arise when the accuracy of the scale changes
with time, and the scale should be recalibrated.
Solution 1.6 The method random.choices creates a random selection with re-
placement. Note that in range(start, end) the end argument is excluded. We
therefore need to set it to 101.
import random
random.choices(range(1, 101), k=20)
[6, 88, 57, 20, 51, 49, 36, 35, 54, 63, 62, 46, 3, 23, 18, 59, 87, 80,
80, 82]
Solution 1.7 The method random.choices creates a random selection without

replacement.
import random
random.sample(range(11, 31), 10)
[19, 12, 13, 28, 11, 18, 26, 23, 15, 14]
Solution 1.8 (i) 265 = 11, 881, 376; (ii) 7, 893, 600; (iii) 263 = 17, 576; (iv) 210 =
1024; (v) 10
5 = 252.
Solution 1.9 (i) discrete;

(ii) discrete;
(iii) continuous;
(iv) continuous.
Solution 1.10
filmsp = mistat.load_data('FILMSP')
filmsp.plot.hist()
plt.show()
80
70
60
50
Frequency
40
30
20
10
0
70 80 90 100 110 120
Solution 1.11
coal = mistat.load_data('COAL')
pd.DataFrame(coal.value_counts(sort=False))
count
COAL
3 17
6 3
4 6
0 33
5 5
2 18
7 1
1 28
Solution 1.12 For (1) and (2), we can use the pandas value_counts method. e.g.:
car = mistat.load_data('CAR')
car['cyl'].value_counts(sort=False)
cyl
4 66
6 30
8 13
Name: count, dtype: int64
(i) Frequency distribution of number of cylinders:

Cyl 4 6 8 Total
Frequency 66 30 13 109
(ii) Frequency distribution of car’s origin:
Origin U.S. Europe Asia Total
Frequency 58 14 37 109
For (3) to (5), we need to bin the data first. We can use the pandas cut method
for this.
(iii) Frequency distribution of Turn diameter: We determine the frequency distri-
bution on 8 intervals of length 2, from 28 to 44.
pd.cut(car['turn'], bins=range(28, 46, 2)).value_counts(sort=False)
turn
(28, 30] 3
(30, 32] 16
(32, 34] 16
(34, 36] 26
(36, 38] 20
(38, 40] 18
(40, 42] 8
(42, 44] 2
Note that the bin intervals are open on the left and closed on the right.
(iv) Frequency distribution of Horsepower:
pd.cut(car['hp'], bins=range(50, 275, 25)).value_counts(sort=False)
hp
(50, 75] 7
(75, 100] 34
(100, 125] 22
(125, 150] 18
(150, 175] 17
(175, 200] 6
(200, 225] 4
(225, 250] 1
(v) Frequency Distribution of MPG:

pd.cut(car['mpg'], bins=range(9, 38, 5)).value_counts(sort=False)
mpg
(9, 14] 1
(14, 19] 42
(19, 24] 41
(24, 29] 22
(29, 34] 3
Solution 1.13
filmsp = filmsp.sort_values(ignore_index=True) # sort and reset index
print(filmsp.quantile(q=[0, 0.25, 0.5, 0.75, 1.0]))

print(filmsp.quantile(q=[0.8, 0.9, 0.99]))
0.00 66.0
0.25 102.0
0.50 105.0
0.75 109.0
1.00 118.0
Name: FILMSP, dtype: float64
0.80 109.8
0.90 111.0
0.99 114.0
Name: FILMSP, dtype: float64
Here is a solution that uses pure Python. Note that the pandas quantile implements
different interpolation methods which will lead to differences for smaller datasets.
We therefore recommend using the library method and select the method that is most
suitable for your use case.
def calculate_quantile(x, q):

idx = (len(x) - 1) * q
left = math.floor(idx)
right = math.ceil(idx)
return 0.5 * (x[left] + x[right])
for q in (0, 0.25, 0.5, 0.75, 0.8, 0.9, 0.99, 1.0):

print(q, calculate_quantile(filmsp, q))
0 66.0
0.25 102.0
0.5 105.0
0.75 109.0
0.8 109.5
0.9 111.0
0.99 114.0
1.0 118.0
Solution 1.14
n = len(filmsp)
mean = filmsp.mean()
deviations = [film - mean for film in filmsp]
S = math.sqrt(sum(deviation**2 for deviation in deviations) / n)
skewness = sum(deviation**3 for deviation in deviations) / n / (S**3)

kurtosis = sum(deviation**4 for deviation in deviations) / n / (S**4)
print('Python:\n',
f'Skewness: {skewness}, Kurtosis: {kurtosis}')
print('Pandas:\n',
f'Skewness: {filmsp.skew()}, Kurtosis: {filmsp.kurtosis()}')
Python:
Skewness: -1.8098727695275856, Kurtosis: 9.014427238360716
Pandas:
Skewness: -1.8224949285588137, Kurtosis: 6.183511188870432
The distribution of film speed is negatively skewed and much steeper than the
normal distribution. Note that the calculated values differ between the methods.
Solution 1.15 The pandas groupby method groups the data based on the value. We
can then calculate individual statistics for each group.
car['mpg'].groupby(by=car['origin']).mean()
car['mpg'].groupby(by=car['origin']).std()
# calculate both at the same time
print(car['mpg'].groupby(by=car['origin']).agg(['mean', 'std']))
mean std
origin
1 20.931034 3.597573
2 19.500000 2.623855
3 23.108108 4.280341
Solution 1.16 We first create a subset of the data frame that contains only US made
cars and then calculate the statistics for this subset only.
car_US = car[car['origin'] == 1]
gamma = car_US['turn'].std() / car_US['turn'].mean()
Coefficient of variation gamma = 0.084.
Solution
car 1.17
= mistat.load_data('CAR')
car_US = car[car['origin'] == 1]
car_Asia = car[car['origin'] == 3]
print('US')
print('mean', car_US['turn'].mean())
print('geometric mean', stats.gmean(car_US['turn']))
print('Japanese')
print('mean', car_Asia['turn'].mean())
print('geometric mean', stats.gmean(car_Asia['turn']))
US
mean 37.203448275862065
geometric mean 37.06877691910792
Japanese
mean 33.04594594594595
geometric mean 32.97599107553825
We see that 𝑋¯ is greater than 𝐺. The cars from Asia have smaller mean turn
diameter.
Solution
filmsp 1.18
= mistat.load_data('FILMSP')
Xbar = filmsp.mean()
S = filmsp.std()
print(f'mean: {Xbar}, stddev: {S}')
expected = {1: 0.68, 2: 0.95, 3: 0.997}
for k in (1, 2, 3):
left = Xbar - k * S
right = Xbar + k * S
proportion = sum(left < film < right for film in filmsp)
print(f'X +/- {k}S: ',
f'actual freq. {proportion}, ',
f'pred. freq. {expected[k] * len(filmsp):.2f}')
mean: 104.59447004608295, stddev: 6.547657682704987

X +/- 1S: actual freq. 173, pred. freq. 147.56
The discrepancies between the actual frequencies to the predicted frequencies are
due to the fact that the distribution of film speed is neither symmetric nor bell-shaped.
Solution 1.19
car.boxplot(column='mpg', by='origin')
plt.show()
Boxplot grouped
mpg by origin
35.0
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
1 2 3
origin
Solution 1.20
oturb = mistat.load_data('OTURB')
mistat.stemLeafDiagram(oturb, 2, leafUnit=0.01)
4 2 3444
18 2 55555666677789
40 3 0000001111112222333345
(15) 3 566677788899999
45 4 00022334444
34 4 566888999
25 5 0112333
18 5 6789
14 6 01122233444
3 6 788
• 𝑋 (1) = 0.23,
• 𝑄 1 = 𝑋 (25.25) = 𝑋 (25) + 0.25(𝑋 (26) − 𝑋 (25) ) = 0.31,
• 𝑀3 = 𝑋 (50.5) = 0.385,
• 𝑄 3 = 𝑋 (75.75) = 0.49 + 0.75(0.50 − 0.49) = 0.4975,
• 𝑋 (𝑛) = 0.68.
Solution
from 1.21
scipy.stats import trim_mean
oturb = mistat.load_data('OTURB')
print(f'T(0.1) = {trim_mean(oturb, 0.1)}')
print(f'S(0.1) = {trim_std(oturb, 0.1)}')
T(0.1) = 0.40558750000000005
S(0.1) = 0.09982289003530202
𝑇¯𝛼 = 0.4056 and 𝑆 𝛼 = 0.0998, where 𝛼 = 0.10.
Solution 1.22
germanCars = [10, 10.9, 4.8, 6.4, 7.9, 8.9, 8.5, 6.9, 7.1,
5.5, 6.4, 8.7, 5.1, 6.0, 7.5]
japaneseCars = [9.4, 9.5, 7.1, 8.0, 8.9, 7.7, 10.5, 6.5, 6.7,
9.3, 5.7, 12.5, 7.2, 9.1, 8.3, 8.2, 8.5, 6.8, 9.5, 9.7]
# convert to pandas Series
germanCars = pd.Series(germanCars)
japaneseCars = pd.Series(japaneseCars)
# use describe to calculate statistics
comparison = pd.DataFrame({
'German': germanCars.describe(),
'Japanese': japaneseCars.describe(),
})
print(comparison)
German Japanese
count 15.000000 20.000000
mean 7.373333 8.455000
std 1.780235 1.589596
min 4.800000 5.700000
25% 6.200000 7.175000
50% 7.100000 8.400000
75% 8.600000 9.425000
max 10.900000 12.500000
Solution 1.23 Sample statistics:

hadpas = mistat.load_data('HADPAS')
sampleStatistics = pd.DataFrame({
'res3': hadpas['res3'].describe(),
'res7': hadpas['res7'].describe(),
})
print(sampleStatistics)
res3 res7
count 192.000000 192.000000
mean 1965.239583 1857.776042
std 163.528165 151.535930
min 1587.000000 1420.000000
25% 1860.000000 1772.250000
50% 1967.000000 1880.000000
75% 2088.750000 1960.000000
max 2427.000000 2200.000000
Histogram:
ax = hadpas.hist(column='res3', alpha=0.5)
hadpas.hist(column='res7', alpha=0.5, ax=ax)
plt.show()
res7
40
30
20
10
0
1400 1600 1800 2000 2200 2400
We overlay both histograms in one plot and make them transparent (alpha).
Stem and leaf diagrams:
print('res3')
mistat.stemLeafDiagram(hadpas['res3'], 2, leafUnit=10)
print('res7')
mistat.stemLeafDiagram(hadpas['res7'], 2, leafUnit=10)
res3
1 15 8
6 16 01124
14 16 56788889
22 17 00000234
32 17 5566667899
45 18 0011112233444
60 18 556666677888899
87 19 000000111111223333344444444
(18) 19 566666666667888889
87 20 00000000122222333334444
64 20 55556666666677789999
44 21 0000011222233344444
25 21 566667788888
13 22 000111234
4 22 668
0 23
1 24 2
res7
1 14 2
9 15 11222244
15 15 667789
23 16 00012334
30 16 5566799
40 17 0022233334
54 17 66666666777999
79 18 0000222222222233344444444
(28) 18 5555556666666778888888999999
85 19 000000001111112222222233333444444
52 19 566666667777888888889999
28 20 0000111222333444
12 20 678
9 21 1123344
2 21 8
1 22 0
Solution 1.24
hadpas.boxplot(column='res3')
plt.show()
2400
2200
2000
1800
1600
res3
Lower whisker starts at max(1587, 1511.7) = 1587 = 𝑋 (1) ; upper whisker ends
at min(2427, 2440.5) = 2427 = 𝑋 (𝑛) . There are no outliers.
Chapter 2
Probability Models and Distribution Functions
import pandas as pd
Solution 2.1 (i) S = {(𝑤 1 , . . . , 𝑤 20 ); 𝑤 𝑗 = 𝐺, 𝐷, 𝑗 = 1, . . . , 20}.

(ii) 220 = 1,
n 048, 576. o
(iii) 𝐴𝑛 = (𝑤 1 , . . . , 𝑤 20 ) : 20
Í
𝑗=1 𝐼{𝑤 𝑗 = 𝐺} = 𝑛 , 𝑛 = 0, . . . , 20,
where 𝐼{ 𝐴} = 1 if 𝐴 is true and 𝐼{ 𝐴} = 0 otherwise. The number of elementary
events in 𝐴𝑛 is 20 20!
𝑛 = 𝑛!(20−𝑛)! .
Solution 2.2 S = {(𝜔1 , . . . , 𝜔10 ) : 10 ≤ 𝜔𝑖 ≤ 20, 𝑖 = 1, . . . , 10}. Looking at the

(𝜔1 , 𝜔2 ) components of 𝐴 and 𝐵 we have the following graphical representation:
25
20
15
A
B
2
10
0
0 5 10 15 20 25
1
17
18 2 Probability Models and Distribution Functions
If (𝜔1 , . . . , 𝜔10 ) ∈ 𝐴 then (𝜔1 , . . . , 𝜔10 ) ∈ 𝐵. Thus 𝐴 ⊂ 𝐵. 𝐴 ∩ 𝐵 = 𝐴.
Solution 2.3 (i) S = {(𝑖 1 , . . . , 𝑖 30 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 1, . . . , 30}.

(ii) 𝐴10 = {(1, 1, . . . , 1, 𝑖11 , 𝑖12 , . . . , 𝑖 30 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 11, . . . , 30}. | 𝐴10 | =
220 = 1, 048, 576. (| 𝐴10 | denotes the number of elements in 𝐴10 .)
30
(iii) 𝐵10 = {(𝑖1 , . . . , 𝑖 30 ) : 𝑖 𝑗 = 0, 1 and 30
Í
𝑗=1 𝑗 = 10}, |𝐵10 | = 10 =
𝑖
30, 045, 015. 𝐴10 ⊄ 𝐵10 , in fact, 𝐴10 has only one element belonging to 𝐵10 .
Solution 2.4 S = ( 𝐴 ∩ 𝐵) ∪ ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵) ∪ ( 𝐴𝑐 ∩ 𝐵 𝑐 ), a union of mutually

disjoint sets.
(a) 𝐴 ∪ 𝐵 = ( 𝐴 ∩ 𝐵) ∪ ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵). Hence, ( 𝐴 ∪ 𝐵) 𝑐 = 𝐴𝑐 ∩ 𝐵 𝑐 .
(b)
( 𝐴 ∩ 𝐵) 𝑐 = ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵) ∪ ( 𝐴𝑐 ∩ 𝐵 𝑐 )
= ( 𝐴 ∩ 𝐵𝑐 ) ∪ 𝐴𝑐
= 𝐴𝑐 ∪ 𝐵𝑐 .
Solution 2.5 As in Exercise 3.1, 𝐴𝑛 = (𝜔1 , . . . , 𝜔20 ) : 20

Í
𝑖=1Ð𝐼{𝜔𝑖 = 𝐺} = 𝑛 , 𝑛 =
0, . . . , 20. Thus, for any 𝑛 ≠ 𝑛 , 𝐴𝑛 ∩ 𝐴𝑛 = ∅, moreover 20
0 0
𝑛=0 𝐴𝑛 = S. Hence
{ 𝐴0 , . . . , 𝐴20 } is a partition.
Ð𝑛
Solution 2.6 𝑖=1 𝐴𝑖 = S and 𝐴𝑖 ∩ 𝐴 𝑗 = ∅ for all 𝑖 ≠ 𝑗.
𝑛
!
Ø
𝐵 = 𝐵∩S = 𝐵∩ 𝐴𝑖
𝑖=1
𝑛
Ø
= 𝐴𝑖 𝐵.
𝑖=1
Solution 2.7
Pr{ 𝐴 ∪ 𝐵 ∪ 𝐶} = Pr{( 𝐴 ∪ 𝐵) ∪ 𝐶}
= Pr{( 𝐴 ∪ 𝐵)} + Pr{𝐶} − Pr{( 𝐴 ∪ 𝐵) ∩ 𝐶}
= Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵} + Pr{𝐶}
− Pr{ 𝐴 ∩ 𝐶} − Pr{𝐵 ∩ 𝐶} + Pr{ 𝐴 ∩ 𝐵 ∩ 𝐶}
= Pr{ 𝐴} + Pr{𝐵} + Pr{𝐶} − Pr{ 𝐴 ∩ 𝐵}
− Pr{ 𝐴 ∩ 𝐶} − Pr{𝐵 ∩ 𝐶} + Pr{ 𝐴 ∩ 𝐵 ∩ 𝐶}.
Ð𝑛
{exc:partition-union}
Solution 2.8 We have shown in Exercise 2.6 that 𝐵 = 𝑖=1 𝐴𝑖 𝐵. Moreover, since
{ 𝐴1 , . . . , 𝐴𝑛 } is a partition, 𝐴𝑖 𝐵 ∩ 𝐴 𝑗 𝐵 = ( 𝐴𝑖 ∩ 𝐴 𝑗 ) ∩ 𝐵 = ∅ ∩ 𝐵 = ∅ for all 𝑖 ≠ 𝑗.
Hence, from Axiom 3
( 𝑛 ) 𝑛
Ø ∑︁
Pr{𝐵} = Pr 𝐴𝑖 𝐵 = Pr{ 𝐴𝑖 𝐵}.
𝑖=1 𝑖=1
2 Probability Models and Distribution Functions 19
Solution 2.9
S = {(𝑖 1 , 𝑖2 ) : 𝑖 𝑗 = 1, . . . , 6, 𝑗 = 1, 2}
𝐴 = {(𝑖 1 , 𝑖2 ) : 𝑖 1 + 𝑖2 = 10} = {(4, 6), (5, 5), (6, 4)}
3 1
Pr{𝐴} = = .
36 12
Solution 2.10

150 280
Pr{𝐵} = Pr{ 𝐴150 } − Pr{ 𝐴280 } = exp − − exp − = 0.2258.
200 200
Solution 2.11
10 10 15 5
2 2 2 2
40
= 0.02765
8
100
Solution 2.12 (i) 1005 = 1010 ; (ii) 5 = 75, 287, 520.
Solution 2.13 𝑁 = 1000, 𝑀 = 900, 𝑛 = 10.
10
∑︁ 10
(i) Pr{𝑋 ≥ 8} = (0.9) 𝑗 (0.1) 10− 𝑗 = 0.9298.
𝑗=8
𝑗
10 900 100
10− 𝑗
∑︁ 𝑗
(ii) Pr{𝑋 ≥ 8} = 1000
= 0.9308.
𝑗=8 10
Solution 2.14 1 − (0.9) 10 = 0.6513.

Solution 2.15 Pr{𝑇 > 300 | 𝑇 > 200} = 0.6065
Solution 2.16 (i) Pr{𝐷 | 𝐵} = 14 ; (ii) Pr{𝐶 | 𝐷} = 1.
Solution 2.17 Since 𝐴 and 𝐵 are independent, Pr{ 𝐴 ∩ 𝐵} = Pr{ 𝐴}Pr{𝐵}. Using this
fact and DeMorgan’s Law,
Pr{ 𝐴𝑐 ∩ 𝐵 𝑐 } = Pr{( 𝐴 ∪ 𝐵) 𝑐 }
= 1 − Pr{ 𝐴 ∪ 𝐵}
= 1 − (Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵})
= 1 − Pr{ 𝐴} − Pr{𝐵} + Pr{ 𝐴} Pr{𝐵}
= Pr{ 𝐴𝑐 } − Pr{𝐵}(1 − Pr{ 𝐴})
= Pr{ 𝐴𝑐 }(1 − Pr{𝐵})
= Pr{ 𝐴𝑐 } Pr{𝐵 𝑐 }.
Since Pr{ 𝐴𝑐 ∩ 𝐵 𝑐 } = Pr{ 𝐴𝑐 } Pr{𝐵 𝑐 }, 𝐴𝑐 and 𝐵 𝑐 are independent.
Solution 2.18 We assume that Pr{𝐴} > 0 and Pr{𝐵} > 0. Thus, Pr{𝐴} Pr{𝐵} > 0.
On the other hand, since 𝐴 ∩ 𝐵 = ∅, Pr{ 𝐴 ∩ 𝐵} = 0.
Solution 2.19
Pr{ 𝐴 ∪ 𝐵} = Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵}

= Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴} Pr{𝐵}
= Pr{ 𝐴}(1 − Pr{𝐵}) + Pr{𝐵}
= Pr{𝐵}(1 − Pr{𝐴}) + Pr{ 𝐴}.
Solution 2.20 By Bayes’ theorem,
Pr{ 𝐴 | 𝐷} Pr{𝐷} 0.10 × 0.01

Pr{𝐷 | 𝐴} = = = 0.0011.
Pr{ 𝐴 | 𝐷} Pr{𝐷} + Pr{ 𝐴 | 𝐺} Pr{𝐺} 0.10 × 0.01 + 0.95 × 0.99
Additional problems in combinatorial and geometric probabilities
Solution 2.21 Let 𝑛 be the number of people in the party. The probability
that all
Î𝑛 365 − 𝑗 + 1
their birthdays fall on different days is Pr{𝐷 𝑛 } = 𝑗=1 .
365
(i) If 𝑛 = 10, Pr{𝐷 10 } = 0.8831.
(ii) If 𝑛 = 23, Pr{𝐷 23 } = 0.4927. Thus, the probability of at least 2 persons with
the same birthday, when 𝑛 = 23, is 1 − Pr{𝐷 23 } = 0.5073 > 12 .
10
7
Solution 2.22 = 0.02825.
10

Î10 1
Solution 2.23 1− = 0.5833.
𝑗=1
24 − 𝑗 + 1
4 86 4 10 86 86
1 4 1 1 3 5
Solution 2.24 (i) 100
= 0.1128; (ii) 100
= 0.0544; (iii) 1 − 100
=
5 5 5
0.5374.
4
Solution 2.25 10
= 0.0889.
2
Solution 2.26 The sample median is 𝑋 (6) , where 𝑋 (1) < · · · < 𝑋 (11) are the ordered
𝑘−1 20−𝑘
5 5
sample values. Pr{𝑋 (6) = 𝑘 } = 20
, 𝑘 = 6, . . . , 15. This is the probability
11
distribution of the sample median. The probabilities are
𝑘 6 7 8 9 10 11 12 13 14 15
Pr .01192 .04598 .09902 .15404 .18905 .18905 .15404 .09902 .04598 .01192
Solution 2.27 Without loss of generality, assume that the stick is of length 1. Let 𝑥, 𝑦
and (1 − 𝑥 − 𝑦), 0 < 𝑥, 𝑦 < 1, be the length of the 3 pieces. Obviously, 0 < 𝑥 + 𝑦 < 1.
All points in S = {(𝑥, 𝑦) : 𝑥, 𝑦 > 0, 𝑥 + 𝑦 < 1} are uniformly distributed. In order
that the three pieces can form a triangle, the following three conditions should be
satisfied:
(i) 𝑥 + 𝑦 > (1 − 𝑥 − 𝑦)
(ii) 𝑥 + (1 − 𝑥 − 𝑦) > 𝑦
(iii) 𝑦 + (1 − 𝑥 − 𝑦) > 𝑥.
The set of points (𝑥, 𝑦) satisfying (i), (ii) and (iii) is bounded by a triangle of area
1/8. S is bounded by a triangle of area 1/2. Hence, the required probability is 1/4.
Solution 2.28 Consider Fig. 2.1. Suppose that the particle is moving along the {exc:geometrySolution}
circumference of the circle in a counterclockwise direction. Then,√ using the notation
in the diagram, Pr{hit} = 𝜙/2𝜋. Since 𝑂𝐷 = 1 = 𝑂𝐸, 𝑂𝐵 = 𝑎 2 + ℎ2 = 𝑂𝐶 and
the lines 𝐷𝐵 and 𝐸𝐶 are tangential to the circle, it follows that the triangles Δ𝑂𝐷𝐵
and Δ𝑂𝐸𝐶 are congruent. Thus
𝑚(∠ 𝐷𝑂𝐵) = 𝑚(∠ 𝐸𝑂𝐶), and it is easily seen that
ℎ 1 ℎ
𝜙 = 2𝛼. Now 𝛼 = tan−1 , and hence, Pr{hit} = tan−1 .
𝑎 𝜋 𝑎
Fig. 2.1 Geometry of The Solution {exc:geometrySolution}
α
O A
α a
E
B
D
100
Solution 2.29 1−(0.999) 100 −100×(0.001)×(0.999) 99 − 2
2 ×(0.001) (0.999)
98 =
0.0001504.
𝑛
𝑛−1 1
Solution 2.30 The probability that 𝑛 tosses are required is 𝑝(𝑛) = 1 , 𝑛 ≥ 2.
2
1 3
Thus, 𝑝(4) = 3 · = .
24 16
Solution 2.31 S = {(𝑖1 , . . . , 𝑖 10 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 1, . . . , 10}. One random variable
is the number of 1’s in an element, i.e., for 𝝎 = (𝑖 1 , . . . , 𝑖 10 ) 𝑋1 (𝝎) = 10
Í
𝑗=1 𝑖 𝑗 .
Another random variable is the number of zeros to the left of the 1st one, i.e.,
𝑋2 (𝝎) = 10
Í Î𝑗
𝑗=1 𝑘=1 (1 − 𝑖 𝑘 ). Notice that 𝑋2 (𝝎) = 0 if 𝑖 1 = 1 and 𝑋2 (𝝎) = 10 if
10 10
𝑖 1 = 𝑖2 = . . . = 𝑖10 = 0. The probability distribution of 𝑋1 is Pr{𝑋1 = 𝑘 } = 𝑘 /2 ,
𝑘 = 0, 1, . . . , 10. The probability distribution of 𝑋2 is
𝑘+1
 1
𝑘 = 0, . . . , 9



 2
 ,
Pr{𝑋2 = 𝑘 } = 10
 1

 , 𝑘 = 10.
 2

Í∞ 5 𝑥
= 𝑒 5 , we have ∞
Í
Solution 2.32 (i) Since 𝑥=0 𝑥=0 𝑝(𝑥) = 1. ; (ii) Pr{𝑋 ≤ 1} =
𝑥!
𝑒 −5 (1 + 5) = 0.0404; (iii) Pr{𝑋 ≤ 7} = 0.8666.
Solution 2.33 (i) Pr{𝑋 = −1} = 0.3; (ii) Pr{−0.5 < 𝑋 < 0} = 0.1; (iii) Pr{0 ≤ 𝑋 <
0.75} = 0.425; (iv) Pr{𝑋 = 1} = 0; (v) 𝐸 {𝑋 } = −0.25, 𝑉 {𝑋 } = 0.4042.
Solution 2.34
∞ ∞
∫ ∫ √︂
−𝑥 2 /2𝜎 2 𝜋
𝐸 {𝑋 } = (1 − 𝐹 (𝑥)) d𝑥 = 𝑒 d𝑥 = 𝜎 .
0 0 2
Solution 2.35
𝑁 𝑁
1 ∑︁ 𝑁 +1 1 ∑︁ 2 (𝑁 + 1) (2𝑁 + 1)
𝐸 {𝑋 } = 𝑖= ; 𝐸 {𝑋 2 } = 𝑖 =
𝑁 𝑖=1 2 𝑁 𝑖=1 6
2(𝑁 + 1) (2𝑁 + 1) − 3(𝑁 + 1) 2 𝑁 2 − 1

𝑉 {𝑋 } = 𝐸 {𝑋 2 } − (𝐸 {𝑋 }) 2 = = .
12 12
𝑉 {𝑋 } 0.25
Solution 2.36 Pr{8 < 𝑋 < 12} = Pr{|𝑋 − 10| < 2} ≥ 1 − = 1− =
4 4
0.9375.
Solution 2.37 Notice that 𝐹 (𝑥) is the standard Cauchy distribution. The quan-
𝑝-th
1 1 −1 1
tile, 𝑥 𝑝 , satisfies the equation + tan (𝑥 𝑝 ) = 𝑝, hence 𝑥 𝑝 = tan 𝜋 𝑝 − .
2 𝜋 2
For 𝑝 = 0.25, 0.50, 0.75 we get 𝑥 .25 = −1, 𝑥 .50 = 0, 𝑥 .75 = 1, respectively.
Solution 2.38 𝜇𝑙∗ = 𝐸 {(𝑋 − 𝜇1 ) 𝑙 } = 𝑙𝑗=0 (−1) 𝑗 𝑙𝑗 𝜇1 𝜇𝑙− 𝑗 .
Í 𝑗
When 𝑗 = 𝑙 the term is (−1) 𝑙 𝜇1𝑙 . When 𝑗 = 𝑙 − 1 the term is (−1) 𝑙−1 𝑙 𝜇1𝑙−1 𝜇1 =
(−1) 𝑙−1 𝑙 𝜇1𝑙 . Thus, the sum of the last 2 terms is (−1) 𝑙−1 (𝑙 − 1)𝜇1𝑙 and we have
𝜇𝑙∗ = 𝑙−2
Í 𝑗 𝑙
𝑗
𝑗=0 (−1) 𝑗 𝜇1 𝜇𝑙− 𝑗 + (−1)
𝑙−1 (𝑙 − 1)𝜇𝑙 .
1
{exc:mixed-cdf-example} Solution 2.39 We saw in the solution of Exercise 2.33 that 𝜇1 = −0.25. Moreover,
𝜇2 = 𝑉 {𝑋 } + 𝜇12 = 0.4667.
1
Solution 2.40 𝑀𝑋 (𝑡) = (𝑒 𝑡 𝑏 − 𝑒 𝑡 𝑎 ), −∞ < 𝑡 < ∞, 𝑎 < 𝑏.
𝑡 (𝑏 − 𝑎)
𝑎+𝑏 (𝑏 − 𝑎) 2
𝐸 {𝑋 } = ; 𝑉 {𝑋 } = .
2 12
𝑡 −1

Solution 2.41 (i) For 𝑡 < 𝜆 we have 𝑀 (𝑡) = 1 − 𝜆 ,
1 𝑡 −2 1
𝑀 0 (𝑡) = 1− , 𝜇1 = 𝑀 0 (0) =
𝜆 𝜆 𝜆
2 𝑡 −3 2
𝑀 00 (𝑡) = 2 1 − , 𝜇2 = 𝑀 00 (0) = 2
𝜆 𝜆 𝜆
6 𝑡 −4 6
𝑀 (3) (𝑡) = 3 1 − , 𝜇3 = 𝑀 (3) (0) = 3
𝜆 𝜆 𝜆
24 𝑡 −5 24
𝑀 (4) (𝑡) = 4 1 − , (4)
𝜇4 = 𝑀 (0) = 4 .
𝜆 𝜆 𝜆
(ii) The central moments are
𝜇1∗ = 0,
1
𝜇2∗ = 2 ,
𝜆
6 6 2 2
𝜇3∗ = 𝜇3 − 3𝜇2 𝜇1 + 2𝜇13 = − + = ,
𝜆3 𝜆3 𝜆3 𝜆3
1 9
𝜇4∗ = 𝜇4 − 4𝜇3 𝜇1 + 6𝜇2 (𝜇1 ) 2 − 3𝜇14 = 4 (24 − 4 · 6 + 6 · 2 − 3) = 4 .
𝜆 𝜆
𝜇4∗
(iii) The index of kurtosis is 𝛽4 = = 9.
(𝜇2∗ ) 2
Solution 2.42 scipy.stats.binom provides the distribution information.
x = list(range(15))
table = pd.DataFrame({
'x': x,
'p.d.f.': [stats.binom(20, 0.17).pmf(x) for x in x],
'c.d.f.': [stats.binom(20, 0.17).cdf(x) for x in x],
})
print(table)
x p.d.f. c.d.f.
0 0 2.407475e-02 0.024075
1 1 9.861947e-02 0.122694
2 2 1.918921e-01 0.314586
3 3 2.358192e-01 0.550406
4 4 2.052764e-01 0.755682
5 5 1.345426e-01 0.890224
6 6 6.889229e-02 0.959117
7 7 2.822094e-02 0.987338
8 8 9.392812e-03 0.996731
9 9 2.565105e-03 0.999296
10 10 5.779213e-04 0.999874
11 11 1.076086e-04 0.999981
12 12 1.653023e-05 0.999998
13 13 2.083514e-06 1.000000
14 14 2.133719e-07 1.000000
Solution 2.43 𝑄 1 = 2, Med = 3, 𝑄 3 = 4.

Solution 2.44 𝐸 {𝑋 } = 15.75, 𝜎 = 3.1996.
Solution 2.45 Pr{no defective chip on the board} = 𝑝 50 . Solving 𝑝 50 = 0.99 yields
𝑝 = (0.99) 1/50 = 0.999799.
𝑛
𝜆
Solution 2.46 Notice first that lim 𝑛→∞ 𝑏(0; 𝑛, 𝑝) = lim𝑛→∞ 1 − = 𝑒 −𝜆 .
𝑛 𝑝→𝜆 𝑛
Moreover, for all
𝑏( 𝑗 + 1; 𝑛, 𝑝) 𝑛− 𝑗 𝑝
𝑗 = 0, 1, · · · , 𝑛 − 1, = · . Thus, by induction on 𝑗,
𝑏( 𝑗; 𝑛, 𝑝) 𝑗 +1 1− 𝑝
for 𝑗 > 0
𝑛− 𝑗 +1 𝑝
lim 𝑏( 𝑗 − 1; 𝑛, 𝑝)
lim 𝑏( 𝑗; 𝑛, 𝑝) = 𝑛→∞ ·
𝑛→∞
𝑛 𝑝→𝜆 𝑛 𝑝→𝜆
𝑗 1− 𝑝
𝜆 𝑗−1 (𝑛 − 𝑗 + 1) 𝑝
= 𝑒 −𝜆 lim
( 𝑗 − 1)! 𝑛𝑛→∞ 𝜆

𝑝→𝜆 𝑗 1 − 𝑛
𝜆 𝑗−1 𝜆 𝜆𝑗
= 𝑒 −𝜆 · = 𝑒 −𝜆 .
( 𝑗 − 1)! 𝑗 𝑗!
Solution 2.47 Using the Poisson approximation, 𝜆 = 𝑛 · 𝑝 = 1000 · 10−3 = 1.

1
Pr{𝑋 < 4} = 𝑒 −1 3𝑗=0
Í
= 0.9810.
𝑗!

350 350 150 19
Solution 2.48 𝐸 {𝑋 } = 20· = 14; 𝑉 {𝑋 } = 20· · 1− = 4.0401.
500 500 500 499
Solution 2.49 Let 𝑋 be the number of defective items observed.
Pr{𝑋 > 1} = 1 − Pr{𝑋 ≤ 1} = 1 − 𝐻 (1; 500, 5, 50) = 0.0806.
Solution 2.50
3
∑︁
Pr{𝑅} = 1 − 𝐻 (3; 100, 10, 20) + ℎ(𝑖; 100, 10, 20) [1 − 𝐻 (3 − 𝑖; 80, 10 − 𝑖, 40)]
𝑖=1
= 0.87395.
Solution 2.51 The m.g.f. of the Poisson distribution with parameter 𝜆, 𝑃(𝜆), is
𝑀 (𝑡) = exp{−𝜆(1 − 𝑒 𝑡 )}, −∞ < 𝑡 < ∞. Accordingly,

𝑀 0 (𝑡) = 𝜆𝑀 (𝑡)𝑒 𝑡
𝑀 00 (𝑡) = (𝜆2 𝑒 2𝑡 + 𝜆𝑒 𝑡 ) 𝑀 (𝑡)
𝑀 (3) (𝑡) = (𝜆3 𝑒 3𝑡 + 3𝜆2 𝑒 2𝑡 + 𝜆𝑒 𝑡 ) 𝑀 (𝑡)
𝑀 (4) (𝑡) = (𝜆4 𝑒 4𝑡 + 6𝜆3 𝑒 3𝑡 + 7𝜆2 𝑒 2𝑡 + 𝜆𝑒 𝑡 ) 𝑀 (𝑡).
The moments and central moments are

𝜇1 = 𝜆 𝜇1∗ = 0
𝜇2 = 𝜆2 + 𝜆 𝜇2∗ = 𝜆
𝜇3 = 𝜆3 + 3𝜆2 + 𝜆 𝜇3∗ = 𝜆
𝜇4 = 𝜆4 + 6𝜆3 + 7𝜆2 + 𝜆 𝜇4∗ = 3𝜆2 + 𝜆.
1
Thus, the indexes of skewness and kurtosis are 𝛽3 = 𝜆−1/2 and 𝛽4 = 3 + .
𝜆
For 𝜆 = 10 we have 𝛽3 = 0.3162 and 𝛽4 = 3.1.
Solution 2.52 Let 𝑋 be the number of blemishes observed. Pr{𝑋 > 2} = 0.1912.
Solution 2.53 Using the Poisson approximation with 𝑁 = 8000 and 𝑝 = 380 × 10−6 ,
we have 𝜆 = 3.04 and Pr{𝑋 > 6} = 0.0356, where 𝑋 is the number of insertion
errors in 2 hours of operation.
Solution 2.54 The distribution of 𝑁 is geometric with 𝑝 = 0.00038. 𝐸 {𝑁 } = 2631.6,

𝜎𝑁 = 2631.08.
Solution 2.55 Using Python we obtain that for the 𝑁 𝐵( 𝑝, 𝑘) with 𝑝 = 0.01 and
𝑘 = 3, 𝑄 1 = 170, 𝑀𝑒 = 265, and 𝑄 3 = 389.
stats.nbinom.ppf([0.25, 0.5, 0.75], 3, 0.01)
array([170., 265., 389.])
Solution 2.56 By definition, the m.g.f. of 𝑁 𝐵( 𝑝, 𝑘) is

∞
∑︁ 𝑘 +𝑖−1 𝑘
𝑀 (𝑡) = 𝑝 ((1 − 𝑝)𝑒 𝑡 ) 𝑖 ,
𝑖=0
𝑘 − 1
for 𝑡 < − log(1 − 𝑝).

𝑝𝑘 Í∞ 𝑘+𝑖−1
Thus 𝑀 (𝑡) = 𝑖=0 𝑘−1 (1
− (1 − 𝑝)𝑒 𝑡 ) 𝑘 ((1 − 𝑝)𝑒 𝑡 ) 𝑖 .
(1 − (1 − 𝑝)𝑒 )
𝑡 𝑘
𝑘
𝑝
Since the last infinite series sums to one, 𝑀 (𝑡) = , 𝑡 < − log(1−
1 − (1 − 𝑝)𝑒 𝑡
𝑝).
𝑝𝑒 𝑡
Solution 2.57 𝑀 (𝑡) = , for 𝑡 < − log(1 − 𝑝). The derivatives of 𝑀 (𝑡)
1 − (1 − 𝑝)𝑒 𝑡
are
𝑀 0 (𝑡) = 𝑀 (𝑡) (1 − (1 − 𝑝)𝑒 𝑡 ) −1

𝑀 00 (𝑡) = 𝑀 (𝑡) (1 − (1 − 𝑝)𝑒 𝑡 ) −2 (1 + (1 − 𝑝)𝑒 𝑡 )
𝑀 (3) (𝑡) = 𝑀 (𝑡) (1 − (1 − 𝑝)𝑒 𝑡 ) −3 · [(1 + (1 − 𝑝)𝑒 𝑡 ) 2 + 2(1 − 𝑝)𝑒 𝑡 ]
𝑀 (4) (𝑡) = 𝑀 (𝑡) (1 − (1 − 𝑝)𝑒 𝑡 ) −4 [1 + (1 − 𝑝) 3 𝑒 3𝑡 + 11(1 − 𝑝)𝑒 𝑡 + 11(1 − 𝑝) 2 𝑒 2𝑡 ].
The moments are

1
𝜇1 =
𝑝
2− 𝑝
𝜇2 =
𝑝2
(2 − 𝑝) 2 + 2(1 − 𝑝) 6 − 6𝑝 + 𝑝 2
𝜇3 = =
𝑝3 𝑝3
11(1 − 𝑝) (2 − 𝑝) + (1 − 𝑝) + 1 24 − 36𝑝 + 14𝑝 2 − 𝑝 3
3
𝜇4 = = .
𝑝4 𝑝4
The central moments are
𝜇1∗ = 0
1− 𝑝
𝜇2∗ = ,
𝑝2
1
𝜇3∗ = 3 (1 − 𝑝) (2 − 𝑝),
𝑝
1
𝜇4∗ = 4 (9 − 18𝑝 + 10𝑝 2 − 𝑝 3 ).
𝑝
2− 𝑝
Thus the indices of skewness and kurtosis are 𝛽3∗ = √︁ and 𝛽4∗ =
1− 𝑝
9 − 9𝑝 + 𝑝 2
.
1− 𝑝
Solution 2.58 If there are 𝑛 chips, 𝑛 > 50, the probability of at least 50 good ones is
1 − 𝐵(49; 𝑛, 0.998). Thus, 𝑛 is the smallest integer > 50 for which 𝐵(49; 𝑛, 0.998) <
0.05. It is sufficient to order 51 chips.
Solution 2.59 If 𝑋 has a geometric distribution then, for every 𝑗, 𝑗 = 1, 2, . . .
Pr{𝑋 > 𝑗 } = (1 − 𝑝) 𝑗 . Thus,
Pr{𝑋 > 𝑛 + 𝑚}
Pr{𝑋 > 𝑛 + 𝑚 | 𝑋 > 𝑚} =
Pr{𝑋 > 𝑚}
(1 − 𝑝) 𝑛+𝑚
=
(1 − 𝑝) 𝑚
= (1 − 𝑝) 𝑛
= Pr{𝑋 > 𝑛}.
Solution 2.60 For 0 < 𝑦 < 1, Pr{𝐹 (𝑋) ≤ 𝑦} = Pr{𝑋 ≤ 𝐹 −1 (𝑦)} = 𝐹 (𝐹 −1 (𝑦)) = 𝑦.
Hence, the distribution of 𝐹 (𝑋) is uniform on (0, 1). Conversely, if 𝑈 has a uniform
distribution on (0, 1), then
Pr{𝐹 −1 (𝑈) ≤ 𝑥} = Pr{𝑈 ≤ 𝐹 (𝑥)} = 𝐹 (𝑥).

1600
Solution 2.61 𝐸 {𝑈 (10, 50)} = 30; 𝑉 {𝑈 (10, 50)} = = 133.33;
12
40
𝜎{𝑈 (10, 50)} = √ = 11.547.
2 3
Solution 2.62 Let 𝑋 = − log(𝑈) where 𝑈 has a uniform distribution on (0,1).
Pr{𝑋 ≤ 𝑥} = Pr{− log(𝑈) ≤ 𝑥}

= Pr{𝑈 ≥ 𝑒 −𝑥 }
= 1 − 𝑒 −𝑥 .
Therefore 𝑋 has an exponential distribution 𝐸 (1).
Solution 2.63 (i) Pr{92 < 𝑋 < 108} = 0.4062; (ii) Pr{𝑋 > 105} = 0.3694;
(iii) Pr{2𝑋 + 5 < 200} = Pr{𝑋 < 97.5} = 0.4338.
rv = stats.norm(100, 15)
print('(i)', rv.cdf(108) - rv.cdf(92))
print('(ii)', 1 - rv.cdf(105))
print('(iii)', rv.cdf((200 - 5)/2))
(i) 0.4061971427922976
(ii) 0.36944134018176367
(iii) 0.43381616738909634
Solution 2.64 Let 𝑧 𝛼 denote the 𝛼 quantile of a 𝑁 (0, 1) distribution. Then the two
equations 𝜇 + 𝑧 .9 𝜎 = 15 and 𝜇 + 𝑧 .99 𝜎 = 20 yield the solution 𝜇 = 8.8670 and
𝜎 = 4.7856.
Solution 2.65 Due to symmetry, Pr{𝑌 > 0} = Pr{𝑌 < 0} = Pr{𝐸 < 𝑣}, where
𝐸 ∼ 𝑁 (0, 1). If the probability of a bit error is 𝛼 = 0.01, then Pr{𝐸 < 𝑣} = Φ(𝑣) =
1 − 𝛼 = 0.99.
Thus 𝑣 = 𝑧 .99 = 2.3263.
Solution 2.66 Let 𝑋 𝑝 denote the diameter of an aluminum pin and 𝑋ℎ denote the size
of a hole drilled in an aluminum plate. If 𝑋 𝑝 ∼ 𝑁 (10, 0.02) and 𝑋ℎ ∼ 𝑁 (𝜇 𝑑 , 0.02)
then the probability that the pin will not enter the hole is Pr{𝑋ℎ − 𝑋 𝑝 < 0}. Now
√
𝑋ℎ − 𝑋 𝑝 ∼ 𝑁 (𝜇 𝑑 − 10, 0.022 + 0.022 ) and for Pr{𝑋ℎ − 𝑋 𝑝 < 0} = 0.01, we
obtain 𝜇 𝑑 = 10.0658 mm. (The fact that the sum of two independent normal random
variables is normally distributed should be given to the student since it has not yet
been covered in the text.)
Solution 2.67 For 𝑋1 , . . . , 𝑋𝑛 i.i.d. 𝑁 (𝜇, 𝜎 2 ), 𝑌 = 𝑖=1 𝑖𝑋𝑖 ∼ 𝑁 (𝜇𝑌 , 𝜎𝑌2 ) where
Í𝑛
𝑛(𝑛 + 1) 𝑛(𝑛 + 1) (2𝑛 + 1)
and 𝜎𝑌2 = 𝜎 2 𝑖=1
Í𝑛 2
𝑖 = 𝜎2
Í𝑛
𝜇𝑌 = 𝜇 𝑖=1 𝑖=𝜇 .
2 6
Solution 2.68 Pr{𝑋 > 300} = Pr{log 𝑋 > 5.7038} = 1 − Φ(0.7038) = 0.24078.
2 𝑡 2 /2
Solution 2.69 For 𝑋 ∼ 𝑒 𝑁 ( 𝜇, 𝜎) , 𝑋 ∼ 𝑒𝑌 where 𝑌 ∼ 𝑁 (𝜇, 𝜎), 𝑀𝑌 (𝑡) = 𝑒 𝜇𝑡+𝜎 .
2 /2
𝜉 = 𝐸 {𝑋 } = 𝐸 {𝑒𝑌 } = 𝑀𝑌 (1) = 𝑒 𝜇+𝜎 .
2
Since 𝐸 {𝑋 2 } = 𝐸 {𝑒 2𝑌 } = 𝑀𝑌 (2) = 𝑒 2𝜇+2𝜎 we have
2 2
𝑉 {𝑋 } = 𝑒 2𝜇+2𝜎 − 𝑒 2𝜇+𝜎
2 2
= 𝑒 2𝜇+𝜎 (𝑒 𝜎 − 1)
2
= 𝜉 2 (𝑒 𝜎 − 1).
Solution 2.70 The quantiles of 𝐸 (𝛽) are 𝑥 𝑝 = −𝛽 log(1 − 𝑝). Hence,

𝑄 1 = 0.2877𝛽, 𝑀𝑒 = 0.6931𝛽, 𝑄 3 = 1.3863𝛽.
Solution 2.71 If 𝑋 ∼ 𝐸 (𝛽), Pr{𝑋 > 𝛽} = 𝑒 −𝛽/𝛽 = 𝑒 −1 = 0.3679.
Solution 2.72 The m.g.f. of 𝐸 (𝛽) is

∫ ∞
1
𝑀 (𝑡) = 𝑒 𝑡 𝑥−𝑥/𝛽 d𝑥
𝛽 0
∫ ∞
1 (1−𝑡 𝛽)
= 𝑒 − 𝛽 𝑥 d𝑥
𝛽 0
1
= (1 − 𝑡 𝛽) −1 , for 𝑡 < .
𝛽
Solution 2.73 By independence,
𝑀 (𝑋1 +𝑋2 +𝑋3 ) (𝑡) = 𝐸 {𝑒 𝑡 (𝑋1 +𝑋2 +𝑋3 ) }

3
Ö
= 𝐸 {𝑒 𝑡 𝑋𝑖 }
𝑖=1
1
= (1 − 𝛽𝑡) −3 , 𝑡< .
𝛽
{exc:mgf-ind-exp-rv} Thus 𝑋1 + 𝑋2 + 𝑋3 ∼ 𝐺 (3, 𝛽), (see Exercise 2.76).

Using the formula of the next exercise,
Pr{𝑋1 + 𝑋2 + 𝑋3 ≥ 3𝛽} = Pr{𝛽𝐺 (3, 1) ≥ 3𝛽}

= Pr{𝐺 (3, 1) ≥ 3}
2
−3
∑︁ 3𝑗
=𝑒
𝑗=0
𝑗!
= 0.4232.
Solution 2.74
∫ 𝑡
𝜆𝑘
𝐺 (𝑡; 𝑘, 𝜆) = 𝑥 𝑘−1 𝑒 −𝜆𝑥 d𝑥
(𝑘 + 1)! 0
∫ 𝑡
𝜆 𝑘 𝑘 −𝜆𝑡 𝜆 𝑘+1
= 𝑡 𝑒 + 𝑥 𝑘 𝑒 −𝜆𝑥 d𝑥
𝑘! 𝑘! 0
∫ 𝑡
𝜆 𝑘 𝑘 −𝜆𝑡 𝜆 𝑘+1 𝑘+1 −𝜆𝑡 𝜆 𝑘+2
= 𝑡 𝑒 + 𝑡 𝑒 + 𝑥 𝑘+1 𝑒 −𝜆𝑥 d𝑥
𝑘! (𝑘 + 1)! (𝑘 + 1)! 0
= ···
∞
∑︁ (𝜆𝑡) 𝑗
= 𝑒 −𝜆𝑡
𝑗=𝑘
𝑗!
𝑘−1
∑︁ (𝜆𝑡) 𝑗
= 1 − 𝑒 −𝜆𝑡 .
𝑗=0
𝑗!

1 3 1 1
Solution 2.75 Γ(1.17) = 0.9267, Γ = 1.77245, Γ = Γ = 0.88623.
2 2 2 2
from scipy.special import gamma
print(gamma(1.17), gamma(1 / 2), gamma(3 / 2))
0.9266996106177159 1.7724538509055159 0.8862269254527579
Solution 2.76 The moment generating function of the sum of independent random
variables is the product of their respective m.g.f.’s.ÎThus, if 𝑋1 , · · · , 𝑋 𝑘 are i.i.d.
𝐸 (𝛽), using the result of Exercise 2.73, 𝑀𝑆 (𝑡) = 𝑖=1 𝑘
(1 − 𝛽𝑡) −1 = (1 − 𝛽𝑡) −𝑘 , {exc:prob-ind-exp-rv}
1
𝑡 < , where 𝑆 = 𝑖=1 𝑋𝑖 . On the other hand, (1 − 𝛽𝑡) −𝑘 is the m.g.f. of 𝐺 (𝑘, 𝛽).
Í𝑘
𝛽
Solution 2.77 The expected value and variance of 𝑊 (2, 3.5) are

1
𝐸 {𝑊 (2, 3.5)} = 3.5 × Γ 1 + = 3.1018,
2

2 1
𝑉 {𝑊 (2, 3.5)} = (3.5) 2 Γ 1 + − Γ2 1 + = 2.6289.
2 2
Solution 2.78 Let 𝑇 be the number of days until failure. 𝑇 ∼ 𝑊 (1.5, 500) ∼
500𝑊 (1.5, 1).
6 1.5
Pr{𝑇 ≥ 600} = Pr 𝑊 (1.5, 1) ≥ = 𝑒 −(6/5) = 0.2686.
5

1 3
Solution 2.79 Let 𝑋 ∼ Beta , .
2 2
1 3
1/2 1 · 1 1
𝐸 {𝑋 } = 1 3 = , 𝑉 {𝑋 } = 22 2 = and 𝜎{𝑋 } = .
2 + 2
4 2 · 3 16 4
Solution 2.80 Let 𝑋 ∼ Beta(𝜈, 𝜈). The first four moments are
1
𝜇1 = 𝜈/2𝜈 =
2
𝐵(𝜈 + 2, 𝜈) 𝜈+1
𝜇2 = =
𝐵(𝜈, 𝜈) 2(2𝜈 + 1)
𝐵(𝜈 + 3, 𝜈) (𝜈 + 1) (𝜈 + 2)
𝜇3 = =
𝐵(𝜈, 𝜈) 2(2𝜈 + 1) (2𝜈 + 2)
𝐵(𝜈 + 4, 𝜈) (𝜈 + 1) (𝜈 + 2) (𝜈 + 3)
𝜇4 = = .
𝐵(𝜈, 𝜈) 2(2𝜈 + 1) (2𝜈 + 2) (2𝜈 + 3)
1
The variance is 𝜎 2 = and the fourth central moment is
4(2𝜈 + 1)
3
𝜇4∗ = 𝜇4 − 4𝜇3 · 𝜇1 + 6𝜇2 · 𝜇12 − 3𝜇14 = .
16(3 + 8𝜈 + 4𝜈 2 )
∗
𝜇 3(1 + 2𝜈)
Finally, the index of kurtosis is 𝛽2 = 44 = .
𝜎 3 + 2𝜈
Solution 2.81 Let (𝑋, 𝑌 ) have a joint p.d.f.
 1,

(𝑥, 𝑦) ∈ 𝑆


𝑓 (𝑥, 𝑦) = 2
 0,
 otherwise.

(i) The marginal distributions of 𝑋 and 𝑌 have p.d.f.’s
1 ∫ 1− | 𝑥 |
𝑓 𝑋 (𝑥) = d𝑦 = 1 − |𝑥|, −1 < 𝑥 < 1, and by symmetry,
2 −1+ | 𝑥 |
𝑓𝑌 (𝑦) = 1 − |𝑦|, −1 < 𝑦 < 1.
∫1 1
(ii) 𝐸 {𝑋 } = 𝐸 {𝑌 } = 0, 𝑉 {𝑋 } = 𝑉 {𝑌 } = 2 0 𝑦 2 (1 − 𝑦) d𝑦 = 2𝐵(3, 2) = .
6
Solution 2.82 The marginal p.d.f. of 𝑌 is 𝑓 (𝑦) = 𝑒 −𝑦 , 𝑦 > 0, that is, 𝑌 ∼ 𝐸 (1). The
conditional p.d.f. of 𝑋, given 𝑌 = 𝑦, is 𝑓 (𝑥 | 𝑦) = 1𝑦 𝑒 −𝑥/𝑦 which is the p.d.f. of
an exponential with parameter 𝑦. Thus, 𝐸 {𝑋 | 𝑌 = 𝑦} = 𝑦, and 𝐸 {𝑋 } = 𝐸 {𝐸 {𝑋 |
𝑌 }} = 𝐸 {𝑌 } = 1. Also,
𝐸 {𝑋𝑌 } = 𝐸 {𝑌 𝐸 {𝑋 | 𝑌 }}
= 𝐸 {𝑌 2 }
= 2.
Hence, cov(𝑋, 𝑌 ) = 𝐸 {𝑋𝑌 } − 𝐸 {𝑋 }𝐸 {𝑌 } = 1. The variance of 𝑌 is 𝜎𝑌2 = 1. The

variance of 𝑋 is
𝜎𝑋2 = 𝐸 {𝑉 {𝑋 | 𝑌 }} + 𝑉 {𝐸 {𝑋 | 𝑌 }}
= 𝐸 {𝑌 2 } + 𝑉 {𝑌 }
= 2 + 1 = 3.
1
The correlation between 𝑋 and 𝑌 is 𝜌 𝑋𝑌 = √ .
3


 2, if (𝑥, 𝑦) ∈ 𝑇


Solution 2.83 Let (𝑋, 𝑌 ) have joint p.d.f. 𝑓 (𝑥, 𝑦) =

 0,

otherwise.

The marginal densities of 𝑋 and 𝑌 are
𝑓 𝑋 (𝑥) = 2(1 − 𝑥), 0 ≤ 𝑥 ≤ 1
𝑓𝑌 (𝑦) = 2(1 − 𝑦), 0 ≤ 𝑦 ≤ 1.
1 1
Notice that 𝑓 (𝑥, 𝑦) ≠ 𝑓 𝑋 (𝑥) 𝑓𝑌 (𝑦) for 𝑥 =
, 𝑦 = . Thus, 𝑋 and 𝑌 are dependent.
2 4
1
cov(𝑋, 𝑌 ) = 𝐸 {𝑋𝑌 } − 𝐸 {𝑋 }𝐸 {𝑌 } = 𝐸 {𝑋𝑌 } − .
9
∫ ∫ 1 1−𝑥
𝐸 {𝑋𝑌 } = 2 𝑥 𝑦 d𝑦 d𝑥
0 0
∫ 1
= 𝑥(1 − 𝑥) 2 d𝑥
0
1
= 𝐵(2, 3) = .
12
1 1 1
Hence, cov(𝑋, 𝑌 ) = − =− .
12 9 36
Solution 2.84 𝐽 | 𝑁 ∼ 𝐵(𝑁, 𝑝); 𝑁 ∼ 𝑃(𝜆). 𝐸 {𝑁 } = 𝜆, 𝑉 {𝑁 } = 𝜆, 𝐸 {𝐽} = 𝜆𝑝.
𝑉 {𝐽} = 𝐸 {𝑉 {𝐽 | 𝑁 }} + 𝑉 {𝐸 {𝐽 | 𝑁 }} 𝐸 {𝐽 𝑁 } = 𝐸 {𝑁 𝐸 {𝐽 | 𝑁 }}
= 𝐸 {𝑁 𝑝(1 − 𝑝)} + 𝑉 {𝑁 𝑝} = 𝑝𝐸 {𝑁 2 }
= 𝜆𝑝(1 − 𝑝) + 𝑝 2 𝜆 = 𝜆𝑝. = 𝑝(𝜆 + 𝜆2 )
𝑝𝜆 √
Hence, cov(𝐽, 𝑁) = 𝑝𝜆(1 + 𝜆) − 𝑝𝜆2 = 𝑝𝜆 and 𝜌 𝐽 𝑁 = √ = 𝑝.
𝜆 𝑝
Solution 2.85 Let 𝑋 ∼ 𝐺 (2, 100) ∼ 100𝐺 (2, 1) and𝑌 ∼ 𝑊 (1.5, 500) ∼ 500𝑊 (1.5, 1).
Then 𝑋𝑌 ∼ 5 × 104 𝐺 (2,1) · 𝑊 8
(1.5, 1) and 𝑉 {𝑋𝑌 } = 25 × 10 · 𝑉 {𝐺𝑊 }, where
3
𝐺 ∼ 𝐺 (2, 1) and 𝑊 ∼ 𝑊 , 1 .
2
𝑉 {𝐺𝑊 } = 𝐸 {𝐺 2 }𝑉 {𝑊 } + 𝐸 2 {𝑊 }𝑉 {𝐺}

4 2 2
=6 Γ 1+ − Γ2 1 + + 2 · Γ2 1 +
3 3 3
= 3.88404.
Thus 𝑉 {𝑋𝑌 } = 9.7101 × 109 .
Solution 2.86 Using the notation of Example 2.33, {ex:insert-machine-trinomial}

(i) Pr{𝐽2 + 𝐽3 ≤ 20} = 𝐵(20; 3500,
0.005) = 0.7699.
0.004
(ii) 𝐽3 | 𝐽2 = 15 ∼ Binomial 𝐵 3485, .
0.999
0.004
(iii) 𝜆 = 3485 × = 13.954, Pr{𝐽2 ≤ 15 | 𝐽3 = 15} ≈ 𝑃(15; 13.954) =
0.999
0.6739.
{ex:pdf-hypergeom-joint} Solution 2.87 Using the notation of Example 2.34, the joint p.d.f. of 𝐽1 and 𝐽2 is
20 50 30
𝑗1 𝑗2 20− 𝑗1 − 𝑗2
𝑝( 𝑗 1 , 𝑗2 ) = 100
, 0 ≤ 𝑗 1 , 𝑗2 ; 𝑗1 + 𝑗2 ≤ 20.
20
The marginal distribution of 𝐽1 is 𝐻 (100, 20, 20). The marginal distribution of 𝐽2
is 𝐻 (100, 50, 20). Accordingly,
19
𝑉 {𝐽1 } = 20 × 0.2 × 0.8 × 1 − = 2.585859,
99

19
𝑉 {𝐽2 } = 20 × 0.5 × 0.5 × 1 − = 4.040404.
99
The conditional distribution of 𝐽1 , given 𝐽2 , is 𝐻 (50, 20, 20 − 𝐽2 ). Hence,
𝐸 {𝐽1 𝐽2 } = 𝐸 {𝐸 {𝐽1 𝐽2 | 𝐽2 }}

2
= 𝐸 𝐽2 (20 − 𝐽2 ) ×
5
= 8𝐸 {𝐽2 } − 0.4𝐸 {𝐽22 }
= 80 − 0.4 × 104.040404 = 38.38381
and cov(𝐽1 , 𝐽2 ) = −1.61616.
−1.61616
Finally, the correlation between 𝐽1 and 𝐽2 is 𝜌 = √ =
2.585859 × 4.040404
−0.50.
Solution 2.88 𝑉 {𝑌 | 𝑋 } = 150, 𝑉 {𝑌 } = 200, 𝑉 {𝑌 | 𝑋 } = 𝑉 {𝑌 }(1 − 𝜌 2 ). Hence

|𝜌| = 0.5. The sign of 𝜌 cannot be determined.

100 Í 1
Solution 2.89 (i) 𝑋 (1) ∼ 𝐸 , 𝐸 {𝑋 (1) } = 10.; (ii) 𝐸 {𝑋 (10) } = 100 10
𝑖=1 =
10 𝑖
292.8968.
Solution 2.90 𝐽 ∼ 𝐵(10, 0.95). If {𝐽 = 𝑗 }, 𝑗 > 1, 𝑋 (1) is the minimum

of a sample
10
of 𝑗 i.i.d. 𝐸 (10) random variables. Thus 𝑋 (1) | 𝐽 = 𝑗 ∼ 𝐸 .
𝑗
𝑘𝑥
(i) Pr{𝐽 = 𝑘, 𝑋 (1) ≤ 𝑥} = 𝑏(𝑘; 10, 0.95) (1 − 𝑒 − 10 ), 𝑘 = 1, 2, · · · , 10.
10
(ii) First note that Pr{𝐽 ≥ 1} = 1 − (0.05) ≈ 1.
10
∑︁ 𝑘𝑥
Pr{𝑋 (1) ≤ 𝑥 | 𝐽 ≥ 1} = 𝑏(𝑘; 10, 0.95) (1 − 𝑒 − 10 )
𝑘=1
10
∑︁ 10 𝑥
=1− (0.95𝑒 − 10 ) 𝑘 (0.05) (10−𝑘)
𝑘=1
𝑘
= 1 − [0.05 + 0.95𝑒 −𝑥/10 ] 10 + (0.05) 10
= 1 − (0.05 + 0.95𝑒 −𝑥/10 ) 10 .
Solution 2.91 The median is 𝑀𝑒 = 𝑋 (6) .

11!
(a) The p.d.f. of 𝑀𝑒 is 𝑓 (6) (𝑥) = 𝜆(1 − 𝑒 −𝜆𝑥 ) 5 𝑒 −6𝜆𝑥 , 𝑥 ≥ 0.
5!5!
(b) The expected value
∫ of 𝑀𝑒 is
∞
𝐸 {𝑋 (6) } = 2772𝜆 𝑥(1 − 𝑒 −𝜆𝑥 ) 5 𝑒 −6𝜆𝑥 d𝑥
0
5 ∫ ∞
∑︁ 5
= 2772𝜆 (−1) 𝑗
𝑥𝑒 −𝜆𝑥 (6+ 𝑗) d𝑥
𝑗=0
𝑗 0
5
∑︁ 5 𝑗 1
= 2772 (−1)
𝑗=0
𝑗 𝜆(6 + 𝑗) 2
= 0.73654/𝜆.
𝑋 and 𝑌 be i.i.d. 𝐸 (𝛽), 𝑇 = 𝑋 + 𝑌 and 𝑊 + 𝑋 − 𝑌 .

2.92 Let
Solution
1 3 1
𝑉 𝑇 + 𝑊 = 𝑉{ 𝑋 + 𝑌}
2 2 2
2 2!
3 1
= 𝛽2 +
2 2
= 2.5𝛽2 .
Solution 2.93 cov(𝑋, 𝑋 + 𝑌 ) = cov(𝑋, 𝑋) + cov(𝑋, 𝑌 ) = 𝑉 {𝑋 } = 𝜎 2 .
Solution 2.94 𝑉 {𝛼𝑋 + 𝛽𝑌 } = 𝛼2 𝜎𝑋2 + 𝛽2 𝜎𝑌2 + 2𝛼𝛽cov(𝑋, 𝑌 ) = 𝛼2 𝜎𝑋2 + 𝛽2 𝜎𝑌2 +

2𝛼𝛽𝜌 𝑋𝑌 𝜎𝑋 𝜎𝑌 .
Solution 2.95 Let 𝑈 ∼ 𝑁 (0, 1) and 𝑋 ∼ 𝑁 (𝜇, 𝜎). We assume that 𝑈 and 𝑋 are
independent. Then Φ(𝑋) = Pr{𝑈 < 𝑋 | 𝑋 } and therefore
𝐸 {Φ(𝑋)} = 𝐸 {Pr{𝑈 < 𝑋 | 𝑋 }}
= Pr{𝑈 < 𝑋 }
= Pr{𝑈 − 𝑋 < 0}

𝜇
=Φ √ .
1 + 𝜎2 √
The last equality follows from the fact that 𝑈 − 𝑋 ∼ 𝑁 (−𝜇, 1 + 𝜎 2 ).
Solution 2.96 Let 𝑈1 , 𝑈2 , 𝑋 be independent random variables; 𝑈1 , 𝑈2 i.i.d. 𝑁 (0, 1).

Then
Φ2 (𝑋) = Pr{𝑈1 ≤ 𝑋, 𝑈2 ≤ 𝑋 | 𝑋 }. Hence
𝐸 {Φ2 (𝑋)} = Pr{𝑈1 ≤ 𝑋, 𝑈2 ≤ 𝑋 } = Pr{𝑈1 − 𝑋 ≤ 0, 𝑈2 − 𝑋 ≤ 0}.
Since (𝑈1 − 𝑋, 𝑈2 − 𝑋) have a bivariate normal distribution with means (−𝜇, −𝜇)
1 + 𝜎 2 𝜎 2 
 
and variance-covariance matrix 𝑉 =   , it follows that

 𝜎2 1 + 𝜎2 
 
𝜎2

2 𝜇 𝜇
𝐸 {Φ (𝑋)} = Φ2 √ ,√ ; 2
.
1 + 𝜎2 1 + 𝜎2 1 + 𝜎
Solution 2.97 Since 𝑋 and 𝑌 are independent, 𝑇 = 𝑋 + 𝑌 ∼ 𝑃(12), Pr{𝑇 > 15} =
0.1556.
∫𝑥
Solution 2.98 Let 𝐹2 (𝑥) = −∞ 𝑓2 (𝑧) d𝑧 be the c.d.f. of 𝑋2 . Since 𝑋1 + 𝑋2 are
∫ ∞
Pr{𝑌 ≤ 𝑦} = 𝑓1 (𝑥) Pr{𝑋2 ≤ 𝑦 − 𝑥} d𝑥
independent ∫−∞∞
= 𝑓1 (𝑥)𝐹2 (𝑦 − 𝑥) d𝑥.
−∞
𝑑
𝑔(𝑦) = Pr{𝑌 ≤ 𝑦}
𝑑𝑦
Therefore, the p.d.f. of 𝑌 is ∫ ∞
= 𝑓1 (𝑥) 𝑓2 (𝑦 − 𝑥) d𝑥.
−∞
Solution 2.99 Let 𝑌 = 𝑋1 + 𝑋2 where 𝑋1 , 𝑋2 are i.i.d. uniform on (0,1). Then the
p.d.f.’s are
𝑓1 (𝑥) = 𝑓2 (𝑥) = 𝐼{0 < 𝑥 < 1}
∫𝑦
 d𝑥 = 𝑦, if 0 ≤ 𝑦 < 1
 0



𝑔(𝑦) =
 1 d𝑥 = 2 − 𝑦, if 1 ≤ 𝑦 ≤ 2.

 ∫
 𝑦−1
∫∞
Solution 2.100 𝑋1 , 𝑋2 are i.i.d. 𝐸 (1). 𝑈 = 𝑋1 − 𝑋2 . Pr{𝑈 ≤ 𝑢} = 0 𝑒 −𝑥 Pr{𝑋1 ≤
𝑢 + 𝑥} d𝑥. Notice that −∞ < 𝑢 < ∞ and Pr{𝑋1 ≤ 𝑢 + 𝑥} = 0 if 𝑥 + 𝑢 < 0. Let
𝑎 + = max(𝑎, 0). Then
∫ ∞
+
Pr{𝑈 ≤ 𝑢} = 𝑒 −𝑥 (1 − 𝑒 −(𝑢+𝑥) ) d𝑥
0
∫ ∞
+
=1− 𝑒 −𝑥−(𝑢+𝑥) d𝑥
0


 1 − 12 𝑒 −𝑢 , if 𝑢 ≥ 0


=

 1 𝑒 − |𝑢 | ,

if 𝑢 < 0
2
1
Thus, the p.d.f. of 𝑈 is 𝑔(𝑢) = 𝑒 −|𝑢 | , −∞ < 𝑢 < ∞.
2
√

. 50 − 40
Solution 2.101 𝑇 = 𝑋1 +· · ·+𝑋20 ∼ 𝑁 (20𝜇, 20𝜎 2 ). Pr{𝑇 ≤ 50} ≈ Φ =
44.7214
0.5885.
√︁
Solution 2.102 𝑋 ∼ 𝐵(200, 0.15). 𝜇 = 𝑛𝑝 = 30, 𝜎 = 𝑛𝑝(1 − 𝑝) = 5.0497.
34.5 − 30 25.5 − 30
Pr{25 < 𝑋 < 35} ≈ Φ −Φ = 0.6271.
5.0497 5.0497

9.5
Solution 2.103 𝑋 ∼ 𝑃(200). Pr{190 < 𝑋 < 210} ≈ 2Φ √ − 1 = 0.4983.
200
3 √︁
Solution 2.104 𝑋 ∼ Beta(3, 5). 𝜇 = 𝐸 {𝑋 } = = 0.375. 𝜎 = 𝑉 {𝑋 } =
1/2 8
3·5
= 0.161374.
82 · 9
√ !
200 · 0.2282
Pr{| 𝑋¯ 200 − 0.375| < 0.2282} ≈ 2Φ − 1 = 1.
0.161374
Solution 2.105 𝑡 .95 [10] = 1.8125, 𝑡 .95 [15] = 1.7531, 𝑡 .95 [20] = 1.7247.
print(stats.t.ppf(0.95, 10))
1.8124611228107335
1.7530503556925547
1.7247182429207857
Solution 2.106 𝐹.95 [10, 30] = 2.1646, 𝐹.95 [15, 30] = 2.0148, 𝐹.95 [20, 30] =
1.9317.
print(stats.f.ppf(0.95, 10, 30))

2.164579917125473
2.0148036912954885
1.931653475236928
Solution 2.107 The solution to this problem is based on the fact, which is not
discussed in the text, that 𝑡 [𝜈] is distributed like the ratio of two independent random
√︁ 𝑁 (0, 1)
variables, 𝑁 (0, 1) and 𝜒2 [𝜈]/𝜈. Accordingly, 𝑡 [𝑛] ∼ √︃ , where 𝑁 (0, 1) and
𝜒2 [𝑛]
𝑛
(𝑁 (0, 1)) 2
𝜒2 [𝑛] are independent. 𝑡 2 [𝑛] ∼ ∼ 𝐹 [1, 𝑛]. Thus, since Pr{𝐹 [1, 𝑛] ≤
𝜒2 [𝑛]
𝑛
𝐹1−𝛼 [1, 𝑛]} = 1 − 𝛼.
√︁ √︁
Pr{− 𝐹1−𝛼 [1, 𝑛] ≤ 𝑡 [𝑛] ≤ 𝐹1−𝛼 [1, 𝑛]} = 1 − 𝛼.
√︁
It follows that 𝐹1−𝛼 [1, 𝑛] = 𝑡 1−𝛼/2 [𝑛],
2
or 𝐹1−𝛼 [1, 𝑛] = 𝑡 1−𝛼/2 [𝑛]. (If you assign this problem, please inform the students
of the above fact.)
Solution 2.108 A random variable 𝐹 [𝜈1 , 𝜈2 ] is distributed like the ratio of two
independent random variables 𝜒2 [𝜈1 ]/𝜈1 and 𝜒2 [𝜈2 ]/𝜈2 . Accordingly, 𝐹 [𝜈1 , 𝜈2 ] ∼
𝜒2 [𝜈1 ]/𝜈1
and
𝜒2 [𝜈2 ]/𝜈2
1 − 𝛼 = Pr{𝐹 [𝜈1 , 𝜈2 ] ≤ 𝐹1−𝛼 [𝜈1 , 𝜈2 ]}
( )
𝜒12 [𝜈1 ]/𝜈1
= Pr ≤ 𝐹1−𝛼 [𝜈1 , 𝜈2 ]}
𝜒22 [𝜈2 ]/𝜈2
( )
𝜒22 [𝜈2 ]/𝜈2 1
= Pr ≥
𝜒12 [𝜈1 ]/𝜈1 𝐹1−𝛼 [𝜈1 , 𝜈2 ]

1
= Pr 𝐹 [𝜈2 , 𝜈1 ] ≥
𝐹1−𝛼 [𝜈1 , 𝜈2 ]
= Pr {𝐹 [𝜈2 , 𝜈1 ] ≥ 𝐹𝛼 [𝜈2 , 𝜈1 ]} .
1
Hence 𝐹1−𝛼 [𝜈1 , 𝜈2 ] = .
𝐹𝛼 [𝜈2 , 𝜈1 ]
𝑁 (0, 1)
Solution 2.109 Using the fact that 𝑡 [𝜈] ∼ √︃ , where 𝑁 (0, 1) and 𝜒2 [𝜈] are
𝜒2 [𝜈 ]
𝜈
independent, ( )
𝑁 (0, 1)
𝑉 {𝑡 [𝜈]} = 𝑉 √︁
𝜒2 [𝜈]/𝜈

 
 
 
 
 
 
 

 

  𝑁 (0, 1)
 2


 
  

  𝑁 (0, 1)
 2


 

= 𝐸 𝑉 √︃ 𝜒 [𝜈] + 𝑉 𝐸 √︃ 𝜒 [𝜈] .

  𝜒2 [𝜈 ]     𝜒2 [𝜈 ]  
   𝜈

 
 
 
 𝜈

 

     

 
 
 

 𝑁 (0, 1)

 

 𝜈  𝑁 (0, 1)

 


By independence, 𝑉 √︃ 2
𝜒 [𝜈] = 2 , and 𝐸 √︃ 2
𝜒 [𝜈] = 0.
2 𝜒 [𝜈] 2
 𝜒 𝜈[𝜈 ]  𝜒 𝜈[𝜈 ]

 
 
 

 
   
1 2
𝜈 𝜈
Thus, 𝑉 {𝑡 [𝜈]} = 𝜈𝐸 . Since 𝜒 [𝜈] ∼ 𝐺 , 2 ∼ 2𝐺 , 1 ,
𝜒2 [𝜈] 2 2
∫ ∞
1 1 1
𝐸 = · 𝑥 𝜈−2 𝑒 −𝑥 d𝑥
𝜒2 [𝜈]

2 Γ 𝜈2 0
𝜈

1 Γ 2 −1 1 1 1
= · = · 𝜈 = .
2 Γ 2 𝜈 2 2 −1 𝜈−2
𝜈
Finally, 𝑉 {𝑡 [𝜈]} = , 𝜈 > 2.
𝜈−2
10 2 · 102 · 11
Solution 2.110 𝐸 {𝐹 [3, 10]} = = 1.25, 𝑉 {𝐹 [3, 10]} = = 1.9097.
8 3 · 82 · 6
Chapter 3
Statistical Inference and Bootstrapping
import random
import math
import numpy as np
import pandas as pd
import pingouin as pg
import mistat
import os
os.environ['OUTDATED_IGNORE'] = '1'
Solution 3.1 By the WLLN, for any 𝜖 > 0, lim𝑛→∞ Pr{|𝑀𝑙 − 𝜇𝑙 | < 𝜖 } = 1. Hence,
𝑀𝑙 is a consistent estimator of the 𝑙-th moment.
√
¯ 𝑛
Solution 3.2 Using the CLT, Pr{| 𝑋𝑛 − 𝜇| < 1} ≈ 2Φ − 1. To determine the
𝜎√
𝑛
sample size 𝑛 so that this probability is 0.95 we set 2Φ − 1 = 0.95 and solve
√ 𝜎
𝑛
for 𝑛. This gives = 𝑧 .975 = 1.96. Thus 𝑛 ≥ 𝜎 2 (1.96) 2 = 424 for 𝜎 = 10.5.
𝜎
1Í
Solution 3.3 𝜉ˆ𝑝 = 𝑋¯ 𝑛 + 𝑧 𝑝 𝜎 ˆ 𝑛2 =
ˆ 𝑛 , where 𝜎 (𝑋𝑖 − 𝑋¯ 𝑛 ) 2 .
𝑛
Solution 3.4 Let (𝑋1 , 𝑌1 ), . . . , (𝑋𝑛 , 𝑌𝑛 ) be a random sample from a bivariate normal
distribution with density 𝑓 (𝑥, 𝑦; 𝜇, 𝜂, 𝜎𝑋 , 𝜎𝑌 , 𝜌) as given in Eq. (4.6.6). Let 𝑍𝑖 =
𝑋𝑖𝑌𝑖 for 𝑖 = 1, . . . , 𝑛. Then the first moment of 𝑍 is given by
𝜇1 (𝐹𝑍 ) = 𝐸 {𝑍 }
∫ ∞∫ ∞
= 𝑥𝑦 𝑓 (𝑥, 𝑦; 𝜇, 𝜂, 𝜎𝑋 , 𝜎𝑌 , 𝜌) d𝑥 d𝑦
−∞ −∞
= 𝜇𝜂 + 𝜌𝜎𝑋 𝜎𝑌 .
39
40 3 Statistical Inference and Bootstrapping
Using this fact, as well as the first 2 moments of 𝑋 and 𝑌 , we get the following
moment equations:
𝑛 𝑛
1 ∑︁ 1 ∑︁
𝑋𝑖 = 𝜇 𝑌𝑖 = 𝜂
𝑛 𝑖=1 𝑛 𝑖=1
𝑛 𝑛
1 ∑︁ 2 1 ∑︁ 2
𝑋 = 𝜎𝑋2 + 𝜇2 𝑌 = 𝜎𝑌2 + 𝜂2
𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
𝑛
1 ∑︁
𝑋𝑖𝑌𝑖 = 𝜇𝜂 + 𝜌𝜎𝑋 𝜎𝑌 .
𝑛 𝑖=1
Solving these equations for the 5 parameters gives

1 Í𝑛 ¯¯
𝑖=1 𝑋𝑖 𝑌𝑖 − 𝑋 𝑌
𝜌ˆ 𝑛 = 𝑛 or equivalently,
1/2 ,
1 Í𝑛 1
𝑋 2 − 𝑋¯ 2 · 𝑌 2 − 𝑌¯ 2
Í 𝑛
𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
1 Í𝑛
(𝑋𝑖 − 𝑋¯ 𝑛 ) (𝑌𝑖 − 𝑌¯𝑛 )
𝜌ˆ 𝑛 = 𝑛 𝑖=1
1/2 .
1 Í𝑛 ¯ · 1
2 (𝑌𝑖 − 𝑌¯ ) 2
Í 𝑛
(𝑋𝑖 − 𝑋)
𝑛 𝑖=1 𝑛 𝑖=1
Solution 3.5 The two first moments are

𝜈1 𝜈1 (𝜈1 + 1)
𝜇1 = , 𝜇2 = .
𝜈1 + 𝜈2 (𝜈1 + 𝜈2 ) (𝜈1 + 𝜈2 + 1)
1 Í𝑛
Equating the theoretical moments to the sample moments 𝑀1 = 𝑋𝑖 and
𝑛 𝑖=1
1 Í𝑛
𝑀2 = 𝑋 2 , we otain with 𝜎
ˆ 𝑛2 = 𝑀2 − 𝑀12 the moment equation estimators.
𝑛 𝑖=1 𝑖
ˆ 𝑛2
𝜈ˆ1 = 𝑀1 (𝑀1 − 𝑀2 )/𝜎 and 𝜈ˆ2 = (𝑀1 − 𝑀2 ) (1 − 𝑀1 )/𝜎 ˆ 𝑛2 .
!
𝜎𝑖2 Í 2 𝑤𝑖
Solution 3.6 𝑉 {𝑌¯𝑤 } = 2
Í 𝑘 𝑘 Í𝑘
𝑖=1 𝑤 𝑖 𝑖=1 𝑤 𝑖 . Let 𝜆 𝑖 = Í 𝑘 , 𝑖=1 𝜆𝑖 = 1.
𝑛𝑖 𝑤
𝑗=1 𝑗
We find weights 𝜆𝑖 which minimize 𝑉 {𝑌¯𝑤 }, under the constraint 𝑖=1
Í𝑘
𝜆𝑖 = 1. The
Í 𝑘 2 𝜎𝑖2 Í𝑘
Lagrangian is 𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 𝑖=1 𝜆𝑖 𝑛𝑖 + 𝜌 𝑖=1 𝜆𝑖 − 1 . Differentiating with
respect to 𝜆𝑖 , we get
𝜕 𝜎2 𝜕
𝑘∑︁
𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 2𝜆𝑖 𝑖 +𝜌, 𝑖 = 1, . . . , 𝑘 and 𝐿 (𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 𝜆𝑖 −1.
𝜕𝜆 𝑖 𝑛𝑖 𝜕𝜌 𝑖=1
3 Statistical Inference and Bootstrapping 41
Equating the partial derivatives to zero, we get 𝜆0𝑖 = − 𝜌2 𝑛𝑖

𝜎𝑖2
for 𝑖 = 1, . . . , 𝑘 and
Í𝑘 0 𝜌 Í𝑘 𝑛𝑖
𝑖=1 𝜆 𝑖 = − 2 𝑖=1 𝜎 2 = 1.
𝑖
𝜌 1 0 =
𝑛𝑖 /𝜎𝑖2
Thus, − = Í 𝑘 2
, 𝜆 2
, and therefore 𝑤 𝑖 = 𝑛𝑖 /𝜎𝑖2 .
2 𝑖 Í 𝑘
𝑛
𝑖=1 𝑖 /𝜎𝑖 𝑛
𝑗=1 𝑗 /𝜎 𝑗
Solution 3.7 Since the 𝑌𝑖 are uncorrelated,

Í𝑛 (𝑥𝑖 − 𝑥¯ 𝑛 ) 2 𝜎2
𝑉 { 𝛽ˆ1 } = 𝑖=1 𝑤 2𝑖 𝑉 {𝑌𝑖 } = 𝜎 2 𝑖=1 (𝑥 𝑖 − 𝑥¯ 𝑛 ) 2 .
Í𝑛 Í𝑛
2
= , where 𝑆𝑆 𝑥 = 𝑖=1
𝑆𝑆 𝑥 𝑆𝑆 𝑥
𝑥𝑖 − 𝑥¯ 𝑛
(𝑥𝑖 − 𝑥¯ 𝑛 ) 2 . Then
Í𝑛
Solution 3.8 Let 𝑤 𝑖 = for 𝑖 = 1, · · · , 𝑛 where 𝑆𝑆 𝑥 = 𝑖=1
𝑆𝑆 𝑥
we have
𝑛 𝑛
∑︁ ∑︁ 1
𝑤 𝑖 = 0, and 𝑤 2𝑖 = .
𝑖=1 𝑖=1
𝑆𝑆 𝑥
Hence,
𝑉 { 𝛽ˆ0 } = 𝑉 {𝑌¯𝑛 − 𝛽ˆ1 𝑥¯ 𝑛 }

( 𝑛
! )
∑︁
= 𝑉 𝑌¯𝑛 − 𝑤 𝑖𝑌𝑖 𝑥¯ 𝑛
𝑖=1
( 𝑛 )
∑︁ 1
=𝑉 − 𝑤 𝑖 𝑥¯ 𝑛 𝑌𝑖
𝑖=1
𝑛
𝑛 2
∑︁ 1
= − 𝑤 𝑖 𝑥¯ 𝑛 𝜎2
𝑖=1
𝑛
𝑛
∑︁ 1 2𝑤 𝑖 𝑥¯ 𝑛
= 𝜎2 − + 𝑤 2 2
¯
𝑥
𝑖 𝑛
𝑖=1
𝑛2 𝑛
𝑛 𝑛
!
2 1 2 ∑︁ ∑︁
=𝜎 − 𝑥¯ 𝑛 𝑤 𝑖 + 𝑥¯ 𝑛2 𝑤 2𝑖
𝑛 𝑛 𝑖=1 𝑖=1

¯ 2
1 𝑥
= 𝜎2 + 𝑛 .
𝑛 𝑆𝑆 𝑥
Also
𝑛 𝑛
!
∑︁ 1 ∑︁
cov( 𝛽ˆ0 , 𝛽ˆ1 ) = cov − 𝑤 𝑖 𝑥¯ 𝑛 𝑌𝑖 , 𝑤 𝑖𝑌𝑖
𝑖=1
𝑛 𝑖=1
𝑛
∑︁ 1
= 𝜎2 − 𝑤 𝑖 𝑥¯ 𝑛 𝑤 𝑖
𝑖=1
𝑛
𝑥¯ 𝑛
= −𝜎 2 .
𝑆𝑆 𝑥
Solution 3.9 The correlation between 𝛽ˆ0 and 𝛽ˆ1 is
𝜎 2 𝑥¯ 𝑛
𝜌 𝛽ˆ0 , 𝛽ˆ1 = − 1/2
𝑥¯ 2

1 1
𝜎 2 𝑆𝑆 𝑥 + 𝑛
𝑛 𝑆𝑆 𝑥 𝑆𝑆 𝑥
𝑥¯ 𝑛
= − 1/2 .
1 Í𝑛 2
𝑥
𝑛 𝑖=1 𝑖
Solution 3.10 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 i.i.d., 𝑋1 ∼ 𝑃(𝜆). The likelihood function is

Í𝑛
𝑋𝑖
𝜆 𝑖=1
𝐿 (𝜆; 𝑋1 , . . . , 𝑋𝑛 ) = 𝑒 −𝑛𝜆 Î𝑛
𝑖=1 𝑋𝑖 !
Í𝑛
𝜕 𝑖=1 𝑋𝑖
Thus, log 𝐿 (𝜆; 𝑋1 , . . . , 𝑋𝑛 ) = −𝑛 + . Equating this to zero and solving
𝜕𝜆 𝜆
1 Í𝑛
for 𝜆, we get 𝜆ˆ 𝑛 = 𝑋𝑖 = 𝑋¯ 𝑛 .
𝑛 𝑖=1
1 − Í𝑛 𝑋𝑖 /𝛽
Solution 3.11 Since 𝜈 is known, the likelihood of 𝛽 is 𝐿(𝛽) = 𝐶𝑛 𝑒 𝑖=1 ,
𝛽 𝑛𝜈
0 < 𝛽 < ∞ where 𝐶𝑛 does not depend on 𝛽. The log-likelihood function is
𝑛
1 ∑︁
𝑙 (𝛽) = log 𝐶𝑛 − 𝑛𝜈 log 𝛽 − 𝑋𝑖 .
𝛽 𝑖=1
Í𝑛
𝑛𝜈 𝑋𝑖
The score function is 𝑙 0 (𝛽) = − + 𝑖=12 . Equating the score to 0 and solving
𝛽 𝛽
1 1 ¯
for 𝛽, we obtain the MLE 𝛽ˆ =
Í 𝑛
𝑖=1 𝑋𝑖 = 𝑋𝑛 . The variance of the MLE is
𝑛𝜈 𝜈
2
ˆ = 𝛽 .
𝑉 { 𝛽}
𝑛𝜈
{exc:mgf-nb} Solution 3.12 We proved in Exercise 2.56 that the m.g.f. of 𝑁 𝐵(2, 𝑝) is
𝑝2
𝑀𝑋 (𝑡) = , 𝑡 < − log(1 − 𝑝).
(1 − (1 − 𝑝)𝑒 𝑡 ) 2
Í𝑛
Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be i.i.d. 𝑁 𝐵(2, 𝑝), then the m.g.f. of 𝑇𝑛 = 𝑖=1 𝑋𝑖 is
𝑝 2𝑛
𝑀𝑇𝑛 (𝑡) = , 𝑡 < − log(1 − 𝑝).
(1 − (1 − 𝑝)𝑒 𝑡 ) 2𝑛
{ex:neg-binom-dist} Thus, 𝑇𝑛 ∼ 𝑁 𝐵(2𝑛, 𝑝). According to Example 3.4, the MLE of 𝑝, based on 𝑇𝑛
2𝑛 2
(which is a sufficient statistic) is 𝑝ˆ 𝑛 = = , where 𝑋¯ 𝑛 = 𝑇𝑛 /𝑛 is the
𝑇𝑛 + 2𝑛 ¯
𝑋𝑛 + 2
sample mean.
2(1 − 𝑝)
(i) According to the WLLN, 𝑋¯ 𝑛 converges in probability to 𝐸 {𝑋1 } = .
𝑝
1− 𝑝 2
Substituting 2 for 𝑋¯ 𝑛 in the formula of 𝑝ˆ 𝑛 we obtain 𝑝 ∗ = = 𝑝. This
𝑝 2 + 2 1−𝑝𝑝
shows that the limit in probability as 𝑛 → ∞, of 𝑝ˆ 𝑛 is 𝑝.
(ii) Substituting 𝑘 = 2𝑛 in the formulas of Example 3.4 we obtain {ex:neg-binom-dist}
3𝑝(1 − 𝑝) 𝑝 2 (1 − 𝑝)
Bias( 𝑝ˆ 𝑛 ) ≈ and 𝑉 { 𝑝ˆ 𝑛 } ≈ .
4𝑛 2𝑛
Solution 3.13 The likelihood function of 𝜇 and 𝛽 is
( 𝑛
)
1 1 ∑︁ 𝑛
𝐿 (𝜇, 𝛽) = 𝐼{𝑋 (1) ≥ 𝜇} 𝑛 exp − (𝑋 (𝑖) − 𝑋 (1) ) − (𝑋 (1) − 𝜇) ,
𝛽 𝛽 𝑖=1 𝛽
for −∞ < 𝜇 ≤ 𝑋 (1) , 0 < 𝛽 < ∞.

(i) 𝐿 (𝜇, 𝛽) is maximized by 𝜇ˆ = 𝑋 (1) , that is,
( 𝑛
)
1 1 ∑︁
𝐿 ∗ (𝛽) = sup 𝐿(𝜇, 𝛽) = 𝑛 exp − (𝑋 (𝑖) − 𝑋 (1) )
𝜇 ≤𝑋 (1) 𝛽 𝛽 𝑖=1
where 𝑋 (1) < 𝑋 (2) < · · · < 𝑋 (𝑛) are the ordered statistics.
1 Í𝑛
(ii) Furthermore 𝐿 ∗ (𝛽) is maximized by 𝛽ˆ𝑛 = (𝑋 (𝑖) − 𝑋 (1) ). The MLEs
𝑛 𝑖=2
1
are 𝜇ˆ = 𝑋 (1) , and 𝛽ˆ𝑛 =
Í 𝑛
(𝑋 (𝑖) − 𝑋 (1) ).
𝑛 𝑖=2
𝛽
(iii) 𝑋 (1) is distributed like 𝜇 + 𝐸 , with p.d.f.
𝑛
𝑛
𝑛 − 𝛽 ( 𝑥−𝜇)
𝑓 (1) (𝑥; 𝜇, 𝛽) = 𝐼{𝑥 ≥ 𝜇} 𝑒 .
𝛽
Thus, the joint p.d.f. of (𝑋1 , . . . , 𝑋𝑛 ) is factored to a product of the p.d.f. of 𝑋 (1) and a
function of 𝛽ˆ𝑛 , which does not depend on 𝑋 (1) (nor on 𝜇). This implies that 𝑋 (1) and
𝛽2 1
ˆ = 𝑉 {𝑋 (1) } = 2 . It can be shown that 𝛽ˆ𝑛 ∼ 𝐺 (𝑛 − 1, 𝛽).
𝛽ˆ𝑛 are independent. 𝑉 { 𝜇}
𝑛 𝑛
𝑛−1 1 1 2
Accordingly, 𝑉 { 𝛽ˆ𝑛 } = 2 𝛽2 = 1− 𝛽 .
𝑛 𝑛 𝑛
Solution 3.14 In sampling with replacement, the number of defective items in the
sample, 𝑋, has the binomial distribution 𝐵(𝑛, 𝑝). We test the hypotheses 𝐻0 : 𝑝 ≤
0.03 against 𝐻1 : 𝑝 > 0.03. 𝐻0 is rejected if 𝑋 > 𝐵−1 (1 − 𝛼, 20, 0.03). For 𝛼 = 0.05
the rejection criterion is 𝑘 𝛼 = 𝐵−1 (0.95, 20, 0.03) = 2. Since the number of defective
items in the sample is 𝑋 = 2, 𝐻0 is not rejected at the 𝛼 = 0.05 significance level.
1.0
0.8
0.6
OC(p)
0.4
0.2
0.0
0.0 0.1 0.2 0.3 0.4 0.5
p
Fig. 3.1 The OC Function 𝐵 (2; 30, 𝑝) { fig:PlotOCcurveBinomial_2_30}
Solution 3.15 The OC function is 𝑂𝐶 ( 𝑝) = 𝐵(2; 30, 𝑝), 0 < 𝑝 < 1. A plot of this
OC function is given in Fig. 3.1. { fig:PlotOCcurveBinomial_2_30}
Solution 3.16 Let 𝑝 0 = 0.01, 𝑝 1 = 0.03, 𝛼 = 0.05, 𝛽 = 0.05. According to

Eq. (3.3.12), the sample size 𝑛 should satisfy
𝑝1 − 𝑝0 √
√︂
𝑝0 𝑞0
1−Φ √ 𝑛 − 𝑧1−𝛼 =𝛽
𝑝1 𝑞1 𝑝1 𝑞1
or, equivalently,
𝑝1 − 𝑝0 √
√︂
𝑝0 𝑞0
√ 𝑛 − 𝑧 1−𝛼 = 𝑧 1−𝛽 .
𝑝1 𝑞1 𝑝1 𝑞1
This gives
√ √
(𝑧 1−𝛼 𝑝 0 𝑞 0 + 𝑧 1−𝛽 𝑝 1 𝑞 1 ) 2
𝑛≈
( 𝑝1 − 𝑝0)2
√ √
(1.645) 2 ( 0.01 × 0.99 + 0.03 × 0.97) 2
= = 494.
(0.02) 2
√
For this sample size, the critical value is 𝑘 𝛼 = 𝑛𝑝 0 + 𝑧1−𝛼 𝑛𝑝 0 𝑞 0 = 8.58. Thus, 𝐻0
is rejected if there are more than 8 “successes” in a sample of size 494.
𝜎
Solution 3.17 𝑋¯ 𝑛 ∼ 𝑁 (𝜇, √ ).
𝑛
(i) If 𝜇 = 𝜇0 the probability that 𝑋¯ 𝑛 will be outside the control limits is

¯ 3𝜎 ¯ 3𝜎
Pr 𝑋𝑛 < 𝜇0 − √ + Pr 𝑋𝑛 > 𝜇0 + √ = Φ(−3) + 1 − Φ(3) = 0.0027.
𝑛 𝑛
(ii) (1 − 0.0027) 20 =√0.9474.

(iii) If 𝜇 = 𝜇0 + 2(𝜎/ 𝑛), the probability that 𝑋¯ 𝑛 will be outside the control limits
is
Φ(−5) + 1 − Φ(1) = 0.1587.
(iv) (1 − 0.1587) 10 = 0.1777.
Solution 3.18 We can run the 1-sample 𝑡-test in Python as follows:
socell = mistat.load_data('SOCELL')
t1 = socell['t1']
statistic, pvalue = stats.ttest_1samp(t1, 4.0)

# divide pvalue by two for one-sided test
pvalue = pvalue / 2
print(f'pvalue {pvalue:.2f}')
pvalue 0.35
The hypothesis 𝐻0 : 𝜇 ≥ 4.0 amps is not rejected.
Solution 3.19 We can run the 1-sample 𝑡-test in Python as follows:
t2 = socell['t2']
statistic, pvalue = stats.ttest_1samp(t2, 4.0)

# divide pvalue by two for one-sided test
pvalue = pvalue / 2
print(f'pvalue {pvalue:.2f}')
pvalue 0.03
The hypothesis 𝐻0 : 𝜇 ≥ 4.0 amps is rejected at a 0.05 level of significance.
Solution 3.20 Let 𝑛 = 30, 𝛼 = 0.01. The 𝑂𝐶 (𝛿) function for a one-sided 𝑡-test is
√ 1
© 𝛿 30 − 2.462 × (1 − 232 ) ª
𝑂𝐶 (𝛿) = 1 − Φ ®
6.0614 1/2 ®
(1 + )
« 58 ¬
= 1 − Φ(5.2117𝛿 − 2.3325).
delta = np.linspace(0, 1.0, 11)
a = np.sqrt(30)
b = 2.462 * ( 1 - 1/232)
f = np.sqrt(1 + 6.0614 / 58)
OC_delta = 1 - stats.norm.cdf((a * delta - b) / f)
Values of 𝑂𝐶 (𝛿) for 𝛿 = 0, 1(0.1) are given in the following table.

𝛿 OC(𝛿)
0.0 0.990164
0.1 0.964958
0.2 0.901509
0.3 0.779063
0.4 0.597882
0.5 0.392312
0.6 0.213462
0.7 0.094149
0.8 0.033120
0.9 0.009188
1.0 0.001994
Solution 3.21 Let 𝑛 = 31, 𝛼 = 0.10. The 𝑂𝐶 function for testing 𝐻0 : 𝜎 2 ≤ 𝜎02
against 𝐻1 : 𝜎 2 > 𝜎02 is
( )
2 2
𝜎02 2
𝑂𝐶 (𝜎 ) = Pr 𝑆 ≤ 𝜒0.9 [𝑛 − 1]
𝑛−1
( )
𝜎02
= Pr 𝜒2 [30] ≤ 2
𝜒0.9 [30]
𝜎2
!
2 [30]
30 𝜎02 𝜒0.9
=1−𝑃 − 1; 2 · .
2 𝜎 2
sigma2 = np.linspace(1, 2, 11)

OC_sigma2 = 1 - stats.poisson.cdf(30 / 2 - 1,
stats.chi2(30).ppf(0.90) / (2 * sigma2))
The values of 𝑂𝐶 (𝜎 2 ) for 𝜎 2 = 1, 2(0.1) are given in the following table: (Here
𝜎02 = 1.)
𝜎2 OC(𝜎 2 )
1.0 0.900000
1.1 0.810804
1.2 0.700684
1.3 0.582928
1.4 0.469471
1.5 0.368201
1.6 0.282781
1.7 0.213695
1.8 0.159540
1.9 0.118063
2.0 0.086834
1.0
OC(p)
OCapprox(p)
0.8
0.6
0.4
0.2
0.0
0.10 0.15 0.20 0.25 0.30 0.35
p
Fig. 3.2 Comparison exact to normal approximation of 𝑂𝐶 ( 𝑝) { fig:OC_OCapprox}
Solution 3.22 The 𝑂𝐶 function, for testing 𝐻0 : 𝑝 ≤ 𝑝 0 against 𝐻1 : 𝑝 > 𝑝 0 is

approximated by
𝑝 − 𝑝0 √
√︂
𝑝0 𝑞0
𝑂𝐶 ( 𝑝) = 1 − Φ √ 𝑛 − 𝑧1−𝛼 ,
𝑝𝑞 𝑝𝑞
for 𝑝 ≥ 𝑝 0 . In Fig. 3.2, we present the graph of 𝑂𝐶 ( 𝑝), for 𝑝 0 = 0.1, 𝑛 = 100, { fig:OC_OCapprox}
𝛼 = 0.05 both using the exact solution and the normal approximation.
n = 100
p0 = 0.1
alpha = 0.05
c_alpha = stats.binom(n, p0).ppf(1 - alpha)

p = np.linspace(0.1, 0.35, 20)
OC_exact = stats.binom(n, p).cdf(c_alpha)
z_ma = stats.norm.ppf(1 - alpha)

q0 = 1- p0
pq = p * (1 - p)
OC_approx = 1 - stats.norm.cdf((p - p0) * np.sqrt((n / pq)) -
z_ma * np.sqrt(p0 * q0 / pq))
df = pd.DataFrame({
'p': p,
'OC(p)': OC_exact,
'$OC_{approx}(p)$': OC_approx,
})
df.plot.line(x='p', y=['OC(p)', '$OC_{approx}(p)$'])
plt.show()
Solution 3.23 The power function is

( )
2 2
𝜎02 2
𝜓(𝜎 ) = Pr 𝑆 ≥ 𝜒1−𝛼 [𝑛 − 1]
𝑛−1
( )
2
𝜎02
= Pr 𝜒 [𝑛 − 1] ≥ 𝜒2 [𝑛 − 1] .
𝜎 2 1−𝛼

1
Solution 3.24 The power function is 𝜓(𝜌) = Pr 𝐹 [𝑛1 − 1, 𝑛2 − 1] ≥ 𝐹1−𝛼 [𝑛1 − 1, 𝑛2 − 1] ,
𝜌
𝜎12
for 𝜌 ≥ 1, where 𝜌 = 2 .
𝜎2
Solution 3.25 (i) Using the following Python commands we get a 99% C.I. for 𝜇:
data = [20.74, 20.85, 20.54, 20.05, 20.08, 22.55, 19.61, 19.72,

20.34, 20.37, 22.69, 20.79, 21.76, 21.94, 20.31, 21.38,
20.42, 20.86, 18.80, 21.41]
alpha = 1 - 0.99
df = len(data) - 1
mean = np.mean(data)
sem = stats.sem(data)
print(stats.t.interval(1 - alpha, df, loc=mean, scale=sem))
(20.136889216656858, 21.38411078334315)
Confidence Intervals
Variable N Mean StDev SE 99.0% C.I.
Mean
Sample 20 20.760 0.975 0.218 (20.137, 21.384)
(ii) A 99% C.I. for 𝜎 2 is (0.468, 2.638).
var = np.var(data, ddof=1)

print(df * var / stats.chi2(df).ppf(1 - alpha/2))
print(df * var / stats.chi2(df).ppf(alpha/2))
0.46795850248657883
2.6380728125212016
(iii) A 99% C.I. for 𝜎 is (0.684, 1.624).

Solution 3.26 Let (𝜇 , 𝜇¯ .99 ) be a confidence interval for 𝜇, at level 0.99. Let
.99
(𝜎 .99 , 𝜎¯ .99 ) be a confidence interval for 𝜎 at level 0.99. Let 𝜉 = 𝜇 + 2𝜎 .99 and
.99
𝜉¯ = 𝜇¯ .99 + 2𝜎 ¯ .99 . Then
¯ ≥ Pr{𝜇
Pr{𝜉 ≤ 𝜇 + 2𝜎 ≤ 𝜉} ≤ 𝜇 ≤ 𝜇¯ .99 , 𝜎 .99 ≤ 𝜎 ≤ 𝜎
¯ .99 } ≥ 0.98.
.99
¯ is a confidence interval for 𝜇 + 2𝜎, with confidence level greater or

Thus, (𝜉, 𝜉)
equal to 0.98. Using the data of the previous problem, a 98% C.I. for 𝜇 + 2𝜎 is
(21.505, 24.632).
Solution 3.27 Let 𝑋 ∼ 𝐵(𝑛, 𝜃). For 𝑋 = 17 and 𝑛 = 20, a confidence interval for 𝜃,
at level 0.95, is (0.6211, 0.9679).
alpha = 1 - 0.95
X = 17
n = 20
F1 = stats.f(2*(n-X+1), 2*X).ppf(1 - alpha/2)
F2 = stats.f(2*(X+1), 2*(n-X)).ppf(1 - alpha/2)
pL = X / (X + (n-X+1) * F1)
pU = (X+1) * F2 / (n-X + (X+1) * F2)
print(pL, pU)
0.6210731734546859 0.9679290628145363
Solution 3.28 From the data we have 𝑛 = 10 and 𝑇10 = 134. For 𝛼 = 0.05, 𝜆 𝐿 =
11.319 and 𝜆𝑈 = 15.871.
X = [14, 16, 11, 19, 11, 9, 12, 15, 14, 13]

alpha = 1 - 0.95
T_n = np.sum(X)
# exact solution
print(stats.chi2(2 * T_n + 2).ppf(alpha/2) / (2 * len(X)))
print(stats.chi2(2 * T_n + 2).ppf(1 - alpha/2) / (2 * len(X)))
# approximate solution
nu = 2 * T_n + 2
print((nu + stats.norm.ppf(alpha/2) * np.sqrt(2 * nu)) / (2 * len(X)))
print((nu + stats.norm.ppf(1-alpha/2) * np.sqrt(2 * nu)) / (2 * len(X)))
11.318870163746238
15.870459268116013
11.222727638613012
15.777272361386988
Solution 3.29 For 𝑛 = 20, 𝜎 = 5, 𝑌¯20 = 13.75, 𝛼 = 0.05 and 𝛽 = 0.1, the tolerance
interval is (3.33, 24.17).
Solution 3.30 Use the following Python code:
yarnstrg = mistat.load_data('YARNSTRG')
n = len(yarnstrg)
Ybar = yarnstrg.mean()
S = yarnstrg.std()
alpha, beta = 0.025, 0.025

z_1a = stats.norm.ppf(1-alpha)
z_1b = stats.norm.ppf(1-beta)
z_a = stats.norm.ppf(alpha)
z_b = stats.norm.ppf(beta)
t_abn = (z_1b/(1 - z_1a**2/(2*n)) +

z_1a*np.sqrt(1 + z_b**2/2 - z_1a**2/(2*n)) /
(np.sqrt(n)*(1 - z_1a**2/(2*n))))
print(Ybar - t_abn*S, Ybar + t_abn*S)
0.7306652047594424 5.117020795240556
Ordered quantiles
1
1
R 2 = 0.911
1 0 1 2
Theoretical quantiles
{ fig:qqplotISCT1Socell} Fig. 3.3 𝑄–𝑄 plot of ISC-𝑡1 (SOCELL.csv)
From the data we have 𝑋¯ 100 = 2.9238 and 𝑆100 = 0.9378.
1.962 1.962 1/2

1.96 1.96(1 + − )
𝑡 (0.025, 0.025, 100) = + 2 200
1 − 1.962 /200 1.962
10(1 − )
200
= 2.3388.
The tolerance interval is (0.7306, 5.1171).

Solution 3.31 From the data, 𝑌 (1) = 1.151 and 𝑌 (100) = 5.790. For 𝑛 = 100 and
𝛽 = 0.10 we have 1 − 𝛼 = 0.988. For 𝛽 = 0.05, 1 − 𝛼 = 0.847, the tolerance interval
is (1.151, 5.790). The nonparametric tolerance interval is shorter and is shifted to
the right with a lower confidence level.
Solution 3.32 The following is a normal probability plot of ISC-𝑡1 :
{ fig:qqplotISCT1Socell} According to Fig. 3.3, the hypothesis of normality is not rejected.
{ fig:qqplotCar} Solution 3.33 As is shown in the normal probability plots in Fig. 3.4, the hypothesis
of normality is not rejected in either case.
Solution 3.34 Frequency distribution for turn diameter:
Q-Q plot (turn diameter) Q-Q plot (log(horse-power))

2 2
1
Ordered quantiles
Ordered quantiles
1
0 0
1 1
2 R 2 = 0.985 2 R 2 = 0.986
2 0 2 2 0 2
Theoretical quantiles Theoretical quantiles
{ fig:qqplotCar} Fig. 3.4 𝑄–𝑄 plot of CAR.csv data
Interval Observed Expected (0 − 𝐸) 2 /𝐸

- 31 11 8.1972 0.9583
31 - 32 8 6.3185 0.4475
32 - 33 9 8.6687 0.0127
33 - 34 6 10.8695 2.1815
34 - 35 18 12.4559 2.4677
35 - 36 8 13.0454 1.9513
36 - 37 13 12.4868 0.0211
37 - 38 6 10.9234 2.2191
38 - 39 9 8.7333 0.0081
39 - 40 8 6.3814 0.4106
40 - 13 9.0529 1.7213
Total 109 — 12.399
The expected frequencies were computed for 𝑁 (35.5138, 3.3208). Here 𝜒2 = 12.4,
d.f. = 8 and the 𝑃 value is 0.134. The differences from normal are not significant.
observed expected (O-E)^2/E
turn
[28, 31) 11 8.197333 0.958231
[31, 32) 8 6.318478 0.447500
[32, 33) 9 8.668695 0.012662
[33, 34) 6 10.869453 2.181487
[34, 35) 18 12.455897 2.467673
[35, 36) 8 13.045357 1.951317
[36, 37) 13 12.486789 0.021093
[37, 38) 6 10.923435 2.219102
[38, 39) 9 8.733354 0.008141
[39, 40) 8 6.381395 0.410550
[40, 44) 13 9.052637 1.721231
chi2-statistic of fit 12.398987638400024
chi2[8] for 95% 15.50731305586545

p-value of observed statistic 0.1342700576126994
Solution 3.35 In Python:

KstestResult(statistic=0.07019153486614366, pvalue=0.6303356787948367,
statistic_location=32.2, statistic_sign=1)
k_alpha 0.08514304524687971
√

0.85
𝐷 ∗109 = 0.0702. For 𝛼 = 0.05 𝑘 ∗𝛼 = 0.895/ 109 − 0.01 + √ = 0.0851. The
109
deviations from the normal distribution are not significant.

Solution 3.36 For 𝑋 ∼ 𝑃(100), 𝑛0 = 𝑃−1 0.2 0.3 , 100 = 100 + 𝑧 .67 × 10 = 105.
Solution 3.37 Given 𝑋 = 6, the posterior distribution of 𝑝 is Beta(9,11).

9 99
Solution 3.38 𝐸 {𝑝 | 𝑋 = 6} = = 0.45 and 𝑉 {𝑝 | 𝑋 = 6} = =
20 202 × 21
0.0118 so 𝜎𝑝 |𝑋 =6 = 0.1086.
Solution 3.39 Let 𝑋 | 𝜆 ∼ 𝑃(𝜆) where 𝜆 ∼ 𝐺 (2, 50).

50
(i) The posterior distribution of 𝜆 | 𝑋 = 82 is 𝐺 84, .
51
50 50
(ii) 𝐺 .025 84, = 65.6879 and 𝐺 .975 84, = 100.873.
51 51
Solution 3.40 The posterior probability for 𝜆0 = 70 is
1
𝑝(72; 70)
𝜋(72) = 3 = 0.771.
1 2
𝑝(72; 70) + 𝑝(72; 90)
3 3
𝑟0
𝐻0 : 𝜆 = 𝜆0 is accepted if 𝜋(𝑋) > = 0.4. Thus, 𝐻0 is accepted.
𝑟0 + 𝑟1
Solution 3.41 The credibility interval for 𝜇 is (43.235,60.765). Since the posterior
distribution of 𝜇 is symmetric, this credibility interval is also a HPD interval.
random.seed(1)
def mean_of_sample_with_replacement(x, n):

sample = random.choices(x, k=n)
return np.mean(sample)
means = [mean_of_sample_with_replacement(car['mpg'], 64)

for _ in range(200)]
fig, ax = plt.subplots(figsize=[4, 4])

pg.qqplot(means, ax=ax)
plt.show()
Ordered quantiles 1
2
R 2 = 0.992
2 1 0 1 2
{ fig:qqplotSampleMeansMPG} Fig. 3.5 𝑄–𝑄 Plot of mean of samples from mpg data
{ fig:qqplotSampleMeansMPG} As Fig. 3.5 shows, the resampling distribution is approximately normal.
S = np.std(car['mpg'], ddof=1)
print('standard deviation of mpg', S)
print('S/8', S/8)
S_resample = np.std(means, ddof=1)
print('S.E.\{X\}', S_resample)
standard deviation of mpg 3.9172332424696052

S/8 0.48965415530870066
S.E.\{X\} 0.44718791755315535
Executing the macro 200 times, we obtained S.E.{ 𝑋¯ } = 𝑆8 = 0.48965. The

standard deviation of the resampled distribution is 0.4472. This is a resampling
estimate of S.E.{ 𝑋¯ }.
Solution 3.43 In our particular execution with 𝑀 = 500, we have a proportion
𝛼ˆ = 0.07 of cases in which the bootstrap confidence intervals do not cover the
mean of yarnstrg, 𝜇 = 2.9238. This is not significantly different from the nominal
𝛼 = 0.05. The determination of the proportion 𝛼ˆ can be done by using the following
commands:
random.seed(1)
yarnstrg = mistat.load_data('YARNSTRG')
def confidence_interval(x, nsigma=2):

sample_mean = np.mean(x)
sigma = np.std(x, ddof=1) / np.sqrt(len(x))
return (sample_mean - 2 * sigma, sample_mean + 2 * sigma)
mean = np.mean(yarnstrg)
outside = 0
for _ in range(500):
sample = random.choices(yarnstrg, k=30)
ci = confidence_interval(sample)
if mean < ci[0] or ci[1] < mean:
outside += 1
hat_alpha = outside / 500
ci = confidence_interval(yarnstrg)
print(f' Mean: {mean}')
print(f' 2-sigma-CI: {ci[0]:.1f} - {ci[1]:.1f}')
print(f' proportion outside: {hat_alpha:.3f}')
Mean: 2.9238429999999993
2-sigma-CI: 2.7 - 3.1
proportion outside: 0.068
random.seed(1)
us_cars = car[car['origin'] == 1]
us_turn = list(us_cars['turn'])
sample_means = []
x = random.choices(us_turn, k=58)
sample_means.append(np.mean(x))
is_larger = sum(m > 37.406 for m in sample_means)

ratio = is_larger / len(sample_means)
print(ratio)
0.23
We obtained 𝑃˜ = 0.23. The mean 𝑋¯ = 37.203 is not significantly larger than 37.
Solution 3.45 Let 𝑋50 be the number of non-conforming units in a sample of 𝑛 = 50

items. We reject 𝐻0 , at level of 𝛼 = 0.05, if 𝑋50 > 𝐵−1 (0.95, 50, 0.03) = 4. The
criterion 𝑘 𝛼 is obtained by using the Python commands:
stats.binom(50, 0.03).ppf(0.95)
4.0
Solution 3.46 i Calculation of 95% confidence intervals:
random.seed(1)
cyclt = mistat.load_data('CYCLT')
Bootstrap distribution mean Bootstrap distribution std.dev.

150 150
125 125
100 100
75 75
50 50
25 25
0 0
0.5 0.6 0.7 0.8 0.325 0.350 0.375 0.400
Fig. 3.6 Histograms of EBD for CYCLT.csv data { fig:histEBD_CYCLT}
B = pg.compute_bootci(cyclt, func=np.mean, n_boot=1000,

confidence=0.95, return_dist=True, seed=1)
ci_mean, dist_mean = B
print(f' Mean: {np.mean(dist_mean):.3f}')
print(f' 95%-CI: {ci_mean[0]:.3f} - {ci_mean[1]:.3f}')
B = pg.compute_bootci(cyclt, func=lambda x: np.std(x, ddof=1), n_boot=1000,

ci_std, dist_std = B
print(f' Mean: {np.mean(dist_std):.3f}')
print(f' 95%-CI: {ci_std[0]:.3f} - {ci_std[1]:.3f}')
Mean: 0.652
95%-CI: 0.550 - 0.760
Mean: 0.370
95%-CI: 0.340 - 0.400
ii Histograms of the EBD, see Fig. 3.6. { fig:histEBD_CYCLT}
fig, axes = plt.subplots(figsize=[6, 3], ncols=2)

axes[0].hist(dist_mean, color='grey', bins=17)
axes[1].hist(dist_std, color='grey', bins=17)
axes[0].set_title('Bootstrap distribution mean')
axes[1].set_title('Bootstrap distribution std.dev.')
plt.tight_layout()
plt.show()
ebd = {}
for quantile in (1, 2, 3):
B = pg.compute_bootci(cyclt, func=lambda x: np.quantile(x, 0.25 * quantile),
n_boot=1000, confidence=0.95, return_dist=True, seed=1)
200 200
600
Frequency 100 400

100
200
0 0 0
0.2 0.3 0.4 0.5 1.0 0.8 1.0
Q1 Q2 Q3
{ fig:histEBD_CYCLT_quartiles} Fig. 3.7 Histograms of EBD for quartiles for CYCLT.csv data
ci, dist = B
ebd[quantile] = dist
print(f'Quantile {quantile}: {np.mean(dist):.3f} 95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')
Quantile 1: 0.306 95%-CI: 0.230 - 0.420

Quantile 2: 0.573 95%-CI: 0.380 - 1.010
Quantile 3: 1.060 95%-CI: 0.660 - 1.090
{ fig:histEBD_CYCLT_quartiles} ii Histograms of the EBD, see Fig. 3.7.

axes[0].hist(ebd[1], color='grey', bins=17)
axes[0].set_xlabel('Q1')
axes[0].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
t1 = socell['t1']
t2 = socell['t2']
# use the index

idx = list(range(len(socell)))
def sample_correlation(x):
return stats.pearsonr(t1[x], t2[x])[0]
B = pg.compute_bootci(idx, func=sample_correlation,
n_boot=1000, confidence=0.95, return_dist=True, seed=1)
ci, dist = B
print(f'rho_XY: {np.mean(dist):.3f} 95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')
rho_XY: 0.975 95%-CI: 0.940 - 0.990
{ fig:histEBD_SOCELL_correlation} Histogram of bootstrap correlations, see Fig. 3.8.

200
150
100
50
0
0.88 0.90 0.92 0.94 0.96 0.98 1.00
Fig. 3.8 Histograms of EBD for correlation for SOCELL.csv data { fig:histEBD_SOCELL_correlation}
fig, ax = plt.subplots(figsize=[4, 4])

ax.hist(dist, color='grey', bins=17)
plt.show()
Solution 3.49 (i) and (ii)
mpg = car['mpg']
hp = car['hp']
idx = list(range(len(mpg)))
sample_intercept = []
sample_slope = []
x = random.choices(idx, k=len(idx))
result = stats.linregress(hp[x], mpg[x])
sample_intercept.append(result.intercept)
sample_slope.append(result.slope)
ci = np.quantile(sample_intercept, [0.025, 0.975])

print(f'intercept (a): {np.mean(sample_intercept):.3f} ' +
f'95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')
ci = np.quantile(sample_slope, [0.025, 0.975])
print(f'slope (b): {np.mean(sample_slope):.4f} ' +
f'95%-CI: {ci[0]:.4f} - {ci[1]:.4f}')
reg = stats.linregress(hp, mpg)

hm = np.mean(hp)
print(np.std(sample_intercept))
print(np.std(sample_slope))
intercept (a): 30.724 95%-CI: 28.766 - 32.691

slope (b): -0.0741 95%-CI: -0.0891 - -0.0599
1.0170449375724464
0.0074732552885114645
(iii) The bootstrap S.E. of slope and intercept are 1.017 and 0.00747, respectively.
{sec:least-squares-single} The standard errors of 𝑎 and 𝑏, according to the formulas of Sect. 4.3.2.1 are 0.8099
and 0.00619, respectively. The bootstrap estimates are quite close to the correct
values.
X50 = np.mean(cyclt)
SD50 = np.std(cyclt)
result = stats.ttest_1samp(cyclt, 0.55)
print(f'Xmean_50 = {X50:.3f}')
print(result)
B = pg.compute_bootci(cyclt, func=np.mean, n_boot=1000,

ci_mean, dist = B
pstar = sum(dist < 0.55) / len(dist)
print(f'p*-value: {pstar}')
Xmean_50 = 0.652
TtestResult(statistic=1.9425149510299369, pvalue=0.057833259176805,
df=49)
p*-value: 0.024
The mean of the sample is 𝑋¯ 50 = 0.652. The studentized difference from 𝜇0 =

0.55 is 𝑡 = 1.943.
(i) The t-test obtained a 𝑃-level of 0.058 and the bootstrap resulted in 𝑃∗ = 0.024.
(ii) Yes, but 𝜇 is very close to the lower bootstrap confidence limit (0.540). The
null hypothesis 𝐻0 : 𝜇 = 0.55 is accepted.
(iii) No, but since 𝑃∗ is close to 0.05, we expect that the bootstrap confidence
interval will be very close to 𝜇0 .
almpin = mistat.load_data('ALMPIN')
diam1 = almpin['diam1']
# calculate the ratio of the two variances:

var_diam1 = np.var(diam1)
var_diam2 = np.var(diam2)
F = var_diam2 / var_diam1
print(f'Variance diam1: {var_diam1:.5f}')
print(f'Variance diam2: {var_diam2:.5f}')
print(f'Ratio: {F:.4f}')
# Calculate the p-value

10.00
9.98
9.96
9.94
9.92
9.90
diam1 diam2
Fig. 3.9 Box plots of diam1 and diam2 measurements of the ALMPIN dataset { fig:boxplot-almpin}
p_value = stats.f.cdf(F, len(diam1) - 1, len(diam2) - 1)

print(f'p-value: {p_value:.3f}')
Variance diam1: 0.00027

Variance diam2: 0.00032
Ratio: 1.2016
p-value: 0.776
The variances are therefore not significantly different.

(ii) The box plots of the two measurements are shown in Fig. 3.9. { fig:boxplot-almpin}
almpin.boxplot(column=['diam1', 'diam2'])
plt.show()
Solution 3.52 The variances were already compared in the previous exercise. To
compare the means use:
# Compare means
mean_diam1 = np.mean(diam1)
mean_diam2 = np.mean(diam2)
print(f'Mean diam1: {mean_diam1:.5f}')
print(f'Mean diam2: {mean_diam2:.5f}')
# calculate studentized difference and p-value

se1, se2 = stats.sem(diam1), stats.sem(diam2)
sed = np.sqrt(se1**2.0 + se2**2.0)
t_stat = (mean_diam1 - mean_diam2) / sed
print(f'Studentized difference: {t_stat:.3f}')
df = len(diam1) + len(diam2) - 2
p = (1 - stats.t.cdf(abs(t_stat), df)) * 2
print(f'p-value: {p:.3f}')
# or use any of the available implementations of the t-test

print(stats.ttest_ind(diam1, diam2))
Mean diam1: 9.99286

Mean diam2: 9.98729
Studentized difference: 1.912
p-value: 0.058
Ttest_indResult(statistic=1.9119658005133064,
pvalue=0.05795318184124417)
The bootstrap based p-value for the comparison of the means is:
random.seed(1)
# return studentized distance between random samples from diam1 and diam2
def stat_func():
d1 = random.choices(diam1, k=len(diam1))
d2 = random.choices(diam2, k=len(diam2))
return stats.ttest_ind(d1, d2).statistic
dist = np.array([stat_func() for _ in range(1000)])
pstar = sum(dist < 0) / len(dist)

p*-value: 0.014
The bootstrap based p-value for the comparison of the variances is:
columns = ['diam1', 'diam2']

# variance for each column
S2 = almpin[columns].var(axis=0, ddof=0)
F0 = max(S2) / min(S2)
print('S2', S2)
print('F0', F0)
# Step 1: sample variances of bootstrapped samples for each column

seed = 1
B = {}
for column in columns:
ci = pg.compute_bootci(almpin[column], func='var', n_boot=500,
confidence=0.95, seed=seed, return_dist=True)
B[column] = ci[1]
Bt = pd.DataFrame(B)
# Step 2: compute Wi
Wi = Bt / S2
# Step 3: compute F*
FBoot = Wi.max(axis=1) / Wi.min(axis=1)
FBoot95 = np.quantile(FBoot, 0.95)
print('FBoot 95%', FBoot95)
pstar = sum(FBoot >= F0)/len(FBoot)
S2 diam1 0.000266
diam2 0.000320
dtype: float64
F0 1.2016104294478573
FBoot 95% 1.1855457165968324
p*-value: 0.04
The variance of Sample 1 is 𝑆12 = 0.00027. The variance of Sample 2 is 𝑆22 =

0.00032. The variance ratio is 𝐹 = 𝑆22 /𝑆12 = 1.202. The bootstrap level for variance
ratios is 𝑃∗ = 0.04.
mpg = mistat.load_data('MPG')
columns = ['origin1', 'origin2', 'origin3']
# variance for each column
S2 = mpg[columns].var(axis=0, ddof=1)
F0 = max(S2) / min(S2)
print('S2', S2)
print('F0', F0)
# Step 1: sample variances of bootstrapped samples for each column

seed = 1
B = {}
for column in columns:
ci = pg.compute_bootci(mpg[column].dropna(), func='var', n_boot=500,
confidence=0.95, seed=seed, return_dist=True)
B[column] = ci[1]
Bt = pd.DataFrame(B)
# Step 2: compute Wi
Wi = Bt / S2
# Step 3: compute F*
FBoot = Wi.max(axis=1) / Wi.min(axis=1)
FBoot95 = np.quantile(FBoot, 0.95)
print('FBoot 95%', FBoot95)
pstar = sum(FBoot >= F0)/len(FBoot)
S2 origin1 12.942529
origin2 6.884615
origin3 18.321321
dtype: float64
F0 2.6611975103595213
FBoot 95% 2.6925366761838987
p*-value: 0.058
With 𝑀 = 500 we obtained the following results:

• 1st sample variance = 12.9425,
• 2nd sample variance = 6.8846,
• 3rd sample variance = 18.3213,
𝐹max/min = 2.6612 and the bootstrap 𝑃 value is 𝑃∗ = 0.058. The bootstrap test does
not reject the hypothesis of equal variances at the 0.05 significance level.
Solution 3.54 With 𝑀 = 500 we obtained
𝑋¯ 1 = 20.931 𝑆12 = 12.9425

𝑋¯ 2 = 19.5 𝑆22 = 6.8846
𝑋¯ 3 = 23.1081 𝑆32 = 18.3213
Using the approach shown in Sect. 3.11.5.2, we get: {sec:comp-means-one-way-anova}

500
400
300
200
100
0
0 1 2 3 4 5 6 7
F* values
exc:mpg-equal-mean
{ fig:mpg-equal-mean} Fig. 3.10 Distribution of EBD for Exercise 3.54
mpg = mistat.load_data('MPG.csv')
samples = [mpg[key].dropna() for key in ['origin1', 'origin2', 'origin3']]
def test_statistic_F(samples):
return stats.f_oneway(*samples).statistic
# Calculate sample shifts

Ni = np.array([len(sample) for sample in samples])
N = np.sum(Ni)
XBni = np.array([np.mean(sample) for sample in samples])
XBB = np.sum(Ni * XBni) / N
DB = XBni - XBB
F0 = test_statistic_F(samples)
Ns = 1000
Fstar = []
for _ in range(Ns):
Ysamples = []
for sample, DBi in zip(samples, DB):
Xstar = np.array(random.choices(sample, k=len(sample)))
Ysamples.append(Xstar - DBi)
Fs = test_statistic_F(Ysamples)
Fstar.append(Fs)
Fstar = np.array(Fstar)
print(f'F = {F0:.3f}')
print('ratio', sum(Fstar > F0)/len(Fstar))
ax = pd.Series(Fstar).hist(bins=14, color='grey')
ax.axvline(F0, color='black', lw=2)
ax.set_xlabel('F* values')
plt.show()
F = 6.076
ratio 0.003
{ fig:mpg-equal-mean} 𝐹 = 6.076, 𝑃∗ = 0.003 and the hypothesis of equal means is rejected. See Fig.
3.10 for the calculated EBD.
np.random.seed(1)
def qbinomBoot(x, p):

return stats.binom.ppf(p, 50, p=x.mean())
for p_real in (0.2, 0.1, 0.05):

defects = stats.bernoulli.rvs(p_real, size=50)
B_025 = pg.compute_bootci(defects, func=lambda x: qbinomBoot(x, p=0.025),
n_boot=500, seed=1, return_dist=True)
B_975 = pg.compute_bootci(defects, func=lambda x: qbinomBoot(x, p=0.975),
tol_int = [np.quantile(B_025[1], 0.025),np.quantile(B_975[1], 0.975)]
print(f'Tolerance interval p={p_real}: ({tol_int[0]}, {tol_int[1]})')
Tolerance interval p=0.2: (1.0, 22.0)

The tolerance intervals of the number of defective items in future batches of size
𝑁 = 50, with 𝛼 = 0.05 and 𝛽 = 0.05 are
Limits
𝑝 Lower Upper
0.2 1 23
0.1 0 17
0.05 0 9
np.random.seed(1)
def getQuantile(x, p):

return np.quantile(x, p)
oturb = mistat.load_data('OTURB.csv')
B_025 = pg.compute_bootci(oturb, func=lambda x: getQuantile(x, p=0.025),
B_975 = pg.compute_bootci(oturb, func=lambda x: getQuantile(x, p=0.975),
tol_int = [np.quantile(B_025[1], 0.025),np.quantile(B_975[1], 0.975)]
print(f'Tolerance interval ({tol_int[0]}, {tol_int[1]})')
Tolerance interval (0.2399, 0.68305)
A (0.95, 0.95) tolerance interval for OTURB.csv is (0.24, 0.683).
cyclt = mistat.load_data('CYCLT.csv')
# make use of the fact that a True value is interpreted as 1 and False as 0
print('Values greater 0.7:', sum(cyclt>0.7))
Values greater 0.7: 20
We find that in the sample of 𝑛 = 50 cycle times, there are 𝑋 = 20 values greater
than 0.7. If the hypothesis is 𝐻0 : 𝜉 .5 ≤ 0.7, the probability of observing a value
smaller than 0.7 is 𝑝 ≥ 21 . Thus, the sign test rejects 𝐻0 if 𝑋 < 𝐵−1 (𝛼; 50, 12 ). For
𝛼 = 0.10 the critical value is 𝑘 𝛼 = 20. 𝐻0 is not rejected.
Solution 3.58 We apply the wilcoxon test from scipy on the differences of oelect
from 220.
oelect = mistat.load_data('OELECT.csv')
print(stats.wilcoxon(oelect-220))
WilcoxonResult(statistic=1916.0, pvalue=0.051047599707252124)
The null hypothesis is rejected with 𝑃 value equal to 0.051.
car = mistat.load_data('CAR.csv')
fourCylinder = car[car['cyl'] == 4]
uscars = fourCylinder[fourCylinder['origin'] == 1]
foreign = fourCylinder[fourCylinder['origin'] != 1]
print(f'Mean of Sample 1 (U.S. made) {np.mean(uscars["turn"]):.3f}')

print(f'Mean of Sample 2 (foreign) {np.mean(foreign["turn"]):.3f}')
_ = mistat.randomizationTest(uscars['turn'], foreign['turn'], np.mean,

aggregate_stats=lambda x: x[0] - x[1],
n_boot=1000, seed=1)
Mean of Sample 1 (U.S. made) 36.255

Mean of Sample 2 (foreign) 33.179
Original stat is 3.075758
Original stat is at quantile 1001 of 1001 (100.00%)
Distribution of bootstrap samples:
min: -2.12, median: 0.01, max: 2.68
The original stat 3.08 is outside of the distribution of the bootstrap samples. The
difference between the means of the turn diameters is therefore significant. Foreign
cars have on the average a smaller turn diameter.
Chapter 4
Variability in Several Dimensions and
Regression Models
import random
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats as sms
import seaborn as sns
import mistat
Solution 4.1 In Fig. 4.1, one sees that horsepower and miles per gallon are inversely { fig:ex_car_pairplot}
proportional. Turn diameter seems to increase with horsepower.
sns.pairplot(car[['turn', 'hp', 'mpg']])
plt.show()
---------------------------------------------------------------------------OptionError
Traceback (most recent call last)Cell In[1], line 2
1 car = mistat.load_data('CAR')
----> 2 sns.pairplot(car[['turn', 'hp', 'mpg']])
3 plt.show()
File /usr/local/lib/python3.9/site-packages/seaborn/_decorators.py:46,
in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
36 warnings.warn(
37 "Pass the following variable{} as {}keyword arg{}: {}.
"
38 "From version 0.12, the only valid positional argument
"
(...)
43 FutureWarning
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters,
args)})
---> 46 return f(**kwargs)
File /usr/local/lib/python3.9/site-packages/seaborn/axisgrid.py:2126,
in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind,
65
66 4 Variability in Several Dimensions and Regression Models
diag_kind, markers, height, aspect, corner, dropna, plot_kws,

diag_kws, grid_kws, size)
2124 diag_kws.setdefault("legend", False)
2125 if diag_kind == "hist":
-> 2126 grid.map_diag(histplot, **diag_kws)
2127 elif diag_kind == "kde":
2128 diag_kws.setdefault("fill", True)
in PairGrid.map_diag(self, func, **kwargs)
1476 plot_kwargs.setdefault("hue_order", self._hue_order)
1477 plot_kwargs.setdefault("palette", self._orig_palette)
-> 1478 func(x=vector, **plot_kwargs)
1479 ax.legend_ = None
1481 self._add_axis_labels()
File /usr/local/lib/python3.9/site-
packages/seaborn/distributions.py:1462, in histplot(data, x, y, hue,
weights, stat, bins, binwidth, binrange, discrete, cumulative,
common_bins, common_norm, multiple, element, fill, shrink, kde,
kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws,
palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
1451 estimate_kws = dict(
1452 stat=stat,
1453 bins=bins,
(...)
1457 cumulative=cumulative,
1458 )
1460 if p.univariate:
-> 1462 p.plot_univariate_histogram(
1463 multiple=multiple,
1464 element=element,
1465 fill=fill,
1466 shrink=shrink,
1467 common_norm=common_norm,
1468 common_bins=common_bins,
1469 kde=kde,
1470 kde_kws=kde_kws,
1471 color=color,
1472 legend=legend,
1473 estimate_kws=estimate_kws,
1474 line_kws=line_kws,
1475 **kwargs,
1476 )
1478 else:
1480 p.plot_bivariate_histogram(
(...)
1492 **kwargs,
1493 )
packages/seaborn/distributions.py:428, in
_DistributionPlotter.plot_univariate_histogram(self, multiple,
element, fill, common_norm, common_bins, shrink, kde, kde_kws, color,
legend, line_kws, estimate_kws, **plot_kws)
418 densities = self._compute_univariate_density(
419 self.data_variable,
420 common_norm,
(...)
424 warn_singular=False,
425 )
427 # First pass through the data to compute the histograms
--> 428 for sub_vars, sub_data in self.iter_data("hue",
from_comp_data=True):
429
430 # Prepare the relevant data
431 key = tuple(sub_vars.items())
432 sub_data = sub_data.dropna()
4 Variability in Several Dimensions and Regression Models 67
File /usr/local/lib/python3.9/site-packages/seaborn/_core.py:983, in
VectorPlotter.iter_data(self, grouping_vars, reverse, from_comp_data)
978 grouping_vars = [
979 var for var in grouping_vars if var in self.variables
980 ]
982 if from_comp_data:
--> 983 data = self.comp_data
984 else:
985 data = self.plot_data
VectorPlotter.comp_data(self)
1050 axis = getattr(ax, f"{var}axis")
1052 # Use the converter assigned to the axis to get a float
representation
1053 # of the data, passing np.nan or pd.NA through (pd.NA becomes
np.nan)
-> 1054 with pd.option_context('mode.use_inf_as_null', True):
1055 orig = self.plot_data[var].dropna()
1056 comp_col = pd.Series(index=orig.index, dtype=float, name=var)
packages/pandas/_config/config.py:441, in
option_context.__enter__(self)
440 def __enter__(self) -> None:
--> 441 self.undo = [(pat, _get_option(pat, silent=True)) for pat,
val in self.ops]
443 for pat, val in self.ops:
444 _set_option(pat, val, silent=True)
packages/pandas/_config/config.py:441, in <listcomp>(.0)
val in self.ops]
packages/pandas/_config/config.py:135, in _get_option(pat, silent)
134 def _get_option(pat: str, silent: bool = False) -> Any:
--> 135 key = _get_single_key(pat, silent)
137 # walk the nested dict
138 root, k = _get_root(key)
packages/pandas/_config/config.py:121, in _get_single_key(pat, silent)
119 if not silent:
120 _warn_if_deprecated(pat)
--> 121 raise OptionError(f"No such keys(s): {repr(pat)}")
122 if len(keys) > 1:
123 raise OptionError("Pattern matched multiple keys")
OptionError: "No such keys(s): 'mode.use_inf_as_null'"
Solution 4.2 The box plots in Fig. 4.2 show that cars from Asia generally have the { fig:ex_car_boxplots}
smallest turn diameter. The maximal turn diameter of cars from Asia is smaller
than the median turn diameter of U.S. cars. European cars tend to have larger turn
diameter than those from Asia, but smaller than those from the U.S.
ax = car.boxplot(column='turn', by='origin')
ax.set_title('')
ax.get_figure().suptitle('')
ax.set_xlabel('origin')
plt.show()
1.0
0.8
0.6
turn
0.4
0.2
0.0
1.0
0.8
0.6
hp
0.4
0.2
0.0
1.0
0.8
0.6
mpg
0.4
0.2
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
turn hp mpg
Fig. 4.1 Scatterplot matrix for CAR dataset { fig:ex_car_pairplot}
42
40
38
36
34
32
30
28
1 2 3
origin
Fig. 4.2 Boxplots of turn diameter by origin for CAR dataset { fig:ex_car_boxplots}
2400
2200
Res 3 2000
1800
1600
1 2 3 4 5 6
Hybrid number
Fig. 4.3 Multiple box plots of Res 3 grouped by hybrid number { fig:ex_hadpas_plot_i}
Solution 4.3 (i) The multiple box plots (see Fig. 4.3) show that the conditional { fig:ex_hadpas_plot_i}
distributions of res3 at different hybrids are different.
ax = hadpas.boxplot(column='res3', by='hyb')
ax.set_title('')
ax.get_figure().suptitle('')
ax.set_xlabel('Hybrid number')
ax.set_ylabel('Res 3')
plt.show()
(ii) The matrix plot of all the Res variables (see Fig. 4.4) reveals that Res 3 and { fig:ex_hadpas_plot_ii}
Res 7 are positively correlated. Res 20 is generally larger than the corresponding Res
14. Res 18 and Res 20 seem to be negatively associated.
sns.pairplot(hadpas[['res3', 'res7', 'res18', 'res14', 'res20']])

plt.show()
---------------------------------------------------------------------------OptionError
----> 1 sns.pairplot(hadpas[['res3', 'res7', 'res18', 'res14',
'res20']])
2 plt.show()
File /usr/local/lib/python3.9/site-packages/seaborn/_decorators.py:46,
in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
36 warnings.warn(
37 "Pass the following variable{} as {}keyword arg{}: {}.
"
38 "From version 0.12, the only valid positional argument
"
(...)
43 FutureWarning
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters,
args)})
---> 46 return f(**kwargs)
in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind,
diag_kind, markers, height, aspect, corner, dropna, plot_kws,
diag_kws, grid_kws, size)
2124 diag_kws.setdefault("legend", False)
2125 if diag_kind == "hist":
-> 2126 grid.map_diag(histplot, **diag_kws)
2127 elif diag_kind == "kde":
2128 diag_kws.setdefault("fill", True)
in PairGrid.map_diag(self, func, **kwargs)
1476 plot_kwargs.setdefault("hue_order", self._hue_order)
1477 plot_kwargs.setdefault("palette", self._orig_palette)
-> 1478 func(x=vector, **plot_kwargs)
1479 ax.legend_ = None
1481 self._add_axis_labels()
packages/seaborn/distributions.py:1462, in histplot(data, x, y, hue,
weights, stat, bins, binwidth, binrange, discrete, cumulative,
common_bins, common_norm, multiple, element, fill, shrink, kde,
kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws,
palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
1451 estimate_kws = dict(
1452 stat=stat,
1453 bins=bins,
(...)
1457 cumulative=cumulative,
1458 )
1460 if p.univariate:
-> 1462 p.plot_univariate_histogram(
1463 multiple=multiple,
1464 element=element,
1465 fill=fill,
1466 shrink=shrink,
1469 kde=kde,
1470 kde_kws=kde_kws,
1471 color=color,
1472 legend=legend,
1473 estimate_kws=estimate_kws,
1474 line_kws=line_kws,
1475 **kwargs,
1476 )
1478 else:
1480 p.plot_bivariate_histogram(
(...)
1492 **kwargs,
1493 )
packages/seaborn/distributions.py:428, in
_DistributionPlotter.plot_univariate_histogram(self, multiple,
element, fill, common_norm, common_bins, shrink, kde, kde_kws, color,
legend, line_kws, estimate_kws, **plot_kws)
418 densities = self._compute_univariate_density(
419 self.data_variable,
420 common_norm,
(...)
424 warn_singular=False,
425 )
427 # First pass through the data to compute the histograms
--> 428 for sub_vars, sub_data in self.iter_data("hue",
from_comp_data=True):
429
430 # Prepare the relevant data
431 key = tuple(sub_vars.items())

432 sub_data = sub_data.dropna()
VectorPlotter.iter_data(self, grouping_vars, reverse, from_comp_data)
978 grouping_vars = [
979 var for var in grouping_vars if var in self.variables
980 ]
982 if from_comp_data:
--> 983 data = self.comp_data
984 else:
985 data = self.plot_data
VectorPlotter.comp_data(self)
1050 axis = getattr(ax, f"{var}axis")
1052 # Use the converter assigned to the axis to get a float
representation
1053 # of the data, passing np.nan or pd.NA through (pd.NA becomes
np.nan)
-> 1054 with pd.option_context('mode.use_inf_as_null', True):
1055 orig = self.plot_data[var].dropna()
1056 comp_col = pd.Series(index=orig.index, dtype=float, name=var)
packages/pandas/_config/config.py:441, in
option_context.__enter__(self)
val in self.ops]
packages/pandas/_config/config.py:441, in <listcomp>(.0)
val in self.ops]
packages/pandas/_config/config.py:135, in _get_option(pat, silent)
134 def _get_option(pat: str, silent: bool = False) -> Any:
--> 135 key = _get_single_key(pat, silent)
137 # walk the nested dict
138 root, k = _get_root(key)
packages/pandas/_config/config.py:121, in _get_single_key(pat, silent)
119 if not silent:
120 _warn_if_deprecated(pat)
--> 121 raise OptionError(f"No such keys(s): {repr(pat)}")
122 if len(keys) > 1:
123 raise OptionError("Pattern matched multiple keys")
OptionError: "No such keys(s): 'mode.use_inf_as_null'"
Solution 4.4 The joint frequency distribution of horsepower versus miles per gallon
is
binned_car = pd.DataFrame({
'hp': pd.cut(car['hp'], bins=np.arange(50, 275, 25)),
'mpg': pd.cut(car['mpg'], bins=np.arange(10, 40, 5)),
})
freqDist = pd.crosstab(binned_car['hp'], binned_car['mpg'])
print(freqDist)
# You can get distributions for hp and mpg by summing along an axis
print(freqDist.sum(axis=0))
print(freqDist.sum(axis=1))
1.0
0.8
0.6
res3
0.4
0.2
0.0
1.0
0.8
0.6
res7
0.4
0.2
0.0
1.0
0.8
0.6
res18
0.4
0.2
0.0
1.0
0.8
0.6
res14
0.4
0.2
0.0
1.0
0.8
0.6
res20
0.4
0.2
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
res3 res7 res18 res14 res20
Fig. 4.4 Scatterplot matrix Res variables for HADPAS dataset { fig:ex_hadpas_plot_ii}
mpg (10, 15] (15, 20] (20, 25] (25, 30] (30, 35]
hp
(50, 75] 0 1 0 4 2
(75, 100] 0 0 23 11 0
(100, 125] 0 10 11 1 0
(125, 150] 0 14 3 1 0
(150, 175] 0 17 0 0 0
(175, 200] 1 5 0 0 0
(200, 225] 1 3 0 0 0
(225, 250] 0 1 0 0 0
mpg
(10, 15] 2
(15, 20] 51
(20, 25] 37
(25, 30] 17
(30, 35] 2
dtype: int64
hp
(50, 75] 7
(75, 100] 34
(100, 125] 22
(125, 150] 18
(150, 175] 17
(175, 200] 6
(200, 225] 4
(225, 250] 1
dtype: int64
The intervals for HP are from 50 to 250 at fixed length of 25. The intervals for
MPG are from 10 to 35 at length 5. Students may get different results by defining
the intervals differently.
Solution 4.5 The joint frequency distribution of Res 3 and Res 14 is given in the
following table:
binned_hadpas = pd.DataFrame({
'res3': pd.cut(hadpas['res3'], bins=np.arange(1580, 2780, 200)),
'res14': pd.cut(hadpas['res14'], bins=np.arange(900, 3000, 300)),
})
pd.crosstab(binned_hadpas['res14'], binned_hadpas['res3'])
res3 (1580, 1780] (1780, 1980] (1980, 2180] (2180, 2380]

res14
(900, 1200] 11 3 3 0 \
(1200, 1500] 11 33 28 2
(1500, 1800] 5 16 24 6
(1800, 2100] 2 11 12 5
(2100, 2400] 0 9 8 1
(2400, 2700] 0 0 1 0
res3 (2380, 2580]

res14
(900, 1200] 0
(1200, 1500] 0
(1500, 1800] 0
(1800, 2100] 1
(2100, 2400] 0
(2400, 2700] 0
The intervals for Res 3 start at 1580 and end at 2580 with length of 200. The
intervals of Res 14 start at 900 and end at 2700 with length of 300.
Solution 4.6 The following is the conditional frequency distribution of Res 3, given
that Res 14 is between 1300 and 1500 ohms:
in_range = hadpas[hadpas['res14'].between(1300, 1500)]
pd.cut(in_range['res3'], bins=np.arange(1580, 2780, 200)).value_counts(sort=False)
res3
(1580, 1780] 8
(1780, 1980] 21
(1980, 2180] 25
(2180, 2380] 2
(2380, 2580] 0
Solution 4.7 Following the instructions in the question we obtained the following
results:
bins = [900, 1200, 1500, 1800, 2100, 3000]
binned_res14 = pd.cut(hadpas['res14'], bins=bins)
results = []
for group, df in hadpas.groupby(binned_res14):
res3 = df['res3']
results.append({
'res3': group,
'N': len(res3),
'mean': res3.mean(),
'std': res3.std(),
})
pd.DataFrame(results)
res3 N mean std

0 (900, 1200] 17 1779.117647 162.348730
1 (1200, 1500] 74 1952.175676 154.728251
2 (1500, 1800] 51 1997.196078 151.608841
3 (1800, 2100] 31 2024.774194 156.749845
4 (2100, 3000] 19 1999.736842 121.505758
df = pd.DataFrame([
[10.0, 8.04, 10.0, 9.14, 10.0, 7.46, 8.0, 6.58],
[8.0, 6.95, 8.0, 8.14, 8.0, 6.77, 8.0, 5.76],
[13.0, 7.58, 13.0, 8.74, 13.0, 12.74, 8.0, 7.71],
[9.0, 8.81, 9.0, 8.77, 9.0, 7.11, 8.0, 8.84],
[11.0, 8.33, 11.0, 9.26, 11.0, 7.81, 8.0, 8.47],
[14.0, 9.96, 14.0, 8.10, 14.0, 8.84, 8.0, 7.04],
[6.0, 7.24, 6.0, 6.13, 6.0, 6.08, 8.0, 5.25],
[4.0, 4.26, 4.0, 3.10, 4.0, 5.39, 19.0, 12.50],
[12.0, 10.84, 12.0, 9.13, 12.0, 8.15, 8.0, 5.56],
[7.0, 4.82, 7.0, 7.26, 7.0, 6.42, 8.0, 7.91],
[5.0, 5.68, 5.0, 4.74, 5.0, 5.73, 8.0, 6.89],
], columns=['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4'])
results = []
for i in (1, 2, 3, 4):
x = df[f'x{i}']
y = df[f'y{i}']
model = smf.ols(formula=f'y{i} ~ 1 + x{i}', data=df).fit()
results.append({
'Data Set': i,
'Intercept': model.params['Intercept'],
'Slope': model.params[f'x{i}'],
'R2': model.rsquared,
})
pd.DataFrame(results)
Data Set Intercept Slope R2

0 1 3.000091 0.500091 0.666542
1 2 3.000909 0.500000 0.666242
2 3 3.002455 0.499727 0.666324
3 4 3.001727 0.499909 0.666707
Notice the influence of the point (19,12.5) on the regression in Data Set 4. Without
this point the correlation between 𝑥 and 𝑦 is zero.
10 10
y1
y2
5 5
0 20 0 20
x1 x2
10 10
y3
5 y4 5
0 20 0 20
x3 x4
Fig. 4.5 Anscombe’s quartet { fig:AnscombeQuartet}
{ fig:AnscombeQuartet} The dataset is known as Anscombe’s quartet (see Fig. 4.5). It not only has identical
linear regression, but it has also identical means and variances of 𝑥 and 𝑦, and
correlation between 𝑥 and 𝑦. The dataset clearly demonstrates the importance of
visualization in data analysis.
Solution 4.9 The correlation matrix:
car[['turn', 'hp', 'mpg']].corr()
turn hp mpg
turn 1.000000 0.507610 -0.541061
hp 0.507610 1.000000 -0.754716
mpg -0.541061 -0.754716 1.000000
− 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 ) 2
Í𝑛
Solution 4.10 𝑆𝑆𝐸 = 𝑖=1 (𝑌𝑖
(i)
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 )
𝜕 𝛽0 𝑖=1
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 𝑋𝑖1 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 )
𝜕 𝛽1 𝑖=1
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 𝑋𝑖2 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 ).
𝜕 𝛽2 𝑖=1
Equating these partial derivatives to zero and arranging terms, we arrive at the
following set of linear equations:
Í𝑛 Í𝑛 Í𝑛
 𝑛 𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖2 
  𝛽0   𝑖=1 𝑌𝑖 
   
Í    Í 
 𝑛 Í𝑛 2 Í𝑛    𝑛
 𝛽1  =  𝑖=1 𝑋𝑖1𝑌𝑖 
 
 𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖1 𝑋𝑖2 
     
Í    Í 
 𝑛 𝑋 Í𝑛 𝑋 𝑋 Í𝑛 2  𝛽2   𝑛 𝑋𝑖2𝑌𝑖 
𝑖=1 𝑋𝑖2 

 𝑖=1 𝑖2 𝑖=1 𝑖1 𝑖2    𝑖=1 
(ii) Let 𝑏 0 , 𝑏 1 , 𝑏 2 be the (unique) solution. From the first equation we get, after
dividing by 𝑛, 𝑏 0 = 𝑌¯ − 𝑋¯ 1 𝑏 1 − 𝑋¯ 2 𝑏 2 , where 𝑌¯ = 𝑛1 𝑖=1 𝑌𝑖 , 𝑋¯ 1 = 𝑛1 𝑖=1
Í𝑛 Í𝑛
𝑋𝑖1 ,
1
𝑋¯ 2 =
Í 𝑛
𝑋𝑖2 . Substituting 𝑏 0 in the second and third equations and arranging
𝑛 𝑖=1
terms, we obtain the reduced system of equations:
 (𝑄 1 − 𝑛 𝑋¯ 2 ) (𝑃12 − 𝑛 𝑋¯ 1 𝑋¯ 2 )  𝑏 1  𝑃1𝑦 − 𝑛 𝑋¯ 1𝑌¯ 
 1   
   
 (𝑃12 − 𝑛 𝑋¯ 1 𝑋¯ 2 ) (𝑄 2 − 𝑛 𝑋¯ 2 )  𝑏 2  𝑃2𝑦 − 𝑛 𝑋¯ 2𝑌¯ 
     
 2   
Í 2 Í 2 Í Í
where 𝑄 1 = 𝑋𝑖1 , 𝑄 2 = 𝑋𝑖2 , 𝑃12 = 𝑋𝑖1 𝑋𝑖2 and 𝑃1𝑦 = 𝑋𝑖1𝑌𝑖 , 𝑃2𝑦 =
Í
𝑋𝑖2𝑌𝑖 . Dividing both sides by (𝑛 − 1) we obtain Eq. (4.4.3), and solving we get 𝑏 1
and 𝑏 2 .
Solution 4.11 We get the following result using statsmodels
model = smf.ols(formula='mpg ~ 1 + hp + turn', data=car).fit()

print(model.summary2())
Results: Ordinary least squares

=================================================================
Model: OLS Adj. R-squared: 0.596
Dependent Variable: mpg AIC: 511.2247
Date: 2023-05-03 21:16 BIC: 519.2988
No. Observations: 109 Log-Likelihood: -252.61
Df Model: 2 F-statistic: 80.57
Df Residuals: 106 Prob (F-statistic): 5.29e-22
R-squared: 0.603 Scale: 6.2035
------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------
Intercept 38.2642 2.6541 14.4167 0.0000 33.0020 43.5263
hp -0.0631 0.0069 -9.1070 0.0000 -0.0768 -0.0493
turn -0.2510 0.0838 -2.9965 0.0034 -0.4171 -0.0849

-----------------------------------------------------------------
Omnibus: 12.000 Durbin-Watson: 2.021
Prob(Omnibus): 0.002 Jarque-Bera (JB): 25.976
Skew: -0.335 Prob(JB): 0.000
Kurtosis: 5.296 Condition No.: 1507
=================================================================
* The condition number is large (2e+03). This might indicate
strong multicollinearity or other numerical problems.
The regression equation is MPG = 38.3 − 0.251 × turn − 0.0631 × hp.

We see that only 60% of the variability in MPG is explained by the linear rela-
tionship with Turn and HP. Both variables contribute significantly to the regression.
Solution 4.12 The partial correlation is −0.70378.
# y: mpg, x1: cyl, x2: hp

model_1 = smf.ols(formula='mpg ~ cyl + 1', data=car).fit()
e_1 = model_1.resid
model_2 = smf.ols(formula='hp ~ cyl + 1', data=car).fit()

e_2 = model_2.resid
print(f'Partial correlation {stats.pearsonr(e_1, e_2)[0]:.5f}')
Partial correlation -0.70378
# y: mpg, x1: hp, x2: turn

model_1 = smf.ols(formula='mpg ~ hp + 1', data=car).fit()
e_1 = model_1.resid
print('Model mpg ~ hp + 1:\n', model_1.params)
model_2 = smf.ols(formula='turn ~ hp + 1', data=car).fit()

e_2 = model_2.resid
print('Model turn ~ hp + 1:\n', model_2.params)
print('Partial correlation', stats.pearsonr(e_1, e_2)[0])

df = pd.DataFrame({'e1': e_1, 'e2': e_2})
model_partial = smf.ols(formula='e1 ~ e2 - 1', data=df).fit()
# print(model_partial.summary2())
print('Model e1 ~ e2:\n', model_partial.params)
Model mpg ~ hp + 1:
Intercept 30.663308
hp -0.073611
dtype: float64
Model turn ~ hp + 1:
Intercept 30.281255
hp 0.041971
dtype: float64
Partial correlation -0.27945246615045016
Model e1 ~ e2:
e2 -0.251008
dtype: float64
The partial regression equation is 𝑒ˆ1 = −0.251𝑒ˆ2 .
Solution 4.14 The regression of MPG on HP is MPG = 30.6633 − 0.07361 HP. The
regression of TurnD on HP is TurnD = 30.2813 + 0.041971 HP. The regression of
the residuals 𝑒ˆ1 on 𝑒ˆ2 is 𝑒ˆ1 = −0.251 · 𝑒ˆ2 . Thus,
Const. : 𝑏 0 = 30.6633 + 30.2813 × 0.251 = 38.2639

HP : 𝑏 1 = −0.07361 + 0.041971 × 0.251 = −0.063076
TurnD : 𝑏 2 = −0.251.
Solution 4.15 The regression of Cap Diameter on Diam2 and Diam3 is
model = smf.ols('capDiam ~ 1 + diam2 + diam3', data=almpin).fit()
model.summary2()
<class 'statsmodels.iolib.summary2.Summary'>
"""
===================================================================
Dependent Variable: capDiam AIC: -482.1542
Date: 2023-05-03 21:16 BIC: -475.4087
No. Observations: 70 Log-Likelihood: 244.08
R-squared: 0.846 Scale: 5.7272e-05
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
Intercept 4.7565 0.5501 8.6467 0.0000 3.6585 5.8544
diam2 0.5040 0.1607 3.1359 0.0025 0.1832 0.8248
diam3 0.5203 0.1744 2.9830 0.0040 0.1722 0.8684
-------------------------------------------------------------------
Skew: -0.071 Prob(JB): 0.614
===================================================================
"""
The dependence of CapDiam on Diam2, without Diam1 is significant. This is due

to the fact that Diam1 and Diam2 are highly correlated (𝜌 = 0.957). If Diam1 is in
the regression, then Diam2 does not furnish additional information on CapDiam. If
Diam1 is not included then Diam2 is very informative.
Solution 4.16 The regression of yield (𝑌𝑖𝑒𝑙𝑑) on the four variables is:
gasol = mistat.load_data('GASOL')
# rename column 'yield' to 'Yield' as 'yield' is a special keyword in Python
gasol = gasol.rename(columns={'yield': 'Yield'})
model = smf.ols(formula='Yield ~ x1 + x2 + astm + endPt',
data=gasol).fit()

=================================================================
Dependent Variable: Yield AIC: 146.8308
Date: 2023-05-03 21:16 BIC: 154.1595
------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------
Intercept -6.8208 10.1232 -0.6738 0.5062 -27.5918 13.9502
x1 0.2272 0.0999 2.2739 0.0311 0.0222 0.4323
x2 0.5537 0.3698 1.4976 0.1458 -0.2049 1.3124
astm -0.1495 0.0292 -5.1160 0.0000 -0.2095 -0.0896
endPt 0.1547 0.0064 23.9922 0.0000 0.1414 0.1679
-----------------------------------------------------------------
Skew: 0.190 Prob(JB): 0.698
=================================================================
(i) The regression equation is
𝑦ˆ = −6.8 + 0.227𝑥1 + 0.554𝑥2 − 0.150 astm + 0.155 endPt.
(ii) 𝑅 2 = 0.962.
(iii) The regression coefficient of 𝑥 2 is not significant.
(iv) Running the multiple regression again, without 𝑥2 , we obtain the equation
model = smf.ols(formula='Yield ~ x1 + astm + endPt', data=gasol).fit()


=================================================================
Dependent Variable: Yield AIC: 147.3842
Date: 2023-05-03 21:16 BIC: 153.2471
------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------
Intercept 4.0320 7.2233 0.5582 0.5811 -10.7643 18.8284
x1 0.2217 0.1021 2.1725 0.0384 0.0127 0.4308
astm -0.1866 0.0159 -11.7177 0.0000 -0.2192 -0.1540
endPt 0.1565 0.0065 24.2238 0.0000 0.1433 0.1698
-----------------------------------------------------------------
Skew: 0.174 Prob(JB): 0.692
=================================================================
Ordered quantiles
0
1
R 2 = 0.982
2
2 1 0 1 2
{ fig:qqPlotGasolRegResid} Fig. 4.6 𝑄–𝑄 plot of gasol regression residuals
𝑦ˆ = 4.03 + 0.222𝑥1 + 0.554𝑥2 − 0.187 astm + 0.157 endPt,

with 𝑅 2 = 0.959. Variables 𝑥 1 , astm and endPt are important.
(v) Normal probability plotting of the residuals 𝑒ˆ from the equation of (iv) shows
{ fig:qqPlotGasolRegResid} that they are normally distributed. (see Fig. 4.6)
Solution 4.17 (i)
(𝐻) = (𝑋) (𝐵) = (𝑋) [(𝑋) 0 (𝑋)] −1 (𝑋) 0

𝐻 2 = (𝑋) [(𝑋) 0 (𝑋)] −1 (𝑋) 0 (𝑋) [(𝑋) 0 (𝑋)] −1 (𝑋) 0
= (𝑋) [(𝑋) 0 (𝑋)] −1 (𝑋) 0
= 𝐻.
(ii)
(𝑄) = 𝐼 − (𝐻)
(𝑄) 2 = (𝐼 − (𝐻)) (𝐼 − (𝐻))
= 𝐼 − (𝐻) − (𝐻) + (𝐻) 2
= 𝐼 − (𝐻)
= 𝑄.
𝑠2𝑒 = y0 (𝑄) (𝑄)y/(𝑛 − 𝑘 − 1)
= y0 (𝑄)y/(𝑛 − 𝑘 − 1).
Solution 4.18 We have ŷ = (𝑋) 𝜷ˆ = (𝑋) (𝐵)y = (𝐻)y and ê = 𝑄y = (𝐼 − (𝐻))y.

ŷ0ê = y0 (𝐻) (𝐼 − (𝐻))y

= y0 (𝐻)y − y0 (𝐻) 2 y
= 0.
𝑆𝑆𝐸
Solution 4.19 1 − 𝑅 2𝑦 ( 𝑥) = where 𝑆𝑆𝐸 = ê0ê = ||ê|| 2 .
𝑆𝑆𝐷 𝑦
Solution 4.20 From the basic properties of the cov(𝑋, 𝑌 ) operator,
𝑛 𝑛 𝑛 𝑛
©∑︁ ∑︁ ª ∑︁ ∑︁
cov 𝛽𝑖 𝑋𝑖 , 𝛾𝑗 𝑋𝑗® = 𝛽𝑖 𝛾 𝑗 cov(𝑋𝑖 , 𝑋 𝑗 )
« 𝑖=1 𝑗=1 ¬ 𝑖=1 𝑗=1
𝑛 ∑︁
∑︁ 𝑛
= 𝛽𝑖 𝛾 𝑗 Σ| 𝑖 𝑗
𝑖=1 𝑗=1
0 |
= 𝜷 ( 𝚺)𝜸.
Solution 4.21 W = (𝑊1 , . . . , 𝑊𝑚 ) 0 where 𝑊𝑖 = b𝑖0 X (𝑖 = 1, . . . , 𝑚). b𝑖0 is the 𝑖-th

row vector of 𝐵. Thus, by the previous exercise cov(𝑊𝑖 , 𝑊 𝑗 ) = b𝑖0 ( 𝚺)b
| 𝑗 . This is the
(𝑖, 𝑗) element of the covariance matrix of W. Hence, the covariance matrix of W is
C(W) = (𝐵) ( 𝚺)| (𝐵) 0.
|
Solution 4.22 From the model, 𝚺(Y) = 𝜎 2 𝐼 and b = (𝐵)Y.
|
𝚺(b) |
= (𝐵) 𝚺(Y) (𝐵) 0 = 𝜎 2 (𝐵) (𝐵) 0
= 𝜎 2 [(X) 0 (X)] −1 · X0X[(X) 0 (X)] −1
= 𝜎 2 [(X) 0 (X)] −1 .
Solution 4.23 Rearrange the dataset into the format suitable for the test outlines in
the multiple linear regression section.
# combine the two datasets and add the additional columns z and w
socell_1 = socell[['t3', 't1']].copy()
socell_1.columns = ['t3', 't']
socell_1['z'] = 0
socell_2 = socell[['t3', 't2']].copy()
socell_2.columns = ['t3', 't']
socell_2['z'] = 1
combined = pd.concat([socell_1, socell_2])
combined['w'] = combined['z'] * combined['t']
# multiple linear regression model

model_test = smf.ols(formula='t3 ~ t + z + w + 1',
data=combined).fit()
print(model_test.summary2())

==================================================================

Dependent Variable: t3 AIC: -58.2308
Date: 2023-05-03 21:16 BIC: -52.3678
R-squared: 0.957 Scale: 0.0084460
--------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------------
Intercept 0.5187 0.2144 2.4196 0.0223 0.0796 0.9578
t 0.9411 0.0539 17.4664 0.0000 0.8307 1.0515
z -0.5052 0.3220 -1.5688 0.1279 -1.1648 0.1545
w 0.0633 0.0783 0.8081 0.4259 -0.0971 0.2237
------------------------------------------------------------------
Skew: -0.378 Prob(JB): 0.653
==================================================================
Neither of the 𝑧 nor 𝑤 The 𝑃-values corresponding to 𝑧 and 𝑤 are 0.128 and 0.426
respectively. Accordingly, we can conclude that the slopes and intercepts of the two
simple linear regressions given above are not significantly different. Combining the
data we have the following regression line for the combined dataset:
model_combined = smf.ols(formula='t3 ~ t + 1',

data=combined).fit()
print(model_combined.summary2())

=================================================================
Dependent Variable: t3 AIC: -28.1379
Date: 2023-05-03 21:16 BIC: -25.2064
-------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
-------------------------------------------------------------------
Intercept 0.6151 0.2527 2.4343 0.0211 0.0990 1.1311
t 0.8882 0.0615 14.4327 0.0000 0.7625 1.0139
-----------------------------------------------------------------
Skew: 0.114 Prob(JB): 0.643
=================================================================
Solution 4.24 Load the data frame
df = mistat.load_data('CEMENT.csv')
(a) Regression of 𝑌 on 𝑋1 is
model1 = smf.ols('y ~ x1 + 1', data=df).fit()

print(model1.summary().tables[1])
r2 = model1.rsquared
print(f'R-sq: {r2:.3f}')
anova = sms.anova.anova_lm(model1)
print('Analysis of Variance\n', anova)
F = anova.F['x1']
SSE_1 = anova.sum_sq['Residual']
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 81.4793 4.927 16.536 0.000 70.634 92.324
x1 1.8687 0.526 3.550 0.005 0.710 3.027
==============================================================================
R-sq: 0.534
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 12.602518 0.004552
Residual 11.0 1265.686749 115.062432 NaN NaN
• 𝑅𝑌2 | (𝑋1 ) =0.534.

• 𝑆𝑆𝐸 1 = 1265.7,
• 𝐹 = 12.60 (In the 1st stage 𝐹 is equal to the partial−𝐹.)
(b) The regression of 𝑌 on 𝑋1 and 𝑋2 is
model2 = smf.ols('y ~ x1 + x2 + 1', data=df).fit()

SEQ_SS_X2 = anova.sum_sq['x2']
s2e2 = anova.mean_sq['Residual']
partialF = np.sum(anova.sum_sq) * (model2.rsquared - model1.rsquared) / s2e2
anova = sms.anova.anova_lm(model1, model2)

print('Comparing models\n', anova)
partialF = anova.F[1]
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 52.5773 2.286 22.998 0.000 47.483 57.671
x1 1.4683 0.121 12.105 0.000 1.198 1.739
x2 0.6623 0.046 14.442 0.000 0.560 0.764
==============================================================================
R-sq: 0.979
x1 1.0 1450.076328 1450.076328 250.425571 2.088092e-08
x2 1.0 1207.782266 1207.782266 208.581823 5.028960e-08
Residual 10.0 57.904483 5.790448 NaN NaN
Comparing models
df_resid ssr df_diff ss_diff F Pr(>F)
0 11.0 1265.686749 0.0 NaN NaN NaN
1 10.0 57.904483 1.0 1207.782266 208.581823 5.028960e-08
• 𝑅𝑌2 | (𝑋1 ,𝑋2 ) =0.979.

• 𝑆𝑆𝐸 2 = 57.9,
• 𝑠2𝑒2 = 5.79,
• 𝐹 = 12.60
• Partial-𝐹 = 208.582
Notice that SEQ SS for 𝑋2 = 2716.9(0.974 − 0.529) = 1207.782.
(c) The regression of 𝑌 on 𝑋1 , 𝑋2 , and 𝑋3 is
model3 = smf.ols('y ~ x1 + x2 + x3 + 1', data=df).fit()


==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 48.1936 3.913 12.315 0.000 39.341 57.046
x1 1.6959 0.205 8.290 0.000 1.233 2.159
x2 0.6569 0.044 14.851 0.000 0.557 0.757
x3 0.2500 0.185 1.354 0.209 -0.168 0.668
==============================================================================
R-sq: 0.982
x1 1.0 1450.076328 1450.076328 271.264194 4.995767e-08
x2 1.0 1207.782266 1207.782266 225.938509 1.107893e-07
x3 1.0 9.793869 9.793869 1.832128 2.088895e-01
Residual 9.0 48.110614 5.345624 NaN NaN
Comparing models
0 10.0 57.904483 0.0 NaN NaN NaN
1 9.0 48.110614 1.0 9.793869 1.832128 0.208889
• 𝑅𝑌2 | (𝑋1 ,𝑋2 ,𝑋3 ) =0.982.

• Partial-𝐹 = 1.832
The SEQ SS of 𝑋3 is 9.79. The .95-quantile of 𝐹 [1, 9] is 5.117. Thus, the contribution
of 𝑋3 is not significant.
(d) The regression of 𝑌 on 𝑋1 , 𝑋2 , 𝑋3 , and 𝑋4 is
model4 = smf.ols('y ~ x1 + x2 + x3 + x4 + 1', data=df).fit()


==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 62.4054 70.071 0.891 0.399 -99.179 223.989
x1 1.5511 0.745 2.083 0.071 -0.166 3.269
x2 0.5102 0.724 0.705 0.501 -1.159 2.179
x3 0.1019 0.755 0.135 0.896 -1.638 1.842
x4 -0.1441 0.709 -0.203 0.844 -1.779 1.491
==============================================================================
R-sq: 0.982
x1 1.0 1450.076328 1450.076328 242.367918 2.887559e-07
x2 1.0 1207.782266 1207.782266 201.870528 5.863323e-07
x3 1.0 9.793869 9.793869 1.636962 2.366003e-01
x4 1.0 0.246975 0.246975 0.041280 8.440715e-01
Residual 8.0 47.863639 5.982955 NaN NaN
Comparing models
0 9.0 48.110614 0.0 NaN NaN NaN
1 8.0 47.863639 1.0 0.246975 0.04128 0.844071
• 𝑅𝑌2 | (𝑋1 ,𝑋2 ,𝑋3 ) =0.982.

• Partial-𝐹 = 0.041
The effect of 𝑋4 is not significant.
Solution 4.25 Using the step-wise regression method from the mistat package, we
get:
outcome = 'y'
all_vars = ['x1', 'x2', 'x3', 'x4']
included, model = mistat.stepwise_regression(outcome, all_vars, df)
formula = ' + '.join(included)

formula = f'{outcome} ~ 1 + {formula}'
print()
print('Final model')
print(formula)
print(model.params)
Step 1 add - (F: 22.80) x4

Step 2 add - (F: 108.22) x1 x4
Step 3 add - (F: 5.03) x1 x2 x4
Final model
y ~ 1 + x2 + x4 + x1
Intercept 71.648307
x2 0.416110
x4 -0.236540
x1 1.451938
dtype: float64
Solution 4.26 Build the regression model using statsmodels.
car_3 = car[car['origin'] == 3]
print('Full dataset shape', car.shape)
print('Origin 3 dataset shape', car_3.shape)
model = smf.ols(formula='mpg ~ hp + 1', data=car_3).fit()
Full dataset shape (109, 5)

Origin 3 dataset shape (37, 5)
=================================================================
Dependent Variable: mpg AIC: 195.6458
Date: 2023-05-03 21:16 BIC: 198.8676
------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------
Intercept 31.8328 1.8282 17.4117 0.0000 28.1213 35.5444
hp -0.0799 0.0160 -4.9996 0.0000 -0.1123 -0.0474
-----------------------------------------------------------------
Skew: -0.675 Prob(JB): 0.035
=================================================================
Compute the additional properties
influence = model.get_influence()
df = pd.DataFrame({
'hp': car_3['hp'],
'mpg': car_3['mpg'],
'resi': model.resid,
'sres': influence.resid_studentized_internal,
'hi': influence.hat_matrix_diag,
'D': influence.cooks_distance[0],
})
print(df.round(4))
hp mpg resi sres hi D

0 118 25 2.5936 0.7937 0.0288 0.0093
1 161 18 -0.9714 -0.3070 0.0893 0.0046
51 55 17 -10.4392 -3.3101 0.0953 0.5770
52 98 23 -1.0041 -0.3075 0.0299 0.0015
53 92 27 2.5166 0.7722 0.0339 0.0105
54 92 29 4.5166 1.3859 0.0339 0.0337
55 104 20 -3.5248 -1.0781 0.0277 0.0165
56 68 27 0.5993 0.1871 0.0665 0.0012
57 70 31 4.7591 1.4826 0.0627 0.0736
58 110 20 -3.0455 -0.9312 0.0270 0.0120
62 121 19 -3.1668 -0.9699 0.0303 0.0147
63 82 24 -1.2823 -0.3956 0.0442 0.0036
64 110 22 -1.0455 -0.3197 0.0270 0.0014
65 158 19 -0.2110 -0.0664 0.0823 0.0002
71 92 26 1.5166 0.4654 0.0339 0.0038
72 102 22 -1.6846 -0.5154 0.0282 0.0039
73 81 27 1.6378 0.5056 0.0455 0.0061
74 142 18 -2.4892 -0.7710 0.0520 0.0163
75 107 18 -5.2852 -1.6161 0.0271 0.0364
76 160 19 -0.0513 -0.0162 0.0869 0.0000
77 90 24 -0.6432 -0.1975 0.0356 0.0007
78 90 26 1.3568 0.4167 0.0356 0.0032
79 97 21 -3.0840 -0.9446 0.0305 0.0140
80 106 18 -5.3650 -1.6406 0.0273 0.0377
81 140 20 -0.6489 -0.2007 0.0490 0.0010
82 165 18 -0.6518 -0.2071 0.0993 0.0024
94 66 34 7.4396 2.3272 0.0704 0.2051
95 97 23 -1.0840 -0.3320 0.0305 0.0017
96 100 25 1.1557 0.3537 0.0290 0.0019
97 115 24 1.3539 0.4141 0.0278 0.0025

98 115 26 3.3539 1.0259 0.0278 0.0151
99 90 27 2.3568 0.7238 0.0356 0.0097
100 190 19 2.3453 0.7805 0.1786 0.0662
101 115 25 2.3539 0.7200 0.0278 0.0074
102 200 18 2.1441 0.7315 0.2184 0.0748
103 78 28 2.3982 0.7419 0.0497 0.0144
108 64 28 1.2798 0.4012 0.0745 0.0065
Notice that points 51 and 94 have residuals with large magnitude (-10.4, 7.4).
Points 51 and 94 have also the largest Cook’s distance (0.58, 0.21) Points 100 and
102 have high HI values (leverage; 0.18, 0.22).
Solution 4.27 This solution uses version 2 of the Piston simulator.

Run piston simulation for different piston weights and visualize variation of times
(see Fig. 4.7). { fig:anovaWeightPiston}
np.random.seed(1)
settings = {'s': 0.005, 'v0': 0.002, 'k': 1000, 'p0': 90_000,
't': 290, 't0': 340}
results = []
n_simulation = 5
for m in [30, 40, 50, 60]:
simulator = mistat.PistonSimulator(m=m, n_simulation=n_simulation,
**settings)
sim_result = simulator.simulate()
results.extend([m, s] for s in sim_result['seconds'])
results = pd.DataFrame(results, columns=['m', 'seconds'])
group_std = results.groupby('m').std()
pooled_std = np.sqrt(np.sum(group_std**2) / len(group_std))[0]
print('Pooled standard deviation', pooled_std)
group_mean = results.groupby('m').mean()
ax = results.plot.scatter(x='m', y='seconds', color='black')
ax.errorbar(group_mean.index, results.groupby('m').mean().values.flatten(),
yerr=[pooled_std] * 4, color='grey')
plt.show()
Pooled standard deviation 0.013088056556052113
Perform ANOVA of data.
model = smf.ols(formula='seconds ~ C(m)', data=results).fit()

aov_table = sm.stats.anova_lm(model)
aov_table

C(m) 3.0 0.000333 0.000111 0.648836 0.595055
Residual 16.0 0.002741 0.000171 NaN NaN
We see that the differences between the sample means are not significant in spite
of the apparent upward trend in cycle times.
Solution 4.28 Prepare dataset and visualize distributions (see Fig. 4.8). { fig:boxplotIntegratedCircuits}
df = pd.DataFrame([
[2.58, 2.62, 2.22],
[2.48, 2.77, 1.73],
0.07
0.06
0.05
seconds 0.04
0.03
0.02
0.01
30 35 40 45 50 55 60
m
Fig. 4.7 ANOVA of effect of changing weight in piston simulation { fig:anovaWeightPiston}
3.0
2.8
2.6
2.4
2.2
2.0
1.8
Exp. 1 Exp. 2 Exp. 3
{ fig:boxplotIntegratedCircuits} Fig. 4.8 Box plot of pre-etch line width from integrated circuits fabrication process
[2.52, 2.69, 2.00],

[2.50, 2.80, 1.86],
[2.53, 2.87, 2.04],
[2.46, 2.67, 2.15],
[2.52, 2.71, 2.18],
[2.49, 2.77, 1.86],
[2.58, 2.87, 1.84],
[2.51, 2.97, 1.86]
], columns=['Exp. 1', 'Exp. 2', 'Exp. 3'])
df.boxplot()
# Convert data frame to long format using melt

df = df.melt(var_name='Experiment', value_name='mu')
Analysis using ANOVA:

model = smf.ols(formula='mu ~ C(Experiment)', data=df).fit()

aov_table

C(Experiment) 2.0 3.336327 1.668163 120.917098 3.352509e-14
Residual 27.0 0.372490 0.013796 NaN NaN
The difference in the experiments is significant.

Bootstrap test:
experiment = df['Experiment']
mu = df['mu']
def onewayTest(x, verbose=False):
df = pd.DataFrame({
'value': x,
'variable': experiment,
})
aov = pg.anova(dv='value', between='variable', data=df)
return aov['F'].values[0]
B = pg.compute_bootci(mu, func=onewayTest, n_boot=1000,

seed=1, return_dist=True)
Bt0 = onewayTest(mu)
print('Bt0', Bt0)
print('ratio', sum(B[1] >= Bt0)/len(B[1]))
Bt0 120.91709844559576
ratio 0.0
The bootstrap test also shows that the difference in means is significant.
Solution 4.29 Create dataset and visualize distribution (see Fig. 4.9). { fig:filmSpeedData}
df = pd.DataFrame({
'Batch A': [103, 107, 104, 102, 95, 91, 107, 99, 105, 105],
'Batch B': [104, 103, 106, 103, 107, 108, 104, 105, 105, 97],
})
fig, ax = plt.subplots()
bplot1 = ax.boxplot(df, labels=df.columns)
plt.show()
# the following code failed at some point using pandas

# df.boxplot()
# plt.show()
(i) Randomization test (see Sect. 3.13.2) {sec:randomizaton-test}
dist = mistat.randomizationTest(df['Batch A'], df['Batch B'], np.mean,

aggregate_stats=lambda x: x[0] - x[1],
n_boot=10000, seed=1)
# ax = sns.distplot(dist)
# ax.axvline(np.mean(df['Batch A']) - np.mean(df['Batch B']))
Original stat is -2.400000

Original stat is at quantile 1062 of 10001 (10.62%)
Distribution of bootstrap samples:
min: -5.40, median: 0.00, max: 5.60
107.5
105.0
102.5
100.0
97.5
95.0
92.5
Batch A Batch B
Fig. 4.9 Box plot of film speed data { fig:filmSpeedData}
The randomization test gave a P value of 0.106. The difference between the means
is not significant.
(ii)
# Convert data frame to long format using melt

df = df.melt(var_name='Batch', value_name='film_speed')
model = smf.ols(formula='film_speed ~ C(Batch)', data=df).fit()

aov_table

C(Batch) 1.0 28.8 28.800000 1.555822 0.228263
Residual 18.0 333.2 18.511111 NaN NaN
The ANOVA also shows no significant difference in the means. The 𝑃 value
is 0.228. Remember that the 𝐹-test in the ANOVA is based on the assumption of
normality and equal variances. The randomization test is nonparametric.
Solution 4.30 Define function that calculates the statistic and execute bootstrap.
def func_stats(x):
m = pd.Series(x).groupby(df['Experiment']).agg(['mean', 'count'])
top = np.sum(m['count'] * m['mean'] ** 2) - len(x)*np.mean(x)**2
return top / np.std(x) ** 2
Bt = []
mu = list(df['mu'])
mu_star = random.sample(mu, len(mu))
Bt.append(func_stats(mu_star))
Bt0 = func_stats(mu)
print('Bt0', Bt0)
print('ratio', sum(Bt >= Bt0)/len(Bt))
Bt0 26.986990459670288
ratio 0.0
Boxplot grouped
xDev by crcBrd
0.004
0.003
0.002
0.001
0.000
0.001
0.002
0.003
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526
crcBrd
Fig. 4.10 Box plot visualisation of xDev distribution by crcBrd for the PLACE dataset { fig:boxplotXdevCrcBrdPlace}
The result demonstrates that the differences between the results is significant.
Solution 4.31 Load the data and visualize the distributions (see Fig. 4.10). { fig:boxplotXdevCrcBrdPlace}
place = mistat.load_data('PLACE')
place.boxplot('xDev', by='crcBrd')
plt.show()
(a) ANOVA for the fulldataset
model = smf.ols(formula='xDev ~ C(crcBrd)', data=place).fit()

aov_table

C(crcBrd) 25.0 0.001128 4.512471e-05 203.292511 2.009252e-206
Residual 390.0 0.000087 2.219694e-07 NaN NaN
(b) There seem to be four homogeneous groups: 𝐺 1 = {1, 2, . . . , 9}, 𝐺 2 =

{10, 11, 12}, 𝐺 3 = {13, . . . , 19, 21, . . . , 26}, 𝐺 4 = {20}.
In multiple comparisons we use the Scheffé coefficient 𝑆 .05 = (25×𝐹.95 [25, 390]) 1/2 =
(25 × 1.534) 1/2 = 6.193. The group means and standard errors are:
G1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
G2 = [10, 11, 12]
G3 = [13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26]
G4 = [20]
place['group'] = 'G1'
place.loc[place['crcBrd'].isin(G2), 'group'] = 'G2'
statistics = place['xDev'].groupby(place['group']).agg(['mean', 'sem', 'count'])

statistics = statistics.sort_values(['mean'], ascending=False)
print(statistics.round(8))
statistics['Diff'] = 0
n = len(statistics)
print(statistics['mean'][:-1].values - statistics['mean'][1:].values)
print(statistics['mean'][:(n-1)].values - statistics['mean'][1:].values)
statistics.loc[1:, 'Diff'] = (statistics['mean'][:-1].values -
statistics['mean'][1:].values)
statistics['CR'] = 6.193 * statistics['Diff']
print(statistics.round(8))
# 0.001510 0.0022614 0.0010683
# 0.000757 0.000467 0.000486
sem = statistics['sem'].values
sem = sem**2
sem = np.sqrt(sem[:-1] + sem[1:])
print(sem * 6.193)
print(757/644, 467/387, 486/459)
mean sem count

group
G4 0.003778 0.000100 16
G3 0.002268 0.000030 208
G2 0.000006 0.000055 48
G1 -0.001062 0.000050 144
[0.00151029 0.00226138 0.00106826]
[0.00151029 0.00226138 0.00106826]
---------------------------------------------------------------------------TypeError
15 print(statistics['mean'][:-1].values -
16 print(statistics['mean'][:(n-1)].values -
---> 17 statistics.loc[1:, 'Diff'] = (statistics['mean'][:-1].values -
18 statistics['mean'][1:].values)
19 statistics['CR'] = 6.193 * statistics['Diff']
20 print(statistics.round(8))
packages/pandas/core/indexing.py:845, in
_LocationIndexer.__setitem__(self, key, value)
843 else:
844 key = com.apply_if_callable(key, self.obj)
--> 845 indexer = self._get_setitem_indexer(key)
846 self._has_valid_setitem_indexer(key)
848 iloc = self if self.name == "iloc" else self.obj.iloc
_LocationIndexer._get_setitem_indexer(self, key)
707 if isinstance(key, tuple):
708 with suppress(IndexingError):
709 # suppress "Too many indexers"
--> 710 return self._convert_tuple(key)
712 if isinstance(key, range):
713 # GH#45479 test_loc_setitem_range_key
714 key = list(key)
_LocationIndexer._convert_tuple(self, key)
923 @final
924 def _convert_tuple(self, key: tuple) -> tuple:
925 # Note: we assume _tupleize_axis_indexer has been called,
if necessary.
926 self._validate_key_length(key)
--> 927 keyidx = [self._convert_to_indexer(k, axis=i) for i, k in
enumerate(key)]
928 return tuple(keyidx)
packages/pandas/core/indexing.py:927, in <listcomp>(.0)
923 @final
924 def _convert_tuple(self, key: tuple) -> tuple:
925 # Note: we assume _tupleize_axis_indexer has been called,
if necessary.
926 self._validate_key_length(key)
--> 927 keyidx = [self._convert_to_indexer(k, axis=i) for i, k in
enumerate(key)]
928 return tuple(keyidx)
_LocIndexer._convert_to_indexer(self, key, axis)
1379 labels = self.obj._get_axis(axis)
1381 if isinstance(key, slice):
-> 1382 return labels._convert_slice_indexer(key, kind="loc")
1384 if (
1385 isinstance(key, tuple)
1386 and not isinstance(labels, MultiIndex)
1387 and self.ndim < 2
1388 and len(key) > 1
1389 ):
1390 raise IndexingError("Too many indexers")
packages/pandas/core/indexes/base.py:4125, in
Index._convert_slice_indexer(self, key, kind)
4122 elif is_positional:
4123 if kind == "loc":
4124 # GH#16121, GH#24612, GH#31810
-> 4125 raise TypeError(
4126 "Slicing a positional slice with .loc is not
allowed, "
4127 "Use .loc with labels or .iloc with positions
instead.",
4128 )
4129 indexer = key
4130 else:
TypeError: Slicing a positional slice with .loc is not allowed, Use
.loc with labels or .iloc with positions instead.
The differences between the means of the groups are all significant.
from statsmodels.stats.multicomp import pairwise_tukeyhsd

m_comp = pairwise_tukeyhsd(endog=place['xDev'], groups=place['group'],
alpha=0.05)
print(m_comp)
Multiple Comparison of Means - Tukey HSD, FWER=0.05

=================================================
group1 group2 meandiff p-adj lower upper reject
-------------------------------------------------
G1 G2 0.0011 0.0 0.0009 0.0013 True
G1 G3 0.0033 0.0 0.0032 0.0035 True
G1 G4 0.0048 0.0 0.0045 0.0052 True
G2 G3 0.0023 0.0 0.0021 0.0025 True
G2 G4 0.0038 0.0 0.0034 0.0041 True
G3 G4 0.0015 0.0 0.0012 0.0018 True
-------------------------------------------------
Solution
df 4.32
= pd.DataFrame({
'US': [33, 25],
'Europe': [7, 7],
'Asia': [26, 11],

})
print(df)
col_sums = df.sum(axis=0)
row_sums = df.sum(axis=1)
total = df.to_numpy().sum()
expected_frequencies = np.outer(row_sums, col_sums) / total
chi2 = (df - expected_frequencies) ** 2 / expected_frequencies

chi2 = chi2.to_numpy().sum()
print(f'chi2: {chi2:.3f}')
print(f'p-value: {1 - stats.chi2.cdf(chi2, 2):.3f}')
US Europe Asia
0 33 7 26
1 25 7 11
chi2: 2.440
p-value: 0.295
The chi-square test statistic is 𝑋 2 = 2.440 with d.f. = 2 and 𝑃 value = 0.295. The
null hypothesis that the number of cylinders a car has is independent of the origin of
the car is not rejected.
We can also use the scipy function chi2_contingency.
chi2 = stats.chi2_contingency(df)
print(f'chi2-statistic: {chi2[0]:.3f}')
print(f'p-value: {chi2[1]:.3f}')
print(f'd.f.: {chi2[2]}')
chi2-statistic: 2.440
p-value: 0.295
d.f.: 2
binned_car = pd.DataFrame({
'turn': pd.cut(car['turn'], bins=[27, 30.6, 34.2, 37.8, 45]), #np.arange(27, 50, 3.6)),
'mpg': pd.cut(car['mpg'], bins=[12, 18, 24, 100]),
})
freqDist = pd.crosstab(binned_car['mpg'], binned_car['turn'])
print(freqDist)
chi2 = stats.chi2_contingency(freqDist)
print(f'chi2-statistic: {chi2[0]:.3f}')
print(f'p-value: {chi2[1]:.3f}')
print(f'd.f.: {chi2[2]}')
turn (27.0, 30.6] (30.6, 34.2] (34.2, 37.8] (37.8, 45.0]

mpg
(12, 18] 2 4 10 15
(18, 24] 0 12 26 15
(24, 100] 4 15 6 0
chi2-statistic: 34.990
p-value: 0.000
d.f.: 6
The dependence between turn diameter and miles per gallon is significant.
question_13 = pd.DataFrame({
'1': [0,0,0,1,0],
'2': [1,0,2,0,0],
'3': [1,2,6,5,1],
'4': [2,1,10,23,13],
'5': [0,1,1,15,100],
}, index = ['1', '2', '3', '4', '5']).transpose()
question_23 = pd.DataFrame({
'1': [1,0,0,3,1],
'2': [2,0,1,0,0],
'3': [0,4,2,3,0],
'4': [1,1,10,7,5],
'5': [0,0,1,30,134],
}, index = ['1', '2', '3', '4', '5']).transpose()
chi2_13 = stats.chi2_contingency(question_13)
chi2_23 = stats.chi2_contingency(question_23)
msc_13 = chi2_13[0] / question_13.to_numpy().sum()

tschuprov_13 = np.sqrt(msc_13 / (2 * 2)) # (4 * 4))
cramer_13 = np.sqrt(msc_13 / 2) # min(4, 4))
msc_23 = chi2_23[0] / question_23.to_numpy().sum()

tschuprov_23 = np.sqrt(msc_23 / 4) # (4 * 4))
cramer_23 = np.sqrt(msc_23 / 2) # min(4, 4))
print('Question 1 vs 3')
print(f' Mean squared contingency : {msc_13:.3f}')
print(f' Tschuprov : {tschuprov_13:.3f}')
print(f" Cramer's index : {cramer_13:.3f}")
print('Question 2 vs 3')
print(f' Mean squared contingency : {msc_23:.3f}')
print(f' Tschuprov : {tschuprov_23:.3f}')
print(f" Cramer's index : {cramer_23:.3f}")
Question 1 vs 3
Mean squared contingency : 0.629
Tschuprov : 0.397
Cramer's index : 0.561
Question 2 vs 3
Mean squared contingency : 1.137
Tschuprov : 0.533
Cramer's index : 0.754
Chapter 5
Sampling for Estimation of Finite Population
Quantities
import random
import numpy as np
import pandas as pd
import mistat
Solution 5.1 Define the binary random variables



 1, if the 𝑗-th element is selected at the 𝑖-th sampling


𝐼𝑖 𝑗 =

 0,

otherwise.

Í
The random variables 𝑋1 , · · · , 𝑋𝑛 are given by 𝑋𝑖 = 𝑁𝑗=1 𝑥 𝑗 𝐼𝑖 𝑗 , 𝑖 = 1, · · · , 𝑛. Since
1
sampling is RSWR, Pr{𝑋𝑖 = 𝑥 𝑗 } = for all 𝑖 = 1, · · · , 𝑛 and 𝑗 = 1, · · · , 𝑁.
𝑁
Hence, Pr{𝑋𝑖 ≤ 𝑥} = 𝐹𝑁 (𝑥) for all 𝑥, and all 𝑖 = 1, . . . , 𝑛. Moreover, by definition
of RSWR, the vectors I𝑖 = (𝐼𝑖1 , . . . , 𝐼𝑖 𝑁 ), 𝑖 = 1, . . . , 𝑛 are mutually independent.
Therefore 𝑋1 , . . . , 𝑋𝑛 are i.i.d., having a common c.d.f. 𝐹𝑁 (𝑥).
1 Í𝑁
Solution 5.2 In continuation of the previous exercise, 𝐸 {𝑋𝑖 } = 𝑥𝑖 = 𝜇 𝑁 .
𝑁 𝑖=1
Therefore, by the weak law of large numbers, lim𝑛→∞ 𝑃{| 𝑋¯ 𝑛 − 𝜇 𝑁 | < 𝜖 } = 1.
2 < ∞),
Solution 5.3 By the CLT (0 < 𝜎𝑁
√

𝛿
Pr{ 𝑛| 𝑋¯ 𝑛 − 𝜇 𝑁 | < 𝛿} ≈ 2Φ − 1,
𝜎𝑁
as 𝑛 → ∞.
97
98 5 Sampling for Estimation of Finite Population Quantities
RSWR RSWOR
12.5
15
10.0
Frequency
Frequency
7.5 10
5.0
5
2.5
0.0 0
0.0 0.5 0.0 0.5
Sample correlation Sample correlation
Fig. 5.1 Distribution of correlation between xDev and yDev for sampling with and without distri-
{ fig:exDistCorrPlace} bution.
Solution 5.4 We create samples of size 𝑘 = 20 with and without replacement,

determine the correlation coefficient and finally create the two histograms (see Fig. { fig:exDistCorrPlace}
5.1).
random.seed(1)
# calculate correlation coefficient based on a sample of rows

def stat_func(idx):
return stats.pearsonr(place['xDev'][idx], place['yDev'][idx])[0]
rswr = []
rswor = []
idx = list(range(len(place)))
rswr.append(stat_func(random.choices(idx, k=20)))
rswor.append(stat_func(random.sample(idx, k=20)))
corr_range = (min(*rswr, *rswor), max(*rswr, *rswor))
def makeHistogram(title, ax, data, xrange):

ax = pd.Series(data).hist(color='grey', ax=ax, bins=20)
ax.set_title(title)
ax.set_xlabel('Sample correlation')
ax.set_ylabel('Frequency')
ax.set_xlim(*xrange)

makeHistogram('RSWR', axes[0], rswr, corr_range)
makeHistogram('RSWOR', axes[1], rswor, corr_range)
plt.tight_layout()
plt.show()
5 Sampling for Estimation of Finite Population Quantities 99
70 80 60
60
60 50
50
Frequency
Frequency
Frequency
40 40
40 30
30
20 20
20
10 10
0 0 0
35 36 37 100 120 140 20 22
Median turn Median hp Median mpg
Fig. 5.2 Distribution of median of turn-diameter, horsepower, and mpg of the CAR dataset using
random sampling without replacement. { fig:exDistMedianCAR}
{ fig:exDistMedianCAR} Solution 5.5 The Python code creates the histograms shown in Fig. 5.2.
random.seed(1)
columns = ['turn', 'hp', 'mpg']
# calculate correlation coefficient based on a sample of rows

def stat_func(idx):
sample = car[columns].loc[idx,]
return sample.median()
idx = list(range(len(car)))
result = []
result.append(stat_func(random.sample(idx, k=50)))
result = pd.DataFrame(result)

for ax, column in zip(axes, columns):
result[column].hist(color='grey', ax=ax)
ax.set_xlabel(f'Median {column}')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
Solution 5.6 For RSWOR,

1/2
¯ 𝜎 𝑛−1
S.E.{ 𝑋𝑖 } = √ 1 −
𝑛 𝑁 −1
Equating the standard error to 𝛿 we get 𝑛1 = 30, 𝑛2 = 116, 𝑛3 = 84.

Solution 5.7 The required
√︄ samplesize is a solution of the equation

𝑃(1 − 𝑃) 𝑛−1
0.002 = 2 · 1.96 · 1− . The solution is 𝑛 = 1611.
𝑛 𝑁
Solution 5.8 The following are Python commands to estimate the mean of all 𝑁 =
416 𝑥-dev values by stratified sampling with proportional allocation. The total sample
size is 𝑛 = 200 and the weights are 𝑊1 = 0.385, 𝑊2 = 0.115, 𝑊3 = 0.5. Thus,
𝑛1 = 77, 𝑛2 = 23, and 𝑛3 = 100.
# load dataset and split into strata

strata_1 = list(place['xDev'][:160])
strata_2 = list(place['xDev'][160:208])
strata_3 = list(place['xDev'][208:])
N = len(place)
w_1 = 0.385
w_2 = 0.115
w_3 = 0.5
n_1 = int(w_1 * 200)
n_2 = int(w_2 * 200)
n_3 = int(w_3 * 200)
sample_means = []
m_1 = np.mean(random.sample(strata_1, k=n_1))
sample_means.append(w_1*m_1 + w_2*m_2 + w_3*m_3)
std_dev_sample_means = np.std(sample_means)
print(std_dev_sample_means)
print(stats.sem(place['xDev'], ddof=0))
3.442839155174113e-05
8.377967188860638e-05
The standard deviation of the estimated means is an estimate of S.E.( 𝜇ˆ 𝑁 ). The

true value of this S.E. is 0.000034442.
˜𝑁
𝜎 2
Solution 5.9 𝐿(𝑛1 , . . . , 𝑛 𝑘 ; 𝜆) =
Í𝑘 2 𝑖 Í𝑘
− 𝜆 𝑛 − 𝑖=1 𝑛𝑖 . Partial differentia-
𝑖=1 𝑊𝑖
𝑛𝑖
tion of 𝐿 w.r.t. 𝑛1 , . . . , 𝑛 𝑘 and 𝜆 and equating the result to zero yields the following
equations:
𝑊𝑖2 𝜎
˜𝑁2
𝑖
= 𝜆, 𝑖 = 1, . . . , 𝑘
𝑛2𝑖
𝑘
∑︁
𝑛𝑖 = 𝑛.
𝑖=1
1 1 Í𝑘
Equivalently, 𝑛𝑖 = √ 𝑊𝑖 𝜎 ˜ 𝑁𝑖 , for 𝑖 = 1, . . . , 𝑘 and 𝑛 = √ ˜ 𝑁𝑖 . Thus
𝑖=1 𝑊𝑖 𝜎
𝜆 𝜆
˜ 𝑁𝑖
𝑊𝑖 𝜎
𝑛0𝑖 = 𝑛 Í 𝑘 , 𝑖 = 1, . . . , 𝑘.
˜ 𝑁𝑗
𝑗=1 𝑊 𝑗 𝜎
Solution 5.10 The prediction model is 𝑦 𝑖 = 𝛽 + 𝑒 𝑖 , 𝑖 = 1, . . . , 𝑁, 𝐸 {𝑒 𝑖 } = 0,

𝑉 {𝑒 𝑖 } = 𝜎 2 , cov(𝑒 𝑖 , 𝑒 𝑗 ) = 0 for all 𝑖 ≠ 𝑗.
( 𝑁 𝑁
)
¯ 1 ∑︁ 1 ∑︁
𝐸 {𝑌𝑛 − 𝜇 𝑁 } = 𝐸 𝛽 + 𝐼𝑖 𝑒 𝑖 − 𝛽 − 𝑒𝑖
𝑛 𝑖=1 𝑁 𝑖=1
𝑁
1 ∑︁
= 𝐸 {𝐼𝑖 𝑒 𝑖 },
𝑛 𝑖=1


 1, if 𝑖-th population element is sampled


where 𝐼𝑖 =

 0, otherwise.


Notice that 𝐼1 , . . . , 𝐼 𝑁 are independent of 𝑒 1 , . . . , 𝑒 𝑁 . Hence, 𝐸 {𝐼𝑖 𝑒 𝑖 } = 0 for all
𝑖 = 1, . . . , 𝑁. This proves that 𝑌¯𝑛 is prediction unbiased, irrespective of the sample
strategy. The prediction MSE of 𝑌¯𝑛 is
𝑃𝑀𝑆𝐸 {𝑌¯𝑛 } = 𝐸 {(𝑌¯𝑛 − 𝜇 𝑁 ) 2 }

( 𝑁 𝑁
)
1 ∑︁ 1 ∑︁
=𝑉 𝐼𝑖 𝑒 𝑖 − 𝑒𝑖
𝑛 𝑖=1 𝑁 𝑖=1
( 𝑁 𝑁
)
1 1 ∑︁ 1 ∑︁
=𝑉 − 𝐼𝑖 𝑒 𝑖 − (1 − 𝐼𝑖 )𝑒 𝑖 .
𝑛 𝑁 𝑖=1 𝑁 𝑖=1
Let s denote the set of units in the sample. Then
𝜎2 𝑛 2 1 𝑛 2
𝑃𝑀𝑆𝐸 {𝑌¯𝑛 | s} = 1− + 1− 𝜎
𝑛 𝑁 𝑁 𝑁
𝜎2 𝑛
= 1− .
𝑛 𝑁
Notice that 𝑃𝑀𝑆𝐸 {𝑌¯𝑛 | s} is independent of s, and is equal for all samples.
Solution 5.11 The model is 𝑦 𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝑒 𝑖 , 𝑖 = 1, . . . , 𝑁. 𝐸 {𝑒 𝑖 } = 0, 𝑉 {𝑒 𝑖 } =

𝜎 2 𝑥𝑖 , 𝑖 = 1, . . . , 𝑁. Given a sample {(𝑋1 , 𝑌1 ), . . . , (𝑋𝑛 , 𝑌𝑛 )}, we estimate 𝛽0 and 𝛽1
by the weighted LSE because the variances of 𝑦 𝑖 depend on 𝑥𝑖 , 𝑖 = 1, . . . , 𝑁. These
Í𝑁 1
weighted LSE values 𝛽ˆ0 and 𝛽ˆ1 minimizing 𝑄 = 𝑖=1 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) 2 , are given
𝑋𝑖
by
1 Í𝑁 1 1 Í𝑁
𝑌¯𝑛 · 𝑖=1 𝑋𝑖 −
𝑌𝑖
𝑁
!
𝑛 𝑛 𝑖=1 𝑋𝑖 1 ∑︁ 𝑌𝑖
𝛽ˆ1 = and 𝛽ˆ0 = Í 𝑁 − 𝑛 𝛽ˆ1 .
1 Í𝑁 1 1 𝑋
𝑋¯ 𝑛 · −1 𝑖=1 𝑋𝑖 𝑖=1 𝑖
𝑛 𝑖=1 𝑋𝑖
It is straightforward to show that 𝐸 { 𝛽ˆ1 } = 𝛽1 and 𝐸 { 𝛽ˆ0 } = 𝛽0 . Thus, an unbiased
predictor of 𝜇 𝑁 is 𝜇ˆ 𝑁 = 𝛽ˆ0 + 𝛽ˆ1 𝑥¯ 𝑁 .
1 Í𝑁
Solution 5.12 The predictor 𝑌ˆ𝑅 𝐴 can be written as 𝑌ˆ𝑅 𝐴 = 𝑥¯ 𝑁 · 𝐼𝑖 𝑦𝑖 , where
𝑛 𝑖=1 𝑥𝑖


 1, if 𝑖-th population element is in the sample s


𝐼𝑖 =

 0,

otherwise.

Recall that 𝑦 𝑖 = 𝛽𝑥𝑖 + 𝑒 𝑖 , 𝑖 = 1, . . . , 𝑁, and that for any sampling strategy,
Í𝑁
𝑒 1 , . . . , 𝑒 𝑁 are independent of 𝐼1 , . . . , 𝐼 𝑁 . Hence, since 𝑖=1 𝐼𝑖 = 𝑛,
𝑁
ˆ 1 ∑︁ 𝛽𝑥 𝑖 + 𝑒 𝑖
𝐸 {𝑌𝑅 𝐴 } = 𝑥¯ 𝑁 · 𝐸 𝐼𝑖
𝑁 !
1 ∑︁ 𝑒𝑖
= 𝑥¯ 𝑁 𝛽 + 𝐸 𝐼𝑖
= 𝑥¯ 𝑁 𝛽,

𝑒𝑖 𝐼𝑖
because 𝐸 𝐼𝑖 =𝐸 𝐸 {𝑒 𝑖 } = 0, 𝑖 = 1, . . . , 𝑁. Thus, 𝐸 {𝑌ˆ𝑅 𝐴 − 𝜇 𝑁 } = 0 and
𝑥𝑖 𝑥𝑖
𝑌ˆ𝑅 𝐴 is an unbiased predictor.
 𝑁 𝑁
!2
 𝑥¯ 𝑁 ∑︁

 𝑒 1 ∑︁ 


𝑃𝑀𝑆𝐸 {𝑌ˆ𝑅 𝐴 } = 𝐸
𝑖
𝐼𝑖 − 𝑒𝑖
 𝑛 𝑥𝑖 𝑁 𝑖=1 
 𝑖=1 

( )
∑︁ 𝑥¯ 𝑁 1
∑︁ 𝑒 𝑖0
=𝑉 − 𝑒𝑖 −
𝑖 ∈s
𝑛𝑥 𝑖 𝑁 𝑖0 ∈r
𝑁
𝑛 𝑛
where s𝑛 is the set of elements in the sample and r𝑛 is the set of elements in P but
not in s𝑛 , r𝑛 = P − s𝑛 . Since 𝑒 1 , . . . , 𝑒 𝑁 are uncorrelated,
∑︁ 𝑥¯ 𝑁 2
1 𝑁 −𝑛
𝑃𝑀𝑆𝐸 {𝑌ˆ𝑅 𝐴 | s𝑛 } = 𝜎 2 − + 𝜎2
𝑖 ∈s𝑛
𝑛𝑥 𝑖 𝑁 𝑁2
" 2#
𝜎2 ∑︁ 𝑁 𝑥¯ 𝑁
= 2 (𝑁 − 𝑛) + −1 .
𝑁 𝑖 ∈s
𝑛𝑥𝑖
𝑛
2
A sample s𝑛 which minimizes 𝑖 ∈s𝑛 𝑁𝑛𝑥𝑥¯ 𝑁𝑖 − 1 is optimal.
Í
The predictor 𝑌ˆ𝑅𝐺 can be written as

Í𝑁 Í𝑁 ! Í𝑁
𝐼𝑖 𝑥 𝑖 𝑦 𝑖 𝑖=1 𝐼𝑖 𝑥 𝑖 (𝛽𝑥 𝑖 + 𝑒 𝑖 ) 𝐼𝑖 𝑥 𝑖 𝑒 𝑖
𝑌ˆ𝑅𝐺 = 𝑥¯ 𝑁 Í𝑖=1
𝑁 2
= 𝑥¯ 𝑁 Í𝑁 2
= 𝛽𝑥¯ 𝑁 + 𝑥¯ 𝑁 Í𝑖=1
𝑁 2
.
𝑖=1 𝐼𝑖 𝑥 𝑖 𝑖=1 𝐼𝑖 𝑥 𝑖 𝑖=1 𝐼𝑖 𝑥 𝑖
Hence, 𝐸 {𝑌ˆ𝑅𝐺 } = 𝛽𝑥¯ 𝑁 and 𝑌ˆ𝑅𝐺 is an unbiased predictor of 𝜇 𝑁 . The conditional

prediction 𝑀𝑆𝐸, given s𝑛 , is
" #
𝜎2 𝑁 2 𝑥¯ 2𝑁 𝑛 𝑋¯ 𝑛
ˆ
𝑃𝑀𝑆𝐸 {𝑌𝑅𝐺 | s𝑛 } = 2 𝑁 + Í − 2𝑁 𝑥¯ 𝑁 Í .
𝑁 2 2
𝑖 ∈s𝑛 𝑥 𝑖 𝑖 ∈s𝑛 𝑥 𝑖
Chapter 6
Time Series Analysis and Prediction
import math
import mistat
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa import tsatools
import statsmodels.formula.api as smf
Solution 6.1 TODO: Provide a sample solution
Solution 6.2 (i) Fig. 6.1 shows the change of demand over time { fig:seascom-timeline}
(ii) We are first fitting the seasonal trend to the data in SeasCom data set. We use
the linear model𝑌 = 𝑋 𝛽 + 𝜀, where the 𝑌 vector has 102 elements, which are in data
set. 𝑋 is a 102x4 matrix of four column vectors. We combined 𝑋 and 𝑌 in a data
frame. In addition to the SeasCom data, we add a column of 1’s (const), a column
with numbers 1 to 102 (trend) and two columns to describe the seasonal pattern.
The column season_1 consists of cos(𝜋 × trend/6), and the column season_2 is
sin(𝜋 × trend/6).
seascom = mistat.load_data('SEASCOM.csv')
df = tsatools.add_trend(seascom, trend='ct')
df['season_1'] = [np.cos(math.pi * tx/6) for tx in df['trend']]
df['season_2'] = [np.sin(math.pi * tx/6) for tx in df['trend']]
print(df.head())
model = smf.ols(formula='SeasCom ~ trend + 1 + season_1 + season_2',

data=df).fit()
print(model.params)
print(f'r2-adj: {model.rsquared_adj:.3f}')
SeasCom const trend season_1 season_2

0 71.95623 1.0 1.0 8.660254e-01 0.500000
1 56.36048 1.0 2.0 5.000000e-01 0.866025
2 64.85331 1.0 3.0 6.123234e-17 1.000000
105
106 6 Time Series Analysis and Prediction
160
140
120
Data
100
80
60
40
0 20 40 60 80 100
Time
{ fig:seascom-timeline} Fig. 6.1 Seasonal trend model of SeasCom data set
3 59.93460 1.0 4.0 -5.000000e-01 0.866025

4 51.62297 1.0 5.0 -8.660254e-01 0.500000
Intercept 47.673469
trend 1.047236
season_1 10.653968
season_2 10.130145
dtype: float64
r2-adj: 0.981
The least squares estimates of 𝛽 is
𝑏 = (47.67347, 1.04724, 10.65397, 10.13015) 0
{ fig:seascom-timeline} The fitted trend is 𝑌𝑡 = 𝑋 𝑏. (see Fig. 6.1).
seascom = mistat.load_data('SEASCOM.csv')
ax.scatter(seascom.index, seascom, facecolors='none', edgecolors='grey')
model.predict(df).plot(ax=ax, color='black')
ax.set_xlabel('Time')
ax.set_ylabel('Data')
plt.show()
{ fig:seascom-model-deviation} (iii) Calculate the residuals and plot them (see Fig. 6.2).
U = df['SeasCom'] - model.predict(df)
ax.scatter(U.index, U, facecolors='none', edgecolors='black')
6 Time Series Analysis and Prediction 107
7.5
5.0
2.5
Deviation
0.0
2.5
5.0
7.5
10.0
0 20 40 60 80 100
Time
Fig. 6.2 Deviation of SeasCom data from cyclical trend { fig:seascom-model-deviation}
ax.set_ylabel('Deviation')
plt.show()
(iv) Calculate the correlations
# use slices to get sublists

corr_1 = np.corrcoef(U[:-1], U[1:])[0][1]
corr_2 = np.corrcoef(U[:-2], U[2:])[0][1]
print(f'Corr(Ut,Ut-1) = {corr_1:.3f}')
print(f'Corr(Ut,Ut-2) = {corr_2:.3f}')
Corr(Ut,Ut-1) = -0.191
Corr(Ut,Ut-2) = 0.132
Indeed the correlations between adjacent data points are 𝑐𝑜𝑟𝑟 (𝑋𝑡 , 𝑋𝑡−1 ) =
−0.191, and 𝑐𝑜𝑟𝑟 (𝑋𝑡 , 𝑋𝑡−2 ) = 0.132.
(iv) A plot of the deviations and the low autocorrelations shows something like
randomness.
# keep some information for later exercises

seascom_model = model
seascom_df = df
Solution 6.3 According to Equation 6.2.2, the auto-correlation in a stationary MA(q) {eqn:ma-covariance}
is
Í𝑞−ℎ
𝐾 (ℎ) 𝑗=0 𝛽 𝑗 𝛽 𝑗+ℎ
𝜌(ℎ) = = Í𝑞 2
𝐾 (0) 𝑗=0 𝛽 𝑗
for 0 ≤ ℎ ≤ 𝑞).
Notice that 𝜌(ℎ) = 𝜌(−ℎ), and 𝜌(ℎ) = 0 for all ℎ > 𝑞.
beta = np.array([1, 1.05, 0.76, -0.35, 0.45, 0.55])

data = []
n = len(beta)
sum_0 = np.sum(beta * beta)
for h in range(6):
sum_h = np.sum(beta[:n-h] * beta[h:])
data.append({
'h': h,
'K(h)': sum_h,
'rho(h)': sum_h / sum_0,
})
0 1 2 3 4 5
K(h) 3.308 1.672 0.542 0.541 1.028 0.550
rho(h) 1.000 0.506 0.164 0.163 0.311 0.166
Solution 6.5 We consider the 𝐴𝑄(∞), given by coefficients 𝛽 𝑗 = 𝑞 𝑗 , with 0 < 𝑞 < 1.
In this case,
(i) 𝐸 {𝑋𝑡 } = 0, and
(ii) 𝑉 {𝑋𝑡 } = 𝜎 2 ∞ 2 𝑗 = 𝜎 2 /(1 − 𝑞 2 ).
Í
𝑗=0 𝑞
Solution 6.6 We consider the AR(1), 𝑋𝑡 = Í 0.75𝑋𝑡−1 + 𝜀 𝑡 .

(i) This time-series is equivalent to 𝑋𝑡 = ∞ 𝑗=0 (−0.75) 𝜀 𝑡− 𝑗 , hence it is covariance
𝑗
stationary.
(ii) 𝐸 {𝑋𝑡 } = 0
(iii) According to the Yule-Walker equations,
2
1 −0.75 𝐾 (0) 𝜎
=
−0.75 1 𝐾 (1) 0
It follows that 𝐾 (0) = 2.285714 𝜎 2 and 𝐾 (1) = 1.714286 𝜎 2 .

Solution 6.7 The given AR(2) can be written as X𝑡 − 0.5X𝑡−1 + 0.3X𝑡−2 = 𝜺 𝑡 .
(i) The corresponding characteristic√polynomial is P2 (z) = 0.3 − 0.5z + z2 . The
two characteristic roots are z1,2 = 41 ± i 2095 . These two roots belong to the unit circle.
Hence this AR(2) is covariance stationary.
(ii) We can write 𝐴2 (𝑧) 𝑋𝑡 = 𝜀 𝑡 , where 𝐴2 (𝑧) = 1 − 0.5𝑧−1 + 0.3𝑧 −2 . Furthermore,
𝜙1,2 are the two roots of 𝐴2 (𝑧) = 0.
It follows that
𝑋𝑡 = ( 𝐴2 (𝑧)) −1 𝜀 𝑡
= 𝜀 𝑡 + 0.5𝜀 𝑡−1 − 0.08𝜀 𝑡−2 − 0.205𝜀 𝑡−3 − 0.0761𝜀 𝑡−4 + 0.0296𝜀 𝑡−5 + . . .
Solution 6.8 The Yule-Walker equations are:

 1 −0.5 0.3 −0.2 𝐾 (0)  1
     
−0.5 1.3 −0.2 0  𝐾 (1)  0
 · = 
 0.3 −0.7 1 0  𝐾 (2)  0

−0.2 0.3 −0.5 1  𝐾 (3)  0
     
The solution is 𝐾 (0) = 1.2719, 𝐾 (1) = 0.4825, 𝐾 (2) = −0.0439, 𝐾 (3) = 0.0877.
Solution 6.9 The Toeplitz matrix is

 1.0000
 0.3794 −0.0235 0.0690 
 0.3794 1.0000 0.3794 −0.0235
𝑅4 = 
−0.0235 0.3794 1.0000 0.3794 
 0.0690
 −0.0235 0.3794 1.0000 
Solution 6.10 This series is an ARMA(2,2), given by the equation
(1 − 𝑧−1 + 0.25𝑧−2 ) 𝑋𝑡 = (1 + 0.4𝑧−1 − 0.45𝑧−2 )𝜀 𝑡 .
Accordingly,
𝑋𝑡 = (1 + 0.4𝑧 −1 − 0.45𝑧−2 ) (1 − 𝑧−1 + 0.25𝑧−2 ) −1 𝜀 𝑡

= 𝜀 𝑡 + 1.4𝜀 𝑡−1 + 0.7𝜀 𝑡−2 + 0.35𝜀 𝑡−3 + 0.175𝜀 𝑡−4
+ 0.0875𝜀 𝑡−5 + 0.0438𝜀 𝑡−6 + 0.0219𝜀 𝑡−7 + 0.0109𝜀 𝑡−8 + ...
Solution 6.11 Let 𝑋 denote the DOW1941 data set. We create a new set, 𝑌 of second
order difference, i.e., 𝑌𝑡 = 𝑋𝑡 − 2𝑋𝑡−1 + 𝑋𝑡−2 .
dow1941 = mistat.load_data('DOW1941.csv')
X = dow1941.values # extract values to remove index for calculations

Y = X[2:] - 2 * X[1:-1] + X[:-2]
(i) In the following table we present the autocorrelations, acf, and the partial
autocorrelations, pacf, of 𝑌 . For a visualisation see Fig. 6.3. { fig:acf-pacf-dow-second-order}
# use argument alpha to return confidence intervals

y_acf, ci_acf = acf(Y, nlags=11, fft=True, alpha=0.05)
y_pacf, ci_pacf = pacf(Y, nlags=11, alpha=0.05)
# determine if values are significantly different from zero

def is_significant(y, ci):
return not (ci[0] < 0 < ci[1])
s_acf = [is_significant(y, ci) for y, ci in zip(y_acf, ci_acf)]

s_pacf = [is_significant(y, ci) for y, ci in zip(y_pacf, ci_pacf)]
Autocorrelation Partial Autocorrelation

1.0 1.0
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
0 5 10 0 5 10
Fig. 6.3 Autocorrelations, acf, and the partial autocorrelations, pacf, of the second order differences
of the DOW1941 dataset. { fig:acf-pacf-dow-second-order}
h acf S/NS pacf S/NS

1 -0.342 S -0.343 S
2 -0.179 S -0.337 S
3 0.033 NS -0.210 S
4 0.057 NS -0.100 NS
5 -0.080 NS -0.155 S
6 -0.047 NS -0.193 S
7 -0.010 NS -0.237 S
8 0.128 NS -0.074 NS
9 -0.065 NS -0.127 S
10 -0.071 NS -0.204 S
11 0.053 NS -0.193 S
S denotes significantly different from 0. NS denotes not significantly different

from 0.
(ii) All other correlations are not significant. It seems that the ARIMA(1,2,2) is a
good approximation.
{ fig:seascom-one-day-ahead-model} Solution 6.12 In Fig. 6.4, we present the seasonal data SeasCom, and the one-day
ahead predictions, We see an excellent prediction.
predictedError = mistat.optimalLinearPredictor(seascom_model.resid,
10, nlags=9)
predictedTrend = seascom_model.predict(seascom_df)
correctedTrend = predictedTrend + predictedError
ax.scatter(seascom_df.index, seascom_df['SeasCom'],
facecolors='none', edgecolors='grey')
predictedTrend.plot(ax=ax, color='grey')
correctedTrend.plot(ax=ax, color='black')
ax.set_ylabel('SeasCom data')
plt.show()
160
140
120
SeasCom data
100
80
60
40
0 20 40 60 80 100
Time
{ fig:seascom-one-day-ahead-model} Fig. 6.4 One-day ahead prediction model of SeasCom data set
Chapter 7
Modern analytic methods: Part I
import warnings
import mistat
import random
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
# Uncomment the following if xgboost crashes

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
Solution 7.1 Articles reviewing the application of supervised and unsupervised

methods can be found online (see e.g., doi:10.1016/j.chaos.2020.110059)
An example for supervised applications is the classification of a COVID-19 based
on diagnostic data (see e.g., doi:10.1155/2021/4733167 or doi:10.3390/s21103322)
An example of unsupervised learning is hierarchical clustering to evaluate
COVID-19 pandemic preparedness and performance in 180 countries (see doi:10.1016/j.rinp.2021.104639)
Solution 7.2 The decision tree model for testResult results in the confusion ma-
trix shown in Fig. 7.1. { fig:ex-confusion-matrix-testResult}
sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'testResult'
X = sensors[predictors]
113
114 7 Modern analytic methods: Part I
80
Brake 4 0 0 0 0 1 1 0 0
Good 0 82 0 0 0 0 0 0 0 70
Grippers 0 0 14 0 0 0 0 0 0 60
IMP 0 1 1 3 0 0 0 0 0 50
True label
ITM 0 2 0 0 31 0 0 0 0 40
Motor 0 0 0 0 0 15 1 0 0
30
SOS 0 0 0 0 0 0 2 0 0
20
Velocity Type I 0 0 0 0 0 0 0 11 0
10
Velocity Type II 0 1 0 0 0 0 0 0 4
0
Brake Good Grippers IMP ITM Motor SOS VelocityVelocity
Type I Type II
Predicted label
exc:sensors-test-result
{ fig:ex-confusion-matrix-testResult} Fig. 7.1 Decision tree model to predict testResult from sensor data (Exc. 7.2)
y = sensors[outcome]
# Train the model

clf = DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
clf.fit(X, y)
fig, ax = plt.subplots(figsize=(10, 6))

ConfusionMatrixDisplay.from_estimator(clf, X, y, ax=ax, cmap=plt.cm.Blues)
plt.show()
The models classification performance is very good. The test result ‘Good’, which
corresponds to status ‘Pass’ is correctly predicted. Most of the other individual test
results have also low missclassification rates. The likely reason for this is that each
test result is associated with a specific subset of the sensors.
# convert the status information into numerical labels

outcome = 'status'
y = sensors[outcome].replace({'Pass': 1, 'Fail': 0})
# Train the model

xgb = XGBClassifier(objective='binary:logistic', subsample=.63,
eval_metric='logloss', use_label_encoder=False,
seed=1)
xgb.fit(X, y)
# actual in rows / predicted in columns

print('Confusion matrix')
print(confusion_matrix(y, xgb.predict(X)))
7 Modern analytic methods: Part I 115
Confusion matrix
[[92 0]
[ 0 82]]
/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py:1395:
UserWarning: ùse_label_encoder` is deprecated in 1.7.0.
warnings.warn("ùse_label_encoder` is deprecated in 1.7.0.")
The models confusion matrix is perfect.
var_imp = pd.DataFrame({
'importance': xgb.feature_importances_,
}, index=predictors)
var_imp = var_imp.sort_values(by='importance', ascending=False)
var_imp['order'] = range(1, len(var_imp) + 1)
print(var_imp.head(10))
var_imp.loc[var_imp.index.isin(['sensor18', 'sensor07', 'sensor21'])]
importance order
sensor18 0.290473 1
sensor54 0.288680 2
sensor53 0.105831 3
sensor55 0.062423 4
sensor61 0.058112 5
sensor48 0.040433 6
sensor07 0.026944 7
sensor12 0.015288 8
sensor03 0.013340 9
sensor52 0.013160 10
importance order
sensor18 0.290473 1
sensor07 0.026944 7
sensor21 0.000000 50
The decision tree model uses sensors 18, 7, and 21. The xgboost model identifies
sensor 18 as the most important variable. Sensor 7 is ranked 7th, sensor 21 has no
importance.
Solution 7.4 Create the random forest classifier model.
y = sensors['status']
# Train the model

model = RandomForestClassifier(ccp_alpha=0.012, random_state=0)
model.fit(X, y)
# actual in rows / predicted in columns

print(confusion_matrix(y, model.predict(X)))
Confusion matrix
[[92 0]
[ 0 82]]
The models confusion matrix is perfect.
var_imp = pd.DataFrame({
'importance': model.feature_importances_,
}, index=predictors)
var_imp = var_imp.sort_values(by='importance', ascending=False)
var_imp['order'] = range(1, len(var_imp) + 1)

print(var_imp.head(10))
var_imp.loc[var_imp.index.isin(['sensor18', 'sensor07', 'sensor21'])]
importance order
sensor61 0.138663 1
sensor18 0.100477 2
sensor53 0.079890 3
sensor52 0.076854 4
sensor46 0.052957 5
sensor50 0.051970 6
sensor44 0.042771 7
sensor48 0.037087 8
sensor24 0.036825 9
sensor21 0.035014 10
importance order
sensor18 0.100477 2
sensor21 0.035014 10
sensor07 0.023162 17
The decision tree model uses sensors 18, 7, and 21. Sensor 18 has the second
largest importance value, sensor 21 ranks 10th, and sensor 7 is on rank 17.
# convert outcome values from strings into numerical labels

# use le.inverse_transform to convert the predictions to strings
le = LabelEncoder()
y = le.fit_transform(sensors['status'])
train_X, valid_X, train_y, valid_y = train_test_split(X, y,

test_size=0.4, random_state=2)
dt_model = DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)

dt_model.fit(train_X, train_y)
xgb_model = XGBClassifier(objective='binary:logistic', subsample=.63,

eval_metric='logloss', use_label_encoder=False,
seed=1)
xgb_model.fit(train_X, train_y)
rf_model = RandomForestClassifier(ccp_alpha=0.014, random_state=0)

rf_model.fit(train_X, train_y)
print('Decision tree model')

print(f'Accuracy {accuracy_score(valid_y, dt_model.predict(valid_X)):.3f}')
print(confusion_matrix(valid_y, dt_model.predict(valid_X)))
print('Gradient boosting model')

print(f'Accuracy {accuracy_score(valid_y, xgb_model.predict(valid_X)):.3f}')
print(confusion_matrix(valid_y, xgb_model.predict(valid_X)))
print('Random forest model')

print(f'Accuracy {accuracy_score(valid_y, rf_model.predict(valid_X)):.3f}')
print(confusion_matrix(valid_y, rf_model.predict(valid_X)))
/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py:1395:
UserWarning: ùse_label_encoder` is deprecated in 1.7.0.
warnings.warn("ùse_label_encoder` is deprecated in 1.7.0.")
Decision tree model
Accuracy 0.900
[[37 2]
[ 5 26]]
Gradient boosting model
Accuracy 0.957
[[36 3]
[ 0 31]]
Random forest model
Accuracy 0.986
[[38 1]
[ 0 31]]
The accuracy for predicting the validation set is very good for all three models,
with random forest giving the best model with an accuracy of 0.986. The xgboost
model has a slightly lower accuracy of 0.957 and the accuracy for the decision tree
model is 0.900.
If you change the random seeds or remove them for the various commands, you
will see that the accuracies vary and that the order can change.
dt_model = DecisionTreeClassifier(ccp_alpha=0.012)
random_valid_acc = []
random_train_acc = []
org_valid_acc = []
org_train_acc = []
train_X, valid_X, train_y, valid_y = train_test_split(X, y,
test_size=0.4)
dt_model.fit(train_X, train_y)
org_train_acc.append(accuracy_score(train_y, dt_model.predict(train_X)))
org_valid_acc.append(accuracy_score(valid_y, dt_model.predict(valid_X)))
random_y = random.sample(list(train_y), len(train_y))

dt_model.fit(train_X, random_y)
random_train_acc.append(accuracy_score(random_y, dt_model.predict(train_X)))
random_valid_acc.append(accuracy_score(valid_y, dt_model.predict(valid_X)))
ax = pd.Series(random_valid_acc).plot.density(color='C0')
pd.Series(random_train_acc).plot.density(color='C0', linestyle='--',
ax=ax)
pd.Series(org_valid_acc).plot.density(color='C1', ax=ax)
pd.Series(org_train_acc).plot.hist(color='C1', linestyle='--',
ax=ax)
ax.set_ylim(0, 40)
ax.set_xlim(0, 1.01)
plt.show()
40
35
30
25
Frequency
20
15
10
5
0
0.0 0.2 0.4 0.6 0.8 1.0
Solution 7.7 Load data
distTower = mistat.load_data('DISTILLATION-TOWER.csv')
predictors = ['Temp1', 'Temp2', 'Temp3', 'Temp4', 'Temp5', 'Temp6',
'Temp7', 'Temp8', 'Temp9', 'Temp10', 'Temp11', 'Temp12']
outcome = 'VapourPressure'
Xr = distTower[predictors]
yr = distTower[outcome]
(i) Split dataset into train and validation set
train_X, valid_X, train_y, valid_y = train_test_split(Xr, yr,

(ii) Determine model performance for different tree complexity along the depen-
{ fig:exc-cpp-pruning} dence of tree depth on ccp parameter; see Fig. 7.2.
# Code to analyze tree depth vs alpha

model = DecisionTreeRegressor(random_state=0)
path = model.cost_complexity_pruning_path(Xr, yr)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
mse = []
mse_train = []
for ccp_alpha in ccp_alphas:
model = DecisionTreeRegressor(random_state=0, ccp_alpha=ccp_alpha)
model.fit(train_X, train_y)
mse.append(mean_squared_error(valid_y, model.predict(valid_X)))
mse_train.append(mean_squared_error(train_y, model.predict(train_X)))
ccp_alpha = ccp_alphas[np.argmin(mse)]
The smallest validation set error is obtained for ccp_alpha = 0.372. The depen-
{ fig:exc-cpp-pruning} dence of training and validation error on ccp_alpha is shown in Fig. 7.2.
{ fig:exc-dtreeviz-visualization} (iii) The final model is visualized using dtreeviz in Fig. 7.3.
20
15
MSE of test set

10
0
10 6 10 4 10 2 100
Cost-complexity parameter (ccp_alpha)
Fig. 7.2 Decision tree regressor complexity as a function of ccp_alpha. The validation set error
exc:dec-tree-reg-distillation
{ fig:exc-cpp-pruning} is shown in black, the training set error in grey (Exercise 7.7)
# the dtreeviz methods requires the classifier to be trained with a numerical

# representation of the classes
# Train the model
model = DecisionTreeRegressor(ccp_alpha=ccp_alpha, random_state=0)
model.fit(Xr, yr)
viz = dtreeviz.model(model, Xr, yr,

target_name=outcome,
feature_names=Xr.columns)
viz2pdf(viz, 'compiled/figures/Chap008_ex_dtreeviz_regressor.pdf')
---------------------------------------------------------------------------AttributeError
5 model.fit(Xr, yr)
7 viz = dtreeviz.model(model, Xr, yr,
8 target_name=outcome,
9 feature_names=Xr.columns)
---> 10 viz2pdf(viz,
'compiled/figures/Chap008_ex_dtreeviz_regressor.pdf')
Cell In[1], line 8, in viz2pdf(viz, pdfFile)
6 from tempfile import NamedTemporaryFile
7 with NamedTemporaryFile(mode='w+', suffix='.svg') as f:
----> 8 f.write(viz.svg())
9 f.flush()
10 f.seek(0)
AttributeError: 'DTreeVizAPI' object has no attribute 'svg'
Solution 7.8 Vary the number of bins and binning strategy. The influence of the two
model parameter on model performance is shown in Fig. 7.4. { fig:nb-binning-performance}
y = sensors['status']
results = []
≤ >
exc:dec-tree-reg-distillation
Fig. 7.3 Decision tree visualization of regression tree (Exercise 7.7) { fig:exc-dtreeviz-visualization}
for strategy in 'uniform', 'quantile', 'kmeans':

for n_bins in range(2, 11):
kbinsDiscretizer = KBinsDiscretizer(encode='ordinal',
strategy=strategy, n_bins=n_bins)
X_binned = kbinsDiscretizer.fit_transform(X)
nb_model = MultinomialNB()
nb_model.fit(X_binned, y)
results.append({'strategy': strategy, 'n_bins': n_bins,
'accuracy': accuracy_score(y, nb_model.predict(X_binned))})
results = pd.DataFrame(results)
for key, group in results.groupby('strategy'):
group.plot(x='n_bins', y='accuracy', label=key, ax=ax)
The quantile binning strategy (for each feature, each bin has the same number
of data points) with splitting each column into two bins leads to the best performing
model. With this strategy, performance declines with increasing number of bins. The
uniform (for each feature, each bin has the same width) and the kmeans (for each
feature, a 𝑘-means clustering is used to bin the feature) strategies on the other hand,
show increasing performance with increasing number of bins.
The confusion matrix for the best performing models is:
kbinsDiscretizer = KBinsDiscretizer(encode='ordinal',
strategy='quantile', n_bins=2)
X_binned = kbinsDiscretizer.fit_transform(X)
nb_model = MultinomialNB()
nb_model.fit(X_binned, y)
print(confusion_matrix(y, nb_model.predict(X_binned)))
Confusion matrix
[[87 5]
[ 1 81]]
kmeans
0.950 quantile
uniform
0.925
0.900
0.875
0.850
0.825
0.800
2 3 4 5 6 7 8 9 10
n_bins
Fig. 7.4 Influence of number of bins and binning strategy on model performance for the sensors
{ fig:nb-binning-performance} data with status as outcome
The decision tree model missclassified three of the ‘Pass’ data points as ‘Fail’.
The Naïve Bayes model on the other hand missclassifies six data points. However,
five of these are ‘Pass’ and predicted as ‘Fail’. Depending on your use case, you may
prefer a model with more false negatives or false positives.
from sklearn.cluster import AgglomerativeClustering

from sklearn.preprocessing import StandardScaler
from mistat import plot_dendrogram
food = mistat.load_data('FOOD.csv')
scaler = StandardScaler()
model = AgglomerativeClustering(n_clusters=10, compute_distances=True)
X = scaler.fit_transform(food)
model = model.fit(X)
plot_dendrogram(model, ax=ax)
ax.set_title('Dendrogram')
ax.get_xaxis().set_ticks([])
plt.show()
food = mistat.load_data('FOOD.csv')
X = scaler.fit_transform(food)
fig, axes = plt.subplots(ncols=2, nrows=2)

Dendrogram
80
70
60
50
40
30
20
10
0
Fig. 7.5 Hierarchical clustering of food data set using Ward clustering { fig:food-ward-10-clusters}
for linkage, ax in zip(['ward', 'complete', 'average', 'single'], axes.flatten()):

model = AgglomerativeClustering(n_clusters=10, compute_distances=True,
linkage=linkage)
model = model.fit(X)
plot_dendrogram(model, ax=ax)
ax.set_title(linkage)
ax.get_xaxis().set_ticks([])
plt.show()
{ fig:food-compare-linkage} The comparison of the different linkage methods is shown in Fig. 7.6. We can
see that Ward’s clustering gives the most balanced clusters; three bigger clusters and
seven small clusters. Complete, average, and single linkage lead to one big cluster.
Solution 7.11 We determine 10 clusters using K-means clustering.
outcome = 'status'
X = scaler.fit_transform(X)
model = KMeans(n_clusters=10, random_state=1).fit(X)
Combine the information and analyse cluster membership by status.

ward complete
80
60 20
40
10
20
0 0
average single
20 10.0
15 7.5
10 5.0
5 2.5
0 0.0
{ fig:food-compare-linkage} Fig. 7.6 Comparison of different linkage methods for hierarchical clustering of food data set
df = pd.DataFrame({
'status': sensors['status'],
'testResult': sensors['testResult'],
'cluster': model.predict(X),
})
for status, group in df.groupby('status'):

print(f'Status {status}')
print(group['cluster'].value_counts())
Status Fail
cluster
8 19
0 15
4 13
1 13
9 13
3 10
5 5
7 2
6 1
2 1
Status Pass
cluster
1 48
8 34
There are several clusters that contain only ‘Fail’ data points. They correspond to
specific sensor value combinations that are very distinct to the sensor values during
normal operation. The ‘Pass’ data points are found in two clusters. Both of these
clusters contain also ‘Fail’ data points.
Analyse cluster membership by testResult.
print('Number of clusters by testResult')

for cluster, group in df.groupby('cluster'):
print(f'Cluster {cluster}')
print(group['testResult'].value_counts())
print()
Number of clusters by testResult

Cluster 0
testResult
ITM 15
Cluster 1
testResult
Good 48
Brake 6
IMP 4
Grippers 1
Motor 1
ITM 1
Cluster 2
testResult
SOS 1
Cluster 3
testResult
Velocity Type I 10
Cluster 4
testResult
Grippers 10
ITM 3
Cluster 5
testResult
Velocity Type II 5
Cluster 6
testResult
SOS 1
Cluster 7
testResult
Grippers 2
Cluster 8
testResult
Good 34
Motor 15
Grippers 1
IMP 1
ITM 1
Velocity Type I 1
Cluster 9
testResult
ITM 13
We can see that some of the test results are only found in one or two clusters.
Solution 7.12 The scikit-learn K-means clustering method can return either the
cluster centers or the distances of a data point to all the cluster centers. We evaluate
both as features for classification.
# Data preparation
outcome = 'status'
X = scaler.fit_transform(X)
First, classifying data points based on the cluster center. In order to use that
information in a classifier, we transform the cluster center information using on-hot-
encoding.
# Iterate over increasing number of clusters

results = []
for n_clusters in range(2, 20):
# fit a model and assign the data to clusters
model = KMeans(n_clusters=n_clusters, random_state=1)
model.fit(X)
Xcluster = model.predict(X)
# to use the cluster number in a classifier, use one-hot encoding
# it's necessary to reshape the vector of cluster numbers into a column vector
Xcluster = OneHotEncoder().fit_transform(Xcluster.reshape(-1, 1))
# create a decision tree model and determine the accuracy

clf.fit(Xcluster, y)
results.append({'n_clusters': n_clusters,
'accuracy': accuracy_score(y, clf.predict(Xcluster))})
ax = pd.DataFrame(results).plot(x='n_clusters', y='accuracy')
ax.set_ylim(0.5, 1)
plt.show()
pd.DataFrame(results).round(3)
n_clusters accuracy
0 2 0.529
1 3 0.805
2 4 0.805
3 5 0.799
4 6 0.793
5 7 0.897
6 8 0.868
7 9 0.879
8 10 0.816
9 11 0.879
10 12 0.879
11 13 0.879
12 14 0.897
13 15 0.897
14 16 0.885
15 17 0.879
16 18 0.891
17 19 0.902
1.0
accuracy
0.9
0.8
0.7
0.6
0.5
2.5 5.0 7.5 10.0 12.5 15.0 17.5
n_clusters
Fig. 7.7 Dependence of accuracy on number of clusters using cluster membership as feature
exc:kmeans-classifier
(Exc. 7.12) { fig:cluster-number-model}
The accuracies are visualized in Fig. 7.7. We see that splitting the dataset into 7 { fig:cluster-number-model}
clusters gives a classification model with an accuracy of about 0.9.
Next we use the distance to the cluster centers as variable in the classifier.
results = []
for n_clusters in range(2, 20):
# fit a model and convert data to distances
model = KMeans(n_clusters=n_clusters, random_state=1)
model.fit(X)
Xcluster = model.transform(X)
# create a decision tree model and determine the accuracy

clf.fit(Xcluster, y)
results.append({'n_clusters': n_clusters,
'accuracy': accuracy_score(y, clf.predict(Xcluster))})
ax = pd.DataFrame(results).plot(x='n_clusters', y='accuracy')
ax.set_ylim(0.5, 1)
plt.show()
pd.DataFrame(results).round(3)
n_clusters accuracy
0 2 0.983
1 3 0.977
2 4 0.977
3 5 0.977
4 6 0.989
5 7 0.977
6 8 0.977
7 9 0.977
8 10 0.977
9 11 0.977
10 12 0.977
11 13 0.983
12 14 0.977
1.0
0.9
0.8
0.7
0.6
accuracy
0.5
2.5 5.0 7.5 10.0 12.5 15.0 17.5
n_clusters
Fig. 7.8 Dependence of accuracy on number of clusters using distance to cluster center as feature
exc:kmeans-classifier
(Exc. 7.12) { fig:cluster-distance-model}
13 15 0.977
14 16 0.977
15 17 0.983
16 18 0.977
17 19 0.983
The accuracies of all models are very high. The largest accuracy is achived for 6
clusters.
Based on these results, we would design the procedure using the decision tree
classifier combined with K-means clustering into six clusters. Using scikit-learn,
we can define the full procedure as a single pipeline as follows:
pipeline = make_pipeline(
StandardScaler(),
KMeans(n_clusters=6, random_state=1),
DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
)
process = pipeline.fit(X, y)
print('accuracy', accuracy_score(y, process.predict(X)))
print(confusion_matrix(y, process.predict(X)))
accuracy 0.9885057471264368
Confusion matrix
[[91 1]
[ 1 81]]
The final model has two missclassified data points.

Chapter 8
Modern analytic methods: Part II
import mistat
import networkx as nx
from pgmpy.estimators import HillClimbSearch
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
Solution 8.1 Load the data and convert to FDataGrid.
from skfda import FDataGrid

from skfda.representation.interpolation import SplineInterpolation
dissolution = mistat.load_data('DISSOLUTION.csv')
# convert the data to FDataGrid

data = []
labels = []
names = []
for label, group in dissolution.groupby('Label'):
data.append(group['Data'].values)
labels.append('Reference' if label.endswith('R') else 'Test')
names.append(label)
labels = np.array(labels)
grid_points = np.array(sorted(dissolution['Time'].unique()))
fd = FDataGrid(np.array(data), grid_points,
dataset_name='Dissolution',
argument_names=['Time'],
coordinate_names=['Dissolution'])
Use shift registration to align the dissolution data with spline interpolation of
order 1, 2, and 3.
from skfda.preprocessing.registration import ShiftRegistration

shift_registration = ShiftRegistration()
fd_registered = {}
for order in (1, 2, 3):
fd.interpolation = SplineInterpolation(interpolation_order=order)
fd_registered[order] = shift_registration.fit_transform(fd)
129
130 8 Modern analytic methods: Part II
Dissolution
Order 1 Order 2 Order 3
100
80 80 80
Dissolution
Dissolution
Dissolution
60 60 60
40 40 40
20
20
20 40 20 40 20 40
Fig. 8.1 Mean dissolution curves for reference and test tablets derived using spline interpolation
{ fig:fdaDissolutionSplineComparison} of order 1, 2, and 3.
For each of the three registered datasets, calculate the mean dissolution curves
for reference and test tablets and plot the results.
from skfda.exploratory import stats
group_colors = {'Reference': 'grey', 'Test': 'black'}
fig, axes = plt.subplots(ncols=3, figsize=(8, 3))

for ax, order in zip(axes, (1, 2, 3)):
mean_ref = stats.mean(fd_registered[order][labels=='Reference'])
mean_test = stats.mean(fd_registered[order][labels=='Test'])
means = mean_ref.concatenate(mean_test)
means.plot(axes=[ax], group=['Reference', 'Test'], group_colors=group_colors)
ax.set_title(f'Order {order}')
plt.tight_layout()
{ fig:fdaDissolutionSplineComparison} The dissolution curves are shown in Fig. 8.1. We can see in all three graphs,
that the test tablets show a slightly faster dissolution than the reference tablets. If
we compare the shape of the curves, the curves for the linear splines interpolation
shows a levelling off with time. In the quadratic spline interpolation result, the
dissolution curves go through a maximum. This behavior is unrealistic. The cubic
spline interpolation also leads to unrealistic curves that first level of and then start to
increase again.
Solution 8.2 (i)
import skfda
from skfda import FDataGrid
pinchraw = skfda.datasets.fetch_cran('pinchraw', 'fda')['pinchraw']

pinchtime = skfda.datasets.fetch_cran('pinch', 'fda')['pinchtime']
fd = FDataGrid(pinchraw.transpose(), pinchtime)
Note that the measurement data need to be transposed.

(ii)
fig = fd.plot()
ax = fig.axes[0]
8 Modern analytic methods: Part II 131
12
10
Pinch force
6
0.00 0.05 0.10 0.15 0.20 0.25 0.30

Time [s]
Fig. 8.2 Twenty measurements of pinch force { fig:fdaPinchforceOriginal}
ax.set_xlabel('Time [s]')
ax.set_ylabel('Pinch force')
plt.show()
Fig. 8.2 shows the measure pinch forces. We can see that the start of the force { fig:fdaPinchforceOriginal}
varies from 0.025 to 0.1 seconds. This makes it difficult to compare the shape of the
curves. The shapes of the individual curves is not symmetric with a faster onset of
the force and slower decline.
(iii) We create a variety of smoothed version of the dataset to explore the effect
of varying the smoothing_parameter.
import itertools
from skfda.preprocessing.smoothing.kernel_smoothers import NadarayaWatsonSmoother
def plotSmoothData(fd, smoothing_parameter, ax):

smoother = NadarayaWatsonSmoother(smoothing_parameter=smoothing_parameter)
fd_smooth = smoother.fit_transform(fd)
_ = fd_smooth.plot(axes=[ax])
ax.set_title(f'Smoothing parameter {smoothing_parameter}')
fig, axes = plt.subplots(ncols=2, nrows=2)

axes = list(itertools.chain(*axes)) # flatten list of lists
for i, sp in enumerate([0.03, 0.01, 0.001, 0.0001]):
plotSmoothData(fd, sp, axes[i])
plt.tight_layout()
Fig. 8.3 shows smoothed measurement curves for a variety of smoothing_parameter { fig:fdaPinchforceSmoothing}
values. If values are too large, the data are oversmoothed and the asymmetric shape
of the curves is lost. With decreasing values, the shape is reproduced better but the
curves are getting noisier again. We select 0.005 as the smoothing parameter.
smoother = NadarayaWatsonSmoother(smoothing_parameter=0.005)
fd_smooth = smoother.fit_transform(fd)
(iii) We first determine the maxima of the smoothed curves:

Smoothing parameter 0.03 Smoothing parameter 0.01

8 10
Pinch force
Pinch force
6
4 5
2
0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3
Time Time
Smoothing parameter 0.001 Smoothing parameter 0.0001
10 10
Pinch force
Pinch force
5 5
0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3

Time Time
Fig. 8.3 Effect of smoothing parameter on measurement curves { fig:fdaPinchforceSmoothing}
12
10
8
Pinch force
2
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Time [s]
{ fig:fdaPinchforceRegistered} Fig. 8.4 Registered measurement curves of the Pinch dataset
max_idx = fd_smooth.data_matrix.argmax(axis=1)
landmarks = [pinchtime[idx] for idx in max_idx]
Use the landmarks to shift register the measurements:
from skfda.preprocessing.registration import landmark_shift

fd_landmark = landmark_shift(fd_smooth, landmarks)
{ fig:fdaPinchforceRegistered} (iv) Fig. 8.4 shows the measurements after smoothing and landmark shift regis-
tration.
fig = fd_landmark.plot()
ax = fig.axes[0]
ax.set_xlabel('Time [s]')
plt.show()
16
14
12
10
8
6
4
2
0
13 14 15 16 17
Fig. 8.5 Histogram of moisture content { fig:fdaMoistureDistribution}
Solution 8.3 (i) Load the data
import skfda
moisturespectrum = skfda.datasets.fetch_cran('Moisturespectrum', 'fds')

moisturevalues = skfda.datasets.fetch_cran('Moisturevalues', 'fds')
frequencies = moisturespectrum['Moisturespectrum']['x']
spectra = moisturespectrum['Moisturespectrum']['y']
moisture = moisturevalues['Moisturevalues']
(ii) We can use a histogram to look at the distribution of the moisture values.
_ = pd.Series(moisture).hist(bins=20, color='grey', label='Moisture content')
Fig. 8.5 shows a bimodal distribution of the moisture content with a clear separa- { fig:fdaMoistureDistribution}
tion of the two peaks. Based on this, we select 14.5 as the threshold to separate into
high and low moisture content.
moisture_class = ['high' if m > 14.5 else 'low' for m in moisture]
(iii - iv) The spectrum information is already in the array format required for the
FDataGrid class. In order to do this, we need to transpose the spectrum information.
As we can see in the left graph of Fig. 8.6, the spectra are not well aligned but show { fig:fdaMoistureSpectrum}
a large spread in the intensities. This is likely due to the difficulty in having a clearly
defined concentration between samples. In order to reduce this variation, we can
transform the intensities by dividing the intensities by the mean of each sample.
intensities = spectra.transpose()
fd = skfda.FDataGrid(intensities, frequencies)
# divide each sample spectrum by it's mean intensities

intensities_normalized = (intensities - intensities.mean(dim='dim_0')) / intensities.std(dim='dim_0')
fd_normalized = skfda.FDataGrid(intensities_normalized, frequencies)
Code for plotting the spectra:

high
low
2
1.2
1.0 1
0.8 0
0.6 1
0.4
2
1500 2000 2500 1500 2000 2500
{ fig:fdaMoistureSpectrum} Fig. 8.6 Near-infrared spectra of the moisture dataset. Left: raw spectra, Right: normalized spectra
fig, axes = plt.subplots(ncols=2)

_ = fd.plot(axes=axes[0], alpha=0.5,
# color lines by moisture class
group=moisture_class, group_names={'high': 'high', 'low': 'low'})
_ = fd_normalized.plot(axes=axes[1], alpha=0.5,
group=moisture_class, group_names={'high': 'high', 'low': 'low'})
{ fig:fdaMoistureSpectrum} As we can see in right graph of Fig. 8.6, the normalized spectra are now better
aligned. We also see that the overall shape of the spectra is fairly consistent between
samples.
(v) We repeat the model building both for the original and normalized spectra 50
times. At each iteration, we split the data set into training and test sets (50-50), build
the model with the training set and measure accuracy using the test set. By using the
same random seed for splitting the original and the normalized dataset, we can better
{ fig:fdaMoistureAccuracies} compare the models. The accuracies from the 50 iteration are compared in Fig. 8.7.
from skfda.ml.classification import KNeighborsClassifier

from sklearn.metrics import accuracy_score, confusion_matrix
accuracies = []
for rs in range(50):
X_train, X_test, y_train, y_test = train_test_split(fd,
moisture_class, random_state=rs, test_size=0.5)
knn_original = KNeighborsClassifier()
knn_original.fit(X_train, y_train)
Accuracy of models based on normalized spectra

1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900
Accuracy of models based on original spectra
Fig. 8.7 Accuracies of classification models based original and normalized spectra. The line
indicates equal performance. { fig:fdaMoistureAccuracies}
acc_original = accuracy_score(y_test, knn_original.predict(X_test))
X_train, X_test, y_train, y_test = train_test_split(fd_normalized,

moisture_class, random_state=rs, test_size=0.5)
knn_normalized = KNeighborsClassifier()
knn_normalized.fit(X_train, y_train)
acc_normalized = accuracy_score(y_test, knn_normalized.predict(X_test))
accuracies.append({
'original': acc_original,
'normalized': acc_normalized,
})
accuracies = pd.DataFrame(accuracies)
ax = accuracies.plot.scatter(x='original', y='normalized')
_ = ax.plot([0.7, 0.9], [0.7, 0.9], color='black')
ax.set_xlabel('Accuracy of models based on original spectra')
ax.set_ylabel('Accuracy of models based on normalized spectra')
plt.show()
# mean of accuracies
mean_accuracy = accuracies.mean()
mean_accuracy
original 0.7976
normalized 0.9468
dtype: float64
Fig. 8.7 clearly shows that classification models based on the normalized spectra { fig:fdaMoistureAccuracies}
achieve better accuracies. The mean accuracy increases from 0.8 to 0.95.
Solution 8.4 (i) See solution for Exercise 8.3. {exc:fda-moisture-classification}
(ii) We use the method skfda.ml.regression.KNeighborsRegressor to

build the regression models.
from skfda.ml.regression import KNeighborsRegressor

from sklearn.metrics import mean_absolute_error
1.0
MAE of models based on normalized spectra

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
MAE of models based on original spectra
Fig. 8.8 Mean absolute error of regression models using original and normalized spectra. The line
{ fig:fdaMoistureMAE} indicates equal performance.
mae = []
for rs in range(50):
X_train, X_test, y_train, y_test = train_test_split(fd,
moisture, random_state=rs, test_size=0.5)
knn_original = KNeighborsRegressor()
knn_original.fit(X_train, y_train)
mae_original = mean_absolute_error(y_test, knn_original.predict(X_test))
X_train, X_test, y_train, y_test = train_test_split(fd_normalized,

moisture, random_state=rs, test_size=0.5)
knn_normalized = KNeighborsRegressor()
knn_normalized.fit(X_train, y_train)
mae_normalized = mean_absolute_error(y_test, knn_normalized.predict(X_test))
mae.append({
'original': mae_original,
'normalized': mae_normalized,
})
mae = pd.DataFrame(mae)
ax = mae.plot.scatter(x='original', y='normalized')
ax.plot([0.3, 1.0], [0.3, 1.0], color='black')
ax.set_xlabel('MAE of models based on original spectra')
ax.set_ylabel('MAE of models based on normalized spectra')
plt.show()
# mean of MAE
mean_mae = mae.mean()
mean_mae
original 0.817016
normalized 0.433026
dtype: float64
{ fig:fdaMoistureMAE} Fig. 8.8 clearly shows that regression models based on the normalized spectra
achieve better performance. The mean absolute error descrease from 0.82 to 0.43.
(iii) We use the last regression model from (ii) to create a plot of actual versus
predicted moisture content for the test data.
17
Predicted moisture content

16
15
14
13
13 14 15 16 17
Moisture content
Fig. 8.9 Actual versus predicted moisture content { fig:fdaMoisturePredictions}
y_pred = knn_normalized.predict(X_test)
predictions = pd.DataFrame({'actual': y_test, 'predicted': y_pred})
minmax = [min(*y_test, *y_pred), max(*y_test, *y_pred)]
ax = predictions.plot.scatter(x='actual', y='predicted')
ax.set_xlabel('Moisture content')
ax.set_ylabel('Predicted moisture content')
ax.plot(minmax, minmax, color='grey')
plt.show()
Fig. 8.9 shows two clusters of points. One cluster contains the samples with the { fig:fdaMoisturePredictions}
high moisture content and the other cluster the samples with low moisture content.
The clusters are well separated and the predictions are in the typical range for
each cluster. However, within a cluster, predictions and actual values are not highly
correlated. In other words, while the regression model can distinguish between
samples with a high and low moisture content, the moisture content is otherwise
not well predicted. There is therefore no advantage of using the regression model
compared to the classification model.
Solution 8.5 (i) See solution for Exercise 8.3. {exc:fda-moisture-classification}
(ii) In Python:
from skfda.preprocessing.dim_reduction.projection import FPCA
fpca_original = FPCA(n_components=2)
_ = fpca_original.fit(fd)
fpca_normalized = FPCA(n_components=2)
_ = fpca_normalized.fit(fd_normalized)
The projections of the spectra can now be visualized:
def plotFPCA(fpca, fd, ax):

fpca_df = pd.DataFrame(fpca.transform(fd))
2.0 1.5
1.5 1.0
1.0 0.5
0.5
1
0.0
0.0 0.5
0.5 1.0
5.0 2.5 0.0 2.5 5.0 2 0 2
Component 2 Component 2
Fig. 8.10 Projection of spectra onto first two principal components. Left: original spectra, right:
{ fig:fdaMoisturePCA} normalized spectra
fpca_df.plot.scatter(x=0, y=1,
c=['C1' if mc == 'high' else 'C2' for mc in moisture_class], ax=ax)
ax.set_xlabel('Component 1')
ax.set_xlabel('Component 2')
fig, axes = plt.subplots(ncols=2, figsize=[6, 3])

plotFPCA(fpca_original, fd, axes[0])
plotFPCA(fpca_normalized, fd_normalized, axes[1])
plt.tight_layout()
{ fig:fdaMoisturePCA} Fig. 8.10 compares the PCA projections for the original and normalized data.
We can see that the second principal component clearly separates the two moisture
content classes for the normalized spectra. This is not the case for the original spectra.
Solution 8.6 (i) We demonstrate a solution to this exercise using two blog posts
preprocessed and included in the mistat package. The content of the posts was
converted text with each paragraph on a line. The two blog posts can be loaded as
follows:
from mistat.nlp import globalWarmingBlogs

blogs = globalWarmingBlogs()
The variable blogs is a dictionary with labels as keys and text as values. We next
split the data into a list of labels and a list of non-empty paragraphs.
paragraphs = []
labels = []
for blog, text in blogs.items():
for paragraph in text.split('\n'):
paragraph = paragraph.strip()
if not paragraph: # ignore empty paragraphs
continue
paragraphs.append(paragraph)
labels.append(blog)
(ii) Using CountVectorizer, transform the list of paragraphs into a document-

term matrix (DTM).
import re
from sklearn.feature_extraction.text import CountVectorizer
def preprocessor(text):
text = text.lower()
text = re.sub(r'\d[\d,]*', '', text)
text = '\n'.join(line for line in text.split('\n')
if not line.startswith('ntsb'))
return text
vectorizer = CountVectorizer(preprocessor=preprocessor,
stop_words='english')
counts = vectorizer.fit_transform(paragraphs)
print('shape of DTM', counts.shape)

print('total number of terms', np.sum(counts))
shape of DTM (123, 1025)

total number of terms 2946
The ten most frequenly occurring terms are:
termCounts = np.array(counts.sum(axis=0)).flatten()
topCounts = termCounts.argsort()
terms = vectorizer.get_feature_names_out()
for n in reversed(topCounts[-10:]):
print(f'{terms[n]:14s} {termCounts[n]:3d}')
global 63
climate 59
warming 57
change 55
ice 35
sea 34
earth 33
ocean 29
temperatures 28
heat 25
(iii) Conversion of counts using TF-IDF.
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None)

tfidf = tfidfTransformer.fit_transform(counts)
(iv)
from sklearn.decomposition import TruncatedSVD

from sklearn.preprocessing import Normalizer
svd = TruncatedSVD(5)
norm_tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(norm_tfidf)
We can analyze the loadings to get an idea of topics.
data = {}
for i, component in enumerate(svd.components_, 1):
compSort = component.argsort()
idx = list(reversed(compSort[-10:]))
data[f'Topic {i}'] = [terms[n] for n in idx]
data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)
print("{\\tiny")
print(df.style.format(precision=2).hide(axis='index').to_latex(hrules=True))
print("}")
Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5
change 0.24 ice 0.39 sea 0.25 extreme 0.54 snow 0.40
climate 0.24 sea 0.34 earth 0.23 events 0.30 cover 0.24
global 0.23 sheets 0.27 energy 0.21 heat 0.24 sea 0.16
sea 0.22 shrinking 0.19 light 0.21 precipitation 0.20 climate 0.12
warming 0.21 level 0.17 gases 0.19 earth 0.13 level 0.12
ice 0.20 arctic 0.15 ice 0.18 light 0.13 temperatures 0.12
temperature 0.18 ocean 0.13 infrared 0.17 energy 0.12 decreased 0.11
ocean 0.16 declining 0.10 greenhouse 0.15 gases 0.12 temperature 0.11
earth 0.16 levels 0.08 level 0.15 greenhouse 0.11 increase 0.10
extreme 0.15 glaciers 0.07 atmosphere 0.12 infrared 0.10 rise 0.10
We can identify topics related to sea warming, ice sheets melting, greenhouse
effect, extreme weather events, and hurricanes.
(v) Repeat the analysis requesting 10 components in the SVD.
norm_tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(norm_tfidf)
We now get the following topics.
data = {}
change 0.24 ice 0.39 sea 0.25 extreme 0.54 snow 0.39
climate 0.24 sea 0.35 earth 0.24 events 0.31 cover 0.23
global 0.23 sheets 0.27 energy 0.21 heat 0.24 sea 0.17
sea 0.22 shrinking 0.19 light 0.21 precipitation 0.20 level 0.13
warming 0.21 level 0.17 gases 0.19 light 0.13 temperatures 0.12
ice 0.20 arctic 0.15 ice 0.18 earth 0.13 climate 0.12
temperature 0.18 ocean 0.13 infrared 0.17 energy 0.12 hurricanes 0.12
ocean 0.16 declining 0.10 greenhouse 0.16 gases 0.12 decreased 0.11
earth 0.16 levels 0.08 level 0.15 greenhouse 0.11 increase 0.10
extreme 0.15 glaciers 0.07 atmosphere 0.12 infrared 0.10 rise 0.10
sea 0.38 snow 0.34 ocean 0.38 glaciers 0.36 responsibility 0.30
level 0.36 ocean 0.26 hurricanes 0.23 water 0.24 authorities 0.24
rise 0.17 cover 0.24 acidification 0.18 retreat 0.24 heat 0.23
extreme 0.14 acidification 0.17 glaciers 0.17 glacial 0.21 wildfires 0.20
events 0.11 extreme 0.16 water 0.13 months 0.16 pollution 0.19
global 0.11 pollution 0.13 waters 0.12 summer 0.14 arctic 0.13
hurricanes 0.11 carbon 0.13 temperatures 0.11 going 0.13 personal 0.12
temperature 0.10 decreased 0.12 reefs 0.09 plants 0.13 carbon 0.11
impacts 0.09 point 0.11 coral 0.09 animals 0.11 percent 0.11
coastal 0.08 warming 0.10 warmer 0.09 stream 0.11 reduce 0.10
The first five topics are identical to the result in (iv). This is an expected property
of the SVD.
0.4
0.2
0.0
0.2
0.4
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
Second component
Fig. 8.11 Projection of the documents onto first two singular values. Red: blog post 1, green: blog
post 2 { fig:nlpSVD}
{ fig:nlpSVD} (vi) Fig. 8.11 shows the individual documents projected onto the first two singular
values of the LSA. Based on this visualization, we can say that the two documents
discuss different aspects of global warming and that blog post 1 contains more details
about the area.
blog1 = [label == 'blog-1' for label in labels]
blog2 = [label == 'blog-2' for label in labels]
ax.plot(lsa_tfidf[blog1, 0], lsa_tfidf[blog1, 1], 'ro')
ax.plot(lsa_tfidf[blog2, 0], lsa_tfidf[blog2, 1], 'go')
ax.set_xlabel('First component')
ax.set_xlabel('Second component')
plt.show()
Solution 8.7 We follow the same procedure as in Exercise 8.6 using a set of three {exc:nlp-topic-1}
articles preprocessed and included in the mistat package.
from mistat.nlp import covid19Blogs

blogs = covid19Blogs()
Determine DTM using paragraphs as documents:
paragraphs = []
labels = []
for blog, text in blogs.items():
for paragraph in text.split('\n'):
paragraph = paragraph.strip()
if not paragraph:
continue
paragraphs.append(paragraph)
labels.append(blog)
text = text.lower()
text = '\n'.join(line for line in text.split('\n')
if not line.startswith('ntsb'))
return text
vectorizer = CountVectorizer(preprocessor=preprocessor, stop_words='english')

counts = vectorizer.fit_transform(paragraphs)
TF-IDF transformation

Latent semantic analysis (LSA)
tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(tfidf)
Topics:
data = {}
labour 0.29 south 0.27 economic 0.24 capacity 0.21 covid 0.22
covid 0.22 labour 0.27 percent 0.23 financial 0.20 america 0.22
impact 0.19 north 0.22 covid 0.19 firms 0.19 reset 0.21
market 0.19 differences 0.20 gdp 0.17 household 0.17 latin 0.20
south 0.18 americas 0.17 impact 0.15 international 0.17 needs 0.18
america 0.16 channel 0.16 imf 0.14 access 0.14 economic 0.14
pandemic 0.15 agenda 0.14 social 0.12 largely 0.14 social 0.14
channel 0.15 covid 0.13 pre 0.12 depends 0.14 asymmetric 0.11
economic 0.14 post 0.11 world 0.11 state 0.13 region 0.11
north 0.14 impact 0.09 growth 0.11 support 0.13 labor 0.10
economic 0.26 covid 0.24 self 0.22 agenda 0.21 state 0.15
channel 0.20 asymmetric 0.21 employed 0.19 post 0.19 policies 0.14
social 0.16 occupations 0.20 unbearable 0.19 channel 0.18 needs 0.13
market 0.14 consequences 0.20 lightness 0.19 region 0.15 self 0.12
recovery 0.12 economic 0.17 employment 0.13 labour 0.12 capacity 0.12
labour 0.11 transition 0.13 countries 0.13 pandemic 0.12 unbearable 0.12
consequences 0.10 occupation 0.13 economic 0.13 new 0.09 lightness 0.12
covid 0.09 americas 0.13 channel 0.12 latin 0.09 covid 0.12
governments 0.09 north 0.13 argentina 0.12 given 0.08 employed 0.12
agenda 0.09 differences 0.12 social 0.12 occupations 0.08 reset 0.11
Looking at the differnt loadings, we can see different topics emerging.

{ fig:nlpSVD-covid} In Fig. 8.12, we can see that the paragraphs in the article show more overlap
{exc:nlp-topic-1} compared to what we’ve observed in the Exercise 8.6.
for blog in blogs:
match = [label == blog for label in labels]
ax.plot(lsa_tfidf[match, 0], lsa_tfidf[match, 1], 'o', label=blog)
0.5 blog-1
0.4 blog-2
blog-3
0.3
0.2
0.1
0.0
0.1
0.2
0.3
0.1 0.2 0.3 0.4 0.5
Second component
Fig. 8.12 Projection of the documents onto first two singular values. Red: blog post 1, green: blog
post 2 { fig:nlpSVD-covid}
ax.legend()
ax.set_xlabel('First component')
ax.set_xlabel('Second component')
plt.show()
Solution 8.8 (i) Load and preprocess the data
data = mistat.load_data('LAPTOP_REVIEWS')
data['Review'] = data['Review title'] + ' ' + data['Review content']
reviews = data.dropna(subset=['User rating', 'Review title', 'Review content'])
(ii) Convert the text representation into a document term matrix (DTM).
import re
from sklearn.feature_extraction.text import CountVectorizer
text = text.lower()
return text
vectorizer = CountVectorizer(preprocessor=preprocessor,
stop_words='english')
counts = vectorizer.fit_transform(reviews['Review'])
print('shape of DTM', counts.shape)
print('total number of terms', np.sum(counts))
shape of DTM (7433, 12823)

total number of terms 251566
(iii) Convert the counts in the document term matrix (DTM) using TF-IDF.
from sklearn.feature_extraction.text import TfidfTransformer

(iv) Using scikit-learn’s TruncatedSVD method, we convert the sparse tfidf ma-
trix to a denser representation.
from sklearn.decomposition import TruncatedSVD

from sklearn.preprocessing import Normalizer
tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(tfidf)
print(lsa_tfidf.shape)
(7433, 20)
(v) We use logistic regression to classify reviews with a user rating of five as
positive and negative otherwise.
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, confusion_matrix
outcome = ['positive' if rating == 5 else 'negative'

for rating in reviews['User rating']]
# split dataset into 60% training and 40% test set

Xtrain, Xtest, ytrain, ytest = train_test_split(lsa_tfidf, outcome,
# run logistic regression model on training

logit_reg = LogisticRegression(solver='lbfgs')
logit_reg.fit(Xtrain, ytrain)
# print confusion matrix and accuracty

accuracy = accuracy_score(ytest, logit_reg.predict(Xtest))
print(accuracy)
confusion_matrix(ytest, logit_reg.predict(Xtest))
0.769334229993275
array([[ 861, 385],

[ 301, 1427]])
The predicted accuracy of the classification model is 0.77.

Solutions Modernstatistics

Uploaded by

Copyright:

Available Formats

Solutions Modernstatistics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solutions Modernstatistics

Uploaded by

Copyright:

Available Formats

Ron Kenett, Shelemyahu Zacks, Peter Gedeck

Modern Statistics: A Computer

1 Analyzing Variability: Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . 5

2 Probability Models and Distribution Functions . . . . . . . . . . . . . . . . . . . . 17

3 Statistical Inference and Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Variability in Several Dimensions and Regression Models . . . . . . . . . . . 65

5 Sampling for Estimation of Finite Population Quantities . . . . . . . . . . . . 91

6 Time Series Analysis and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7 Modern analytic methods: Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8 Modern analytic methods: Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Import required modules and define required functions

def trim_std(data, alpha):

Counter counts the number of occurrences of a given value in a list.

from collections import Counter

Counter({2: 10, 3: 10, 1: 9, 6: 9, 5: 8, 4: 4})

Solution 1.3 In Python

from scipy.stats import binom

for p in (0.1, 0.3, 0.7, 0.9):

inst1 = [9.490950, 10.436813, 9.681357, 10.996083, 10.226101, 10.253741,

print('mean inst1', np.mean(inst1))

mean inst1 10.03366815

Solution 1.7 The method random.choices creates a random selection without

Solution 1.9 (i) discrete;

(i) Frequency distribution of number of cylinders:

pd.cut(car['turn'], bins=range(28, 46, 2)).value_counts(sort=False)

pd.cut(car['hp'], bins=range(50, 275, 25)).value_counts(sort=False)

(v) Frequency Distribution of MPG:

pd.cut(car['mpg'], bins=range(9, 38, 5)).value_counts(sort=False)

print(filmsp.quantile(q=[0, 0.25, 0.5, 0.75, 1.0]))

def calculate_quantile(x, q):

for q in (0, 0.25, 0.5, 0.75, 0.8, 0.9, 0.99, 1.0):

skewness = sum(deviation**3 for deviation in deviations) / n / (S**3)

Coefficient of variation gamma = 0.084.

mean: 104.59447004608295, stddev: 6.547657682704987

𝑇¯𝛼 = 0.4056 and 𝑆 𝛼 = 0.0998, where 𝛼 = 0.10.

Solution 1.23 Sample statistics:

Import required modules and define required functions

Solution 2.1 (i) S = {(𝑤 1 , . . . , 𝑤 20 ); 𝑤 𝑗 = 𝐺, 𝐷, 𝑗 = 1, . . . , 20}.

Solution 2.2 S = {(𝜔1 , . . . , 𝜔10 ) : 10 ≤ 𝜔𝑖 ≤ 20, 𝑖 = 1, . . . , 10}. Looking at the

If (𝜔1 , . . . , 𝜔10 ) ∈ 𝐴 then (𝜔1 , . . . , 𝜔10 ) ∈ 𝐵. Thus 𝐴 ⊂ 𝐵. 𝐴 ∩ 𝐵 = 𝐴.

Solution 2.3 (i) S = {(𝑖 1 , . . . , 𝑖 30 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 1, . . . , 30}.

Solution 2.4 S = ( 𝐴 ∩ 𝐵) ∪ ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵) ∪ ( 𝐴𝑐 ∩ 𝐵 𝑐 ), a union of mutually

Solution 2.5 As in Exercise 3.1, 𝐴𝑛 = (𝜔1 , . . . , 𝜔20 ) : 20

Solution 2.14 1 − (0.9) 10 = 0.6513.

Pr{ 𝐴 ∪ 𝐵} = Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵}

Solution 2.20 By Bayes’ theorem,

Pr{ 𝐴 | 𝐷} Pr{𝐷} 0.10 × 0.01

Fig. 2.1 Geometry of The Solution {exc:geometrySolution}

2(𝑁 + 1) (2𝑁 + 1) − 3(𝑁 + 1) 2 𝑁 2 − 1

Solution 2.42 scipy.stats.binom provides the distribution information.

Solution 2.43 𝑄 1 = 2, Med = 3, 𝑄 3 = 4.

Solution 2.44 𝐸 {𝑋 } = 15.75, 𝜎 = 3.1996.

Solution 2.47 Using the Poisson approximation, 𝜆 = 𝑛 · 𝑝 = 1000 · 10−3 = 1.

𝑀 (𝑡) = exp{−𝜆(1 − 𝑒 𝑡 )}, −∞ < 𝑡 < ∞. Accordingly,

The moments and central moments are

Solution 2.54 The distribution of 𝑁 is geometric with 𝑝 = 0.00038. 𝐸 {𝑁 } = 2631.6,

stats.nbinom.ppf([0.25, 0.5, 0.75], 3, 0.01)

array([170., 265., 389.])

Solution 2.56 By definition, the m.g.f. of 𝑁 𝐵( 𝑝, 𝑘) is

for 𝑡 < − log(1 − 𝑝).

skewness = sum(deviation3 for deviation in deviations) / n / (S3)

Solution 2.90 𝐽 ∼ 𝐵(10, 0.95). If {𝐽 = 𝑗 }, 𝑗 > 1, 𝑋 (1) is the minimum

𝑋 and 𝑌 be i.i.d. 𝐸 (𝛽), 𝑇 = 𝑋 + 𝑌 and 𝑊 + 𝑋 − 𝑌 .

print(Ybar - t_abnS, Ybar + t_abnS)