Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Solutions Modernstatistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 144

Ron Kenett, Shelemyahu Zacks, Peter Gedeck

Modern Statistics: A Computer


Based Approach with Python
Solutions

May 3, 2023

Springer Nature
Contents

1 Analyzing Variability: Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . 5

2 Probability Models and Distribution Functions . . . . . . . . . . . . . . . . . . . . 17

3 Statistical Inference and Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Variability in Several Dimensions and Regression Models . . . . . . . . . . . 65

5 Sampling for Estimation of Finite Population Quantities . . . . . . . . . . . . 91

6 Time Series Analysis and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7 Modern analytic methods: Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8 Modern analytic methods: Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3
Chapter 1
Analyzing Variability: Descriptive Statistics

Import required modules and define required functions

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mistat
from scipy import stats

def trim_std(data, alpha):


""" Calculate trimmed standard deviation """
data = np.array(data)
data.sort()
n = len(data)
low = int(n * alpha) + 1
high = int(n * (1 - alpha))
return data[low:(high + 1)].std()

Solution 1.1 random.choices selects 𝑘 values from the list using sampling with
replacement.

import random
random.seed(1)
values = random.choices([1, 2, 3, 4, 5, 6], k=50)

Counter counts the number of occurrences of a given value in a list.

from collections import Counter


Counter(values)

Counter({2: 10, 3: 10, 1: 9, 6: 9, 5: 8, 4: 4})

The expected frequency in each cell, under randomness is 50/6 = 8.3. You will
get different numerical results, due to randomness.

Solution 1.2 The Python function range is an iterator. As we need a list of values,
we need to explicitly convert it.

5
6 1 Analyzing Variability: Descriptive Statistics

x = list(range(50))
y = [5 + 2.5 * xi for xi in x]
y = [yi + random.uniform(-10, 10) for yi in y]
pd.DataFrame({'x': x, 'y': y}).plot.scatter(x='x', y='y')
plt.show()

120
100
80
60
y

40
20
0
0 10 20 30 40 50
x

Solution 1.3 In Python

from scipy.stats import binom


np.random.seed(1)

for p in (0.1, 0.3, 0.7, 0.9):


X = binom.rvs(1, p, size=50)
print(p, sum(X))

0.1 4
0.3 12
0.7 33
0.9 43

Notice that the expected values of the sums are 5, 15, 35 and 45.

Solution 1.4 We can plot the data and calculate mean and standard deviation.

inst1 = [9.490950, 10.436813, 9.681357, 10.996083, 10.226101, 10.253741,


10.458926, 9.247097, 8.287045, 10.145414, 11.373981, 10.144389,
11.265351, 7.956107, 10.166610, 10.800805, 9.372905, 10.199018,
9.742579, 10.428091]
inst2 = [11.771486, 10.697693, 10.687212, 11.097567, 11.676099,
10.583907, 10.505690, 9.958557, 10.938350, 11.718334,
11.308556, 10.957640, 11.250546, 10.195894, 11.804038,
11.825099, 10.677206, 10.249831, 10.729174, 11.027622]
ax = pd.Series(inst1).plot(marker='o', linestyle='none',
fillstyle='none', color='black')
pd.Series(inst2).plot(marker='+', linestyle='none', ax=ax,
fillstyle='none', color='black')
1 Analyzing Variability: Descriptive Statistics 7

plt.show()

print('mean inst1', np.mean(inst1))


print('stdev inst1', np.std(inst1, ddof=1))
print('mean inst2', np.mean(inst2))
print('stdev inst2', np.std(inst2, ddof=1))

mean inst1 10.03366815


stdev inst1 0.8708144577963102
mean inst2 10.98302505
stdev inst2 0.5685555119253366

12.0
11.5
11.0
10.5
10.0
9.5
9.0
8.5
8.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
As shown in the following Figure, the measurements on Instrument 1, , seem
to be accurate but less precise than those on Instrument 2, +. Instrument 2 seems to
have an upward bias (inaccurate). Quantitatively, the mean of the measurements on
Instrument 1 is 𝑋¯ 1 = 10.034 and its standard deviation is 𝑆1 = 0.871. For Instrument
2 we have 𝑋¯ 2 = 10.983 and 𝑆2 = 0.569.

Solution 1.5 If the scale is inaccurate it will show on the average a deterministic
component different than the nominal weight. If the scale is imprecise, different
weight measurements will show a high degree of variability around the correct
nominal weight. Problems with stability arise when the accuracy of the scale changes
with time, and the scale should be recalibrated.

Solution 1.6 The method random.choices creates a random selection with re-
placement. Note that in range(start, end) the end argument is excluded. We
therefore need to set it to 101.

import random
random.choices(range(1, 101), k=20)

[6, 88, 57, 20, 51, 49, 36, 35, 54, 63, 62, 46, 3, 23, 18, 59, 87, 80,
80, 82]
8 1 Analyzing Variability: Descriptive Statistics

Solution 1.7 The method random.choices creates a random selection without


replacement.

import random
random.sample(range(11, 31), 10)

[19, 12, 13, 28, 11, 18, 26, 23, 15, 14]

Solution 1.8 (i) 265 = 11, 881, 376; (ii) 7, 893, 600; (iii) 263 = 17, 576; (iv) 210 =
1024; (v) 10
5 = 252.

Solution 1.9 (i) discrete;


(ii) discrete;
(iii) continuous;
(iv) continuous.

Solution 1.10
filmsp = mistat.load_data('FILMSP')
filmsp.plot.hist()
plt.show()

80
70
60
50
Frequency

40
30
20
10
0
70 80 90 100 110 120

Solution 1.11
coal = mistat.load_data('COAL')
pd.DataFrame(coal.value_counts(sort=False))

count
COAL
3 17
6 3
4 6
0 33
5 5
2 18
7 1
1 28
1 Analyzing Variability: Descriptive Statistics 9

Solution 1.12 For (1) and (2), we can use the pandas value_counts method. e.g.:

car = mistat.load_data('CAR')
car['cyl'].value_counts(sort=False)

cyl
4 66
6 30
8 13
Name: count, dtype: int64

(i) Frequency distribution of number of cylinders:


Cyl 4 6 8 Total
Frequency 66 30 13 109
(ii) Frequency distribution of car’s origin:
Origin U.S. Europe Asia Total
Frequency 58 14 37 109
For (3) to (5), we need to bin the data first. We can use the pandas cut method
for this.
(iii) Frequency distribution of Turn diameter: We determine the frequency distri-
bution on 8 intervals of length 2, from 28 to 44.

pd.cut(car['turn'], bins=range(28, 46, 2)).value_counts(sort=False)

turn
(28, 30] 3
(30, 32] 16
(32, 34] 16
(34, 36] 26
(36, 38] 20
(38, 40] 18
(40, 42] 8
(42, 44] 2
Name: count, dtype: int64

Note that the bin intervals are open on the left and closed on the right.
(iv) Frequency distribution of Horsepower:

pd.cut(car['hp'], bins=range(50, 275, 25)).value_counts(sort=False)

hp
(50, 75] 7
(75, 100] 34
(100, 125] 22
(125, 150] 18
(150, 175] 17
(175, 200] 6
(200, 225] 4
(225, 250] 1
Name: count, dtype: int64

(v) Frequency Distribution of MPG:


10 1 Analyzing Variability: Descriptive Statistics

pd.cut(car['mpg'], bins=range(9, 38, 5)).value_counts(sort=False)

mpg
(9, 14] 1
(14, 19] 42
(19, 24] 41
(24, 29] 22
(29, 34] 3
Name: count, dtype: int64

Solution 1.13
filmsp = mistat.load_data('FILMSP')
filmsp = filmsp.sort_values(ignore_index=True) # sort and reset index

print(filmsp.quantile(q=[0, 0.25, 0.5, 0.75, 1.0]))


print(filmsp.quantile(q=[0.8, 0.9, 0.99]))

0.00 66.0
0.25 102.0
0.50 105.0
0.75 109.0
1.00 118.0
Name: FILMSP, dtype: float64
0.80 109.8
0.90 111.0
0.99 114.0
Name: FILMSP, dtype: float64

Here is a solution that uses pure Python. Note that the pandas quantile implements
different interpolation methods which will lead to differences for smaller datasets.
We therefore recommend using the library method and select the method that is most
suitable for your use case.

def calculate_quantile(x, q):


idx = (len(x) - 1) * q
left = math.floor(idx)
right = math.ceil(idx)
return 0.5 * (x[left] + x[right])

for q in (0, 0.25, 0.5, 0.75, 0.8, 0.9, 0.99, 1.0):


print(q, calculate_quantile(filmsp, q))

0 66.0
0.25 102.0
0.5 105.0
0.75 109.0
0.8 109.5
0.9 111.0
0.99 114.0
1.0 118.0

Solution 1.14
filmsp = mistat.load_data('FILMSP')
n = len(filmsp)
mean = filmsp.mean()
deviations = [film - mean for film in filmsp]
S = math.sqrt(sum(deviation**2 for deviation in deviations) / n)

skewness = sum(deviation**3 for deviation in deviations) / n / (S**3)


kurtosis = sum(deviation**4 for deviation in deviations) / n / (S**4)
1 Analyzing Variability: Descriptive Statistics 11

print('Python:\n',
f'Skewness: {skewness}, Kurtosis: {kurtosis}')

print('Pandas:\n',
f'Skewness: {filmsp.skew()}, Kurtosis: {filmsp.kurtosis()}')

Python:
Skewness: -1.8098727695275856, Kurtosis: 9.014427238360716
Pandas:
Skewness: -1.8224949285588137, Kurtosis: 6.183511188870432

The distribution of film speed is negatively skewed and much steeper than the
normal distribution. Note that the calculated values differ between the methods.
Solution 1.15 The pandas groupby method groups the data based on the value. We
can then calculate individual statistics for each group.

car = mistat.load_data('CAR')
car['mpg'].groupby(by=car['origin']).mean()
car['mpg'].groupby(by=car['origin']).std()
# calculate both at the same time
print(car['mpg'].groupby(by=car['origin']).agg(['mean', 'std']))

mean std
origin
1 20.931034 3.597573
2 19.500000 2.623855
3 23.108108 4.280341

Solution 1.16 We first create a subset of the data frame that contains only US made
cars and then calculate the statistics for this subset only.

car = mistat.load_data('CAR')
car_US = car[car['origin'] == 1]
gamma = car_US['turn'].std() / car_US['turn'].mean()

Coefficient of variation gamma = 0.084.

Solution
car 1.17
= mistat.load_data('CAR')

car_US = car[car['origin'] == 1]
car_Asia = car[car['origin'] == 3]
print('US')
print('mean', car_US['turn'].mean())
print('geometric mean', stats.gmean(car_US['turn']))
print('Japanese')
print('mean', car_Asia['turn'].mean())
print('geometric mean', stats.gmean(car_Asia['turn']))

US
mean 37.203448275862065
geometric mean 37.06877691910792
Japanese
mean 33.04594594594595
geometric mean 32.97599107553825

We see that 𝑋¯ is greater than 𝐺. The cars from Asia have smaller mean turn
diameter.
12 1 Analyzing Variability: Descriptive Statistics

Solution
filmsp 1.18
= mistat.load_data('FILMSP')

Xbar = filmsp.mean()
S = filmsp.std()
print(f'mean: {Xbar}, stddev: {S}')
expected = {1: 0.68, 2: 0.95, 3: 0.997}
for k in (1, 2, 3):
left = Xbar - k * S
right = Xbar + k * S
proportion = sum(left < film < right for film in filmsp)
print(f'X +/- {k}S: ',
f'actual freq. {proportion}, ',
f'pred. freq. {expected[k] * len(filmsp):.2f}')

mean: 104.59447004608295, stddev: 6.547657682704987


X +/- 1S: actual freq. 173, pred. freq. 147.56
X +/- 2S: actual freq. 205, pred. freq. 206.15
X +/- 3S: actual freq. 213, pred. freq. 216.35

The discrepancies between the actual frequencies to the predicted frequencies are
due to the fact that the distribution of film speed is neither symmetric nor bell-shaped.

Solution 1.19
car = mistat.load_data('CAR')
car.boxplot(column='mpg', by='origin')
plt.show()

Boxplot grouped
mpg by origin
35.0
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
1 2 3
origin

Solution 1.20
oturb = mistat.load_data('OTURB')
mistat.stemLeafDiagram(oturb, 2, leafUnit=0.01)
1 Analyzing Variability: Descriptive Statistics 13

4 2 3444
18 2 55555666677789
40 3 0000001111112222333345
(15) 3 566677788899999
45 4 00022334444
34 4 566888999
25 5 0112333
18 5 6789
14 6 01122233444
3 6 788

• 𝑋 (1) = 0.23,
• 𝑄 1 = 𝑋 (25.25) = 𝑋 (25) + 0.25(𝑋 (26) − 𝑋 (25) ) = 0.31,
• 𝑀3 = 𝑋 (50.5) = 0.385,
• 𝑄 3 = 𝑋 (75.75) = 0.49 + 0.75(0.50 − 0.49) = 0.4975,
• 𝑋 (𝑛) = 0.68.

Solution
from 1.21
scipy.stats import trim_mean

oturb = mistat.load_data('OTURB')
print(f'T(0.1) = {trim_mean(oturb, 0.1)}')
print(f'S(0.1) = {trim_std(oturb, 0.1)}')

T(0.1) = 0.40558750000000005
S(0.1) = 0.09982289003530202

𝑇¯𝛼 = 0.4056 and 𝑆 𝛼 = 0.0998, where 𝛼 = 0.10.

Solution 1.22
germanCars = [10, 10.9, 4.8, 6.4, 7.9, 8.9, 8.5, 6.9, 7.1,
5.5, 6.4, 8.7, 5.1, 6.0, 7.5]
japaneseCars = [9.4, 9.5, 7.1, 8.0, 8.9, 7.7, 10.5, 6.5, 6.7,
9.3, 5.7, 12.5, 7.2, 9.1, 8.3, 8.2, 8.5, 6.8, 9.5, 9.7]
# convert to pandas Series
germanCars = pd.Series(germanCars)
japaneseCars = pd.Series(japaneseCars)
# use describe to calculate statistics
comparison = pd.DataFrame({
'German': germanCars.describe(),
'Japanese': japaneseCars.describe(),
})
print(comparison)

German Japanese
count 15.000000 20.000000
mean 7.373333 8.455000
std 1.780235 1.589596
min 4.800000 5.700000
25% 6.200000 7.175000
50% 7.100000 8.400000
75% 8.600000 9.425000
max 10.900000 12.500000

Solution 1.23 Sample statistics:


14 1 Analyzing Variability: Descriptive Statistics

hadpas = mistat.load_data('HADPAS')
sampleStatistics = pd.DataFrame({
'res3': hadpas['res3'].describe(),
'res7': hadpas['res7'].describe(),
})
print(sampleStatistics)

res3 res7
count 192.000000 192.000000
mean 1965.239583 1857.776042
std 163.528165 151.535930
min 1587.000000 1420.000000
25% 1860.000000 1772.250000
50% 1967.000000 1880.000000
75% 2088.750000 1960.000000
max 2427.000000 2200.000000

Histogram:

ax = hadpas.hist(column='res3', alpha=0.5)
hadpas.hist(column='res7', alpha=0.5, ax=ax)
plt.show()

res7

40

30

20

10

0
1400 1600 1800 2000 2200 2400
We overlay both histograms in one plot and make them transparent (alpha).
Stem and leaf diagrams:

print('res3')
mistat.stemLeafDiagram(hadpas['res3'], 2, leafUnit=10)
print('res7')
mistat.stemLeafDiagram(hadpas['res7'], 2, leafUnit=10)

res3
1 15 8
6 16 01124
14 16 56788889
1 Analyzing Variability: Descriptive Statistics 15

22 17 00000234
32 17 5566667899
45 18 0011112233444
60 18 556666677888899
87 19 000000111111223333344444444
(18) 19 566666666667888889
87 20 00000000122222333334444
64 20 55556666666677789999
44 21 0000011222233344444
25 21 566667788888
13 22 000111234
4 22 668
0 23
1 24 2
res7
1 14 2
9 15 11222244
15 15 667789
23 16 00012334
30 16 5566799
40 17 0022233334
54 17 66666666777999
79 18 0000222222222233344444444
(28) 18 5555556666666778888888999999
85 19 000000001111112222222233333444444
52 19 566666667777888888889999
28 20 0000111222333444
12 20 678
9 21 1123344
2 21 8
1 22 0

Solution 1.24
hadpas.boxplot(column='res3')
plt.show()

2400

2200

2000

1800

1600
res3
Lower whisker starts at max(1587, 1511.7) = 1587 = 𝑋 (1) ; upper whisker ends
at min(2427, 2440.5) = 2427 = 𝑋 (𝑛) . There are no outliers.
Chapter 2
Probability Models and Distribution Functions

Import required modules and define required functions

import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

Solution 2.1 (i) S = {(𝑤 1 , . . . , 𝑤 20 ); 𝑤 𝑗 = 𝐺, 𝐷, 𝑗 = 1, . . . , 20}.


(ii) 220 = 1,
n 048, 576. o
(iii) 𝐴𝑛 = (𝑤 1 , . . . , 𝑤 20 ) : 20
Í
𝑗=1 𝐼{𝑤 𝑗 = 𝐺} = 𝑛 , 𝑛 = 0, . . . , 20,
where 𝐼{ 𝐴} = 1 if 𝐴 is true and 𝐼{ 𝐴} = 0 otherwise. The number of elementary
events in 𝐴𝑛 is 20 20!
𝑛 = 𝑛!(20−𝑛)! .

Solution 2.2 S = {(𝜔1 , . . . , 𝜔10 ) : 10 ≤ 𝜔𝑖 ≤ 20, 𝑖 = 1, . . . , 10}. Looking at the


(𝜔1 , 𝜔2 ) components of 𝐴 and 𝐵 we have the following graphical representation:

25

20

15
A
B
2

10

0
0 5 10 15 20 25
1

17
18 2 Probability Models and Distribution Functions

If (𝜔1 , . . . , 𝜔10 ) ∈ 𝐴 then (𝜔1 , . . . , 𝜔10 ) ∈ 𝐵. Thus 𝐴 ⊂ 𝐵. 𝐴 ∩ 𝐵 = 𝐴.

Solution 2.3 (i) S = {(𝑖 1 , . . . , 𝑖 30 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 1, . . . , 30}.


(ii) 𝐴10 = {(1, 1, . . . , 1, 𝑖11 , 𝑖12 , . . . , 𝑖 30 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 11, . . . , 30}. | 𝐴10 | =
220 = 1, 048, 576. (| 𝐴10 | denotes the number of elements in 𝐴10 .)
30
(iii) 𝐵10 = {(𝑖1 , . . . , 𝑖 30 ) : 𝑖 𝑗 = 0, 1 and 30
Í
𝑗=1 𝑗 = 10}, |𝐵10 | = 10 =
𝑖
30, 045, 015. 𝐴10 ⊄ 𝐵10 , in fact, 𝐴10 has only one element belonging to 𝐵10 .

Solution 2.4 S = ( 𝐴 ∩ 𝐵) ∪ ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵) ∪ ( 𝐴𝑐 ∩ 𝐵 𝑐 ), a union of mutually


disjoint sets.
(a) 𝐴 ∪ 𝐵 = ( 𝐴 ∩ 𝐵) ∪ ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵). Hence, ( 𝐴 ∪ 𝐵) 𝑐 = 𝐴𝑐 ∩ 𝐵 𝑐 .
(b)

( 𝐴 ∩ 𝐵) 𝑐 = ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵) ∪ ( 𝐴𝑐 ∩ 𝐵 𝑐 )
= ( 𝐴 ∩ 𝐵𝑐 ) ∪ 𝐴𝑐
= 𝐴𝑐 ∪ 𝐵𝑐 .

Solution 2.5 As in Exercise 3.1, 𝐴𝑛 = (𝜔1 , . . . , 𝜔20 ) : 20


 Í
𝑖=1Ð𝐼{𝜔𝑖 = 𝐺} = 𝑛 , 𝑛 =
0, . . . , 20. Thus, for any 𝑛 ≠ 𝑛 , 𝐴𝑛 ∩ 𝐴𝑛 = ∅, moreover 20
0 0
𝑛=0 𝐴𝑛 = S. Hence
{ 𝐴0 , . . . , 𝐴20 } is a partition.
Ð𝑛
Solution 2.6 𝑖=1 𝐴𝑖 = S and 𝐴𝑖 ∩ 𝐴 𝑗 = ∅ for all 𝑖 ≠ 𝑗.

𝑛
!
Ø
𝐵 = 𝐵∩S = 𝐵∩ 𝐴𝑖
𝑖=1
𝑛
Ø
= 𝐴𝑖 𝐵.
𝑖=1

Solution 2.7

Pr{ 𝐴 ∪ 𝐵 ∪ 𝐶} = Pr{( 𝐴 ∪ 𝐵) ∪ 𝐶}
= Pr{( 𝐴 ∪ 𝐵)} + Pr{𝐶} − Pr{( 𝐴 ∪ 𝐵) ∩ 𝐶}
= Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵} + Pr{𝐶}
− Pr{ 𝐴 ∩ 𝐶} − Pr{𝐵 ∩ 𝐶} + Pr{ 𝐴 ∩ 𝐵 ∩ 𝐶}
= Pr{ 𝐴} + Pr{𝐵} + Pr{𝐶} − Pr{ 𝐴 ∩ 𝐵}
− Pr{ 𝐴 ∩ 𝐶} − Pr{𝐵 ∩ 𝐶} + Pr{ 𝐴 ∩ 𝐵 ∩ 𝐶}.
Ð𝑛
{exc:partition-union}
Solution 2.8 We have shown in Exercise 2.6 that 𝐵 = 𝑖=1 𝐴𝑖 𝐵. Moreover, since
{ 𝐴1 , . . . , 𝐴𝑛 } is a partition, 𝐴𝑖 𝐵 ∩ 𝐴 𝑗 𝐵 = ( 𝐴𝑖 ∩ 𝐴 𝑗 ) ∩ 𝐵 = ∅ ∩ 𝐵 = ∅ for all 𝑖 ≠ 𝑗.
Hence, from Axiom 3
( 𝑛 ) 𝑛
Ø ∑︁
Pr{𝐵} = Pr 𝐴𝑖 𝐵 = Pr{ 𝐴𝑖 𝐵}.
𝑖=1 𝑖=1
2 Probability Models and Distribution Functions 19

Solution 2.9

S = {(𝑖 1 , 𝑖2 ) : 𝑖 𝑗 = 1, . . . , 6, 𝑗 = 1, 2}
𝐴 = {(𝑖 1 , 𝑖2 ) : 𝑖 1 + 𝑖2 = 10} = {(4, 6), (5, 5), (6, 4)}
3 1
Pr{𝐴} = = .
36 12
Solution 2.10
   
150 280
Pr{𝐵} = Pr{ 𝐴150 } − Pr{ 𝐴280 } = exp − − exp − = 0.2258.
200 200

Solution 2.11
10 10 15 5
2 2 2 2
40
= 0.02765
8
100
Solution 2.12 (i) 1005 = 1010 ; (ii) 5 = 75, 287, 520.
Solution 2.13 𝑁 = 1000, 𝑀 = 900, 𝑛 = 10.

10  
∑︁ 10
(i) Pr{𝑋 ≥ 8} = (0.9) 𝑗 (0.1) 10− 𝑗 = 0.9298.
𝑗=8
𝑗
10 900 100 
10− 𝑗
∑︁ 𝑗
(ii) Pr{𝑋 ≥ 8} = 1000
= 0.9308.
𝑗=8 10

Solution 2.14 1 − (0.9) 10 = 0.6513.


Solution 2.15 Pr{𝑇 > 300 | 𝑇 > 200} = 0.6065
Solution 2.16 (i) Pr{𝐷 | 𝐵} = 14 ; (ii) Pr{𝐶 | 𝐷} = 1.
Solution 2.17 Since 𝐴 and 𝐵 are independent, Pr{ 𝐴 ∩ 𝐵} = Pr{ 𝐴}Pr{𝐵}. Using this
fact and DeMorgan’s Law,
Pr{ 𝐴𝑐 ∩ 𝐵 𝑐 } = Pr{( 𝐴 ∪ 𝐵) 𝑐 }
= 1 − Pr{ 𝐴 ∪ 𝐵}
= 1 − (Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵})
= 1 − Pr{ 𝐴} − Pr{𝐵} + Pr{ 𝐴} Pr{𝐵}
= Pr{ 𝐴𝑐 } − Pr{𝐵}(1 − Pr{ 𝐴})
= Pr{ 𝐴𝑐 }(1 − Pr{𝐵})
= Pr{ 𝐴𝑐 } Pr{𝐵 𝑐 }.
Since Pr{ 𝐴𝑐 ∩ 𝐵 𝑐 } = Pr{ 𝐴𝑐 } Pr{𝐵 𝑐 }, 𝐴𝑐 and 𝐵 𝑐 are independent.
Solution 2.18 We assume that Pr{𝐴} > 0 and Pr{𝐵} > 0. Thus, Pr{𝐴} Pr{𝐵} > 0.
On the other hand, since 𝐴 ∩ 𝐵 = ∅, Pr{ 𝐴 ∩ 𝐵} = 0.
20 2 Probability Models and Distribution Functions

Solution 2.19

Pr{ 𝐴 ∪ 𝐵} = Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵}


= Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴} Pr{𝐵}
= Pr{ 𝐴}(1 − Pr{𝐵}) + Pr{𝐵}
= Pr{𝐵}(1 − Pr{𝐴}) + Pr{ 𝐴}.

Solution 2.20 By Bayes’ theorem,

Pr{ 𝐴 | 𝐷} Pr{𝐷} 0.10 × 0.01


Pr{𝐷 | 𝐴} = = = 0.0011.
Pr{ 𝐴 | 𝐷} Pr{𝐷} + Pr{ 𝐴 | 𝐺} Pr{𝐺} 0.10 × 0.01 + 0.95 × 0.99
Additional problems in combinatorial and geometric probabilities

Solution 2.21 Let 𝑛 be the number of people in the party. The probability
 that all
Î𝑛 365 − 𝑗 + 1
their birthdays fall on different days is Pr{𝐷 𝑛 } = 𝑗=1 .
365
(i) If 𝑛 = 10, Pr{𝐷 10 } = 0.8831.
(ii) If 𝑛 = 23, Pr{𝐷 23 } = 0.4927. Thus, the probability of at least 2 persons with
the same birthday, when 𝑛 = 23, is 1 − Pr{𝐷 23 } = 0.5073 > 12 .
  10
7
Solution 2.22 = 0.02825.
10
 
Î10 1
Solution 2.23 1− = 0.5833.
𝑗=1
24 − 𝑗 + 1
4 86 4 10 86 86
1 4 1 1 3 5
Solution 2.24 (i) 100
= 0.1128; (ii) 100
= 0.0544; (iii) 1 − 100
=
5 5 5
0.5374.
4
Solution 2.25 10
= 0.0889.
2

Solution 2.26 The sample median is 𝑋 (6) , where 𝑋 (1) < · · · < 𝑋 (11) are the ordered
𝑘−1 20−𝑘 
5 5
sample values. Pr{𝑋 (6) = 𝑘 } = 20
, 𝑘 = 6, . . . , 15. This is the probability
11
distribution of the sample median. The probabilities are
𝑘 6 7 8 9 10 11 12 13 14 15
Pr .01192 .04598 .09902 .15404 .18905 .18905 .15404 .09902 .04598 .01192

Solution 2.27 Without loss of generality, assume that the stick is of length 1. Let 𝑥, 𝑦
and (1 − 𝑥 − 𝑦), 0 < 𝑥, 𝑦 < 1, be the length of the 3 pieces. Obviously, 0 < 𝑥 + 𝑦 < 1.
All points in S = {(𝑥, 𝑦) : 𝑥, 𝑦 > 0, 𝑥 + 𝑦 < 1} are uniformly distributed. In order
2 Probability Models and Distribution Functions 21

that the three pieces can form a triangle, the following three conditions should be
satisfied:
(i) 𝑥 + 𝑦 > (1 − 𝑥 − 𝑦)
(ii) 𝑥 + (1 − 𝑥 − 𝑦) > 𝑦
(iii) 𝑦 + (1 − 𝑥 − 𝑦) > 𝑥.
The set of points (𝑥, 𝑦) satisfying (i), (ii) and (iii) is bounded by a triangle of area
1/8. S is bounded by a triangle of area 1/2. Hence, the required probability is 1/4.

Solution 2.28 Consider Fig. 2.1. Suppose that the particle is moving along the {exc:geometrySolution}
circumference of the circle in a counterclockwise direction. Then,√ using the notation
in the diagram, Pr{hit} = 𝜙/2𝜋. Since 𝑂𝐷 = 1 = 𝑂𝐸, 𝑂𝐵 = 𝑎 2 + ℎ2 = 𝑂𝐶 and
the lines 𝐷𝐵 and 𝐸𝐶 are tangential to the circle, it follows that the triangles Δ𝑂𝐷𝐵
and Δ𝑂𝐸𝐶 are congruent.  Thus
 𝑚(∠ 𝐷𝑂𝐵) = 𝑚(∠ 𝐸𝑂𝐶), and it is easily seen that
ℎ 1 ℎ
𝜙 = 2𝛼. Now 𝛼 = tan−1 , and hence, Pr{hit} = tan−1 .
𝑎 𝜋 𝑎

Fig. 2.1 Geometry of The Solution {exc:geometrySolution}

α
O A
α a

E
B
D

100
Solution 2.29 1−(0.999) 100 −100×(0.001)×(0.999) 99 − 2
2 ×(0.001) (0.999)
98 =
0.0001504.
 𝑛
𝑛−1 1
Solution 2.30 The probability that 𝑛 tosses are required is 𝑝(𝑛) = 1 , 𝑛 ≥ 2.
2
1 3
Thus, 𝑝(4) = 3 · = .
24 16
Solution 2.31 S = {(𝑖1 , . . . , 𝑖 10 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 1, . . . , 10}. One random variable
is the number of 1’s in an element, i.e., for 𝝎 = (𝑖 1 , . . . , 𝑖 10 ) 𝑋1 (𝝎) = 10
Í
𝑗=1 𝑖 𝑗 .
Another random variable is the number of zeros to the left of the 1st one, i.e.,
𝑋2 (𝝎) = 10
Í Î𝑗
𝑗=1 𝑘=1 (1 − 𝑖 𝑘 ). Notice that 𝑋2 (𝝎) = 0 if 𝑖 1 = 1 and 𝑋2 (𝝎) = 10 if
22 2 Probability Models and Distribution Functions

10 10
𝑖 1 = 𝑖2 = . . . = 𝑖10 = 0. The probability distribution of 𝑋1 is Pr{𝑋1 = 𝑘 } = 𝑘 /2 ,
𝑘 = 0, 1, . . . , 10. The probability distribution of 𝑋2 is
  𝑘+1
 1
𝑘 = 0, . . . , 9



 2
 ,
Pr{𝑋2 = 𝑘 } =   10
 1

 , 𝑘 = 10.
 2

Í∞ 5 𝑥
= 𝑒 5 , we have ∞
Í
Solution 2.32 (i) Since 𝑥=0 𝑥=0 𝑝(𝑥) = 1. ; (ii) Pr{𝑋 ≤ 1} =
𝑥!
𝑒 −5 (1 + 5) = 0.0404; (iii) Pr{𝑋 ≤ 7} = 0.8666.
Solution 2.33 (i) Pr{𝑋 = −1} = 0.3; (ii) Pr{−0.5 < 𝑋 < 0} = 0.1; (iii) Pr{0 ≤ 𝑋 <
0.75} = 0.425; (iv) Pr{𝑋 = 1} = 0; (v) 𝐸 {𝑋 } = −0.25, 𝑉 {𝑋 } = 0.4042.
Solution 2.34
∞ ∞
∫ ∫ √︂
−𝑥 2 /2𝜎 2 𝜋
𝐸 {𝑋 } = (1 − 𝐹 (𝑥)) d𝑥 = 𝑒 d𝑥 = 𝜎 .
0 0 2
Solution 2.35
𝑁 𝑁
1 ∑︁ 𝑁 +1 1 ∑︁ 2 (𝑁 + 1) (2𝑁 + 1)
𝐸 {𝑋 } = 𝑖= ; 𝐸 {𝑋 2 } = 𝑖 =
𝑁 𝑖=1 2 𝑁 𝑖=1 6

2(𝑁 + 1) (2𝑁 + 1) − 3(𝑁 + 1) 2 𝑁 2 − 1


𝑉 {𝑋 } = 𝐸 {𝑋 2 } − (𝐸 {𝑋 }) 2 = = .
12 12
𝑉 {𝑋 } 0.25
Solution 2.36 Pr{8 < 𝑋 < 12} = Pr{|𝑋 − 10| < 2} ≥ 1 − = 1− =
4 4
0.9375.
Solution 2.37 Notice that 𝐹 (𝑥) is the standard Cauchy distribution. The  quan-
 𝑝-th 
1 1 −1 1
tile, 𝑥 𝑝 , satisfies the equation + tan (𝑥 𝑝 ) = 𝑝, hence 𝑥 𝑝 = tan 𝜋 𝑝 − .
2 𝜋 2
For 𝑝 = 0.25, 0.50, 0.75 we get 𝑥 .25 = −1, 𝑥 .50 = 0, 𝑥 .75 = 1, respectively.
Solution 2.38 𝜇𝑙∗ = 𝐸 {(𝑋 − 𝜇1 ) 𝑙 } = 𝑙𝑗=0 (−1) 𝑗 𝑙𝑗 𝜇1 𝜇𝑙− 𝑗 .
Í  𝑗

When 𝑗 = 𝑙 the term is (−1) 𝑙 𝜇1𝑙 . When 𝑗 = 𝑙 − 1 the term is (−1) 𝑙−1 𝑙 𝜇1𝑙−1 𝜇1 =
(−1) 𝑙−1 𝑙 𝜇1𝑙 . Thus, the sum of the last 2 terms is (−1) 𝑙−1 (𝑙 − 1)𝜇1𝑙 and we have
𝜇𝑙∗ = 𝑙−2
Í 𝑗 𝑙
 𝑗
𝑗=0 (−1) 𝑗 𝜇1 𝜇𝑙− 𝑗 + (−1)
𝑙−1 (𝑙 − 1)𝜇𝑙 .
1

{exc:mixed-cdf-example} Solution 2.39 We saw in the solution of Exercise 2.33 that 𝜇1 = −0.25. Moreover,
𝜇2 = 𝑉 {𝑋 } + 𝜇12 = 0.4667.
1
Solution 2.40 𝑀𝑋 (𝑡) = (𝑒 𝑡 𝑏 − 𝑒 𝑡 𝑎 ), −∞ < 𝑡 < ∞, 𝑎 < 𝑏.
𝑡 (𝑏 − 𝑎)
𝑎+𝑏 (𝑏 − 𝑎) 2
𝐸 {𝑋 } = ; 𝑉 {𝑋 } = .
2 12
2 Probability Models and Distribution Functions 23

𝑡 −1

Solution 2.41 (i) For 𝑡 < 𝜆 we have 𝑀 (𝑡) = 1 − 𝜆 ,

1 𝑡  −2 1
𝑀 0 (𝑡) = 1− , 𝜇1 = 𝑀 0 (0) =
𝜆 𝜆 𝜆
2  𝑡  −3 2
𝑀 00 (𝑡) = 2 1 − , 𝜇2 = 𝑀 00 (0) = 2
𝜆 𝜆 𝜆
6  𝑡  −4 6
𝑀 (3) (𝑡) = 3 1 − , 𝜇3 = 𝑀 (3) (0) = 3
𝜆 𝜆 𝜆
24  𝑡  −5 24
𝑀 (4) (𝑡) = 4 1 − , (4)
𝜇4 = 𝑀 (0) = 4 .
𝜆 𝜆 𝜆
(ii) The central moments are

𝜇1∗ = 0,
1
𝜇2∗ = 2 ,
𝜆
6 6 2 2
𝜇3∗ = 𝜇3 − 3𝜇2 𝜇1 + 2𝜇13 = − + = ,
𝜆3 𝜆3 𝜆3 𝜆3
1 9
𝜇4∗ = 𝜇4 − 4𝜇3 𝜇1 + 6𝜇2 (𝜇1 ) 2 − 3𝜇14 = 4 (24 − 4 · 6 + 6 · 2 − 3) = 4 .
𝜆 𝜆
𝜇4∗
(iii) The index of kurtosis is 𝛽4 = = 9.
(𝜇2∗ ) 2

Solution 2.42 scipy.stats.binom provides the distribution information.

x = list(range(15))
table = pd.DataFrame({
'x': x,
'p.d.f.': [stats.binom(20, 0.17).pmf(x) for x in x],
'c.d.f.': [stats.binom(20, 0.17).cdf(x) for x in x],
})
print(table)

x p.d.f. c.d.f.
0 0 2.407475e-02 0.024075
1 1 9.861947e-02 0.122694
2 2 1.918921e-01 0.314586
3 3 2.358192e-01 0.550406
4 4 2.052764e-01 0.755682
5 5 1.345426e-01 0.890224
6 6 6.889229e-02 0.959117
7 7 2.822094e-02 0.987338
8 8 9.392812e-03 0.996731
9 9 2.565105e-03 0.999296
10 10 5.779213e-04 0.999874
11 11 1.076086e-04 0.999981
12 12 1.653023e-05 0.999998
13 13 2.083514e-06 1.000000
14 14 2.133719e-07 1.000000

Solution 2.43 𝑄 1 = 2, Med = 3, 𝑄 3 = 4.


24 2 Probability Models and Distribution Functions

Solution 2.44 𝐸 {𝑋 } = 15.75, 𝜎 = 3.1996.

Solution 2.45 Pr{no defective chip on the board} = 𝑝 50 . Solving 𝑝 50 = 0.99 yields
𝑝 = (0.99) 1/50 = 0.999799.
 𝑛
𝜆
Solution 2.46 Notice first that lim 𝑛→∞ 𝑏(0; 𝑛, 𝑝) = lim𝑛→∞ 1 − = 𝑒 −𝜆 .
𝑛 𝑝→𝜆 𝑛
Moreover, for all
𝑏( 𝑗 + 1; 𝑛, 𝑝) 𝑛− 𝑗 𝑝
𝑗 = 0, 1, · · · , 𝑛 − 1, = · . Thus, by induction on 𝑗,
𝑏( 𝑗; 𝑛, 𝑝) 𝑗 +1 1− 𝑝
for 𝑗 > 0
𝑛− 𝑗 +1 𝑝
lim 𝑏( 𝑗 − 1; 𝑛, 𝑝)
lim 𝑏( 𝑗; 𝑛, 𝑝) = 𝑛→∞ ·
𝑛→∞
𝑛 𝑝→𝜆 𝑛 𝑝→𝜆
𝑗 1− 𝑝
𝜆 𝑗−1 (𝑛 − 𝑗 + 1) 𝑝
= 𝑒 −𝜆 lim
( 𝑗 − 1)! 𝑛𝑛→∞ 𝜆

𝑝→𝜆 𝑗 1 − 𝑛

𝜆 𝑗−1 𝜆 𝜆𝑗
= 𝑒 −𝜆 · = 𝑒 −𝜆 .
( 𝑗 − 1)! 𝑗 𝑗!

Solution 2.47 Using the Poisson approximation, 𝜆 = 𝑛 · 𝑝 = 1000 · 10−3 = 1.


1
Pr{𝑋 < 4} = 𝑒 −1 3𝑗=0
Í
= 0.9810.
𝑗!
 
350 350 150 19
Solution 2.48 𝐸 {𝑋 } = 20· = 14; 𝑉 {𝑋 } = 20· · 1− = 4.0401.
500 500 500 499
Solution 2.49 Let 𝑋 be the number of defective items observed.
Pr{𝑋 > 1} = 1 − Pr{𝑋 ≤ 1} = 1 − 𝐻 (1; 500, 5, 50) = 0.0806.

Solution 2.50
3
∑︁
Pr{𝑅} = 1 − 𝐻 (3; 100, 10, 20) + ℎ(𝑖; 100, 10, 20) [1 − 𝐻 (3 − 𝑖; 80, 10 − 𝑖, 40)]
𝑖=1
= 0.87395.

Solution 2.51 The m.g.f. of the Poisson distribution with parameter 𝜆, 𝑃(𝜆), is

𝑀 (𝑡) = exp{−𝜆(1 − 𝑒 𝑡 )}, −∞ < 𝑡 < ∞. Accordingly,


𝑀 0 (𝑡) = 𝜆𝑀 (𝑡)𝑒 𝑡
𝑀 00 (𝑡) = (𝜆2 𝑒 2𝑡 + 𝜆𝑒 𝑡 ) 𝑀 (𝑡)
𝑀 (3) (𝑡) = (𝜆3 𝑒 3𝑡 + 3𝜆2 𝑒 2𝑡 + 𝜆𝑒 𝑡 ) 𝑀 (𝑡)
𝑀 (4) (𝑡) = (𝜆4 𝑒 4𝑡 + 6𝜆3 𝑒 3𝑡 + 7𝜆2 𝑒 2𝑡 + 𝜆𝑒 𝑡 ) 𝑀 (𝑡).

The moments and central moments are


2 Probability Models and Distribution Functions 25

𝜇1 = 𝜆 𝜇1∗ = 0
𝜇2 = 𝜆2 + 𝜆 𝜇2∗ = 𝜆
𝜇3 = 𝜆3 + 3𝜆2 + 𝜆 𝜇3∗ = 𝜆
𝜇4 = 𝜆4 + 6𝜆3 + 7𝜆2 + 𝜆 𝜇4∗ = 3𝜆2 + 𝜆.

1
Thus, the indexes of skewness and kurtosis are 𝛽3 = 𝜆−1/2 and 𝛽4 = 3 + .
𝜆
For 𝜆 = 10 we have 𝛽3 = 0.3162 and 𝛽4 = 3.1.

Solution 2.52 Let 𝑋 be the number of blemishes observed. Pr{𝑋 > 2} = 0.1912.

Solution 2.53 Using the Poisson approximation with 𝑁 = 8000 and 𝑝 = 380 × 10−6 ,
we have 𝜆 = 3.04 and Pr{𝑋 > 6} = 0.0356, where 𝑋 is the number of insertion
errors in 2 hours of operation.

Solution 2.54 The distribution of 𝑁 is geometric with 𝑝 = 0.00038. 𝐸 {𝑁 } = 2631.6,


𝜎𝑁 = 2631.08.

Solution 2.55 Using Python we obtain that for the 𝑁 𝐵( 𝑝, 𝑘) with 𝑝 = 0.01 and
𝑘 = 3, 𝑄 1 = 170, 𝑀𝑒 = 265, and 𝑄 3 = 389.

stats.nbinom.ppf([0.25, 0.5, 0.75], 3, 0.01)

array([170., 265., 389.])

Solution 2.56 By definition, the m.g.f. of 𝑁 𝐵( 𝑝, 𝑘) is


∞  
∑︁ 𝑘 +𝑖−1 𝑘
𝑀 (𝑡) = 𝑝 ((1 − 𝑝)𝑒 𝑡 ) 𝑖 ,
𝑖=0
𝑘 − 1

for 𝑡 < − log(1 − 𝑝).


𝑝𝑘 Í∞ 𝑘+𝑖−1
Thus 𝑀 (𝑡) = 𝑖=0 𝑘−1 (1
− (1 − 𝑝)𝑒 𝑡 ) 𝑘 ((1 − 𝑝)𝑒 𝑡 ) 𝑖 .
(1 − (1 − 𝑝)𝑒 )
𝑡 𝑘
 𝑘
𝑝
Since the last infinite series sums to one, 𝑀 (𝑡) = , 𝑡 < − log(1−
1 − (1 − 𝑝)𝑒 𝑡
𝑝).
𝑝𝑒 𝑡
Solution 2.57 𝑀 (𝑡) = , for 𝑡 < − log(1 − 𝑝). The derivatives of 𝑀 (𝑡)
1 − (1 − 𝑝)𝑒 𝑡
are

𝑀 0 (𝑡) = 𝑀 (𝑡) (1 − (1 − 𝑝)𝑒 𝑡 ) −1


𝑀 00 (𝑡) = 𝑀 (𝑡) (1 − (1 − 𝑝)𝑒 𝑡 ) −2 (1 + (1 − 𝑝)𝑒 𝑡 )
𝑀 (3) (𝑡) = 𝑀 (𝑡) (1 − (1 − 𝑝)𝑒 𝑡 ) −3 · [(1 + (1 − 𝑝)𝑒 𝑡 ) 2 + 2(1 − 𝑝)𝑒 𝑡 ]
𝑀 (4) (𝑡) = 𝑀 (𝑡) (1 − (1 − 𝑝)𝑒 𝑡 ) −4 [1 + (1 − 𝑝) 3 𝑒 3𝑡 + 11(1 − 𝑝)𝑒 𝑡 + 11(1 − 𝑝) 2 𝑒 2𝑡 ].
26 2 Probability Models and Distribution Functions

The moments are


1
𝜇1 =
𝑝
2− 𝑝
𝜇2 =
𝑝2
(2 − 𝑝) 2 + 2(1 − 𝑝) 6 − 6𝑝 + 𝑝 2
𝜇3 = =
𝑝3 𝑝3
11(1 − 𝑝) (2 − 𝑝) + (1 − 𝑝) + 1 24 − 36𝑝 + 14𝑝 2 − 𝑝 3
3
𝜇4 = = .
𝑝4 𝑝4
The central moments are

𝜇1∗ = 0
1− 𝑝
𝜇2∗ = ,
𝑝2
1
𝜇3∗ = 3 (1 − 𝑝) (2 − 𝑝),
𝑝
1
𝜇4∗ = 4 (9 − 18𝑝 + 10𝑝 2 − 𝑝 3 ).
𝑝
2− 𝑝
Thus the indices of skewness and kurtosis are 𝛽3∗ = √︁ and 𝛽4∗ =
1− 𝑝
9 − 9𝑝 + 𝑝 2
.
1− 𝑝
Solution 2.58 If there are 𝑛 chips, 𝑛 > 50, the probability of at least 50 good ones is
1 − 𝐵(49; 𝑛, 0.998). Thus, 𝑛 is the smallest integer > 50 for which 𝐵(49; 𝑛, 0.998) <
0.05. It is sufficient to order 51 chips.
Solution 2.59 If 𝑋 has a geometric distribution then, for every 𝑗, 𝑗 = 1, 2, . . .
Pr{𝑋 > 𝑗 } = (1 − 𝑝) 𝑗 . Thus,

Pr{𝑋 > 𝑛 + 𝑚}
Pr{𝑋 > 𝑛 + 𝑚 | 𝑋 > 𝑚} =
Pr{𝑋 > 𝑚}
(1 − 𝑝) 𝑛+𝑚
=
(1 − 𝑝) 𝑚
= (1 − 𝑝) 𝑛
= Pr{𝑋 > 𝑛}.

Solution 2.60 For 0 < 𝑦 < 1, Pr{𝐹 (𝑋) ≤ 𝑦} = Pr{𝑋 ≤ 𝐹 −1 (𝑦)} = 𝐹 (𝐹 −1 (𝑦)) = 𝑦.
Hence, the distribution of 𝐹 (𝑋) is uniform on (0, 1). Conversely, if 𝑈 has a uniform
distribution on (0, 1), then

Pr{𝐹 −1 (𝑈) ≤ 𝑥} = Pr{𝑈 ≤ 𝐹 (𝑥)} = 𝐹 (𝑥).


2 Probability Models and Distribution Functions 27

1600
Solution 2.61 𝐸 {𝑈 (10, 50)} = 30; 𝑉 {𝑈 (10, 50)} = = 133.33;
12
40
𝜎{𝑈 (10, 50)} = √ = 11.547.
2 3
Solution 2.62 Let 𝑋 = − log(𝑈) where 𝑈 has a uniform distribution on (0,1).

Pr{𝑋 ≤ 𝑥} = Pr{− log(𝑈) ≤ 𝑥}


= Pr{𝑈 ≥ 𝑒 −𝑥 }
= 1 − 𝑒 −𝑥 .

Therefore 𝑋 has an exponential distribution 𝐸 (1).

Solution 2.63 (i) Pr{92 < 𝑋 < 108} = 0.4062; (ii) Pr{𝑋 > 105} = 0.3694;
(iii) Pr{2𝑋 + 5 < 200} = Pr{𝑋 < 97.5} = 0.4338.

rv = stats.norm(100, 15)
print('(i)', rv.cdf(108) - rv.cdf(92))
print('(ii)', 1 - rv.cdf(105))
print('(iii)', rv.cdf((200 - 5)/2))

(i) 0.4061971427922976
(ii) 0.36944134018176367
(iii) 0.43381616738909634

Solution 2.64 Let 𝑧 𝛼 denote the 𝛼 quantile of a 𝑁 (0, 1) distribution. Then the two
equations 𝜇 + 𝑧 .9 𝜎 = 15 and 𝜇 + 𝑧 .99 𝜎 = 20 yield the solution 𝜇 = 8.8670 and
𝜎 = 4.7856.

Solution 2.65 Due to symmetry, Pr{𝑌 > 0} = Pr{𝑌 < 0} = Pr{𝐸 < 𝑣}, where
𝐸 ∼ 𝑁 (0, 1). If the probability of a bit error is 𝛼 = 0.01, then Pr{𝐸 < 𝑣} = Φ(𝑣) =
1 − 𝛼 = 0.99.
Thus 𝑣 = 𝑧 .99 = 2.3263.

Solution 2.66 Let 𝑋 𝑝 denote the diameter of an aluminum pin and 𝑋ℎ denote the size
of a hole drilled in an aluminum plate. If 𝑋 𝑝 ∼ 𝑁 (10, 0.02) and 𝑋ℎ ∼ 𝑁 (𝜇 𝑑 , 0.02)
then the probability that the pin will not enter the hole is Pr{𝑋ℎ − 𝑋 𝑝 < 0}. Now

𝑋ℎ − 𝑋 𝑝 ∼ 𝑁 (𝜇 𝑑 − 10, 0.022 + 0.022 ) and for Pr{𝑋ℎ − 𝑋 𝑝 < 0} = 0.01, we
obtain 𝜇 𝑑 = 10.0658 mm. (The fact that the sum of two independent normal random
variables is normally distributed should be given to the student since it has not yet
been covered in the text.)

Solution 2.67 For 𝑋1 , . . . , 𝑋𝑛 i.i.d. 𝑁 (𝜇, 𝜎 2 ), 𝑌 = 𝑖=1 𝑖𝑋𝑖 ∼ 𝑁 (𝜇𝑌 , 𝜎𝑌2 ) where
Í𝑛
𝑛(𝑛 + 1) 𝑛(𝑛 + 1) (2𝑛 + 1)
and 𝜎𝑌2 = 𝜎 2 𝑖=1
Í𝑛 2
𝑖 = 𝜎2
Í𝑛
𝜇𝑌 = 𝜇 𝑖=1 𝑖=𝜇 .
2 6
Solution 2.68 Pr{𝑋 > 300} = Pr{log 𝑋 > 5.7038} = 1 − Φ(0.7038) = 0.24078.
28 2 Probability Models and Distribution Functions
2 𝑡 2 /2
Solution 2.69 For 𝑋 ∼ 𝑒 𝑁 ( 𝜇, 𝜎) , 𝑋 ∼ 𝑒𝑌 where 𝑌 ∼ 𝑁 (𝜇, 𝜎), 𝑀𝑌 (𝑡) = 𝑒 𝜇𝑡+𝜎 .

2 /2
𝜉 = 𝐸 {𝑋 } = 𝐸 {𝑒𝑌 } = 𝑀𝑌 (1) = 𝑒 𝜇+𝜎 .
2
Since 𝐸 {𝑋 2 } = 𝐸 {𝑒 2𝑌 } = 𝑀𝑌 (2) = 𝑒 2𝜇+2𝜎 we have

2 2
𝑉 {𝑋 } = 𝑒 2𝜇+2𝜎 − 𝑒 2𝜇+𝜎
2 2
= 𝑒 2𝜇+𝜎 (𝑒 𝜎 − 1)
2
= 𝜉 2 (𝑒 𝜎 − 1).

Solution 2.70 The quantiles of 𝐸 (𝛽) are 𝑥 𝑝 = −𝛽 log(1 − 𝑝). Hence,


𝑄 1 = 0.2877𝛽, 𝑀𝑒 = 0.6931𝛽, 𝑄 3 = 1.3863𝛽.

Solution 2.71 If 𝑋 ∼ 𝐸 (𝛽), Pr{𝑋 > 𝛽} = 𝑒 −𝛽/𝛽 = 𝑒 −1 = 0.3679.

Solution 2.72 The m.g.f. of 𝐸 (𝛽) is


∫ ∞
1
𝑀 (𝑡) = 𝑒 𝑡 𝑥−𝑥/𝛽 d𝑥
𝛽 0
∫ ∞
1 (1−𝑡 𝛽)
= 𝑒 − 𝛽 𝑥 d𝑥
𝛽 0
1
= (1 − 𝑡 𝛽) −1 , for 𝑡 < .
𝛽
Solution 2.73 By independence,

𝑀 (𝑋1 +𝑋2 +𝑋3 ) (𝑡) = 𝐸 {𝑒 𝑡 (𝑋1 +𝑋2 +𝑋3 ) }


3
Ö
= 𝐸 {𝑒 𝑡 𝑋𝑖 }
𝑖=1
1
= (1 − 𝛽𝑡) −3 , 𝑡< .
𝛽

{exc:mgf-ind-exp-rv} Thus 𝑋1 + 𝑋2 + 𝑋3 ∼ 𝐺 (3, 𝛽), (see Exercise 2.76).


Using the formula of the next exercise,

Pr{𝑋1 + 𝑋2 + 𝑋3 ≥ 3𝛽} = Pr{𝛽𝐺 (3, 1) ≥ 3𝛽}


= Pr{𝐺 (3, 1) ≥ 3}
2
−3
∑︁ 3𝑗
=𝑒
𝑗=0
𝑗!
= 0.4232.
2 Probability Models and Distribution Functions 29

Solution 2.74
∫ 𝑡
𝜆𝑘
𝐺 (𝑡; 𝑘, 𝜆) = 𝑥 𝑘−1 𝑒 −𝜆𝑥 d𝑥
(𝑘 + 1)! 0
∫ 𝑡
𝜆 𝑘 𝑘 −𝜆𝑡 𝜆 𝑘+1
= 𝑡 𝑒 + 𝑥 𝑘 𝑒 −𝜆𝑥 d𝑥
𝑘! 𝑘! 0
∫ 𝑡
𝜆 𝑘 𝑘 −𝜆𝑡 𝜆 𝑘+1 𝑘+1 −𝜆𝑡 𝜆 𝑘+2
= 𝑡 𝑒 + 𝑡 𝑒 + 𝑥 𝑘+1 𝑒 −𝜆𝑥 d𝑥
𝑘! (𝑘 + 1)! (𝑘 + 1)! 0
= ···

∑︁ (𝜆𝑡) 𝑗
= 𝑒 −𝜆𝑡
𝑗=𝑘
𝑗!
𝑘−1
∑︁ (𝜆𝑡) 𝑗
= 1 − 𝑒 −𝜆𝑡 .
𝑗=0
𝑗!
     
1 3 1 1
Solution 2.75 Γ(1.17) = 0.9267, Γ = 1.77245, Γ = Γ = 0.88623.
2 2 2 2
from scipy.special import gamma
print(gamma(1.17), gamma(1 / 2), gamma(3 / 2))

0.9266996106177159 1.7724538509055159 0.8862269254527579

Solution 2.76 The moment generating function of the sum of independent random
variables is the product of their respective m.g.f.’s.ÎThus, if 𝑋1 , · · · , 𝑋 𝑘 are i.i.d.
𝐸 (𝛽), using the result of Exercise 2.73, 𝑀𝑆 (𝑡) = 𝑖=1 𝑘
(1 − 𝛽𝑡) −1 = (1 − 𝛽𝑡) −𝑘 , {exc:prob-ind-exp-rv}
1
𝑡 < , where 𝑆 = 𝑖=1 𝑋𝑖 . On the other hand, (1 − 𝛽𝑡) −𝑘 is the m.g.f. of 𝐺 (𝑘, 𝛽).
Í𝑘
𝛽
Solution 2.77 The expected value and variance of 𝑊 (2, 3.5) are

 
1
𝐸 {𝑊 (2, 3.5)} = 3.5 × Γ 1 + = 3.1018,
2
    
2 1
𝑉 {𝑊 (2, 3.5)} = (3.5) 2 Γ 1 + − Γ2 1 + = 2.6289.
2 2

Solution 2.78 Let 𝑇 be the number of days until failure. 𝑇 ∼ 𝑊 (1.5, 500) ∼
500𝑊 (1.5, 1).  
6 1.5
Pr{𝑇 ≥ 600} = Pr 𝑊 (1.5, 1) ≥ = 𝑒 −(6/5) = 0.2686.
5
 
1 3
Solution 2.79 Let 𝑋 ∼ Beta , .
2 2
1 3
1/2 1 · 1 1
𝐸 {𝑋 } = 1 3 = , 𝑉 {𝑋 } = 22 2 = and 𝜎{𝑋 } = .
2 + 2
4 2 · 3 16 4
30 2 Probability Models and Distribution Functions

Solution 2.80 Let 𝑋 ∼ Beta(𝜈, 𝜈). The first four moments are

1
𝜇1 = 𝜈/2𝜈 =
2
𝐵(𝜈 + 2, 𝜈) 𝜈+1
𝜇2 = =
𝐵(𝜈, 𝜈) 2(2𝜈 + 1)
𝐵(𝜈 + 3, 𝜈) (𝜈 + 1) (𝜈 + 2)
𝜇3 = =
𝐵(𝜈, 𝜈) 2(2𝜈 + 1) (2𝜈 + 2)
𝐵(𝜈 + 4, 𝜈) (𝜈 + 1) (𝜈 + 2) (𝜈 + 3)
𝜇4 = = .
𝐵(𝜈, 𝜈) 2(2𝜈 + 1) (2𝜈 + 2) (2𝜈 + 3)
1
The variance is 𝜎 2 = and the fourth central moment is
4(2𝜈 + 1)
3
𝜇4∗ = 𝜇4 − 4𝜇3 · 𝜇1 + 6𝜇2 · 𝜇12 − 3𝜇14 = .
16(3 + 8𝜈 + 4𝜈 2 )

𝜇 3(1 + 2𝜈)
Finally, the index of kurtosis is 𝛽2 = 44 = .
𝜎 3 + 2𝜈
Solution 2.81 Let (𝑋, 𝑌 ) have a joint p.d.f.

 1,

(𝑥, 𝑦) ∈ 𝑆


𝑓 (𝑥, 𝑦) = 2
 0,
 otherwise.

(i) The marginal distributions of 𝑋 and 𝑌 have p.d.f.’s
1 ∫ 1− | 𝑥 |
𝑓 𝑋 (𝑥) = d𝑦 = 1 − |𝑥|, −1 < 𝑥 < 1, and by symmetry,
2 −1+ | 𝑥 |
𝑓𝑌 (𝑦) = 1 − |𝑦|, −1 < 𝑦 < 1.
∫1 1
(ii) 𝐸 {𝑋 } = 𝐸 {𝑌 } = 0, 𝑉 {𝑋 } = 𝑉 {𝑌 } = 2 0 𝑦 2 (1 − 𝑦) d𝑦 = 2𝐵(3, 2) = .
6
Solution 2.82 The marginal p.d.f. of 𝑌 is 𝑓 (𝑦) = 𝑒 −𝑦 , 𝑦 > 0, that is, 𝑌 ∼ 𝐸 (1). The
conditional p.d.f. of 𝑋, given 𝑌 = 𝑦, is 𝑓 (𝑥 | 𝑦) = 1𝑦 𝑒 −𝑥/𝑦 which is the p.d.f. of
an exponential with parameter 𝑦. Thus, 𝐸 {𝑋 | 𝑌 = 𝑦} = 𝑦, and 𝐸 {𝑋 } = 𝐸 {𝐸 {𝑋 |
𝑌 }} = 𝐸 {𝑌 } = 1. Also,

𝐸 {𝑋𝑌 } = 𝐸 {𝑌 𝐸 {𝑋 | 𝑌 }}
= 𝐸 {𝑌 2 }
= 2.

Hence, cov(𝑋, 𝑌 ) = 𝐸 {𝑋𝑌 } − 𝐸 {𝑋 }𝐸 {𝑌 } = 1. The variance of 𝑌 is 𝜎𝑌2 = 1. The


variance of 𝑋 is
2 Probability Models and Distribution Functions 31

𝜎𝑋2 = 𝐸 {𝑉 {𝑋 | 𝑌 }} + 𝑉 {𝐸 {𝑋 | 𝑌 }}
= 𝐸 {𝑌 2 } + 𝑉 {𝑌 }
= 2 + 1 = 3.

1
The correlation between 𝑋 and 𝑌 is 𝜌 𝑋𝑌 = √ .
3


 2, if (𝑥, 𝑦) ∈ 𝑇


Solution 2.83 Let (𝑋, 𝑌 ) have joint p.d.f. 𝑓 (𝑥, 𝑦) =

 0,

otherwise.

The marginal densities of 𝑋 and 𝑌 are
𝑓 𝑋 (𝑥) = 2(1 − 𝑥), 0 ≤ 𝑥 ≤ 1
𝑓𝑌 (𝑦) = 2(1 − 𝑦), 0 ≤ 𝑦 ≤ 1.
1 1
Notice that 𝑓 (𝑥, 𝑦) ≠ 𝑓 𝑋 (𝑥) 𝑓𝑌 (𝑦) for 𝑥 =
, 𝑦 = . Thus, 𝑋 and 𝑌 are dependent.
2 4
1
cov(𝑋, 𝑌 ) = 𝐸 {𝑋𝑌 } − 𝐸 {𝑋 }𝐸 {𝑌 } = 𝐸 {𝑋𝑌 } − .
9
∫ ∫ 1 1−𝑥
𝐸 {𝑋𝑌 } = 2 𝑥 𝑦 d𝑦 d𝑥
0 0
∫ 1
= 𝑥(1 − 𝑥) 2 d𝑥
0
1
= 𝐵(2, 3) = .
12
1 1 1
Hence, cov(𝑋, 𝑌 ) = − =− .
12 9 36
Solution 2.84 𝐽 | 𝑁 ∼ 𝐵(𝑁, 𝑝); 𝑁 ∼ 𝑃(𝜆). 𝐸 {𝑁 } = 𝜆, 𝑉 {𝑁 } = 𝜆, 𝐸 {𝐽} = 𝜆𝑝.

𝑉 {𝐽} = 𝐸 {𝑉 {𝐽 | 𝑁 }} + 𝑉 {𝐸 {𝐽 | 𝑁 }} 𝐸 {𝐽 𝑁 } = 𝐸 {𝑁 𝐸 {𝐽 | 𝑁 }}
= 𝐸 {𝑁 𝑝(1 − 𝑝)} + 𝑉 {𝑁 𝑝} = 𝑝𝐸 {𝑁 2 }
= 𝜆𝑝(1 − 𝑝) + 𝑝 2 𝜆 = 𝜆𝑝. = 𝑝(𝜆 + 𝜆2 )

𝑝𝜆 √
Hence, cov(𝐽, 𝑁) = 𝑝𝜆(1 + 𝜆) − 𝑝𝜆2 = 𝑝𝜆 and 𝜌 𝐽 𝑁 = √ = 𝑝.
𝜆 𝑝
Solution 2.85 Let 𝑋 ∼ 𝐺 (2, 100) ∼ 100𝐺 (2, 1) and𝑌 ∼ 𝑊 (1.5, 500) ∼ 500𝑊 (1.5, 1).
Then 𝑋𝑌 ∼ 5 × 104 𝐺 (2,1) · 𝑊 8
 (1.5, 1) and 𝑉 {𝑋𝑌 } = 25 × 10 · 𝑉 {𝐺𝑊 }, where
3
𝐺 ∼ 𝐺 (2, 1) and 𝑊 ∼ 𝑊 , 1 .
2
𝑉 {𝐺𝑊 } = 𝐸 {𝐺 2 }𝑉 {𝑊 } + 𝐸 2 {𝑊 }𝑉 {𝐺}
      
4 2 2
=6 Γ 1+ − Γ2 1 + + 2 · Γ2 1 +
3 3 3
= 3.88404.
Thus 𝑉 {𝑋𝑌 } = 9.7101 × 109 .
32 2 Probability Models and Distribution Functions

Solution 2.86 Using the notation of Example 2.33, {ex:insert-machine-trinomial}


(i) Pr{𝐽2 + 𝐽3 ≤ 20} = 𝐵(20; 3500,
 0.005) = 0.7699.
0.004
(ii) 𝐽3 | 𝐽2 = 15 ∼ Binomial 𝐵 3485, .
0.999
0.004
(iii) 𝜆 = 3485 × = 13.954, Pr{𝐽2 ≤ 15 | 𝐽3 = 15} ≈ 𝑃(15; 13.954) =
0.999
0.6739.

{ex:pdf-hypergeom-joint} Solution 2.87 Using the notation of Example 2.34, the joint p.d.f. of 𝐽1 and 𝐽2 is
20 50 30 
𝑗1 𝑗2 20− 𝑗1 − 𝑗2
𝑝( 𝑗 1 , 𝑗2 ) = 100
, 0 ≤ 𝑗 1 , 𝑗2 ; 𝑗1 + 𝑗2 ≤ 20.
20
The marginal distribution of 𝐽1 is 𝐻 (100, 20, 20). The marginal distribution of 𝐽2
is 𝐻 (100, 50, 20). Accordingly,  
19
𝑉 {𝐽1 } = 20 × 0.2 × 0.8 × 1 − = 2.585859,
99
 
19
𝑉 {𝐽2 } = 20 × 0.5 × 0.5 × 1 − = 4.040404.
99
The conditional distribution of 𝐽1 , given 𝐽2 , is 𝐻 (50, 20, 20 − 𝐽2 ). Hence,
𝐸 {𝐽1 𝐽2 } = 𝐸 {𝐸 {𝐽1 𝐽2 | 𝐽2 }}
 
2
= 𝐸 𝐽2 (20 − 𝐽2 ) ×
5
= 8𝐸 {𝐽2 } − 0.4𝐸 {𝐽22 }
= 80 − 0.4 × 104.040404 = 38.38381
and cov(𝐽1 , 𝐽2 ) = −1.61616.
−1.61616
Finally, the correlation between 𝐽1 and 𝐽2 is 𝜌 = √ =
2.585859 × 4.040404
−0.50.

Solution 2.88 𝑉 {𝑌 | 𝑋 } = 150, 𝑉 {𝑌 } = 200, 𝑉 {𝑌 | 𝑋 } = 𝑉 {𝑌 }(1 − 𝜌 2 ). Hence


|𝜌| = 0.5. The sign of 𝜌 cannot be determined.
 
100 Í 1
Solution 2.89 (i) 𝑋 (1) ∼ 𝐸 , 𝐸 {𝑋 (1) } = 10.; (ii) 𝐸 {𝑋 (10) } = 100 10
𝑖=1 =
10 𝑖
292.8968.

Solution 2.90 𝐽 ∼ 𝐵(10, 0.95). If {𝐽 = 𝑗 }, 𝑗 > 1, 𝑋 (1) is the minimum


 of a sample
10
of 𝑗 i.i.d. 𝐸 (10) random variables. Thus 𝑋 (1) | 𝐽 = 𝑗 ∼ 𝐸 .
𝑗
𝑘𝑥
(i) Pr{𝐽 = 𝑘, 𝑋 (1) ≤ 𝑥} = 𝑏(𝑘; 10, 0.95) (1 − 𝑒 − 10 ), 𝑘 = 1, 2, · · · , 10.
10
(ii) First note that Pr{𝐽 ≥ 1} = 1 − (0.05) ≈ 1.
2 Probability Models and Distribution Functions 33

10
∑︁ 𝑘𝑥
Pr{𝑋 (1) ≤ 𝑥 | 𝐽 ≥ 1} = 𝑏(𝑘; 10, 0.95) (1 − 𝑒 − 10 )
𝑘=1
10  
∑︁ 10 𝑥
=1− (0.95𝑒 − 10 ) 𝑘 (0.05) (10−𝑘)
𝑘=1
𝑘
= 1 − [0.05 + 0.95𝑒 −𝑥/10 ] 10 + (0.05) 10
= 1 − (0.05 + 0.95𝑒 −𝑥/10 ) 10 .

Solution 2.91 The median is 𝑀𝑒 = 𝑋 (6) .


11!
(a) The p.d.f. of 𝑀𝑒 is 𝑓 (6) (𝑥) = 𝜆(1 − 𝑒 −𝜆𝑥 ) 5 𝑒 −6𝜆𝑥 , 𝑥 ≥ 0.
5!5!
(b) The expected value
∫ of 𝑀𝑒 is

𝐸 {𝑋 (6) } = 2772𝜆 𝑥(1 − 𝑒 −𝜆𝑥 ) 5 𝑒 −6𝜆𝑥 d𝑥
0
5  ∫ ∞
∑︁ 5
= 2772𝜆 (−1) 𝑗
𝑥𝑒 −𝜆𝑥 (6+ 𝑗) d𝑥
𝑗=0
𝑗 0

5  
∑︁ 5 𝑗 1
= 2772 (−1)
𝑗=0
𝑗 𝜆(6 + 𝑗) 2
= 0.73654/𝜆.

 𝑋 and 𝑌 be i.i.d. 𝐸 (𝛽), 𝑇 = 𝑋 + 𝑌 and 𝑊 + 𝑋 − 𝑌 .


 2.92 Let
Solution
1 3 1
𝑉 𝑇 + 𝑊 = 𝑉{ 𝑋 + 𝑌}
2 2 2
 2  2!
3 1
= 𝛽2 +
2 2
= 2.5𝛽2 .

Solution 2.93 cov(𝑋, 𝑋 + 𝑌 ) = cov(𝑋, 𝑋) + cov(𝑋, 𝑌 ) = 𝑉 {𝑋 } = 𝜎 2 .

Solution 2.94 𝑉 {𝛼𝑋 + 𝛽𝑌 } = 𝛼2 𝜎𝑋2 + 𝛽2 𝜎𝑌2 + 2𝛼𝛽cov(𝑋, 𝑌 ) = 𝛼2 𝜎𝑋2 + 𝛽2 𝜎𝑌2 +


2𝛼𝛽𝜌 𝑋𝑌 𝜎𝑋 𝜎𝑌 .

Solution 2.95 Let 𝑈 ∼ 𝑁 (0, 1) and 𝑋 ∼ 𝑁 (𝜇, 𝜎). We assume that 𝑈 and 𝑋 are
independent. Then Φ(𝑋) = Pr{𝑈 < 𝑋 | 𝑋 } and therefore
𝐸 {Φ(𝑋)} = 𝐸 {Pr{𝑈 < 𝑋 | 𝑋 }}
= Pr{𝑈 < 𝑋 }
= Pr{𝑈 − 𝑋 < 0}
 
𝜇
=Φ √ .
1 + 𝜎2 √
The last equality follows from the fact that 𝑈 − 𝑋 ∼ 𝑁 (−𝜇, 1 + 𝜎 2 ).
34 2 Probability Models and Distribution Functions

Solution 2.96 Let 𝑈1 , 𝑈2 , 𝑋 be independent random variables; 𝑈1 , 𝑈2 i.i.d. 𝑁 (0, 1).


Then
Φ2 (𝑋) = Pr{𝑈1 ≤ 𝑋, 𝑈2 ≤ 𝑋 | 𝑋 }. Hence
𝐸 {Φ2 (𝑋)} = Pr{𝑈1 ≤ 𝑋, 𝑈2 ≤ 𝑋 } = Pr{𝑈1 − 𝑋 ≤ 0, 𝑈2 − 𝑋 ≤ 0}.
Since (𝑈1 − 𝑋, 𝑈2 − 𝑋) have a bivariate normal distribution with means (−𝜇, −𝜇)
1 + 𝜎 2 𝜎 2 
 
and variance-covariance matrix 𝑉 =   , it follows that

 𝜎2 1 + 𝜎2 
 
𝜎2
 
2 𝜇 𝜇
𝐸 {Φ (𝑋)} = Φ2 √ ,√ ; 2
.
1 + 𝜎2 1 + 𝜎2 1 + 𝜎
Solution 2.97 Since 𝑋 and 𝑌 are independent, 𝑇 = 𝑋 + 𝑌 ∼ 𝑃(12), Pr{𝑇 > 15} =
0.1556.
∫𝑥
Solution 2.98 Let 𝐹2 (𝑥) = −∞ 𝑓2 (𝑧) d𝑧 be the c.d.f. of 𝑋2 . Since 𝑋1 + 𝑋2 are
∫ ∞
Pr{𝑌 ≤ 𝑦} = 𝑓1 (𝑥) Pr{𝑋2 ≤ 𝑦 − 𝑥} d𝑥
independent ∫−∞∞
= 𝑓1 (𝑥)𝐹2 (𝑦 − 𝑥) d𝑥.
−∞
𝑑
𝑔(𝑦) = Pr{𝑌 ≤ 𝑦}
𝑑𝑦
Therefore, the p.d.f. of 𝑌 is ∫ ∞
= 𝑓1 (𝑥) 𝑓2 (𝑦 − 𝑥) d𝑥.
−∞

Solution 2.99 Let 𝑌 = 𝑋1 + 𝑋2 where 𝑋1 , 𝑋2 are i.i.d. uniform on (0,1). Then the
p.d.f.’s are
𝑓1 (𝑥) = 𝑓2 (𝑥) = 𝐼{0 < 𝑥 < 1}
∫𝑦
 d𝑥 = 𝑦, if 0 ≤ 𝑦 < 1
 0



𝑔(𝑦) =
 1 d𝑥 = 2 − 𝑦, if 1 ≤ 𝑦 ≤ 2.

 ∫
 𝑦−1
∫∞
Solution 2.100 𝑋1 , 𝑋2 are i.i.d. 𝐸 (1). 𝑈 = 𝑋1 − 𝑋2 . Pr{𝑈 ≤ 𝑢} = 0 𝑒 −𝑥 Pr{𝑋1 ≤
𝑢 + 𝑥} d𝑥. Notice that −∞ < 𝑢 < ∞ and Pr{𝑋1 ≤ 𝑢 + 𝑥} = 0 if 𝑥 + 𝑢 < 0. Let
𝑎 + = max(𝑎, 0). Then
∫ ∞
+
Pr{𝑈 ≤ 𝑢} = 𝑒 −𝑥 (1 − 𝑒 −(𝑢+𝑥) ) d𝑥
0
∫ ∞
+
=1− 𝑒 −𝑥−(𝑢+𝑥) d𝑥
0


 1 − 12 𝑒 −𝑢 , if 𝑢 ≥ 0


=

 1 𝑒 − |𝑢 | ,

if 𝑢 < 0
2
1
Thus, the p.d.f. of 𝑈 is 𝑔(𝑢) = 𝑒 −|𝑢 | , −∞ < 𝑢 < ∞.
2
2 Probability Models and Distribution Functions 35


 
. 50 − 40
Solution 2.101 𝑇 = 𝑋1 +· · ·+𝑋20 ∼ 𝑁 (20𝜇, 20𝜎 2 ). Pr{𝑇 ≤ 50} ≈ Φ =
44.7214
0.5885.
√︁
Solution 2.102 𝑋 ∼ 𝐵(200,  0.15). 𝜇 = 𝑛𝑝 = 30, 𝜎 = 𝑛𝑝(1 − 𝑝) = 5.0497.
34.5 − 30 25.5 − 30
Pr{25 < 𝑋 < 35} ≈ Φ −Φ = 0.6271.
5.0497 5.0497
 
9.5
Solution 2.103 𝑋 ∼ 𝑃(200). Pr{190 < 𝑋 < 210} ≈ 2Φ √ − 1 = 0.4983.
200
3 √︁
Solution 2.104 𝑋 ∼ Beta(3, 5). 𝜇 = 𝐸 {𝑋 } = = 0.375. 𝜎 = 𝑉 {𝑋 } =
  1/2 8
3·5
= 0.161374.
82 · 9
√ !
200 · 0.2282
Pr{| 𝑋¯ 200 − 0.375| < 0.2282} ≈ 2Φ − 1 = 1.
0.161374

Solution 2.105 𝑡 .95 [10] = 1.8125, 𝑡 .95 [15] = 1.7531, 𝑡 .95 [20] = 1.7247.

print(stats.t.ppf(0.95, 10))
print(stats.t.ppf(0.95, 15))
print(stats.t.ppf(0.95, 20))

1.8124611228107335
1.7530503556925547
1.7247182429207857

Solution 2.106 𝐹.95 [10, 30] = 2.1646, 𝐹.95 [15, 30] = 2.0148, 𝐹.95 [20, 30] =
1.9317.

print(stats.f.ppf(0.95, 10, 30))


print(stats.f.ppf(0.95, 15, 30))
print(stats.f.ppf(0.95, 20, 30))

2.164579917125473
2.0148036912954885
1.931653475236928

Solution 2.107 The solution to this problem is based on the fact, which is not
discussed in the text, that 𝑡 [𝜈] is distributed like the ratio of two independent random
√︁ 𝑁 (0, 1)
variables, 𝑁 (0, 1) and 𝜒2 [𝜈]/𝜈. Accordingly, 𝑡 [𝑛] ∼ √︃ , where 𝑁 (0, 1) and
𝜒2 [𝑛]
𝑛
(𝑁 (0, 1)) 2
𝜒2 [𝑛] are independent. 𝑡 2 [𝑛] ∼ ∼ 𝐹 [1, 𝑛]. Thus, since Pr{𝐹 [1, 𝑛] ≤
𝜒2 [𝑛]
𝑛
𝐹1−𝛼 [1, 𝑛]} = 1 − 𝛼.
36 2 Probability Models and Distribution Functions
√︁ √︁
Pr{− 𝐹1−𝛼 [1, 𝑛] ≤ 𝑡 [𝑛] ≤ 𝐹1−𝛼 [1, 𝑛]} = 1 − 𝛼.
√︁
It follows that 𝐹1−𝛼 [1, 𝑛] = 𝑡 1−𝛼/2 [𝑛],
2
or 𝐹1−𝛼 [1, 𝑛] = 𝑡 1−𝛼/2 [𝑛]. (If you assign this problem, please inform the students
of the above fact.)

Solution 2.108 A random variable 𝐹 [𝜈1 , 𝜈2 ] is distributed like the ratio of two
independent random variables 𝜒2 [𝜈1 ]/𝜈1 and 𝜒2 [𝜈2 ]/𝜈2 . Accordingly, 𝐹 [𝜈1 , 𝜈2 ] ∼
𝜒2 [𝜈1 ]/𝜈1
and
𝜒2 [𝜈2 ]/𝜈2
1 − 𝛼 = Pr{𝐹 [𝜈1 , 𝜈2 ] ≤ 𝐹1−𝛼 [𝜈1 , 𝜈2 ]}
( )
𝜒12 [𝜈1 ]/𝜈1
= Pr ≤ 𝐹1−𝛼 [𝜈1 , 𝜈2 ]}
𝜒22 [𝜈2 ]/𝜈2
( )
𝜒22 [𝜈2 ]/𝜈2 1
= Pr ≥
𝜒12 [𝜈1 ]/𝜈1 𝐹1−𝛼 [𝜈1 , 𝜈2 ]
 
1
= Pr 𝐹 [𝜈2 , 𝜈1 ] ≥
𝐹1−𝛼 [𝜈1 , 𝜈2 ]
= Pr {𝐹 [𝜈2 , 𝜈1 ] ≥ 𝐹𝛼 [𝜈2 , 𝜈1 ]} .
1
Hence 𝐹1−𝛼 [𝜈1 , 𝜈2 ] = .
𝐹𝛼 [𝜈2 , 𝜈1 ]
𝑁 (0, 1)
Solution 2.109 Using the fact that 𝑡 [𝜈] ∼ √︃ , where 𝑁 (0, 1) and 𝜒2 [𝜈] are
𝜒2 [𝜈 ]
𝜈
independent, ( )
𝑁 (0, 1)
𝑉 {𝑡 [𝜈]} = 𝑉 √︁
𝜒2 [𝜈]/𝜈

 
 
 
 
 
 
 

 

  𝑁 (0, 1)
 2


 
  

  𝑁 (0, 1)
 2


 

= 𝐸 𝑉 √︃ 𝜒 [𝜈] + 𝑉 𝐸 √︃ 𝜒 [𝜈] .

  𝜒2 [𝜈 ]     𝜒2 [𝜈 ]  
   𝜈

 
 
 
 𝜈

 

     

 
 
 

 𝑁 (0, 1)

 

 𝜈  𝑁 (0, 1)

 


By independence, 𝑉 √︃ 2
𝜒 [𝜈] = 2 , and 𝐸 √︃ 2
𝜒 [𝜈] = 0.
2 𝜒 [𝜈] 2
 𝜒 𝜈[𝜈 ]  𝜒 𝜈[𝜈 ]

 
 
 

 
    
1 2
 𝜈   𝜈 
Thus, 𝑉 {𝑡 [𝜈]} = 𝜈𝐸 . Since 𝜒 [𝜈] ∼ 𝐺 , 2 ∼ 2𝐺 , 1 ,
𝜒2 [𝜈] 2 2
  ∫ ∞
1 1 1
𝐸 = · 𝑥 𝜈−2 𝑒 −𝑥 d𝑥
𝜒2 [𝜈]

2 Γ 𝜈2 0
𝜈

1 Γ 2 −1 1 1 1
= ·  = · 𝜈 = .
2 Γ 2 𝜈 2 2 −1 𝜈−2
2 Probability Models and Distribution Functions 37
𝜈
Finally, 𝑉 {𝑡 [𝜈]} = , 𝜈 > 2.
𝜈−2
10 2 · 102 · 11
Solution 2.110 𝐸 {𝐹 [3, 10]} = = 1.25, 𝑉 {𝐹 [3, 10]} = = 1.9097.
8 3 · 82 · 6
Chapter 3
Statistical Inference and Bootstrapping

Import required modules and define required functions

import random
import math
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import pingouin as pg
import mistat
import os
os.environ['OUTDATED_IGNORE'] = '1'

Solution 3.1 By the WLLN, for any 𝜖 > 0, lim𝑛→∞ Pr{|𝑀𝑙 − 𝜇𝑙 | < 𝜖 } = 1. Hence,
𝑀𝑙 is a consistent estimator of the 𝑙-th moment.
√ 
¯ 𝑛
Solution 3.2 Using the CLT, Pr{| 𝑋𝑛 − 𝜇| < 1} ≈ 2Φ − 1. To determine the
𝜎√ 
𝑛
sample size 𝑛 so that this probability is 0.95 we set 2Φ − 1 = 0.95 and solve
√ 𝜎
𝑛
for 𝑛. This gives = 𝑧 .975 = 1.96. Thus 𝑛 ≥ 𝜎 2 (1.96) 2 = 424 for 𝜎 = 10.5.
𝜎

Solution 3.3 𝜉ˆ𝑝 = 𝑋¯ 𝑛 + 𝑧 𝑝 𝜎 ˆ 𝑛2 =
ˆ 𝑛 , where 𝜎 (𝑋𝑖 − 𝑋¯ 𝑛 ) 2 .
𝑛
Solution 3.4 Let (𝑋1 , 𝑌1 ), . . . , (𝑋𝑛 , 𝑌𝑛 ) be a random sample from a bivariate normal
distribution with density 𝑓 (𝑥, 𝑦; 𝜇, 𝜂, 𝜎𝑋 , 𝜎𝑌 , 𝜌) as given in Eq. (4.6.6). Let 𝑍𝑖 =
𝑋𝑖𝑌𝑖 for 𝑖 = 1, . . . , 𝑛. Then the first moment of 𝑍 is given by

𝜇1 (𝐹𝑍 ) = 𝐸 {𝑍 }
∫ ∞∫ ∞
= 𝑥𝑦 𝑓 (𝑥, 𝑦; 𝜇, 𝜂, 𝜎𝑋 , 𝜎𝑌 , 𝜌) d𝑥 d𝑦
−∞ −∞
= 𝜇𝜂 + 𝜌𝜎𝑋 𝜎𝑌 .

39
40 3 Statistical Inference and Bootstrapping

Using this fact, as well as the first 2 moments of 𝑋 and 𝑌 , we get the following
moment equations:

𝑛 𝑛
1 ∑︁ 1 ∑︁
𝑋𝑖 = 𝜇 𝑌𝑖 = 𝜂
𝑛 𝑖=1 𝑛 𝑖=1
𝑛 𝑛
1 ∑︁ 2 1 ∑︁ 2
𝑋 = 𝜎𝑋2 + 𝜇2 𝑌 = 𝜎𝑌2 + 𝜂2
𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
𝑛
1 ∑︁
𝑋𝑖𝑌𝑖 = 𝜇𝜂 + 𝜌𝜎𝑋 𝜎𝑌 .
𝑛 𝑖=1

Solving these equations for the 5 parameters gives


1 Í𝑛 ¯¯
𝑖=1 𝑋𝑖 𝑌𝑖 − 𝑋 𝑌
𝜌ˆ 𝑛 =   𝑛 or equivalently,
    1/2 ,
1 Í𝑛 1
𝑋 2 − 𝑋¯ 2 · 𝑌 2 − 𝑌¯ 2
Í 𝑛
𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
1 Í𝑛
(𝑋𝑖 − 𝑋¯ 𝑛 ) (𝑌𝑖 − 𝑌¯𝑛 )
𝜌ˆ 𝑛 =  𝑛 𝑖=1
 1/2 .
1 Í𝑛 ¯ · 1
2 (𝑌𝑖 − 𝑌¯ ) 2
Í 𝑛
(𝑋𝑖 − 𝑋)
𝑛 𝑖=1 𝑛 𝑖=1

Solution 3.5 The two first moments are


𝜈1 𝜈1 (𝜈1 + 1)
𝜇1 = , 𝜇2 = .
𝜈1 + 𝜈2 (𝜈1 + 𝜈2 ) (𝜈1 + 𝜈2 + 1)
1 Í𝑛
Equating the theoretical moments to the sample moments 𝑀1 = 𝑋𝑖 and
𝑛 𝑖=1
1 Í𝑛
𝑀2 = 𝑋 2 , we otain with 𝜎
ˆ 𝑛2 = 𝑀2 − 𝑀12 the moment equation estimators.
𝑛 𝑖=1 𝑖

ˆ 𝑛2
𝜈ˆ1 = 𝑀1 (𝑀1 − 𝑀2 )/𝜎 and 𝜈ˆ2 = (𝑀1 − 𝑀2 ) (1 − 𝑀1 )/𝜎 ˆ 𝑛2 .
!
𝜎𝑖2 Í 2 𝑤𝑖
Solution 3.6 𝑉 {𝑌¯𝑤 } = 2
Í 𝑘 𝑘 Í𝑘
𝑖=1 𝑤 𝑖 𝑖=1 𝑤 𝑖 . Let 𝜆 𝑖 = Í 𝑘 , 𝑖=1 𝜆𝑖 = 1.
𝑛𝑖 𝑤
𝑗=1 𝑗
We find weights 𝜆𝑖 which minimize 𝑉 {𝑌¯𝑤 }, under the constraint 𝑖=1
Í𝑘
𝜆𝑖 = 1. The
Í 𝑘 2 𝜎𝑖2  Í𝑘 
Lagrangian is 𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 𝑖=1 𝜆𝑖 𝑛𝑖 + 𝜌 𝑖=1 𝜆𝑖 − 1 . Differentiating with
respect to 𝜆𝑖 , we get

𝜕 𝜎2 𝜕
𝑘∑︁
𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 2𝜆𝑖 𝑖 +𝜌, 𝑖 = 1, . . . , 𝑘 and 𝐿 (𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 𝜆𝑖 −1.
𝜕𝜆 𝑖 𝑛𝑖 𝜕𝜌 𝑖=1
3 Statistical Inference and Bootstrapping 41

Equating the partial derivatives to zero, we get 𝜆0𝑖 = − 𝜌2 𝑛𝑖


𝜎𝑖2
for 𝑖 = 1, . . . , 𝑘 and
Í𝑘 0 𝜌 Í𝑘 𝑛𝑖
𝑖=1 𝜆 𝑖 = − 2 𝑖=1 𝜎 2 = 1.
𝑖
𝜌 1 0 =
𝑛𝑖 /𝜎𝑖2
Thus, − = Í 𝑘 2
, 𝜆 2
, and therefore 𝑤 𝑖 = 𝑛𝑖 /𝜎𝑖2 .
2 𝑖 Í 𝑘
𝑛
𝑖=1 𝑖 /𝜎𝑖 𝑛
𝑗=1 𝑗 /𝜎 𝑗

Solution 3.7 Since the 𝑌𝑖 are uncorrelated,


Í𝑛 (𝑥𝑖 − 𝑥¯ 𝑛 ) 2 𝜎2
𝑉 { 𝛽ˆ1 } = 𝑖=1 𝑤 2𝑖 𝑉 {𝑌𝑖 } = 𝜎 2 𝑖=1 (𝑥 𝑖 − 𝑥¯ 𝑛 ) 2 .
Í𝑛 Í𝑛
2
= , where 𝑆𝑆 𝑥 = 𝑖=1
𝑆𝑆 𝑥 𝑆𝑆 𝑥
𝑥𝑖 − 𝑥¯ 𝑛
(𝑥𝑖 − 𝑥¯ 𝑛 ) 2 . Then
Í𝑛
Solution 3.8 Let 𝑤 𝑖 = for 𝑖 = 1, · · · , 𝑛 where 𝑆𝑆 𝑥 = 𝑖=1
𝑆𝑆 𝑥
we have
𝑛 𝑛
∑︁ ∑︁ 1
𝑤 𝑖 = 0, and 𝑤 2𝑖 = .
𝑖=1 𝑖=1
𝑆𝑆 𝑥
Hence,

𝑉 { 𝛽ˆ0 } = 𝑉 {𝑌¯𝑛 − 𝛽ˆ1 𝑥¯ 𝑛 }


( 𝑛
! )
∑︁
= 𝑉 𝑌¯𝑛 − 𝑤 𝑖𝑌𝑖 𝑥¯ 𝑛
𝑖=1
( 𝑛   )
∑︁ 1
=𝑉 − 𝑤 𝑖 𝑥¯ 𝑛 𝑌𝑖
𝑖=1
𝑛
𝑛  2
∑︁ 1
= − 𝑤 𝑖 𝑥¯ 𝑛 𝜎2
𝑖=1
𝑛
𝑛  
∑︁ 1 2𝑤 𝑖 𝑥¯ 𝑛
= 𝜎2 − + 𝑤 2 2
¯
𝑥
𝑖 𝑛
𝑖=1
𝑛2 𝑛
𝑛 𝑛
!
2 1 2 ∑︁ ∑︁
=𝜎 − 𝑥¯ 𝑛 𝑤 𝑖 + 𝑥¯ 𝑛2 𝑤 2𝑖
𝑛 𝑛 𝑖=1 𝑖=1

¯ 2 
1 𝑥
= 𝜎2 + 𝑛 .
𝑛 𝑆𝑆 𝑥

Also
𝑛   𝑛
!
∑︁ 1 ∑︁
cov( 𝛽ˆ0 , 𝛽ˆ1 ) = cov − 𝑤 𝑖 𝑥¯ 𝑛 𝑌𝑖 , 𝑤 𝑖𝑌𝑖
𝑖=1
𝑛 𝑖=1
𝑛  
∑︁ 1
= 𝜎2 − 𝑤 𝑖 𝑥¯ 𝑛 𝑤 𝑖
𝑖=1
𝑛
𝑥¯ 𝑛
= −𝜎 2 .
𝑆𝑆 𝑥
42 3 Statistical Inference and Bootstrapping

Solution 3.9 The correlation between 𝛽ˆ0 and 𝛽ˆ1 is

𝜎 2 𝑥¯ 𝑛
𝜌 𝛽ˆ0 , 𝛽ˆ1 = −  1/2
𝑥¯ 2
 
1 1
𝜎 2 𝑆𝑆 𝑥 + 𝑛
𝑛 𝑆𝑆 𝑥 𝑆𝑆 𝑥
𝑥¯ 𝑛
= −  1/2 .
1 Í𝑛 2
𝑥
𝑛 𝑖=1 𝑖

Solution 3.10 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 i.i.d., 𝑋1 ∼ 𝑃(𝜆). The likelihood function is


Í𝑛
𝑋𝑖
𝜆 𝑖=1
𝐿 (𝜆; 𝑋1 , . . . , 𝑋𝑛 ) = 𝑒 −𝑛𝜆 Î𝑛
𝑖=1 𝑋𝑖 !
Í𝑛
𝜕 𝑖=1 𝑋𝑖
Thus, log 𝐿 (𝜆; 𝑋1 , . . . , 𝑋𝑛 ) = −𝑛 + . Equating this to zero and solving
𝜕𝜆 𝜆
1 Í𝑛
for 𝜆, we get 𝜆ˆ 𝑛 = 𝑋𝑖 = 𝑋¯ 𝑛 .
𝑛 𝑖=1
1 − Í𝑛 𝑋𝑖 /𝛽
Solution 3.11 Since 𝜈 is known, the likelihood of 𝛽 is 𝐿(𝛽) = 𝐶𝑛 𝑒 𝑖=1 ,
𝛽 𝑛𝜈
0 < 𝛽 < ∞ where 𝐶𝑛 does not depend on 𝛽. The log-likelihood function is
𝑛
1 ∑︁
𝑙 (𝛽) = log 𝐶𝑛 − 𝑛𝜈 log 𝛽 − 𝑋𝑖 .
𝛽 𝑖=1
Í𝑛
𝑛𝜈 𝑋𝑖
The score function is 𝑙 0 (𝛽) = − + 𝑖=12 . Equating the score to 0 and solving
𝛽 𝛽
1 1 ¯
for 𝛽, we obtain the MLE 𝛽ˆ =
Í 𝑛
𝑖=1 𝑋𝑖 = 𝑋𝑛 . The variance of the MLE is
𝑛𝜈 𝜈
2
ˆ = 𝛽 .
𝑉 { 𝛽}
𝑛𝜈
{exc:mgf-nb} Solution 3.12 We proved in Exercise 2.56 that the m.g.f. of 𝑁 𝐵(2, 𝑝) is

𝑝2
𝑀𝑋 (𝑡) = , 𝑡 < − log(1 − 𝑝).
(1 − (1 − 𝑝)𝑒 𝑡 ) 2
Í𝑛
Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be i.i.d. 𝑁 𝐵(2, 𝑝), then the m.g.f. of 𝑇𝑛 = 𝑖=1 𝑋𝑖 is

𝑝 2𝑛
𝑀𝑇𝑛 (𝑡) = , 𝑡 < − log(1 − 𝑝).
(1 − (1 − 𝑝)𝑒 𝑡 ) 2𝑛

{ex:neg-binom-dist} Thus, 𝑇𝑛 ∼ 𝑁 𝐵(2𝑛, 𝑝). According to Example 3.4, the MLE of 𝑝, based on 𝑇𝑛
2𝑛 2
(which is a sufficient statistic) is 𝑝ˆ 𝑛 = = , where 𝑋¯ 𝑛 = 𝑇𝑛 /𝑛 is the
𝑇𝑛 + 2𝑛 ¯
𝑋𝑛 + 2
sample mean.
3 Statistical Inference and Bootstrapping 43

2(1 − 𝑝)
(i) According to the WLLN, 𝑋¯ 𝑛 converges in probability to 𝐸 {𝑋1 } = .
𝑝
1− 𝑝 2
Substituting 2 for 𝑋¯ 𝑛 in the formula of 𝑝ˆ 𝑛 we obtain 𝑝 ∗ = = 𝑝. This
𝑝 2 + 2 1−𝑝𝑝
shows that the limit in probability as 𝑛 → ∞, of 𝑝ˆ 𝑛 is 𝑝.
(ii) Substituting 𝑘 = 2𝑛 in the formulas of Example 3.4 we obtain {ex:neg-binom-dist}

3𝑝(1 − 𝑝) 𝑝 2 (1 − 𝑝)
Bias( 𝑝ˆ 𝑛 ) ≈ and 𝑉 { 𝑝ˆ 𝑛 } ≈ .
4𝑛 2𝑛
Solution 3.13 The likelihood function of 𝜇 and 𝛽 is
( 𝑛
)
1 1 ∑︁ 𝑛
𝐿 (𝜇, 𝛽) = 𝐼{𝑋 (1) ≥ 𝜇} 𝑛 exp − (𝑋 (𝑖) − 𝑋 (1) ) − (𝑋 (1) − 𝜇) ,
𝛽 𝛽 𝑖=1 𝛽

for −∞ < 𝜇 ≤ 𝑋 (1) , 0 < 𝛽 < ∞.


(i) 𝐿 (𝜇, 𝛽) is maximized by 𝜇ˆ = 𝑋 (1) , that is,
( 𝑛
)
1 1 ∑︁
𝐿 ∗ (𝛽) = sup 𝐿(𝜇, 𝛽) = 𝑛 exp − (𝑋 (𝑖) − 𝑋 (1) )
𝜇 ≤𝑋 (1) 𝛽 𝛽 𝑖=1

where 𝑋 (1) < 𝑋 (2) < · · · < 𝑋 (𝑛) are the ordered statistics.
1 Í𝑛
(ii) Furthermore 𝐿 ∗ (𝛽) is maximized by 𝛽ˆ𝑛 = (𝑋 (𝑖) − 𝑋 (1) ). The MLEs
𝑛 𝑖=2
1
are 𝜇ˆ = 𝑋 (1) , and 𝛽ˆ𝑛 =
Í 𝑛
(𝑋 (𝑖) − 𝑋 (1) ).
𝑛 𝑖=2  
𝛽
(iii) 𝑋 (1) is distributed like 𝜇 + 𝐸 , with p.d.f.
𝑛
𝑛
𝑛 − 𝛽 ( 𝑥−𝜇)
𝑓 (1) (𝑥; 𝜇, 𝛽) = 𝐼{𝑥 ≥ 𝜇} 𝑒 .
𝛽

Thus, the joint p.d.f. of (𝑋1 , . . . , 𝑋𝑛 ) is factored to a product of the p.d.f. of 𝑋 (1) and a
function of 𝛽ˆ𝑛 , which does not depend on 𝑋 (1) (nor on 𝜇). This implies that 𝑋 (1) and
𝛽2 1
ˆ = 𝑉 {𝑋 (1) } = 2 . It can be shown that 𝛽ˆ𝑛 ∼ 𝐺 (𝑛 − 1, 𝛽).
𝛽ˆ𝑛 are independent. 𝑉 { 𝜇}
 𝑛  𝑛
𝑛−1 1 1 2
Accordingly, 𝑉 { 𝛽ˆ𝑛 } = 2 𝛽2 = 1− 𝛽 .
𝑛 𝑛 𝑛
Solution 3.14 In sampling with replacement, the number of defective items in the
sample, 𝑋, has the binomial distribution 𝐵(𝑛, 𝑝). We test the hypotheses 𝐻0 : 𝑝 ≤
0.03 against 𝐻1 : 𝑝 > 0.03. 𝐻0 is rejected if 𝑋 > 𝐵−1 (1 − 𝛼, 20, 0.03). For 𝛼 = 0.05
the rejection criterion is 𝑘 𝛼 = 𝐵−1 (0.95, 20, 0.03) = 2. Since the number of defective
items in the sample is 𝑋 = 2, 𝐻0 is not rejected at the 𝛼 = 0.05 significance level.
44 3 Statistical Inference and Bootstrapping

1.0

0.8

0.6
OC(p)
0.4

0.2

0.0
0.0 0.1 0.2 0.3 0.4 0.5
p
Fig. 3.1 The OC Function 𝐵 (2; 30, 𝑝) { fig:PlotOCcurveBinomial_2_30}

Solution 3.15 The OC function is 𝑂𝐶 ( 𝑝) = 𝐵(2; 30, 𝑝), 0 < 𝑝 < 1. A plot of this
OC function is given in Fig. 3.1. { fig:PlotOCcurveBinomial_2_30}

Solution 3.16 Let 𝑝 0 = 0.01, 𝑝 1 = 0.03, 𝛼 = 0.05, 𝛽 = 0.05. According to


Eq. (3.3.12), the sample size 𝑛 should satisfy

𝑝1 − 𝑝0 √
 √︂ 
𝑝0 𝑞0
1−Φ √ 𝑛 − 𝑧1−𝛼 =𝛽
𝑝1 𝑞1 𝑝1 𝑞1

or, equivalently,
𝑝1 − 𝑝0 √
√︂
𝑝0 𝑞0
√ 𝑛 − 𝑧 1−𝛼 = 𝑧 1−𝛽 .
𝑝1 𝑞1 𝑝1 𝑞1
This gives
√ √
(𝑧 1−𝛼 𝑝 0 𝑞 0 + 𝑧 1−𝛽 𝑝 1 𝑞 1 ) 2
𝑛≈
( 𝑝1 − 𝑝0)2
√ √
(1.645) 2 ( 0.01 × 0.99 + 0.03 × 0.97) 2
= = 494.
(0.02) 2

For this sample size, the critical value is 𝑘 𝛼 = 𝑛𝑝 0 + 𝑧1−𝛼 𝑛𝑝 0 𝑞 0 = 8.58. Thus, 𝐻0
is rejected if there are more than 8 “successes” in a sample of size 494.
𝜎
Solution 3.17 𝑋¯ 𝑛 ∼ 𝑁 (𝜇, √ ).
𝑛
(i) If 𝜇 = 𝜇0 the probability that 𝑋¯ 𝑛 will be outside the control limits is
   
¯ 3𝜎 ¯ 3𝜎
Pr 𝑋𝑛 < 𝜇0 − √ + Pr 𝑋𝑛 > 𝜇0 + √ = Φ(−3) + 1 − Φ(3) = 0.0027.
𝑛 𝑛
3 Statistical Inference and Bootstrapping 45

(ii) (1 − 0.0027) 20 =√0.9474.


(iii) If 𝜇 = 𝜇0 + 2(𝜎/ 𝑛), the probability that 𝑋¯ 𝑛 will be outside the control limits
is
Φ(−5) + 1 − Φ(1) = 0.1587.
(iv) (1 − 0.1587) 10 = 0.1777.

Solution 3.18 We can run the 1-sample 𝑡-test in Python as follows:

socell = mistat.load_data('SOCELL')
t1 = socell['t1']

statistic, pvalue = stats.ttest_1samp(t1, 4.0)


# divide pvalue by two for one-sided test
pvalue = pvalue / 2
print(f'pvalue {pvalue:.2f}')

pvalue 0.35

The hypothesis 𝐻0 : 𝜇 ≥ 4.0 amps is not rejected.

Solution 3.19 We can run the 1-sample 𝑡-test in Python as follows:

socell = mistat.load_data('SOCELL')
t2 = socell['t2']

statistic, pvalue = stats.ttest_1samp(t2, 4.0)


# divide pvalue by two for one-sided test
pvalue = pvalue / 2
print(f'pvalue {pvalue:.2f}')

pvalue 0.03

The hypothesis 𝐻0 : 𝜇 ≥ 4.0 amps is rejected at a 0.05 level of significance.

Solution 3.20 Let 𝑛 = 30, 𝛼 = 0.01. The 𝑂𝐶 (𝛿) function for a one-sided 𝑡-test is
√ 1
© 𝛿 30 − 2.462 × (1 − 232 ) ª
𝑂𝐶 (𝛿) = 1 − Φ ­­ ®
6.0614 1/2 ®
(1 + )
« 58 ¬
= 1 − Φ(5.2117𝛿 − 2.3325).

delta = np.linspace(0, 1.0, 11)

a = np.sqrt(30)
b = 2.462 * ( 1 - 1/232)
f = np.sqrt(1 + 6.0614 / 58)
OC_delta = 1 - stats.norm.cdf((a * delta - b) / f)

Values of 𝑂𝐶 (𝛿) for 𝛿 = 0, 1(0.1) are given in the following table.


46 3 Statistical Inference and Bootstrapping

𝛿 OC(𝛿)
0.0 0.990164
0.1 0.964958
0.2 0.901509
0.3 0.779063
0.4 0.597882
0.5 0.392312
0.6 0.213462
0.7 0.094149
0.8 0.033120
0.9 0.009188
1.0 0.001994

Solution 3.21 Let 𝑛 = 31, 𝛼 = 0.10. The 𝑂𝐶 function for testing 𝐻0 : 𝜎 2 ≤ 𝜎02
against 𝐻1 : 𝜎 2 > 𝜎02 is

( )
2 2
𝜎02 2
𝑂𝐶 (𝜎 ) = Pr 𝑆 ≤ 𝜒0.9 [𝑛 − 1]
𝑛−1
( )
𝜎02
= Pr 𝜒2 [30] ≤ 2
𝜒0.9 [30]
𝜎2
!
2 [30]
30 𝜎02 𝜒0.9
=1−𝑃 − 1; 2 · .
2 𝜎 2

sigma2 = np.linspace(1, 2, 11)


OC_sigma2 = 1 - stats.poisson.cdf(30 / 2 - 1,
stats.chi2(30).ppf(0.90) / (2 * sigma2))

The values of 𝑂𝐶 (𝜎 2 ) for 𝜎 2 = 1, 2(0.1) are given in the following table: (Here
𝜎02 = 1.)

𝜎2 OC(𝜎 2 )
1.0 0.900000
1.1 0.810804
1.2 0.700684
1.3 0.582928
1.4 0.469471
1.5 0.368201
1.6 0.282781
1.7 0.213695
1.8 0.159540
1.9 0.118063
2.0 0.086834
3 Statistical Inference and Bootstrapping 47

1.0
OC(p)
OCapprox(p)
0.8

0.6

0.4

0.2

0.0
0.10 0.15 0.20 0.25 0.30 0.35
p
Fig. 3.2 Comparison exact to normal approximation of 𝑂𝐶 ( 𝑝) { fig:OC_OCapprox}

Solution 3.22 The 𝑂𝐶 function, for testing 𝐻0 : 𝑝 ≤ 𝑝 0 against 𝐻1 : 𝑝 > 𝑝 0 is


approximated by

𝑝 − 𝑝0 √
 √︂ 
𝑝0 𝑞0
𝑂𝐶 ( 𝑝) = 1 − Φ √ 𝑛 − 𝑧1−𝛼 ,
𝑝𝑞 𝑝𝑞

for 𝑝 ≥ 𝑝 0 . In Fig. 3.2, we present the graph of 𝑂𝐶 ( 𝑝), for 𝑝 0 = 0.1, 𝑛 = 100, { fig:OC_OCapprox}
𝛼 = 0.05 both using the exact solution and the normal approximation.

n = 100
p0 = 0.1
alpha = 0.05

c_alpha = stats.binom(n, p0).ppf(1 - alpha)


p = np.linspace(0.1, 0.35, 20)
OC_exact = stats.binom(n, p).cdf(c_alpha)

z_ma = stats.norm.ppf(1 - alpha)


q0 = 1- p0
pq = p * (1 - p)
OC_approx = 1 - stats.norm.cdf((p - p0) * np.sqrt((n / pq)) -
z_ma * np.sqrt(p0 * q0 / pq))
df = pd.DataFrame({
'p': p,
'OC(p)': OC_exact,
'$OC_{approx}(p)$': OC_approx,
})
df.plot.line(x='p', y=['OC(p)', '$OC_{approx}(p)$'])
plt.show()

Solution 3.23 The power function is


48 3 Statistical Inference and Bootstrapping
( )
2 2
𝜎02 2
𝜓(𝜎 ) = Pr 𝑆 ≥ 𝜒1−𝛼 [𝑛 − 1]
𝑛−1
( )
2
𝜎02
= Pr 𝜒 [𝑛 − 1] ≥ 𝜒2 [𝑛 − 1] .
𝜎 2 1−𝛼
 
1
Solution 3.24 The power function is 𝜓(𝜌) = Pr 𝐹 [𝑛1 − 1, 𝑛2 − 1] ≥ 𝐹1−𝛼 [𝑛1 − 1, 𝑛2 − 1] ,
𝜌
𝜎12
for 𝜌 ≥ 1, where 𝜌 = 2 .
𝜎2
Solution 3.25 (i) Using the following Python commands we get a 99% C.I. for 𝜇:

data = [20.74, 20.85, 20.54, 20.05, 20.08, 22.55, 19.61, 19.72,


20.34, 20.37, 22.69, 20.79, 21.76, 21.94, 20.31, 21.38,
20.42, 20.86, 18.80, 21.41]
alpha = 1 - 0.99

df = len(data) - 1
mean = np.mean(data)
sem = stats.sem(data)

print(stats.t.interval(1 - alpha, df, loc=mean, scale=sem))

(20.136889216656858, 21.38411078334315)

Confidence Intervals
Variable N Mean StDev SE 99.0% C.I.
Mean
Sample 20 20.760 0.975 0.218 (20.137, 21.384)
(ii) A 99% C.I. for 𝜎 2 is (0.468, 2.638).

var = np.var(data, ddof=1)


print(df * var / stats.chi2(df).ppf(1 - alpha/2))
print(df * var / stats.chi2(df).ppf(alpha/2))

0.46795850248657883
2.6380728125212016

(iii) A 99% C.I. for 𝜎 is (0.684, 1.624).


Solution 3.26 Let (𝜇 , 𝜇¯ .99 ) be a confidence interval for 𝜇, at level 0.99. Let
.99
(𝜎 .99 , 𝜎¯ .99 ) be a confidence interval for 𝜎 at level 0.99. Let 𝜉 = 𝜇 + 2𝜎 .99 and
.99
𝜉¯ = 𝜇¯ .99 + 2𝜎 ¯ .99 . Then
¯ ≥ Pr{𝜇
Pr{𝜉 ≤ 𝜇 + 2𝜎 ≤ 𝜉} ≤ 𝜇 ≤ 𝜇¯ .99 , 𝜎 .99 ≤ 𝜎 ≤ 𝜎
¯ .99 } ≥ 0.98.
.99

¯ is a confidence interval for 𝜇 + 2𝜎, with confidence level greater or


Thus, (𝜉, 𝜉)
equal to 0.98. Using the data of the previous problem, a 98% C.I. for 𝜇 + 2𝜎 is
(21.505, 24.632).
3 Statistical Inference and Bootstrapping 49

Solution 3.27 Let 𝑋 ∼ 𝐵(𝑛, 𝜃). For 𝑋 = 17 and 𝑛 = 20, a confidence interval for 𝜃,
at level 0.95, is (0.6211, 0.9679).

alpha = 1 - 0.95
X = 17
n = 20
F1 = stats.f(2*(n-X+1), 2*X).ppf(1 - alpha/2)
F2 = stats.f(2*(X+1), 2*(n-X)).ppf(1 - alpha/2)
pL = X / (X + (n-X+1) * F1)
pU = (X+1) * F2 / (n-X + (X+1) * F2)
print(pL, pU)

0.6210731734546859 0.9679290628145363

Solution 3.28 From the data we have 𝑛 = 10 and 𝑇10 = 134. For 𝛼 = 0.05, 𝜆 𝐿 =
11.319 and 𝜆𝑈 = 15.871.

X = [14, 16, 11, 19, 11, 9, 12, 15, 14, 13]


alpha = 1 - 0.95
T_n = np.sum(X)

# exact solution
print(stats.chi2(2 * T_n + 2).ppf(alpha/2) / (2 * len(X)))
print(stats.chi2(2 * T_n + 2).ppf(1 - alpha/2) / (2 * len(X)))

# approximate solution
nu = 2 * T_n + 2
print((nu + stats.norm.ppf(alpha/2) * np.sqrt(2 * nu)) / (2 * len(X)))
print((nu + stats.norm.ppf(1-alpha/2) * np.sqrt(2 * nu)) / (2 * len(X)))

11.318870163746238
15.870459268116013
11.222727638613012
15.777272361386988

Solution 3.29 For 𝑛 = 20, 𝜎 = 5, 𝑌¯20 = 13.75, 𝛼 = 0.05 and 𝛽 = 0.1, the tolerance
interval is (3.33, 24.17).

Solution 3.30 Use the following Python code:

yarnstrg = mistat.load_data('YARNSTRG')
n = len(yarnstrg)
Ybar = yarnstrg.mean()
S = yarnstrg.std()

alpha, beta = 0.025, 0.025


z_1a = stats.norm.ppf(1-alpha)
z_1b = stats.norm.ppf(1-beta)
z_a = stats.norm.ppf(alpha)
z_b = stats.norm.ppf(beta)

t_abn = (z_1b/(1 - z_1a**2/(2*n)) +


z_1a*np.sqrt(1 + z_b**2/2 - z_1a**2/(2*n)) /
(np.sqrt(n)*(1 - z_1a**2/(2*n))))

print(Ybar - t_abn*S, Ybar + t_abn*S)

0.7306652047594424 5.117020795240556
50 3 Statistical Inference and Bootstrapping

Ordered quantiles
1

1
R 2 = 0.911

1 0 1 2
Theoretical quantiles
{ fig:qqplotISCT1Socell} Fig. 3.3 𝑄–𝑄 plot of ISC-𝑡1 (SOCELL.csv)

From the data we have 𝑋¯ 100 = 2.9238 and 𝑆100 = 0.9378.

1.962 1.962 1/2


1.96 1.96(1 + − )
𝑡 (0.025, 0.025, 100) = + 2 200
1 − 1.962 /200 1.962
10(1 − )
200
= 2.3388.

The tolerance interval is (0.7306, 5.1171).


Solution 3.31 From the data, 𝑌 (1) = 1.151 and 𝑌 (100) = 5.790. For 𝑛 = 100 and
𝛽 = 0.10 we have 1 − 𝛼 = 0.988. For 𝛽 = 0.05, 1 − 𝛼 = 0.847, the tolerance interval
is (1.151, 5.790). The nonparametric tolerance interval is shorter and is shifted to
the right with a lower confidence level.
Solution 3.32 The following is a normal probability plot of ISC-𝑡1 :
{ fig:qqplotISCT1Socell} According to Fig. 3.3, the hypothesis of normality is not rejected.

{ fig:qqplotCar} Solution 3.33 As is shown in the normal probability plots in Fig. 3.4, the hypothesis
of normality is not rejected in either case.
Solution 3.34 Frequency distribution for turn diameter:
3 Statistical Inference and Bootstrapping 51

Q-Q plot (turn diameter) Q-Q plot (log(horse-power))


2 2
1
Ordered quantiles

Ordered quantiles
1
0 0
1 1
2 R 2 = 0.985 2 R 2 = 0.986
2 0 2 2 0 2
Theoretical quantiles Theoretical quantiles

{ fig:qqplotCar} Fig. 3.4 𝑄–𝑄 plot of CAR.csv data

Interval Observed Expected (0 − 𝐸) 2 /𝐸


- 31 11 8.1972 0.9583
31 - 32 8 6.3185 0.4475
32 - 33 9 8.6687 0.0127
33 - 34 6 10.8695 2.1815
34 - 35 18 12.4559 2.4677
35 - 36 8 13.0454 1.9513
36 - 37 13 12.4868 0.0211
37 - 38 6 10.9234 2.2191
38 - 39 9 8.7333 0.0081
39 - 40 8 6.3814 0.4106
40 - 13 9.0529 1.7213
Total 109 — 12.399

The expected frequencies were computed for 𝑁 (35.5138, 3.3208). Here 𝜒2 = 12.4,
d.f. = 8 and the 𝑃 value is 0.134. The differences from normal are not significant.
observed expected (O-E)^2/E
turn
[28, 31) 11 8.197333 0.958231
[31, 32) 8 6.318478 0.447500
[32, 33) 9 8.668695 0.012662
[33, 34) 6 10.869453 2.181487
[34, 35) 18 12.455897 2.467673
[35, 36) 8 13.045357 1.951317
[36, 37) 13 12.486789 0.021093
[37, 38) 6 10.923435 2.219102
[38, 39) 9 8.733354 0.008141
[39, 40) 8 6.381395 0.410550
[40, 44) 13 9.052637 1.721231
chi2-statistic of fit 12.398987638400024
52 3 Statistical Inference and Bootstrapping

chi2[8] for 95% 15.50731305586545


p-value of observed statistic 0.1342700576126994

Solution 3.35 In Python:


KstestResult(statistic=0.07019153486614366, pvalue=0.6303356787948367,
statistic_location=32.2, statistic_sign=1)
k_alpha 0.08514304524687971


 
0.85
𝐷 ∗109 = 0.0702. For 𝛼 = 0.05 𝑘 ∗𝛼 = 0.895/ 109 − 0.01 + √ = 0.0851. The
109
deviations from the normal distribution are not significant.
 
Solution 3.36 For 𝑋 ∼ 𝑃(100), 𝑛0 = 𝑃−1 0.2 0.3 , 100 = 100 + 𝑧 .67 × 10 = 105.

Solution 3.37 Given 𝑋 = 6, the posterior distribution of 𝑝 is Beta(9,11).


9 99
Solution 3.38 𝐸 {𝑝 | 𝑋 = 6} = = 0.45 and 𝑉 {𝑝 | 𝑋 = 6} = =
20 202 × 21
0.0118 so 𝜎𝑝 |𝑋 =6 = 0.1086.
Solution 3.39 Let 𝑋 | 𝜆 ∼ 𝑃(𝜆) where 𝜆 ∼ 𝐺 (2, 50).
 
50
(i) The posterior distribution of 𝜆 | 𝑋 = 82 is 𝐺 84, .
    51
50 50
(ii) 𝐺 .025 84, = 65.6879 and 𝐺 .975 84, = 100.873.
51 51
Solution 3.40 The posterior probability for 𝜆0 = 70 is
1
𝑝(72; 70)
𝜋(72) = 3 = 0.771.
1 2
𝑝(72; 70) + 𝑝(72; 90)
3 3
𝑟0
𝐻0 : 𝜆 = 𝜆0 is accepted if 𝜋(𝑋) > = 0.4. Thus, 𝐻0 is accepted.
𝑟0 + 𝑟1
Solution 3.41 The credibility interval for 𝜇 is (43.235,60.765). Since the posterior
distribution of 𝜇 is symmetric, this credibility interval is also a HPD interval.
Solution 3.42 In Python:

random.seed(1)
car = mistat.load_data('CAR')

def mean_of_sample_with_replacement(x, n):


sample = random.choices(x, k=n)
return np.mean(sample)

means = [mean_of_sample_with_replacement(car['mpg'], 64)


for _ in range(200)]

fig, ax = plt.subplots(figsize=[4, 4])


pg.qqplot(means, ax=ax)
plt.show()
3 Statistical Inference and Bootstrapping 53

Ordered quantiles 1

2
R 2 = 0.992

2 1 0 1 2
Theoretical quantiles
{ fig:qqplotSampleMeansMPG} Fig. 3.5 𝑄–𝑄 Plot of mean of samples from mpg data

{ fig:qqplotSampleMeansMPG} As Fig. 3.5 shows, the resampling distribution is approximately normal.

S = np.std(car['mpg'], ddof=1)
print('standard deviation of mpg', S)
print('S/8', S/8)
S_resample = np.std(means, ddof=1)
print('S.E.\{X\}', S_resample)

standard deviation of mpg 3.9172332424696052


S/8 0.48965415530870066
S.E.\{X\} 0.44718791755315535

Executing the macro 200 times, we obtained S.E.{ 𝑋¯ } = 𝑆8 = 0.48965. The


standard deviation of the resampled distribution is 0.4472. This is a resampling
estimate of S.E.{ 𝑋¯ }.
Solution 3.43 In our particular execution with 𝑀 = 500, we have a proportion
𝛼ˆ = 0.07 of cases in which the bootstrap confidence intervals do not cover the
mean of yarnstrg, 𝜇 = 2.9238. This is not significantly different from the nominal
𝛼 = 0.05. The determination of the proportion 𝛼ˆ can be done by using the following
commands:

random.seed(1)
yarnstrg = mistat.load_data('YARNSTRG')
54 3 Statistical Inference and Bootstrapping

def confidence_interval(x, nsigma=2):


sample_mean = np.mean(x)
sigma = np.std(x, ddof=1) / np.sqrt(len(x))
return (sample_mean - 2 * sigma, sample_mean + 2 * sigma)

mean = np.mean(yarnstrg)
outside = 0
for _ in range(500):
sample = random.choices(yarnstrg, k=30)
ci = confidence_interval(sample)
if mean < ci[0] or ci[1] < mean:
outside += 1

hat_alpha = outside / 500

ci = confidence_interval(yarnstrg)
print(f' Mean: {mean}')
print(f' 2-sigma-CI: {ci[0]:.1f} - {ci[1]:.1f}')
print(f' proportion outside: {hat_alpha:.3f}')

Mean: 2.9238429999999993
2-sigma-CI: 2.7 - 3.1
proportion outside: 0.068

Solution 3.44 In Python:

random.seed(1)
car = mistat.load_data('CAR')
us_cars = car[car['origin'] == 1]
us_turn = list(us_cars['turn'])

sample_means = []
for _ in range(100):
x = random.choices(us_turn, k=58)
sample_means.append(np.mean(x))

is_larger = sum(m > 37.406 for m in sample_means)


ratio = is_larger / len(sample_means)
print(ratio)

0.23

We obtained 𝑃˜ = 0.23. The mean 𝑋¯ = 37.203 is not significantly larger than 37.

Solution 3.45 Let 𝑋50 be the number of non-conforming units in a sample of 𝑛 = 50


items. We reject 𝐻0 , at level of 𝛼 = 0.05, if 𝑋50 > 𝐵−1 (0.95, 50, 0.03) = 4. The
criterion 𝑘 𝛼 is obtained by using the Python commands:

stats.binom(50, 0.03).ppf(0.95)

4.0

Solution 3.46 i Calculation of 95% confidence intervals:

random.seed(1)
cyclt = mistat.load_data('CYCLT')
3 Statistical Inference and Bootstrapping 55

Bootstrap distribution mean Bootstrap distribution std.dev.


150 150
125 125
100 100
75 75
50 50
25 25
0 0
0.5 0.6 0.7 0.8 0.325 0.350 0.375 0.400

Fig. 3.6 Histograms of EBD for CYCLT.csv data { fig:histEBD_CYCLT}

B = pg.compute_bootci(cyclt, func=np.mean, n_boot=1000,


confidence=0.95, return_dist=True, seed=1)

ci_mean, dist_mean = B
print(f' Mean: {np.mean(dist_mean):.3f}')
print(f' 95%-CI: {ci_mean[0]:.3f} - {ci_mean[1]:.3f}')

B = pg.compute_bootci(cyclt, func=lambda x: np.std(x, ddof=1), n_boot=1000,


confidence=0.95, return_dist=True, seed=1)

ci_std, dist_std = B
print(f' Mean: {np.mean(dist_std):.3f}')
print(f' 95%-CI: {ci_std[0]:.3f} - {ci_std[1]:.3f}')

Mean: 0.652
95%-CI: 0.550 - 0.760
Mean: 0.370
95%-CI: 0.340 - 0.400

ii Histograms of the EBD, see Fig. 3.6. { fig:histEBD_CYCLT}

fig, axes = plt.subplots(figsize=[6, 3], ncols=2)


axes[0].hist(dist_mean, color='grey', bins=17)
axes[1].hist(dist_std, color='grey', bins=17)
axes[0].set_title('Bootstrap distribution mean')
axes[1].set_title('Bootstrap distribution std.dev.')
plt.tight_layout()
plt.show()

Solution 3.47 In Python:

cyclt = mistat.load_data('CYCLT')
ebd = {}
for quantile in (1, 2, 3):
B = pg.compute_bootci(cyclt, func=lambda x: np.quantile(x, 0.25 * quantile),
n_boot=1000, confidence=0.95, return_dist=True, seed=1)
56 3 Statistical Inference and Bootstrapping

200 200
600

Frequency 100 400


100
200
0 0 0
0.2 0.3 0.4 0.5 1.0 0.8 1.0
Q1 Q2 Q3

{ fig:histEBD_CYCLT_quartiles} Fig. 3.7 Histograms of EBD for quartiles for CYCLT.csv data

ci, dist = B
ebd[quantile] = dist
print(f'Quantile {quantile}: {np.mean(dist):.3f} 95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')

Quantile 1: 0.306 95%-CI: 0.230 - 0.420


Quantile 2: 0.573 95%-CI: 0.380 - 1.010
Quantile 3: 1.060 95%-CI: 0.660 - 1.090

{ fig:histEBD_CYCLT_quartiles} ii Histograms of the EBD, see Fig. 3.7.

fig, axes = plt.subplots(figsize=[6, 2], ncols=3)


axes[0].hist(ebd[1], color='grey', bins=17)
axes[1].hist(ebd[2], color='grey', bins=17)
axes[2].hist(ebd[3], color='grey', bins=17)
axes[0].set_xlabel('Q1')
axes[1].set_xlabel('Q2')
axes[2].set_xlabel('Q3')
axes[0].set_ylabel('Frequency')
plt.tight_layout()
plt.show()

Solution 3.48 In Python:

socell = mistat.load_data('SOCELL')
t1 = socell['t1']
t2 = socell['t2']

# use the index


idx = list(range(len(socell)))
def sample_correlation(x):
return stats.pearsonr(t1[x], t2[x])[0]

B = pg.compute_bootci(idx, func=sample_correlation,
n_boot=1000, confidence=0.95, return_dist=True, seed=1)
ci, dist = B
print(f'rho_XY: {np.mean(dist):.3f} 95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')

rho_XY: 0.975 95%-CI: 0.940 - 0.990

{ fig:histEBD_SOCELL_correlation} Histogram of bootstrap correlations, see Fig. 3.8.


3 Statistical Inference and Bootstrapping 57

200

150

100

50

0
0.88 0.90 0.92 0.94 0.96 0.98 1.00
Fig. 3.8 Histograms of EBD for correlation for SOCELL.csv data { fig:histEBD_SOCELL_correlation}

fig, ax = plt.subplots(figsize=[4, 4])


ax.hist(dist, color='grey', bins=17)
plt.show()

Solution 3.49 (i) and (ii)

car = mistat.load_data('CAR')
mpg = car['mpg']
hp = car['hp']

idx = list(range(len(mpg)))
sample_intercept = []
sample_slope = []
for _ in range(1000):
x = random.choices(idx, k=len(idx))
result = stats.linregress(hp[x], mpg[x])
sample_intercept.append(result.intercept)
sample_slope.append(result.slope)

ci = np.quantile(sample_intercept, [0.025, 0.975])


print(f'intercept (a): {np.mean(sample_intercept):.3f} ' +
f'95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')
ci = np.quantile(sample_slope, [0.025, 0.975])
print(f'slope (b): {np.mean(sample_slope):.4f} ' +
f'95%-CI: {ci[0]:.4f} - {ci[1]:.4f}')

reg = stats.linregress(hp, mpg)


58 3 Statistical Inference and Bootstrapping

hm = np.mean(hp)

print(np.std(sample_intercept))
print(np.std(sample_slope))

intercept (a): 30.724 95%-CI: 28.766 - 32.691


slope (b): -0.0741 95%-CI: -0.0891 - -0.0599
1.0170449375724464
0.0074732552885114645

(iii) The bootstrap S.E. of slope and intercept are 1.017 and 0.00747, respectively.
{sec:least-squares-single} The standard errors of 𝑎 and 𝑏, according to the formulas of Sect. 4.3.2.1 are 0.8099
and 0.00619, respectively. The bootstrap estimates are quite close to the correct
values.

Solution 3.50 In Python:

cyclt = mistat.load_data('CYCLT')
X50 = np.mean(cyclt)
SD50 = np.std(cyclt)
result = stats.ttest_1samp(cyclt, 0.55)
print(f'Xmean_50 = {X50:.3f}')
print(result)

B = pg.compute_bootci(cyclt, func=np.mean, n_boot=1000,


confidence=0.95, return_dist=True, seed=1)
ci_mean, dist = B
pstar = sum(dist < 0.55) / len(dist)
print(f'p*-value: {pstar}')

Xmean_50 = 0.652
TtestResult(statistic=1.9425149510299369, pvalue=0.057833259176805,
df=49)
p*-value: 0.024

The mean of the sample is 𝑋¯ 50 = 0.652. The studentized difference from 𝜇0 =


0.55 is 𝑡 = 1.943.
(i) The t-test obtained a 𝑃-level of 0.058 and the bootstrap resulted in 𝑃∗ = 0.024.
(ii) Yes, but 𝜇 is very close to the lower bootstrap confidence limit (0.540). The
null hypothesis 𝐻0 : 𝜇 = 0.55 is accepted.
(iii) No, but since 𝑃∗ is close to 0.05, we expect that the bootstrap confidence
interval will be very close to 𝜇0 .

Solution 3.51 In Python:

almpin = mistat.load_data('ALMPIN')
diam1 = almpin['diam1']
diam2 = almpin['diam2']

# calculate the ratio of the two variances:


var_diam1 = np.var(diam1)
var_diam2 = np.var(diam2)
F = var_diam2 / var_diam1
print(f'Variance diam1: {var_diam1:.5f}')
print(f'Variance diam2: {var_diam2:.5f}')
print(f'Ratio: {F:.4f}')

# Calculate the p-value


3 Statistical Inference and Bootstrapping 59

10.00

9.98

9.96

9.94

9.92

9.90

diam1 diam2
Fig. 3.9 Box plots of diam1 and diam2 measurements of the ALMPIN dataset { fig:boxplot-almpin}

p_value = stats.f.cdf(F, len(diam1) - 1, len(diam2) - 1)


print(f'p-value: {p_value:.3f}')

Variance diam1: 0.00027


Variance diam2: 0.00032
Ratio: 1.2016
p-value: 0.776

The variances are therefore not significantly different.


(ii) The box plots of the two measurements are shown in Fig. 3.9. { fig:boxplot-almpin}

almpin.boxplot(column=['diam1', 'diam2'])
plt.show()

Solution 3.52 The variances were already compared in the previous exercise. To
compare the means use:

almpin = mistat.load_data('ALMPIN')
diam1 = almpin['diam1']
diam2 = almpin['diam2']

# Compare means
mean_diam1 = np.mean(diam1)
mean_diam2 = np.mean(diam2)
print(f'Mean diam1: {mean_diam1:.5f}')
print(f'Mean diam2: {mean_diam2:.5f}')

# calculate studentized difference and p-value


se1, se2 = stats.sem(diam1), stats.sem(diam2)
sed = np.sqrt(se1**2.0 + se2**2.0)
t_stat = (mean_diam1 - mean_diam2) / sed
print(f'Studentized difference: {t_stat:.3f}')
df = len(diam1) + len(diam2) - 2
p = (1 - stats.t.cdf(abs(t_stat), df)) * 2
60 3 Statistical Inference and Bootstrapping

print(f'p-value: {p:.3f}')

# or use any of the available implementations of the t-test


print(stats.ttest_ind(diam1, diam2))

Mean diam1: 9.99286


Mean diam2: 9.98729
Studentized difference: 1.912
p-value: 0.058
Ttest_indResult(statistic=1.9119658005133064,
pvalue=0.05795318184124417)

The bootstrap based p-value for the comparison of the means is:

random.seed(1)

# return studentized distance between random samples from diam1 and diam2
def stat_func():
d1 = random.choices(diam1, k=len(diam1))
d2 = random.choices(diam2, k=len(diam2))
return stats.ttest_ind(d1, d2).statistic

dist = np.array([stat_func() for _ in range(1000)])

pstar = sum(dist < 0) / len(dist)


print(f'p*-value: {pstar}')

p*-value: 0.014

The bootstrap based p-value for the comparison of the variances is:

columns = ['diam1', 'diam2']


# variance for each column
S2 = almpin[columns].var(axis=0, ddof=0)
F0 = max(S2) / min(S2)
print('S2', S2)
print('F0', F0)

# Step 1: sample variances of bootstrapped samples for each column


seed = 1
B = {}
for column in columns:
ci = pg.compute_bootci(almpin[column], func='var', n_boot=500,
confidence=0.95, seed=seed, return_dist=True)
B[column] = ci[1]
Bt = pd.DataFrame(B)

# Step 2: compute Wi
Wi = Bt / S2

# Step 3: compute F*
FBoot = Wi.max(axis=1) / Wi.min(axis=1)
FBoot95 = np.quantile(FBoot, 0.95)
print('FBoot 95%', FBoot95)
pstar = sum(FBoot >= F0)/len(FBoot)
print(f'p*-value: {pstar}')

S2 diam1 0.000266
diam2 0.000320
dtype: float64
F0 1.2016104294478573
FBoot 95% 1.1855457165968324
p*-value: 0.04
3 Statistical Inference and Bootstrapping 61

The variance of Sample 1 is 𝑆12 = 0.00027. The variance of Sample 2 is 𝑆22 =


0.00032. The variance ratio is 𝐹 = 𝑆22 /𝑆12 = 1.202. The bootstrap level for variance
ratios is 𝑃∗ = 0.04.

Solution 3.53 In Python:

mpg = mistat.load_data('MPG')
columns = ['origin1', 'origin2', 'origin3']
# variance for each column
S2 = mpg[columns].var(axis=0, ddof=1)
F0 = max(S2) / min(S2)
print('S2', S2)
print('F0', F0)

# Step 1: sample variances of bootstrapped samples for each column


seed = 1
B = {}
for column in columns:
ci = pg.compute_bootci(mpg[column].dropna(), func='var', n_boot=500,
confidence=0.95, seed=seed, return_dist=True)
B[column] = ci[1]
Bt = pd.DataFrame(B)

# Step 2: compute Wi
Wi = Bt / S2

# Step 3: compute F*
FBoot = Wi.max(axis=1) / Wi.min(axis=1)
FBoot95 = np.quantile(FBoot, 0.95)
print('FBoot 95%', FBoot95)
pstar = sum(FBoot >= F0)/len(FBoot)
print(f'p*-value: {pstar}')

S2 origin1 12.942529
origin2 6.884615
origin3 18.321321
dtype: float64
F0 2.6611975103595213
FBoot 95% 2.6925366761838987
p*-value: 0.058

With 𝑀 = 500 we obtained the following results:


• 1st sample variance = 12.9425,
• 2nd sample variance = 6.8846,
• 3rd sample variance = 18.3213,
𝐹max/min = 2.6612 and the bootstrap 𝑃 value is 𝑃∗ = 0.058. The bootstrap test does
not reject the hypothesis of equal variances at the 0.05 significance level.

Solution 3.54 With 𝑀 = 500 we obtained

𝑋¯ 1 = 20.931 𝑆12 = 12.9425


𝑋¯ 2 = 19.5 𝑆22 = 6.8846
𝑋¯ 3 = 23.1081 𝑆32 = 18.3213

Using the approach shown in Sect. 3.11.5.2, we get: {sec:comp-means-one-way-anova}


62 3 Statistical Inference and Bootstrapping

500

400

300

200

100

0
0 1 2 3 4 5 6 7
F* values
exc:mpg-equal-mean
{ fig:mpg-equal-mean} Fig. 3.10 Distribution of EBD for Exercise 3.54

mpg = mistat.load_data('MPG.csv')
samples = [mpg[key].dropna() for key in ['origin1', 'origin2', 'origin3']]

def test_statistic_F(samples):
return stats.f_oneway(*samples).statistic

# Calculate sample shifts


Ni = np.array([len(sample) for sample in samples])
N = np.sum(Ni)
XBni = np.array([np.mean(sample) for sample in samples])
XBB = np.sum(Ni * XBni) / N
DB = XBni - XBB

F0 = test_statistic_F(samples)
Ns = 1000
Fstar = []
for _ in range(Ns):
Ysamples = []
for sample, DBi in zip(samples, DB):
Xstar = np.array(random.choices(sample, k=len(sample)))
Ysamples.append(Xstar - DBi)
Fs = test_statistic_F(Ysamples)
Fstar.append(Fs)
Fstar = np.array(Fstar)

print(f'F = {F0:.3f}')
print('ratio', sum(Fstar > F0)/len(Fstar))

ax = pd.Series(Fstar).hist(bins=14, color='grey')
ax.axvline(F0, color='black', lw=2)
ax.set_xlabel('F* values')
plt.show()

F = 6.076
ratio 0.003
3 Statistical Inference and Bootstrapping 63

{ fig:mpg-equal-mean} 𝐹 = 6.076, 𝑃∗ = 0.003 and the hypothesis of equal means is rejected. See Fig.
3.10 for the calculated EBD.

Solution 3.55 In Python:

np.random.seed(1)

def qbinomBoot(x, p):


return stats.binom.ppf(p, 50, p=x.mean())

for p_real in (0.2, 0.1, 0.05):


defects = stats.bernoulli.rvs(p_real, size=50)
B_025 = pg.compute_bootci(defects, func=lambda x: qbinomBoot(x, p=0.025),
n_boot=500, seed=1, return_dist=True)
B_975 = pg.compute_bootci(defects, func=lambda x: qbinomBoot(x, p=0.975),
n_boot=500, seed=1, return_dist=True)
tol_int = [np.quantile(B_025[1], 0.025),np.quantile(B_975[1], 0.975)]
print(f'Tolerance interval p={p_real}: ({tol_int[0]}, {tol_int[1]})')

Tolerance interval p=0.2: (1.0, 22.0)


Tolerance interval p=0.1: (0.0, 17.0)
Tolerance interval p=0.05: (0.0, 9.0)

The tolerance intervals of the number of defective items in future batches of size
𝑁 = 50, with 𝛼 = 0.05 and 𝛽 = 0.05 are

Limits
𝑝 Lower Upper
0.2 1 23
0.1 0 17
0.05 0 9

Solution 3.56 In Python:

np.random.seed(1)

def getQuantile(x, p):


return np.quantile(x, p)

oturb = mistat.load_data('OTURB.csv')
B_025 = pg.compute_bootci(oturb, func=lambda x: getQuantile(x, p=0.025),
n_boot=500, seed=1, return_dist=True)
B_975 = pg.compute_bootci(oturb, func=lambda x: getQuantile(x, p=0.975),
n_boot=500, seed=1, return_dist=True)
tol_int = [np.quantile(B_025[1], 0.025),np.quantile(B_975[1], 0.975)]
print(f'Tolerance interval ({tol_int[0]}, {tol_int[1]})')

Tolerance interval (0.2399, 0.68305)

A (0.95, 0.95) tolerance interval for OTURB.csv is (0.24, 0.683).

Solution 3.57 In Python:

cyclt = mistat.load_data('CYCLT.csv')
# make use of the fact that a True value is interpreted as 1 and False as 0
print('Values greater 0.7:', sum(cyclt>0.7))
64 3 Statistical Inference and Bootstrapping

Values greater 0.7: 20

We find that in the sample of 𝑛 = 50 cycle times, there are 𝑋 = 20 values greater
than 0.7. If the hypothesis is 𝐻0 : 𝜉 .5 ≤ 0.7, the probability of observing a value
smaller than 0.7 is 𝑝 ≥ 21 . Thus, the sign test rejects 𝐻0 if 𝑋 < 𝐵−1 (𝛼; 50, 12 ). For
𝛼 = 0.10 the critical value is 𝑘 𝛼 = 20. 𝐻0 is not rejected.

Solution 3.58 We apply the wilcoxon test from scipy on the differences of oelect
from 220.

oelect = mistat.load_data('OELECT.csv')
print(stats.wilcoxon(oelect-220))

WilcoxonResult(statistic=1916.0, pvalue=0.051047599707252124)

The null hypothesis is rejected with 𝑃 value equal to 0.051.

Solution 3.59 In Python

car = mistat.load_data('CAR.csv')
fourCylinder = car[car['cyl'] == 4]
uscars = fourCylinder[fourCylinder['origin'] == 1]
foreign = fourCylinder[fourCylinder['origin'] != 1]

print(f'Mean of Sample 1 (U.S. made) {np.mean(uscars["turn"]):.3f}')


print(f'Mean of Sample 2 (foreign) {np.mean(foreign["turn"]):.3f}')

_ = mistat.randomizationTest(uscars['turn'], foreign['turn'], np.mean,


aggregate_stats=lambda x: x[0] - x[1],
n_boot=1000, seed=1)

Mean of Sample 1 (U.S. made) 36.255


Mean of Sample 2 (foreign) 33.179
Original stat is 3.075758
Original stat is at quantile 1001 of 1001 (100.00%)
Distribution of bootstrap samples:
min: -2.12, median: 0.01, max: 2.68

The original stat 3.08 is outside of the distribution of the bootstrap samples. The
difference between the means of the turn diameters is therefore significant. Foreign
cars have on the average a smaller turn diameter.
Chapter 4
Variability in Several Dimensions and
Regression Models

Import required modules and define required functions

import random
import numpy as np
import pandas as pd
import pingouin as pg
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats as sms
import seaborn as sns
import matplotlib.pyplot as plt

import mistat

Solution 4.1 In Fig. 4.1, one sees that horsepower and miles per gallon are inversely { fig:ex_car_pairplot}
proportional. Turn diameter seems to increase with horsepower.

car = mistat.load_data('CAR')
sns.pairplot(car[['turn', 'hp', 'mpg']])
plt.show()

---------------------------------------------------------------------------OptionError
Traceback (most recent call last)Cell In[1], line 2
1 car = mistat.load_data('CAR')
----> 2 sns.pairplot(car[['turn', 'hp', 'mpg']])
3 plt.show()
File /usr/local/lib/python3.9/site-packages/seaborn/_decorators.py:46,
in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
36 warnings.warn(
37 "Pass the following variable{} as {}keyword arg{}: {}.
"
38 "From version 0.12, the only valid positional argument
"
(...)
43 FutureWarning
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters,
args)})
---> 46 return f(**kwargs)
File /usr/local/lib/python3.9/site-packages/seaborn/axisgrid.py:2126,
in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind,

65
66 4 Variability in Several Dimensions and Regression Models

diag_kind, markers, height, aspect, corner, dropna, plot_kws,


diag_kws, grid_kws, size)
2124 diag_kws.setdefault("legend", False)
2125 if diag_kind == "hist":
-> 2126 grid.map_diag(histplot, **diag_kws)
2127 elif diag_kind == "kde":
2128 diag_kws.setdefault("fill", True)
File /usr/local/lib/python3.9/site-packages/seaborn/axisgrid.py:1478,
in PairGrid.map_diag(self, func, **kwargs)
1476 plot_kwargs.setdefault("hue_order", self._hue_order)
1477 plot_kwargs.setdefault("palette", self._orig_palette)
-> 1478 func(x=vector, **plot_kwargs)
1479 ax.legend_ = None
1481 self._add_axis_labels()
File /usr/local/lib/python3.9/site-
packages/seaborn/distributions.py:1462, in histplot(data, x, y, hue,
weights, stat, bins, binwidth, binrange, discrete, cumulative,
common_bins, common_norm, multiple, element, fill, shrink, kde,
kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws,
palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
1451 estimate_kws = dict(
1452 stat=stat,
1453 bins=bins,
(...)
1457 cumulative=cumulative,
1458 )
1460 if p.univariate:
-> 1462 p.plot_univariate_histogram(
1463 multiple=multiple,
1464 element=element,
1465 fill=fill,
1466 shrink=shrink,
1467 common_norm=common_norm,
1468 common_bins=common_bins,
1469 kde=kde,
1470 kde_kws=kde_kws,
1471 color=color,
1472 legend=legend,
1473 estimate_kws=estimate_kws,
1474 line_kws=line_kws,
1475 **kwargs,
1476 )
1478 else:
1480 p.plot_bivariate_histogram(
1481 common_bins=common_bins,
1482 common_norm=common_norm,
(...)
1492 **kwargs,
1493 )
File /usr/local/lib/python3.9/site-
packages/seaborn/distributions.py:428, in
_DistributionPlotter.plot_univariate_histogram(self, multiple,
element, fill, common_norm, common_bins, shrink, kde, kde_kws, color,
legend, line_kws, estimate_kws, **plot_kws)
418 densities = self._compute_univariate_density(
419 self.data_variable,
420 common_norm,
(...)
424 warn_singular=False,
425 )
427 # First pass through the data to compute the histograms
--> 428 for sub_vars, sub_data in self.iter_data("hue",
from_comp_data=True):
429
430 # Prepare the relevant data
431 key = tuple(sub_vars.items())
432 sub_data = sub_data.dropna()
4 Variability in Several Dimensions and Regression Models 67

File /usr/local/lib/python3.9/site-packages/seaborn/_core.py:983, in
VectorPlotter.iter_data(self, grouping_vars, reverse, from_comp_data)
978 grouping_vars = [
979 var for var in grouping_vars if var in self.variables
980 ]
982 if from_comp_data:
--> 983 data = self.comp_data
984 else:
985 data = self.plot_data
File /usr/local/lib/python3.9/site-packages/seaborn/_core.py:1054, in
VectorPlotter.comp_data(self)
1050 axis = getattr(ax, f"{var}axis")
1052 # Use the converter assigned to the axis to get a float
representation
1053 # of the data, passing np.nan or pd.NA through (pd.NA becomes
np.nan)
-> 1054 with pd.option_context('mode.use_inf_as_null', True):
1055 orig = self.plot_data[var].dropna()
1056 comp_col = pd.Series(index=orig.index, dtype=float, name=var)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:441, in
option_context.__enter__(self)
440 def __enter__(self) -> None:
--> 441 self.undo = [(pat, _get_option(pat, silent=True)) for pat,
val in self.ops]
443 for pat, val in self.ops:
444 _set_option(pat, val, silent=True)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:441, in <listcomp>(.0)
440 def __enter__(self) -> None:
--> 441 self.undo = [(pat, _get_option(pat, silent=True)) for pat,
val in self.ops]
443 for pat, val in self.ops:
444 _set_option(pat, val, silent=True)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:135, in _get_option(pat, silent)
134 def _get_option(pat: str, silent: bool = False) -> Any:
--> 135 key = _get_single_key(pat, silent)
137 # walk the nested dict
138 root, k = _get_root(key)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:121, in _get_single_key(pat, silent)
119 if not silent:
120 _warn_if_deprecated(pat)
--> 121 raise OptionError(f"No such keys(s): {repr(pat)}")
122 if len(keys) > 1:
123 raise OptionError("Pattern matched multiple keys")
OptionError: "No such keys(s): 'mode.use_inf_as_null'"

Solution 4.2 The box plots in Fig. 4.2 show that cars from Asia generally have the { fig:ex_car_boxplots}
smallest turn diameter. The maximal turn diameter of cars from Asia is smaller
than the median turn diameter of U.S. cars. European cars tend to have larger turn
diameter than those from Asia, but smaller than those from the U.S.

car = mistat.load_data('CAR')

ax = car.boxplot(column='turn', by='origin')
ax.set_title('')
ax.get_figure().suptitle('')
ax.set_xlabel('origin')
plt.show()
68 4 Variability in Several Dimensions and Regression Models

1.0
0.8
0.6

turn
0.4
0.2
0.0
1.0
0.8
0.6
hp

0.4
0.2
0.0
1.0
0.8
0.6
mpg

0.4
0.2
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
turn hp mpg

Fig. 4.1 Scatterplot matrix for CAR dataset { fig:ex_car_pairplot}

42
40
38
36
34
32
30
28
1 2 3
origin
Fig. 4.2 Boxplots of turn diameter by origin for CAR dataset { fig:ex_car_boxplots}
4 Variability in Several Dimensions and Regression Models 69

2400

2200

Res 3 2000

1800

1600
1 2 3 4 5 6
Hybrid number
Fig. 4.3 Multiple box plots of Res 3 grouped by hybrid number { fig:ex_hadpas_plot_i}

Solution 4.3 (i) The multiple box plots (see Fig. 4.3) show that the conditional { fig:ex_hadpas_plot_i}
distributions of res3 at different hybrids are different.

hadpas = mistat.load_data('HADPAS')
ax = hadpas.boxplot(column='res3', by='hyb')
ax.set_title('')
ax.get_figure().suptitle('')
ax.set_xlabel('Hybrid number')
ax.set_ylabel('Res 3')
plt.show()

(ii) The matrix plot of all the Res variables (see Fig. 4.4) reveals that Res 3 and { fig:ex_hadpas_plot_ii}
Res 7 are positively correlated. Res 20 is generally larger than the corresponding Res
14. Res 18 and Res 20 seem to be negatively associated.

sns.pairplot(hadpas[['res3', 'res7', 'res18', 'res14', 'res20']])


plt.show()

---------------------------------------------------------------------------OptionError
Traceback (most recent call last)Cell In[1], line 1
----> 1 sns.pairplot(hadpas[['res3', 'res7', 'res18', 'res14',
'res20']])
2 plt.show()
File /usr/local/lib/python3.9/site-packages/seaborn/_decorators.py:46,
in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
36 warnings.warn(
37 "Pass the following variable{} as {}keyword arg{}: {}.
"
38 "From version 0.12, the only valid positional argument
"
(...)
43 FutureWarning
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters,
args)})
---> 46 return f(**kwargs)
70 4 Variability in Several Dimensions and Regression Models

File /usr/local/lib/python3.9/site-packages/seaborn/axisgrid.py:2126,
in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind,
diag_kind, markers, height, aspect, corner, dropna, plot_kws,
diag_kws, grid_kws, size)
2124 diag_kws.setdefault("legend", False)
2125 if diag_kind == "hist":
-> 2126 grid.map_diag(histplot, **diag_kws)
2127 elif diag_kind == "kde":
2128 diag_kws.setdefault("fill", True)
File /usr/local/lib/python3.9/site-packages/seaborn/axisgrid.py:1478,
in PairGrid.map_diag(self, func, **kwargs)
1476 plot_kwargs.setdefault("hue_order", self._hue_order)
1477 plot_kwargs.setdefault("palette", self._orig_palette)
-> 1478 func(x=vector, **plot_kwargs)
1479 ax.legend_ = None
1481 self._add_axis_labels()
File /usr/local/lib/python3.9/site-
packages/seaborn/distributions.py:1462, in histplot(data, x, y, hue,
weights, stat, bins, binwidth, binrange, discrete, cumulative,
common_bins, common_norm, multiple, element, fill, shrink, kde,
kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws,
palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
1451 estimate_kws = dict(
1452 stat=stat,
1453 bins=bins,
(...)
1457 cumulative=cumulative,
1458 )
1460 if p.univariate:
-> 1462 p.plot_univariate_histogram(
1463 multiple=multiple,
1464 element=element,
1465 fill=fill,
1466 shrink=shrink,
1467 common_norm=common_norm,
1468 common_bins=common_bins,
1469 kde=kde,
1470 kde_kws=kde_kws,
1471 color=color,
1472 legend=legend,
1473 estimate_kws=estimate_kws,
1474 line_kws=line_kws,
1475 **kwargs,
1476 )
1478 else:
1480 p.plot_bivariate_histogram(
1481 common_bins=common_bins,
1482 common_norm=common_norm,
(...)
1492 **kwargs,
1493 )
File /usr/local/lib/python3.9/site-
packages/seaborn/distributions.py:428, in
_DistributionPlotter.plot_univariate_histogram(self, multiple,
element, fill, common_norm, common_bins, shrink, kde, kde_kws, color,
legend, line_kws, estimate_kws, **plot_kws)
418 densities = self._compute_univariate_density(
419 self.data_variable,
420 common_norm,
(...)
424 warn_singular=False,
425 )
427 # First pass through the data to compute the histograms
--> 428 for sub_vars, sub_data in self.iter_data("hue",
from_comp_data=True):
429
430 # Prepare the relevant data
4 Variability in Several Dimensions and Regression Models 71

431 key = tuple(sub_vars.items())


432 sub_data = sub_data.dropna()
File /usr/local/lib/python3.9/site-packages/seaborn/_core.py:983, in
VectorPlotter.iter_data(self, grouping_vars, reverse, from_comp_data)
978 grouping_vars = [
979 var for var in grouping_vars if var in self.variables
980 ]
982 if from_comp_data:
--> 983 data = self.comp_data
984 else:
985 data = self.plot_data
File /usr/local/lib/python3.9/site-packages/seaborn/_core.py:1054, in
VectorPlotter.comp_data(self)
1050 axis = getattr(ax, f"{var}axis")
1052 # Use the converter assigned to the axis to get a float
representation
1053 # of the data, passing np.nan or pd.NA through (pd.NA becomes
np.nan)
-> 1054 with pd.option_context('mode.use_inf_as_null', True):
1055 orig = self.plot_data[var].dropna()
1056 comp_col = pd.Series(index=orig.index, dtype=float, name=var)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:441, in
option_context.__enter__(self)
440 def __enter__(self) -> None:
--> 441 self.undo = [(pat, _get_option(pat, silent=True)) for pat,
val in self.ops]
443 for pat, val in self.ops:
444 _set_option(pat, val, silent=True)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:441, in <listcomp>(.0)
440 def __enter__(self) -> None:
--> 441 self.undo = [(pat, _get_option(pat, silent=True)) for pat,
val in self.ops]
443 for pat, val in self.ops:
444 _set_option(pat, val, silent=True)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:135, in _get_option(pat, silent)
134 def _get_option(pat: str, silent: bool = False) -> Any:
--> 135 key = _get_single_key(pat, silent)
137 # walk the nested dict
138 root, k = _get_root(key)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:121, in _get_single_key(pat, silent)
119 if not silent:
120 _warn_if_deprecated(pat)
--> 121 raise OptionError(f"No such keys(s): {repr(pat)}")
122 if len(keys) > 1:
123 raise OptionError("Pattern matched multiple keys")
OptionError: "No such keys(s): 'mode.use_inf_as_null'"

Solution 4.4 The joint frequency distribution of horsepower versus miles per gallon
is

car = mistat.load_data('CAR')
binned_car = pd.DataFrame({
'hp': pd.cut(car['hp'], bins=np.arange(50, 275, 25)),
'mpg': pd.cut(car['mpg'], bins=np.arange(10, 40, 5)),
})
freqDist = pd.crosstab(binned_car['hp'], binned_car['mpg'])
print(freqDist)
# You can get distributions for hp and mpg by summing along an axis
print(freqDist.sum(axis=0))
print(freqDist.sum(axis=1))
72 4 Variability in Several Dimensions and Regression Models

1.0
0.8
0.6

res3
0.4
0.2
0.0
1.0
0.8
0.6
res7

0.4
0.2
0.0
1.0
0.8
0.6
res18

0.4
0.2
0.0
1.0
0.8
0.6
res14

0.4
0.2
0.0
1.0
0.8
0.6
res20

0.4
0.2
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
res3 res7 res18 res14 res20

Fig. 4.4 Scatterplot matrix Res variables for HADPAS dataset { fig:ex_hadpas_plot_ii}

mpg (10, 15] (15, 20] (20, 25] (25, 30] (30, 35]
hp
(50, 75] 0 1 0 4 2
(75, 100] 0 0 23 11 0
(100, 125] 0 10 11 1 0
(125, 150] 0 14 3 1 0
(150, 175] 0 17 0 0 0
(175, 200] 1 5 0 0 0
(200, 225] 1 3 0 0 0
(225, 250] 0 1 0 0 0
mpg
(10, 15] 2
(15, 20] 51
(20, 25] 37
(25, 30] 17
(30, 35] 2
dtype: int64
hp
(50, 75] 7
4 Variability in Several Dimensions and Regression Models 73

(75, 100] 34
(100, 125] 22
(125, 150] 18
(150, 175] 17
(175, 200] 6
(200, 225] 4
(225, 250] 1
dtype: int64

The intervals for HP are from 50 to 250 at fixed length of 25. The intervals for
MPG are from 10 to 35 at length 5. Students may get different results by defining
the intervals differently.

Solution 4.5 The joint frequency distribution of Res 3 and Res 14 is given in the
following table:

hadpas = mistat.load_data('HADPAS')
binned_hadpas = pd.DataFrame({
'res3': pd.cut(hadpas['res3'], bins=np.arange(1580, 2780, 200)),
'res14': pd.cut(hadpas['res14'], bins=np.arange(900, 3000, 300)),
})
pd.crosstab(binned_hadpas['res14'], binned_hadpas['res3'])

res3 (1580, 1780] (1780, 1980] (1980, 2180] (2180, 2380]


res14
(900, 1200] 11 3 3 0 \
(1200, 1500] 11 33 28 2
(1500, 1800] 5 16 24 6
(1800, 2100] 2 11 12 5
(2100, 2400] 0 9 8 1
(2400, 2700] 0 0 1 0

res3 (2380, 2580]


res14
(900, 1200] 0
(1200, 1500] 0
(1500, 1800] 0
(1800, 2100] 1
(2100, 2400] 0
(2400, 2700] 0

The intervals for Res 3 start at 1580 and end at 2580 with length of 200. The
intervals of Res 14 start at 900 and end at 2700 with length of 300.

Solution 4.6 The following is the conditional frequency distribution of Res 3, given
that Res 14 is between 1300 and 1500 ohms:

hadpas = mistat.load_data('HADPAS')
in_range = hadpas[hadpas['res14'].between(1300, 1500)]
pd.cut(in_range['res3'], bins=np.arange(1580, 2780, 200)).value_counts(sort=False)

res3
(1580, 1780] 8
(1780, 1980] 21
(1980, 2180] 25
(2180, 2380] 2
(2380, 2580] 0
Name: count, dtype: int64
74 4 Variability in Several Dimensions and Regression Models

Solution 4.7 Following the instructions in the question we obtained the following
results:

hadpas = mistat.load_data('HADPAS')
bins = [900, 1200, 1500, 1800, 2100, 3000]
binned_res14 = pd.cut(hadpas['res14'], bins=bins)

results = []
for group, df in hadpas.groupby(binned_res14):
res3 = df['res3']
results.append({
'res3': group,
'N': len(res3),
'mean': res3.mean(),
'std': res3.std(),
})
pd.DataFrame(results)

res3 N mean std


0 (900, 1200] 17 1779.117647 162.348730
1 (1200, 1500] 74 1952.175676 154.728251
2 (1500, 1800] 51 1997.196078 151.608841
3 (1800, 2100] 31 2024.774194 156.749845
4 (2100, 3000] 19 1999.736842 121.505758

Solution 4.8 In Python

df = pd.DataFrame([
[10.0, 8.04, 10.0, 9.14, 10.0, 7.46, 8.0, 6.58],
[8.0, 6.95, 8.0, 8.14, 8.0, 6.77, 8.0, 5.76],
[13.0, 7.58, 13.0, 8.74, 13.0, 12.74, 8.0, 7.71],
[9.0, 8.81, 9.0, 8.77, 9.0, 7.11, 8.0, 8.84],
[11.0, 8.33, 11.0, 9.26, 11.0, 7.81, 8.0, 8.47],
[14.0, 9.96, 14.0, 8.10, 14.0, 8.84, 8.0, 7.04],
[6.0, 7.24, 6.0, 6.13, 6.0, 6.08, 8.0, 5.25],
[4.0, 4.26, 4.0, 3.10, 4.0, 5.39, 19.0, 12.50],
[12.0, 10.84, 12.0, 9.13, 12.0, 8.15, 8.0, 5.56],
[7.0, 4.82, 7.0, 7.26, 7.0, 6.42, 8.0, 7.91],
[5.0, 5.68, 5.0, 4.74, 5.0, 5.73, 8.0, 6.89],
], columns=['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4'])

results = []
for i in (1, 2, 3, 4):
x = df[f'x{i}']
y = df[f'y{i}']
model = smf.ols(formula=f'y{i} ~ 1 + x{i}', data=df).fit()
results.append({
'Data Set': i,
'Intercept': model.params['Intercept'],
'Slope': model.params[f'x{i}'],
'R2': model.rsquared,
})
pd.DataFrame(results)

Data Set Intercept Slope R2


0 1 3.000091 0.500091 0.666542
1 2 3.000909 0.500000 0.666242
2 3 3.002455 0.499727 0.666324
3 4 3.001727 0.499909 0.666707

Notice the influence of the point (19,12.5) on the regression in Data Set 4. Without
this point the correlation between 𝑥 and 𝑦 is zero.
4 Variability in Several Dimensions and Regression Models 75

10 10

y1

y2
5 5

0 20 0 20
x1 x2

10 10
y3

5 y4 5

0 20 0 20
x3 x4
Fig. 4.5 Anscombe’s quartet { fig:AnscombeQuartet}

{ fig:AnscombeQuartet} The dataset is known as Anscombe’s quartet (see Fig. 4.5). It not only has identical
linear regression, but it has also identical means and variances of 𝑥 and 𝑦, and
correlation between 𝑥 and 𝑦. The dataset clearly demonstrates the importance of
visualization in data analysis.

Solution 4.9 The correlation matrix:

car = mistat.load_data('CAR')
car[['turn', 'hp', 'mpg']].corr()

turn hp mpg
turn 1.000000 0.507610 -0.541061
hp 0.507610 1.000000 -0.754716
mpg -0.541061 -0.754716 1.000000

− 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 ) 2
Í𝑛
Solution 4.10 𝑆𝑆𝐸 = 𝑖=1 (𝑌𝑖
(i)
76 4 Variability in Several Dimensions and Regression Models
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 )
𝜕 𝛽0 𝑖=1
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 𝑋𝑖1 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 )
𝜕 𝛽1 𝑖=1
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 𝑋𝑖2 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 ).
𝜕 𝛽2 𝑖=1

Equating these partial derivatives to zero and arranging terms, we arrive at the
following set of linear equations:
Í𝑛 Í𝑛 Í𝑛
 𝑛 𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖2 
  𝛽0   𝑖=1 𝑌𝑖 
   
Í    Í 
 𝑛 Í𝑛 2 Í𝑛    𝑛
 𝛽1  =  𝑖=1 𝑋𝑖1𝑌𝑖 
 
 𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖1 𝑋𝑖2 
     
Í    Í 
 𝑛 𝑋 Í𝑛 𝑋 𝑋 Í𝑛 2  𝛽2   𝑛 𝑋𝑖2𝑌𝑖 
𝑖=1 𝑋𝑖2 

 𝑖=1 𝑖2 𝑖=1 𝑖1 𝑖2    𝑖=1 
(ii) Let 𝑏 0 , 𝑏 1 , 𝑏 2 be the (unique) solution. From the first equation we get, after
dividing by 𝑛, 𝑏 0 = 𝑌¯ − 𝑋¯ 1 𝑏 1 − 𝑋¯ 2 𝑏 2 , where 𝑌¯ = 𝑛1 𝑖=1 𝑌𝑖 , 𝑋¯ 1 = 𝑛1 𝑖=1
Í𝑛 Í𝑛
𝑋𝑖1 ,
1
𝑋¯ 2 =
Í 𝑛
𝑋𝑖2 . Substituting 𝑏 0 in the second and third equations and arranging
𝑛 𝑖=1
terms, we obtain the reduced system of equations:
 (𝑄 1 − 𝑛 𝑋¯ 2 ) (𝑃12 − 𝑛 𝑋¯ 1 𝑋¯ 2 )  𝑏 1  𝑃1𝑦 − 𝑛 𝑋¯ 1𝑌¯ 
 1   
   
 (𝑃12 − 𝑛 𝑋¯ 1 𝑋¯ 2 ) (𝑄 2 − 𝑛 𝑋¯ 2 )  𝑏 2  𝑃2𝑦 − 𝑛 𝑋¯ 2𝑌¯ 
     
 2   
Í 2 Í 2 Í Í
where 𝑄 1 = 𝑋𝑖1 , 𝑄 2 = 𝑋𝑖2 , 𝑃12 = 𝑋𝑖1 𝑋𝑖2 and 𝑃1𝑦 = 𝑋𝑖1𝑌𝑖 , 𝑃2𝑦 =
Í
𝑋𝑖2𝑌𝑖 . Dividing both sides by (𝑛 − 1) we obtain Eq. (4.4.3), and solving we get 𝑏 1
and 𝑏 2 .

Solution 4.11 We get the following result using statsmodels

car = mistat.load_data('CAR')

model = smf.ols(formula='mpg ~ 1 + hp + turn', data=car).fit()


print(model.summary2())

Results: Ordinary least squares


=================================================================
Model: OLS Adj. R-squared: 0.596
Dependent Variable: mpg AIC: 511.2247
Date: 2023-05-03 21:16 BIC: 519.2988
No. Observations: 109 Log-Likelihood: -252.61
Df Model: 2 F-statistic: 80.57
Df Residuals: 106 Prob (F-statistic): 5.29e-22
R-squared: 0.603 Scale: 6.2035
------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------
Intercept 38.2642 2.6541 14.4167 0.0000 33.0020 43.5263
hp -0.0631 0.0069 -9.1070 0.0000 -0.0768 -0.0493
4 Variability in Several Dimensions and Regression Models 77

turn -0.2510 0.0838 -2.9965 0.0034 -0.4171 -0.0849


-----------------------------------------------------------------
Omnibus: 12.000 Durbin-Watson: 2.021
Prob(Omnibus): 0.002 Jarque-Bera (JB): 25.976
Skew: -0.335 Prob(JB): 0.000
Kurtosis: 5.296 Condition No.: 1507
=================================================================
* The condition number is large (2e+03). This might indicate
strong multicollinearity or other numerical problems.

The regression equation is MPG = 38.3 − 0.251 × turn − 0.0631 × hp.


We see that only 60% of the variability in MPG is explained by the linear rela-
tionship with Turn and HP. Both variables contribute significantly to the regression.

Solution 4.12 The partial correlation is −0.70378.

car = mistat.load_data('CAR')

# y: mpg, x1: cyl, x2: hp


model_1 = smf.ols(formula='mpg ~ cyl + 1', data=car).fit()
e_1 = model_1.resid

model_2 = smf.ols(formula='hp ~ cyl + 1', data=car).fit()


e_2 = model_2.resid

print(f'Partial correlation {stats.pearsonr(e_1, e_2)[0]:.5f}')

Partial correlation -0.70378

Solution 4.13 In Python:

car = mistat.load_data('CAR')

# y: mpg, x1: hp, x2: turn


model_1 = smf.ols(formula='mpg ~ hp + 1', data=car).fit()
e_1 = model_1.resid
print('Model mpg ~ hp + 1:\n', model_1.params)

model_2 = smf.ols(formula='turn ~ hp + 1', data=car).fit()


e_2 = model_2.resid
print('Model turn ~ hp + 1:\n', model_2.params)

print('Partial correlation', stats.pearsonr(e_1, e_2)[0])


df = pd.DataFrame({'e1': e_1, 'e2': e_2})
model_partial = smf.ols(formula='e1 ~ e2 - 1', data=df).fit()
# print(model_partial.summary2())
print('Model e1 ~ e2:\n', model_partial.params)

Model mpg ~ hp + 1:
Intercept 30.663308
hp -0.073611
dtype: float64
Model turn ~ hp + 1:
Intercept 30.281255
hp 0.041971
dtype: float64
Partial correlation -0.27945246615045016
Model e1 ~ e2:
e2 -0.251008
dtype: float64
78 4 Variability in Several Dimensions and Regression Models

The partial regression equation is 𝑒ˆ1 = −0.251𝑒ˆ2 .

Solution 4.14 The regression of MPG on HP is MPG = 30.6633 − 0.07361 HP. The
regression of TurnD on HP is TurnD = 30.2813 + 0.041971 HP. The regression of
the residuals 𝑒ˆ1 on 𝑒ˆ2 is 𝑒ˆ1 = −0.251 · 𝑒ˆ2 . Thus,

Const. : 𝑏 0 = 30.6633 + 30.2813 × 0.251 = 38.2639


HP : 𝑏 1 = −0.07361 + 0.041971 × 0.251 = −0.063076
TurnD : 𝑏 2 = −0.251.

Solution 4.15 The regression of Cap Diameter on Diam2 and Diam3 is

almpin = mistat.load_data('ALMPIN')
model = smf.ols('capDiam ~ 1 + diam2 + diam3', data=almpin).fit()
model.summary2()

<class 'statsmodels.iolib.summary2.Summary'>
"""
Results: Ordinary least squares
===================================================================
Model: OLS Adj. R-squared: 0.842
Dependent Variable: capDiam AIC: -482.1542
Date: 2023-05-03 21:16 BIC: -475.4087
No. Observations: 70 Log-Likelihood: 244.08
Df Model: 2 F-statistic: 184.2
Df Residuals: 67 Prob (F-statistic): 5.89e-28
R-squared: 0.846 Scale: 5.7272e-05
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
Intercept 4.7565 0.5501 8.6467 0.0000 3.6585 5.8544
diam2 0.5040 0.1607 3.1359 0.0025 0.1832 0.8248
diam3 0.5203 0.1744 2.9830 0.0040 0.1722 0.8684
-------------------------------------------------------------------
Omnibus: 1.078 Durbin-Watson: 2.350
Prob(Omnibus): 0.583 Jarque-Bera (JB): 0.976
Skew: -0.071 Prob(JB): 0.614
Kurtosis: 2.439 Condition No.: 8689
===================================================================
* The condition number is large (9e+03). This might indicate
strong multicollinearity or other numerical problems.
"""

The dependence of CapDiam on Diam2, without Diam1 is significant. This is due


to the fact that Diam1 and Diam2 are highly correlated (𝜌 = 0.957). If Diam1 is in
the regression, then Diam2 does not furnish additional information on CapDiam. If
Diam1 is not included then Diam2 is very informative.

Solution 4.16 The regression of yield (𝑌𝑖𝑒𝑙𝑑) on the four variables is:

gasol = mistat.load_data('GASOL')
# rename column 'yield' to 'Yield' as 'yield' is a special keyword in Python
gasol = gasol.rename(columns={'yield': 'Yield'})
model = smf.ols(formula='Yield ~ x1 + x2 + astm + endPt',
data=gasol).fit()
print(model.summary2())
4 Variability in Several Dimensions and Regression Models 79

Results: Ordinary least squares


=================================================================
Model: OLS Adj. R-squared: 0.957
Dependent Variable: Yield AIC: 146.8308
Date: 2023-05-03 21:16 BIC: 154.1595
No. Observations: 32 Log-Likelihood: -68.415
Df Model: 4 F-statistic: 171.7
Df Residuals: 27 Prob (F-statistic): 8.82e-19
R-squared: 0.962 Scale: 4.9927
------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------
Intercept -6.8208 10.1232 -0.6738 0.5062 -27.5918 13.9502
x1 0.2272 0.0999 2.2739 0.0311 0.0222 0.4323
x2 0.5537 0.3698 1.4976 0.1458 -0.2049 1.3124
astm -0.1495 0.0292 -5.1160 0.0000 -0.2095 -0.0896
endPt 0.1547 0.0064 23.9922 0.0000 0.1414 0.1679
-----------------------------------------------------------------
Omnibus: 0.635 Durbin-Watson: 1.402
Prob(Omnibus): 0.728 Jarque-Bera (JB): 0.719
Skew: 0.190 Prob(JB): 0.698
Kurtosis: 2.371 Condition No.: 10714
=================================================================
* The condition number is large (1e+04). This might indicate
strong multicollinearity or other numerical problems.

(i) The regression equation is

𝑦ˆ = −6.8 + 0.227𝑥1 + 0.554𝑥2 − 0.150 astm + 0.155 endPt.

(ii) 𝑅 2 = 0.962.
(iii) The regression coefficient of 𝑥 2 is not significant.
(iv) Running the multiple regression again, without 𝑥2 , we obtain the equation

model = smf.ols(formula='Yield ~ x1 + astm + endPt', data=gasol).fit()


print(model.summary2())

Results: Ordinary least squares


=================================================================
Model: OLS Adj. R-squared: 0.955
Dependent Variable: Yield AIC: 147.3842
Date: 2023-05-03 21:16 BIC: 153.2471
No. Observations: 32 Log-Likelihood: -69.692
Df Model: 3 F-statistic: 218.5
Df Residuals: 28 Prob (F-statistic): 1.59e-19
R-squared: 0.959 Scale: 5.2143
------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------
Intercept 4.0320 7.2233 0.5582 0.5811 -10.7643 18.8284
x1 0.2217 0.1021 2.1725 0.0384 0.0127 0.4308
astm -0.1866 0.0159 -11.7177 0.0000 -0.2192 -0.1540
endPt 0.1565 0.0065 24.2238 0.0000 0.1433 0.1698
-----------------------------------------------------------------
Omnibus: 0.679 Durbin-Watson: 1.150
Prob(Omnibus): 0.712 Jarque-Bera (JB): 0.738
Skew: 0.174 Prob(JB): 0.692
Kurtosis: 2.343 Condition No.: 7479
=================================================================
* The condition number is large (7e+03). This might indicate
strong multicollinearity or other numerical problems.
80 4 Variability in Several Dimensions and Regression Models

Ordered quantiles
0

1
R 2 = 0.982
2
2 1 0 1 2
Theoretical quantiles
{ fig:qqPlotGasolRegResid} Fig. 4.6 𝑄–𝑄 plot of gasol regression residuals

𝑦ˆ = 4.03 + 0.222𝑥1 + 0.554𝑥2 − 0.187 astm + 0.157 endPt,


with 𝑅 2 = 0.959. Variables 𝑥 1 , astm and endPt are important.
(v) Normal probability plotting of the residuals 𝑒ˆ from the equation of (iv) shows
{ fig:qqPlotGasolRegResid} that they are normally distributed. (see Fig. 4.6)

Solution 4.17 (i)

(𝐻) = (𝑋) (𝐵) = (𝑋) [(𝑋) 0 (𝑋)] −1 (𝑋) 0


𝐻 2 = (𝑋) [(𝑋) 0 (𝑋)] −1 (𝑋) 0 (𝑋) [(𝑋) 0 (𝑋)] −1 (𝑋) 0
= (𝑋) [(𝑋) 0 (𝑋)] −1 (𝑋) 0
= 𝐻.

(ii)

(𝑄) = 𝐼 − (𝐻)
(𝑄) 2 = (𝐼 − (𝐻)) (𝐼 − (𝐻))
= 𝐼 − (𝐻) − (𝐻) + (𝐻) 2
= 𝐼 − (𝐻)
= 𝑄.
𝑠2𝑒 = y0 (𝑄) (𝑄)y/(𝑛 − 𝑘 − 1)
= y0 (𝑄)y/(𝑛 − 𝑘 − 1).

Solution 4.18 We have ŷ = (𝑋) 𝜷ˆ = (𝑋) (𝐵)y = (𝐻)y and ê = 𝑄y = (𝐼 − (𝐻))y.


4 Variability in Several Dimensions and Regression Models 81

ŷ0ê = y0 (𝐻) (𝐼 − (𝐻))y


= y0 (𝐻)y − y0 (𝐻) 2 y
= 0.

𝑆𝑆𝐸
Solution 4.19 1 − 𝑅 2𝑦 ( 𝑥) = where 𝑆𝑆𝐸 = ê0ê = ||ê|| 2 .
𝑆𝑆𝐷 𝑦

Solution 4.20 From the basic properties of the cov(𝑋, 𝑌 ) operator,

𝑛 𝑛 𝑛 𝑛
©∑︁ ∑︁ ª ∑︁ ∑︁
cov ­ 𝛽𝑖 𝑋𝑖 , 𝛾𝑗 𝑋𝑗® = 𝛽𝑖 𝛾 𝑗 cov(𝑋𝑖 , 𝑋 𝑗 )
« 𝑖=1 𝑗=1 ¬ 𝑖=1 𝑗=1
𝑛 ∑︁
∑︁ 𝑛
= 𝛽𝑖 𝛾 𝑗 Σ| 𝑖 𝑗
𝑖=1 𝑗=1
0 |
= 𝜷 ( 𝚺)𝜸.

Solution 4.21 W = (𝑊1 , . . . , 𝑊𝑚 ) 0 where 𝑊𝑖 = b𝑖0 X (𝑖 = 1, . . . , 𝑚). b𝑖0 is the 𝑖-th


row vector of 𝐵. Thus, by the previous exercise cov(𝑊𝑖 , 𝑊 𝑗 ) = b𝑖0 ( 𝚺)b
| 𝑗 . This is the
(𝑖, 𝑗) element of the covariance matrix of W. Hence, the covariance matrix of W is
C(W) = (𝐵) ( 𝚺)| (𝐵) 0.

|
Solution 4.22 From the model, 𝚺(Y) = 𝜎 2 𝐼 and b = (𝐵)Y.
|
𝚺(b) |
= (𝐵) 𝚺(Y) (𝐵) 0 = 𝜎 2 (𝐵) (𝐵) 0
= 𝜎 2 [(X) 0 (X)] −1 · X0X[(X) 0 (X)] −1
= 𝜎 2 [(X) 0 (X)] −1 .

Solution 4.23 Rearrange the dataset into the format suitable for the test outlines in
the multiple linear regression section.

socell = mistat.load_data('SOCELL')

# combine the two datasets and add the additional columns z and w
socell_1 = socell[['t3', 't1']].copy()
socell_1.columns = ['t3', 't']
socell_1['z'] = 0
socell_2 = socell[['t3', 't2']].copy()
socell_2.columns = ['t3', 't']
socell_2['z'] = 1
combined = pd.concat([socell_1, socell_2])
combined['w'] = combined['z'] * combined['t']

# multiple linear regression model


model_test = smf.ols(formula='t3 ~ t + z + w + 1',
data=combined).fit()
print(model_test.summary2())

Results: Ordinary least squares


==================================================================
82 4 Variability in Several Dimensions and Regression Models

Model: OLS Adj. R-squared: 0.952


Dependent Variable: t3 AIC: -58.2308
Date: 2023-05-03 21:16 BIC: -52.3678
No. Observations: 32 Log-Likelihood: 33.115
Df Model: 3 F-statistic: 205.8
Df Residuals: 28 Prob (F-statistic): 3.55e-19
R-squared: 0.957 Scale: 0.0084460
--------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
--------------------------------------------------------------------
Intercept 0.5187 0.2144 2.4196 0.0223 0.0796 0.9578
t 0.9411 0.0539 17.4664 0.0000 0.8307 1.0515
z -0.5052 0.3220 -1.5688 0.1279 -1.1648 0.1545
w 0.0633 0.0783 0.8081 0.4259 -0.0971 0.2237
------------------------------------------------------------------
Omnibus: 1.002 Durbin-Watson: 1.528
Prob(Omnibus): 0.606 Jarque-Bera (JB): 0.852
Skew: -0.378 Prob(JB): 0.653
Kurtosis: 2.739 Condition No.: 111
==================================================================

Neither of the 𝑧 nor 𝑤 The 𝑃-values corresponding to 𝑧 and 𝑤 are 0.128 and 0.426
respectively. Accordingly, we can conclude that the slopes and intercepts of the two
simple linear regressions given above are not significantly different. Combining the
data we have the following regression line for the combined dataset:

model_combined = smf.ols(formula='t3 ~ t + 1',


data=combined).fit()
print(model_combined.summary2())

Results: Ordinary least squares


=================================================================
Model: OLS Adj. R-squared: 0.870
Dependent Variable: t3 AIC: -28.1379
Date: 2023-05-03 21:16 BIC: -25.2064
No. Observations: 32 Log-Likelihood: 16.069
Df Model: 1 F-statistic: 208.3
Df Residuals: 30 Prob (F-statistic): 4.86e-15
R-squared: 0.874 Scale: 0.022876
-------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
-------------------------------------------------------------------
Intercept 0.6151 0.2527 2.4343 0.0211 0.0990 1.1311
t 0.8882 0.0615 14.4327 0.0000 0.7625 1.0139
-----------------------------------------------------------------
Omnibus: 1.072 Durbin-Watson: 0.667
Prob(Omnibus): 0.585 Jarque-Bera (JB): 0.883
Skew: 0.114 Prob(JB): 0.643
Kurtosis: 2.219 Condition No.: 41
=================================================================

Solution 4.24 Load the data frame

df = mistat.load_data('CEMENT.csv')

(a) Regression of 𝑌 on 𝑋1 is

model1 = smf.ols('y ~ x1 + 1', data=df).fit()


print(model1.summary().tables[1])
r2 = model1.rsquared
4 Variability in Several Dimensions and Regression Models 83

print(f'R-sq: {r2:.3f}')

anova = sms.anova.anova_lm(model1)
print('Analysis of Variance\n', anova)

F = anova.F['x1']
SSE_1 = anova.sum_sq['Residual']

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 81.4793 4.927 16.536 0.000 70.634 92.324
x1 1.8687 0.526 3.550 0.005 0.710 3.027
==============================================================================
R-sq: 0.534
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 12.602518 0.004552
Residual 11.0 1265.686749 115.062432 NaN NaN

• 𝑅𝑌2 | (𝑋1 ) =0.534.


• 𝑆𝑆𝐸 1 = 1265.7,
• 𝐹 = 12.60 (In the 1st stage 𝐹 is equal to the partial−𝐹.)
(b) The regression of 𝑌 on 𝑋1 and 𝑋2 is

model2 = smf.ols('y ~ x1 + x2 + 1', data=df).fit()


r2 = model2.rsquared
print(model2.summary().tables[1])
print(f'R-sq: {r2:.3f}')
anova = sms.anova.anova_lm(model2)
print('Analysis of Variance\n', anova)
SEQ_SS_X2 = anova.sum_sq['x2']
SSE_2 = anova.sum_sq['Residual']
s2e2 = anova.mean_sq['Residual']
partialF = np.sum(anova.sum_sq) * (model2.rsquared - model1.rsquared) / s2e2

anova = sms.anova.anova_lm(model1, model2)


print('Comparing models\n', anova)
partialF = anova.F[1]

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 52.5773 2.286 22.998 0.000 47.483 57.671
x1 1.4683 0.121 12.105 0.000 1.198 1.739
x2 0.6623 0.046 14.442 0.000 0.560 0.764
==============================================================================
R-sq: 0.979
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 250.425571 2.088092e-08
x2 1.0 1207.782266 1207.782266 208.581823 5.028960e-08
Residual 10.0 57.904483 5.790448 NaN NaN
Comparing models
df_resid ssr df_diff ss_diff F Pr(>F)
0 11.0 1265.686749 0.0 NaN NaN NaN
1 10.0 57.904483 1.0 1207.782266 208.581823 5.028960e-08

• 𝑅𝑌2 | (𝑋1 ,𝑋2 ) =0.979.


• 𝑆𝑆𝐸 2 = 57.9,
84 4 Variability in Several Dimensions and Regression Models

• 𝑠2𝑒2 = 5.79,
• 𝐹 = 12.60
• Partial-𝐹 = 208.582
Notice that SEQ SS for 𝑋2 = 2716.9(0.974 − 0.529) = 1207.782.
(c) The regression of 𝑌 on 𝑋1 , 𝑋2 , and 𝑋3 is

model3 = smf.ols('y ~ x1 + x2 + x3 + 1', data=df).fit()


r2 = model3.rsquared
print(model3.summary().tables[1])
print(f'R-sq: {r2:.3f}')
anova = sms.anova.anova_lm(model3)
print('Analysis of Variance\n', anova)
SEQ_SS_X3 = anova.sum_sq['x3']
SSE_3 = anova.sum_sq['Residual']
s2e3 = anova.mean_sq['Residual']

anova = sms.anova.anova_lm(model2, model3)


print('Comparing models\n', anova)
partialF = anova.F[1]

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 48.1936 3.913 12.315 0.000 39.341 57.046
x1 1.6959 0.205 8.290 0.000 1.233 2.159
x2 0.6569 0.044 14.851 0.000 0.557 0.757
x3 0.2500 0.185 1.354 0.209 -0.168 0.668
==============================================================================
R-sq: 0.982
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 271.264194 4.995767e-08
x2 1.0 1207.782266 1207.782266 225.938509 1.107893e-07
x3 1.0 9.793869 9.793869 1.832128 2.088895e-01
Residual 9.0 48.110614 5.345624 NaN NaN
Comparing models
df_resid ssr df_diff ss_diff F Pr(>F)
0 10.0 57.904483 0.0 NaN NaN NaN
1 9.0 48.110614 1.0 9.793869 1.832128 0.208889

• 𝑅𝑌2 | (𝑋1 ,𝑋2 ,𝑋3 ) =0.982.


• Partial-𝐹 = 1.832
The SEQ SS of 𝑋3 is 9.79. The .95-quantile of 𝐹 [1, 9] is 5.117. Thus, the contribution
of 𝑋3 is not significant.
(d) The regression of 𝑌 on 𝑋1 , 𝑋2 , 𝑋3 , and 𝑋4 is

model4 = smf.ols('y ~ x1 + x2 + x3 + x4 + 1', data=df).fit()


r2 = model4.rsquared
print(model4.summary().tables[1])
print(f'R-sq: {r2:.3f}')
anova = sms.anova.anova_lm(model4)
print('Analysis of Variance\n', anova)
SEQ_SS_X4 = anova.sum_sq['x4']
SSE_4 = anova.sum_sq['Residual']
s2e4 = anova.mean_sq['Residual']

anova = sms.anova.anova_lm(model3, model4)


print('Comparing models\n', anova)
partialF = anova.F[1]
4 Variability in Several Dimensions and Regression Models 85

==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 62.4054 70.071 0.891 0.399 -99.179 223.989
x1 1.5511 0.745 2.083 0.071 -0.166 3.269
x2 0.5102 0.724 0.705 0.501 -1.159 2.179
x3 0.1019 0.755 0.135 0.896 -1.638 1.842
x4 -0.1441 0.709 -0.203 0.844 -1.779 1.491
==============================================================================
R-sq: 0.982
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 242.367918 2.887559e-07
x2 1.0 1207.782266 1207.782266 201.870528 5.863323e-07
x3 1.0 9.793869 9.793869 1.636962 2.366003e-01
x4 1.0 0.246975 0.246975 0.041280 8.440715e-01
Residual 8.0 47.863639 5.982955 NaN NaN
Comparing models
df_resid ssr df_diff ss_diff F Pr(>F)
0 9.0 48.110614 0.0 NaN NaN NaN
1 8.0 47.863639 1.0 0.246975 0.04128 0.844071

• 𝑅𝑌2 | (𝑋1 ,𝑋2 ,𝑋3 ) =0.982.


• Partial-𝐹 = 0.041
The effect of 𝑋4 is not significant.
Solution 4.25 Using the step-wise regression method from the mistat package, we
get:

outcome = 'y'
all_vars = ['x1', 'x2', 'x3', 'x4']

included, model = mistat.stepwise_regression(outcome, all_vars, df)

formula = ' + '.join(included)


formula = f'{outcome} ~ 1 + {formula}'
print()
print('Final model')
print(formula)
print(model.params)

Step 1 add - (F: 22.80) x4


Step 2 add - (F: 108.22) x1 x4
Step 3 add - (F: 5.03) x1 x2 x4

Final model
y ~ 1 + x2 + x4 + x1
Intercept 71.648307
x2 0.416110
x4 -0.236540
x1 1.451938
dtype: float64

Solution 4.26 Build the regression model using statsmodels.

car = mistat.load_data('CAR')
car_3 = car[car['origin'] == 3]
print('Full dataset shape', car.shape)
print('Origin 3 dataset shape', car_3.shape)
model = smf.ols(formula='mpg ~ hp + 1', data=car_3).fit()
print(model.summary2())
86 4 Variability in Several Dimensions and Regression Models

Full dataset shape (109, 5)


Origin 3 dataset shape (37, 5)
Results: Ordinary least squares
=================================================================
Model: OLS Adj. R-squared: 0.400
Dependent Variable: mpg AIC: 195.6458
Date: 2023-05-03 21:16 BIC: 198.8676
No. Observations: 37 Log-Likelihood: -95.823
Df Model: 1 F-statistic: 25.00
Df Residuals: 35 Prob (F-statistic): 1.61e-05
R-squared: 0.417 Scale: 10.994
------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
------------------------------------------------------------------
Intercept 31.8328 1.8282 17.4117 0.0000 28.1213 35.5444
hp -0.0799 0.0160 -4.9996 0.0000 -0.1123 -0.0474
-----------------------------------------------------------------
Omnibus: 7.375 Durbin-Watson: 1.688
Prob(Omnibus): 0.025 Jarque-Bera (JB): 6.684
Skew: -0.675 Prob(JB): 0.035
Kurtosis: 4.584 Condition No.: 384
=================================================================

Compute the additional properties

influence = model.get_influence()
df = pd.DataFrame({
'hp': car_3['hp'],
'mpg': car_3['mpg'],
'resi': model.resid,
'sres': influence.resid_studentized_internal,
'hi': influence.hat_matrix_diag,
'D': influence.cooks_distance[0],
})
print(df.round(4))

hp mpg resi sres hi D


0 118 25 2.5936 0.7937 0.0288 0.0093
1 161 18 -0.9714 -0.3070 0.0893 0.0046
51 55 17 -10.4392 -3.3101 0.0953 0.5770
52 98 23 -1.0041 -0.3075 0.0299 0.0015
53 92 27 2.5166 0.7722 0.0339 0.0105
54 92 29 4.5166 1.3859 0.0339 0.0337
55 104 20 -3.5248 -1.0781 0.0277 0.0165
56 68 27 0.5993 0.1871 0.0665 0.0012
57 70 31 4.7591 1.4826 0.0627 0.0736
58 110 20 -3.0455 -0.9312 0.0270 0.0120
62 121 19 -3.1668 -0.9699 0.0303 0.0147
63 82 24 -1.2823 -0.3956 0.0442 0.0036
64 110 22 -1.0455 -0.3197 0.0270 0.0014
65 158 19 -0.2110 -0.0664 0.0823 0.0002
71 92 26 1.5166 0.4654 0.0339 0.0038
72 102 22 -1.6846 -0.5154 0.0282 0.0039
73 81 27 1.6378 0.5056 0.0455 0.0061
74 142 18 -2.4892 -0.7710 0.0520 0.0163
75 107 18 -5.2852 -1.6161 0.0271 0.0364
76 160 19 -0.0513 -0.0162 0.0869 0.0000
77 90 24 -0.6432 -0.1975 0.0356 0.0007
78 90 26 1.3568 0.4167 0.0356 0.0032
79 97 21 -3.0840 -0.9446 0.0305 0.0140
80 106 18 -5.3650 -1.6406 0.0273 0.0377
81 140 20 -0.6489 -0.2007 0.0490 0.0010
82 165 18 -0.6518 -0.2071 0.0993 0.0024
94 66 34 7.4396 2.3272 0.0704 0.2051
95 97 23 -1.0840 -0.3320 0.0305 0.0017
96 100 25 1.1557 0.3537 0.0290 0.0019
4 Variability in Several Dimensions and Regression Models 87

97 115 24 1.3539 0.4141 0.0278 0.0025


98 115 26 3.3539 1.0259 0.0278 0.0151
99 90 27 2.3568 0.7238 0.0356 0.0097
100 190 19 2.3453 0.7805 0.1786 0.0662
101 115 25 2.3539 0.7200 0.0278 0.0074
102 200 18 2.1441 0.7315 0.2184 0.0748
103 78 28 2.3982 0.7419 0.0497 0.0144
108 64 28 1.2798 0.4012 0.0745 0.0065

Notice that points 51 and 94 have residuals with large magnitude (-10.4, 7.4).
Points 51 and 94 have also the largest Cook’s distance (0.58, 0.21) Points 100 and
102 have high HI values (leverage; 0.18, 0.22).

Solution 4.27 This solution uses version 2 of the Piston simulator.


Run piston simulation for different piston weights and visualize variation of times
(see Fig. 4.7). { fig:anovaWeightPiston}

np.random.seed(1)
settings = {'s': 0.005, 'v0': 0.002, 'k': 1000, 'p0': 90_000,
't': 290, 't0': 340}
results = []
n_simulation = 5
for m in [30, 40, 50, 60]:
simulator = mistat.PistonSimulator(m=m, n_simulation=n_simulation,
**settings)
sim_result = simulator.simulate()
results.extend([m, s] for s in sim_result['seconds'])
results = pd.DataFrame(results, columns=['m', 'seconds'])

group_std = results.groupby('m').std()
pooled_std = np.sqrt(np.sum(group_std**2) / len(group_std))[0]
print('Pooled standard deviation', pooled_std)

group_mean = results.groupby('m').mean()
ax = results.plot.scatter(x='m', y='seconds', color='black')
ax.errorbar(group_mean.index, results.groupby('m').mean().values.flatten(),
yerr=[pooled_std] * 4, color='grey')
plt.show()

Pooled standard deviation 0.013088056556052113

Perform ANOVA of data.

model = smf.ols(formula='seconds ~ C(m)', data=results).fit()


aov_table = sm.stats.anova_lm(model)
aov_table

df sum_sq mean_sq F PR(>F)


C(m) 3.0 0.000333 0.000111 0.648836 0.595055
Residual 16.0 0.002741 0.000171 NaN NaN

We see that the differences between the sample means are not significant in spite
of the apparent upward trend in cycle times.

Solution 4.28 Prepare dataset and visualize distributions (see Fig. 4.8). { fig:boxplotIntegratedCircuits}

df = pd.DataFrame([
[2.58, 2.62, 2.22],
[2.48, 2.77, 1.73],
88 4 Variability in Several Dimensions and Regression Models

0.07

0.06

0.05

seconds 0.04

0.03

0.02

0.01

30 35 40 45 50 55 60
m
Fig. 4.7 ANOVA of effect of changing weight in piston simulation { fig:anovaWeightPiston}

3.0

2.8

2.6

2.4

2.2

2.0

1.8
Exp. 1 Exp. 2 Exp. 3

{ fig:boxplotIntegratedCircuits} Fig. 4.8 Box plot of pre-etch line width from integrated circuits fabrication process

[2.52, 2.69, 2.00],


[2.50, 2.80, 1.86],
[2.53, 2.87, 2.04],
[2.46, 2.67, 2.15],
[2.52, 2.71, 2.18],
[2.49, 2.77, 1.86],
[2.58, 2.87, 1.84],
[2.51, 2.97, 1.86]
], columns=['Exp. 1', 'Exp. 2', 'Exp. 3'])
df.boxplot()

# Convert data frame to long format using melt


df = df.melt(var_name='Experiment', value_name='mu')

Analysis using ANOVA:


4 Variability in Several Dimensions and Regression Models 89

model = smf.ols(formula='mu ~ C(Experiment)', data=df).fit()


aov_table = sm.stats.anova_lm(model)
aov_table

df sum_sq mean_sq F PR(>F)


C(Experiment) 2.0 3.336327 1.668163 120.917098 3.352509e-14
Residual 27.0 0.372490 0.013796 NaN NaN

The difference in the experiments is significant.


Bootstrap test:

experiment = df['Experiment']
mu = df['mu']
def onewayTest(x, verbose=False):
df = pd.DataFrame({
'value': x,
'variable': experiment,
})
aov = pg.anova(dv='value', between='variable', data=df)
return aov['F'].values[0]

B = pg.compute_bootci(mu, func=onewayTest, n_boot=1000,


seed=1, return_dist=True)

Bt0 = onewayTest(mu)
print('Bt0', Bt0)
print('ratio', sum(B[1] >= Bt0)/len(B[1]))

Bt0 120.91709844559576
ratio 0.0

The bootstrap test also shows that the difference in means is significant.

Solution 4.29 Create dataset and visualize distribution (see Fig. 4.9). { fig:filmSpeedData}

df = pd.DataFrame({
'Batch A': [103, 107, 104, 102, 95, 91, 107, 99, 105, 105],
'Batch B': [104, 103, 106, 103, 107, 108, 104, 105, 105, 97],
})
fig, ax = plt.subplots()
bplot1 = ax.boxplot(df, labels=df.columns)
plt.show()

# the following code failed at some point using pandas


# df.boxplot()
# plt.show()

(i) Randomization test (see Sect. 3.13.2) {sec:randomizaton-test}

dist = mistat.randomizationTest(df['Batch A'], df['Batch B'], np.mean,


aggregate_stats=lambda x: x[0] - x[1],
n_boot=10000, seed=1)
# ax = sns.distplot(dist)
# ax.axvline(np.mean(df['Batch A']) - np.mean(df['Batch B']))

Original stat is -2.400000


Original stat is at quantile 1062 of 10001 (10.62%)
Distribution of bootstrap samples:
min: -5.40, median: 0.00, max: 5.60
90 4 Variability in Several Dimensions and Regression Models

107.5
105.0
102.5
100.0
97.5
95.0
92.5

Batch A Batch B

Fig. 4.9 Box plot of film speed data { fig:filmSpeedData}

The randomization test gave a P value of 0.106. The difference between the means
is not significant.
(ii)

# Convert data frame to long format using melt


df = df.melt(var_name='Batch', value_name='film_speed')

model = smf.ols(formula='film_speed ~ C(Batch)', data=df).fit()


aov_table = sm.stats.anova_lm(model)
aov_table

df sum_sq mean_sq F PR(>F)


C(Batch) 1.0 28.8 28.800000 1.555822 0.228263
Residual 18.0 333.2 18.511111 NaN NaN

The ANOVA also shows no significant difference in the means. The 𝑃 value
is 0.228. Remember that the 𝐹-test in the ANOVA is based on the assumption of
normality and equal variances. The randomization test is nonparametric.
Solution 4.30 Define function that calculates the statistic and execute bootstrap.

def func_stats(x):
m = pd.Series(x).groupby(df['Experiment']).agg(['mean', 'count'])
top = np.sum(m['count'] * m['mean'] ** 2) - len(x)*np.mean(x)**2
return top / np.std(x) ** 2

Bt = []
mu = list(df['mu'])
for _ in range(1000):
mu_star = random.sample(mu, len(mu))
Bt.append(func_stats(mu_star))

Bt0 = func_stats(mu)
print('Bt0', Bt0)
print('ratio', sum(Bt >= Bt0)/len(Bt))

Bt0 26.986990459670288
ratio 0.0
4 Variability in Several Dimensions and Regression Models 91

Boxplot grouped
xDev by crcBrd
0.004
0.003
0.002
0.001
0.000
0.001
0.002
0.003
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526
crcBrd
Fig. 4.10 Box plot visualisation of xDev distribution by crcBrd for the PLACE dataset { fig:boxplotXdevCrcBrdPlace}

The result demonstrates that the differences between the results is significant.

Solution 4.31 Load the data and visualize the distributions (see Fig. 4.10). { fig:boxplotXdevCrcBrdPlace}

place = mistat.load_data('PLACE')
place.boxplot('xDev', by='crcBrd')
plt.show()

(a) ANOVA for the fulldataset

model = smf.ols(formula='xDev ~ C(crcBrd)', data=place).fit()


aov_table = sm.stats.anova_lm(model)
aov_table

df sum_sq mean_sq F PR(>F)


C(crcBrd) 25.0 0.001128 4.512471e-05 203.292511 2.009252e-206
Residual 390.0 0.000087 2.219694e-07 NaN NaN

(b) There seem to be four homogeneous groups: 𝐺 1 = {1, 2, . . . , 9}, 𝐺 2 =


{10, 11, 12}, 𝐺 3 = {13, . . . , 19, 21, . . . , 26}, 𝐺 4 = {20}.
In multiple comparisons we use the Scheffé coefficient 𝑆 .05 = (25×𝐹.95 [25, 390]) 1/2 =
(25 × 1.534) 1/2 = 6.193. The group means and standard errors are:

G1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
G2 = [10, 11, 12]
G3 = [13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26]
G4 = [20]
place['group'] = 'G1'
place.loc[place['crcBrd'].isin(G2), 'group'] = 'G2'
place.loc[place['crcBrd'].isin(G3), 'group'] = 'G3'
place.loc[place['crcBrd'].isin(G4), 'group'] = 'G4'
92 4 Variability in Several Dimensions and Regression Models

statistics = place['xDev'].groupby(place['group']).agg(['mean', 'sem', 'count'])


statistics = statistics.sort_values(['mean'], ascending=False)
print(statistics.round(8))
statistics['Diff'] = 0
n = len(statistics)
print(statistics['mean'][:-1].values - statistics['mean'][1:].values)
print(statistics['mean'][:(n-1)].values - statistics['mean'][1:].values)
statistics.loc[1:, 'Diff'] = (statistics['mean'][:-1].values -
statistics['mean'][1:].values)
statistics['CR'] = 6.193 * statistics['Diff']
print(statistics.round(8))
# 0.001510 0.0022614 0.0010683
# 0.000757 0.000467 0.000486

sem = statistics['sem'].values
sem = sem**2
sem = np.sqrt(sem[:-1] + sem[1:])
print(sem * 6.193)
print(757/644, 467/387, 486/459)

mean sem count


group
G4 0.003778 0.000100 16
G3 0.002268 0.000030 208
G2 0.000006 0.000055 48
G1 -0.001062 0.000050 144
[0.00151029 0.00226138 0.00106826]
[0.00151029 0.00226138 0.00106826]

---------------------------------------------------------------------------TypeError
Traceback (most recent call last)Cell In[1], line 17
15 print(statistics['mean'][:-1].values -
statistics['mean'][1:].values)
16 print(statistics['mean'][:(n-1)].values -
statistics['mean'][1:].values)
---> 17 statistics.loc[1:, 'Diff'] = (statistics['mean'][:-1].values -
18 statistics['mean'][1:].values)
19 statistics['CR'] = 6.193 * statistics['Diff']
20 print(statistics.round(8))
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:845, in
_LocationIndexer.__setitem__(self, key, value)
843 else:
844 key = com.apply_if_callable(key, self.obj)
--> 845 indexer = self._get_setitem_indexer(key)
846 self._has_valid_setitem_indexer(key)
848 iloc = self if self.name == "iloc" else self.obj.iloc
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:710, in
_LocationIndexer._get_setitem_indexer(self, key)
707 if isinstance(key, tuple):
708 with suppress(IndexingError):
709 # suppress "Too many indexers"
--> 710 return self._convert_tuple(key)
712 if isinstance(key, range):
713 # GH#45479 test_loc_setitem_range_key
714 key = list(key)
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:927, in
_LocationIndexer._convert_tuple(self, key)
923 @final
924 def _convert_tuple(self, key: tuple) -> tuple:
925 # Note: we assume _tupleize_axis_indexer has been called,
if necessary.
926 self._validate_key_length(key)
--> 927 keyidx = [self._convert_to_indexer(k, axis=i) for i, k in
4 Variability in Several Dimensions and Regression Models 93

enumerate(key)]
928 return tuple(keyidx)
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:927, in <listcomp>(.0)
923 @final
924 def _convert_tuple(self, key: tuple) -> tuple:
925 # Note: we assume _tupleize_axis_indexer has been called,
if necessary.
926 self._validate_key_length(key)
--> 927 keyidx = [self._convert_to_indexer(k, axis=i) for i, k in
enumerate(key)]
928 return tuple(keyidx)
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:1382, in
_LocIndexer._convert_to_indexer(self, key, axis)
1379 labels = self.obj._get_axis(axis)
1381 if isinstance(key, slice):
-> 1382 return labels._convert_slice_indexer(key, kind="loc")
1384 if (
1385 isinstance(key, tuple)
1386 and not isinstance(labels, MultiIndex)
1387 and self.ndim < 2
1388 and len(key) > 1
1389 ):
1390 raise IndexingError("Too many indexers")
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexes/base.py:4125, in
Index._convert_slice_indexer(self, key, kind)
4122 elif is_positional:
4123 if kind == "loc":
4124 # GH#16121, GH#24612, GH#31810
-> 4125 raise TypeError(
4126 "Slicing a positional slice with .loc is not
allowed, "
4127 "Use .loc with labels or .iloc with positions
instead.",
4128 )
4129 indexer = key
4130 else:
TypeError: Slicing a positional slice with .loc is not allowed, Use
.loc with labels or .iloc with positions instead.

The differences between the means of the groups are all significant.

from statsmodels.stats.multicomp import pairwise_tukeyhsd


m_comp = pairwise_tukeyhsd(endog=place['xDev'], groups=place['group'],
alpha=0.05)
print(m_comp)

Multiple Comparison of Means - Tukey HSD, FWER=0.05


=================================================
group1 group2 meandiff p-adj lower upper reject
-------------------------------------------------
G1 G2 0.0011 0.0 0.0009 0.0013 True
G1 G3 0.0033 0.0 0.0032 0.0035 True
G1 G4 0.0048 0.0 0.0045 0.0052 True
G2 G3 0.0023 0.0 0.0021 0.0025 True
G2 G4 0.0038 0.0 0.0034 0.0041 True
G3 G4 0.0015 0.0 0.0012 0.0018 True
-------------------------------------------------

Solution
df 4.32
= pd.DataFrame({
'US': [33, 25],
'Europe': [7, 7],
94 4 Variability in Several Dimensions and Regression Models

'Asia': [26, 11],


})

print(df)

col_sums = df.sum(axis=0)
row_sums = df.sum(axis=1)
total = df.to_numpy().sum()

expected_frequencies = np.outer(row_sums, col_sums) / total

chi2 = (df - expected_frequencies) ** 2 / expected_frequencies


chi2 = chi2.to_numpy().sum()
print(f'chi2: {chi2:.3f}')
print(f'p-value: {1 - stats.chi2.cdf(chi2, 2):.3f}')

US Europe Asia
0 33 7 26
1 25 7 11
chi2: 2.440
p-value: 0.295

The chi-square test statistic is 𝑋 2 = 2.440 with d.f. = 2 and 𝑃 value = 0.295. The
null hypothesis that the number of cylinders a car has is independent of the origin of
the car is not rejected.
We can also use the scipy function chi2_contingency.

chi2 = stats.chi2_contingency(df)
print(f'chi2-statistic: {chi2[0]:.3f}')
print(f'p-value: {chi2[1]:.3f}')
print(f'd.f.: {chi2[2]}')

chi2-statistic: 2.440
p-value: 0.295
d.f.: 2

Solution 4.33 In Python:

car = mistat.load_data('CAR')
binned_car = pd.DataFrame({
'turn': pd.cut(car['turn'], bins=[27, 30.6, 34.2, 37.8, 45]), #np.arange(27, 50, 3.6)),
'mpg': pd.cut(car['mpg'], bins=[12, 18, 24, 100]),
})
freqDist = pd.crosstab(binned_car['mpg'], binned_car['turn'])
print(freqDist)

chi2 = stats.chi2_contingency(freqDist)
print(f'chi2-statistic: {chi2[0]:.3f}')
print(f'p-value: {chi2[1]:.3f}')
print(f'd.f.: {chi2[2]}')

turn (27.0, 30.6] (30.6, 34.2] (34.2, 37.8] (37.8, 45.0]


mpg
(12, 18] 2 4 10 15
(18, 24] 0 12 26 15
(24, 100] 4 15 6 0
chi2-statistic: 34.990
p-value: 0.000
d.f.: 6

The dependence between turn diameter and miles per gallon is significant.
4 Variability in Several Dimensions and Regression Models 95

Solution 4.34 In Python:

question_13 = pd.DataFrame({
'1': [0,0,0,1,0],
'2': [1,0,2,0,0],
'3': [1,2,6,5,1],
'4': [2,1,10,23,13],
'5': [0,1,1,15,100],
}, index = ['1', '2', '3', '4', '5']).transpose()
question_23 = pd.DataFrame({
'1': [1,0,0,3,1],
'2': [2,0,1,0,0],
'3': [0,4,2,3,0],
'4': [1,1,10,7,5],
'5': [0,0,1,30,134],
}, index = ['1', '2', '3', '4', '5']).transpose()

chi2_13 = stats.chi2_contingency(question_13)
chi2_23 = stats.chi2_contingency(question_23)

msc_13 = chi2_13[0] / question_13.to_numpy().sum()


tschuprov_13 = np.sqrt(msc_13 / (2 * 2)) # (4 * 4))
cramer_13 = np.sqrt(msc_13 / 2) # min(4, 4))

msc_23 = chi2_23[0] / question_23.to_numpy().sum()


tschuprov_23 = np.sqrt(msc_23 / 4) # (4 * 4))
cramer_23 = np.sqrt(msc_23 / 2) # min(4, 4))

print('Question 1 vs 3')
print(f' Mean squared contingency : {msc_13:.3f}')
print(f' Tschuprov : {tschuprov_13:.3f}')
print(f" Cramer's index : {cramer_13:.3f}")
print('Question 2 vs 3')
print(f' Mean squared contingency : {msc_23:.3f}')
print(f' Tschuprov : {tschuprov_23:.3f}')
print(f" Cramer's index : {cramer_23:.3f}")

Question 1 vs 3
Mean squared contingency : 0.629
Tschuprov : 0.397
Cramer's index : 0.561
Question 2 vs 3
Mean squared contingency : 1.137
Tschuprov : 0.533
Cramer's index : 0.754
Chapter 5
Sampling for Estimation of Finite Population
Quantities

Import required modules and define required functions

import random
import numpy as np
import pandas as pd
import pingouin as pg
from scipy import stats
import matplotlib.pyplot as plt
import mistat

Solution 5.1 Define the binary random variables




 1, if the 𝑗-th element is selected at the 𝑖-th sampling


𝐼𝑖 𝑗 =

 0,

otherwise.

Í
The random variables 𝑋1 , · · · , 𝑋𝑛 are given by 𝑋𝑖 = 𝑁𝑗=1 𝑥 𝑗 𝐼𝑖 𝑗 , 𝑖 = 1, · · · , 𝑛. Since
1
sampling is RSWR, Pr{𝑋𝑖 = 𝑥 𝑗 } = for all 𝑖 = 1, · · · , 𝑛 and 𝑗 = 1, · · · , 𝑁.
𝑁
Hence, Pr{𝑋𝑖 ≤ 𝑥} = 𝐹𝑁 (𝑥) for all 𝑥, and all 𝑖 = 1, . . . , 𝑛. Moreover, by definition
of RSWR, the vectors I𝑖 = (𝐼𝑖1 , . . . , 𝐼𝑖 𝑁 ), 𝑖 = 1, . . . , 𝑛 are mutually independent.
Therefore 𝑋1 , . . . , 𝑋𝑛 are i.i.d., having a common c.d.f. 𝐹𝑁 (𝑥).
1 Í𝑁
Solution 5.2 In continuation of the previous exercise, 𝐸 {𝑋𝑖 } = 𝑥𝑖 = 𝜇 𝑁 .
𝑁 𝑖=1
Therefore, by the weak law of large numbers, lim𝑛→∞ 𝑃{| 𝑋¯ 𝑛 − 𝜇 𝑁 | < 𝜖 } = 1.
2 < ∞),
Solution 5.3 By the CLT (0 < 𝜎𝑁


 
𝛿
Pr{ 𝑛| 𝑋¯ 𝑛 − 𝜇 𝑁 | < 𝛿} ≈ 2Φ − 1,
𝜎𝑁
as 𝑛 → ∞.

97
98 5 Sampling for Estimation of Finite Population Quantities

RSWR RSWOR
12.5
15
10.0
Frequency

Frequency
7.5 10
5.0
5
2.5
0.0 0
0.0 0.5 0.0 0.5
Sample correlation Sample correlation

Fig. 5.1 Distribution of correlation between xDev and yDev for sampling with and without distri-
{ fig:exDistCorrPlace} bution.

Solution 5.4 We create samples of size 𝑘 = 20 with and without replacement,


determine the correlation coefficient and finally create the two histograms (see Fig. { fig:exDistCorrPlace}
5.1).

random.seed(1)
place = mistat.load_data('PLACE')

# calculate correlation coefficient based on a sample of rows


def stat_func(idx):
return stats.pearsonr(place['xDev'][idx], place['yDev'][idx])[0]

rswr = []
rswor = []
idx = list(range(len(place)))
for _ in range(100):
rswr.append(stat_func(random.choices(idx, k=20)))
rswor.append(stat_func(random.sample(idx, k=20)))

corr_range = (min(*rswr, *rswor), max(*rswr, *rswor))

def makeHistogram(title, ax, data, xrange):


ax = pd.Series(data).hist(color='grey', ax=ax, bins=20)
ax.set_title(title)
ax.set_xlabel('Sample correlation')
ax.set_ylabel('Frequency')
ax.set_xlim(*xrange)

fig, axes = plt.subplots(figsize=[5, 3], ncols=2)


makeHistogram('RSWR', axes[0], rswr, corr_range)
makeHistogram('RSWOR', axes[1], rswor, corr_range)
plt.tight_layout()
plt.show()
5 Sampling for Estimation of Finite Population Quantities 99

70 80 60
60
60 50
50
Frequency

Frequency

Frequency
40 40
40 30
30
20 20
20
10 10
0 0 0
35 36 37 100 120 140 20 22
Median turn Median hp Median mpg

Fig. 5.2 Distribution of median of turn-diameter, horsepower, and mpg of the CAR dataset using
random sampling without replacement. { fig:exDistMedianCAR}

{ fig:exDistMedianCAR} Solution 5.5 The Python code creates the histograms shown in Fig. 5.2.

random.seed(1)
car = mistat.load_data('CAR')
columns = ['turn', 'hp', 'mpg']

# calculate correlation coefficient based on a sample of rows


def stat_func(idx):
sample = car[columns].loc[idx,]
return sample.median()

idx = list(range(len(car)))
result = []
for _ in range(200):
result.append(stat_func(random.sample(idx, k=50)))
result = pd.DataFrame(result)

fig, axes = plt.subplots(figsize=[8, 3], ncols=3)


for ax, column in zip(axes, columns):
result[column].hist(color='grey', ax=ax)
ax.set_xlabel(f'Median {column}')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()

Solution 5.6 For RSWOR,


  1/2
¯ 𝜎 𝑛−1
S.E.{ 𝑋𝑖 } = √ 1 −
𝑛 𝑁 −1

Equating the standard error to 𝛿 we get 𝑛1 = 30, 𝑛2 = 116, 𝑛3 = 84.


Solution 5.7 The required
√︄ samplesize is a solution of the equation

𝑃(1 − 𝑃) 𝑛−1
0.002 = 2 · 1.96 · 1− . The solution is 𝑛 = 1611.
𝑛 𝑁
Solution 5.8 The following are Python commands to estimate the mean of all 𝑁 =
416 𝑥-dev values by stratified sampling with proportional allocation. The total sample
100 5 Sampling for Estimation of Finite Population Quantities

size is 𝑛 = 200 and the weights are 𝑊1 = 0.385, 𝑊2 = 0.115, 𝑊3 = 0.5. Thus,
𝑛1 = 77, 𝑛2 = 23, and 𝑛3 = 100.

# load dataset and split into strata


place = mistat.load_data('PLACE')
strata_1 = list(place['xDev'][:160])
strata_2 = list(place['xDev'][160:208])
strata_3 = list(place['xDev'][208:])
N = len(place)
w_1 = 0.385
w_2 = 0.115
w_3 = 0.5
n_1 = int(w_1 * 200)
n_2 = int(w_2 * 200)
n_3 = int(w_3 * 200)

sample_means = []
for _ in range(500):
m_1 = np.mean(random.sample(strata_1, k=n_1))
m_2 = np.mean(random.sample(strata_2, k=n_2))
m_3 = np.mean(random.sample(strata_3, k=n_3))
sample_means.append(w_1*m_1 + w_2*m_2 + w_3*m_3)
std_dev_sample_means = np.std(sample_means)
print(std_dev_sample_means)
print(stats.sem(place['xDev'], ddof=0))

3.442839155174113e-05
8.377967188860638e-05

The standard deviation of the estimated means is an estimate of S.E.( 𝜇ˆ 𝑁 ). The


true value of this S.E. is 0.000034442.
˜𝑁
𝜎 2  
Solution 5.9 𝐿(𝑛1 , . . . , 𝑛 𝑘 ; 𝜆) =
Í𝑘 2 𝑖 Í𝑘
− 𝜆 𝑛 − 𝑖=1 𝑛𝑖 . Partial differentia-
𝑖=1 𝑊𝑖
𝑛𝑖
tion of 𝐿 w.r.t. 𝑛1 , . . . , 𝑛 𝑘 and 𝜆 and equating the result to zero yields the following
equations:

𝑊𝑖2 𝜎
˜𝑁2
𝑖
= 𝜆, 𝑖 = 1, . . . , 𝑘
𝑛2𝑖
𝑘
∑︁
𝑛𝑖 = 𝑛.
𝑖=1

1 1 Í𝑘
Equivalently, 𝑛𝑖 = √ 𝑊𝑖 𝜎 ˜ 𝑁𝑖 , for 𝑖 = 1, . . . , 𝑘 and 𝑛 = √ ˜ 𝑁𝑖 . Thus
𝑖=1 𝑊𝑖 𝜎
𝜆 𝜆
˜ 𝑁𝑖
𝑊𝑖 𝜎
𝑛0𝑖 = 𝑛 Í 𝑘 , 𝑖 = 1, . . . , 𝑘.
˜ 𝑁𝑗
𝑗=1 𝑊 𝑗 𝜎

Solution 5.10 The prediction model is 𝑦 𝑖 = 𝛽 + 𝑒 𝑖 , 𝑖 = 1, . . . , 𝑁, 𝐸 {𝑒 𝑖 } = 0,


𝑉 {𝑒 𝑖 } = 𝜎 2 , cov(𝑒 𝑖 , 𝑒 𝑗 ) = 0 for all 𝑖 ≠ 𝑗.
5 Sampling for Estimation of Finite Population Quantities 101
( 𝑁 𝑁
)
¯ 1 ∑︁ 1 ∑︁
𝐸 {𝑌𝑛 − 𝜇 𝑁 } = 𝐸 𝛽 + 𝐼𝑖 𝑒 𝑖 − 𝛽 − 𝑒𝑖
𝑛 𝑖=1 𝑁 𝑖=1
𝑁
1 ∑︁
= 𝐸 {𝐼𝑖 𝑒 𝑖 },
𝑛 𝑖=1



 1, if 𝑖-th population element is sampled


where 𝐼𝑖 =

 0, otherwise.


Notice that 𝐼1 , . . . , 𝐼 𝑁 are independent of 𝑒 1 , . . . , 𝑒 𝑁 . Hence, 𝐸 {𝐼𝑖 𝑒 𝑖 } = 0 for all
𝑖 = 1, . . . , 𝑁. This proves that 𝑌¯𝑛 is prediction unbiased, irrespective of the sample
strategy. The prediction MSE of 𝑌¯𝑛 is

𝑃𝑀𝑆𝐸 {𝑌¯𝑛 } = 𝐸 {(𝑌¯𝑛 − 𝜇 𝑁 ) 2 }


( 𝑁 𝑁
)
1 ∑︁ 1 ∑︁
=𝑉 𝐼𝑖 𝑒 𝑖 − 𝑒𝑖
𝑛 𝑖=1 𝑁 𝑖=1
(  𝑁 𝑁
)
1 1 ∑︁ 1 ∑︁
=𝑉 − 𝐼𝑖 𝑒 𝑖 − (1 − 𝐼𝑖 )𝑒 𝑖 .
𝑛 𝑁 𝑖=1 𝑁 𝑖=1

Let s denote the set of units in the sample. Then

𝜎2  𝑛 2 1  𝑛 2
𝑃𝑀𝑆𝐸 {𝑌¯𝑛 | s} = 1− + 1− 𝜎
𝑛 𝑁 𝑁 𝑁
𝜎2  𝑛
= 1− .
𝑛 𝑁
Notice that 𝑃𝑀𝑆𝐸 {𝑌¯𝑛 | s} is independent of s, and is equal for all samples.

Solution 5.11 The model is 𝑦 𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝑒 𝑖 , 𝑖 = 1, . . . , 𝑁. 𝐸 {𝑒 𝑖 } = 0, 𝑉 {𝑒 𝑖 } =


𝜎 2 𝑥𝑖 , 𝑖 = 1, . . . , 𝑁. Given a sample {(𝑋1 , 𝑌1 ), . . . , (𝑋𝑛 , 𝑌𝑛 )}, we estimate 𝛽0 and 𝛽1
by the weighted LSE because the variances of 𝑦 𝑖 depend on 𝑥𝑖 , 𝑖 = 1, . . . , 𝑁. These
Í𝑁 1
weighted LSE values 𝛽ˆ0 and 𝛽ˆ1 minimizing 𝑄 = 𝑖=1 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 ) 2 , are given
𝑋𝑖
by
1 Í𝑁 1 1 Í𝑁
𝑌¯𝑛 · 𝑖=1 𝑋𝑖 −
𝑌𝑖
𝑁
!
𝑛 𝑛 𝑖=1 𝑋𝑖 1 ∑︁ 𝑌𝑖
𝛽ˆ1 = and 𝛽ˆ0 = Í 𝑁 − 𝑛 𝛽ˆ1 .
1 Í𝑁 1 1 𝑋
𝑋¯ 𝑛 · −1 𝑖=1 𝑋𝑖 𝑖=1 𝑖
𝑛 𝑖=1 𝑋𝑖
It is straightforward to show that 𝐸 { 𝛽ˆ1 } = 𝛽1 and 𝐸 { 𝛽ˆ0 } = 𝛽0 . Thus, an unbiased
predictor of 𝜇 𝑁 is 𝜇ˆ 𝑁 = 𝛽ˆ0 + 𝛽ˆ1 𝑥¯ 𝑁 .
1 Í𝑁
Solution 5.12 The predictor 𝑌ˆ𝑅 𝐴 can be written as 𝑌ˆ𝑅 𝐴 = 𝑥¯ 𝑁 · 𝐼𝑖 𝑦𝑖 , where
𝑛 𝑖=1 𝑥𝑖
102 5 Sampling for Estimation of Finite Population Quantities



 1, if 𝑖-th population element is in the sample s


𝐼𝑖 =

 0,

otherwise.

Recall that 𝑦 𝑖 = 𝛽𝑥𝑖 + 𝑒 𝑖 , 𝑖 = 1, . . . , 𝑁, and that for any sampling strategy,
Í𝑁
𝑒 1 , . . . , 𝑒 𝑁 are independent of 𝐼1 , . . . , 𝐼 𝑁 . Hence, since 𝑖=1 𝐼𝑖 = 𝑛,
𝑁  
ˆ 1 ∑︁ 𝛽𝑥 𝑖 + 𝑒 𝑖
𝐸 {𝑌𝑅 𝐴 } = 𝑥¯ 𝑁 · 𝐸 𝐼𝑖
𝑛 𝑖=1 𝑥𝑖
𝑁  !
1 ∑︁ 𝑒𝑖
= 𝑥¯ 𝑁 𝛽 + 𝐸 𝐼𝑖
𝑛 𝑖=1 𝑥𝑖
= 𝑥¯ 𝑁 𝛽,
   
𝑒𝑖 𝐼𝑖
because 𝐸 𝐼𝑖 =𝐸 𝐸 {𝑒 𝑖 } = 0, 𝑖 = 1, . . . , 𝑁. Thus, 𝐸 {𝑌ˆ𝑅 𝐴 − 𝜇 𝑁 } = 0 and
𝑥𝑖 𝑥𝑖
𝑌ˆ𝑅 𝐴 is an unbiased predictor.
 𝑁 𝑁
!2
 𝑥¯ 𝑁 ∑︁

 𝑒 1 ∑︁ 


𝑃𝑀𝑆𝐸 {𝑌ˆ𝑅 𝐴 } = 𝐸
𝑖
𝐼𝑖 − 𝑒𝑖
 𝑛 𝑥𝑖 𝑁 𝑖=1 
 𝑖=1 

( )
∑︁  𝑥¯ 𝑁 1
 ∑︁ 𝑒 𝑖0
=𝑉 − 𝑒𝑖 −
𝑖 ∈s
𝑛𝑥 𝑖 𝑁 𝑖0 ∈r
𝑁
𝑛 𝑛

where s𝑛 is the set of elements in the sample and r𝑛 is the set of elements in P but
not in s𝑛 , r𝑛 = P − s𝑛 . Since 𝑒 1 , . . . , 𝑒 𝑁 are uncorrelated,
∑︁  𝑥¯ 𝑁 2
1 𝑁 −𝑛
𝑃𝑀𝑆𝐸 {𝑌ˆ𝑅 𝐴 | s𝑛 } = 𝜎 2 − + 𝜎2
𝑖 ∈s𝑛
𝑛𝑥 𝑖 𝑁 𝑁2
" 2#
𝜎2 ∑︁  𝑁 𝑥¯ 𝑁
= 2 (𝑁 − 𝑛) + −1 .
𝑁 𝑖 ∈s
𝑛𝑥𝑖
𝑛

 2
A sample s𝑛 which minimizes 𝑖 ∈s𝑛 𝑁𝑛𝑥𝑥¯ 𝑁𝑖 − 1 is optimal.
Í

The predictor 𝑌ˆ𝑅𝐺 can be written as


Í𝑁 Í𝑁 ! Í𝑁
𝐼𝑖 𝑥 𝑖 𝑦 𝑖 𝑖=1 𝐼𝑖 𝑥 𝑖 (𝛽𝑥 𝑖 + 𝑒 𝑖 ) 𝐼𝑖 𝑥 𝑖 𝑒 𝑖
𝑌ˆ𝑅𝐺 = 𝑥¯ 𝑁 Í𝑖=1
𝑁 2
= 𝑥¯ 𝑁 Í𝑁 2
= 𝛽𝑥¯ 𝑁 + 𝑥¯ 𝑁 Í𝑖=1
𝑁 2
.
𝑖=1 𝐼𝑖 𝑥 𝑖 𝑖=1 𝐼𝑖 𝑥 𝑖 𝑖=1 𝐼𝑖 𝑥 𝑖

Hence, 𝐸 {𝑌ˆ𝑅𝐺 } = 𝛽𝑥¯ 𝑁 and 𝑌ˆ𝑅𝐺 is an unbiased predictor of 𝜇 𝑁 . The conditional


prediction 𝑀𝑆𝐸, given s𝑛 , is
5 Sampling for Estimation of Finite Population Quantities 103
" #
𝜎2 𝑁 2 𝑥¯ 2𝑁 𝑛 𝑋¯ 𝑛
ˆ
𝑃𝑀𝑆𝐸 {𝑌𝑅𝐺 | s𝑛 } = 2 𝑁 + Í − 2𝑁 𝑥¯ 𝑁 Í .
𝑁 2 2
𝑖 ∈s𝑛 𝑥 𝑖 𝑖 ∈s𝑛 𝑥 𝑖
Chapter 6
Time Series Analysis and Prediction

Import required modules and define required functions

import math
import mistat
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa import tsatools
import statsmodels.formula.api as smf

Solution 6.1 TODO: Provide a sample solution

Solution 6.2 (i) Fig. 6.1 shows the change of demand over time { fig:seascom-timeline}
(ii) We are first fitting the seasonal trend to the data in SeasCom data set. We use
the linear model𝑌 = 𝑋 𝛽 + 𝜀, where the 𝑌 vector has 102 elements, which are in data
set. 𝑋 is a 102x4 matrix of four column vectors. We combined 𝑋 and 𝑌 in a data
frame. In addition to the SeasCom data, we add a column of 1’s (const), a column
with numbers 1 to 102 (trend) and two columns to describe the seasonal pattern.
The column season_1 consists of cos(𝜋 × trend/6), and the column season_2 is
sin(𝜋 × trend/6).

seascom = mistat.load_data('SEASCOM.csv')
df = tsatools.add_trend(seascom, trend='ct')
df['season_1'] = [np.cos(math.pi * tx/6) for tx in df['trend']]
df['season_2'] = [np.sin(math.pi * tx/6) for tx in df['trend']]
print(df.head())

model = smf.ols(formula='SeasCom ~ trend + 1 + season_1 + season_2',


data=df).fit()
print(model.params)
print(f'r2-adj: {model.rsquared_adj:.3f}')

SeasCom const trend season_1 season_2


0 71.95623 1.0 1.0 8.660254e-01 0.500000
1 56.36048 1.0 2.0 5.000000e-01 0.866025
2 64.85331 1.0 3.0 6.123234e-17 1.000000

105
106 6 Time Series Analysis and Prediction

160
140
120
Data

100
80
60
40
0 20 40 60 80 100
Time

{ fig:seascom-timeline} Fig. 6.1 Seasonal trend model of SeasCom data set

3 59.93460 1.0 4.0 -5.000000e-01 0.866025


4 51.62297 1.0 5.0 -8.660254e-01 0.500000
Intercept 47.673469
trend 1.047236
season_1 10.653968
season_2 10.130145
dtype: float64
r2-adj: 0.981

The least squares estimates of 𝛽 is

𝑏 = (47.67347, 1.04724, 10.65397, 10.13015) 0

{ fig:seascom-timeline} The fitted trend is 𝑌𝑡 = 𝑋 𝑏. (see Fig. 6.1).

seascom = mistat.load_data('SEASCOM.csv')
fig, ax = plt.subplots()
ax.scatter(seascom.index, seascom, facecolors='none', edgecolors='grey')
model.predict(df).plot(ax=ax, color='black')
ax.set_xlabel('Time')
ax.set_ylabel('Data')
plt.show()

{ fig:seascom-model-deviation} (iii) Calculate the residuals and plot them (see Fig. 6.2).

U = df['SeasCom'] - model.predict(df)
fig, ax = plt.subplots()
ax.scatter(U.index, U, facecolors='none', edgecolors='black')
6 Time Series Analysis and Prediction 107

7.5
5.0
2.5
Deviation

0.0
2.5
5.0
7.5
10.0
0 20 40 60 80 100
Time

Fig. 6.2 Deviation of SeasCom data from cyclical trend { fig:seascom-model-deviation}

ax.set_xlabel('Time')
ax.set_ylabel('Deviation')
plt.show()

(iv) Calculate the correlations

# use slices to get sublists


corr_1 = np.corrcoef(U[:-1], U[1:])[0][1]
corr_2 = np.corrcoef(U[:-2], U[2:])[0][1]
print(f'Corr(Ut,Ut-1) = {corr_1:.3f}')
print(f'Corr(Ut,Ut-2) = {corr_2:.3f}')

Corr(Ut,Ut-1) = -0.191
Corr(Ut,Ut-2) = 0.132

Indeed the correlations between adjacent data points are 𝑐𝑜𝑟𝑟 (𝑋𝑡 , 𝑋𝑡−1 ) =
−0.191, and 𝑐𝑜𝑟𝑟 (𝑋𝑡 , 𝑋𝑡−2 ) = 0.132.
(iv) A plot of the deviations and the low autocorrelations shows something like
randomness.

# keep some information for later exercises


seascom_model = model
seascom_df = df

Solution 6.3 According to Equation 6.2.2, the auto-correlation in a stationary MA(q) {eqn:ma-covariance}
is
108 6 Time Series Analysis and Prediction
Í𝑞−ℎ
𝐾 (ℎ) 𝑗=0 𝛽 𝑗 𝛽 𝑗+ℎ
𝜌(ℎ) = = Í𝑞 2
𝐾 (0) 𝑗=0 𝛽 𝑗

for 0 ≤ ℎ ≤ 𝑞).
Notice that 𝜌(ℎ) = 𝜌(−ℎ), and 𝜌(ℎ) = 0 for all ℎ > 𝑞.
Solution 6.4 In Python

beta = np.array([1, 1.05, 0.76, -0.35, 0.45, 0.55])


data = []
n = len(beta)
sum_0 = np.sum(beta * beta)
for h in range(6):
sum_h = np.sum(beta[:n-h] * beta[h:])
data.append({
'h': h,
'K(h)': sum_h,
'rho(h)': sum_h / sum_0,
})

0 1 2 3 4 5
K(h) 3.308 1.672 0.542 0.541 1.028 0.550
rho(h) 1.000 0.506 0.164 0.163 0.311 0.166

Solution 6.5 We consider the 𝐴𝑄(∞), given by coefficients 𝛽 𝑗 = 𝑞 𝑗 , with 0 < 𝑞 < 1.
In this case,
(i) 𝐸 {𝑋𝑡 } = 0, and
(ii) 𝑉 {𝑋𝑡 } = 𝜎 2 ∞ 2 𝑗 = 𝜎 2 /(1 − 𝑞 2 ).
Í
𝑗=0 𝑞

Solution 6.6 We consider the AR(1), 𝑋𝑡 = Í 0.75𝑋𝑡−1 + 𝜀 𝑡 .


(i) This time-series is equivalent to 𝑋𝑡 = ∞ 𝑗=0 (−0.75) 𝜀 𝑡− 𝑗 , hence it is covariance
𝑗

stationary.
(ii) 𝐸 {𝑋𝑡 } = 0
(iii) According to the Yule-Walker equations,
    2
1 −0.75 𝐾 (0) 𝜎
=
−0.75 1 𝐾 (1) 0

It follows that 𝐾 (0) = 2.285714 𝜎 2 and 𝐾 (1) = 1.714286 𝜎 2 .


Solution 6.7 The given AR(2) can be written as X𝑡 − 0.5X𝑡−1 + 0.3X𝑡−2 = 𝜺 𝑡 .
(i) The corresponding characteristic√polynomial is P2 (z) = 0.3 − 0.5z + z2 . The
two characteristic roots are z1,2 = 41 ± i 2095 . These two roots belong to the unit circle.
Hence this AR(2) is covariance stationary.
(ii) We can write 𝐴2 (𝑧) 𝑋𝑡 = 𝜀 𝑡 , where 𝐴2 (𝑧) = 1 − 0.5𝑧−1 + 0.3𝑧 −2 . Furthermore,
𝜙1,2 are the two roots of 𝐴2 (𝑧) = 0.
It follows that

𝑋𝑡 = ( 𝐴2 (𝑧)) −1 𝜀 𝑡
= 𝜀 𝑡 + 0.5𝜀 𝑡−1 − 0.08𝜀 𝑡−2 − 0.205𝜀 𝑡−3 − 0.0761𝜀 𝑡−4 + 0.0296𝜀 𝑡−5 + . . .
6 Time Series Analysis and Prediction 109

Solution 6.8 The Yule-Walker equations are:


 1 −0.5 0.3 −0.2 𝐾 (0)  1
     
−0.5 1.3 −0.2 0  𝐾 (1)  0
 · = 
 0.3 −0.7 1 0  𝐾 (2)  0

−0.2 0.3 −0.5 1  𝐾 (3)  0
     
The solution is 𝐾 (0) = 1.2719, 𝐾 (1) = 0.4825, 𝐾 (2) = −0.0439, 𝐾 (3) = 0.0877.

Solution 6.9 The Toeplitz matrix is


 1.0000
 0.3794 −0.0235 0.0690 
 0.3794 1.0000 0.3794 −0.0235
𝑅4 = 
−0.0235 0.3794 1.0000 0.3794 
 0.0690
 −0.0235 0.3794 1.0000 

Solution 6.10 This series is an ARMA(2,2), given by the equation

(1 − 𝑧−1 + 0.25𝑧−2 ) 𝑋𝑡 = (1 + 0.4𝑧−1 − 0.45𝑧−2 )𝜀 𝑡 .

Accordingly,

𝑋𝑡 = (1 + 0.4𝑧 −1 − 0.45𝑧−2 ) (1 − 𝑧−1 + 0.25𝑧−2 ) −1 𝜀 𝑡


= 𝜀 𝑡 + 1.4𝜀 𝑡−1 + 0.7𝜀 𝑡−2 + 0.35𝜀 𝑡−3 + 0.175𝜀 𝑡−4
+ 0.0875𝜀 𝑡−5 + 0.0438𝜀 𝑡−6 + 0.0219𝜀 𝑡−7 + 0.0109𝜀 𝑡−8 + ...

Solution 6.11 Let 𝑋 denote the DOW1941 data set. We create a new set, 𝑌 of second
order difference, i.e., 𝑌𝑡 = 𝑋𝑡 − 2𝑋𝑡−1 + 𝑋𝑡−2 .

dow1941 = mistat.load_data('DOW1941.csv')

X = dow1941.values # extract values to remove index for calculations


Y = X[2:] - 2 * X[1:-1] + X[:-2]

(i) In the following table we present the autocorrelations, acf, and the partial
autocorrelations, pacf, of 𝑌 . For a visualisation see Fig. 6.3. { fig:acf-pacf-dow-second-order}

# use argument alpha to return confidence intervals


y_acf, ci_acf = acf(Y, nlags=11, fft=True, alpha=0.05)
y_pacf, ci_pacf = pacf(Y, nlags=11, alpha=0.05)

# determine if values are significantly different from zero


def is_significant(y, ci):
return not (ci[0] < 0 < ci[1])

s_acf = [is_significant(y, ci) for y, ci in zip(y_acf, ci_acf)]


s_pacf = [is_significant(y, ci) for y, ci in zip(y_pacf, ci_pacf)]
110 6 Time Series Analysis and Prediction

Autocorrelation Partial Autocorrelation


1.0 1.0
0.5 0.5
0.0 0.0
0.5 0.5
1.0 1.0
0 5 10 0 5 10

Fig. 6.3 Autocorrelations, acf, and the partial autocorrelations, pacf, of the second order differences
of the DOW1941 dataset. { fig:acf-pacf-dow-second-order}

h acf S/NS pacf S/NS


1 -0.342 S -0.343 S
2 -0.179 S -0.337 S
3 0.033 NS -0.210 S
4 0.057 NS -0.100 NS
5 -0.080 NS -0.155 S
6 -0.047 NS -0.193 S
7 -0.010 NS -0.237 S
8 0.128 NS -0.074 NS
9 -0.065 NS -0.127 S
10 -0.071 NS -0.204 S
11 0.053 NS -0.193 S

S denotes significantly different from 0. NS denotes not significantly different


from 0.
(ii) All other correlations are not significant. It seems that the ARIMA(1,2,2) is a
good approximation.

{ fig:seascom-one-day-ahead-model} Solution 6.12 In Fig. 6.4, we present the seasonal data SeasCom, and the one-day
ahead predictions, We see an excellent prediction.

predictedError = mistat.optimalLinearPredictor(seascom_model.resid,
10, nlags=9)
predictedTrend = seascom_model.predict(seascom_df)
correctedTrend = predictedTrend + predictedError

fig, ax = plt.subplots()
ax.scatter(seascom_df.index, seascom_df['SeasCom'],
facecolors='none', edgecolors='grey')
predictedTrend.plot(ax=ax, color='grey')
correctedTrend.plot(ax=ax, color='black')
ax.set_xlabel('Time')
ax.set_ylabel('SeasCom data')
plt.show()
6 Time Series Analysis and Prediction 111

160
140
120
SeasCom data

100
80
60
40
0 20 40 60 80 100
Time

{ fig:seascom-one-day-ahead-model} Fig. 6.4 One-day ahead prediction model of SeasCom data set
Chapter 7
Modern analytic methods: Part I

Import required modules and define required functions

import warnings
import mistat
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

# Uncomment the following if xgboost crashes


import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

Solution 7.1 Articles reviewing the application of supervised and unsupervised


methods can be found online (see e.g., doi:10.1016/j.chaos.2020.110059)
An example for supervised applications is the classification of a COVID-19 based
on diagnostic data (see e.g., doi:10.1155/2021/4733167 or doi:10.3390/s21103322)
An example of unsupervised learning is hierarchical clustering to evaluate
COVID-19 pandemic preparedness and performance in 180 countries (see doi:10.1016/j.rinp.2021.104639)

Solution 7.2 The decision tree model for testResult results in the confusion ma-
trix shown in Fig. 7.1. { fig:ex-confusion-matrix-testResult}

sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'testResult'
X = sensors[predictors]

113
114 7 Modern analytic methods: Part I

80
Brake 4 0 0 0 0 1 1 0 0

Good 0 82 0 0 0 0 0 0 0 70

Grippers 0 0 14 0 0 0 0 0 0 60

IMP 0 1 1 3 0 0 0 0 0 50
True label
ITM 0 2 0 0 31 0 0 0 0 40

Motor 0 0 0 0 0 15 1 0 0
30
SOS 0 0 0 0 0 0 2 0 0
20
Velocity Type I 0 0 0 0 0 0 0 11 0
10
Velocity Type II 0 1 0 0 0 0 0 0 4
0
Brake Good Grippers IMP ITM Motor SOS VelocityVelocity
Type I Type II
Predicted label
exc:sensors-test-result
{ fig:ex-confusion-matrix-testResult} Fig. 7.1 Decision tree model to predict testResult from sensor data (Exc. 7.2)

y = sensors[outcome]

# Train the model


clf = DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
clf.fit(X, y)

fig, ax = plt.subplots(figsize=(10, 6))


ConfusionMatrixDisplay.from_estimator(clf, X, y, ax=ax, cmap=plt.cm.Blues)

plt.show()

The models classification performance is very good. The test result ‘Good’, which
corresponds to status ‘Pass’ is correctly predicted. Most of the other individual test
results have also low missclassification rates. The likely reason for this is that each
test result is associated with a specific subset of the sensors.

Solution 7.3 In Python

# convert the status information into numerical labels


outcome = 'status'
y = sensors[outcome].replace({'Pass': 1, 'Fail': 0})

# Train the model


xgb = XGBClassifier(objective='binary:logistic', subsample=.63,
eval_metric='logloss', use_label_encoder=False,
seed=1)
xgb.fit(X, y)

# actual in rows / predicted in columns


print('Confusion matrix')
print(confusion_matrix(y, xgb.predict(X)))
7 Modern analytic methods: Part I 115

Confusion matrix
[[92 0]
[ 0 82]]
/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py:1395:
UserWarning: `use_label_encoder` is deprecated in 1.7.0.
warnings.warn("`use_label_encoder` is deprecated in 1.7.0.")

The models confusion matrix is perfect.

var_imp = pd.DataFrame({
'importance': xgb.feature_importances_,
}, index=predictors)
var_imp = var_imp.sort_values(by='importance', ascending=False)
var_imp['order'] = range(1, len(var_imp) + 1)
print(var_imp.head(10))
var_imp.loc[var_imp.index.isin(['sensor18', 'sensor07', 'sensor21'])]

importance order
sensor18 0.290473 1
sensor54 0.288680 2
sensor53 0.105831 3
sensor55 0.062423 4
sensor61 0.058112 5
sensor48 0.040433 6
sensor07 0.026944 7
sensor12 0.015288 8
sensor03 0.013340 9
sensor52 0.013160 10

importance order
sensor18 0.290473 1
sensor07 0.026944 7
sensor21 0.000000 50

The decision tree model uses sensors 18, 7, and 21. The xgboost model identifies
sensor 18 as the most important variable. Sensor 7 is ranked 7th, sensor 21 has no
importance.

Solution 7.4 Create the random forest classifier model.

y = sensors['status']

# Train the model


model = RandomForestClassifier(ccp_alpha=0.012, random_state=0)
model.fit(X, y)

# actual in rows / predicted in columns


print('Confusion matrix')
print(confusion_matrix(y, model.predict(X)))

Confusion matrix
[[92 0]
[ 0 82]]

The models confusion matrix is perfect.

var_imp = pd.DataFrame({
'importance': model.feature_importances_,
}, index=predictors)
var_imp = var_imp.sort_values(by='importance', ascending=False)
116 7 Modern analytic methods: Part I

var_imp['order'] = range(1, len(var_imp) + 1)


print(var_imp.head(10))
var_imp.loc[var_imp.index.isin(['sensor18', 'sensor07', 'sensor21'])]

importance order
sensor61 0.138663 1
sensor18 0.100477 2
sensor53 0.079890 3
sensor52 0.076854 4
sensor46 0.052957 5
sensor50 0.051970 6
sensor44 0.042771 7
sensor48 0.037087 8
sensor24 0.036825 9
sensor21 0.035014 10

importance order
sensor18 0.100477 2
sensor21 0.035014 10
sensor07 0.023162 17

The decision tree model uses sensors 18, 7, and 21. Sensor 18 has the second
largest importance value, sensor 21 ranks 10th, and sensor 7 is on rank 17.

Solution 7.5 In Python:

# convert outcome values from strings into numerical labels


# use le.inverse_transform to convert the predictions to strings
le = LabelEncoder()
y = le.fit_transform(sensors['status'])

train_X, valid_X, train_y, valid_y = train_test_split(X, y,


test_size=0.4, random_state=2)

dt_model = DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)


dt_model.fit(train_X, train_y)

xgb_model = XGBClassifier(objective='binary:logistic', subsample=.63,


eval_metric='logloss', use_label_encoder=False,
seed=1)
xgb_model.fit(train_X, train_y)

rf_model = RandomForestClassifier(ccp_alpha=0.014, random_state=0)


rf_model.fit(train_X, train_y)

print('Decision tree model')


print(f'Accuracy {accuracy_score(valid_y, dt_model.predict(valid_X)):.3f}')
print(confusion_matrix(valid_y, dt_model.predict(valid_X)))

print('Gradient boosting model')


print(f'Accuracy {accuracy_score(valid_y, xgb_model.predict(valid_X)):.3f}')
print(confusion_matrix(valid_y, xgb_model.predict(valid_X)))

print('Random forest model')


print(f'Accuracy {accuracy_score(valid_y, rf_model.predict(valid_X)):.3f}')
print(confusion_matrix(valid_y, rf_model.predict(valid_X)))

/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py:1395:
UserWarning: `use_label_encoder` is deprecated in 1.7.0.
warnings.warn("`use_label_encoder` is deprecated in 1.7.0.")
Decision tree model
Accuracy 0.900
[[37 2]
7 Modern analytic methods: Part I 117

[ 5 26]]
Gradient boosting model
Accuracy 0.957
[[36 3]
[ 0 31]]
Random forest model
Accuracy 0.986
[[38 1]
[ 0 31]]

The accuracy for predicting the validation set is very good for all three models,
with random forest giving the best model with an accuracy of 0.986. The xgboost
model has a slightly lower accuracy of 0.957 and the accuracy for the decision tree
model is 0.900.
If you change the random seeds or remove them for the various commands, you
will see that the accuracies vary and that the order can change.

Solution 7.6 In Python:

dt_model = DecisionTreeClassifier(ccp_alpha=0.012)

random_valid_acc = []
random_train_acc = []
org_valid_acc = []
org_train_acc = []
for _ in range(100):
train_X, valid_X, train_y, valid_y = train_test_split(X, y,
test_size=0.4)
dt_model.fit(train_X, train_y)
org_train_acc.append(accuracy_score(train_y, dt_model.predict(train_X)))
org_valid_acc.append(accuracy_score(valid_y, dt_model.predict(valid_X)))

random_y = random.sample(list(train_y), len(train_y))


dt_model.fit(train_X, random_y)
random_train_acc.append(accuracy_score(random_y, dt_model.predict(train_X)))
random_valid_acc.append(accuracy_score(valid_y, dt_model.predict(valid_X)))

ax = pd.Series(random_valid_acc).plot.density(color='C0')
pd.Series(random_train_acc).plot.density(color='C0', linestyle='--',
ax=ax)
pd.Series(org_valid_acc).plot.density(color='C1', ax=ax)
pd.Series(org_train_acc).plot.hist(color='C1', linestyle='--',
ax=ax)
ax.set_ylim(0, 40)
ax.set_xlim(0, 1.01)
plt.show()
118 7 Modern analytic methods: Part I

40
35
30
25
Frequency

20
15
10
5
0
0.0 0.2 0.4 0.6 0.8 1.0
Solution 7.7 Load data

distTower = mistat.load_data('DISTILLATION-TOWER.csv')
predictors = ['Temp1', 'Temp2', 'Temp3', 'Temp4', 'Temp5', 'Temp6',
'Temp7', 'Temp8', 'Temp9', 'Temp10', 'Temp11', 'Temp12']
outcome = 'VapourPressure'
Xr = distTower[predictors]
yr = distTower[outcome]

(i) Split dataset into train and validation set

train_X, valid_X, train_y, valid_y = train_test_split(Xr, yr,


test_size=0.2, random_state=2)

(ii) Determine model performance for different tree complexity along the depen-
{ fig:exc-cpp-pruning} dence of tree depth on ccp parameter; see Fig. 7.2.

# Code to analyze tree depth vs alpha


model = DecisionTreeRegressor(random_state=0)
path = model.cost_complexity_pruning_path(Xr, yr)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
mse = []
mse_train = []
for ccp_alpha in ccp_alphas:
model = DecisionTreeRegressor(random_state=0, ccp_alpha=ccp_alpha)
model.fit(train_X, train_y)
mse.append(mean_squared_error(valid_y, model.predict(valid_X)))
mse_train.append(mean_squared_error(train_y, model.predict(train_X)))
ccp_alpha = ccp_alphas[np.argmin(mse)]

The smallest validation set error is obtained for ccp_alpha = 0.372. The depen-
{ fig:exc-cpp-pruning} dence of training and validation error on ccp_alpha is shown in Fig. 7.2.
{ fig:exc-dtreeviz-visualization} (iii) The final model is visualized using dtreeviz in Fig. 7.3.
7 Modern analytic methods: Part I 119

20

15

MSE of test set


10

0
10 6 10 4 10 2 100
Cost-complexity parameter (ccp_alpha)
Fig. 7.2 Decision tree regressor complexity as a function of ccp_alpha. The validation set error
exc:dec-tree-reg-distillation
{ fig:exc-cpp-pruning} is shown in black, the training set error in grey (Exercise 7.7)

# the dtreeviz methods requires the classifier to be trained with a numerical


# representation of the classes
# Train the model
model = DecisionTreeRegressor(ccp_alpha=ccp_alpha, random_state=0)
model.fit(Xr, yr)

viz = dtreeviz.model(model, Xr, yr,


target_name=outcome,
feature_names=Xr.columns)
viz2pdf(viz, 'compiled/figures/Chap008_ex_dtreeviz_regressor.pdf')

---------------------------------------------------------------------------AttributeError
Traceback (most recent call last)Cell In[1], line 10
5 model.fit(Xr, yr)
7 viz = dtreeviz.model(model, Xr, yr,
8 target_name=outcome,
9 feature_names=Xr.columns)
---> 10 viz2pdf(viz,
'compiled/figures/Chap008_ex_dtreeviz_regressor.pdf')
Cell In[1], line 8, in viz2pdf(viz, pdfFile)
6 from tempfile import NamedTemporaryFile
7 with NamedTemporaryFile(mode='w+', suffix='.svg') as f:
----> 8 f.write(viz.svg())
9 f.flush()
10 f.seek(0)
AttributeError: 'DTreeVizAPI' object has no attribute 'svg'

Solution 7.8 Vary the number of bins and binning strategy. The influence of the two
model parameter on model performance is shown in Fig. 7.4. { fig:nb-binning-performance}

y = sensors['status']

results = []
120 7 Modern analytic methods: Part I

≤ >

exc:dec-tree-reg-distillation
Fig. 7.3 Decision tree visualization of regression tree (Exercise 7.7) { fig:exc-dtreeviz-visualization}

for strategy in 'uniform', 'quantile', 'kmeans':


for n_bins in range(2, 11):
kbinsDiscretizer = KBinsDiscretizer(encode='ordinal',
strategy=strategy, n_bins=n_bins)
X_binned = kbinsDiscretizer.fit_transform(X)
nb_model = MultinomialNB()
nb_model.fit(X_binned, y)
results.append({'strategy': strategy, 'n_bins': n_bins,
'accuracy': accuracy_score(y, nb_model.predict(X_binned))})
results = pd.DataFrame(results)
fig, ax = plt.subplots()
for key, group in results.groupby('strategy'):
group.plot(x='n_bins', y='accuracy', label=key, ax=ax)

The quantile binning strategy (for each feature, each bin has the same number
of data points) with splitting each column into two bins leads to the best performing
model. With this strategy, performance declines with increasing number of bins. The
uniform (for each feature, each bin has the same width) and the kmeans (for each
feature, a 𝑘-means clustering is used to bin the feature) strategies on the other hand,
show increasing performance with increasing number of bins.
The confusion matrix for the best performing models is:

kbinsDiscretizer = KBinsDiscretizer(encode='ordinal',
strategy='quantile', n_bins=2)
X_binned = kbinsDiscretizer.fit_transform(X)
nb_model = MultinomialNB()
nb_model.fit(X_binned, y)
print('Confusion matrix')
print(confusion_matrix(y, nb_model.predict(X_binned)))

Confusion matrix
[[87 5]
[ 1 81]]
7 Modern analytic methods: Part I 121

kmeans
0.950 quantile
uniform
0.925
0.900
0.875
0.850
0.825
0.800
2 3 4 5 6 7 8 9 10
n_bins
Fig. 7.4 Influence of number of bins and binning strategy on model performance for the sensors
{ fig:nb-binning-performance} data with status as outcome

The decision tree model missclassified three of the ‘Pass’ data points as ‘Fail’.
The Naïve Bayes model on the other hand missclassifies six data points. However,
five of these are ‘Pass’ and predicted as ‘Fail’. Depending on your use case, you may
prefer a model with more false negatives or false positives.

Solution 7.9 In Python:

from sklearn.cluster import AgglomerativeClustering


from sklearn.preprocessing import StandardScaler
from mistat import plot_dendrogram

food = mistat.load_data('FOOD.csv')

scaler = StandardScaler()
model = AgglomerativeClustering(n_clusters=10, compute_distances=True)

X = scaler.fit_transform(food)
model = model.fit(X)
fig, ax = plt.subplots()
plot_dendrogram(model, ax=ax)
ax.set_title('Dendrogram')
ax.get_xaxis().set_ticks([])
plt.show()

Solution 7.10 In Python:

food = mistat.load_data('FOOD.csv')
scaler = StandardScaler()
X = scaler.fit_transform(food)

fig, axes = plt.subplots(ncols=2, nrows=2)


122 7 Modern analytic methods: Part I

Dendrogram
80
70
60
50
40
30
20
10
0

Fig. 7.5 Hierarchical clustering of food data set using Ward clustering { fig:food-ward-10-clusters}

for linkage, ax in zip(['ward', 'complete', 'average', 'single'], axes.flatten()):


model = AgglomerativeClustering(n_clusters=10, compute_distances=True,
linkage=linkage)
model = model.fit(X)
plot_dendrogram(model, ax=ax)
ax.set_title(linkage)
ax.get_xaxis().set_ticks([])
plt.show()

{ fig:food-compare-linkage} The comparison of the different linkage methods is shown in Fig. 7.6. We can
see that Ward’s clustering gives the most balanced clusters; three bigger clusters and
seven small clusters. Complete, average, and single linkage lead to one big cluster.

Solution 7.11 We determine 10 clusters using K-means clustering.

sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'status'
X = sensors[predictors]

scaler = StandardScaler()
X = scaler.fit_transform(X)
model = KMeans(n_clusters=10, random_state=1).fit(X)

Combine the information and analyse cluster membership by status.


7 Modern analytic methods: Part I 123

ward complete
80
60 20
40
10
20
0 0
average single
20 10.0
15 7.5
10 5.0
5 2.5
0 0.0

{ fig:food-compare-linkage} Fig. 7.6 Comparison of different linkage methods for hierarchical clustering of food data set

df = pd.DataFrame({
'status': sensors['status'],
'testResult': sensors['testResult'],
'cluster': model.predict(X),
})

for status, group in df.groupby('status'):


print(f'Status {status}')
print(group['cluster'].value_counts())

Status Fail
cluster
8 19
0 15
4 13
1 13
9 13
3 10
5 5
7 2
6 1
2 1
Name: count, dtype: int64
Status Pass
cluster
1 48
8 34
Name: count, dtype: int64

There are several clusters that contain only ‘Fail’ data points. They correspond to
specific sensor value combinations that are very distinct to the sensor values during
normal operation. The ‘Pass’ data points are found in two clusters. Both of these
clusters contain also ‘Fail’ data points.
Analyse cluster membership by testResult.
124 7 Modern analytic methods: Part I

print('Number of clusters by testResult')


for cluster, group in df.groupby('cluster'):
print(f'Cluster {cluster}')
print(group['testResult'].value_counts())
print()

Number of clusters by testResult


Cluster 0
testResult
ITM 15
Name: count, dtype: int64

Cluster 1
testResult
Good 48
Brake 6
IMP 4
Grippers 1
Motor 1
ITM 1
Name: count, dtype: int64

Cluster 2
testResult
SOS 1
Name: count, dtype: int64

Cluster 3
testResult
Velocity Type I 10
Name: count, dtype: int64

Cluster 4
testResult
Grippers 10
ITM 3
Name: count, dtype: int64

Cluster 5
testResult
Velocity Type II 5
Name: count, dtype: int64

Cluster 6
testResult
SOS 1
Name: count, dtype: int64

Cluster 7
testResult
Grippers 2
Name: count, dtype: int64

Cluster 8
testResult
Good 34
Motor 15
Grippers 1
IMP 1
ITM 1
Velocity Type I 1
Name: count, dtype: int64

Cluster 9
testResult
ITM 13
Name: count, dtype: int64
7 Modern analytic methods: Part I 125

We can see that some of the test results are only found in one or two clusters.

Solution 7.12 The scikit-learn K-means clustering method can return either the
cluster centers or the distances of a data point to all the cluster centers. We evaluate
both as features for classification.

# Data preparation
sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'status'
X = sensors[predictors]
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = sensors[outcome]

First, classifying data points based on the cluster center. In order to use that
information in a classifier, we transform the cluster center information using on-hot-
encoding.

# Iterate over increasing number of clusters


results = []
clf = DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
for n_clusters in range(2, 20):
# fit a model and assign the data to clusters
model = KMeans(n_clusters=n_clusters, random_state=1)
model.fit(X)
Xcluster = model.predict(X)
# to use the cluster number in a classifier, use one-hot encoding
# it's necessary to reshape the vector of cluster numbers into a column vector
Xcluster = OneHotEncoder().fit_transform(Xcluster.reshape(-1, 1))

# create a decision tree model and determine the accuracy


clf.fit(Xcluster, y)
results.append({'n_clusters': n_clusters,
'accuracy': accuracy_score(y, clf.predict(Xcluster))})
ax = pd.DataFrame(results).plot(x='n_clusters', y='accuracy')
ax.set_ylim(0.5, 1)
plt.show()
pd.DataFrame(results).round(3)

n_clusters accuracy
0 2 0.529
1 3 0.805
2 4 0.805
3 5 0.799
4 6 0.793
5 7 0.897
6 8 0.868
7 9 0.879
8 10 0.816
9 11 0.879
10 12 0.879
11 13 0.879
12 14 0.897
13 15 0.897
14 16 0.885
15 17 0.879
16 18 0.891
17 19 0.902
126 7 Modern analytic methods: Part I

1.0
accuracy
0.9

0.8

0.7

0.6

0.5
2.5 5.0 7.5 10.0 12.5 15.0 17.5
n_clusters
Fig. 7.7 Dependence of accuracy on number of clusters using cluster membership as feature
exc:kmeans-classifier
(Exc. 7.12) { fig:cluster-number-model}

The accuracies are visualized in Fig. 7.7. We see that splitting the dataset into 7 { fig:cluster-number-model}
clusters gives a classification model with an accuracy of about 0.9.
Next we use the distance to the cluster centers as variable in the classifier.

results = []
clf = DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
for n_clusters in range(2, 20):
# fit a model and convert data to distances
model = KMeans(n_clusters=n_clusters, random_state=1)
model.fit(X)
Xcluster = model.transform(X)

# create a decision tree model and determine the accuracy


clf.fit(Xcluster, y)
results.append({'n_clusters': n_clusters,
'accuracy': accuracy_score(y, clf.predict(Xcluster))})
ax = pd.DataFrame(results).plot(x='n_clusters', y='accuracy')
ax.set_ylim(0.5, 1)
plt.show()
pd.DataFrame(results).round(3)

n_clusters accuracy
0 2 0.983
1 3 0.977
2 4 0.977
3 5 0.977
4 6 0.989
5 7 0.977
6 8 0.977
7 9 0.977
8 10 0.977
9 11 0.977
10 12 0.977
11 13 0.983
12 14 0.977
7 Modern analytic methods: Part I 127

1.0

0.9

0.8

0.7

0.6
accuracy
0.5
2.5 5.0 7.5 10.0 12.5 15.0 17.5
n_clusters
Fig. 7.8 Dependence of accuracy on number of clusters using distance to cluster center as feature
exc:kmeans-classifier
(Exc. 7.12) { fig:cluster-distance-model}

13 15 0.977
14 16 0.977
15 17 0.983
16 18 0.977
17 19 0.983

The accuracies of all models are very high. The largest accuracy is achived for 6
clusters.
Based on these results, we would design the procedure using the decision tree
classifier combined with K-means clustering into six clusters. Using scikit-learn,
we can define the full procedure as a single pipeline as follows:

pipeline = make_pipeline(
StandardScaler(),
KMeans(n_clusters=6, random_state=1),
DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
)
X = sensors[predictors]
y = sensors[outcome]

process = pipeline.fit(X, y)
print('accuracy', accuracy_score(y, process.predict(X)))
print('Confusion matrix')
print(confusion_matrix(y, process.predict(X)))

accuracy 0.9885057471264368
Confusion matrix
[[91 1]
[ 1 81]]

The final model has two missclassified data points.


Chapter 8
Modern analytic methods: Part II

Import required modules and define required functions

import mistat
import networkx as nx
from pgmpy.estimators import HillClimbSearch
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Solution 8.1 Load the data and convert to FDataGrid.

from skfda import FDataGrid


from skfda.representation.interpolation import SplineInterpolation

dissolution = mistat.load_data('DISSOLUTION.csv')

# convert the data to FDataGrid


data = []
labels = []
names = []
for label, group in dissolution.groupby('Label'):
data.append(group['Data'].values)
labels.append('Reference' if label.endswith('R') else 'Test')
names.append(label)
labels = np.array(labels)
grid_points = np.array(sorted(dissolution['Time'].unique()))
fd = FDataGrid(np.array(data), grid_points,
dataset_name='Dissolution',
argument_names=['Time'],
coordinate_names=['Dissolution'])

Use shift registration to align the dissolution data with spline interpolation of
order 1, 2, and 3.

from skfda.preprocessing.registration import ShiftRegistration


shift_registration = ShiftRegistration()

fd_registered = {}
for order in (1, 2, 3):
fd.interpolation = SplineInterpolation(interpolation_order=order)
fd_registered[order] = shift_registration.fit_transform(fd)

129
130 8 Modern analytic methods: Part II

Dissolution
Order 1 Order 2 Order 3
100
80 80 80

Dissolution

Dissolution

Dissolution
60 60 60

40 40 40
20
20
20 40 20 40 20 40

Fig. 8.1 Mean dissolution curves for reference and test tablets derived using spline interpolation
{ fig:fdaDissolutionSplineComparison} of order 1, 2, and 3.

For each of the three registered datasets, calculate the mean dissolution curves
for reference and test tablets and plot the results.

from skfda.exploratory import stats

group_colors = {'Reference': 'grey', 'Test': 'black'}

fig, axes = plt.subplots(ncols=3, figsize=(8, 3))


for ax, order in zip(axes, (1, 2, 3)):
mean_ref = stats.mean(fd_registered[order][labels=='Reference'])
mean_test = stats.mean(fd_registered[order][labels=='Test'])
means = mean_ref.concatenate(mean_test)
means.plot(axes=[ax], group=['Reference', 'Test'], group_colors=group_colors)
ax.set_title(f'Order {order}')
plt.tight_layout()

{ fig:fdaDissolutionSplineComparison} The dissolution curves are shown in Fig. 8.1. We can see in all three graphs,
that the test tablets show a slightly faster dissolution than the reference tablets. If
we compare the shape of the curves, the curves for the linear splines interpolation
shows a levelling off with time. In the quadratic spline interpolation result, the
dissolution curves go through a maximum. This behavior is unrealistic. The cubic
spline interpolation also leads to unrealistic curves that first level of and then start to
increase again.

Solution 8.2 (i)

import skfda
from skfda import FDataGrid

pinchraw = skfda.datasets.fetch_cran('pinchraw', 'fda')['pinchraw']


pinchtime = skfda.datasets.fetch_cran('pinch', 'fda')['pinchtime']

fd = FDataGrid(pinchraw.transpose(), pinchtime)

Note that the measurement data need to be transposed.


(ii)

fig = fd.plot()
ax = fig.axes[0]
8 Modern analytic methods: Part II 131

12

10

Pinch force
6

0.00 0.05 0.10 0.15 0.20 0.25 0.30


Time [s]

Fig. 8.2 Twenty measurements of pinch force { fig:fdaPinchforceOriginal}

ax.set_xlabel('Time [s]')
ax.set_ylabel('Pinch force')
plt.show()

Fig. 8.2 shows the measure pinch forces. We can see that the start of the force { fig:fdaPinchforceOriginal}
varies from 0.025 to 0.1 seconds. This makes it difficult to compare the shape of the
curves. The shapes of the individual curves is not symmetric with a faster onset of
the force and slower decline.
(iii) We create a variety of smoothed version of the dataset to explore the effect
of varying the smoothing_parameter.

import itertools
from skfda.preprocessing.smoothing.kernel_smoothers import NadarayaWatsonSmoother

def plotSmoothData(fd, smoothing_parameter, ax):


smoother = NadarayaWatsonSmoother(smoothing_parameter=smoothing_parameter)
fd_smooth = smoother.fit_transform(fd)
_ = fd_smooth.plot(axes=[ax])
ax.set_title(f'Smoothing parameter {smoothing_parameter}')
ax.set_xlabel('Time')
ax.set_ylabel('Pinch force')

fig, axes = plt.subplots(ncols=2, nrows=2)


axes = list(itertools.chain(*axes)) # flatten list of lists
for i, sp in enumerate([0.03, 0.01, 0.001, 0.0001]):
plotSmoothData(fd, sp, axes[i])
plt.tight_layout()

Fig. 8.3 shows smoothed measurement curves for a variety of smoothing_parameter { fig:fdaPinchforceSmoothing}
values. If values are too large, the data are oversmoothed and the asymmetric shape
of the curves is lost. With decreasing values, the shape is reproduced better but the
curves are getting noisier again. We select 0.005 as the smoothing parameter.

smoother = NadarayaWatsonSmoother(smoothing_parameter=0.005)
fd_smooth = smoother.fit_transform(fd)

(iii) We first determine the maxima of the smoothed curves:


132 8 Modern analytic methods: Part II

Smoothing parameter 0.03 Smoothing parameter 0.01


8 10

Pinch force

Pinch force
6
4 5
2
0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3
Time Time
Smoothing parameter 0.001 Smoothing parameter 0.0001
10 10
Pinch force

Pinch force
5 5

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3


Time Time

Fig. 8.3 Effect of smoothing parameter on measurement curves { fig:fdaPinchforceSmoothing}

12

10

8
Pinch force

2
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Time [s]

{ fig:fdaPinchforceRegistered} Fig. 8.4 Registered measurement curves of the Pinch dataset

max_idx = fd_smooth.data_matrix.argmax(axis=1)
landmarks = [pinchtime[idx] for idx in max_idx]

Use the landmarks to shift register the measurements:

from skfda.preprocessing.registration import landmark_shift


fd_landmark = landmark_shift(fd_smooth, landmarks)

{ fig:fdaPinchforceRegistered} (iv) Fig. 8.4 shows the measurements after smoothing and landmark shift regis-
tration.

fig = fd_landmark.plot()
ax = fig.axes[0]
ax.set_xlabel('Time [s]')
ax.set_ylabel('Pinch force')
plt.show()
8 Modern analytic methods: Part II 133

16
14
12
10
8
6
4
2
0
13 14 15 16 17

Fig. 8.5 Histogram of moisture content { fig:fdaMoistureDistribution}

Solution 8.3 (i) Load the data

import skfda

moisturespectrum = skfda.datasets.fetch_cran('Moisturespectrum', 'fds')


moisturevalues = skfda.datasets.fetch_cran('Moisturevalues', 'fds')

frequencies = moisturespectrum['Moisturespectrum']['x']
spectra = moisturespectrum['Moisturespectrum']['y']
moisture = moisturevalues['Moisturevalues']

(ii) We can use a histogram to look at the distribution of the moisture values.

_ = pd.Series(moisture).hist(bins=20, color='grey', label='Moisture content')

Fig. 8.5 shows a bimodal distribution of the moisture content with a clear separa- { fig:fdaMoistureDistribution}
tion of the two peaks. Based on this, we select 14.5 as the threshold to separate into
high and low moisture content.

moisture_class = ['high' if m > 14.5 else 'low' for m in moisture]

(iii - iv) The spectrum information is already in the array format required for the
FDataGrid class. In order to do this, we need to transpose the spectrum information.
As we can see in the left graph of Fig. 8.6, the spectra are not well aligned but show { fig:fdaMoistureSpectrum}
a large spread in the intensities. This is likely due to the difficulty in having a clearly
defined concentration between samples. In order to reduce this variation, we can
transform the intensities by dividing the intensities by the mean of each sample.

intensities = spectra.transpose()
fd = skfda.FDataGrid(intensities, frequencies)

# divide each sample spectrum by it's mean intensities


intensities_normalized = (intensities - intensities.mean(dim='dim_0')) / intensities.std(dim='dim_0')
fd_normalized = skfda.FDataGrid(intensities_normalized, frequencies)

Code for plotting the spectra:


134 8 Modern analytic methods: Part II

high
low
2
1.2

1.0 1

0.8 0

0.6 1

0.4
2

1500 2000 2500 1500 2000 2500

{ fig:fdaMoistureSpectrum} Fig. 8.6 Near-infrared spectra of the moisture dataset. Left: raw spectra, Right: normalized spectra

fig, axes = plt.subplots(ncols=2)


_ = fd.plot(axes=axes[0], alpha=0.5,
# color lines by moisture class
group=moisture_class, group_names={'high': 'high', 'low': 'low'})
_ = fd_normalized.plot(axes=axes[1], alpha=0.5,
group=moisture_class, group_names={'high': 'high', 'low': 'low'})

{ fig:fdaMoistureSpectrum} As we can see in right graph of Fig. 8.6, the normalized spectra are now better
aligned. We also see that the overall shape of the spectra is fairly consistent between
samples.
(v) We repeat the model building both for the original and normalized spectra 50
times. At each iteration, we split the data set into training and test sets (50-50), build
the model with the training set and measure accuracy using the test set. By using the
same random seed for splitting the original and the normalized dataset, we can better
{ fig:fdaMoistureAccuracies} compare the models. The accuracies from the 50 iteration are compared in Fig. 8.7.

from skfda.ml.classification import KNeighborsClassifier


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

accuracies = []
for rs in range(50):
X_train, X_test, y_train, y_test = train_test_split(fd,
moisture_class, random_state=rs, test_size=0.5)
knn_original = KNeighborsClassifier()
knn_original.fit(X_train, y_train)
8 Modern analytic methods: Part II 135

Accuracy of models based on normalized spectra


1.00

0.95

0.90

0.85

0.80

0.75

0.70
0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900
Accuracy of models based on original spectra

Fig. 8.7 Accuracies of classification models based original and normalized spectra. The line
indicates equal performance. { fig:fdaMoistureAccuracies}

acc_original = accuracy_score(y_test, knn_original.predict(X_test))

X_train, X_test, y_train, y_test = train_test_split(fd_normalized,


moisture_class, random_state=rs, test_size=0.5)
knn_normalized = KNeighborsClassifier()
knn_normalized.fit(X_train, y_train)
acc_normalized = accuracy_score(y_test, knn_normalized.predict(X_test))
accuracies.append({
'original': acc_original,
'normalized': acc_normalized,
})
accuracies = pd.DataFrame(accuracies)
ax = accuracies.plot.scatter(x='original', y='normalized')
_ = ax.plot([0.7, 0.9], [0.7, 0.9], color='black')
ax.set_xlabel('Accuracy of models based on original spectra')
ax.set_ylabel('Accuracy of models based on normalized spectra')
plt.show()

# mean of accuracies
mean_accuracy = accuracies.mean()
mean_accuracy

original 0.7976
normalized 0.9468
dtype: float64

Fig. 8.7 clearly shows that classification models based on the normalized spectra { fig:fdaMoistureAccuracies}
achieve better accuracies. The mean accuracy increases from 0.8 to 0.95.

Solution 8.4 (i) See solution for Exercise 8.3. {exc:fda-moisture-classification}

(ii) We use the method skfda.ml.regression.KNeighborsRegressor to


build the regression models.

from skfda.ml.regression import KNeighborsRegressor


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
136 8 Modern analytic methods: Part II

1.0

MAE of models based on normalized spectra


0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
MAE of models based on original spectra

Fig. 8.8 Mean absolute error of regression models using original and normalized spectra. The line
{ fig:fdaMoistureMAE} indicates equal performance.

mae = []
for rs in range(50):
X_train, X_test, y_train, y_test = train_test_split(fd,
moisture, random_state=rs, test_size=0.5)
knn_original = KNeighborsRegressor()
knn_original.fit(X_train, y_train)
mae_original = mean_absolute_error(y_test, knn_original.predict(X_test))

X_train, X_test, y_train, y_test = train_test_split(fd_normalized,


moisture, random_state=rs, test_size=0.5)
knn_normalized = KNeighborsRegressor()
knn_normalized.fit(X_train, y_train)
mae_normalized = mean_absolute_error(y_test, knn_normalized.predict(X_test))
mae.append({
'original': mae_original,
'normalized': mae_normalized,
})
mae = pd.DataFrame(mae)
ax = mae.plot.scatter(x='original', y='normalized')
ax.plot([0.3, 1.0], [0.3, 1.0], color='black')
ax.set_xlabel('MAE of models based on original spectra')
ax.set_ylabel('MAE of models based on normalized spectra')
plt.show()

# mean of MAE
mean_mae = mae.mean()
mean_mae

original 0.817016
normalized 0.433026
dtype: float64

{ fig:fdaMoistureMAE} Fig. 8.8 clearly shows that regression models based on the normalized spectra
achieve better performance. The mean absolute error descrease from 0.82 to 0.43.
(iii) We use the last regression model from (ii) to create a plot of actual versus
predicted moisture content for the test data.
8 Modern analytic methods: Part II 137

17

Predicted moisture content


16

15

14

13

13 14 15 16 17
Moisture content

Fig. 8.9 Actual versus predicted moisture content { fig:fdaMoisturePredictions}

y_pred = knn_normalized.predict(X_test)
predictions = pd.DataFrame({'actual': y_test, 'predicted': y_pred})
minmax = [min(*y_test, *y_pred), max(*y_test, *y_pred)]

ax = predictions.plot.scatter(x='actual', y='predicted')
ax.set_xlabel('Moisture content')
ax.set_ylabel('Predicted moisture content')
ax.plot(minmax, minmax, color='grey')
plt.show()

Fig. 8.9 shows two clusters of points. One cluster contains the samples with the { fig:fdaMoisturePredictions}
high moisture content and the other cluster the samples with low moisture content.
The clusters are well separated and the predictions are in the typical range for
each cluster. However, within a cluster, predictions and actual values are not highly
correlated. In other words, while the regression model can distinguish between
samples with a high and low moisture content, the moisture content is otherwise
not well predicted. There is therefore no advantage of using the regression model
compared to the classification model.

Solution 8.5 (i) See solution for Exercise 8.3. {exc:fda-moisture-classification}

(ii) In Python:

from skfda.preprocessing.dim_reduction.projection import FPCA

fpca_original = FPCA(n_components=2)
_ = fpca_original.fit(fd)

fpca_normalized = FPCA(n_components=2)
_ = fpca_normalized.fit(fd_normalized)

The projections of the spectra can now be visualized:

def plotFPCA(fpca, fd, ax):


fpca_df = pd.DataFrame(fpca.transform(fd))
138 8 Modern analytic methods: Part II

2.0 1.5
1.5 1.0
1.0 0.5
0.5

1
0.0
0.0 0.5
0.5 1.0
5.0 2.5 0.0 2.5 5.0 2 0 2
Component 2 Component 2

Fig. 8.10 Projection of spectra onto first two principal components. Left: original spectra, right:
{ fig:fdaMoisturePCA} normalized spectra

fpca_df.plot.scatter(x=0, y=1,
c=['C1' if mc == 'high' else 'C2' for mc in moisture_class], ax=ax)
ax.set_xlabel('Component 1')
ax.set_xlabel('Component 2')

fig, axes = plt.subplots(ncols=2, figsize=[6, 3])


plotFPCA(fpca_original, fd, axes[0])
plotFPCA(fpca_normalized, fd_normalized, axes[1])
plt.tight_layout()

{ fig:fdaMoisturePCA} Fig. 8.10 compares the PCA projections for the original and normalized data.
We can see that the second principal component clearly separates the two moisture
content classes for the normalized spectra. This is not the case for the original spectra.

Solution 8.6 (i) We demonstrate a solution to this exercise using two blog posts
preprocessed and included in the mistat package. The content of the posts was
converted text with each paragraph on a line. The two blog posts can be loaded as
follows:

from mistat.nlp import globalWarmingBlogs


blogs = globalWarmingBlogs()

The variable blogs is a dictionary with labels as keys and text as values. We next
split the data into a list of labels and a list of non-empty paragraphs.

paragraphs = []
labels = []
for blog, text in blogs.items():
for paragraph in text.split('\n'):
paragraph = paragraph.strip()
if not paragraph: # ignore empty paragraphs
continue
paragraphs.append(paragraph)
labels.append(blog)

(ii) Using CountVectorizer, transform the list of paragraphs into a document-


term matrix (DTM).
8 Modern analytic methods: Part II 139

import re
from sklearn.feature_extraction.text import CountVectorizer

def preprocessor(text):
text = text.lower()
text = re.sub(r'\d[\d,]*', '', text)
text = '\n'.join(line for line in text.split('\n')
if not line.startswith('ntsb'))
return text

vectorizer = CountVectorizer(preprocessor=preprocessor,
stop_words='english')
counts = vectorizer.fit_transform(paragraphs)

print('shape of DTM', counts.shape)


print('total number of terms', np.sum(counts))

shape of DTM (123, 1025)


total number of terms 2946

The ten most frequenly occurring terms are:

termCounts = np.array(counts.sum(axis=0)).flatten()
topCounts = termCounts.argsort()
terms = vectorizer.get_feature_names_out()
for n in reversed(topCounts[-10:]):
print(f'{terms[n]:14s} {termCounts[n]:3d}')

global 63
climate 59
warming 57
change 55
ice 35
sea 34
earth 33
ocean 29
temperatures 28
heat 25

(iii) Conversion of counts using TF-IDF.

from sklearn.feature_extraction.text import TfidfTransformer

tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None)


tfidf = tfidfTransformer.fit_transform(counts)

(iv)

from sklearn.decomposition import TruncatedSVD


from sklearn.preprocessing import Normalizer
svd = TruncatedSVD(5)
norm_tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(norm_tfidf)

We can analyze the loadings to get an idea of topics.

terms = vectorizer.get_feature_names_out()
data = {}
for i, component in enumerate(svd.components_, 1):
140 8 Modern analytic methods: Part II

compSort = component.argsort()
idx = list(reversed(compSort[-10:]))
data[f'Topic {i}'] = [terms[n] for n in idx]
data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)

print("{\\tiny")
print(df.style.format(precision=2).hide(axis='index').to_latex(hrules=True))
print("}")

Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5

change 0.24 ice 0.39 sea 0.25 extreme 0.54 snow 0.40
climate 0.24 sea 0.34 earth 0.23 events 0.30 cover 0.24
global 0.23 sheets 0.27 energy 0.21 heat 0.24 sea 0.16
sea 0.22 shrinking 0.19 light 0.21 precipitation 0.20 climate 0.12
warming 0.21 level 0.17 gases 0.19 earth 0.13 level 0.12
ice 0.20 arctic 0.15 ice 0.18 light 0.13 temperatures 0.12
temperature 0.18 ocean 0.13 infrared 0.17 energy 0.12 decreased 0.11
ocean 0.16 declining 0.10 greenhouse 0.15 gases 0.12 temperature 0.11
earth 0.16 levels 0.08 level 0.15 greenhouse 0.11 increase 0.10
extreme 0.15 glaciers 0.07 atmosphere 0.12 infrared 0.10 rise 0.10

We can identify topics related to sea warming, ice sheets melting, greenhouse
effect, extreme weather events, and hurricanes.
(v) Repeat the analysis requesting 10 components in the SVD.

svd = TruncatedSVD(10)
norm_tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(norm_tfidf)

We now get the following topics.

terms = vectorizer.get_feature_names_out()
data = {}
for i, component in enumerate(svd.components_, 1):
compSort = component.argsort()
idx = list(reversed(compSort[-10:]))
data[f'Topic {i}'] = [terms[n] for n in idx]
data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)

Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5

change 0.24 ice 0.39 sea 0.25 extreme 0.54 snow 0.39
climate 0.24 sea 0.35 earth 0.24 events 0.31 cover 0.23
global 0.23 sheets 0.27 energy 0.21 heat 0.24 sea 0.17
sea 0.22 shrinking 0.19 light 0.21 precipitation 0.20 level 0.13
warming 0.21 level 0.17 gases 0.19 light 0.13 temperatures 0.12
ice 0.20 arctic 0.15 ice 0.18 earth 0.13 climate 0.12
temperature 0.18 ocean 0.13 infrared 0.17 energy 0.12 hurricanes 0.12
ocean 0.16 declining 0.10 greenhouse 0.16 gases 0.12 decreased 0.11
earth 0.16 levels 0.08 level 0.15 greenhouse 0.11 increase 0.10
extreme 0.15 glaciers 0.07 atmosphere 0.12 infrared 0.10 rise 0.10

Topic 6 Loading 6 Topic 7 Loading 7 Topic 8 Loading 8 Topic 9 Loading 9 Topic 10 Loading 10

sea 0.38 snow 0.34 ocean 0.38 glaciers 0.36 responsibility 0.30
level 0.36 ocean 0.26 hurricanes 0.23 water 0.24 authorities 0.24
rise 0.17 cover 0.24 acidification 0.18 retreat 0.24 heat 0.23
extreme 0.14 acidification 0.17 glaciers 0.17 glacial 0.21 wildfires 0.20
events 0.11 extreme 0.16 water 0.13 months 0.16 pollution 0.19
global 0.11 pollution 0.13 waters 0.12 summer 0.14 arctic 0.13
hurricanes 0.11 carbon 0.13 temperatures 0.11 going 0.13 personal 0.12
temperature 0.10 decreased 0.12 reefs 0.09 plants 0.13 carbon 0.11
impacts 0.09 point 0.11 coral 0.09 animals 0.11 percent 0.11
coastal 0.08 warming 0.10 warmer 0.09 stream 0.11 reduce 0.10

The first five topics are identical to the result in (iv). This is an expected property
of the SVD.
8 Modern analytic methods: Part II 141

0.4

0.2

0.0

0.2

0.4
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
Second component

Fig. 8.11 Projection of the documents onto first two singular values. Red: blog post 1, green: blog
post 2 { fig:nlpSVD}

{ fig:nlpSVD} (vi) Fig. 8.11 shows the individual documents projected onto the first two singular
values of the LSA. Based on this visualization, we can say that the two documents
discuss different aspects of global warming and that blog post 1 contains more details
about the area.

fig, ax = plt.subplots()
blog1 = [label == 'blog-1' for label in labels]
blog2 = [label == 'blog-2' for label in labels]
ax.plot(lsa_tfidf[blog1, 0], lsa_tfidf[blog1, 1], 'ro')
ax.plot(lsa_tfidf[blog2, 0], lsa_tfidf[blog2, 1], 'go')
ax.set_xlabel('First component')
ax.set_xlabel('Second component')
plt.show()

Solution 8.7 We follow the same procedure as in Exercise 8.6 using a set of three {exc:nlp-topic-1}
articles preprocessed and included in the mistat package.

from mistat.nlp import covid19Blogs


blogs = covid19Blogs()

Determine DTM using paragraphs as documents:

paragraphs = []
labels = []
for blog, text in blogs.items():
for paragraph in text.split('\n'):
paragraph = paragraph.strip()
if not paragraph:
continue
paragraphs.append(paragraph)
labels.append(blog)

def preprocessor(text):
142 8 Modern analytic methods: Part II

text = text.lower()
text = re.sub(r'\d[\d,]*', '', text)
text = '\n'.join(line for line in text.split('\n')
if not line.startswith('ntsb'))
return text

vectorizer = CountVectorizer(preprocessor=preprocessor, stop_words='english')


counts = vectorizer.fit_transform(paragraphs)

TF-IDF transformation

tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None)


tfidf = tfidfTransformer.fit_transform(counts)

Latent semantic analysis (LSA)

svd = TruncatedSVD(10)
tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(tfidf)

Topics:

terms = vectorizer.get_feature_names_out()
data = {}
for i, component in enumerate(svd.components_, 1):
compSort = component.argsort()
idx = list(reversed(compSort[-10:]))
data[f'Topic {i}'] = [terms[n] for n in idx]
data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)

Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5

labour 0.29 south 0.27 economic 0.24 capacity 0.21 covid 0.22
covid 0.22 labour 0.27 percent 0.23 financial 0.20 america 0.22
impact 0.19 north 0.22 covid 0.19 firms 0.19 reset 0.21
market 0.19 differences 0.20 gdp 0.17 household 0.17 latin 0.20
south 0.18 americas 0.17 impact 0.15 international 0.17 needs 0.18
america 0.16 channel 0.16 imf 0.14 access 0.14 economic 0.14
pandemic 0.15 agenda 0.14 social 0.12 largely 0.14 social 0.14
channel 0.15 covid 0.13 pre 0.12 depends 0.14 asymmetric 0.11
economic 0.14 post 0.11 world 0.11 state 0.13 region 0.11
north 0.14 impact 0.09 growth 0.11 support 0.13 labor 0.10

Topic 6 Loading 6 Topic 7 Loading 7 Topic 8 Loading 8 Topic 9 Loading 9 Topic 10 Loading 10

economic 0.26 covid 0.24 self 0.22 agenda 0.21 state 0.15
channel 0.20 asymmetric 0.21 employed 0.19 post 0.19 policies 0.14
social 0.16 occupations 0.20 unbearable 0.19 channel 0.18 needs 0.13
market 0.14 consequences 0.20 lightness 0.19 region 0.15 self 0.12
recovery 0.12 economic 0.17 employment 0.13 labour 0.12 capacity 0.12
labour 0.11 transition 0.13 countries 0.13 pandemic 0.12 unbearable 0.12
consequences 0.10 occupation 0.13 economic 0.13 new 0.09 lightness 0.12
covid 0.09 americas 0.13 channel 0.12 latin 0.09 covid 0.12
governments 0.09 north 0.13 argentina 0.12 given 0.08 employed 0.12
agenda 0.09 differences 0.12 social 0.12 occupations 0.08 reset 0.11

Looking at the differnt loadings, we can see different topics emerging.


{ fig:nlpSVD-covid} In Fig. 8.12, we can see that the paragraphs in the article show more overlap
{exc:nlp-topic-1} compared to what we’ve observed in the Exercise 8.6.

fig, ax = plt.subplots()
for blog in blogs:
match = [label == blog for label in labels]
ax.plot(lsa_tfidf[match, 0], lsa_tfidf[match, 1], 'o', label=blog)
8 Modern analytic methods: Part II 143

0.5 blog-1
0.4 blog-2
blog-3
0.3
0.2
0.1
0.0
0.1
0.2
0.3
0.1 0.2 0.3 0.4 0.5
Second component

Fig. 8.12 Projection of the documents onto first two singular values. Red: blog post 1, green: blog
post 2 { fig:nlpSVD-covid}

ax.legend()
ax.set_xlabel('First component')
ax.set_xlabel('Second component')
plt.show()

Solution 8.8 (i) Load and preprocess the data

data = mistat.load_data('LAPTOP_REVIEWS')
data['Review'] = data['Review title'] + ' ' + data['Review content']
reviews = data.dropna(subset=['User rating', 'Review title', 'Review content'])

(ii) Convert the text representation into a document term matrix (DTM).

import re
from sklearn.feature_extraction.text import CountVectorizer
def preprocessor(text):
text = text.lower()
text = re.sub(r'\d[\d,]*', '', text)
return text

vectorizer = CountVectorizer(preprocessor=preprocessor,
stop_words='english')
counts = vectorizer.fit_transform(reviews['Review'])
print('shape of DTM', counts.shape)
print('total number of terms', np.sum(counts))

shape of DTM (7433, 12823)


total number of terms 251566

(iii) Convert the counts in the document term matrix (DTM) using TF-IDF.

from sklearn.feature_extraction.text import TfidfTransformer

tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None)


tfidf = tfidfTransformer.fit_transform(counts)
144 8 Modern analytic methods: Part II

(iv) Using scikit-learn’s TruncatedSVD method, we convert the sparse tfidf ma-
trix to a denser representation.

from sklearn.decomposition import TruncatedSVD


from sklearn.preprocessing import Normalizer
svd = TruncatedSVD(20)
tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(tfidf)
print(lsa_tfidf.shape)

(7433, 20)

(v) We use logistic regression to classify reviews with a user rating of five as
positive and negative otherwise.

from sklearn.linear_model import LogisticRegression


from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

outcome = ['positive' if rating == 5 else 'negative'


for rating in reviews['User rating']]

# split dataset into 60% training and 40% test set


Xtrain, Xtest, ytrain, ytest = train_test_split(lsa_tfidf, outcome,
test_size=0.4, random_state=1)

# run logistic regression model on training


logit_reg = LogisticRegression(solver='lbfgs')
logit_reg.fit(Xtrain, ytrain)

# print confusion matrix and accuracty


accuracy = accuracy_score(ytest, logit_reg.predict(Xtest))
print(accuracy)
confusion_matrix(ytest, logit_reg.predict(Xtest))

0.769334229993275

array([[ 861, 385],


[ 301, 1427]])

The predicted accuracy of the classification model is 0.77.

You might also like