Solutions Modernstatistics
Solutions Modernstatistics
Solutions Modernstatistics
May 3, 2023
Springer Nature
Contents
3
Chapter 1
Analyzing Variability: Descriptive Statistics
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mistat
from scipy import stats
Solution 1.1 random.choices selects 𝑘 values from the list using sampling with
replacement.
import random
random.seed(1)
values = random.choices([1, 2, 3, 4, 5, 6], k=50)
The expected frequency in each cell, under randomness is 50/6 = 8.3. You will
get different numerical results, due to randomness.
Solution 1.2 The Python function range is an iterator. As we need a list of values,
we need to explicitly convert it.
5
6 1 Analyzing Variability: Descriptive Statistics
x = list(range(50))
y = [5 + 2.5 * xi for xi in x]
y = [yi + random.uniform(-10, 10) for yi in y]
pd.DataFrame({'x': x, 'y': y}).plot.scatter(x='x', y='y')
plt.show()
120
100
80
60
y
40
20
0
0 10 20 30 40 50
x
0.1 4
0.3 12
0.7 33
0.9 43
Notice that the expected values of the sums are 5, 15, 35 and 45.
Solution 1.4 We can plot the data and calculate mean and standard deviation.
plt.show()
12.0
11.5
11.0
10.5
10.0
9.5
9.0
8.5
8.0
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
As shown in the following Figure, the measurements on Instrument 1, , seem
to be accurate but less precise than those on Instrument 2, +. Instrument 2 seems to
have an upward bias (inaccurate). Quantitatively, the mean of the measurements on
Instrument 1 is 𝑋¯ 1 = 10.034 and its standard deviation is 𝑆1 = 0.871. For Instrument
2 we have 𝑋¯ 2 = 10.983 and 𝑆2 = 0.569.
Solution 1.5 If the scale is inaccurate it will show on the average a deterministic
component different than the nominal weight. If the scale is imprecise, different
weight measurements will show a high degree of variability around the correct
nominal weight. Problems with stability arise when the accuracy of the scale changes
with time, and the scale should be recalibrated.
Solution 1.6 The method random.choices creates a random selection with re-
placement. Note that in range(start, end) the end argument is excluded. We
therefore need to set it to 101.
import random
random.choices(range(1, 101), k=20)
[6, 88, 57, 20, 51, 49, 36, 35, 54, 63, 62, 46, 3, 23, 18, 59, 87, 80,
80, 82]
8 1 Analyzing Variability: Descriptive Statistics
import random
random.sample(range(11, 31), 10)
[19, 12, 13, 28, 11, 18, 26, 23, 15, 14]
Solution 1.8 (i) 265 = 11, 881, 376; (ii) 7, 893, 600; (iii) 263 = 17, 576; (iv) 210 =
1024; (v) 10
5 = 252.
Solution 1.10
filmsp = mistat.load_data('FILMSP')
filmsp.plot.hist()
plt.show()
80
70
60
50
Frequency
40
30
20
10
0
70 80 90 100 110 120
Solution 1.11
coal = mistat.load_data('COAL')
pd.DataFrame(coal.value_counts(sort=False))
count
COAL
3 17
6 3
4 6
0 33
5 5
2 18
7 1
1 28
1 Analyzing Variability: Descriptive Statistics 9
Solution 1.12 For (1) and (2), we can use the pandas value_counts method. e.g.:
car = mistat.load_data('CAR')
car['cyl'].value_counts(sort=False)
cyl
4 66
6 30
8 13
Name: count, dtype: int64
turn
(28, 30] 3
(30, 32] 16
(32, 34] 16
(34, 36] 26
(36, 38] 20
(38, 40] 18
(40, 42] 8
(42, 44] 2
Name: count, dtype: int64
Note that the bin intervals are open on the left and closed on the right.
(iv) Frequency distribution of Horsepower:
hp
(50, 75] 7
(75, 100] 34
(100, 125] 22
(125, 150] 18
(150, 175] 17
(175, 200] 6
(200, 225] 4
(225, 250] 1
Name: count, dtype: int64
mpg
(9, 14] 1
(14, 19] 42
(19, 24] 41
(24, 29] 22
(29, 34] 3
Name: count, dtype: int64
Solution 1.13
filmsp = mistat.load_data('FILMSP')
filmsp = filmsp.sort_values(ignore_index=True) # sort and reset index
0.00 66.0
0.25 102.0
0.50 105.0
0.75 109.0
1.00 118.0
Name: FILMSP, dtype: float64
0.80 109.8
0.90 111.0
0.99 114.0
Name: FILMSP, dtype: float64
Here is a solution that uses pure Python. Note that the pandas quantile implements
different interpolation methods which will lead to differences for smaller datasets.
We therefore recommend using the library method and select the method that is most
suitable for your use case.
0 66.0
0.25 102.0
0.5 105.0
0.75 109.0
0.8 109.5
0.9 111.0
0.99 114.0
1.0 118.0
Solution 1.14
filmsp = mistat.load_data('FILMSP')
n = len(filmsp)
mean = filmsp.mean()
deviations = [film - mean for film in filmsp]
S = math.sqrt(sum(deviation**2 for deviation in deviations) / n)
print('Python:\n',
f'Skewness: {skewness}, Kurtosis: {kurtosis}')
print('Pandas:\n',
f'Skewness: {filmsp.skew()}, Kurtosis: {filmsp.kurtosis()}')
Python:
Skewness: -1.8098727695275856, Kurtosis: 9.014427238360716
Pandas:
Skewness: -1.8224949285588137, Kurtosis: 6.183511188870432
The distribution of film speed is negatively skewed and much steeper than the
normal distribution. Note that the calculated values differ between the methods.
Solution 1.15 The pandas groupby method groups the data based on the value. We
can then calculate individual statistics for each group.
car = mistat.load_data('CAR')
car['mpg'].groupby(by=car['origin']).mean()
car['mpg'].groupby(by=car['origin']).std()
# calculate both at the same time
print(car['mpg'].groupby(by=car['origin']).agg(['mean', 'std']))
mean std
origin
1 20.931034 3.597573
2 19.500000 2.623855
3 23.108108 4.280341
Solution 1.16 We first create a subset of the data frame that contains only US made
cars and then calculate the statistics for this subset only.
car = mistat.load_data('CAR')
car_US = car[car['origin'] == 1]
gamma = car_US['turn'].std() / car_US['turn'].mean()
Solution
car 1.17
= mistat.load_data('CAR')
car_US = car[car['origin'] == 1]
car_Asia = car[car['origin'] == 3]
print('US')
print('mean', car_US['turn'].mean())
print('geometric mean', stats.gmean(car_US['turn']))
print('Japanese')
print('mean', car_Asia['turn'].mean())
print('geometric mean', stats.gmean(car_Asia['turn']))
US
mean 37.203448275862065
geometric mean 37.06877691910792
Japanese
mean 33.04594594594595
geometric mean 32.97599107553825
We see that 𝑋¯ is greater than 𝐺. The cars from Asia have smaller mean turn
diameter.
12 1 Analyzing Variability: Descriptive Statistics
Solution
filmsp 1.18
= mistat.load_data('FILMSP')
Xbar = filmsp.mean()
S = filmsp.std()
print(f'mean: {Xbar}, stddev: {S}')
expected = {1: 0.68, 2: 0.95, 3: 0.997}
for k in (1, 2, 3):
left = Xbar - k * S
right = Xbar + k * S
proportion = sum(left < film < right for film in filmsp)
print(f'X +/- {k}S: ',
f'actual freq. {proportion}, ',
f'pred. freq. {expected[k] * len(filmsp):.2f}')
The discrepancies between the actual frequencies to the predicted frequencies are
due to the fact that the distribution of film speed is neither symmetric nor bell-shaped.
Solution 1.19
car = mistat.load_data('CAR')
car.boxplot(column='mpg', by='origin')
plt.show()
Boxplot grouped
mpg by origin
35.0
32.5
30.0
27.5
25.0
22.5
20.0
17.5
15.0
1 2 3
origin
Solution 1.20
oturb = mistat.load_data('OTURB')
mistat.stemLeafDiagram(oturb, 2, leafUnit=0.01)
1 Analyzing Variability: Descriptive Statistics 13
4 2 3444
18 2 55555666677789
40 3 0000001111112222333345
(15) 3 566677788899999
45 4 00022334444
34 4 566888999
25 5 0112333
18 5 6789
14 6 01122233444
3 6 788
• 𝑋 (1) = 0.23,
• 𝑄 1 = 𝑋 (25.25) = 𝑋 (25) + 0.25(𝑋 (26) − 𝑋 (25) ) = 0.31,
• 𝑀3 = 𝑋 (50.5) = 0.385,
• 𝑄 3 = 𝑋 (75.75) = 0.49 + 0.75(0.50 − 0.49) = 0.4975,
• 𝑋 (𝑛) = 0.68.
Solution
from 1.21
scipy.stats import trim_mean
oturb = mistat.load_data('OTURB')
print(f'T(0.1) = {trim_mean(oturb, 0.1)}')
print(f'S(0.1) = {trim_std(oturb, 0.1)}')
T(0.1) = 0.40558750000000005
S(0.1) = 0.09982289003530202
Solution 1.22
germanCars = [10, 10.9, 4.8, 6.4, 7.9, 8.9, 8.5, 6.9, 7.1,
5.5, 6.4, 8.7, 5.1, 6.0, 7.5]
japaneseCars = [9.4, 9.5, 7.1, 8.0, 8.9, 7.7, 10.5, 6.5, 6.7,
9.3, 5.7, 12.5, 7.2, 9.1, 8.3, 8.2, 8.5, 6.8, 9.5, 9.7]
# convert to pandas Series
germanCars = pd.Series(germanCars)
japaneseCars = pd.Series(japaneseCars)
# use describe to calculate statistics
comparison = pd.DataFrame({
'German': germanCars.describe(),
'Japanese': japaneseCars.describe(),
})
print(comparison)
German Japanese
count 15.000000 20.000000
mean 7.373333 8.455000
std 1.780235 1.589596
min 4.800000 5.700000
25% 6.200000 7.175000
50% 7.100000 8.400000
75% 8.600000 9.425000
max 10.900000 12.500000
hadpas = mistat.load_data('HADPAS')
sampleStatistics = pd.DataFrame({
'res3': hadpas['res3'].describe(),
'res7': hadpas['res7'].describe(),
})
print(sampleStatistics)
res3 res7
count 192.000000 192.000000
mean 1965.239583 1857.776042
std 163.528165 151.535930
min 1587.000000 1420.000000
25% 1860.000000 1772.250000
50% 1967.000000 1880.000000
75% 2088.750000 1960.000000
max 2427.000000 2200.000000
Histogram:
ax = hadpas.hist(column='res3', alpha=0.5)
hadpas.hist(column='res7', alpha=0.5, ax=ax)
plt.show()
res7
40
30
20
10
0
1400 1600 1800 2000 2200 2400
We overlay both histograms in one plot and make them transparent (alpha).
Stem and leaf diagrams:
print('res3')
mistat.stemLeafDiagram(hadpas['res3'], 2, leafUnit=10)
print('res7')
mistat.stemLeafDiagram(hadpas['res7'], 2, leafUnit=10)
res3
1 15 8
6 16 01124
14 16 56788889
1 Analyzing Variability: Descriptive Statistics 15
22 17 00000234
32 17 5566667899
45 18 0011112233444
60 18 556666677888899
87 19 000000111111223333344444444
(18) 19 566666666667888889
87 20 00000000122222333334444
64 20 55556666666677789999
44 21 0000011222233344444
25 21 566667788888
13 22 000111234
4 22 668
0 23
1 24 2
res7
1 14 2
9 15 11222244
15 15 667789
23 16 00012334
30 16 5566799
40 17 0022233334
54 17 66666666777999
79 18 0000222222222233344444444
(28) 18 5555556666666778888888999999
85 19 000000001111112222222233333444444
52 19 566666667777888888889999
28 20 0000111222333444
12 20 678
9 21 1123344
2 21 8
1 22 0
Solution 1.24
hadpas.boxplot(column='res3')
plt.show()
2400
2200
2000
1800
1600
res3
Lower whisker starts at max(1587, 1511.7) = 1587 = 𝑋 (1) ; upper whisker ends
at min(2427, 2440.5) = 2427 = 𝑋 (𝑛) . There are no outliers.
Chapter 2
Probability Models and Distribution Functions
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
25
20
15
A
B
2
10
0
0 5 10 15 20 25
1
17
18 2 Probability Models and Distribution Functions
( 𝐴 ∩ 𝐵) 𝑐 = ( 𝐴 ∩ 𝐵 𝑐 ) ∪ ( 𝐴𝑐 ∩ 𝐵) ∪ ( 𝐴𝑐 ∩ 𝐵 𝑐 )
= ( 𝐴 ∩ 𝐵𝑐 ) ∪ 𝐴𝑐
= 𝐴𝑐 ∪ 𝐵𝑐 .
𝑛
!
Ø
𝐵 = 𝐵∩S = 𝐵∩ 𝐴𝑖
𝑖=1
𝑛
Ø
= 𝐴𝑖 𝐵.
𝑖=1
Solution 2.7
Pr{ 𝐴 ∪ 𝐵 ∪ 𝐶} = Pr{( 𝐴 ∪ 𝐵) ∪ 𝐶}
= Pr{( 𝐴 ∪ 𝐵)} + Pr{𝐶} − Pr{( 𝐴 ∪ 𝐵) ∩ 𝐶}
= Pr{ 𝐴} + Pr{𝐵} − Pr{ 𝐴 ∩ 𝐵} + Pr{𝐶}
− Pr{ 𝐴 ∩ 𝐶} − Pr{𝐵 ∩ 𝐶} + Pr{ 𝐴 ∩ 𝐵 ∩ 𝐶}
= Pr{ 𝐴} + Pr{𝐵} + Pr{𝐶} − Pr{ 𝐴 ∩ 𝐵}
− Pr{ 𝐴 ∩ 𝐶} − Pr{𝐵 ∩ 𝐶} + Pr{ 𝐴 ∩ 𝐵 ∩ 𝐶}.
Ð𝑛
{exc:partition-union}
Solution 2.8 We have shown in Exercise 2.6 that 𝐵 = 𝑖=1 𝐴𝑖 𝐵. Moreover, since
{ 𝐴1 , . . . , 𝐴𝑛 } is a partition, 𝐴𝑖 𝐵 ∩ 𝐴 𝑗 𝐵 = ( 𝐴𝑖 ∩ 𝐴 𝑗 ) ∩ 𝐵 = ∅ ∩ 𝐵 = ∅ for all 𝑖 ≠ 𝑗.
Hence, from Axiom 3
( 𝑛 ) 𝑛
Ø ∑︁
Pr{𝐵} = Pr 𝐴𝑖 𝐵 = Pr{ 𝐴𝑖 𝐵}.
𝑖=1 𝑖=1
2 Probability Models and Distribution Functions 19
Solution 2.9
S = {(𝑖 1 , 𝑖2 ) : 𝑖 𝑗 = 1, . . . , 6, 𝑗 = 1, 2}
𝐴 = {(𝑖 1 , 𝑖2 ) : 𝑖 1 + 𝑖2 = 10} = {(4, 6), (5, 5), (6, 4)}
3 1
Pr{𝐴} = = .
36 12
Solution 2.10
150 280
Pr{𝐵} = Pr{ 𝐴150 } − Pr{ 𝐴280 } = exp − − exp − = 0.2258.
200 200
Solution 2.11
10 10 15 5
2 2 2 2
40
= 0.02765
8
100
Solution 2.12 (i) 1005 = 1010 ; (ii) 5 = 75, 287, 520.
Solution 2.13 𝑁 = 1000, 𝑀 = 900, 𝑛 = 10.
10
∑︁ 10
(i) Pr{𝑋 ≥ 8} = (0.9) 𝑗 (0.1) 10− 𝑗 = 0.9298.
𝑗=8
𝑗
10 900 100
10− 𝑗
∑︁ 𝑗
(ii) Pr{𝑋 ≥ 8} = 1000
= 0.9308.
𝑗=8 10
Solution 2.19
Solution 2.21 Let 𝑛 be the number of people in the party. The probability
that all
Î𝑛 365 − 𝑗 + 1
their birthdays fall on different days is Pr{𝐷 𝑛 } = 𝑗=1 .
365
(i) If 𝑛 = 10, Pr{𝐷 10 } = 0.8831.
(ii) If 𝑛 = 23, Pr{𝐷 23 } = 0.4927. Thus, the probability of at least 2 persons with
the same birthday, when 𝑛 = 23, is 1 − Pr{𝐷 23 } = 0.5073 > 12 .
10
7
Solution 2.22 = 0.02825.
10
Î10 1
Solution 2.23 1− = 0.5833.
𝑗=1
24 − 𝑗 + 1
4 86 4 10 86 86
1 4 1 1 3 5
Solution 2.24 (i) 100
= 0.1128; (ii) 100
= 0.0544; (iii) 1 − 100
=
5 5 5
0.5374.
4
Solution 2.25 10
= 0.0889.
2
Solution 2.26 The sample median is 𝑋 (6) , where 𝑋 (1) < · · · < 𝑋 (11) are the ordered
𝑘−1 20−𝑘
5 5
sample values. Pr{𝑋 (6) = 𝑘 } = 20
, 𝑘 = 6, . . . , 15. This is the probability
11
distribution of the sample median. The probabilities are
𝑘 6 7 8 9 10 11 12 13 14 15
Pr .01192 .04598 .09902 .15404 .18905 .18905 .15404 .09902 .04598 .01192
Solution 2.27 Without loss of generality, assume that the stick is of length 1. Let 𝑥, 𝑦
and (1 − 𝑥 − 𝑦), 0 < 𝑥, 𝑦 < 1, be the length of the 3 pieces. Obviously, 0 < 𝑥 + 𝑦 < 1.
All points in S = {(𝑥, 𝑦) : 𝑥, 𝑦 > 0, 𝑥 + 𝑦 < 1} are uniformly distributed. In order
2 Probability Models and Distribution Functions 21
that the three pieces can form a triangle, the following three conditions should be
satisfied:
(i) 𝑥 + 𝑦 > (1 − 𝑥 − 𝑦)
(ii) 𝑥 + (1 − 𝑥 − 𝑦) > 𝑦
(iii) 𝑦 + (1 − 𝑥 − 𝑦) > 𝑥.
The set of points (𝑥, 𝑦) satisfying (i), (ii) and (iii) is bounded by a triangle of area
1/8. S is bounded by a triangle of area 1/2. Hence, the required probability is 1/4.
Solution 2.28 Consider Fig. 2.1. Suppose that the particle is moving along the {exc:geometrySolution}
circumference of the circle in a counterclockwise direction. Then,√ using the notation
in the diagram, Pr{hit} = 𝜙/2𝜋. Since 𝑂𝐷 = 1 = 𝑂𝐸, 𝑂𝐵 = 𝑎 2 + ℎ2 = 𝑂𝐶 and
the lines 𝐷𝐵 and 𝐸𝐶 are tangential to the circle, it follows that the triangles Δ𝑂𝐷𝐵
and Δ𝑂𝐸𝐶 are congruent. Thus
𝑚(∠ 𝐷𝑂𝐵) = 𝑚(∠ 𝐸𝑂𝐶), and it is easily seen that
ℎ 1 ℎ
𝜙 = 2𝛼. Now 𝛼 = tan−1 , and hence, Pr{hit} = tan−1 .
𝑎 𝜋 𝑎
α
O A
α a
E
B
D
100
Solution 2.29 1−(0.999) 100 −100×(0.001)×(0.999) 99 − 2
2 ×(0.001) (0.999)
98 =
0.0001504.
𝑛
𝑛−1 1
Solution 2.30 The probability that 𝑛 tosses are required is 𝑝(𝑛) = 1 , 𝑛 ≥ 2.
2
1 3
Thus, 𝑝(4) = 3 · = .
24 16
Solution 2.31 S = {(𝑖1 , . . . , 𝑖 10 ) : 𝑖 𝑗 = 0, 1, 𝑗 = 1, . . . , 10}. One random variable
is the number of 1’s in an element, i.e., for 𝝎 = (𝑖 1 , . . . , 𝑖 10 ) 𝑋1 (𝝎) = 10
Í
𝑗=1 𝑖 𝑗 .
Another random variable is the number of zeros to the left of the 1st one, i.e.,
𝑋2 (𝝎) = 10
Í Î𝑗
𝑗=1 𝑘=1 (1 − 𝑖 𝑘 ). Notice that 𝑋2 (𝝎) = 0 if 𝑖 1 = 1 and 𝑋2 (𝝎) = 10 if
22 2 Probability Models and Distribution Functions
10 10
𝑖 1 = 𝑖2 = . . . = 𝑖10 = 0. The probability distribution of 𝑋1 is Pr{𝑋1 = 𝑘 } = 𝑘 /2 ,
𝑘 = 0, 1, . . . , 10. The probability distribution of 𝑋2 is
𝑘+1
1
𝑘 = 0, . . . , 9
2
,
Pr{𝑋2 = 𝑘 } = 10
1
, 𝑘 = 10.
2
Í∞ 5 𝑥
= 𝑒 5 , we have ∞
Í
Solution 2.32 (i) Since 𝑥=0 𝑥=0 𝑝(𝑥) = 1. ; (ii) Pr{𝑋 ≤ 1} =
𝑥!
𝑒 −5 (1 + 5) = 0.0404; (iii) Pr{𝑋 ≤ 7} = 0.8666.
Solution 2.33 (i) Pr{𝑋 = −1} = 0.3; (ii) Pr{−0.5 < 𝑋 < 0} = 0.1; (iii) Pr{0 ≤ 𝑋 <
0.75} = 0.425; (iv) Pr{𝑋 = 1} = 0; (v) 𝐸 {𝑋 } = −0.25, 𝑉 {𝑋 } = 0.4042.
Solution 2.34
∞ ∞
∫ ∫ √︂
−𝑥 2 /2𝜎 2 𝜋
𝐸 {𝑋 } = (1 − 𝐹 (𝑥)) d𝑥 = 𝑒 d𝑥 = 𝜎 .
0 0 2
Solution 2.35
𝑁 𝑁
1 ∑︁ 𝑁 +1 1 ∑︁ 2 (𝑁 + 1) (2𝑁 + 1)
𝐸 {𝑋 } = 𝑖= ; 𝐸 {𝑋 2 } = 𝑖 =
𝑁 𝑖=1 2 𝑁 𝑖=1 6
When 𝑗 = 𝑙 the term is (−1) 𝑙 𝜇1𝑙 . When 𝑗 = 𝑙 − 1 the term is (−1) 𝑙−1 𝑙 𝜇1𝑙−1 𝜇1 =
(−1) 𝑙−1 𝑙 𝜇1𝑙 . Thus, the sum of the last 2 terms is (−1) 𝑙−1 (𝑙 − 1)𝜇1𝑙 and we have
𝜇𝑙∗ = 𝑙−2
Í 𝑗 𝑙
𝑗
𝑗=0 (−1) 𝑗 𝜇1 𝜇𝑙− 𝑗 + (−1)
𝑙−1 (𝑙 − 1)𝜇𝑙 .
1
{exc:mixed-cdf-example} Solution 2.39 We saw in the solution of Exercise 2.33 that 𝜇1 = −0.25. Moreover,
𝜇2 = 𝑉 {𝑋 } + 𝜇12 = 0.4667.
1
Solution 2.40 𝑀𝑋 (𝑡) = (𝑒 𝑡 𝑏 − 𝑒 𝑡 𝑎 ), −∞ < 𝑡 < ∞, 𝑎 < 𝑏.
𝑡 (𝑏 − 𝑎)
𝑎+𝑏 (𝑏 − 𝑎) 2
𝐸 {𝑋 } = ; 𝑉 {𝑋 } = .
2 12
2 Probability Models and Distribution Functions 23
𝑡 −1
Solution 2.41 (i) For 𝑡 < 𝜆 we have 𝑀 (𝑡) = 1 − 𝜆 ,
1 𝑡 −2 1
𝑀 0 (𝑡) = 1− , 𝜇1 = 𝑀 0 (0) =
𝜆 𝜆 𝜆
2 𝑡 −3 2
𝑀 00 (𝑡) = 2 1 − , 𝜇2 = 𝑀 00 (0) = 2
𝜆 𝜆 𝜆
6 𝑡 −4 6
𝑀 (3) (𝑡) = 3 1 − , 𝜇3 = 𝑀 (3) (0) = 3
𝜆 𝜆 𝜆
24 𝑡 −5 24
𝑀 (4) (𝑡) = 4 1 − , (4)
𝜇4 = 𝑀 (0) = 4 .
𝜆 𝜆 𝜆
(ii) The central moments are
𝜇1∗ = 0,
1
𝜇2∗ = 2 ,
𝜆
6 6 2 2
𝜇3∗ = 𝜇3 − 3𝜇2 𝜇1 + 2𝜇13 = − + = ,
𝜆3 𝜆3 𝜆3 𝜆3
1 9
𝜇4∗ = 𝜇4 − 4𝜇3 𝜇1 + 6𝜇2 (𝜇1 ) 2 − 3𝜇14 = 4 (24 − 4 · 6 + 6 · 2 − 3) = 4 .
𝜆 𝜆
𝜇4∗
(iii) The index of kurtosis is 𝛽4 = = 9.
(𝜇2∗ ) 2
x = list(range(15))
table = pd.DataFrame({
'x': x,
'p.d.f.': [stats.binom(20, 0.17).pmf(x) for x in x],
'c.d.f.': [stats.binom(20, 0.17).cdf(x) for x in x],
})
print(table)
x p.d.f. c.d.f.
0 0 2.407475e-02 0.024075
1 1 9.861947e-02 0.122694
2 2 1.918921e-01 0.314586
3 3 2.358192e-01 0.550406
4 4 2.052764e-01 0.755682
5 5 1.345426e-01 0.890224
6 6 6.889229e-02 0.959117
7 7 2.822094e-02 0.987338
8 8 9.392812e-03 0.996731
9 9 2.565105e-03 0.999296
10 10 5.779213e-04 0.999874
11 11 1.076086e-04 0.999981
12 12 1.653023e-05 0.999998
13 13 2.083514e-06 1.000000
14 14 2.133719e-07 1.000000
Solution 2.45 Pr{no defective chip on the board} = 𝑝 50 . Solving 𝑝 50 = 0.99 yields
𝑝 = (0.99) 1/50 = 0.999799.
𝑛
𝜆
Solution 2.46 Notice first that lim 𝑛→∞ 𝑏(0; 𝑛, 𝑝) = lim𝑛→∞ 1 − = 𝑒 −𝜆 .
𝑛 𝑝→𝜆 𝑛
Moreover, for all
𝑏( 𝑗 + 1; 𝑛, 𝑝) 𝑛− 𝑗 𝑝
𝑗 = 0, 1, · · · , 𝑛 − 1, = · . Thus, by induction on 𝑗,
𝑏( 𝑗; 𝑛, 𝑝) 𝑗 +1 1− 𝑝
for 𝑗 > 0
𝑛− 𝑗 +1 𝑝
lim 𝑏( 𝑗 − 1; 𝑛, 𝑝)
lim 𝑏( 𝑗; 𝑛, 𝑝) = 𝑛→∞ ·
𝑛→∞
𝑛 𝑝→𝜆 𝑛 𝑝→𝜆
𝑗 1− 𝑝
𝜆 𝑗−1 (𝑛 − 𝑗 + 1) 𝑝
= 𝑒 −𝜆 lim
( 𝑗 − 1)! 𝑛𝑛→∞ 𝜆
𝑝→𝜆 𝑗 1 − 𝑛
𝜆 𝑗−1 𝜆 𝜆𝑗
= 𝑒 −𝜆 · = 𝑒 −𝜆 .
( 𝑗 − 1)! 𝑗 𝑗!
Solution 2.50
3
∑︁
Pr{𝑅} = 1 − 𝐻 (3; 100, 10, 20) + ℎ(𝑖; 100, 10, 20) [1 − 𝐻 (3 − 𝑖; 80, 10 − 𝑖, 40)]
𝑖=1
= 0.87395.
Solution 2.51 The m.g.f. of the Poisson distribution with parameter 𝜆, 𝑃(𝜆), is
𝜇1 = 𝜆 𝜇1∗ = 0
𝜇2 = 𝜆2 + 𝜆 𝜇2∗ = 𝜆
𝜇3 = 𝜆3 + 3𝜆2 + 𝜆 𝜇3∗ = 𝜆
𝜇4 = 𝜆4 + 6𝜆3 + 7𝜆2 + 𝜆 𝜇4∗ = 3𝜆2 + 𝜆.
1
Thus, the indexes of skewness and kurtosis are 𝛽3 = 𝜆−1/2 and 𝛽4 = 3 + .
𝜆
For 𝜆 = 10 we have 𝛽3 = 0.3162 and 𝛽4 = 3.1.
Solution 2.52 Let 𝑋 be the number of blemishes observed. Pr{𝑋 > 2} = 0.1912.
Solution 2.53 Using the Poisson approximation with 𝑁 = 8000 and 𝑝 = 380 × 10−6 ,
we have 𝜆 = 3.04 and Pr{𝑋 > 6} = 0.0356, where 𝑋 is the number of insertion
errors in 2 hours of operation.
Solution 2.55 Using Python we obtain that for the 𝑁 𝐵( 𝑝, 𝑘) with 𝑝 = 0.01 and
𝑘 = 3, 𝑄 1 = 170, 𝑀𝑒 = 265, and 𝑄 3 = 389.
𝜇1∗ = 0
1− 𝑝
𝜇2∗ = ,
𝑝2
1
𝜇3∗ = 3 (1 − 𝑝) (2 − 𝑝),
𝑝
1
𝜇4∗ = 4 (9 − 18𝑝 + 10𝑝 2 − 𝑝 3 ).
𝑝
2− 𝑝
Thus the indices of skewness and kurtosis are 𝛽3∗ = √︁ and 𝛽4∗ =
1− 𝑝
9 − 9𝑝 + 𝑝 2
.
1− 𝑝
Solution 2.58 If there are 𝑛 chips, 𝑛 > 50, the probability of at least 50 good ones is
1 − 𝐵(49; 𝑛, 0.998). Thus, 𝑛 is the smallest integer > 50 for which 𝐵(49; 𝑛, 0.998) <
0.05. It is sufficient to order 51 chips.
Solution 2.59 If 𝑋 has a geometric distribution then, for every 𝑗, 𝑗 = 1, 2, . . .
Pr{𝑋 > 𝑗 } = (1 − 𝑝) 𝑗 . Thus,
Pr{𝑋 > 𝑛 + 𝑚}
Pr{𝑋 > 𝑛 + 𝑚 | 𝑋 > 𝑚} =
Pr{𝑋 > 𝑚}
(1 − 𝑝) 𝑛+𝑚
=
(1 − 𝑝) 𝑚
= (1 − 𝑝) 𝑛
= Pr{𝑋 > 𝑛}.
Solution 2.60 For 0 < 𝑦 < 1, Pr{𝐹 (𝑋) ≤ 𝑦} = Pr{𝑋 ≤ 𝐹 −1 (𝑦)} = 𝐹 (𝐹 −1 (𝑦)) = 𝑦.
Hence, the distribution of 𝐹 (𝑋) is uniform on (0, 1). Conversely, if 𝑈 has a uniform
distribution on (0, 1), then
1600
Solution 2.61 𝐸 {𝑈 (10, 50)} = 30; 𝑉 {𝑈 (10, 50)} = = 133.33;
12
40
𝜎{𝑈 (10, 50)} = √ = 11.547.
2 3
Solution 2.62 Let 𝑋 = − log(𝑈) where 𝑈 has a uniform distribution on (0,1).
Solution 2.63 (i) Pr{92 < 𝑋 < 108} = 0.4062; (ii) Pr{𝑋 > 105} = 0.3694;
(iii) Pr{2𝑋 + 5 < 200} = Pr{𝑋 < 97.5} = 0.4338.
rv = stats.norm(100, 15)
print('(i)', rv.cdf(108) - rv.cdf(92))
print('(ii)', 1 - rv.cdf(105))
print('(iii)', rv.cdf((200 - 5)/2))
(i) 0.4061971427922976
(ii) 0.36944134018176367
(iii) 0.43381616738909634
Solution 2.64 Let 𝑧 𝛼 denote the 𝛼 quantile of a 𝑁 (0, 1) distribution. Then the two
equations 𝜇 + 𝑧 .9 𝜎 = 15 and 𝜇 + 𝑧 .99 𝜎 = 20 yield the solution 𝜇 = 8.8670 and
𝜎 = 4.7856.
Solution 2.65 Due to symmetry, Pr{𝑌 > 0} = Pr{𝑌 < 0} = Pr{𝐸 < 𝑣}, where
𝐸 ∼ 𝑁 (0, 1). If the probability of a bit error is 𝛼 = 0.01, then Pr{𝐸 < 𝑣} = Φ(𝑣) =
1 − 𝛼 = 0.99.
Thus 𝑣 = 𝑧 .99 = 2.3263.
Solution 2.66 Let 𝑋 𝑝 denote the diameter of an aluminum pin and 𝑋ℎ denote the size
of a hole drilled in an aluminum plate. If 𝑋 𝑝 ∼ 𝑁 (10, 0.02) and 𝑋ℎ ∼ 𝑁 (𝜇 𝑑 , 0.02)
then the probability that the pin will not enter the hole is Pr{𝑋ℎ − 𝑋 𝑝 < 0}. Now
√
𝑋ℎ − 𝑋 𝑝 ∼ 𝑁 (𝜇 𝑑 − 10, 0.022 + 0.022 ) and for Pr{𝑋ℎ − 𝑋 𝑝 < 0} = 0.01, we
obtain 𝜇 𝑑 = 10.0658 mm. (The fact that the sum of two independent normal random
variables is normally distributed should be given to the student since it has not yet
been covered in the text.)
Solution 2.67 For 𝑋1 , . . . , 𝑋𝑛 i.i.d. 𝑁 (𝜇, 𝜎 2 ), 𝑌 = 𝑖=1 𝑖𝑋𝑖 ∼ 𝑁 (𝜇𝑌 , 𝜎𝑌2 ) where
Í𝑛
𝑛(𝑛 + 1) 𝑛(𝑛 + 1) (2𝑛 + 1)
and 𝜎𝑌2 = 𝜎 2 𝑖=1
Í𝑛 2
𝑖 = 𝜎2
Í𝑛
𝜇𝑌 = 𝜇 𝑖=1 𝑖=𝜇 .
2 6
Solution 2.68 Pr{𝑋 > 300} = Pr{log 𝑋 > 5.7038} = 1 − Φ(0.7038) = 0.24078.
28 2 Probability Models and Distribution Functions
2 𝑡 2 /2
Solution 2.69 For 𝑋 ∼ 𝑒 𝑁 ( 𝜇, 𝜎) , 𝑋 ∼ 𝑒𝑌 where 𝑌 ∼ 𝑁 (𝜇, 𝜎), 𝑀𝑌 (𝑡) = 𝑒 𝜇𝑡+𝜎 .
2 /2
𝜉 = 𝐸 {𝑋 } = 𝐸 {𝑒𝑌 } = 𝑀𝑌 (1) = 𝑒 𝜇+𝜎 .
2
Since 𝐸 {𝑋 2 } = 𝐸 {𝑒 2𝑌 } = 𝑀𝑌 (2) = 𝑒 2𝜇+2𝜎 we have
2 2
𝑉 {𝑋 } = 𝑒 2𝜇+2𝜎 − 𝑒 2𝜇+𝜎
2 2
= 𝑒 2𝜇+𝜎 (𝑒 𝜎 − 1)
2
= 𝜉 2 (𝑒 𝜎 − 1).
Solution 2.74
∫ 𝑡
𝜆𝑘
𝐺 (𝑡; 𝑘, 𝜆) = 𝑥 𝑘−1 𝑒 −𝜆𝑥 d𝑥
(𝑘 + 1)! 0
∫ 𝑡
𝜆 𝑘 𝑘 −𝜆𝑡 𝜆 𝑘+1
= 𝑡 𝑒 + 𝑥 𝑘 𝑒 −𝜆𝑥 d𝑥
𝑘! 𝑘! 0
∫ 𝑡
𝜆 𝑘 𝑘 −𝜆𝑡 𝜆 𝑘+1 𝑘+1 −𝜆𝑡 𝜆 𝑘+2
= 𝑡 𝑒 + 𝑡 𝑒 + 𝑥 𝑘+1 𝑒 −𝜆𝑥 d𝑥
𝑘! (𝑘 + 1)! (𝑘 + 1)! 0
= ···
∞
∑︁ (𝜆𝑡) 𝑗
= 𝑒 −𝜆𝑡
𝑗=𝑘
𝑗!
𝑘−1
∑︁ (𝜆𝑡) 𝑗
= 1 − 𝑒 −𝜆𝑡 .
𝑗=0
𝑗!
1 3 1 1
Solution 2.75 Γ(1.17) = 0.9267, Γ = 1.77245, Γ = Γ = 0.88623.
2 2 2 2
from scipy.special import gamma
print(gamma(1.17), gamma(1 / 2), gamma(3 / 2))
Solution 2.76 The moment generating function of the sum of independent random
variables is the product of their respective m.g.f.’s.ÎThus, if 𝑋1 , · · · , 𝑋 𝑘 are i.i.d.
𝐸 (𝛽), using the result of Exercise 2.73, 𝑀𝑆 (𝑡) = 𝑖=1 𝑘
(1 − 𝛽𝑡) −1 = (1 − 𝛽𝑡) −𝑘 , {exc:prob-ind-exp-rv}
1
𝑡 < , where 𝑆 = 𝑖=1 𝑋𝑖 . On the other hand, (1 − 𝛽𝑡) −𝑘 is the m.g.f. of 𝐺 (𝑘, 𝛽).
Í𝑘
𝛽
Solution 2.77 The expected value and variance of 𝑊 (2, 3.5) are
1
𝐸 {𝑊 (2, 3.5)} = 3.5 × Γ 1 + = 3.1018,
2
2 1
𝑉 {𝑊 (2, 3.5)} = (3.5) 2 Γ 1 + − Γ2 1 + = 2.6289.
2 2
Solution 2.78 Let 𝑇 be the number of days until failure. 𝑇 ∼ 𝑊 (1.5, 500) ∼
500𝑊 (1.5, 1).
6 1.5
Pr{𝑇 ≥ 600} = Pr 𝑊 (1.5, 1) ≥ = 𝑒 −(6/5) = 0.2686.
5
1 3
Solution 2.79 Let 𝑋 ∼ Beta , .
2 2
1 3
1/2 1 · 1 1
𝐸 {𝑋 } = 1 3 = , 𝑉 {𝑋 } = 22 2 = and 𝜎{𝑋 } = .
2 + 2
4 2 · 3 16 4
30 2 Probability Models and Distribution Functions
Solution 2.80 Let 𝑋 ∼ Beta(𝜈, 𝜈). The first four moments are
1
𝜇1 = 𝜈/2𝜈 =
2
𝐵(𝜈 + 2, 𝜈) 𝜈+1
𝜇2 = =
𝐵(𝜈, 𝜈) 2(2𝜈 + 1)
𝐵(𝜈 + 3, 𝜈) (𝜈 + 1) (𝜈 + 2)
𝜇3 = =
𝐵(𝜈, 𝜈) 2(2𝜈 + 1) (2𝜈 + 2)
𝐵(𝜈 + 4, 𝜈) (𝜈 + 1) (𝜈 + 2) (𝜈 + 3)
𝜇4 = = .
𝐵(𝜈, 𝜈) 2(2𝜈 + 1) (2𝜈 + 2) (2𝜈 + 3)
1
The variance is 𝜎 2 = and the fourth central moment is
4(2𝜈 + 1)
3
𝜇4∗ = 𝜇4 − 4𝜇3 · 𝜇1 + 6𝜇2 · 𝜇12 − 3𝜇14 = .
16(3 + 8𝜈 + 4𝜈 2 )
∗
𝜇 3(1 + 2𝜈)
Finally, the index of kurtosis is 𝛽2 = 44 = .
𝜎 3 + 2𝜈
Solution 2.81 Let (𝑋, 𝑌 ) have a joint p.d.f.
1,
(𝑥, 𝑦) ∈ 𝑆
𝑓 (𝑥, 𝑦) = 2
0,
otherwise.
(i) The marginal distributions of 𝑋 and 𝑌 have p.d.f.’s
1 ∫ 1− | 𝑥 |
𝑓 𝑋 (𝑥) = d𝑦 = 1 − |𝑥|, −1 < 𝑥 < 1, and by symmetry,
2 −1+ | 𝑥 |
𝑓𝑌 (𝑦) = 1 − |𝑦|, −1 < 𝑦 < 1.
∫1 1
(ii) 𝐸 {𝑋 } = 𝐸 {𝑌 } = 0, 𝑉 {𝑋 } = 𝑉 {𝑌 } = 2 0 𝑦 2 (1 − 𝑦) d𝑦 = 2𝐵(3, 2) = .
6
Solution 2.82 The marginal p.d.f. of 𝑌 is 𝑓 (𝑦) = 𝑒 −𝑦 , 𝑦 > 0, that is, 𝑌 ∼ 𝐸 (1). The
conditional p.d.f. of 𝑋, given 𝑌 = 𝑦, is 𝑓 (𝑥 | 𝑦) = 1𝑦 𝑒 −𝑥/𝑦 which is the p.d.f. of
an exponential with parameter 𝑦. Thus, 𝐸 {𝑋 | 𝑌 = 𝑦} = 𝑦, and 𝐸 {𝑋 } = 𝐸 {𝐸 {𝑋 |
𝑌 }} = 𝐸 {𝑌 } = 1. Also,
𝐸 {𝑋𝑌 } = 𝐸 {𝑌 𝐸 {𝑋 | 𝑌 }}
= 𝐸 {𝑌 2 }
= 2.
𝜎𝑋2 = 𝐸 {𝑉 {𝑋 | 𝑌 }} + 𝑉 {𝐸 {𝑋 | 𝑌 }}
= 𝐸 {𝑌 2 } + 𝑉 {𝑌 }
= 2 + 1 = 3.
1
The correlation between 𝑋 and 𝑌 is 𝜌 𝑋𝑌 = √ .
3
2, if (𝑥, 𝑦) ∈ 𝑇
Solution 2.83 Let (𝑋, 𝑌 ) have joint p.d.f. 𝑓 (𝑥, 𝑦) =
0,
otherwise.
The marginal densities of 𝑋 and 𝑌 are
𝑓 𝑋 (𝑥) = 2(1 − 𝑥), 0 ≤ 𝑥 ≤ 1
𝑓𝑌 (𝑦) = 2(1 − 𝑦), 0 ≤ 𝑦 ≤ 1.
1 1
Notice that 𝑓 (𝑥, 𝑦) ≠ 𝑓 𝑋 (𝑥) 𝑓𝑌 (𝑦) for 𝑥 =
, 𝑦 = . Thus, 𝑋 and 𝑌 are dependent.
2 4
1
cov(𝑋, 𝑌 ) = 𝐸 {𝑋𝑌 } − 𝐸 {𝑋 }𝐸 {𝑌 } = 𝐸 {𝑋𝑌 } − .
9
∫ ∫ 1 1−𝑥
𝐸 {𝑋𝑌 } = 2 𝑥 𝑦 d𝑦 d𝑥
0 0
∫ 1
= 𝑥(1 − 𝑥) 2 d𝑥
0
1
= 𝐵(2, 3) = .
12
1 1 1
Hence, cov(𝑋, 𝑌 ) = − =− .
12 9 36
Solution 2.84 𝐽 | 𝑁 ∼ 𝐵(𝑁, 𝑝); 𝑁 ∼ 𝑃(𝜆). 𝐸 {𝑁 } = 𝜆, 𝑉 {𝑁 } = 𝜆, 𝐸 {𝐽} = 𝜆𝑝.
𝑉 {𝐽} = 𝐸 {𝑉 {𝐽 | 𝑁 }} + 𝑉 {𝐸 {𝐽 | 𝑁 }} 𝐸 {𝐽 𝑁 } = 𝐸 {𝑁 𝐸 {𝐽 | 𝑁 }}
= 𝐸 {𝑁 𝑝(1 − 𝑝)} + 𝑉 {𝑁 𝑝} = 𝑝𝐸 {𝑁 2 }
= 𝜆𝑝(1 − 𝑝) + 𝑝 2 𝜆 = 𝜆𝑝. = 𝑝(𝜆 + 𝜆2 )
𝑝𝜆 √
Hence, cov(𝐽, 𝑁) = 𝑝𝜆(1 + 𝜆) − 𝑝𝜆2 = 𝑝𝜆 and 𝜌 𝐽 𝑁 = √ = 𝑝.
𝜆 𝑝
Solution 2.85 Let 𝑋 ∼ 𝐺 (2, 100) ∼ 100𝐺 (2, 1) and𝑌 ∼ 𝑊 (1.5, 500) ∼ 500𝑊 (1.5, 1).
Then 𝑋𝑌 ∼ 5 × 104 𝐺 (2,1) · 𝑊 8
(1.5, 1) and 𝑉 {𝑋𝑌 } = 25 × 10 · 𝑉 {𝐺𝑊 }, where
3
𝐺 ∼ 𝐺 (2, 1) and 𝑊 ∼ 𝑊 , 1 .
2
𝑉 {𝐺𝑊 } = 𝐸 {𝐺 2 }𝑉 {𝑊 } + 𝐸 2 {𝑊 }𝑉 {𝐺}
4 2 2
=6 Γ 1+ − Γ2 1 + + 2 · Γ2 1 +
3 3 3
= 3.88404.
Thus 𝑉 {𝑋𝑌 } = 9.7101 × 109 .
32 2 Probability Models and Distribution Functions
{ex:pdf-hypergeom-joint} Solution 2.87 Using the notation of Example 2.34, the joint p.d.f. of 𝐽1 and 𝐽2 is
20 50 30
𝑗1 𝑗2 20− 𝑗1 − 𝑗2
𝑝( 𝑗 1 , 𝑗2 ) = 100
, 0 ≤ 𝑗 1 , 𝑗2 ; 𝑗1 + 𝑗2 ≤ 20.
20
The marginal distribution of 𝐽1 is 𝐻 (100, 20, 20). The marginal distribution of 𝐽2
is 𝐻 (100, 50, 20). Accordingly,
19
𝑉 {𝐽1 } = 20 × 0.2 × 0.8 × 1 − = 2.585859,
99
19
𝑉 {𝐽2 } = 20 × 0.5 × 0.5 × 1 − = 4.040404.
99
The conditional distribution of 𝐽1 , given 𝐽2 , is 𝐻 (50, 20, 20 − 𝐽2 ). Hence,
𝐸 {𝐽1 𝐽2 } = 𝐸 {𝐸 {𝐽1 𝐽2 | 𝐽2 }}
2
= 𝐸 𝐽2 (20 − 𝐽2 ) ×
5
= 8𝐸 {𝐽2 } − 0.4𝐸 {𝐽22 }
= 80 − 0.4 × 104.040404 = 38.38381
and cov(𝐽1 , 𝐽2 ) = −1.61616.
−1.61616
Finally, the correlation between 𝐽1 and 𝐽2 is 𝜌 = √ =
2.585859 × 4.040404
−0.50.
10
∑︁ 𝑘𝑥
Pr{𝑋 (1) ≤ 𝑥 | 𝐽 ≥ 1} = 𝑏(𝑘; 10, 0.95) (1 − 𝑒 − 10 )
𝑘=1
10
∑︁ 10 𝑥
=1− (0.95𝑒 − 10 ) 𝑘 (0.05) (10−𝑘)
𝑘=1
𝑘
= 1 − [0.05 + 0.95𝑒 −𝑥/10 ] 10 + (0.05) 10
= 1 − (0.05 + 0.95𝑒 −𝑥/10 ) 10 .
5
∑︁ 5 𝑗 1
= 2772 (−1)
𝑗=0
𝑗 𝜆(6 + 𝑗) 2
= 0.73654/𝜆.
Solution 2.95 Let 𝑈 ∼ 𝑁 (0, 1) and 𝑋 ∼ 𝑁 (𝜇, 𝜎). We assume that 𝑈 and 𝑋 are
independent. Then Φ(𝑋) = Pr{𝑈 < 𝑋 | 𝑋 } and therefore
𝐸 {Φ(𝑋)} = 𝐸 {Pr{𝑈 < 𝑋 | 𝑋 }}
= Pr{𝑈 < 𝑋 }
= Pr{𝑈 − 𝑋 < 0}
𝜇
=Φ √ .
1 + 𝜎2 √
The last equality follows from the fact that 𝑈 − 𝑋 ∼ 𝑁 (−𝜇, 1 + 𝜎 2 ).
34 2 Probability Models and Distribution Functions
Solution 2.99 Let 𝑌 = 𝑋1 + 𝑋2 where 𝑋1 , 𝑋2 are i.i.d. uniform on (0,1). Then the
p.d.f.’s are
𝑓1 (𝑥) = 𝑓2 (𝑥) = 𝐼{0 < 𝑥 < 1}
∫𝑦
d𝑥 = 𝑦, if 0 ≤ 𝑦 < 1
0
𝑔(𝑦) =
1 d𝑥 = 2 − 𝑦, if 1 ≤ 𝑦 ≤ 2.
∫
𝑦−1
∫∞
Solution 2.100 𝑋1 , 𝑋2 are i.i.d. 𝐸 (1). 𝑈 = 𝑋1 − 𝑋2 . Pr{𝑈 ≤ 𝑢} = 0 𝑒 −𝑥 Pr{𝑋1 ≤
𝑢 + 𝑥} d𝑥. Notice that −∞ < 𝑢 < ∞ and Pr{𝑋1 ≤ 𝑢 + 𝑥} = 0 if 𝑥 + 𝑢 < 0. Let
𝑎 + = max(𝑎, 0). Then
∫ ∞
+
Pr{𝑈 ≤ 𝑢} = 𝑒 −𝑥 (1 − 𝑒 −(𝑢+𝑥) ) d𝑥
0
∫ ∞
+
=1− 𝑒 −𝑥−(𝑢+𝑥) d𝑥
0
1 − 12 𝑒 −𝑢 , if 𝑢 ≥ 0
=
1 𝑒 − |𝑢 | ,
if 𝑢 < 0
2
1
Thus, the p.d.f. of 𝑈 is 𝑔(𝑢) = 𝑒 −|𝑢 | , −∞ < 𝑢 < ∞.
2
2 Probability Models and Distribution Functions 35
√
. 50 − 40
Solution 2.101 𝑇 = 𝑋1 +· · ·+𝑋20 ∼ 𝑁 (20𝜇, 20𝜎 2 ). Pr{𝑇 ≤ 50} ≈ Φ =
44.7214
0.5885.
√︁
Solution 2.102 𝑋 ∼ 𝐵(200, 0.15). 𝜇 = 𝑛𝑝 = 30, 𝜎 = 𝑛𝑝(1 − 𝑝) = 5.0497.
34.5 − 30 25.5 − 30
Pr{25 < 𝑋 < 35} ≈ Φ −Φ = 0.6271.
5.0497 5.0497
9.5
Solution 2.103 𝑋 ∼ 𝑃(200). Pr{190 < 𝑋 < 210} ≈ 2Φ √ − 1 = 0.4983.
200
3 √︁
Solution 2.104 𝑋 ∼ Beta(3, 5). 𝜇 = 𝐸 {𝑋 } = = 0.375. 𝜎 = 𝑉 {𝑋 } =
1/2 8
3·5
= 0.161374.
82 · 9
√ !
200 · 0.2282
Pr{| 𝑋¯ 200 − 0.375| < 0.2282} ≈ 2Φ − 1 = 1.
0.161374
Solution 2.105 𝑡 .95 [10] = 1.8125, 𝑡 .95 [15] = 1.7531, 𝑡 .95 [20] = 1.7247.
print(stats.t.ppf(0.95, 10))
print(stats.t.ppf(0.95, 15))
print(stats.t.ppf(0.95, 20))
1.8124611228107335
1.7530503556925547
1.7247182429207857
Solution 2.106 𝐹.95 [10, 30] = 2.1646, 𝐹.95 [15, 30] = 2.0148, 𝐹.95 [20, 30] =
1.9317.
2.164579917125473
2.0148036912954885
1.931653475236928
Solution 2.107 The solution to this problem is based on the fact, which is not
discussed in the text, that 𝑡 [𝜈] is distributed like the ratio of two independent random
√︁ 𝑁 (0, 1)
variables, 𝑁 (0, 1) and 𝜒2 [𝜈]/𝜈. Accordingly, 𝑡 [𝑛] ∼ √︃ , where 𝑁 (0, 1) and
𝜒2 [𝑛]
𝑛
(𝑁 (0, 1)) 2
𝜒2 [𝑛] are independent. 𝑡 2 [𝑛] ∼ ∼ 𝐹 [1, 𝑛]. Thus, since Pr{𝐹 [1, 𝑛] ≤
𝜒2 [𝑛]
𝑛
𝐹1−𝛼 [1, 𝑛]} = 1 − 𝛼.
36 2 Probability Models and Distribution Functions
√︁ √︁
Pr{− 𝐹1−𝛼 [1, 𝑛] ≤ 𝑡 [𝑛] ≤ 𝐹1−𝛼 [1, 𝑛]} = 1 − 𝛼.
√︁
It follows that 𝐹1−𝛼 [1, 𝑛] = 𝑡 1−𝛼/2 [𝑛],
2
or 𝐹1−𝛼 [1, 𝑛] = 𝑡 1−𝛼/2 [𝑛]. (If you assign this problem, please inform the students
of the above fact.)
Solution 2.108 A random variable 𝐹 [𝜈1 , 𝜈2 ] is distributed like the ratio of two
independent random variables 𝜒2 [𝜈1 ]/𝜈1 and 𝜒2 [𝜈2 ]/𝜈2 . Accordingly, 𝐹 [𝜈1 , 𝜈2 ] ∼
𝜒2 [𝜈1 ]/𝜈1
and
𝜒2 [𝜈2 ]/𝜈2
1 − 𝛼 = Pr{𝐹 [𝜈1 , 𝜈2 ] ≤ 𝐹1−𝛼 [𝜈1 , 𝜈2 ]}
( )
𝜒12 [𝜈1 ]/𝜈1
= Pr ≤ 𝐹1−𝛼 [𝜈1 , 𝜈2 ]}
𝜒22 [𝜈2 ]/𝜈2
( )
𝜒22 [𝜈2 ]/𝜈2 1
= Pr ≥
𝜒12 [𝜈1 ]/𝜈1 𝐹1−𝛼 [𝜈1 , 𝜈2 ]
1
= Pr 𝐹 [𝜈2 , 𝜈1 ] ≥
𝐹1−𝛼 [𝜈1 , 𝜈2 ]
= Pr {𝐹 [𝜈2 , 𝜈1 ] ≥ 𝐹𝛼 [𝜈2 , 𝜈1 ]} .
1
Hence 𝐹1−𝛼 [𝜈1 , 𝜈2 ] = .
𝐹𝛼 [𝜈2 , 𝜈1 ]
𝑁 (0, 1)
Solution 2.109 Using the fact that 𝑡 [𝜈] ∼ √︃ , where 𝑁 (0, 1) and 𝜒2 [𝜈] are
𝜒2 [𝜈 ]
𝜈
independent, ( )
𝑁 (0, 1)
𝑉 {𝑡 [𝜈]} = 𝑉 √︁
𝜒2 [𝜈]/𝜈
𝑁 (0, 1)
2
𝑁 (0, 1)
2
= 𝐸 𝑉 √︃ 𝜒 [𝜈] + 𝑉 𝐸 √︃ 𝜒 [𝜈] .
𝜒2 [𝜈 ] 𝜒2 [𝜈 ]
𝜈
𝜈
𝑁 (0, 1)
𝜈 𝑁 (0, 1)
By independence, 𝑉 √︃ 2
𝜒 [𝜈] = 2 , and 𝐸 √︃ 2
𝜒 [𝜈] = 0.
2 𝜒 [𝜈] 2
𝜒 𝜈[𝜈 ] 𝜒 𝜈[𝜈 ]
1 2
𝜈 𝜈
Thus, 𝑉 {𝑡 [𝜈]} = 𝜈𝐸 . Since 𝜒 [𝜈] ∼ 𝐺 , 2 ∼ 2𝐺 , 1 ,
𝜒2 [𝜈] 2 2
∫ ∞
1 1 1
𝐸 = · 𝑥 𝜈−2 𝑒 −𝑥 d𝑥
𝜒2 [𝜈]
2 Γ 𝜈2 0
𝜈
1 Γ 2 −1 1 1 1
= · = · 𝜈 = .
2 Γ 2 𝜈 2 2 −1 𝜈−2
2 Probability Models and Distribution Functions 37
𝜈
Finally, 𝑉 {𝑡 [𝜈]} = , 𝜈 > 2.
𝜈−2
10 2 · 102 · 11
Solution 2.110 𝐸 {𝐹 [3, 10]} = = 1.25, 𝑉 {𝐹 [3, 10]} = = 1.9097.
8 3 · 82 · 6
Chapter 3
Statistical Inference and Bootstrapping
import random
import math
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import pingouin as pg
import mistat
import os
os.environ['OUTDATED_IGNORE'] = '1'
Solution 3.1 By the WLLN, for any 𝜖 > 0, lim𝑛→∞ Pr{|𝑀𝑙 − 𝜇𝑙 | < 𝜖 } = 1. Hence,
𝑀𝑙 is a consistent estimator of the 𝑙-th moment.
√
¯ 𝑛
Solution 3.2 Using the CLT, Pr{| 𝑋𝑛 − 𝜇| < 1} ≈ 2Φ − 1. To determine the
𝜎√
𝑛
sample size 𝑛 so that this probability is 0.95 we set 2Φ − 1 = 0.95 and solve
√ 𝜎
𝑛
for 𝑛. This gives = 𝑧 .975 = 1.96. Thus 𝑛 ≥ 𝜎 2 (1.96) 2 = 424 for 𝜎 = 10.5.
𝜎
1Í
Solution 3.3 𝜉ˆ𝑝 = 𝑋¯ 𝑛 + 𝑧 𝑝 𝜎 ˆ 𝑛2 =
ˆ 𝑛 , where 𝜎 (𝑋𝑖 − 𝑋¯ 𝑛 ) 2 .
𝑛
Solution 3.4 Let (𝑋1 , 𝑌1 ), . . . , (𝑋𝑛 , 𝑌𝑛 ) be a random sample from a bivariate normal
distribution with density 𝑓 (𝑥, 𝑦; 𝜇, 𝜂, 𝜎𝑋 , 𝜎𝑌 , 𝜌) as given in Eq. (4.6.6). Let 𝑍𝑖 =
𝑋𝑖𝑌𝑖 for 𝑖 = 1, . . . , 𝑛. Then the first moment of 𝑍 is given by
𝜇1 (𝐹𝑍 ) = 𝐸 {𝑍 }
∫ ∞∫ ∞
= 𝑥𝑦 𝑓 (𝑥, 𝑦; 𝜇, 𝜂, 𝜎𝑋 , 𝜎𝑌 , 𝜌) d𝑥 d𝑦
−∞ −∞
= 𝜇𝜂 + 𝜌𝜎𝑋 𝜎𝑌 .
39
40 3 Statistical Inference and Bootstrapping
Using this fact, as well as the first 2 moments of 𝑋 and 𝑌 , we get the following
moment equations:
𝑛 𝑛
1 ∑︁ 1 ∑︁
𝑋𝑖 = 𝜇 𝑌𝑖 = 𝜂
𝑛 𝑖=1 𝑛 𝑖=1
𝑛 𝑛
1 ∑︁ 2 1 ∑︁ 2
𝑋 = 𝜎𝑋2 + 𝜇2 𝑌 = 𝜎𝑌2 + 𝜂2
𝑛 𝑖=1 𝑖 𝑛 𝑖=1 𝑖
𝑛
1 ∑︁
𝑋𝑖𝑌𝑖 = 𝜇𝜂 + 𝜌𝜎𝑋 𝜎𝑌 .
𝑛 𝑖=1
ˆ 𝑛2
𝜈ˆ1 = 𝑀1 (𝑀1 − 𝑀2 )/𝜎 and 𝜈ˆ2 = (𝑀1 − 𝑀2 ) (1 − 𝑀1 )/𝜎 ˆ 𝑛2 .
!
𝜎𝑖2 Í 2 𝑤𝑖
Solution 3.6 𝑉 {𝑌¯𝑤 } = 2
Í 𝑘 𝑘 Í𝑘
𝑖=1 𝑤 𝑖 𝑖=1 𝑤 𝑖 . Let 𝜆 𝑖 = Í 𝑘 , 𝑖=1 𝜆𝑖 = 1.
𝑛𝑖 𝑤
𝑗=1 𝑗
We find weights 𝜆𝑖 which minimize 𝑉 {𝑌¯𝑤 }, under the constraint 𝑖=1
Í𝑘
𝜆𝑖 = 1. The
Í 𝑘 2 𝜎𝑖2 Í𝑘
Lagrangian is 𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 𝑖=1 𝜆𝑖 𝑛𝑖 + 𝜌 𝑖=1 𝜆𝑖 − 1 . Differentiating with
respect to 𝜆𝑖 , we get
𝜕 𝜎2 𝜕
𝑘∑︁
𝐿(𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 2𝜆𝑖 𝑖 +𝜌, 𝑖 = 1, . . . , 𝑘 and 𝐿 (𝜆1 , . . . , 𝜆 𝑘 , 𝜌) = 𝜆𝑖 −1.
𝜕𝜆 𝑖 𝑛𝑖 𝜕𝜌 𝑖=1
3 Statistical Inference and Bootstrapping 41
Also
𝑛 𝑛
!
∑︁ 1 ∑︁
cov( 𝛽ˆ0 , 𝛽ˆ1 ) = cov − 𝑤 𝑖 𝑥¯ 𝑛 𝑌𝑖 , 𝑤 𝑖𝑌𝑖
𝑖=1
𝑛 𝑖=1
𝑛
∑︁ 1
= 𝜎2 − 𝑤 𝑖 𝑥¯ 𝑛 𝑤 𝑖
𝑖=1
𝑛
𝑥¯ 𝑛
= −𝜎 2 .
𝑆𝑆 𝑥
42 3 Statistical Inference and Bootstrapping
𝜎 2 𝑥¯ 𝑛
𝜌 𝛽ˆ0 , 𝛽ˆ1 = − 1/2
𝑥¯ 2
1 1
𝜎 2 𝑆𝑆 𝑥 + 𝑛
𝑛 𝑆𝑆 𝑥 𝑆𝑆 𝑥
𝑥¯ 𝑛
= − 1/2 .
1 Í𝑛 2
𝑥
𝑛 𝑖=1 𝑖
𝑝2
𝑀𝑋 (𝑡) = , 𝑡 < − log(1 − 𝑝).
(1 − (1 − 𝑝)𝑒 𝑡 ) 2
Í𝑛
Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be i.i.d. 𝑁 𝐵(2, 𝑝), then the m.g.f. of 𝑇𝑛 = 𝑖=1 𝑋𝑖 is
𝑝 2𝑛
𝑀𝑇𝑛 (𝑡) = , 𝑡 < − log(1 − 𝑝).
(1 − (1 − 𝑝)𝑒 𝑡 ) 2𝑛
{ex:neg-binom-dist} Thus, 𝑇𝑛 ∼ 𝑁 𝐵(2𝑛, 𝑝). According to Example 3.4, the MLE of 𝑝, based on 𝑇𝑛
2𝑛 2
(which is a sufficient statistic) is 𝑝ˆ 𝑛 = = , where 𝑋¯ 𝑛 = 𝑇𝑛 /𝑛 is the
𝑇𝑛 + 2𝑛 ¯
𝑋𝑛 + 2
sample mean.
3 Statistical Inference and Bootstrapping 43
2(1 − 𝑝)
(i) According to the WLLN, 𝑋¯ 𝑛 converges in probability to 𝐸 {𝑋1 } = .
𝑝
1− 𝑝 2
Substituting 2 for 𝑋¯ 𝑛 in the formula of 𝑝ˆ 𝑛 we obtain 𝑝 ∗ = = 𝑝. This
𝑝 2 + 2 1−𝑝𝑝
shows that the limit in probability as 𝑛 → ∞, of 𝑝ˆ 𝑛 is 𝑝.
(ii) Substituting 𝑘 = 2𝑛 in the formulas of Example 3.4 we obtain {ex:neg-binom-dist}
3𝑝(1 − 𝑝) 𝑝 2 (1 − 𝑝)
Bias( 𝑝ˆ 𝑛 ) ≈ and 𝑉 { 𝑝ˆ 𝑛 } ≈ .
4𝑛 2𝑛
Solution 3.13 The likelihood function of 𝜇 and 𝛽 is
( 𝑛
)
1 1 ∑︁ 𝑛
𝐿 (𝜇, 𝛽) = 𝐼{𝑋 (1) ≥ 𝜇} 𝑛 exp − (𝑋 (𝑖) − 𝑋 (1) ) − (𝑋 (1) − 𝜇) ,
𝛽 𝛽 𝑖=1 𝛽
where 𝑋 (1) < 𝑋 (2) < · · · < 𝑋 (𝑛) are the ordered statistics.
1 Í𝑛
(ii) Furthermore 𝐿 ∗ (𝛽) is maximized by 𝛽ˆ𝑛 = (𝑋 (𝑖) − 𝑋 (1) ). The MLEs
𝑛 𝑖=2
1
are 𝜇ˆ = 𝑋 (1) , and 𝛽ˆ𝑛 =
Í 𝑛
(𝑋 (𝑖) − 𝑋 (1) ).
𝑛 𝑖=2
𝛽
(iii) 𝑋 (1) is distributed like 𝜇 + 𝐸 , with p.d.f.
𝑛
𝑛
𝑛 − 𝛽 ( 𝑥−𝜇)
𝑓 (1) (𝑥; 𝜇, 𝛽) = 𝐼{𝑥 ≥ 𝜇} 𝑒 .
𝛽
Thus, the joint p.d.f. of (𝑋1 , . . . , 𝑋𝑛 ) is factored to a product of the p.d.f. of 𝑋 (1) and a
function of 𝛽ˆ𝑛 , which does not depend on 𝑋 (1) (nor on 𝜇). This implies that 𝑋 (1) and
𝛽2 1
ˆ = 𝑉 {𝑋 (1) } = 2 . It can be shown that 𝛽ˆ𝑛 ∼ 𝐺 (𝑛 − 1, 𝛽).
𝛽ˆ𝑛 are independent. 𝑉 { 𝜇}
𝑛 𝑛
𝑛−1 1 1 2
Accordingly, 𝑉 { 𝛽ˆ𝑛 } = 2 𝛽2 = 1− 𝛽 .
𝑛 𝑛 𝑛
Solution 3.14 In sampling with replacement, the number of defective items in the
sample, 𝑋, has the binomial distribution 𝐵(𝑛, 𝑝). We test the hypotheses 𝐻0 : 𝑝 ≤
0.03 against 𝐻1 : 𝑝 > 0.03. 𝐻0 is rejected if 𝑋 > 𝐵−1 (1 − 𝛼, 20, 0.03). For 𝛼 = 0.05
the rejection criterion is 𝑘 𝛼 = 𝐵−1 (0.95, 20, 0.03) = 2. Since the number of defective
items in the sample is 𝑋 = 2, 𝐻0 is not rejected at the 𝛼 = 0.05 significance level.
44 3 Statistical Inference and Bootstrapping
1.0
0.8
0.6
OC(p)
0.4
0.2
0.0
0.0 0.1 0.2 0.3 0.4 0.5
p
Fig. 3.1 The OC Function 𝐵 (2; 30, 𝑝) { fig:PlotOCcurveBinomial_2_30}
Solution 3.15 The OC function is 𝑂𝐶 ( 𝑝) = 𝐵(2; 30, 𝑝), 0 < 𝑝 < 1. A plot of this
OC function is given in Fig. 3.1. { fig:PlotOCcurveBinomial_2_30}
𝑝1 − 𝑝0 √
√︂
𝑝0 𝑞0
1−Φ √ 𝑛 − 𝑧1−𝛼 =𝛽
𝑝1 𝑞1 𝑝1 𝑞1
or, equivalently,
𝑝1 − 𝑝0 √
√︂
𝑝0 𝑞0
√ 𝑛 − 𝑧 1−𝛼 = 𝑧 1−𝛽 .
𝑝1 𝑞1 𝑝1 𝑞1
This gives
√ √
(𝑧 1−𝛼 𝑝 0 𝑞 0 + 𝑧 1−𝛽 𝑝 1 𝑞 1 ) 2
𝑛≈
( 𝑝1 − 𝑝0)2
√ √
(1.645) 2 ( 0.01 × 0.99 + 0.03 × 0.97) 2
= = 494.
(0.02) 2
√
For this sample size, the critical value is 𝑘 𝛼 = 𝑛𝑝 0 + 𝑧1−𝛼 𝑛𝑝 0 𝑞 0 = 8.58. Thus, 𝐻0
is rejected if there are more than 8 “successes” in a sample of size 494.
𝜎
Solution 3.17 𝑋¯ 𝑛 ∼ 𝑁 (𝜇, √ ).
𝑛
(i) If 𝜇 = 𝜇0 the probability that 𝑋¯ 𝑛 will be outside the control limits is
¯ 3𝜎 ¯ 3𝜎
Pr 𝑋𝑛 < 𝜇0 − √ + Pr 𝑋𝑛 > 𝜇0 + √ = Φ(−3) + 1 − Φ(3) = 0.0027.
𝑛 𝑛
3 Statistical Inference and Bootstrapping 45
socell = mistat.load_data('SOCELL')
t1 = socell['t1']
pvalue 0.35
socell = mistat.load_data('SOCELL')
t2 = socell['t2']
pvalue 0.03
Solution 3.20 Let 𝑛 = 30, 𝛼 = 0.01. The 𝑂𝐶 (𝛿) function for a one-sided 𝑡-test is
√ 1
© 𝛿 30 − 2.462 × (1 − 232 ) ª
𝑂𝐶 (𝛿) = 1 − Φ ®
6.0614 1/2 ®
(1 + )
« 58 ¬
= 1 − Φ(5.2117𝛿 − 2.3325).
a = np.sqrt(30)
b = 2.462 * ( 1 - 1/232)
f = np.sqrt(1 + 6.0614 / 58)
OC_delta = 1 - stats.norm.cdf((a * delta - b) / f)
𝛿 OC(𝛿)
0.0 0.990164
0.1 0.964958
0.2 0.901509
0.3 0.779063
0.4 0.597882
0.5 0.392312
0.6 0.213462
0.7 0.094149
0.8 0.033120
0.9 0.009188
1.0 0.001994
Solution 3.21 Let 𝑛 = 31, 𝛼 = 0.10. The 𝑂𝐶 function for testing 𝐻0 : 𝜎 2 ≤ 𝜎02
against 𝐻1 : 𝜎 2 > 𝜎02 is
( )
2 2
𝜎02 2
𝑂𝐶 (𝜎 ) = Pr 𝑆 ≤ 𝜒0.9 [𝑛 − 1]
𝑛−1
( )
𝜎02
= Pr 𝜒2 [30] ≤ 2
𝜒0.9 [30]
𝜎2
!
2 [30]
30 𝜎02 𝜒0.9
=1−𝑃 − 1; 2 · .
2 𝜎 2
The values of 𝑂𝐶 (𝜎 2 ) for 𝜎 2 = 1, 2(0.1) are given in the following table: (Here
𝜎02 = 1.)
𝜎2 OC(𝜎 2 )
1.0 0.900000
1.1 0.810804
1.2 0.700684
1.3 0.582928
1.4 0.469471
1.5 0.368201
1.6 0.282781
1.7 0.213695
1.8 0.159540
1.9 0.118063
2.0 0.086834
3 Statistical Inference and Bootstrapping 47
1.0
OC(p)
OCapprox(p)
0.8
0.6
0.4
0.2
0.0
0.10 0.15 0.20 0.25 0.30 0.35
p
Fig. 3.2 Comparison exact to normal approximation of 𝑂𝐶 ( 𝑝) { fig:OC_OCapprox}
𝑝 − 𝑝0 √
√︂
𝑝0 𝑞0
𝑂𝐶 ( 𝑝) = 1 − Φ √ 𝑛 − 𝑧1−𝛼 ,
𝑝𝑞 𝑝𝑞
for 𝑝 ≥ 𝑝 0 . In Fig. 3.2, we present the graph of 𝑂𝐶 ( 𝑝), for 𝑝 0 = 0.1, 𝑛 = 100, { fig:OC_OCapprox}
𝛼 = 0.05 both using the exact solution and the normal approximation.
n = 100
p0 = 0.1
alpha = 0.05
df = len(data) - 1
mean = np.mean(data)
sem = stats.sem(data)
(20.136889216656858, 21.38411078334315)
Confidence Intervals
Variable N Mean StDev SE 99.0% C.I.
Mean
Sample 20 20.760 0.975 0.218 (20.137, 21.384)
(ii) A 99% C.I. for 𝜎 2 is (0.468, 2.638).
0.46795850248657883
2.6380728125212016
Solution 3.27 Let 𝑋 ∼ 𝐵(𝑛, 𝜃). For 𝑋 = 17 and 𝑛 = 20, a confidence interval for 𝜃,
at level 0.95, is (0.6211, 0.9679).
alpha = 1 - 0.95
X = 17
n = 20
F1 = stats.f(2*(n-X+1), 2*X).ppf(1 - alpha/2)
F2 = stats.f(2*(X+1), 2*(n-X)).ppf(1 - alpha/2)
pL = X / (X + (n-X+1) * F1)
pU = (X+1) * F2 / (n-X + (X+1) * F2)
print(pL, pU)
0.6210731734546859 0.9679290628145363
Solution 3.28 From the data we have 𝑛 = 10 and 𝑇10 = 134. For 𝛼 = 0.05, 𝜆 𝐿 =
11.319 and 𝜆𝑈 = 15.871.
# exact solution
print(stats.chi2(2 * T_n + 2).ppf(alpha/2) / (2 * len(X)))
print(stats.chi2(2 * T_n + 2).ppf(1 - alpha/2) / (2 * len(X)))
# approximate solution
nu = 2 * T_n + 2
print((nu + stats.norm.ppf(alpha/2) * np.sqrt(2 * nu)) / (2 * len(X)))
print((nu + stats.norm.ppf(1-alpha/2) * np.sqrt(2 * nu)) / (2 * len(X)))
11.318870163746238
15.870459268116013
11.222727638613012
15.777272361386988
Solution 3.29 For 𝑛 = 20, 𝜎 = 5, 𝑌¯20 = 13.75, 𝛼 = 0.05 and 𝛽 = 0.1, the tolerance
interval is (3.33, 24.17).
yarnstrg = mistat.load_data('YARNSTRG')
n = len(yarnstrg)
Ybar = yarnstrg.mean()
S = yarnstrg.std()
0.7306652047594424 5.117020795240556
50 3 Statistical Inference and Bootstrapping
Ordered quantiles
1
1
R 2 = 0.911
1 0 1 2
Theoretical quantiles
{ fig:qqplotISCT1Socell} Fig. 3.3 𝑄–𝑄 plot of ISC-𝑡1 (SOCELL.csv)
{ fig:qqplotCar} Solution 3.33 As is shown in the normal probability plots in Fig. 3.4, the hypothesis
of normality is not rejected in either case.
Solution 3.34 Frequency distribution for turn diameter:
3 Statistical Inference and Bootstrapping 51
Ordered quantiles
1
0 0
1 1
2 R 2 = 0.985 2 R 2 = 0.986
2 0 2 2 0 2
Theoretical quantiles Theoretical quantiles
The expected frequencies were computed for 𝑁 (35.5138, 3.3208). Here 𝜒2 = 12.4,
d.f. = 8 and the 𝑃 value is 0.134. The differences from normal are not significant.
observed expected (O-E)^2/E
turn
[28, 31) 11 8.197333 0.958231
[31, 32) 8 6.318478 0.447500
[32, 33) 9 8.668695 0.012662
[33, 34) 6 10.869453 2.181487
[34, 35) 18 12.455897 2.467673
[35, 36) 8 13.045357 1.951317
[36, 37) 13 12.486789 0.021093
[37, 38) 6 10.923435 2.219102
[38, 39) 9 8.733354 0.008141
[39, 40) 8 6.381395 0.410550
[40, 44) 13 9.052637 1.721231
chi2-statistic of fit 12.398987638400024
52 3 Statistical Inference and Bootstrapping
√
0.85
𝐷 ∗109 = 0.0702. For 𝛼 = 0.05 𝑘 ∗𝛼 = 0.895/ 109 − 0.01 + √ = 0.0851. The
109
deviations from the normal distribution are not significant.
Solution 3.36 For 𝑋 ∼ 𝑃(100), 𝑛0 = 𝑃−1 0.2 0.3 , 100 = 100 + 𝑧 .67 × 10 = 105.
random.seed(1)
car = mistat.load_data('CAR')
Ordered quantiles 1
2
R 2 = 0.992
2 1 0 1 2
Theoretical quantiles
{ fig:qqplotSampleMeansMPG} Fig. 3.5 𝑄–𝑄 Plot of mean of samples from mpg data
S = np.std(car['mpg'], ddof=1)
print('standard deviation of mpg', S)
print('S/8', S/8)
S_resample = np.std(means, ddof=1)
print('S.E.\{X\}', S_resample)
random.seed(1)
yarnstrg = mistat.load_data('YARNSTRG')
54 3 Statistical Inference and Bootstrapping
mean = np.mean(yarnstrg)
outside = 0
for _ in range(500):
sample = random.choices(yarnstrg, k=30)
ci = confidence_interval(sample)
if mean < ci[0] or ci[1] < mean:
outside += 1
ci = confidence_interval(yarnstrg)
print(f' Mean: {mean}')
print(f' 2-sigma-CI: {ci[0]:.1f} - {ci[1]:.1f}')
print(f' proportion outside: {hat_alpha:.3f}')
Mean: 2.9238429999999993
2-sigma-CI: 2.7 - 3.1
proportion outside: 0.068
random.seed(1)
car = mistat.load_data('CAR')
us_cars = car[car['origin'] == 1]
us_turn = list(us_cars['turn'])
sample_means = []
for _ in range(100):
x = random.choices(us_turn, k=58)
sample_means.append(np.mean(x))
0.23
We obtained 𝑃˜ = 0.23. The mean 𝑋¯ = 37.203 is not significantly larger than 37.
stats.binom(50, 0.03).ppf(0.95)
4.0
random.seed(1)
cyclt = mistat.load_data('CYCLT')
3 Statistical Inference and Bootstrapping 55
ci_mean, dist_mean = B
print(f' Mean: {np.mean(dist_mean):.3f}')
print(f' 95%-CI: {ci_mean[0]:.3f} - {ci_mean[1]:.3f}')
ci_std, dist_std = B
print(f' Mean: {np.mean(dist_std):.3f}')
print(f' 95%-CI: {ci_std[0]:.3f} - {ci_std[1]:.3f}')
Mean: 0.652
95%-CI: 0.550 - 0.760
Mean: 0.370
95%-CI: 0.340 - 0.400
cyclt = mistat.load_data('CYCLT')
ebd = {}
for quantile in (1, 2, 3):
B = pg.compute_bootci(cyclt, func=lambda x: np.quantile(x, 0.25 * quantile),
n_boot=1000, confidence=0.95, return_dist=True, seed=1)
56 3 Statistical Inference and Bootstrapping
200 200
600
{ fig:histEBD_CYCLT_quartiles} Fig. 3.7 Histograms of EBD for quartiles for CYCLT.csv data
ci, dist = B
ebd[quantile] = dist
print(f'Quantile {quantile}: {np.mean(dist):.3f} 95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')
socell = mistat.load_data('SOCELL')
t1 = socell['t1']
t2 = socell['t2']
B = pg.compute_bootci(idx, func=sample_correlation,
n_boot=1000, confidence=0.95, return_dist=True, seed=1)
ci, dist = B
print(f'rho_XY: {np.mean(dist):.3f} 95%-CI: {ci[0]:.3f} - {ci[1]:.3f}')
200
150
100
50
0
0.88 0.90 0.92 0.94 0.96 0.98 1.00
Fig. 3.8 Histograms of EBD for correlation for SOCELL.csv data { fig:histEBD_SOCELL_correlation}
car = mistat.load_data('CAR')
mpg = car['mpg']
hp = car['hp']
idx = list(range(len(mpg)))
sample_intercept = []
sample_slope = []
for _ in range(1000):
x = random.choices(idx, k=len(idx))
result = stats.linregress(hp[x], mpg[x])
sample_intercept.append(result.intercept)
sample_slope.append(result.slope)
hm = np.mean(hp)
print(np.std(sample_intercept))
print(np.std(sample_slope))
(iii) The bootstrap S.E. of slope and intercept are 1.017 and 0.00747, respectively.
{sec:least-squares-single} The standard errors of 𝑎 and 𝑏, according to the formulas of Sect. 4.3.2.1 are 0.8099
and 0.00619, respectively. The bootstrap estimates are quite close to the correct
values.
cyclt = mistat.load_data('CYCLT')
X50 = np.mean(cyclt)
SD50 = np.std(cyclt)
result = stats.ttest_1samp(cyclt, 0.55)
print(f'Xmean_50 = {X50:.3f}')
print(result)
Xmean_50 = 0.652
TtestResult(statistic=1.9425149510299369, pvalue=0.057833259176805,
df=49)
p*-value: 0.024
almpin = mistat.load_data('ALMPIN')
diam1 = almpin['diam1']
diam2 = almpin['diam2']
10.00
9.98
9.96
9.94
9.92
9.90
diam1 diam2
Fig. 3.9 Box plots of diam1 and diam2 measurements of the ALMPIN dataset { fig:boxplot-almpin}
almpin.boxplot(column=['diam1', 'diam2'])
plt.show()
Solution 3.52 The variances were already compared in the previous exercise. To
compare the means use:
almpin = mistat.load_data('ALMPIN')
diam1 = almpin['diam1']
diam2 = almpin['diam2']
# Compare means
mean_diam1 = np.mean(diam1)
mean_diam2 = np.mean(diam2)
print(f'Mean diam1: {mean_diam1:.5f}')
print(f'Mean diam2: {mean_diam2:.5f}')
print(f'p-value: {p:.3f}')
The bootstrap based p-value for the comparison of the means is:
random.seed(1)
# return studentized distance between random samples from diam1 and diam2
def stat_func():
d1 = random.choices(diam1, k=len(diam1))
d2 = random.choices(diam2, k=len(diam2))
return stats.ttest_ind(d1, d2).statistic
p*-value: 0.014
The bootstrap based p-value for the comparison of the variances is:
# Step 2: compute Wi
Wi = Bt / S2
# Step 3: compute F*
FBoot = Wi.max(axis=1) / Wi.min(axis=1)
FBoot95 = np.quantile(FBoot, 0.95)
print('FBoot 95%', FBoot95)
pstar = sum(FBoot >= F0)/len(FBoot)
print(f'p*-value: {pstar}')
S2 diam1 0.000266
diam2 0.000320
dtype: float64
F0 1.2016104294478573
FBoot 95% 1.1855457165968324
p*-value: 0.04
3 Statistical Inference and Bootstrapping 61
mpg = mistat.load_data('MPG')
columns = ['origin1', 'origin2', 'origin3']
# variance for each column
S2 = mpg[columns].var(axis=0, ddof=1)
F0 = max(S2) / min(S2)
print('S2', S2)
print('F0', F0)
# Step 2: compute Wi
Wi = Bt / S2
# Step 3: compute F*
FBoot = Wi.max(axis=1) / Wi.min(axis=1)
FBoot95 = np.quantile(FBoot, 0.95)
print('FBoot 95%', FBoot95)
pstar = sum(FBoot >= F0)/len(FBoot)
print(f'p*-value: {pstar}')
S2 origin1 12.942529
origin2 6.884615
origin3 18.321321
dtype: float64
F0 2.6611975103595213
FBoot 95% 2.6925366761838987
p*-value: 0.058
500
400
300
200
100
0
0 1 2 3 4 5 6 7
F* values
exc:mpg-equal-mean
{ fig:mpg-equal-mean} Fig. 3.10 Distribution of EBD for Exercise 3.54
mpg = mistat.load_data('MPG.csv')
samples = [mpg[key].dropna() for key in ['origin1', 'origin2', 'origin3']]
def test_statistic_F(samples):
return stats.f_oneway(*samples).statistic
F0 = test_statistic_F(samples)
Ns = 1000
Fstar = []
for _ in range(Ns):
Ysamples = []
for sample, DBi in zip(samples, DB):
Xstar = np.array(random.choices(sample, k=len(sample)))
Ysamples.append(Xstar - DBi)
Fs = test_statistic_F(Ysamples)
Fstar.append(Fs)
Fstar = np.array(Fstar)
print(f'F = {F0:.3f}')
print('ratio', sum(Fstar > F0)/len(Fstar))
ax = pd.Series(Fstar).hist(bins=14, color='grey')
ax.axvline(F0, color='black', lw=2)
ax.set_xlabel('F* values')
plt.show()
F = 6.076
ratio 0.003
3 Statistical Inference and Bootstrapping 63
{ fig:mpg-equal-mean} 𝐹 = 6.076, 𝑃∗ = 0.003 and the hypothesis of equal means is rejected. See Fig.
3.10 for the calculated EBD.
np.random.seed(1)
The tolerance intervals of the number of defective items in future batches of size
𝑁 = 50, with 𝛼 = 0.05 and 𝛽 = 0.05 are
Limits
𝑝 Lower Upper
0.2 1 23
0.1 0 17
0.05 0 9
np.random.seed(1)
oturb = mistat.load_data('OTURB.csv')
B_025 = pg.compute_bootci(oturb, func=lambda x: getQuantile(x, p=0.025),
n_boot=500, seed=1, return_dist=True)
B_975 = pg.compute_bootci(oturb, func=lambda x: getQuantile(x, p=0.975),
n_boot=500, seed=1, return_dist=True)
tol_int = [np.quantile(B_025[1], 0.025),np.quantile(B_975[1], 0.975)]
print(f'Tolerance interval ({tol_int[0]}, {tol_int[1]})')
cyclt = mistat.load_data('CYCLT.csv')
# make use of the fact that a True value is interpreted as 1 and False as 0
print('Values greater 0.7:', sum(cyclt>0.7))
64 3 Statistical Inference and Bootstrapping
We find that in the sample of 𝑛 = 50 cycle times, there are 𝑋 = 20 values greater
than 0.7. If the hypothesis is 𝐻0 : 𝜉 .5 ≤ 0.7, the probability of observing a value
smaller than 0.7 is 𝑝 ≥ 21 . Thus, the sign test rejects 𝐻0 if 𝑋 < 𝐵−1 (𝛼; 50, 12 ). For
𝛼 = 0.10 the critical value is 𝑘 𝛼 = 20. 𝐻0 is not rejected.
Solution 3.58 We apply the wilcoxon test from scipy on the differences of oelect
from 220.
oelect = mistat.load_data('OELECT.csv')
print(stats.wilcoxon(oelect-220))
WilcoxonResult(statistic=1916.0, pvalue=0.051047599707252124)
car = mistat.load_data('CAR.csv')
fourCylinder = car[car['cyl'] == 4]
uscars = fourCylinder[fourCylinder['origin'] == 1]
foreign = fourCylinder[fourCylinder['origin'] != 1]
The original stat 3.08 is outside of the distribution of the bootstrap samples. The
difference between the means of the turn diameters is therefore significant. Foreign
cars have on the average a smaller turn diameter.
Chapter 4
Variability in Several Dimensions and
Regression Models
import random
import numpy as np
import pandas as pd
import pingouin as pg
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats as sms
import seaborn as sns
import matplotlib.pyplot as plt
import mistat
Solution 4.1 In Fig. 4.1, one sees that horsepower and miles per gallon are inversely { fig:ex_car_pairplot}
proportional. Turn diameter seems to increase with horsepower.
car = mistat.load_data('CAR')
sns.pairplot(car[['turn', 'hp', 'mpg']])
plt.show()
---------------------------------------------------------------------------OptionError
Traceback (most recent call last)Cell In[1], line 2
1 car = mistat.load_data('CAR')
----> 2 sns.pairplot(car[['turn', 'hp', 'mpg']])
3 plt.show()
File /usr/local/lib/python3.9/site-packages/seaborn/_decorators.py:46,
in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
36 warnings.warn(
37 "Pass the following variable{} as {}keyword arg{}: {}.
"
38 "From version 0.12, the only valid positional argument
"
(...)
43 FutureWarning
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters,
args)})
---> 46 return f(**kwargs)
File /usr/local/lib/python3.9/site-packages/seaborn/axisgrid.py:2126,
in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind,
65
66 4 Variability in Several Dimensions and Regression Models
File /usr/local/lib/python3.9/site-packages/seaborn/_core.py:983, in
VectorPlotter.iter_data(self, grouping_vars, reverse, from_comp_data)
978 grouping_vars = [
979 var for var in grouping_vars if var in self.variables
980 ]
982 if from_comp_data:
--> 983 data = self.comp_data
984 else:
985 data = self.plot_data
File /usr/local/lib/python3.9/site-packages/seaborn/_core.py:1054, in
VectorPlotter.comp_data(self)
1050 axis = getattr(ax, f"{var}axis")
1052 # Use the converter assigned to the axis to get a float
representation
1053 # of the data, passing np.nan or pd.NA through (pd.NA becomes
np.nan)
-> 1054 with pd.option_context('mode.use_inf_as_null', True):
1055 orig = self.plot_data[var].dropna()
1056 comp_col = pd.Series(index=orig.index, dtype=float, name=var)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:441, in
option_context.__enter__(self)
440 def __enter__(self) -> None:
--> 441 self.undo = [(pat, _get_option(pat, silent=True)) for pat,
val in self.ops]
443 for pat, val in self.ops:
444 _set_option(pat, val, silent=True)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:441, in <listcomp>(.0)
440 def __enter__(self) -> None:
--> 441 self.undo = [(pat, _get_option(pat, silent=True)) for pat,
val in self.ops]
443 for pat, val in self.ops:
444 _set_option(pat, val, silent=True)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:135, in _get_option(pat, silent)
134 def _get_option(pat: str, silent: bool = False) -> Any:
--> 135 key = _get_single_key(pat, silent)
137 # walk the nested dict
138 root, k = _get_root(key)
File /usr/local/lib/python3.9/site-
packages/pandas/_config/config.py:121, in _get_single_key(pat, silent)
119 if not silent:
120 _warn_if_deprecated(pat)
--> 121 raise OptionError(f"No such keys(s): {repr(pat)}")
122 if len(keys) > 1:
123 raise OptionError("Pattern matched multiple keys")
OptionError: "No such keys(s): 'mode.use_inf_as_null'"
Solution 4.2 The box plots in Fig. 4.2 show that cars from Asia generally have the { fig:ex_car_boxplots}
smallest turn diameter. The maximal turn diameter of cars from Asia is smaller
than the median turn diameter of U.S. cars. European cars tend to have larger turn
diameter than those from Asia, but smaller than those from the U.S.
car = mistat.load_data('CAR')
ax = car.boxplot(column='turn', by='origin')
ax.set_title('')
ax.get_figure().suptitle('')
ax.set_xlabel('origin')
plt.show()
68 4 Variability in Several Dimensions and Regression Models
1.0
0.8
0.6
turn
0.4
0.2
0.0
1.0
0.8
0.6
hp
0.4
0.2
0.0
1.0
0.8
0.6
mpg
0.4
0.2
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
turn hp mpg
42
40
38
36
34
32
30
28
1 2 3
origin
Fig. 4.2 Boxplots of turn diameter by origin for CAR dataset { fig:ex_car_boxplots}
4 Variability in Several Dimensions and Regression Models 69
2400
2200
Res 3 2000
1800
1600
1 2 3 4 5 6
Hybrid number
Fig. 4.3 Multiple box plots of Res 3 grouped by hybrid number { fig:ex_hadpas_plot_i}
Solution 4.3 (i) The multiple box plots (see Fig. 4.3) show that the conditional { fig:ex_hadpas_plot_i}
distributions of res3 at different hybrids are different.
hadpas = mistat.load_data('HADPAS')
ax = hadpas.boxplot(column='res3', by='hyb')
ax.set_title('')
ax.get_figure().suptitle('')
ax.set_xlabel('Hybrid number')
ax.set_ylabel('Res 3')
plt.show()
(ii) The matrix plot of all the Res variables (see Fig. 4.4) reveals that Res 3 and { fig:ex_hadpas_plot_ii}
Res 7 are positively correlated. Res 20 is generally larger than the corresponding Res
14. Res 18 and Res 20 seem to be negatively associated.
---------------------------------------------------------------------------OptionError
Traceback (most recent call last)Cell In[1], line 1
----> 1 sns.pairplot(hadpas[['res3', 'res7', 'res18', 'res14',
'res20']])
2 plt.show()
File /usr/local/lib/python3.9/site-packages/seaborn/_decorators.py:46,
in _deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
36 warnings.warn(
37 "Pass the following variable{} as {}keyword arg{}: {}.
"
38 "From version 0.12, the only valid positional argument
"
(...)
43 FutureWarning
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters,
args)})
---> 46 return f(**kwargs)
70 4 Variability in Several Dimensions and Regression Models
File /usr/local/lib/python3.9/site-packages/seaborn/axisgrid.py:2126,
in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind,
diag_kind, markers, height, aspect, corner, dropna, plot_kws,
diag_kws, grid_kws, size)
2124 diag_kws.setdefault("legend", False)
2125 if diag_kind == "hist":
-> 2126 grid.map_diag(histplot, **diag_kws)
2127 elif diag_kind == "kde":
2128 diag_kws.setdefault("fill", True)
File /usr/local/lib/python3.9/site-packages/seaborn/axisgrid.py:1478,
in PairGrid.map_diag(self, func, **kwargs)
1476 plot_kwargs.setdefault("hue_order", self._hue_order)
1477 plot_kwargs.setdefault("palette", self._orig_palette)
-> 1478 func(x=vector, **plot_kwargs)
1479 ax.legend_ = None
1481 self._add_axis_labels()
File /usr/local/lib/python3.9/site-
packages/seaborn/distributions.py:1462, in histplot(data, x, y, hue,
weights, stat, bins, binwidth, binrange, discrete, cumulative,
common_bins, common_norm, multiple, element, fill, shrink, kde,
kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws,
palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
1451 estimate_kws = dict(
1452 stat=stat,
1453 bins=bins,
(...)
1457 cumulative=cumulative,
1458 )
1460 if p.univariate:
-> 1462 p.plot_univariate_histogram(
1463 multiple=multiple,
1464 element=element,
1465 fill=fill,
1466 shrink=shrink,
1467 common_norm=common_norm,
1468 common_bins=common_bins,
1469 kde=kde,
1470 kde_kws=kde_kws,
1471 color=color,
1472 legend=legend,
1473 estimate_kws=estimate_kws,
1474 line_kws=line_kws,
1475 **kwargs,
1476 )
1478 else:
1480 p.plot_bivariate_histogram(
1481 common_bins=common_bins,
1482 common_norm=common_norm,
(...)
1492 **kwargs,
1493 )
File /usr/local/lib/python3.9/site-
packages/seaborn/distributions.py:428, in
_DistributionPlotter.plot_univariate_histogram(self, multiple,
element, fill, common_norm, common_bins, shrink, kde, kde_kws, color,
legend, line_kws, estimate_kws, **plot_kws)
418 densities = self._compute_univariate_density(
419 self.data_variable,
420 common_norm,
(...)
424 warn_singular=False,
425 )
427 # First pass through the data to compute the histograms
--> 428 for sub_vars, sub_data in self.iter_data("hue",
from_comp_data=True):
429
430 # Prepare the relevant data
4 Variability in Several Dimensions and Regression Models 71
Solution 4.4 The joint frequency distribution of horsepower versus miles per gallon
is
car = mistat.load_data('CAR')
binned_car = pd.DataFrame({
'hp': pd.cut(car['hp'], bins=np.arange(50, 275, 25)),
'mpg': pd.cut(car['mpg'], bins=np.arange(10, 40, 5)),
})
freqDist = pd.crosstab(binned_car['hp'], binned_car['mpg'])
print(freqDist)
# You can get distributions for hp and mpg by summing along an axis
print(freqDist.sum(axis=0))
print(freqDist.sum(axis=1))
72 4 Variability in Several Dimensions and Regression Models
1.0
0.8
0.6
res3
0.4
0.2
0.0
1.0
0.8
0.6
res7
0.4
0.2
0.0
1.0
0.8
0.6
res18
0.4
0.2
0.0
1.0
0.8
0.6
res14
0.4
0.2
0.0
1.0
0.8
0.6
res20
0.4
0.2
0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
res3 res7 res18 res14 res20
Fig. 4.4 Scatterplot matrix Res variables for HADPAS dataset { fig:ex_hadpas_plot_ii}
mpg (10, 15] (15, 20] (20, 25] (25, 30] (30, 35]
hp
(50, 75] 0 1 0 4 2
(75, 100] 0 0 23 11 0
(100, 125] 0 10 11 1 0
(125, 150] 0 14 3 1 0
(150, 175] 0 17 0 0 0
(175, 200] 1 5 0 0 0
(200, 225] 1 3 0 0 0
(225, 250] 0 1 0 0 0
mpg
(10, 15] 2
(15, 20] 51
(20, 25] 37
(25, 30] 17
(30, 35] 2
dtype: int64
hp
(50, 75] 7
4 Variability in Several Dimensions and Regression Models 73
(75, 100] 34
(100, 125] 22
(125, 150] 18
(150, 175] 17
(175, 200] 6
(200, 225] 4
(225, 250] 1
dtype: int64
The intervals for HP are from 50 to 250 at fixed length of 25. The intervals for
MPG are from 10 to 35 at length 5. Students may get different results by defining
the intervals differently.
Solution 4.5 The joint frequency distribution of Res 3 and Res 14 is given in the
following table:
hadpas = mistat.load_data('HADPAS')
binned_hadpas = pd.DataFrame({
'res3': pd.cut(hadpas['res3'], bins=np.arange(1580, 2780, 200)),
'res14': pd.cut(hadpas['res14'], bins=np.arange(900, 3000, 300)),
})
pd.crosstab(binned_hadpas['res14'], binned_hadpas['res3'])
The intervals for Res 3 start at 1580 and end at 2580 with length of 200. The
intervals of Res 14 start at 900 and end at 2700 with length of 300.
Solution 4.6 The following is the conditional frequency distribution of Res 3, given
that Res 14 is between 1300 and 1500 ohms:
hadpas = mistat.load_data('HADPAS')
in_range = hadpas[hadpas['res14'].between(1300, 1500)]
pd.cut(in_range['res3'], bins=np.arange(1580, 2780, 200)).value_counts(sort=False)
res3
(1580, 1780] 8
(1780, 1980] 21
(1980, 2180] 25
(2180, 2380] 2
(2380, 2580] 0
Name: count, dtype: int64
74 4 Variability in Several Dimensions and Regression Models
Solution 4.7 Following the instructions in the question we obtained the following
results:
hadpas = mistat.load_data('HADPAS')
bins = [900, 1200, 1500, 1800, 2100, 3000]
binned_res14 = pd.cut(hadpas['res14'], bins=bins)
results = []
for group, df in hadpas.groupby(binned_res14):
res3 = df['res3']
results.append({
'res3': group,
'N': len(res3),
'mean': res3.mean(),
'std': res3.std(),
})
pd.DataFrame(results)
df = pd.DataFrame([
[10.0, 8.04, 10.0, 9.14, 10.0, 7.46, 8.0, 6.58],
[8.0, 6.95, 8.0, 8.14, 8.0, 6.77, 8.0, 5.76],
[13.0, 7.58, 13.0, 8.74, 13.0, 12.74, 8.0, 7.71],
[9.0, 8.81, 9.0, 8.77, 9.0, 7.11, 8.0, 8.84],
[11.0, 8.33, 11.0, 9.26, 11.0, 7.81, 8.0, 8.47],
[14.0, 9.96, 14.0, 8.10, 14.0, 8.84, 8.0, 7.04],
[6.0, 7.24, 6.0, 6.13, 6.0, 6.08, 8.0, 5.25],
[4.0, 4.26, 4.0, 3.10, 4.0, 5.39, 19.0, 12.50],
[12.0, 10.84, 12.0, 9.13, 12.0, 8.15, 8.0, 5.56],
[7.0, 4.82, 7.0, 7.26, 7.0, 6.42, 8.0, 7.91],
[5.0, 5.68, 5.0, 4.74, 5.0, 5.73, 8.0, 6.89],
], columns=['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4'])
results = []
for i in (1, 2, 3, 4):
x = df[f'x{i}']
y = df[f'y{i}']
model = smf.ols(formula=f'y{i} ~ 1 + x{i}', data=df).fit()
results.append({
'Data Set': i,
'Intercept': model.params['Intercept'],
'Slope': model.params[f'x{i}'],
'R2': model.rsquared,
})
pd.DataFrame(results)
Notice the influence of the point (19,12.5) on the regression in Data Set 4. Without
this point the correlation between 𝑥 and 𝑦 is zero.
4 Variability in Several Dimensions and Regression Models 75
10 10
y1
y2
5 5
0 20 0 20
x1 x2
10 10
y3
5 y4 5
0 20 0 20
x3 x4
Fig. 4.5 Anscombe’s quartet { fig:AnscombeQuartet}
{ fig:AnscombeQuartet} The dataset is known as Anscombe’s quartet (see Fig. 4.5). It not only has identical
linear regression, but it has also identical means and variances of 𝑥 and 𝑦, and
correlation between 𝑥 and 𝑦. The dataset clearly demonstrates the importance of
visualization in data analysis.
car = mistat.load_data('CAR')
car[['turn', 'hp', 'mpg']].corr()
turn hp mpg
turn 1.000000 0.507610 -0.541061
hp 0.507610 1.000000 -0.754716
mpg -0.541061 -0.754716 1.000000
− 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 ) 2
Í𝑛
Solution 4.10 𝑆𝑆𝐸 = 𝑖=1 (𝑌𝑖
(i)
76 4 Variability in Several Dimensions and Regression Models
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 )
𝜕 𝛽0 𝑖=1
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 𝑋𝑖1 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 )
𝜕 𝛽1 𝑖=1
𝑛
𝜕 ∑︁
𝑆𝑆𝐸 = −2 𝑋𝑖2 (𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖1 − 𝛽2 𝑋𝑖2 ).
𝜕 𝛽2 𝑖=1
Equating these partial derivatives to zero and arranging terms, we arrive at the
following set of linear equations:
Í𝑛 Í𝑛 Í𝑛
𝑛 𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖2
𝛽0 𝑖=1 𝑌𝑖
Í Í
𝑛 Í𝑛 2 Í𝑛 𝑛
𝛽1 = 𝑖=1 𝑋𝑖1𝑌𝑖
𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖1 𝑖=1 𝑋𝑖1 𝑋𝑖2
Í Í
𝑛 𝑋 Í𝑛 𝑋 𝑋 Í𝑛 2 𝛽2 𝑛 𝑋𝑖2𝑌𝑖
𝑖=1 𝑋𝑖2
𝑖=1 𝑖2 𝑖=1 𝑖1 𝑖2 𝑖=1
(ii) Let 𝑏 0 , 𝑏 1 , 𝑏 2 be the (unique) solution. From the first equation we get, after
dividing by 𝑛, 𝑏 0 = 𝑌¯ − 𝑋¯ 1 𝑏 1 − 𝑋¯ 2 𝑏 2 , where 𝑌¯ = 𝑛1 𝑖=1 𝑌𝑖 , 𝑋¯ 1 = 𝑛1 𝑖=1
Í𝑛 Í𝑛
𝑋𝑖1 ,
1
𝑋¯ 2 =
Í 𝑛
𝑋𝑖2 . Substituting 𝑏 0 in the second and third equations and arranging
𝑛 𝑖=1
terms, we obtain the reduced system of equations:
(𝑄 1 − 𝑛 𝑋¯ 2 ) (𝑃12 − 𝑛 𝑋¯ 1 𝑋¯ 2 ) 𝑏 1 𝑃1𝑦 − 𝑛 𝑋¯ 1𝑌¯
1
(𝑃12 − 𝑛 𝑋¯ 1 𝑋¯ 2 ) (𝑄 2 − 𝑛 𝑋¯ 2 ) 𝑏 2 𝑃2𝑦 − 𝑛 𝑋¯ 2𝑌¯
2
Í 2 Í 2 Í Í
where 𝑄 1 = 𝑋𝑖1 , 𝑄 2 = 𝑋𝑖2 , 𝑃12 = 𝑋𝑖1 𝑋𝑖2 and 𝑃1𝑦 = 𝑋𝑖1𝑌𝑖 , 𝑃2𝑦 =
Í
𝑋𝑖2𝑌𝑖 . Dividing both sides by (𝑛 − 1) we obtain Eq. (4.4.3), and solving we get 𝑏 1
and 𝑏 2 .
car = mistat.load_data('CAR')
car = mistat.load_data('CAR')
car = mistat.load_data('CAR')
Model mpg ~ hp + 1:
Intercept 30.663308
hp -0.073611
dtype: float64
Model turn ~ hp + 1:
Intercept 30.281255
hp 0.041971
dtype: float64
Partial correlation -0.27945246615045016
Model e1 ~ e2:
e2 -0.251008
dtype: float64
78 4 Variability in Several Dimensions and Regression Models
Solution 4.14 The regression of MPG on HP is MPG = 30.6633 − 0.07361 HP. The
regression of TurnD on HP is TurnD = 30.2813 + 0.041971 HP. The regression of
the residuals 𝑒ˆ1 on 𝑒ˆ2 is 𝑒ˆ1 = −0.251 · 𝑒ˆ2 . Thus,
almpin = mistat.load_data('ALMPIN')
model = smf.ols('capDiam ~ 1 + diam2 + diam3', data=almpin).fit()
model.summary2()
<class 'statsmodels.iolib.summary2.Summary'>
"""
Results: Ordinary least squares
===================================================================
Model: OLS Adj. R-squared: 0.842
Dependent Variable: capDiam AIC: -482.1542
Date: 2023-05-03 21:16 BIC: -475.4087
No. Observations: 70 Log-Likelihood: 244.08
Df Model: 2 F-statistic: 184.2
Df Residuals: 67 Prob (F-statistic): 5.89e-28
R-squared: 0.846 Scale: 5.7272e-05
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
Intercept 4.7565 0.5501 8.6467 0.0000 3.6585 5.8544
diam2 0.5040 0.1607 3.1359 0.0025 0.1832 0.8248
diam3 0.5203 0.1744 2.9830 0.0040 0.1722 0.8684
-------------------------------------------------------------------
Omnibus: 1.078 Durbin-Watson: 2.350
Prob(Omnibus): 0.583 Jarque-Bera (JB): 0.976
Skew: -0.071 Prob(JB): 0.614
Kurtosis: 2.439 Condition No.: 8689
===================================================================
* The condition number is large (9e+03). This might indicate
strong multicollinearity or other numerical problems.
"""
Solution 4.16 The regression of yield (𝑌𝑖𝑒𝑙𝑑) on the four variables is:
gasol = mistat.load_data('GASOL')
# rename column 'yield' to 'Yield' as 'yield' is a special keyword in Python
gasol = gasol.rename(columns={'yield': 'Yield'})
model = smf.ols(formula='Yield ~ x1 + x2 + astm + endPt',
data=gasol).fit()
print(model.summary2())
4 Variability in Several Dimensions and Regression Models 79
(ii) 𝑅 2 = 0.962.
(iii) The regression coefficient of 𝑥 2 is not significant.
(iv) Running the multiple regression again, without 𝑥2 , we obtain the equation
Ordered quantiles
0
1
R 2 = 0.982
2
2 1 0 1 2
Theoretical quantiles
{ fig:qqPlotGasolRegResid} Fig. 4.6 𝑄–𝑄 plot of gasol regression residuals
(ii)
(𝑄) = 𝐼 − (𝐻)
(𝑄) 2 = (𝐼 − (𝐻)) (𝐼 − (𝐻))
= 𝐼 − (𝐻) − (𝐻) + (𝐻) 2
= 𝐼 − (𝐻)
= 𝑄.
𝑠2𝑒 = y0 (𝑄) (𝑄)y/(𝑛 − 𝑘 − 1)
= y0 (𝑄)y/(𝑛 − 𝑘 − 1).
𝑆𝑆𝐸
Solution 4.19 1 − 𝑅 2𝑦 ( 𝑥) = where 𝑆𝑆𝐸 = ê0ê = ||ê|| 2 .
𝑆𝑆𝐷 𝑦
𝑛 𝑛 𝑛 𝑛
©∑︁ ∑︁ ª ∑︁ ∑︁
cov 𝛽𝑖 𝑋𝑖 , 𝛾𝑗 𝑋𝑗® = 𝛽𝑖 𝛾 𝑗 cov(𝑋𝑖 , 𝑋 𝑗 )
« 𝑖=1 𝑗=1 ¬ 𝑖=1 𝑗=1
𝑛 ∑︁
∑︁ 𝑛
= 𝛽𝑖 𝛾 𝑗 Σ| 𝑖 𝑗
𝑖=1 𝑗=1
0 |
= 𝜷 ( 𝚺)𝜸.
|
Solution 4.22 From the model, 𝚺(Y) = 𝜎 2 𝐼 and b = (𝐵)Y.
|
𝚺(b) |
= (𝐵) 𝚺(Y) (𝐵) 0 = 𝜎 2 (𝐵) (𝐵) 0
= 𝜎 2 [(X) 0 (X)] −1 · X0X[(X) 0 (X)] −1
= 𝜎 2 [(X) 0 (X)] −1 .
Solution 4.23 Rearrange the dataset into the format suitable for the test outlines in
the multiple linear regression section.
socell = mistat.load_data('SOCELL')
# combine the two datasets and add the additional columns z and w
socell_1 = socell[['t3', 't1']].copy()
socell_1.columns = ['t3', 't']
socell_1['z'] = 0
socell_2 = socell[['t3', 't2']].copy()
socell_2.columns = ['t3', 't']
socell_2['z'] = 1
combined = pd.concat([socell_1, socell_2])
combined['w'] = combined['z'] * combined['t']
Neither of the 𝑧 nor 𝑤 The 𝑃-values corresponding to 𝑧 and 𝑤 are 0.128 and 0.426
respectively. Accordingly, we can conclude that the slopes and intercepts of the two
simple linear regressions given above are not significantly different. Combining the
data we have the following regression line for the combined dataset:
df = mistat.load_data('CEMENT.csv')
(a) Regression of 𝑌 on 𝑋1 is
print(f'R-sq: {r2:.3f}')
anova = sms.anova.anova_lm(model1)
print('Analysis of Variance\n', anova)
F = anova.F['x1']
SSE_1 = anova.sum_sq['Residual']
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 81.4793 4.927 16.536 0.000 70.634 92.324
x1 1.8687 0.526 3.550 0.005 0.710 3.027
==============================================================================
R-sq: 0.534
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 12.602518 0.004552
Residual 11.0 1265.686749 115.062432 NaN NaN
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 52.5773 2.286 22.998 0.000 47.483 57.671
x1 1.4683 0.121 12.105 0.000 1.198 1.739
x2 0.6623 0.046 14.442 0.000 0.560 0.764
==============================================================================
R-sq: 0.979
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 250.425571 2.088092e-08
x2 1.0 1207.782266 1207.782266 208.581823 5.028960e-08
Residual 10.0 57.904483 5.790448 NaN NaN
Comparing models
df_resid ssr df_diff ss_diff F Pr(>F)
0 11.0 1265.686749 0.0 NaN NaN NaN
1 10.0 57.904483 1.0 1207.782266 208.581823 5.028960e-08
• 𝑠2𝑒2 = 5.79,
• 𝐹 = 12.60
• Partial-𝐹 = 208.582
Notice that SEQ SS for 𝑋2 = 2716.9(0.974 − 0.529) = 1207.782.
(c) The regression of 𝑌 on 𝑋1 , 𝑋2 , and 𝑋3 is
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 48.1936 3.913 12.315 0.000 39.341 57.046
x1 1.6959 0.205 8.290 0.000 1.233 2.159
x2 0.6569 0.044 14.851 0.000 0.557 0.757
x3 0.2500 0.185 1.354 0.209 -0.168 0.668
==============================================================================
R-sq: 0.982
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 271.264194 4.995767e-08
x2 1.0 1207.782266 1207.782266 225.938509 1.107893e-07
x3 1.0 9.793869 9.793869 1.832128 2.088895e-01
Residual 9.0 48.110614 5.345624 NaN NaN
Comparing models
df_resid ssr df_diff ss_diff F Pr(>F)
0 10.0 57.904483 0.0 NaN NaN NaN
1 9.0 48.110614 1.0 9.793869 1.832128 0.208889
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 62.4054 70.071 0.891 0.399 -99.179 223.989
x1 1.5511 0.745 2.083 0.071 -0.166 3.269
x2 0.5102 0.724 0.705 0.501 -1.159 2.179
x3 0.1019 0.755 0.135 0.896 -1.638 1.842
x4 -0.1441 0.709 -0.203 0.844 -1.779 1.491
==============================================================================
R-sq: 0.982
Analysis of Variance
df sum_sq mean_sq F PR(>F)
x1 1.0 1450.076328 1450.076328 242.367918 2.887559e-07
x2 1.0 1207.782266 1207.782266 201.870528 5.863323e-07
x3 1.0 9.793869 9.793869 1.636962 2.366003e-01
x4 1.0 0.246975 0.246975 0.041280 8.440715e-01
Residual 8.0 47.863639 5.982955 NaN NaN
Comparing models
df_resid ssr df_diff ss_diff F Pr(>F)
0 9.0 48.110614 0.0 NaN NaN NaN
1 8.0 47.863639 1.0 0.246975 0.04128 0.844071
outcome = 'y'
all_vars = ['x1', 'x2', 'x3', 'x4']
Final model
y ~ 1 + x2 + x4 + x1
Intercept 71.648307
x2 0.416110
x4 -0.236540
x1 1.451938
dtype: float64
car = mistat.load_data('CAR')
car_3 = car[car['origin'] == 3]
print('Full dataset shape', car.shape)
print('Origin 3 dataset shape', car_3.shape)
model = smf.ols(formula='mpg ~ hp + 1', data=car_3).fit()
print(model.summary2())
86 4 Variability in Several Dimensions and Regression Models
influence = model.get_influence()
df = pd.DataFrame({
'hp': car_3['hp'],
'mpg': car_3['mpg'],
'resi': model.resid,
'sres': influence.resid_studentized_internal,
'hi': influence.hat_matrix_diag,
'D': influence.cooks_distance[0],
})
print(df.round(4))
Notice that points 51 and 94 have residuals with large magnitude (-10.4, 7.4).
Points 51 and 94 have also the largest Cook’s distance (0.58, 0.21) Points 100 and
102 have high HI values (leverage; 0.18, 0.22).
np.random.seed(1)
settings = {'s': 0.005, 'v0': 0.002, 'k': 1000, 'p0': 90_000,
't': 290, 't0': 340}
results = []
n_simulation = 5
for m in [30, 40, 50, 60]:
simulator = mistat.PistonSimulator(m=m, n_simulation=n_simulation,
**settings)
sim_result = simulator.simulate()
results.extend([m, s] for s in sim_result['seconds'])
results = pd.DataFrame(results, columns=['m', 'seconds'])
group_std = results.groupby('m').std()
pooled_std = np.sqrt(np.sum(group_std**2) / len(group_std))[0]
print('Pooled standard deviation', pooled_std)
group_mean = results.groupby('m').mean()
ax = results.plot.scatter(x='m', y='seconds', color='black')
ax.errorbar(group_mean.index, results.groupby('m').mean().values.flatten(),
yerr=[pooled_std] * 4, color='grey')
plt.show()
We see that the differences between the sample means are not significant in spite
of the apparent upward trend in cycle times.
Solution 4.28 Prepare dataset and visualize distributions (see Fig. 4.8). { fig:boxplotIntegratedCircuits}
df = pd.DataFrame([
[2.58, 2.62, 2.22],
[2.48, 2.77, 1.73],
88 4 Variability in Several Dimensions and Regression Models
0.07
0.06
0.05
seconds 0.04
0.03
0.02
0.01
30 35 40 45 50 55 60
m
Fig. 4.7 ANOVA of effect of changing weight in piston simulation { fig:anovaWeightPiston}
3.0
2.8
2.6
2.4
2.2
2.0
1.8
Exp. 1 Exp. 2 Exp. 3
{ fig:boxplotIntegratedCircuits} Fig. 4.8 Box plot of pre-etch line width from integrated circuits fabrication process
experiment = df['Experiment']
mu = df['mu']
def onewayTest(x, verbose=False):
df = pd.DataFrame({
'value': x,
'variable': experiment,
})
aov = pg.anova(dv='value', between='variable', data=df)
return aov['F'].values[0]
Bt0 = onewayTest(mu)
print('Bt0', Bt0)
print('ratio', sum(B[1] >= Bt0)/len(B[1]))
Bt0 120.91709844559576
ratio 0.0
The bootstrap test also shows that the difference in means is significant.
Solution 4.29 Create dataset and visualize distribution (see Fig. 4.9). { fig:filmSpeedData}
df = pd.DataFrame({
'Batch A': [103, 107, 104, 102, 95, 91, 107, 99, 105, 105],
'Batch B': [104, 103, 106, 103, 107, 108, 104, 105, 105, 97],
})
fig, ax = plt.subplots()
bplot1 = ax.boxplot(df, labels=df.columns)
plt.show()
107.5
105.0
102.5
100.0
97.5
95.0
92.5
Batch A Batch B
The randomization test gave a P value of 0.106. The difference between the means
is not significant.
(ii)
The ANOVA also shows no significant difference in the means. The 𝑃 value
is 0.228. Remember that the 𝐹-test in the ANOVA is based on the assumption of
normality and equal variances. The randomization test is nonparametric.
Solution 4.30 Define function that calculates the statistic and execute bootstrap.
def func_stats(x):
m = pd.Series(x).groupby(df['Experiment']).agg(['mean', 'count'])
top = np.sum(m['count'] * m['mean'] ** 2) - len(x)*np.mean(x)**2
return top / np.std(x) ** 2
Bt = []
mu = list(df['mu'])
for _ in range(1000):
mu_star = random.sample(mu, len(mu))
Bt.append(func_stats(mu_star))
Bt0 = func_stats(mu)
print('Bt0', Bt0)
print('ratio', sum(Bt >= Bt0)/len(Bt))
Bt0 26.986990459670288
ratio 0.0
4 Variability in Several Dimensions and Regression Models 91
Boxplot grouped
xDev by crcBrd
0.004
0.003
0.002
0.001
0.000
0.001
0.002
0.003
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526
crcBrd
Fig. 4.10 Box plot visualisation of xDev distribution by crcBrd for the PLACE dataset { fig:boxplotXdevCrcBrdPlace}
The result demonstrates that the differences between the results is significant.
Solution 4.31 Load the data and visualize the distributions (see Fig. 4.10). { fig:boxplotXdevCrcBrdPlace}
place = mistat.load_data('PLACE')
place.boxplot('xDev', by='crcBrd')
plt.show()
G1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
G2 = [10, 11, 12]
G3 = [13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26]
G4 = [20]
place['group'] = 'G1'
place.loc[place['crcBrd'].isin(G2), 'group'] = 'G2'
place.loc[place['crcBrd'].isin(G3), 'group'] = 'G3'
place.loc[place['crcBrd'].isin(G4), 'group'] = 'G4'
92 4 Variability in Several Dimensions and Regression Models
sem = statistics['sem'].values
sem = sem**2
sem = np.sqrt(sem[:-1] + sem[1:])
print(sem * 6.193)
print(757/644, 467/387, 486/459)
---------------------------------------------------------------------------TypeError
Traceback (most recent call last)Cell In[1], line 17
15 print(statistics['mean'][:-1].values -
statistics['mean'][1:].values)
16 print(statistics['mean'][:(n-1)].values -
statistics['mean'][1:].values)
---> 17 statistics.loc[1:, 'Diff'] = (statistics['mean'][:-1].values -
18 statistics['mean'][1:].values)
19 statistics['CR'] = 6.193 * statistics['Diff']
20 print(statistics.round(8))
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:845, in
_LocationIndexer.__setitem__(self, key, value)
843 else:
844 key = com.apply_if_callable(key, self.obj)
--> 845 indexer = self._get_setitem_indexer(key)
846 self._has_valid_setitem_indexer(key)
848 iloc = self if self.name == "iloc" else self.obj.iloc
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:710, in
_LocationIndexer._get_setitem_indexer(self, key)
707 if isinstance(key, tuple):
708 with suppress(IndexingError):
709 # suppress "Too many indexers"
--> 710 return self._convert_tuple(key)
712 if isinstance(key, range):
713 # GH#45479 test_loc_setitem_range_key
714 key = list(key)
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:927, in
_LocationIndexer._convert_tuple(self, key)
923 @final
924 def _convert_tuple(self, key: tuple) -> tuple:
925 # Note: we assume _tupleize_axis_indexer has been called,
if necessary.
926 self._validate_key_length(key)
--> 927 keyidx = [self._convert_to_indexer(k, axis=i) for i, k in
4 Variability in Several Dimensions and Regression Models 93
enumerate(key)]
928 return tuple(keyidx)
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:927, in <listcomp>(.0)
923 @final
924 def _convert_tuple(self, key: tuple) -> tuple:
925 # Note: we assume _tupleize_axis_indexer has been called,
if necessary.
926 self._validate_key_length(key)
--> 927 keyidx = [self._convert_to_indexer(k, axis=i) for i, k in
enumerate(key)]
928 return tuple(keyidx)
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexing.py:1382, in
_LocIndexer._convert_to_indexer(self, key, axis)
1379 labels = self.obj._get_axis(axis)
1381 if isinstance(key, slice):
-> 1382 return labels._convert_slice_indexer(key, kind="loc")
1384 if (
1385 isinstance(key, tuple)
1386 and not isinstance(labels, MultiIndex)
1387 and self.ndim < 2
1388 and len(key) > 1
1389 ):
1390 raise IndexingError("Too many indexers")
File /usr/local/lib/python3.9/site-
packages/pandas/core/indexes/base.py:4125, in
Index._convert_slice_indexer(self, key, kind)
4122 elif is_positional:
4123 if kind == "loc":
4124 # GH#16121, GH#24612, GH#31810
-> 4125 raise TypeError(
4126 "Slicing a positional slice with .loc is not
allowed, "
4127 "Use .loc with labels or .iloc with positions
instead.",
4128 )
4129 indexer = key
4130 else:
TypeError: Slicing a positional slice with .loc is not allowed, Use
.loc with labels or .iloc with positions instead.
The differences between the means of the groups are all significant.
Solution
df 4.32
= pd.DataFrame({
'US': [33, 25],
'Europe': [7, 7],
94 4 Variability in Several Dimensions and Regression Models
print(df)
col_sums = df.sum(axis=0)
row_sums = df.sum(axis=1)
total = df.to_numpy().sum()
US Europe Asia
0 33 7 26
1 25 7 11
chi2: 2.440
p-value: 0.295
The chi-square test statistic is 𝑋 2 = 2.440 with d.f. = 2 and 𝑃 value = 0.295. The
null hypothesis that the number of cylinders a car has is independent of the origin of
the car is not rejected.
We can also use the scipy function chi2_contingency.
chi2 = stats.chi2_contingency(df)
print(f'chi2-statistic: {chi2[0]:.3f}')
print(f'p-value: {chi2[1]:.3f}')
print(f'd.f.: {chi2[2]}')
chi2-statistic: 2.440
p-value: 0.295
d.f.: 2
car = mistat.load_data('CAR')
binned_car = pd.DataFrame({
'turn': pd.cut(car['turn'], bins=[27, 30.6, 34.2, 37.8, 45]), #np.arange(27, 50, 3.6)),
'mpg': pd.cut(car['mpg'], bins=[12, 18, 24, 100]),
})
freqDist = pd.crosstab(binned_car['mpg'], binned_car['turn'])
print(freqDist)
chi2 = stats.chi2_contingency(freqDist)
print(f'chi2-statistic: {chi2[0]:.3f}')
print(f'p-value: {chi2[1]:.3f}')
print(f'd.f.: {chi2[2]}')
The dependence between turn diameter and miles per gallon is significant.
4 Variability in Several Dimensions and Regression Models 95
question_13 = pd.DataFrame({
'1': [0,0,0,1,0],
'2': [1,0,2,0,0],
'3': [1,2,6,5,1],
'4': [2,1,10,23,13],
'5': [0,1,1,15,100],
}, index = ['1', '2', '3', '4', '5']).transpose()
question_23 = pd.DataFrame({
'1': [1,0,0,3,1],
'2': [2,0,1,0,0],
'3': [0,4,2,3,0],
'4': [1,1,10,7,5],
'5': [0,0,1,30,134],
}, index = ['1', '2', '3', '4', '5']).transpose()
chi2_13 = stats.chi2_contingency(question_13)
chi2_23 = stats.chi2_contingency(question_23)
print('Question 1 vs 3')
print(f' Mean squared contingency : {msc_13:.3f}')
print(f' Tschuprov : {tschuprov_13:.3f}')
print(f" Cramer's index : {cramer_13:.3f}")
print('Question 2 vs 3')
print(f' Mean squared contingency : {msc_23:.3f}')
print(f' Tschuprov : {tschuprov_23:.3f}')
print(f" Cramer's index : {cramer_23:.3f}")
Question 1 vs 3
Mean squared contingency : 0.629
Tschuprov : 0.397
Cramer's index : 0.561
Question 2 vs 3
Mean squared contingency : 1.137
Tschuprov : 0.533
Cramer's index : 0.754
Chapter 5
Sampling for Estimation of Finite Population
Quantities
import random
import numpy as np
import pandas as pd
import pingouin as pg
from scipy import stats
import matplotlib.pyplot as plt
import mistat
√
𝛿
Pr{ 𝑛| 𝑋¯ 𝑛 − 𝜇 𝑁 | < 𝛿} ≈ 2Φ − 1,
𝜎𝑁
as 𝑛 → ∞.
97
98 5 Sampling for Estimation of Finite Population Quantities
RSWR RSWOR
12.5
15
10.0
Frequency
Frequency
7.5 10
5.0
5
2.5
0.0 0
0.0 0.5 0.0 0.5
Sample correlation Sample correlation
Fig. 5.1 Distribution of correlation between xDev and yDev for sampling with and without distri-
{ fig:exDistCorrPlace} bution.
random.seed(1)
place = mistat.load_data('PLACE')
rswr = []
rswor = []
idx = list(range(len(place)))
for _ in range(100):
rswr.append(stat_func(random.choices(idx, k=20)))
rswor.append(stat_func(random.sample(idx, k=20)))
70 80 60
60
60 50
50
Frequency
Frequency
Frequency
40 40
40 30
30
20 20
20
10 10
0 0 0
35 36 37 100 120 140 20 22
Median turn Median hp Median mpg
Fig. 5.2 Distribution of median of turn-diameter, horsepower, and mpg of the CAR dataset using
random sampling without replacement. { fig:exDistMedianCAR}
{ fig:exDistMedianCAR} Solution 5.5 The Python code creates the histograms shown in Fig. 5.2.
random.seed(1)
car = mistat.load_data('CAR')
columns = ['turn', 'hp', 'mpg']
idx = list(range(len(car)))
result = []
for _ in range(200):
result.append(stat_func(random.sample(idx, k=50)))
result = pd.DataFrame(result)
size is 𝑛 = 200 and the weights are 𝑊1 = 0.385, 𝑊2 = 0.115, 𝑊3 = 0.5. Thus,
𝑛1 = 77, 𝑛2 = 23, and 𝑛3 = 100.
sample_means = []
for _ in range(500):
m_1 = np.mean(random.sample(strata_1, k=n_1))
m_2 = np.mean(random.sample(strata_2, k=n_2))
m_3 = np.mean(random.sample(strata_3, k=n_3))
sample_means.append(w_1*m_1 + w_2*m_2 + w_3*m_3)
std_dev_sample_means = np.std(sample_means)
print(std_dev_sample_means)
print(stats.sem(place['xDev'], ddof=0))
3.442839155174113e-05
8.377967188860638e-05
𝑊𝑖2 𝜎
˜𝑁2
𝑖
= 𝜆, 𝑖 = 1, . . . , 𝑘
𝑛2𝑖
𝑘
∑︁
𝑛𝑖 = 𝑛.
𝑖=1
1 1 Í𝑘
Equivalently, 𝑛𝑖 = √ 𝑊𝑖 𝜎 ˜ 𝑁𝑖 , for 𝑖 = 1, . . . , 𝑘 and 𝑛 = √ ˜ 𝑁𝑖 . Thus
𝑖=1 𝑊𝑖 𝜎
𝜆 𝜆
˜ 𝑁𝑖
𝑊𝑖 𝜎
𝑛0𝑖 = 𝑛 Í 𝑘 , 𝑖 = 1, . . . , 𝑘.
˜ 𝑁𝑗
𝑗=1 𝑊 𝑗 𝜎
1, if 𝑖-th population element is sampled
where 𝐼𝑖 =
0, otherwise.
Notice that 𝐼1 , . . . , 𝐼 𝑁 are independent of 𝑒 1 , . . . , 𝑒 𝑁 . Hence, 𝐸 {𝐼𝑖 𝑒 𝑖 } = 0 for all
𝑖 = 1, . . . , 𝑁. This proves that 𝑌¯𝑛 is prediction unbiased, irrespective of the sample
strategy. The prediction MSE of 𝑌¯𝑛 is
𝜎2 𝑛 2 1 𝑛 2
𝑃𝑀𝑆𝐸 {𝑌¯𝑛 | s} = 1− + 1− 𝜎
𝑛 𝑁 𝑁 𝑁
𝜎2 𝑛
= 1− .
𝑛 𝑁
Notice that 𝑃𝑀𝑆𝐸 {𝑌¯𝑛 | s} is independent of s, and is equal for all samples.
1, if 𝑖-th population element is in the sample s
𝐼𝑖 =
0,
otherwise.
Recall that 𝑦 𝑖 = 𝛽𝑥𝑖 + 𝑒 𝑖 , 𝑖 = 1, . . . , 𝑁, and that for any sampling strategy,
Í𝑁
𝑒 1 , . . . , 𝑒 𝑁 are independent of 𝐼1 , . . . , 𝐼 𝑁 . Hence, since 𝑖=1 𝐼𝑖 = 𝑛,
𝑁
ˆ 1 ∑︁ 𝛽𝑥 𝑖 + 𝑒 𝑖
𝐸 {𝑌𝑅 𝐴 } = 𝑥¯ 𝑁 · 𝐸 𝐼𝑖
𝑛 𝑖=1 𝑥𝑖
𝑁 !
1 ∑︁ 𝑒𝑖
= 𝑥¯ 𝑁 𝛽 + 𝐸 𝐼𝑖
𝑛 𝑖=1 𝑥𝑖
= 𝑥¯ 𝑁 𝛽,
𝑒𝑖 𝐼𝑖
because 𝐸 𝐼𝑖 =𝐸 𝐸 {𝑒 𝑖 } = 0, 𝑖 = 1, . . . , 𝑁. Thus, 𝐸 {𝑌ˆ𝑅 𝐴 − 𝜇 𝑁 } = 0 and
𝑥𝑖 𝑥𝑖
𝑌ˆ𝑅 𝐴 is an unbiased predictor.
𝑁 𝑁
!2
𝑥¯ 𝑁 ∑︁
𝑒 1 ∑︁
𝑃𝑀𝑆𝐸 {𝑌ˆ𝑅 𝐴 } = 𝐸
𝑖
𝐼𝑖 − 𝑒𝑖
𝑛 𝑥𝑖 𝑁 𝑖=1
𝑖=1
( )
∑︁ 𝑥¯ 𝑁 1
∑︁ 𝑒 𝑖0
=𝑉 − 𝑒𝑖 −
𝑖 ∈s
𝑛𝑥 𝑖 𝑁 𝑖0 ∈r
𝑁
𝑛 𝑛
where s𝑛 is the set of elements in the sample and r𝑛 is the set of elements in P but
not in s𝑛 , r𝑛 = P − s𝑛 . Since 𝑒 1 , . . . , 𝑒 𝑁 are uncorrelated,
∑︁ 𝑥¯ 𝑁 2
1 𝑁 −𝑛
𝑃𝑀𝑆𝐸 {𝑌ˆ𝑅 𝐴 | s𝑛 } = 𝜎 2 − + 𝜎2
𝑖 ∈s𝑛
𝑛𝑥 𝑖 𝑁 𝑁2
" 2#
𝜎2 ∑︁ 𝑁 𝑥¯ 𝑁
= 2 (𝑁 − 𝑛) + −1 .
𝑁 𝑖 ∈s
𝑛𝑥𝑖
𝑛
2
A sample s𝑛 which minimizes 𝑖 ∈s𝑛 𝑁𝑛𝑥𝑥¯ 𝑁𝑖 − 1 is optimal.
Í
import math
import mistat
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa import tsatools
import statsmodels.formula.api as smf
Solution 6.2 (i) Fig. 6.1 shows the change of demand over time { fig:seascom-timeline}
(ii) We are first fitting the seasonal trend to the data in SeasCom data set. We use
the linear model𝑌 = 𝑋 𝛽 + 𝜀, where the 𝑌 vector has 102 elements, which are in data
set. 𝑋 is a 102x4 matrix of four column vectors. We combined 𝑋 and 𝑌 in a data
frame. In addition to the SeasCom data, we add a column of 1’s (const), a column
with numbers 1 to 102 (trend) and two columns to describe the seasonal pattern.
The column season_1 consists of cos(𝜋 × trend/6), and the column season_2 is
sin(𝜋 × trend/6).
seascom = mistat.load_data('SEASCOM.csv')
df = tsatools.add_trend(seascom, trend='ct')
df['season_1'] = [np.cos(math.pi * tx/6) for tx in df['trend']]
df['season_2'] = [np.sin(math.pi * tx/6) for tx in df['trend']]
print(df.head())
105
106 6 Time Series Analysis and Prediction
160
140
120
Data
100
80
60
40
0 20 40 60 80 100
Time
seascom = mistat.load_data('SEASCOM.csv')
fig, ax = plt.subplots()
ax.scatter(seascom.index, seascom, facecolors='none', edgecolors='grey')
model.predict(df).plot(ax=ax, color='black')
ax.set_xlabel('Time')
ax.set_ylabel('Data')
plt.show()
{ fig:seascom-model-deviation} (iii) Calculate the residuals and plot them (see Fig. 6.2).
U = df['SeasCom'] - model.predict(df)
fig, ax = plt.subplots()
ax.scatter(U.index, U, facecolors='none', edgecolors='black')
6 Time Series Analysis and Prediction 107
7.5
5.0
2.5
Deviation
0.0
2.5
5.0
7.5
10.0
0 20 40 60 80 100
Time
ax.set_xlabel('Time')
ax.set_ylabel('Deviation')
plt.show()
Corr(Ut,Ut-1) = -0.191
Corr(Ut,Ut-2) = 0.132
Indeed the correlations between adjacent data points are 𝑐𝑜𝑟𝑟 (𝑋𝑡 , 𝑋𝑡−1 ) =
−0.191, and 𝑐𝑜𝑟𝑟 (𝑋𝑡 , 𝑋𝑡−2 ) = 0.132.
(iv) A plot of the deviations and the low autocorrelations shows something like
randomness.
Solution 6.3 According to Equation 6.2.2, the auto-correlation in a stationary MA(q) {eqn:ma-covariance}
is
108 6 Time Series Analysis and Prediction
Í𝑞−ℎ
𝐾 (ℎ) 𝑗=0 𝛽 𝑗 𝛽 𝑗+ℎ
𝜌(ℎ) = = Í𝑞 2
𝐾 (0) 𝑗=0 𝛽 𝑗
for 0 ≤ ℎ ≤ 𝑞).
Notice that 𝜌(ℎ) = 𝜌(−ℎ), and 𝜌(ℎ) = 0 for all ℎ > 𝑞.
Solution 6.4 In Python
0 1 2 3 4 5
K(h) 3.308 1.672 0.542 0.541 1.028 0.550
rho(h) 1.000 0.506 0.164 0.163 0.311 0.166
Solution 6.5 We consider the 𝐴𝑄(∞), given by coefficients 𝛽 𝑗 = 𝑞 𝑗 , with 0 < 𝑞 < 1.
In this case,
(i) 𝐸 {𝑋𝑡 } = 0, and
(ii) 𝑉 {𝑋𝑡 } = 𝜎 2 ∞ 2 𝑗 = 𝜎 2 /(1 − 𝑞 2 ).
Í
𝑗=0 𝑞
stationary.
(ii) 𝐸 {𝑋𝑡 } = 0
(iii) According to the Yule-Walker equations,
2
1 −0.75 𝐾 (0) 𝜎
=
−0.75 1 𝐾 (1) 0
𝑋𝑡 = ( 𝐴2 (𝑧)) −1 𝜀 𝑡
= 𝜀 𝑡 + 0.5𝜀 𝑡−1 − 0.08𝜀 𝑡−2 − 0.205𝜀 𝑡−3 − 0.0761𝜀 𝑡−4 + 0.0296𝜀 𝑡−5 + . . .
6 Time Series Analysis and Prediction 109
Accordingly,
Solution 6.11 Let 𝑋 denote the DOW1941 data set. We create a new set, 𝑌 of second
order difference, i.e., 𝑌𝑡 = 𝑋𝑡 − 2𝑋𝑡−1 + 𝑋𝑡−2 .
dow1941 = mistat.load_data('DOW1941.csv')
(i) In the following table we present the autocorrelations, acf, and the partial
autocorrelations, pacf, of 𝑌 . For a visualisation see Fig. 6.3. { fig:acf-pacf-dow-second-order}
Fig. 6.3 Autocorrelations, acf, and the partial autocorrelations, pacf, of the second order differences
of the DOW1941 dataset. { fig:acf-pacf-dow-second-order}
{ fig:seascom-one-day-ahead-model} Solution 6.12 In Fig. 6.4, we present the seasonal data SeasCom, and the one-day
ahead predictions, We see an excellent prediction.
predictedError = mistat.optimalLinearPredictor(seascom_model.resid,
10, nlags=9)
predictedTrend = seascom_model.predict(seascom_df)
correctedTrend = predictedTrend + predictedError
fig, ax = plt.subplots()
ax.scatter(seascom_df.index, seascom_df['SeasCom'],
facecolors='none', edgecolors='grey')
predictedTrend.plot(ax=ax, color='grey')
correctedTrend.plot(ax=ax, color='black')
ax.set_xlabel('Time')
ax.set_ylabel('SeasCom data')
plt.show()
6 Time Series Analysis and Prediction 111
160
140
120
SeasCom data
100
80
60
40
0 20 40 60 80 100
Time
{ fig:seascom-one-day-ahead-model} Fig. 6.4 One-day ahead prediction model of SeasCom data set
Chapter 7
Modern analytic methods: Part I
import warnings
import mistat
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
Solution 7.2 The decision tree model for testResult results in the confusion ma-
trix shown in Fig. 7.1. { fig:ex-confusion-matrix-testResult}
sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'testResult'
X = sensors[predictors]
113
114 7 Modern analytic methods: Part I
80
Brake 4 0 0 0 0 1 1 0 0
Good 0 82 0 0 0 0 0 0 0 70
Grippers 0 0 14 0 0 0 0 0 0 60
IMP 0 1 1 3 0 0 0 0 0 50
True label
ITM 0 2 0 0 31 0 0 0 0 40
Motor 0 0 0 0 0 15 1 0 0
30
SOS 0 0 0 0 0 0 2 0 0
20
Velocity Type I 0 0 0 0 0 0 0 11 0
10
Velocity Type II 0 1 0 0 0 0 0 0 4
0
Brake Good Grippers IMP ITM Motor SOS VelocityVelocity
Type I Type II
Predicted label
exc:sensors-test-result
{ fig:ex-confusion-matrix-testResult} Fig. 7.1 Decision tree model to predict testResult from sensor data (Exc. 7.2)
y = sensors[outcome]
plt.show()
The models classification performance is very good. The test result ‘Good’, which
corresponds to status ‘Pass’ is correctly predicted. Most of the other individual test
results have also low missclassification rates. The likely reason for this is that each
test result is associated with a specific subset of the sensors.
Confusion matrix
[[92 0]
[ 0 82]]
/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py:1395:
UserWarning: `use_label_encoder` is deprecated in 1.7.0.
warnings.warn("`use_label_encoder` is deprecated in 1.7.0.")
var_imp = pd.DataFrame({
'importance': xgb.feature_importances_,
}, index=predictors)
var_imp = var_imp.sort_values(by='importance', ascending=False)
var_imp['order'] = range(1, len(var_imp) + 1)
print(var_imp.head(10))
var_imp.loc[var_imp.index.isin(['sensor18', 'sensor07', 'sensor21'])]
importance order
sensor18 0.290473 1
sensor54 0.288680 2
sensor53 0.105831 3
sensor55 0.062423 4
sensor61 0.058112 5
sensor48 0.040433 6
sensor07 0.026944 7
sensor12 0.015288 8
sensor03 0.013340 9
sensor52 0.013160 10
importance order
sensor18 0.290473 1
sensor07 0.026944 7
sensor21 0.000000 50
The decision tree model uses sensors 18, 7, and 21. The xgboost model identifies
sensor 18 as the most important variable. Sensor 7 is ranked 7th, sensor 21 has no
importance.
y = sensors['status']
Confusion matrix
[[92 0]
[ 0 82]]
var_imp = pd.DataFrame({
'importance': model.feature_importances_,
}, index=predictors)
var_imp = var_imp.sort_values(by='importance', ascending=False)
116 7 Modern analytic methods: Part I
importance order
sensor61 0.138663 1
sensor18 0.100477 2
sensor53 0.079890 3
sensor52 0.076854 4
sensor46 0.052957 5
sensor50 0.051970 6
sensor44 0.042771 7
sensor48 0.037087 8
sensor24 0.036825 9
sensor21 0.035014 10
importance order
sensor18 0.100477 2
sensor21 0.035014 10
sensor07 0.023162 17
The decision tree model uses sensors 18, 7, and 21. Sensor 18 has the second
largest importance value, sensor 21 ranks 10th, and sensor 7 is on rank 17.
/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py:1395:
UserWarning: `use_label_encoder` is deprecated in 1.7.0.
warnings.warn("`use_label_encoder` is deprecated in 1.7.0.")
Decision tree model
Accuracy 0.900
[[37 2]
7 Modern analytic methods: Part I 117
[ 5 26]]
Gradient boosting model
Accuracy 0.957
[[36 3]
[ 0 31]]
Random forest model
Accuracy 0.986
[[38 1]
[ 0 31]]
The accuracy for predicting the validation set is very good for all three models,
with random forest giving the best model with an accuracy of 0.986. The xgboost
model has a slightly lower accuracy of 0.957 and the accuracy for the decision tree
model is 0.900.
If you change the random seeds or remove them for the various commands, you
will see that the accuracies vary and that the order can change.
dt_model = DecisionTreeClassifier(ccp_alpha=0.012)
random_valid_acc = []
random_train_acc = []
org_valid_acc = []
org_train_acc = []
for _ in range(100):
train_X, valid_X, train_y, valid_y = train_test_split(X, y,
test_size=0.4)
dt_model.fit(train_X, train_y)
org_train_acc.append(accuracy_score(train_y, dt_model.predict(train_X)))
org_valid_acc.append(accuracy_score(valid_y, dt_model.predict(valid_X)))
ax = pd.Series(random_valid_acc).plot.density(color='C0')
pd.Series(random_train_acc).plot.density(color='C0', linestyle='--',
ax=ax)
pd.Series(org_valid_acc).plot.density(color='C1', ax=ax)
pd.Series(org_train_acc).plot.hist(color='C1', linestyle='--',
ax=ax)
ax.set_ylim(0, 40)
ax.set_xlim(0, 1.01)
plt.show()
118 7 Modern analytic methods: Part I
40
35
30
25
Frequency
20
15
10
5
0
0.0 0.2 0.4 0.6 0.8 1.0
Solution 7.7 Load data
distTower = mistat.load_data('DISTILLATION-TOWER.csv')
predictors = ['Temp1', 'Temp2', 'Temp3', 'Temp4', 'Temp5', 'Temp6',
'Temp7', 'Temp8', 'Temp9', 'Temp10', 'Temp11', 'Temp12']
outcome = 'VapourPressure'
Xr = distTower[predictors]
yr = distTower[outcome]
(ii) Determine model performance for different tree complexity along the depen-
{ fig:exc-cpp-pruning} dence of tree depth on ccp parameter; see Fig. 7.2.
The smallest validation set error is obtained for ccp_alpha = 0.372. The depen-
{ fig:exc-cpp-pruning} dence of training and validation error on ccp_alpha is shown in Fig. 7.2.
{ fig:exc-dtreeviz-visualization} (iii) The final model is visualized using dtreeviz in Fig. 7.3.
7 Modern analytic methods: Part I 119
20
15
0
10 6 10 4 10 2 100
Cost-complexity parameter (ccp_alpha)
Fig. 7.2 Decision tree regressor complexity as a function of ccp_alpha. The validation set error
exc:dec-tree-reg-distillation
{ fig:exc-cpp-pruning} is shown in black, the training set error in grey (Exercise 7.7)
---------------------------------------------------------------------------AttributeError
Traceback (most recent call last)Cell In[1], line 10
5 model.fit(Xr, yr)
7 viz = dtreeviz.model(model, Xr, yr,
8 target_name=outcome,
9 feature_names=Xr.columns)
---> 10 viz2pdf(viz,
'compiled/figures/Chap008_ex_dtreeviz_regressor.pdf')
Cell In[1], line 8, in viz2pdf(viz, pdfFile)
6 from tempfile import NamedTemporaryFile
7 with NamedTemporaryFile(mode='w+', suffix='.svg') as f:
----> 8 f.write(viz.svg())
9 f.flush()
10 f.seek(0)
AttributeError: 'DTreeVizAPI' object has no attribute 'svg'
Solution 7.8 Vary the number of bins and binning strategy. The influence of the two
model parameter on model performance is shown in Fig. 7.4. { fig:nb-binning-performance}
y = sensors['status']
results = []
120 7 Modern analytic methods: Part I
≤ >
exc:dec-tree-reg-distillation
Fig. 7.3 Decision tree visualization of regression tree (Exercise 7.7) { fig:exc-dtreeviz-visualization}
The quantile binning strategy (for each feature, each bin has the same number
of data points) with splitting each column into two bins leads to the best performing
model. With this strategy, performance declines with increasing number of bins. The
uniform (for each feature, each bin has the same width) and the kmeans (for each
feature, a 𝑘-means clustering is used to bin the feature) strategies on the other hand,
show increasing performance with increasing number of bins.
The confusion matrix for the best performing models is:
kbinsDiscretizer = KBinsDiscretizer(encode='ordinal',
strategy='quantile', n_bins=2)
X_binned = kbinsDiscretizer.fit_transform(X)
nb_model = MultinomialNB()
nb_model.fit(X_binned, y)
print('Confusion matrix')
print(confusion_matrix(y, nb_model.predict(X_binned)))
Confusion matrix
[[87 5]
[ 1 81]]
7 Modern analytic methods: Part I 121
kmeans
0.950 quantile
uniform
0.925
0.900
0.875
0.850
0.825
0.800
2 3 4 5 6 7 8 9 10
n_bins
Fig. 7.4 Influence of number of bins and binning strategy on model performance for the sensors
{ fig:nb-binning-performance} data with status as outcome
The decision tree model missclassified three of the ‘Pass’ data points as ‘Fail’.
The Naïve Bayes model on the other hand missclassifies six data points. However,
five of these are ‘Pass’ and predicted as ‘Fail’. Depending on your use case, you may
prefer a model with more false negatives or false positives.
food = mistat.load_data('FOOD.csv')
scaler = StandardScaler()
model = AgglomerativeClustering(n_clusters=10, compute_distances=True)
X = scaler.fit_transform(food)
model = model.fit(X)
fig, ax = plt.subplots()
plot_dendrogram(model, ax=ax)
ax.set_title('Dendrogram')
ax.get_xaxis().set_ticks([])
plt.show()
food = mistat.load_data('FOOD.csv')
scaler = StandardScaler()
X = scaler.fit_transform(food)
Dendrogram
80
70
60
50
40
30
20
10
0
Fig. 7.5 Hierarchical clustering of food data set using Ward clustering { fig:food-ward-10-clusters}
{ fig:food-compare-linkage} The comparison of the different linkage methods is shown in Fig. 7.6. We can
see that Ward’s clustering gives the most balanced clusters; three bigger clusters and
seven small clusters. Complete, average, and single linkage lead to one big cluster.
sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'status'
X = sensors[predictors]
scaler = StandardScaler()
X = scaler.fit_transform(X)
model = KMeans(n_clusters=10, random_state=1).fit(X)
ward complete
80
60 20
40
10
20
0 0
average single
20 10.0
15 7.5
10 5.0
5 2.5
0 0.0
{ fig:food-compare-linkage} Fig. 7.6 Comparison of different linkage methods for hierarchical clustering of food data set
df = pd.DataFrame({
'status': sensors['status'],
'testResult': sensors['testResult'],
'cluster': model.predict(X),
})
Status Fail
cluster
8 19
0 15
4 13
1 13
9 13
3 10
5 5
7 2
6 1
2 1
Name: count, dtype: int64
Status Pass
cluster
1 48
8 34
Name: count, dtype: int64
There are several clusters that contain only ‘Fail’ data points. They correspond to
specific sensor value combinations that are very distinct to the sensor values during
normal operation. The ‘Pass’ data points are found in two clusters. Both of these
clusters contain also ‘Fail’ data points.
Analyse cluster membership by testResult.
124 7 Modern analytic methods: Part I
Cluster 1
testResult
Good 48
Brake 6
IMP 4
Grippers 1
Motor 1
ITM 1
Name: count, dtype: int64
Cluster 2
testResult
SOS 1
Name: count, dtype: int64
Cluster 3
testResult
Velocity Type I 10
Name: count, dtype: int64
Cluster 4
testResult
Grippers 10
ITM 3
Name: count, dtype: int64
Cluster 5
testResult
Velocity Type II 5
Name: count, dtype: int64
Cluster 6
testResult
SOS 1
Name: count, dtype: int64
Cluster 7
testResult
Grippers 2
Name: count, dtype: int64
Cluster 8
testResult
Good 34
Motor 15
Grippers 1
IMP 1
ITM 1
Velocity Type I 1
Name: count, dtype: int64
Cluster 9
testResult
ITM 13
Name: count, dtype: int64
7 Modern analytic methods: Part I 125
We can see that some of the test results are only found in one or two clusters.
Solution 7.12 The scikit-learn K-means clustering method can return either the
cluster centers or the distances of a data point to all the cluster centers. We evaluate
both as features for classification.
# Data preparation
sensors = mistat.load_data('SENSORS.csv')
predictors = [c for c in sensors.columns if c.startswith('sensor')]
outcome = 'status'
X = sensors[predictors]
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = sensors[outcome]
First, classifying data points based on the cluster center. In order to use that
information in a classifier, we transform the cluster center information using on-hot-
encoding.
n_clusters accuracy
0 2 0.529
1 3 0.805
2 4 0.805
3 5 0.799
4 6 0.793
5 7 0.897
6 8 0.868
7 9 0.879
8 10 0.816
9 11 0.879
10 12 0.879
11 13 0.879
12 14 0.897
13 15 0.897
14 16 0.885
15 17 0.879
16 18 0.891
17 19 0.902
126 7 Modern analytic methods: Part I
1.0
accuracy
0.9
0.8
0.7
0.6
0.5
2.5 5.0 7.5 10.0 12.5 15.0 17.5
n_clusters
Fig. 7.7 Dependence of accuracy on number of clusters using cluster membership as feature
exc:kmeans-classifier
(Exc. 7.12) { fig:cluster-number-model}
The accuracies are visualized in Fig. 7.7. We see that splitting the dataset into 7 { fig:cluster-number-model}
clusters gives a classification model with an accuracy of about 0.9.
Next we use the distance to the cluster centers as variable in the classifier.
results = []
clf = DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
for n_clusters in range(2, 20):
# fit a model and convert data to distances
model = KMeans(n_clusters=n_clusters, random_state=1)
model.fit(X)
Xcluster = model.transform(X)
n_clusters accuracy
0 2 0.983
1 3 0.977
2 4 0.977
3 5 0.977
4 6 0.989
5 7 0.977
6 8 0.977
7 9 0.977
8 10 0.977
9 11 0.977
10 12 0.977
11 13 0.983
12 14 0.977
7 Modern analytic methods: Part I 127
1.0
0.9
0.8
0.7
0.6
accuracy
0.5
2.5 5.0 7.5 10.0 12.5 15.0 17.5
n_clusters
Fig. 7.8 Dependence of accuracy on number of clusters using distance to cluster center as feature
exc:kmeans-classifier
(Exc. 7.12) { fig:cluster-distance-model}
13 15 0.977
14 16 0.977
15 17 0.983
16 18 0.977
17 19 0.983
The accuracies of all models are very high. The largest accuracy is achived for 6
clusters.
Based on these results, we would design the procedure using the decision tree
classifier combined with K-means clustering into six clusters. Using scikit-learn,
we can define the full procedure as a single pipeline as follows:
pipeline = make_pipeline(
StandardScaler(),
KMeans(n_clusters=6, random_state=1),
DecisionTreeClassifier(ccp_alpha=0.012, random_state=0)
)
X = sensors[predictors]
y = sensors[outcome]
process = pipeline.fit(X, y)
print('accuracy', accuracy_score(y, process.predict(X)))
print('Confusion matrix')
print(confusion_matrix(y, process.predict(X)))
accuracy 0.9885057471264368
Confusion matrix
[[91 1]
[ 1 81]]
import mistat
import networkx as nx
from pgmpy.estimators import HillClimbSearch
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
dissolution = mistat.load_data('DISSOLUTION.csv')
Use shift registration to align the dissolution data with spline interpolation of
order 1, 2, and 3.
fd_registered = {}
for order in (1, 2, 3):
fd.interpolation = SplineInterpolation(interpolation_order=order)
fd_registered[order] = shift_registration.fit_transform(fd)
129
130 8 Modern analytic methods: Part II
Dissolution
Order 1 Order 2 Order 3
100
80 80 80
Dissolution
Dissolution
Dissolution
60 60 60
40 40 40
20
20
20 40 20 40 20 40
Fig. 8.1 Mean dissolution curves for reference and test tablets derived using spline interpolation
{ fig:fdaDissolutionSplineComparison} of order 1, 2, and 3.
For each of the three registered datasets, calculate the mean dissolution curves
for reference and test tablets and plot the results.
{ fig:fdaDissolutionSplineComparison} The dissolution curves are shown in Fig. 8.1. We can see in all three graphs,
that the test tablets show a slightly faster dissolution than the reference tablets. If
we compare the shape of the curves, the curves for the linear splines interpolation
shows a levelling off with time. In the quadratic spline interpolation result, the
dissolution curves go through a maximum. This behavior is unrealistic. The cubic
spline interpolation also leads to unrealistic curves that first level of and then start to
increase again.
import skfda
from skfda import FDataGrid
fd = FDataGrid(pinchraw.transpose(), pinchtime)
fig = fd.plot()
ax = fig.axes[0]
8 Modern analytic methods: Part II 131
12
10
Pinch force
6
ax.set_xlabel('Time [s]')
ax.set_ylabel('Pinch force')
plt.show()
Fig. 8.2 shows the measure pinch forces. We can see that the start of the force { fig:fdaPinchforceOriginal}
varies from 0.025 to 0.1 seconds. This makes it difficult to compare the shape of the
curves. The shapes of the individual curves is not symmetric with a faster onset of
the force and slower decline.
(iii) We create a variety of smoothed version of the dataset to explore the effect
of varying the smoothing_parameter.
import itertools
from skfda.preprocessing.smoothing.kernel_smoothers import NadarayaWatsonSmoother
Fig. 8.3 shows smoothed measurement curves for a variety of smoothing_parameter { fig:fdaPinchforceSmoothing}
values. If values are too large, the data are oversmoothed and the asymmetric shape
of the curves is lost. With decreasing values, the shape is reproduced better but the
curves are getting noisier again. We select 0.005 as the smoothing parameter.
smoother = NadarayaWatsonSmoother(smoothing_parameter=0.005)
fd_smooth = smoother.fit_transform(fd)
Pinch force
Pinch force
6
4 5
2
0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3
Time Time
Smoothing parameter 0.001 Smoothing parameter 0.0001
10 10
Pinch force
Pinch force
5 5
12
10
8
Pinch force
2
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Time [s]
max_idx = fd_smooth.data_matrix.argmax(axis=1)
landmarks = [pinchtime[idx] for idx in max_idx]
{ fig:fdaPinchforceRegistered} (iv) Fig. 8.4 shows the measurements after smoothing and landmark shift regis-
tration.
fig = fd_landmark.plot()
ax = fig.axes[0]
ax.set_xlabel('Time [s]')
ax.set_ylabel('Pinch force')
plt.show()
8 Modern analytic methods: Part II 133
16
14
12
10
8
6
4
2
0
13 14 15 16 17
import skfda
frequencies = moisturespectrum['Moisturespectrum']['x']
spectra = moisturespectrum['Moisturespectrum']['y']
moisture = moisturevalues['Moisturevalues']
(ii) We can use a histogram to look at the distribution of the moisture values.
Fig. 8.5 shows a bimodal distribution of the moisture content with a clear separa- { fig:fdaMoistureDistribution}
tion of the two peaks. Based on this, we select 14.5 as the threshold to separate into
high and low moisture content.
(iii - iv) The spectrum information is already in the array format required for the
FDataGrid class. In order to do this, we need to transpose the spectrum information.
As we can see in the left graph of Fig. 8.6, the spectra are not well aligned but show { fig:fdaMoistureSpectrum}
a large spread in the intensities. This is likely due to the difficulty in having a clearly
defined concentration between samples. In order to reduce this variation, we can
transform the intensities by dividing the intensities by the mean of each sample.
intensities = spectra.transpose()
fd = skfda.FDataGrid(intensities, frequencies)
high
low
2
1.2
1.0 1
0.8 0
0.6 1
0.4
2
{ fig:fdaMoistureSpectrum} Fig. 8.6 Near-infrared spectra of the moisture dataset. Left: raw spectra, Right: normalized spectra
{ fig:fdaMoistureSpectrum} As we can see in right graph of Fig. 8.6, the normalized spectra are now better
aligned. We also see that the overall shape of the spectra is fairly consistent between
samples.
(v) We repeat the model building both for the original and normalized spectra 50
times. At each iteration, we split the data set into training and test sets (50-50), build
the model with the training set and measure accuracy using the test set. By using the
same random seed for splitting the original and the normalized dataset, we can better
{ fig:fdaMoistureAccuracies} compare the models. The accuracies from the 50 iteration are compared in Fig. 8.7.
accuracies = []
for rs in range(50):
X_train, X_test, y_train, y_test = train_test_split(fd,
moisture_class, random_state=rs, test_size=0.5)
knn_original = KNeighborsClassifier()
knn_original.fit(X_train, y_train)
8 Modern analytic methods: Part II 135
0.95
0.90
0.85
0.80
0.75
0.70
0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900
Accuracy of models based on original spectra
Fig. 8.7 Accuracies of classification models based original and normalized spectra. The line
indicates equal performance. { fig:fdaMoistureAccuracies}
# mean of accuracies
mean_accuracy = accuracies.mean()
mean_accuracy
original 0.7976
normalized 0.9468
dtype: float64
Fig. 8.7 clearly shows that classification models based on the normalized spectra { fig:fdaMoistureAccuracies}
achieve better accuracies. The mean accuracy increases from 0.8 to 0.95.
1.0
Fig. 8.8 Mean absolute error of regression models using original and normalized spectra. The line
{ fig:fdaMoistureMAE} indicates equal performance.
mae = []
for rs in range(50):
X_train, X_test, y_train, y_test = train_test_split(fd,
moisture, random_state=rs, test_size=0.5)
knn_original = KNeighborsRegressor()
knn_original.fit(X_train, y_train)
mae_original = mean_absolute_error(y_test, knn_original.predict(X_test))
# mean of MAE
mean_mae = mae.mean()
mean_mae
original 0.817016
normalized 0.433026
dtype: float64
{ fig:fdaMoistureMAE} Fig. 8.8 clearly shows that regression models based on the normalized spectra
achieve better performance. The mean absolute error descrease from 0.82 to 0.43.
(iii) We use the last regression model from (ii) to create a plot of actual versus
predicted moisture content for the test data.
8 Modern analytic methods: Part II 137
17
15
14
13
13 14 15 16 17
Moisture content
y_pred = knn_normalized.predict(X_test)
predictions = pd.DataFrame({'actual': y_test, 'predicted': y_pred})
minmax = [min(*y_test, *y_pred), max(*y_test, *y_pred)]
ax = predictions.plot.scatter(x='actual', y='predicted')
ax.set_xlabel('Moisture content')
ax.set_ylabel('Predicted moisture content')
ax.plot(minmax, minmax, color='grey')
plt.show()
Fig. 8.9 shows two clusters of points. One cluster contains the samples with the { fig:fdaMoisturePredictions}
high moisture content and the other cluster the samples with low moisture content.
The clusters are well separated and the predictions are in the typical range for
each cluster. However, within a cluster, predictions and actual values are not highly
correlated. In other words, while the regression model can distinguish between
samples with a high and low moisture content, the moisture content is otherwise
not well predicted. There is therefore no advantage of using the regression model
compared to the classification model.
(ii) In Python:
fpca_original = FPCA(n_components=2)
_ = fpca_original.fit(fd)
fpca_normalized = FPCA(n_components=2)
_ = fpca_normalized.fit(fd_normalized)
2.0 1.5
1.5 1.0
1.0 0.5
0.5
1
0.0
0.0 0.5
0.5 1.0
5.0 2.5 0.0 2.5 5.0 2 0 2
Component 2 Component 2
Fig. 8.10 Projection of spectra onto first two principal components. Left: original spectra, right:
{ fig:fdaMoisturePCA} normalized spectra
fpca_df.plot.scatter(x=0, y=1,
c=['C1' if mc == 'high' else 'C2' for mc in moisture_class], ax=ax)
ax.set_xlabel('Component 1')
ax.set_xlabel('Component 2')
{ fig:fdaMoisturePCA} Fig. 8.10 compares the PCA projections for the original and normalized data.
We can see that the second principal component clearly separates the two moisture
content classes for the normalized spectra. This is not the case for the original spectra.
Solution 8.6 (i) We demonstrate a solution to this exercise using two blog posts
preprocessed and included in the mistat package. The content of the posts was
converted text with each paragraph on a line. The two blog posts can be loaded as
follows:
The variable blogs is a dictionary with labels as keys and text as values. We next
split the data into a list of labels and a list of non-empty paragraphs.
paragraphs = []
labels = []
for blog, text in blogs.items():
for paragraph in text.split('\n'):
paragraph = paragraph.strip()
if not paragraph: # ignore empty paragraphs
continue
paragraphs.append(paragraph)
labels.append(blog)
import re
from sklearn.feature_extraction.text import CountVectorizer
def preprocessor(text):
text = text.lower()
text = re.sub(r'\d[\d,]*', '', text)
text = '\n'.join(line for line in text.split('\n')
if not line.startswith('ntsb'))
return text
vectorizer = CountVectorizer(preprocessor=preprocessor,
stop_words='english')
counts = vectorizer.fit_transform(paragraphs)
termCounts = np.array(counts.sum(axis=0)).flatten()
topCounts = termCounts.argsort()
terms = vectorizer.get_feature_names_out()
for n in reversed(topCounts[-10:]):
print(f'{terms[n]:14s} {termCounts[n]:3d}')
global 63
climate 59
warming 57
change 55
ice 35
sea 34
earth 33
ocean 29
temperatures 28
heat 25
(iv)
terms = vectorizer.get_feature_names_out()
data = {}
for i, component in enumerate(svd.components_, 1):
140 8 Modern analytic methods: Part II
compSort = component.argsort()
idx = list(reversed(compSort[-10:]))
data[f'Topic {i}'] = [terms[n] for n in idx]
data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)
print("{\\tiny")
print(df.style.format(precision=2).hide(axis='index').to_latex(hrules=True))
print("}")
Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5
change 0.24 ice 0.39 sea 0.25 extreme 0.54 snow 0.40
climate 0.24 sea 0.34 earth 0.23 events 0.30 cover 0.24
global 0.23 sheets 0.27 energy 0.21 heat 0.24 sea 0.16
sea 0.22 shrinking 0.19 light 0.21 precipitation 0.20 climate 0.12
warming 0.21 level 0.17 gases 0.19 earth 0.13 level 0.12
ice 0.20 arctic 0.15 ice 0.18 light 0.13 temperatures 0.12
temperature 0.18 ocean 0.13 infrared 0.17 energy 0.12 decreased 0.11
ocean 0.16 declining 0.10 greenhouse 0.15 gases 0.12 temperature 0.11
earth 0.16 levels 0.08 level 0.15 greenhouse 0.11 increase 0.10
extreme 0.15 glaciers 0.07 atmosphere 0.12 infrared 0.10 rise 0.10
We can identify topics related to sea warming, ice sheets melting, greenhouse
effect, extreme weather events, and hurricanes.
(v) Repeat the analysis requesting 10 components in the SVD.
svd = TruncatedSVD(10)
norm_tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(norm_tfidf)
terms = vectorizer.get_feature_names_out()
data = {}
for i, component in enumerate(svd.components_, 1):
compSort = component.argsort()
idx = list(reversed(compSort[-10:]))
data[f'Topic {i}'] = [terms[n] for n in idx]
data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)
Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5
change 0.24 ice 0.39 sea 0.25 extreme 0.54 snow 0.39
climate 0.24 sea 0.35 earth 0.24 events 0.31 cover 0.23
global 0.23 sheets 0.27 energy 0.21 heat 0.24 sea 0.17
sea 0.22 shrinking 0.19 light 0.21 precipitation 0.20 level 0.13
warming 0.21 level 0.17 gases 0.19 light 0.13 temperatures 0.12
ice 0.20 arctic 0.15 ice 0.18 earth 0.13 climate 0.12
temperature 0.18 ocean 0.13 infrared 0.17 energy 0.12 hurricanes 0.12
ocean 0.16 declining 0.10 greenhouse 0.16 gases 0.12 decreased 0.11
earth 0.16 levels 0.08 level 0.15 greenhouse 0.11 increase 0.10
extreme 0.15 glaciers 0.07 atmosphere 0.12 infrared 0.10 rise 0.10
Topic 6 Loading 6 Topic 7 Loading 7 Topic 8 Loading 8 Topic 9 Loading 9 Topic 10 Loading 10
sea 0.38 snow 0.34 ocean 0.38 glaciers 0.36 responsibility 0.30
level 0.36 ocean 0.26 hurricanes 0.23 water 0.24 authorities 0.24
rise 0.17 cover 0.24 acidification 0.18 retreat 0.24 heat 0.23
extreme 0.14 acidification 0.17 glaciers 0.17 glacial 0.21 wildfires 0.20
events 0.11 extreme 0.16 water 0.13 months 0.16 pollution 0.19
global 0.11 pollution 0.13 waters 0.12 summer 0.14 arctic 0.13
hurricanes 0.11 carbon 0.13 temperatures 0.11 going 0.13 personal 0.12
temperature 0.10 decreased 0.12 reefs 0.09 plants 0.13 carbon 0.11
impacts 0.09 point 0.11 coral 0.09 animals 0.11 percent 0.11
coastal 0.08 warming 0.10 warmer 0.09 stream 0.11 reduce 0.10
The first five topics are identical to the result in (iv). This is an expected property
of the SVD.
8 Modern analytic methods: Part II 141
0.4
0.2
0.0
0.2
0.4
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
Second component
Fig. 8.11 Projection of the documents onto first two singular values. Red: blog post 1, green: blog
post 2 { fig:nlpSVD}
{ fig:nlpSVD} (vi) Fig. 8.11 shows the individual documents projected onto the first two singular
values of the LSA. Based on this visualization, we can say that the two documents
discuss different aspects of global warming and that blog post 1 contains more details
about the area.
fig, ax = plt.subplots()
blog1 = [label == 'blog-1' for label in labels]
blog2 = [label == 'blog-2' for label in labels]
ax.plot(lsa_tfidf[blog1, 0], lsa_tfidf[blog1, 1], 'ro')
ax.plot(lsa_tfidf[blog2, 0], lsa_tfidf[blog2, 1], 'go')
ax.set_xlabel('First component')
ax.set_xlabel('Second component')
plt.show()
Solution 8.7 We follow the same procedure as in Exercise 8.6 using a set of three {exc:nlp-topic-1}
articles preprocessed and included in the mistat package.
paragraphs = []
labels = []
for blog, text in blogs.items():
for paragraph in text.split('\n'):
paragraph = paragraph.strip()
if not paragraph:
continue
paragraphs.append(paragraph)
labels.append(blog)
def preprocessor(text):
142 8 Modern analytic methods: Part II
text = text.lower()
text = re.sub(r'\d[\d,]*', '', text)
text = '\n'.join(line for line in text.split('\n')
if not line.startswith('ntsb'))
return text
TF-IDF transformation
svd = TruncatedSVD(10)
tfidf = Normalizer().fit_transform(tfidf)
lsa_tfidf = svd.fit_transform(tfidf)
Topics:
terms = vectorizer.get_feature_names_out()
data = {}
for i, component in enumerate(svd.components_, 1):
compSort = component.argsort()
idx = list(reversed(compSort[-10:]))
data[f'Topic {i}'] = [terms[n] for n in idx]
data[f'Loading {i}'] = [component[n] for n in idx]
df = pd.DataFrame(data)
Topic 1 Loading 1 Topic 2 Loading 2 Topic 3 Loading 3 Topic 4 Loading 4 Topic 5 Loading 5
labour 0.29 south 0.27 economic 0.24 capacity 0.21 covid 0.22
covid 0.22 labour 0.27 percent 0.23 financial 0.20 america 0.22
impact 0.19 north 0.22 covid 0.19 firms 0.19 reset 0.21
market 0.19 differences 0.20 gdp 0.17 household 0.17 latin 0.20
south 0.18 americas 0.17 impact 0.15 international 0.17 needs 0.18
america 0.16 channel 0.16 imf 0.14 access 0.14 economic 0.14
pandemic 0.15 agenda 0.14 social 0.12 largely 0.14 social 0.14
channel 0.15 covid 0.13 pre 0.12 depends 0.14 asymmetric 0.11
economic 0.14 post 0.11 world 0.11 state 0.13 region 0.11
north 0.14 impact 0.09 growth 0.11 support 0.13 labor 0.10
Topic 6 Loading 6 Topic 7 Loading 7 Topic 8 Loading 8 Topic 9 Loading 9 Topic 10 Loading 10
economic 0.26 covid 0.24 self 0.22 agenda 0.21 state 0.15
channel 0.20 asymmetric 0.21 employed 0.19 post 0.19 policies 0.14
social 0.16 occupations 0.20 unbearable 0.19 channel 0.18 needs 0.13
market 0.14 consequences 0.20 lightness 0.19 region 0.15 self 0.12
recovery 0.12 economic 0.17 employment 0.13 labour 0.12 capacity 0.12
labour 0.11 transition 0.13 countries 0.13 pandemic 0.12 unbearable 0.12
consequences 0.10 occupation 0.13 economic 0.13 new 0.09 lightness 0.12
covid 0.09 americas 0.13 channel 0.12 latin 0.09 covid 0.12
governments 0.09 north 0.13 argentina 0.12 given 0.08 employed 0.12
agenda 0.09 differences 0.12 social 0.12 occupations 0.08 reset 0.11
fig, ax = plt.subplots()
for blog in blogs:
match = [label == blog for label in labels]
ax.plot(lsa_tfidf[match, 0], lsa_tfidf[match, 1], 'o', label=blog)
8 Modern analytic methods: Part II 143
0.5 blog-1
0.4 blog-2
blog-3
0.3
0.2
0.1
0.0
0.1
0.2
0.3
0.1 0.2 0.3 0.4 0.5
Second component
Fig. 8.12 Projection of the documents onto first two singular values. Red: blog post 1, green: blog
post 2 { fig:nlpSVD-covid}
ax.legend()
ax.set_xlabel('First component')
ax.set_xlabel('Second component')
plt.show()
data = mistat.load_data('LAPTOP_REVIEWS')
data['Review'] = data['Review title'] + ' ' + data['Review content']
reviews = data.dropna(subset=['User rating', 'Review title', 'Review content'])
(ii) Convert the text representation into a document term matrix (DTM).
import re
from sklearn.feature_extraction.text import CountVectorizer
def preprocessor(text):
text = text.lower()
text = re.sub(r'\d[\d,]*', '', text)
return text
vectorizer = CountVectorizer(preprocessor=preprocessor,
stop_words='english')
counts = vectorizer.fit_transform(reviews['Review'])
print('shape of DTM', counts.shape)
print('total number of terms', np.sum(counts))
(iii) Convert the counts in the document term matrix (DTM) using TF-IDF.
(iv) Using scikit-learn’s TruncatedSVD method, we convert the sparse tfidf ma-
trix to a denser representation.
(7433, 20)
(v) We use logistic regression to classify reviews with a user rating of five as
positive and negative otherwise.
0.769334229993275