Statistics Fundamentals With Python
Statistics Fundamentals With Python
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
What is statistics?
The field of statistics - the practice and study of collecting and analyzing data
How many occupants will your hotel have? How can you optimize occupancy?
How many sizes of jeans need to be manufactured so they can fit 95% of the population?
Should the same number of each size be produced?
Even so, this can't tell us if more violent scenes lead to more views
Somewhat agree ( 4 )
Strongly agree ( 5 )
import numpy as np
np.mean(car_speeds['speed_mph'])
40.09062
single 188
married 143
divorced 124
dtype: int64
Maggie Matsui
Content Developer, DataCamp
Mammal sleep data
print(msleep)
Mean
Median
Mode
29 1.9 10.1
30 2.7
22 2.9
9 3.0
23 3.1
np.median(msleep['sleep_total'])
...
19 18.0
61 18.1 10.1
36 19.4
21 19.7
42 19.9
msleep['sleep_total'].value_counts()
herbi 32
omni 20
12.5 4 carni 19
10.1 3 insecti 5
14.9 2 Name: vore, dtype: int64
11.0 2
8.4 2
import statistics
...
statistics.mode(msleep['vore'])
14.3 1
17.0 1
'herbi'
Name: sleep_total, Length: 65, dtype: int64
mean 16.53
median 18.9
Name: sleep_total, dtype: float64
mean 13.22
median 18.1
Name: sleep_total, dtype: float64
Maggie Matsui
Content Developer, DataCamp
What is spread?
19.805677
1624.065542
Without ddof=1 , population variance is
4. Divide by number of data points - 1 calculated instead of sample variance:
19.567055
19.805677
4.450357
np.std(msleep['sleep_total'], ddof=1)
4.450357
3.566701
Standard deviation squares distances, penalizing longer distances more than shorter ones.
One isn't better than the other, but SD is more common than MAD.
Quartiles:
5.9
5.9
How do we know what a substantial difference is? A data point is an outlier if:
count 83.000000
mean 166.136349
std 786.839732
min 0.005000
25% 0.174000
50% 1.670000
75% 41.750000
max 6654.000000
Name: bodywt, dtype: float64
Maggie Matsui
Content Developer, DataCamp
Measuring chance
What's the probability of an event?
1
P (Brian) = = 25%
4
name n_sales
2 Claire 75
np.random.seed(10)
sales_counts.sample()
name n_sales
1 Brian 128
1
P (Claire) = = 33%
3
name n_sales
1 Brian 128
2 Claire 75
1
P (Claire) = = 25%
4
name n_sales
1 Brian 128
2 Claire 75
1 Brian 128
3 Damian 69
0 Amir 178
Maggie Matsui
Content Developer, DataCamp
Rolling the dice
number prob
0 1 0.166667 number prob
1 2 0.166667 0 1 0.166667
2 3 0.166667 0 1 0.166667
3 4 0.166667 4 5 0.166667
4 5 0.166667 1 2 0.166667
5 6 0.166667 0 1 0.166667
0 1 0.166667
5 6 0.166667
np.mean(die['number'])
5 6 0.166667
...
3.5
np.mean(rolls_10['number']) = 3.0
mean(die['number']) = 3.5
np.mean(rolls_100['number']) = 3.4
mean(die['number']) = 3.5
np.mean(rolls_1000['number']) = 3.48
mean(die['number']) = 3.5
Maggie Matsui
Content Developer, DataCamp
Waiting for the bus
0.5833333
0.4166667
0.25
Maggie Matsui
Content Developer, DataCamp
Coin flipping
1 = head, 0 = tails
array([1])
array([0, 1, 1, 0, 1, 0, 1, 1])
array([5])
array([0, 3, 2, 1, 3, 0, 2, 2, 0, 0])
array([1, 1, 1, 1, 0, 0, 2, 0, 1, 0])
Described by n and p
0.1171875
0.9453125
0.0546875
Maggie Matsui
Content Developer, DataCamp
What is the normal distribution?
Mean: 20
Standard deviation: 3
Mean: 0
Standard deviation: 1
Mean: 20
Standard deviation: 3
Mean: 0
Standard deviation: 1
0.158655
0.841345
0.1252
169.97086
152.029
Maggie Matsui
Content Developer, DataCamp
Rolling the dice 5 times
die = pd.Series([1, 2, 3, 4, 5, 6])
# Roll 5 times
samp_5 = die.sample(5, replace=True)
print(samp_5)
array([3, 1, 4, 1, 1])
np.mean(samp_5)
2.0
4.4
3.8
sales_team.sample(10, replace=True)
3.48
Maggie Matsui
Content Developer, DataCamp
Poisson processes
Events appear to happen at a certain rate,
but completely at random
Examples
Number of animals adopted from an
animal shelter per week
Examples
Probability of ≥ 5 animals adopted from an animal shelter per week
0.09160366
0.1912361
1 - poisson.cdf(5, 8)
0.8087639
If the average number of adoptions per week is 10, what is P (# adoptions in a week > 5)?
1 - poisson.cdf(5, 10)
0.932914
Maggie Matsui
Content Developer, DataCamp
Exponential distribution
Probability of time between Poisson events
Examples
Probability of > 1 day between adoptions
Continuous (time)
0.1353352832366127 0.4711953764760207
Examples:
Length of chess games
Maggie Matsui
Content Developer, DataCamp
Relationships between two variables
x = explanatory/independent variable
y = response/dependent variable
0.751755
msleep['sleep_rem'].corr(msleep['sleep_total'])
0.751755
x̄ = mean of x
σx = standard deviation of x
n
(xi − x̄)(yi − ȳ )
r=∑
σx × σy
i=1
Maggie Matsui
Content Developer, DataCamp
Non-linear relationships
r = 0.18
df['x'].corr(df['y'])
0.081094
0.3119801
sns.lmplot(x='log_bodywt',
y='awake',
data=msleep,
ci=None)
plt.show()
msleep['log_bodywt'].corr(msleep['awake'])
0.5687943
Reciprocal transformation ( 1 / x )
sqrt(x) and 1 / y
Linear regression
Maggie Matsui
Content Developer, DataCamp
Vocabulary
Experiment aims to answer: What is the effect of the treatment on the response?
Treatment: advertisement
Placebo
Resembles treatment, but has no effect
In clinical trials, a sugar pill ensures that the effect of the drug is actually due to the drug
itself and not the idea of receiving the drug
There are ways to control for confounders to get more reliable conclusions about
association
Maggie Matsui
Content Developer, DataCamp
Overview
Chapter 1 Chapter 2
What is statistics? Measuring chance
Chapter 3 Chapter 4
Normal distribution Correlation
n_claims 22.904762
total_payment_sek 98.187302
dtype: float64
print(swedish_motor_insurance['n_claims'].corr(swedish_motor_insurance['total_payment_sek']))
0.9128782350234068
Logistic regression
The response variable is logical.
sns.scatterplot(x="n_claims",
y="total_payment_sek",
data=swedish_motor_insurance)
plt.show()
Chapter 2
Making predictions from linear regression models and understanding model coe cients.
Chapter 3
Assessing the quality of the linear regression model.
Chapter 4
Same again, but with logistic regression models
scikit-learn
Optimized for prediction (focus in other DataCamp courses)
Slope
The amount the y value increases if you increase x by one.
Equation
y = intercept + slope ∗ x
mdl_payment_vs_claims = mdl_payment_vs_claims.fit()
print(mdl_payment_vs_claims.params)
Intercept 19.994486
n_claims 3.413824
dtype: float64
Equation
total_payment_sek = 19.99 + 3.41 ∗ n_claims
Common Roach
sns.displot(data=fish,
x="mass_g",
col="species",
col_wrap=2,
bins=9)
plt.show()
species
Bream 617.828571
Perch 382.239286
Pike 718.705882
Roach 152.050000
Name: mass_g, dtype: float64
Intercept 617.828571
species[T.Perch] -235.589286
species[T.Pike] 100.877311
species[T.Roach] -465.778571
The coe cients are relative to the intercept: In case of a single, categorical variable,
617.83 − 235.59 = 382.24! coe cients are the means.
plt.show()
Intercept -1035.347565
length_cm 54.549981
dtype: float64
length_cm
0 20
1 21
2 22
3 23
4 24
5 25
...
0 55.652054
1 110.202035
2 164.752015
3 219.301996
4 273.851977
...
16 928.451749
17 983.001730
18 1037.551710
19 1092.101691
20 1146.651672
Length: 21, dtype: float64
pred_little_bream = little_bream.assign(
mass_g=mdl_mass_vs_length.predict(little_bream))
print(pred_little_bream)
length_cm mass_g
0 10 -489.847756
Intercept -1035.347565
length_cm 54.549981
dtype: float64
dataset 1 273.851977
2 268.396979
3 399.316934
print(mdl_mass_vs_length.fittedvalues)
4 410.226930
...
or equivalently 30 873.901768
31 873.901768
print(mdl_mass_vs_length.predict(explanatory_data)) 34 1037.551710
Length: 35, dtype: float64
print(bream["mass_g"] - mdl_mass_vs_length.fittedvalues)
Regression to the mean means extreme cases don't persist over time
sns.scatterplot(x="father_height_cm",
y="son_height_cm",
data=father_son)
plt.axline(xy1=(150, 150),
slope=1,
linewidth=2,
color="green")
plt.axis("equal")
plt.show()
sns.regplot(x="father_height_cm",
y="son_height_cm",
data=father_son,
ci = None,
line_kws={"color": "black"})
plt.axis("equal")
plt.show()
Intercept 86.071975
father_height_cm 0.514093
dtype: float64
mdl_son_vs_father.predict( mdl_son_vs_father.predict(
really_tall_father) really_short_father)
183.7 163.2
plt.show()
sns.regplot(x="length_cm_cubed",
y="mass_g",
data=perch,
ci=None)
plt.show()
Intercept -0.117478
length_cm_cubed 0.016796
dtype: float64
prediction_data = explanatory_data.assign(
mass_g=mdl_perch.predict(explanatory_data))
print(prediction_data)
ad_conversion["sqrt_n_impressions"] = np.sqrt(
ad_conversion["n_impressions"])
sns.regplot(x="sqrt_spent_usd",
y="sqrt_n_impressions",
data=ad_conversion,
ci=None)
prediction_data = explanatory_data.assign(sqrt_n_impressions=mdl_ad.predict(explanatory_data),
n_impressions=mdl_ad.predict(explanatory_data) ** 2)
print(prediction_data)
The proportion of the variance in the response variable that is predictable from the
explanatory variable
1 means a perfect t
print(mdl_bream.summary())
0.8780627095147174
0.8780627095147173
MSE = RSE²
mse: 5498.555084973521
rse = np.sqrt(mse)
print("rse: ", rse)
rse: 74.15224261594197
resid_sum_of_sq = sum(residuals_sq)
resid_sum_of_sq = sum(residuals_sq)
deg_freedom = len(bream.index) - 2
resid_sum_of_sq = sum(residuals_sq)
deg_freedom = len(bream.index) - 2
rse = np.sqrt(resid_sum_of_sq/deg_freedom)
The di erence between predicted bream masses and observed bream masses is typically
about 74g.
fig = plt.figure()
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None)
sns.scatterplot(x="length_cm",
y="mass_g",
hue="extreme_l",
data=roach)
fig = plt.figure()
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None)
sns.scatterplot(x="length_cm",
y="mass_g",
hue="extreme_l",
style="extreme_m",
data=roach)
In uence measures how much the model would change if you le the observation out of the
dataset when modeling.
print(roach.head())
roach["cooks_dist"] = summary_roach["cooks_d"]
print(roach.head())
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None,
line_kws={"color": "green"})
sns.regplot(x="length_cm",
y="mass_g",
data=roach_not_short,
ci=None,
line_kws={"color": "red"})
1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn
print(mdl_churn_vs_recency_lm.params)
Intercept 0.490780
time_since_last_purchase 0.063783
dtype: float64
plt.axline(xy1=(0, intercept),
slope=slope)
plt.show()
plt.axline(xy1=(0,intercept),
slope=slope)
plt.xlim(-10, 10)
plt.ylim(-0.2, 1.2)
plt.show()
Intercept -0.035019
time_since_last_purchase 0.269215
dtype: float64
plt.axline(xy1=(0,intercept),
slope=slope,
color="black")
plt.show()
plt.show()
explanatory_data = pd.DataFrame(
{"time_since_last_purchase": np.arange(-1, 6.25, 0.25)})
prediction_data = explanatory_data.assign(
has_churned = mdl_recency.predict(explanatory_data))
sns.scatterplot(x="time_since_last_purchase",
y="has_churned",
data=prediction_data,
color="red")
plt.show()
sns.scatterplot(x="time_since_last_purchase",
y="most_likely_outcome",
data=prediction_data,
color="red")
plt.show()
probability
odds_ratio =
(1 − probability)
0.25 1
odds_ratio = =
(1 − 0.25) 3
plt.axhline(y=1,
linestyle="dotted")
plt.show()
plt.axhline(y=1,
linestyle="dotted")
plt.yscale("log")
plt.show()
predicted_response = np.round(mdl_recency.predict())
print(outcomes.value_counts(sort=False))
actual_response predicted_response
0 0.0 141
1.0 59
1 0.0 111
1.0 89
print(conf_matrix)
[[141. 59.]
[111. 89.]]
from statsmodels.graphics.mosaicplot
import mosaic
mosaic(conf_matrix)
TN + TP
accuracy =
TN + FN + FP + TP 0.575
[[141., 59.],
[111., 89.]]
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]
Transforming variables
Chapter 3 Chapter 4
Simpson's Paradox
Chapter 3 Chapter 4
More explanatory variables Multiple logistic regression
print(mdl_mass_vs_both.params)
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
dtype: float64
print(mdl_mass_vs_species.params)
species[Bream] 617.828571
species[Perch] 382.239286
species[Pike] 718.705882
species[Roach] 152.050000
sns.regplot(x="length_cm",
y="mass_g",
data=fish,
ci=None)
plt.show()
sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)
sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)
sns.scatterplot(x="length_cm",
y="mass_g",
color="black",
data=prediction_data)
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
np.select(conditions, choices)
0.8225689502644215
print(mdl_mass_vs_species.rsquared)
0.25814887709499157
print(mdl_mass_vs_both.rsquared)
0.9200433561156649
R¯2 = 1 − (1 − R2 ) nobsn−n
obs −1
var −1
rsq_length: 0.8225689502644215
rsq_adj_length: 0.8211607673300121
rsq_species: 0.25814887709499157
rsq_adj_species: 0.24020086605696722
rsq_both: 0.9200433561156649
rsq_adj_both: 0.9174431400543857
rse_length: 152.12092835414788
rse_species = np.sqrt(mdl_mass_vs_species.mse_resid)
print("rse_species: ", rse_species)
rse_species: 313.5501156682592
rse_both = np.sqrt(mdl_mass_vs_both.mse_resid)
print("rse_both: ", rse_both)
rse_both: 103.35563303966488
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species",
ci=None,
legend=False)
plt.show()
0.874
print(mdl_fish.rsquared_adj)
print(mdl_perch.rsquared_adj)
0.917
0.917
print(mdl_pike.rsquared_adj)
0.941
print(mdl_roach.rsquared_adj)
0.815
103 74.2
print(np.sqrt(mdl_perch.mse_resid))
100
print(np.sqrt(mdl_pike.mse_resid))
120
print(np.sqrt(mdl_roach.mse_resid))
38.2
The e ect of length on the expected mass is di erent for di erent species.
More generally
The e ect of one explanatory variable on the expected response changes depending on the
value of another explanatory variable.
print(mdl_mass_vs_both.params)
Intercept -1035.3476
species[T.Perch] 416.1725
species[T.Pike] -505.4767
species[T.Roach] 705.9714
length_cm 54.5500
length_cm:species[T.Perch] -15.6385
length_cm:species[T.Pike] -1.3551
length_cm:species[T.Roach] -31.2307
print(mdl_mass_vs_both_inter.params)
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
print(mdl_mass_vs_both_inter.params)
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species")
plt.show()
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
print(prediction_data) print(prediction_data)
62.24344 70.60840 D
52.33499 14.70577 B
56.36795 46.39554 C
66.80395 66.17487 D
66.53605 89.24658 E
62.38129 91.45260 E
1 h ps://www.rdocumentation.org/packages/datasauRus/topics/simpsons_paradox
print(mdl_whole.params) print(mdl_by_group.params)
Common advice
You can't choose the best model in general – it depends on the dataset and the question you
are trying to answer.
Context is important.
You may see a zero slope rather than a complete change in direction.
print(mdl_mass_vs_both.params)
Intercept -622.150234
length_cm 28.968405
height_cm 26.334804
print(prediction_data)
sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")
plt.show()
print(mdl_mass_vs_both_inter.params)
Intercept 159.107480
length_cm 0.301426
height_cm -78.125178
length_cm:height_cm 3.545435
p = product(length_cm, height_cm)
explanatory_data = pd.DataFrame(p,
columns=["length_cm",
"height_cm"])
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")
plt.show()
grid.map(sns.scatterplot,
"length_cm",
"height_cm")
plt.show()
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0", data=fish).fit()
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0", data=fish).fit()
same as
ols(
"mass_g ~ length_cm * height_cm * species + 0",
data=fish).fit()
same as
ols(
"mass_g ~ (length_cm + height_cm + species) ** 2 + 0",
data=fish).fit()
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_all.predict(explanatory_data))
x = np.arange(-4, 5, 0.1)
y = x ** 2 - x + 10
xy_data = pd.DataFrame({"x": x,
"y": y})
sns.lineplot(x="x",
y="y",
data=xy_data)
0 = 2x − 1
x = 0.5
1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn
conf_matrix = mdl_logit.pred_table()
print(conf_matrix)
[[102. 98.]
[ 53. 147.]]
explanatory1 = some_values
explanatory2 = some_values
p = product(explanatory1, explanatory2)
explanatory_data = pd.DataFrame(p,
columns=["explanatory1",
"explanatory2"])
prediction_data = explanatory_data.assign(
mass_g = mdl_logit.predict(explanatory_data))
sns.scatterplot(...
data=churn,
hue="has_churned",
...)
sns.scatterplot(...
data=prediction_data,
hue="most_likely_outcome",
...)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x)}
)
sns.lineplot(x="x",
y="gauss_pdf",
data=gauss_dist)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)
sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)
sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)
gauss_dist_inv = pd.DataFrame({
"p": p,
"gauss_inv_cdf": norm.ppf(p)}
)
sns.lineplot(x="p",
y="gauss_inv_cdf",
data=gauss_dist_inv)
logistic_dist = pd.DataFrame({
"x": x,
"log_pdf": logistic.pdf(x)}
)
sns.lineplot(x="x",
y="log_pdf",
data=logistic_dist)
y_actual is always 0 or 1 .
When y_actual = 1
When y_actual = 0
-np.sum(log_likelihoods)
minimize(
fun=calc_neg_log_likelihood,
x0=[0, 0]
)
Simpson's Paradox
Chapter 3 Chapter 4
Cross validation
James Chapman
Curriculum Manager, DataCamp
Estimating the population of France
A census asks every household how many
people live there.
SAMPLING IN PYTHON
There are lots of people in France
Censuses are really expensive!
SAMPLING IN PYTHON
Sampling households
Cheaper to ask a small number of households
and use statistics to estimate the population
SAMPLING IN PYTHON
Population vs. sample
The population is the complete dataset
SAMPLING IN PYTHON
Coffee rating dataset
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
90.58 NA Ethiopia 8.67 8.83 8.67 8.50 8.42
89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
... ... ... ... ... ... ... ...
73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83
1338 rows
SAMPLING IN PYTHON
Points vs. flavor: population
pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]
total_cup_points flavor
0 90.58 8.83
1 89.92 8.67
2 89.75 8.50
3 89.00 8.58
4 88.83 8.50
... ... ...
1333 78.75 7.58
1334 78.08 7.67
1335 77.17 7.33
1336 75.08 6.83
1337 73.75 6.67
SAMPLING IN PYTHON
Points vs. flavor: 10 row sample
pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)
total_cup_points flavor
1088 80.33 7.17
1157 79.67 7.42
1267 76.17 7.33
506 83.00 7.67
659 82.50 7.42
817 81.92 7.50
1050 80.67 7.42
685 82.42 7.50
1027 80.92 7.25
62 85.58 8.17
SAMPLING IN PYTHON
Python sampling for Series
Use .sample() for pandas DataFrames and Series
cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)
1088 80.33
1157 79.67
1267 76.17
... ...
685 82.42
1027 80.92
62 85.58
Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Population parameters & point estimates
A population parameter is a calculation made on the population dataset
import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])
82.15120328849028
np.mean(cup_points_samp)
81.31800000000001
SAMPLING IN PYTHON
Point estimates with pandas
pts_vs_flavor_pop['flavor'].mean()
7.526046337817639
pts_vs_flavor_samp['flavor'].mean()
7.485000000000001
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Convenience
sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
The Literary Digest election prediction
SAMPLING IN PYTHON
Finding the mean age of French people
Survey 10 people at Disneyland Paris
Mean age of 24.6 years
SAMPLING IN PYTHON
How accurate was the survey?
Year Average French Age 24.6 years is a poor estimate
1975 31.6 People who visit Disneyland aren't
1985 33.6 representative of the whole population
1995 36.2
2005 38.9
2015 41.2
SAMPLING IN PYTHON
Convenience sampling coffee ratings
coffee_ratings["total_cup_points"].mean()
82.15120328849028
coffee_ratings_first10 = coffee_ratings.head(10)
coffee_ratings_first10["total_cup_points"].mean()
89.1
SAMPLING IN PYTHON
Visualizing selection bias
import matplotlib.pyplot as plt
import numpy as np
coffee_ratings["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()
SAMPLING IN PYTHON
Distribution of a population and of a convenience
sample
Population: Convenience sample:
SAMPLING IN PYTHON
Visualizing selection bias for a random sample
coffee_sample = coffee_ratings.sample(n=10)
coffee_sample["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()
SAMPLING IN PYTHON
Distribution of a population and of a simple random
sample
Population: Random Sample:
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Pseudo-random
number generation
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
What does random mean?
{adjective} made, done, happening, or chosen without method or conscious decision.
1 Oxford Languages
SAMPLING IN PYTHON
True random numbers
Generated from physical processes, like flipping coins
Hotbits uses radioactive decay
1 https://www.fourmilab.ch/hotbits 2 https://www.random.org
SAMPLING IN PYTHON
Pseudo-random number generation
Pseudo-random number generation is cheap and fast
Next "random" number calculated from previous "random" number
SAMPLING IN PYTHON
Pseudo-random number generation example
seed = 1
calc_next_random(seed)
calc_next_random(3)
calc_next_random(2)
SAMPLING IN PYTHON
Random number generating functions
Prepend with numpy.random , such as numpy.random.beta()
SAMPLING IN PYTHON
Visualizing random numbers
randoms = np.random.beta(a=2, b=2, size=5000)
randoms
SAMPLING IN PYTHON
Random numbers seeds
np.random.seed(20000229) np.random.seed(20000229)
SAMPLING IN PYTHON
Using a different seed
np.random.seed(20000229) np.random.seed(20041004)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Simple random and
systematic sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Simple random sampling
SAMPLING IN PYTHON
Simple random sampling of coffees
SAMPLING IN PYTHON
Simple random sampling with pandas
coffee_ratings.sample(n=5, random_state=19000113)
SAMPLING IN PYTHON
Systematic sampling
SAMPLING IN PYTHON
Systematic sampling - defining the interval
sample_size = 5
pop_size = len(coffee_ratings)
print(pop_size)
1338
267
SAMPLING IN PYTHON
Systematic sampling - selecting the rows
coffee_ratings.iloc[::interval]
body balance
0 8.50 8.42
267 7.75 7.75
534 7.92 7.83
801 7.50 7.33
1068 7.17 7.25
SAMPLING IN PYTHON
The trouble with systematic sampling
coffee_ratings_with_id = coffee_ratings.reset_index()
coffee_ratings_with_id.plot(x="index", y="aftertaste", kind="scatter")
plt.show()
Systematic sampling is only safe if we don't see a pattern in this scatter plot
SAMPLING IN PYTHON
Making systematic sampling safe
shuffled = coffee_ratings.sample(frac=1)
shuffled = shuffled.reset_index(drop=True).reset_index()
shuffled.plot(x="index", y="aftertaste", kind="scatter")
plt.show()
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Stratified and
weighted random
sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Coffees by country
top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)
country_of_origin
Mexico 236
Colombia 183
Guatemala 181
Brazil 132
Taiwan 75
United States (Hawaii) 73
dtype: int64
1 The dataset lists Hawaii and Taiwan as countries for convenience, as they are notable coffee-growing regions.
SAMPLING IN PYTHON
Filtering for 6 countries
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]
top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[top_counted_subset]
SAMPLING IN PYTHON
Counts of a simple random sample
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)
coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)
country_of_origin
Mexico 0.250000
Guatemala 0.204545
Colombia 0.181818
Brazil 0.181818
United States (Hawaii) 0.102273
Taiwan 0.079545
dtype: float64
SAMPLING IN PYTHON
Comparing proportions
Population: 10% simple random sample:
SAMPLING IN PYTHON
Proportional stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=0.1, random_state=2021)
coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)
Mexico 0.272727
Guatemala 0.204545
Colombia 0.204545
Brazil 0.147727
Taiwan 0.090909
United States (Hawaii) 0.079545
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Equal counts stratified sampling
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
.sample(n=15, random_state=2021)
coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)
Taiwan 0.166667
Brazil 0.166667
United States (Hawaii) 0.166667
Guatemala 0.166667
Mexico 0.166667
Colombia 0.166667
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Weighted random sampling
Specify weights to adjust the relative probability of a row being sampled
import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"
coffee_ratings_weight['weight'] = np.where(condition, 2, 1)
SAMPLING IN PYTHON
Weighted random sampling results
10% weighted sample:
coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)
Brazil 0.261364
Mexico 0.204545
Guatemala 0.204545
Taiwan 0.170455
Colombia 0.090909
United States (Hawaii) 0.068182
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Cluster sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Stratified sampling vs. cluster sampling
Stratified sampling
Split the population into subgroups
Cluster sampling
Use simple random sampling to pick some subgroups
SAMPLING IN PYTHON
Varieties of coffee
varieties_pop = list(coffee_ratings['variety'].unique())
SAMPLING IN PYTHON
Stage 1: sampling for subgroups
import random
varieties_samp = random.sample(varieties_pop, k=3)
SAMPLING IN PYTHON
Stage 2: sampling each group
variety_condition = coffee_ratings['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee_ratings[variety_condition]
coffee_ratings_cluster['variety'] = coffee_ratings_cluster['variety'].cat.remove_unused_categories()
coffee_ratings_cluster.groupby("variety")\
.sample(n=5, random_state=2021)
SAMPLING IN PYTHON
Stage 2 output
total_cup_points variety country_of_origin ...
variety
Bourbon 575 82.83 Bourbon Guatemala
560 82.83 Bourbon Guatemala
524 83.00 Bourbon Guatemala
1140 79.83 Bourbon Guatemala
318 83.67 Bourbon Brazil
Hawaiian Kona 1291 73.67 Hawaiian Kona United States (Hawaii)
1266 76.25 Hawaiian Kona United States (Hawaii)
488 83.08 Hawaiian Kona United States (Hawaii)
461 83.17 Hawaiian Kona United States (Hawaii)
117 84.83 Hawaiian Kona United States (Hawaii)
SL28 137 84.67 SL28 Kenya
452 83.17 SL28 Kenya
224 84.17 SL28 Kenya
66 85.50 SL28 Kenya
559 82.83 SL28 Kenya
SAMPLING IN PYTHON
Multistage sampling
Cluster sampling is a type of multistage sampling
Can have > 2 stages
E.g., countrywide surveys may sample states, counties, cities, and neighborhoods
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling methods
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Review of sampling techniques - setup
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]
subset_condition = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[subset_condition]
coffee_ratings_top.shape
(880, 8)
SAMPLING IN PYTHON
Review of simple random sampling
coffee_ratings_srs = coffee_ratings_top.sample(frac=1/3, random_state=2021)
coffee_ratings_srs.shape
(293, 8)
SAMPLING IN PYTHON
Review of stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=1/3, random_state=2021)
coffee_ratings_strat.shape
(293, 8)
SAMPLING IN PYTHON
Review of cluster sampling
import random
top_countries_samp = random.sample(top_counted_countries, k=2)
top_condition = coffee_ratings_top['country_of_origin'].isin(top_countries_samp)
coffee_ratings_cluster = coffee_ratings_top[top_condition]
coffee_ratings_cluster['country_of_origin'] = coffee_ratings_cluster['country_of_origin']\
.cat.remove_unused_categories()
coffee_ratings_clust = coffee_ratings_cluster.groupby("country_of_origin")\
.sample(n=len(coffee_ratings_top) // 6)
coffee_ratings_clust.shape
(292, 8)
SAMPLING IN PYTHON
Calculating mean cup points
Population Simple random sample
coffee_ratings_top['total_cup_points'].mean() coffee_ratings_srs['total_cup_points'].mean()
81.94700000000002 81.95982935153583
81.92566552901025 82.03246575342466
SAMPLING IN PYTHON
Mean cup points by country: simple random
Population: Simple random sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_srs.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.414878
Colombia 83.106557 Colombia 82.925536
Guatemala 81.846575 Guatemala 82.045385
Mexico 80.890085 Mexico 81.100714
Taiwan 82.001333 Taiwan 81.744333
United States (Hawaii) 81.820411 United States (Hawaii) 82.008000
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Mean cup points by country: stratified
Population: Stratified sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_strat.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.499773
Colombia 83.106557 Colombia 83.288197
Guatemala 81.846575 Guatemala 81.727667
Mexico 80.890085 Mexico 80.994684
Taiwan 82.001333 Taiwan 81.846800
United States (Hawaii) 81.820411 United States (Hawaii) 81.051667
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Mean cup points by country: cluster
Population: Cluster sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_clust.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Colombia 83.128904
Colombia 83.106557 Mexico 80.936027
Guatemala 81.846575 Name: total_cup_points, dtype: float64
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Relative error of
point estimates
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Sample size is number of rows
len(coffee_ratings.sample(n=300)) len(coffee_ratings.sample(frac=0.25))
300 334
SAMPLING IN PYTHON
Various sample sizes
coffee_ratings['total_cup_points'].mean()
82.15120328849028
coffee_ratings.sample(n=10)['total_cup_points'].mean()
83.027
coffee_ratings.sample(n=100)['total_cup_points'].mean()
82.4897
coffee_ratings.sample(n=1000)['total_cup_points'].mean()
82.1186
SAMPLING IN PYTHON
Relative errors
Population parameter:
population_mean = coffee_ratings['total_cup_points'].mean()
Point estimate:
sample_mean = coffee_ratings.sample(n=sample_size)['total_cup_points'].mean()
SAMPLING IN PYTHON
Relative error vs. sample size
import matplotlib.pyplot as plt
errors.plot(x="sample_size",
y="relative_error",
kind="line")
plt.show()
Properties:
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Creating a sampling
distribution
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Same code, different answer
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()
82.53066666666668 81.97566666666667
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()
82.68 81.675
SAMPLING IN PYTHON
Same code, 1000 times
mean_cup_points_1000 = []
for i in range(1000):
mean_cup_points_1000.append(
coffee_ratings.sample(n=30)['total_cup_points'].mean()
)
print(mean_cup_points_1000)
SAMPLING IN PYTHON
Distribution of sample means for size 30
import matplotlib.pyplot as plt
plt.hist(mean_cup_points_1000, bins=30)
plt.show()
SAMPLING IN PYTHON
Different sample sizes
Sample size: 6 Sample size: 150
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Approximate
sampling
distributions
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
4 dice
die1 die2 die3 die4
0 1 1 1 1
1 1 1 1 2
2 1 1 1 3
3 1 1 1 4
4 1 1 1 5
dice = expand_grid( ... ... ... ... ...
{'die1': [1, 2, 3, 4, 5, 6], 1291 6 6 6 2
'die2': [1, 2, 3, 4, 5, 6], 1292 6 6 6 3
'die3': [1, 2, 3, 4, 5, 6], 1293 6 6 6 4
'die4': [1, 2, 3, 4, 5, 6] 1294 6 6 6 5
} 1295 6 6 6 6
)
[1296 rows x 4 columns]
SAMPLING IN PYTHON
Mean roll
dice['mean_roll'] = (dice['die1'] + die1 die2 die3 die4 mean_roll
dice['die2'] + 0 1 1 1 1 1.00
dice['die3'] + 1 1 1 1 2 1.25
dice['die4']) / 4 2 1 1 1 3 1.50
print(dice) 3 1 1 1 4 1.75
4 1 1 1 5 2.00
... ... ... ... ... ...
1291 6 6 6 2 5.00
1292 6 6 6 3 5.25
1293 6 6 6 4 5.50
1294 6 6 6 5 5.75
1295 6 6 6 6 6.00
SAMPLING IN PYTHON
Exact sampling distribution
dice['mean_roll'] = dice['mean_roll'].astype('category')
dice['mean_roll'].value_counts(sort=False).plot(kind="bar")
SAMPLING IN PYTHON
The number of outcomes increases fast
n_dice = list(range(1, 101))
n_outcomes = []
for n in n_dice:
n_outcomes.append(6**n)
outcomes = pd.DataFrame(
{"n_dice": n_dice,
"n_outcomes": n_outcomes})
outcomes.plot(x="n_dice",
y="n_outcomes",
kind="scatter")
plt.show()
SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
sample_means_1000 = []
for i in range(1000):
sample_means_1000.append(
np.random.choice(list(range(1, 7)), size=4, replace=True).mean()
)
print(sample_means_1000)
[3.25, 3.25, 1.75, 2.0, 2.0, 1.0, 1.0, 2.75, 2.75, 2.5, 3.0, 2.0, 2.75,
...
1.25, 2.0, 2.5, 2.5, 3.75, 1.5, 1.75, 2.25, 2.0, 1.5, 3.25, 3.0, 3.5]
SAMPLING IN PYTHON
Approximate sampling distribution
plt.hist(sample_means_1000, bins=20)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Standard errors and
the Central Limit
Theorem
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Sampling distribution of mean cup points
Sample size: 5 Sample size: 20
SAMPLING IN PYTHON
Consequences of the central limit theorem
SAMPLING IN PYTHON
Population & sampling distribution means
coffee_ratings['total_cup_points'].mean() Use np.mean() on each approximate
sampling distribution:
82.15120328849028
Sample size Mean sample mean
5 82.18420719999999
20 82.1558634
80 82.14510154999999
320 82.154017925
SAMPLING IN PYTHON
Population & sampling distribution standard deviations
coffee_ratings['total_cup_points'].std(ddof=0) Sample size Std dev sample mean
5 1.1886358227738543
2.685858187306438
20 0.5940321141669805
80 0.2934024263916487
SAMPLING IN PYTHON
Population mean over square root sample size
Sample size Std dev sample mean Calculation Result
5 1.1886358227738543 2.685858187306438 / sqrt(5) 1.201
SAMPLING IN PYTHON
Standard error
Standard deviation of the sampling distribution
Important tool in understanding sampling variability
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Introduction to
bootstrapping
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
With or without
Sampling without replacement: Sampling with replacement ("resampling"):
SAMPLING IN PYTHON
Simple random sampling without replacement
Population: Sample:
SAMPLING IN PYTHON
Simple random sampling with replacement
Population: Resample:
SAMPLING IN PYTHON
Why sample with replacement?
coffee_ratings : a sample of a larger population of all coffees
Each coffee in our sample represents many different hypothetical population coffees
SAMPLING IN PYTHON
Coffee data preparation
coffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()
SAMPLING IN PYTHON
Resampling with .sample()
coffee_resamp = coffee_focus.sample(frac=1, replace=True)
SAMPLING IN PYTHON
Repeated coffees
coffee_resamp["index"].value_counts() 658 5
167 4
363 4
357 4
1047 4
..
771 1
770 1
766 1
764 1
0 1
Name: index, Length: 868, dtype: int64
SAMPLING IN PYTHON
Missing coffees
num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))
868
len(coffee_ratings) - num_unique_coffees
470
SAMPLING IN PYTHON
Bootstrapping
The opposite of sampling from a
population
SAMPLING IN PYTHON
Bootstrapping process
1. Make a resample of the same size as the original sample
2. Calculate the statistic of interest for this bootstrap sample
The resulting statistics are bootstrap statistics, and they form a bootstrap distribution
SAMPLING IN PYTHON
Bootstrapping coffee mean flavor
import numpy as np
mean_flavors_1000 = []
for i in range(1000):
mean_flavors_1000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
SAMPLING IN PYTHON
Bootstrap distribution histogram
import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling and
bootstrap
distributions
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Coffee focused subset
coffee_sample = coffee_ratings[["variety", "country_of_origin", "flavor"]]\
.reset_index().sample(n=500)
SAMPLING IN PYTHON
The bootstrap of mean coffee flavors
import numpy as np
mean_flavors_5000 = []
for i in range(5000):
mean_flavors_5000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
bootstrap_distn = mean_flavors_5000
SAMPLING IN PYTHON
Mean flavor bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(bootstrap_distn, bins=15)
plt.show()
SAMPLING IN PYTHON
Sample, bootstrap distribution, population means
Sample mean: Estimated population mean:
coffee_sample['flavor'].mean() np.mean(bootstrap_distn)
7.5132200000000005 7.513357731999999
coffee_ratings['flavor'].mean()
7.526046337817639
SAMPLING IN PYTHON
Interpreting the means
Bootstrap distribution mean:
SAMPLING IN PYTHON
Sample sd vs. bootstrap distribution sd
Sample standard deviation: Estimated population standard deviation?
0.3540883911928703 0.015768474367958217
SAMPLING IN PYTHON
Sample, bootstrap dist'n, pop'n standard deviations
Sample standard deviation: Estimated population standard deviation:
coffee_ratings['flavor'].std(ddof=0)
0.3525938058821761
SAMPLING IN PYTHON
Interpreting the standard errors
Estimated standard error → standard deviation of the bootstrap distribution for a sample
statistic
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Confidence intervals
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Confidence intervals
"Values within one standard deviation of the mean" includes a large number of values from
each of these distributions
SAMPLING IN PYTHON
Predicting the weather
Rapid City, South Dakota in the United
States has the least predictable weather
SAMPLING IN PYTHON
Our weather prediction
Point estimate = 47°F (8.3°C)
Range of plausible high temperature values = 40 to 54°F (4.4 to 12.8°C)
SAMPLING IN PYTHON
We just reported a confidence interval!
40 to 54°F is a confidence interval
Sometimes written as 47 °F (40°F, 54°F) or 47°F [40°F, 54°F]
SAMPLING IN PYTHON
Bootstrap distribution of mean flavor
import matplotlib.pyplot as plt
plt.hist(coffee_boot_distn, bins=15)
plt.show()
SAMPLING IN PYTHON
Mean of the resamples
import numpy as np
np.mean(coffee_boot_distn)
7.513452892
SAMPLING IN PYTHON
Mean plus or minus one standard deviation
np.mean(coffee_boot_distn)
7.513452892
7.497385709174466
7.529520074825534
SAMPLING IN PYTHON
Quantile method for confidence intervals
np.quantile(coffee_boot_distn, 0.025)
7.4817195
np.quantile(coffee_boot_distn, 0.975)
7.5448805
SAMPLING IN PYTHON
Inverse cumulative distribution function
PDF: The bell curve
SAMPLING IN PYTHON
Standard error method for confidence interval
point_estimate = np.mean(coffee_boot_distn)
7.513452892
0.016067182825533724
(7.481961792328933, 7.544943991671067)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Congratulations!
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Recap
Chapter 1 Chapter 3
SAMPLING IN PYTHON
The most important things
The std. deviation of a bootstrap statistic is a good approximation of the standard error
Can assume bootstrap distributions are normally distributed for confidence intervals
SAMPLING IN PYTHON
What's next?
Experimental Design in Python and Customer Analytics and A/B Testing in Python
Hypothesis Testing in Python
SAMPLING IN PYTHON
Happy learning!
SAMPLING IN PYTHON
Hypothesis tests and
z-scores
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
A/B testing
In 2013, Electronic Arts (EA) released
SimCity 5
mean_comp_samp = stack_overflow['converted_comp'].mean()
119574.71738168952
5607.997577378606
119574.71738168952
mean_comp_hyp = 110000
std_error
5607.997577378606
1.7073326529796957
Determine whether sample statistics are close to or far away from expected (or
"hypothesized" values)
James Chapman
Curriculum Manager, DataCamp
Criminal trials
Two possible true states:
1. Defendant committed the crime
2. Not guilty
Prosecution must present evidence "beyond reasonable doubt" for a guilty verdict
The alternative hypothesis (HA ) is the new "challenger" idea of the researcher
1"Naught" is British English for "zero". For historical reasons, "H-naught" is the international convention for
pronouncing the null hypothesis.
If the evidence from the sample is "significant" that HA is true, reject H0 , else choose H0
Test Tails
alternative different from null two-tailed
alternative greater than null right-tailed
alternative less than null left-tailed
0.39141972578505085
prop_child_hyp = 0.35
0.010351057228878566
4.001497129152506
3.1471479512323874e-05
James Chapman
Curriculum Manager, DataCamp
p-value recap
p-values quantify evidence for the null hypothesis
Large p-value → fail to reject null hypothesis
3.1471479512323874e-05
3.1471479512323874e-05
True
Reject H0 in favor of HA
import numpy as np
lower = np.quantile(first_code_boot_distn, 0.025)
upper = np.quantile(first_code_boot_distn, 0.975)
print((lower, upper))
(0.37063246351172047, 0.41132242370632466)
actual H0 actual HA
False positives are Type I errors; false negatives are Type II errors.
A false positive (Type I) error: data scientists didn't start coding as children at a higher rate
A false negative (Type II) error: data scientists started coding as children at a higher rate
James Chapman
Curriculum Manager, DataCamp
Two-sample problems
Compare sample statistics across groups of a variable
converted_comp is a numerical variable
Are users who first programmed as a child compensated higher than those that started as
adults?
H0 : μchild = μadult
H0 : μchild − μadult = 0
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult.
age_first_code_cut
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64
x̄ - a sample mean
x̄child - sample mean compensation for coding first as a child
x̄adult - sample mean compensation for coding first as an adult
x̄child − x̄adult - a test statistic
z-score - a (standardized) test statistic
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64 age_first_code_cut
s = stack_overflow.groupby('age_first_code_cut')['converted_comp'].std()
adult 271546.521729
child 255585.240115
Name: converted_comp, dtype: float64 age_first_code_cut
n = stack_overflow.groupby('age_first_code_cut')['converted_comp'].count()
adult 1376
child 885
Name: converted_comp, dtype: int64
import numpy as np
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator
1.8699313316221844
James Chapman
Curriculum Manager, DataCamp
t-distributions
t statistic follows a t-distribution
Have a parameter named degrees of
freedom, or df
Look like normal distributions, with fatter
tails
df = nchild + nadult − 2
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult
If p ≤ α then reject H0 .
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult
z-statistic: needed when using one sample statistic to estimate a population parameter
t-statistic: needed when using multiple sample statistics to estimate a population parameter
1.8699313316221844
2259
0.030811302165157595
Evidence that Stack Overflow data scientists who started coding as a child earn more.
James Chapman
Curriculum Manager, DataCamp
US Republican presidents dataset
state county repub_percent_08 repub_percent_12
0 Alabama Hale 38.957877 37.139882
1 Arkansas Nevada 56.726272 58.983452
2 California Lake 38.896719 39.331367
3 California Ventura 42.923190 45.250693
.. ... ... ... ...
96 Wisconsin La Crosse 37.490904 40.577038
97 Wisconsin Lafayette 38.104967 41.675050
98 Wyoming Weston 76.684241 83.983328
99 Alaska District 34 77.063259 40.789626
1 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
H0 : μ2008 − μ2012 = 0
-2.877109041242944
df = ndif f − 1
New hypotheses:
H0 : μdiff = 0
HA : μdiff < 0
degrees_of_freedom = n_diff - 1
9.572537285272411e-08
99
BF10 power
T-test 1.323e+05 1.0
1Details on Returns from pingouin.ttest() are available in the API docs for pingouin at https://pingouin-
stats.org/generated/pingouin.ttest.html#pingouin.ttest.
BF10 power
T-test 1.323e+05 0.696338
power
T-test 0.454972
Unpaired t-tests on paired data increases the chances of false negative errors
James Chapman
Curriculum Manager, DataCamp
Job satisfaction: 5 categories
stack_overflow['job_sat'].value_counts()
alpha = 0.2
pingouin.anova(data=stack_overflow,
dv="converted_comp",
between="job_sat")
0.001315 <α
At least two categories have significantly different compensation
James Chapman
Curriculum Manager, DataCamp
Chapter 1 recap
Is a claim about an unknown population proportion feasible?
3. Calculate a p-value
Now, calculate the test statistic without using the bootstrap distribution
p^ − p0
z=
√ p0 ∗ (1 − p0 )
n
^ and n) and the hypothesized parameter (p0 )
Only uses sample information ( p
s is calculated from x̄
x̄ estimates the population mean
s estimates the population standard deviation
↑ uncertainty in our estimate of the parameter
t-distribution - fatter tails than a normal distribution
alpha = 0.01
stack_overflow['age_cat'].value_counts(normalize=True)
Under 30 0.535604
At least 30 0.464396
Name: age_cat, dtype: float64
0.5356037151702786
p_0 = 0.50
n = len(stack_overflow)
2261
import numpy as np
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)
z_score = numerator / denominator
3.385911440783663
p_value = norm.cdf(-z_score) +
1 - norm.cdf(z_score)
p_value = 2 * (1 - norm.cdf(z_score))
Left-tailed ("less than"):
0.0007094227368100725
from scipy.stats import norm
p_value = norm.cdf(z_score)
p_value <= alpha
p_value = 1 - norm.cdf(z_score)
James Chapman
Curriculum Manager, DataCamp
Comparing two proportions
H0 : Proportion of hobbyist users is the same for those under thirty as those at least thirty
H0 : p≥30 − p<30 = 0
HA : Proportion of hobbyist users is different for those under thirty to those at least thirty
HA : p≥30 − p<30 ≠ 0
alpha = 0.05
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
n = stack_overflow.groupby("age_cat")['hobbyist'].count()
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
0.773333 0.843105
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
1050 1211
-4.223718652693034
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
(-4.223691463320559, 2.403330142685068e-05)
James Chapman
Curriculum Manager, DataCamp
Revisiting the proportion test
age_by_hobbyist = stack_overflow.groupby("age_cat")['hobbyist'].value_counts()
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
(-4.223691463320559, 2.403330142685068e-05)
alpha = 0.1
Assuming independence, how far away are the observed results from the expected values?
Degrees of freedom:
(2 − 1) ∗ (5 − 1) = 4
1Left-tailed chi-square tests are used in statistical forensics to detect if a fit is suspiciously good because the
data was fabricated. Chi-square tests of variance can be two-tailed. These are niche uses, though.
James Chapman
Curriculum Manager, DataCamp
Purple links
How do you feel when you discover that you've already visited the top resource?
purple_link_counts = stack_overflow['purple_link'].value_counts()
purple_link_counts = purple_link_counts.rename_axis('purple_link')\
.reset_index(name='n')\
.sort_values('purple_link')
purple_link n
2 Amused 368
3 Annoyed 263
0 Hello, old friend 1225
1 Indifferent 405
H0 : The sample matches the hypothesized χ2 measures how far observed results are
distribution from expectations in each group
purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333
plt.bar(purple_link_counts['purple_link'], purple_link_counts['n'],
color='red', label='Observed')
plt.bar(hypothesized['purple_link'], hypothesized['n'], alpha=0.5,
color='blue', label='Hypothesized')
plt.legend()
plt.show()
purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333
Power_divergenceResult(statistic=44.59840778416629, pvalue=1.1261810719413759e-09)
James Chapman
Curriculum Manager, DataCamp
Randomness
Assumption
The samples are random subsets of larger
populations
Consequence
Sample is not representative of population
Consequence
Increased chance of false negative/positive error
Consequence
Wider confidence intervals
n ≥ 30 n1 ≥ 30, n2 ≥ 30
n × p^ ≥ 10 n1 × p^1 ≥ 10
Revisit data collection to check for randomness, independence, and sample size
James Chapman
Curriculum Manager, DataCamp
Parametric tests
z-test, t-test, and ANOVA are all parametric tests
Assume a normal distribution
alpha = 0.01
import pingouin
pingouin.ttest(x=repub_votes_potus_08_12_small['repub_percent_08'],
y=repub_votes_potus_08_12_small['repub_percent_12'],
paired=True,
alternative="less")
repub_votes_small['diff'] = repub_votes_small['repub_percent_08'] -
repub_votes_small['repub_percent_12']
print(repub_votes_small)
repub_votes_small['abs_diff'] = repub_votes_small['diff'].abs()
print(repub_votes_small)
Incorporate the sum of the ranks for negative and positive differences
T_minus = 1 + 4 + 5 + 2 + 3
T_plus = 0
W = np.min([T_minus, T_plus])
James Chapman
Curriculum Manager, DataCamp
Wilcoxon-Mann-Whitney test
Also know as the Mann Whitney U test
A t-test on the ranks of the numeric input
age_vs_comp_wide = age_vs_comp.pivot(columns='age_first_code_cut',
values='converted_comp')
import pingouin
pingouin.mwu(x=age_vs_comp_wide['child'],
y=age_vs_comp_wide['adult'],
alternative='greater')
alpha=0.01
pingouin.kruskal(data=stack_overflow,
dv='converted_comp',
between='job_sat')
James Chapman
Curriculum Manager, DataCamp
Course recap
Chapter 1 Chapter 3
Chapter 2 Chapter 4
Bayesian statistics
Bayesian Data Analysis in Python
Applications
Customer Analytics and A/B Testing in Python