Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
22 views

Statistics Fundamentals With Python

Statistical knowledge is key to evaluating, interpreting, and reporting findings from your data. In this skill track, you'll learn the four fundamentals of statistics using Python, including: ✓ Summary statistics and probability ✓ Statistical models such as linear and logistic regression ✓ Techniques for sampling ✓ How to perform hypothesis tests and draw conclusions from a wide variety of data sets

Uploaded by

jcmayac
Copyright
© © All Rights Reserved
0% found this document useful (0 votes)
22 views

Statistics Fundamentals With Python

Statistical knowledge is key to evaluating, interpreting, and reporting findings from your data. In this skill track, you'll learn the four fundamentals of statistics using Python, including: ✓ Summary statistics and probability ✓ Statistical models such as linear and logistic regression ✓ Techniques for sampling ✓ How to perform hypothesis tests and draw conclusions from a wide variety of data sets

Uploaded by

jcmayac
Copyright
© © All Rights Reserved
You are on page 1/ 771

What is statistics?

I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
What is statistics?
The field of statistics - the practice and study of collecting and analyzing data

A summary statistic - a fact about or summary of some data

INTRODUCTION TO STATISTICS IN PYTHON


What can statistics do?
What is statistics?
The field of statistics - the practice and study of collecting and analyzing data

A summary statistic - a fact about or summary of some data

What can statistics do?


How likely is someone to purchase a product? Are people more likely to purchase it if they
can use a different payment system?

How many occupants will your hotel have? How can you optimize occupancy?

How many sizes of jeans need to be manufactured so they can fit 95% of the population?
Should the same number of each size be produced?

A/B tests: Which ad is more effective in getting people to purchase a product?

INTRODUCTION TO STATISTICS IN PYTHON


What can't statistics do?
Why is Game of Thrones so popular?
Instead...

Are series with more violent scenes viewed by more people?


But...

Even so, this can't tell us if more violent scenes lead to more views

INTRODUCTION TO STATISTICS IN PYTHON


Types of statistics
Descriptive statistics Inferential statistics
Describe and summarize data Use a sample of data to make inferences
about a larger population

50% of friends drive to work

25% take the bus

25% bike What percent of people drive to work?

INTRODUCTION TO STATISTICS IN PYTHON


Types of data
Numeric (Quantitative) Categorical (Qualitative)
Continuous (Measured) Nominal (Unordered)
Airplane speed Married/unmarried

Time spent waiting in line Country of residence

Discrete (Counted) Ordinal (Ordered)


Number of pets

Number of packages shipped

INTRODUCTION TO STATISTICS IN PYTHON


Categorical data can be represented as numbers
Nominal (Unordered) Ordinal (Ordered)
Married/unmarried ( 1 / 0 ) Strongly disagree ( 1 )

Country of residence ( 1 , 2 , ...) Somewhat disagree ( 2 )

Neither agree nor disagree ( 3 )

Somewhat agree ( 4 )

Strongly agree ( 5 )

INTRODUCTION TO STATISTICS IN PYTHON


Why does data type matter?
Summary statistics Plots

import numpy as np
np.mean(car_speeds['speed_mph'])

40.09062

INTRODUCTION TO STATISTICS IN PYTHON


Why does data type matter?
Summary statistics Plots
demographics['marriage_status'].value_counts()

single 188
married 143
divorced 124
dtype: int64

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Measures of center
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Mammal sleep data
print(msleep)

name genus vore order ... sleep_cycle awake brainwt bodywt


1 Cheetah Acinonyx carni Carnivora ... NaN 11.9 NaN 50.000
2 Owl monkey Aotus omni Primates ... NaN 7.0 0.01550 0.480
3 Mountain beaver Aplodontia herbi Rodentia ... NaN 9.6 NaN 1.350
4 Greater short-ta... Blarina omni Soricomorpha ... 0.133333 9.1 0.00029 0.019
5 Cow Bos herbi Artiodactyla ... 0.666667 20.0 0.42300 600.000
.. ... ... ... ... ... ... ... ... ...
79 Tree shrew Tupaia omni Scandentia ... 0.233333 15.1 0.00250 0.104
80 Bottle-nosed do... Tursiops carni Cetacea ... NaN 18.8 NaN 173.330
81 Genet Genetta carni Carnivora ... NaN 17.7 0.01750 2.000
82 Arctic fox Vulpes carni Carnivora ... NaN 11.5 0.04450 3.380
83 Red fox Vulpes carni Carnivora ... 0.350000 14.2 0.05040 4.230

INTRODUCTION TO STATISTICS IN PYTHON


Histograms

INTRODUCTION TO STATISTICS IN PYTHON


How long do mammals in this dataset typically sleep?
What's a typical value?

Where is the center of the data?

Mean

Median

Mode

INTRODUCTION TO STATISTICS IN PYTHON


Measures of center: mean
name sleep_total import numpy as np
1 Cheetah 12.1 np.mean(msleep['sleep_total'])
2 Owl monkey 17.0
3 Mountain beaver 14.4 10.43373
4 Greater short-t... 14.9
5 Cow 4.0
.. ... ...

Mean sleep time =


12.1 + 17.0 + 14.4 + 14.9 + ...
= 10.43
83

INTRODUCTION TO STATISTICS IN PYTHON


Measures of center: median
msleep['sleep_total'].sort_values() msleep['sleep_total'].sort_values().iloc[41]

29 1.9 10.1
30 2.7
22 2.9
9 3.0
23 3.1
np.median(msleep['sleep_total'])
...
19 18.0
61 18.1 10.1

36 19.4
21 19.7
42 19.9

INTRODUCTION TO STATISTICS IN PYTHON


Measures of center: mode
Most frequent value msleep['vore'].value_counts()

msleep['sleep_total'].value_counts()
herbi 32
omni 20
12.5 4 carni 19
10.1 3 insecti 5
14.9 2 Name: vore, dtype: int64
11.0 2
8.4 2
import statistics
...
statistics.mode(msleep['vore'])
14.3 1
17.0 1
'herbi'
Name: sleep_total, Length: 65, dtype: int64

INTRODUCTION TO STATISTICS IN PYTHON


Adding an outlier
msleep[msleep['vore'] == 'insecti']

name genus vore order sleep_total


22 Big brown bat Eptesicus insecti Chiroptera 19.7
43 Little brown bat Myotis insecti Chiroptera 19.9
62 Giant armadillo Priodontes insecti Cingulata 18.1
67 Eastern american mole Scalopus insecti Soricomorpha 8.4

INTRODUCTION TO STATISTICS IN PYTHON


Adding an outlier
msleep[msleep['vore'] == "insecti"]['sleep_total'].agg([np.mean, np.median])

mean 16.53
median 18.9
Name: sleep_total, dtype: float64

INTRODUCTION TO STATISTICS IN PYTHON


Adding an outlier
msleep[msleep['vore'] == 'insecti']

name genus vore order sleep_total


22 Big brown bat Eptesicus insecti Chiroptera 19.7
43 Little brown bat Myotis insecti Chiroptera 19.9
62 Giant armadillo Priodontes insecti Cingulata 18.1
67 Eastern american mole Scalopus insecti Soricomorpha 8.4
84 Mystery insectivore ... insecti ... 0.0

INTRODUCTION TO STATISTICS IN PYTHON


Adding an outlier
msleep[msleep['vore'] == "insecti"]['sleep_total'].agg([np.mean, np.median])

mean 13.22
median 18.1
Name: sleep_total, dtype: float64

Mean: 16.5 → 13.2

Median: 18.9 → 18.1

INTRODUCTION TO STATISTICS IN PYTHON


Which measure to use?

INTRODUCTION TO STATISTICS IN PYTHON


Skew
Left-skewed Right-skewed

INTRODUCTION TO STATISTICS IN PYTHON


Which measure to use?

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Measures of spread
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
What is spread?

INTRODUCTION TO STATISTICS IN PYTHON


Variance
Average distance from each data point to the data's mean

INTRODUCTION TO STATISTICS IN PYTHON


Variance
Average distance from each data point to the data's mean

INTRODUCTION TO STATISTICS IN PYTHON


Calculating variance
1. Subtract mean from each data point 2. Square each distance

dists = msleep['sleep_total'] - sq_dists = dists ** 2


np.mean(msleep['sleep_total']) print(sq_dists)
print(dists)
0 2.776439
0 1.666265 1 43.115837
1 6.566265 2 15.731259
2 3.966265 3 19.947524
3 4.466265 4 41.392945
4 -6.433735 ...
...

INTRODUCTION TO STATISTICS IN PYTHON


Calculating variance
3. Sum squared distances Use np.var()

sum_sq_dists = np.sum(sq_dists) np.var(msleep['sleep_total'], ddof=1)


print(sum_sq_dists)

19.805677
1624.065542
Without ddof=1 , population variance is
4. Divide by number of data points - 1 calculated instead of sample variance:

variance = sum_sq_dists / (83 - 1) np.var(msleep['sleep_total'])


print(variance)

19.567055
19.805677

INTRODUCTION TO STATISTICS IN PYTHON


Standard deviation
np.sqrt(np.var(msleep['sleep_total'], ddof=1))

4.450357

np.std(msleep['sleep_total'], ddof=1)

4.450357

INTRODUCTION TO STATISTICS IN PYTHON


Mean absolute deviation
dists = msleep['sleep_total'] - mean(msleep$sleep_total)
np.mean(np.abs(dists))

3.566701

Standard deviation vs. mean absolute deviation

Standard deviation squares distances, penalizing longer distances more than shorter ones.

Mean absolute deviation penalizes each distance equally.

One isn't better than the other, but SD is more common than MAD.

INTRODUCTION TO STATISTICS IN PYTHON


Quantiles
np.quantile(msleep['sleep_total'], 0.5)

0.5 quantile = median


10.1

Quartiles:

np.quantile(msleep['sleep_total'], [0, 0.25, 0.5, 0.75, 1])

array([ 1.9 , 7.85, 10.1 , 13.75, 19.9 ])

INTRODUCTION TO STATISTICS IN PYTHON


Boxplots use quartiles
import matplotlib.pyplot as plt
plt.boxplot(msleep['sleep_total'])
plt.show()

INTRODUCTION TO STATISTICS IN PYTHON


Quantiles using np.linspace()
np.quantile(msleep['sleep_total'], [0, 0.2, 0.4, 0.6, 0.8, 1])

array([ 1.9 , 6.24, 9.48, 11.14, 14.4 , 19.9 ])

np.linspace(start, stop, num)

np.quantile(msleep['sleep_total'], np.linspace(0, 1, 5))

array([ 1.9 , 7.85, 10.1 , 13.75, 19.9 ])

INTRODUCTION TO STATISTICS IN PYTHON


Interquartile range (IQR)
Height of the box in a boxplot

np.quantile(msleep['sleep_total'], 0.75) - np.quantile(msleep['sleep_total'], 0.25)

5.9

from scipy.stats import iqr


iqr(msleep['sleep_total'])

5.9

INTRODUCTION TO STATISTICS IN PYTHON


Outliers
Outlier: data point that is substantially different from the others

How do we know what a substantial difference is? A data point is an outlier if:

data < Q1 − 1.5 × IQR or


data > Q3 + 1.5 × IQR

INTRODUCTION TO STATISTICS IN PYTHON


Finding outliers
from scipy.stats import iqr
iqr = iqr(msleep['bodywt'])
lower_threshold = np.quantile(msleep['bodywt'], 0.25) - 1.5 * iqr
upper_threshold = np.quantile(msleep['bodywt'], 0.75) + 1.5 * iqr

msleep[(msleep['bodywt'] < lower_threshold) | (msleep['bodywt'] > upper_threshold)]

name vore sleep_total bodywt


4 Cow herbi 4.0 600.000
20 Asian elephant herbi 3.9 2547.000
22 Horse herbi 2.9 521.000
...

INTRODUCTION TO STATISTICS IN PYTHON


All in one go
msleep['bodywt'].describe()

count 83.000000
mean 166.136349
std 786.839732
min 0.005000
25% 0.174000
50% 1.670000
75% 41.750000
max 6654.000000
Name: bodywt, dtype: float64

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
What are the
chances?
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Measuring chance
What's the probability of an event?

# ways event can happen


P (event) =
total # of possible outcomes
Example: a coin flip

1 way to get heads 1


P (heads) = = = 50%
2 possible outcomes 2

INTRODUCTION TO STATISTICS IN PYTHON


Assigning salespeople

INTRODUCTION TO STATISTICS IN PYTHON


Assigning salespeople

1
P (Brian) = = 25%
4

INTRODUCTION TO STATISTICS IN PYTHON


Sampling from a DataFrame
print(sales_counts) sales_counts.sample()

name n_sales name n_sales


0 Amir 178 1 Brian 128
1 Brian 128
2 Claire 75 sales_counts.sample()
3 Damian 69

name n_sales
2 Claire 75

INTRODUCTION TO STATISTICS IN PYTHON


Setting a random seed
np.random.seed(10) np.random.seed(10)
sales_counts.sample() sales_counts.sample()

name n_sales name n_sales


1 Brian 128 1 Brian 128

np.random.seed(10)
sales_counts.sample()

name n_sales
1 Brian 128

INTRODUCTION TO STATISTICS IN PYTHON


A second meeting
Sampling without replacement

INTRODUCTION TO STATISTICS IN PYTHON


A second meeting

1
P (Claire) = = 33%
3

INTRODUCTION TO STATISTICS IN PYTHON


Sampling twice in Python
sales_counts.sample(2)

name n_sales
1 Brian 128
2 Claire 75

INTRODUCTION TO STATISTICS IN PYTHON


Sampling with replacement

INTRODUCTION TO STATISTICS IN PYTHON


Sampling with replacement

1
P (Claire) = = 25%
4

INTRODUCTION TO STATISTICS IN PYTHON


Sampling with/without replacement in Python
sales_counts.sample(5, replace = True)

name n_sales
1 Brian 128
2 Claire 75
1 Brian 128
3 Damian 69
0 Amir 178

INTRODUCTION TO STATISTICS IN PYTHON


Independent events
Two events are independent if the probability
of the second event isn't affected by the
outcome of the first event.

INTRODUCTION TO STATISTICS IN PYTHON


Independent events
Two events are independent if the probability
of the second event isn't affected by the
outcome of the first event.

Sampling with replacement = each pick is


independent

INTRODUCTION TO STATISTICS IN PYTHON


Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.

INTRODUCTION TO STATISTICS IN PYTHON


Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.

INTRODUCTION TO STATISTICS IN PYTHON


Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.

Sampling without replacement = each pick is


dependent

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Discrete
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Rolling the dice

INTRODUCTION TO STATISTICS IN PYTHON


Rolling the dice

INTRODUCTION TO STATISTICS IN PYTHON


Choosing salespeople

INTRODUCTION TO STATISTICS IN PYTHON


Probability distribution
Describes the probability of each possible outcome in a scenario

Expected value: mean of a probability distribution

Expected value of a fair die roll =


(1 × 16 ) + (2 × 16 ) + (3 × 16 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.5

INTRODUCTION TO STATISTICS IN PYTHON


Visualizing a probability distribution

INTRODUCTION TO STATISTICS IN PYTHON


Probability = area
P (die roll) ≤ 2 = ?

INTRODUCTION TO STATISTICS IN PYTHON


Probability = area
P (die roll) ≤ 2 = 1/3

INTRODUCTION TO STATISTICS IN PYTHON


Uneven die

Expected value of uneven die roll =


(1 × 16 ) + (2 × 0) + (3 × 13 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.67

INTRODUCTION TO STATISTICS IN PYTHON


Visualizing uneven probabilities

INTRODUCTION TO STATISTICS IN PYTHON


Adding areas
P (uneven die roll) ≤ 2 = ?

INTRODUCTION TO STATISTICS IN PYTHON


Adding areas
P (uneven die roll) ≤ 2 = 1/6

INTRODUCTION TO STATISTICS IN PYTHON


Discrete probability distributions
Describe probabilities for discrete outcomes

Fair die Uneven die

Discrete uniform distribution

INTRODUCTION TO STATISTICS IN PYTHON


Sampling from discrete distributions
print(die) rolls_10 = die.sample(10, replace = True)
rolls_10

number prob
0 1 0.166667 number prob
1 2 0.166667 0 1 0.166667
2 3 0.166667 0 1 0.166667
3 4 0.166667 4 5 0.166667
4 5 0.166667 1 2 0.166667
5 6 0.166667 0 1 0.166667
0 1 0.166667
5 6 0.166667
np.mean(die['number'])
5 6 0.166667
...
3.5

INTRODUCTION TO STATISTICS IN PYTHON


Visualizing a sample
rolls_10['number'].hist(bins=np.linspace(1,7,7))
plt.show()

INTRODUCTION TO STATISTICS IN PYTHON


Sample distribution vs. theoretical distribution
Sample of 10 rolls Theoretical probability distribution

np.mean(rolls_10['number']) = 3.0
mean(die['number']) = 3.5

INTRODUCTION TO STATISTICS IN PYTHON


A bigger sample
Sample of 100 rolls Theoretical probability distribution

np.mean(rolls_100['number']) = 3.4
mean(die['number']) = 3.5

INTRODUCTION TO STATISTICS IN PYTHON


An even bigger sample
Sample of 1000 rolls Theoretical probability distribution

np.mean(rolls_1000['number']) = 3.48
mean(die['number']) = 3.5

INTRODUCTION TO STATISTICS IN PYTHON


Law of large numbers
As the size of your sample increases, the sample mean will approach the expected value.

Sample size Mean


10 3.00
100 3.40
1000 3.48

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Continuous
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Waiting for the bus

INTRODUCTION TO STATISTICS IN PYTHON


Continuous uniform distribution

INTRODUCTION TO STATISTICS IN PYTHON


Continuous uniform distribution

INTRODUCTION TO STATISTICS IN PYTHON


Probability still = area
P (4 ≤ wait time ≤ 7) = ?

INTRODUCTION TO STATISTICS IN PYTHON


Probability still = area
P (4 ≤ wait time ≤ 7) = ?

INTRODUCTION TO STATISTICS IN PYTHON


Probability still = area
P (4 ≤ wait time ≤ 7) = 3 × 1/12 = 3/12

INTRODUCTION TO STATISTICS IN PYTHON


Uniform distribution in Python
P (wait time ≤ 7)

from scipy.stats import uniform


uniform.cdf(7, 0, 12)

0.5833333

INTRODUCTION TO STATISTICS IN PYTHON


"Greater than" probabilities
P (wait time ≥ 7) = 1 − P (wait time ≤ 7)

from scipy.stats import uniform


1 - uniform.cdf(7, 0, 12)

0.4166667

INTRODUCTION TO STATISTICS IN PYTHON


P (4 ≤ wait time ≤ 7)

INTRODUCTION TO STATISTICS IN PYTHON


P (4 ≤ wait time ≤ 7)

INTRODUCTION TO STATISTICS IN PYTHON


P (4 ≤ wait time ≤ 7)

from scipy.stats import uniform


uniform.cdf(7, 0, 12) - uniform.cdf(4, 0, 12)

0.25

INTRODUCTION TO STATISTICS IN PYTHON


Total area = 1
P (0 ≤ wait time ≤ 12) = ?

INTRODUCTION TO STATISTICS IN PYTHON


Total area = 1
P (0 ≤ outcome ≤ 12) = 12 × 1/12 = 1

INTRODUCTION TO STATISTICS IN PYTHON


Generating random numbers according to uniform
distribution
from scipy.stats import uniform
uniform.rvs(0, 5, size=10)

array([1.89740094, 4.70673196, 0.33224683, 1.0137103 , 2.31641255,


3.49969897, 0.29688598, 0.92057234, 4.71086658, 1.56815855])

INTRODUCTION TO STATISTICS IN PYTHON


Other continuous distributions

INTRODUCTION TO STATISTICS IN PYTHON


Other continuous distributions

INTRODUCTION TO STATISTICS IN PYTHON


Other special types of distributions
Normal distribution Exponential distribution

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The binomial
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Coin flipping

INTRODUCTION TO STATISTICS IN PYTHON


Binary outcomes

INTRODUCTION TO STATISTICS IN PYTHON


A single flip
binom.rvs(# of coins, probability of heads/success, size=# of trials)

1 = head, 0 = tails

from scipy.stats import binom


binom.rvs(1, 0.5, size=1)

array([1])

INTRODUCTION TO STATISTICS IN PYTHON


One flip many times
binom.rvs(1, 0.5, size=8)

array([0, 1, 1, 0, 1, 0, 1, 1])

INTRODUCTION TO STATISTICS IN PYTHON


Many flips one time
binom.rvs(8, 0.5, size=1)

array([5])

INTRODUCTION TO STATISTICS IN PYTHON


Many flips many times
binom.rvs(3, 0.5, size=10)

array([0, 3, 2, 1, 3, 0, 2, 2, 0, 0])

INTRODUCTION TO STATISTICS IN PYTHON


Other probabilities
binom.rvs(3, 0.25, size=10)

array([1, 1, 1, 1, 0, 0, 2, 0, 1, 0])

INTRODUCTION TO STATISTICS IN PYTHON


Binomial distribution
Probability distribution of the number of
successes in a sequence of independent
trials

E.g. Number of heads in a sequence of coin


flips

Described by n and p

n: total number of trials


p: probability of success

INTRODUCTION TO STATISTICS IN PYTHON


What's the probability of 7 heads?
P (heads = 7)

# binom.pmf(num heads, num trials, prob of heads)


binom.pmf(7, 10, 0.5)

0.1171875

INTRODUCTION TO STATISTICS IN PYTHON


What's the probability of 7 or fewer heads?
P (heads ≤ 7)

binom.cdf(7, 10, 0.5)

0.9453125

INTRODUCTION TO STATISTICS IN PYTHON


What's the probability of more than 7 heads?
P (heads > 7)

1 - binom.cdf(7, 10, 0.5)

0.0546875

INTRODUCTION TO STATISTICS IN PYTHON


Expected value
Expected value = n × p

Expected number of heads out of 10 flips = 10 × 0.5 = 5

INTRODUCTION TO STATISTICS IN PYTHON


Independence
The binomial distribution is a probability
distribution of the number of successes in a
sequence of independent trials

INTRODUCTION TO STATISTICS IN PYTHON


Independence
The binomial distribution is a probability
distribution of the number of successes in a
sequence of independent trials

Probabilities of second trial are altered due to


outcome of the first

If trials are not independent, the binomial


distribution does not apply!

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The normal
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
What is the normal distribution?

INTRODUCTION TO STATISTICS IN PYTHON


Symmetrical

INTRODUCTION TO STATISTICS IN PYTHON


Area = 1

INTRODUCTION TO STATISTICS IN PYTHON


Curve never hits 0

INTRODUCTION TO STATISTICS IN PYTHON


Described by mean and standard deviation

Mean: 20

Standard deviation: 3

Standard normal distribution

Mean: 0

Standard deviation: 1

INTRODUCTION TO STATISTICS IN PYTHON


Described by mean and standard deviation

Mean: 20

Standard deviation: 3

Standard normal distribution

Mean: 0

Standard deviation: 1

INTRODUCTION TO STATISTICS IN PYTHON


Areas under the normal distribution
68% falls within 1 standard deviation

INTRODUCTION TO STATISTICS IN PYTHON


Areas under the normal distribution
95% falls within 2 standard deviations

INTRODUCTION TO STATISTICS IN PYTHON


Areas under the normal distribution
99.7% falls within 3 standard deviations

INTRODUCTION TO STATISTICS IN PYTHON


Lots of histograms look normal
Normal distribution Women's heights from NHANES

Mean: 161 cm Standard deviation: 7 cm

INTRODUCTION TO STATISTICS IN PYTHON


Approximating data with the normal distribution

INTRODUCTION TO STATISTICS IN PYTHON


What percent of women are shorter than 154 cm?
from scipy.stats import norm
norm.cdf(154, 161, 7)

0.158655

16% of women in the survey are shorter than


154 cm

INTRODUCTION TO STATISTICS IN PYTHON


What percent of women are taller than 154 cm?
from scipy.stats import norm
1 - norm.cdf(154, 161, 7)

0.841345

INTRODUCTION TO STATISTICS IN PYTHON


What percent of women are 154-157 cm?

norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)

INTRODUCTION TO STATISTICS IN PYTHON


What percent of women are 154-157 cm?

norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)

0.1252

INTRODUCTION TO STATISTICS IN PYTHON


What height are 90% of women shorter than?
norm.ppf(0.9, 161, 7)

169.97086

INTRODUCTION TO STATISTICS IN PYTHON


What height are 90% of women taller than?
norm.ppf((1-0.9), 161, 7)

152.029

INTRODUCTION TO STATISTICS IN PYTHON


Generating random numbers
# Generate 10 random heights
norm.rvs(161, 7, size=10)

array([155.5758223 , 155.13133235, 160.06377097, 168.33345778,


165.92273375, 163.32677057, 165.13280753, 146.36133538,
149.07845021, 160.5790856 ])

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The central limit
theorem
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Rolling the dice 5 times
die = pd.Series([1, 2, 3, 4, 5, 6])
# Roll 5 times
samp_5 = die.sample(5, replace=True)
print(samp_5)

array([3, 1, 4, 1, 1])

np.mean(samp_5)

2.0

INTRODUCTION TO STATISTICS IN PYTHON


Rolling the dice 5 times
# Roll 5 times and take mean
samp_5 = die.sample(5, replace=True)
np.mean(samp_5)

4.4

samp_5 = die.sample(5, replace=True)


np.mean(samp_5)

3.8

INTRODUCTION TO STATISTICS IN PYTHON


Rolling the dice 5 times 10 times
Repeat 10 times: sample_means = []
for i in range(10):
Roll 5 times
samp_5 = die.sample(5, replace=True)
Take the mean sample_means.append(np.mean(samp_5))
print(sample_means)

[3.8, 4.0, 3.8, 3.6, 3.2, 4.8, 2.6,


3.0, 2.6, 2.0]

INTRODUCTION TO STATISTICS IN PYTHON


Sampling distributions
Sampling distribution of the sample mean

INTRODUCTION TO STATISTICS IN PYTHON


100 sample means
sample_means = []
for i in range(100):
sample_means.append(np.mean(die.sample(5, replace=True)))

INTRODUCTION TO STATISTICS IN PYTHON


1000 sample means
sample_means = []
for i in range(1000):
sample_means.append(np.mean(die.sample(5, replace=True)))

INTRODUCTION TO STATISTICS IN PYTHON


Central limit theorem
The sampling distribution of a statistic becomes closer to the normal distribution as the
number of trials increases.

* Samples should be random and independent

INTRODUCTION TO STATISTICS IN PYTHON


Standard deviation and the CLT
sample_sds = []
for i in range(1000):
sample_sds.append(np.std(die.sample(5, replace=True)))

INTRODUCTION TO STATISTICS IN PYTHON


Proportions and the CLT
sales_team = pd.Series(["Amir", "Brian", "Claire", "Damian"])
sales_team.sample(10, replace=True)

array(['Claire', 'Damian', 'Brian', 'Damian', 'Damian', 'Amir', 'Amir', 'Amir',


'Amir', 'Damian'], dtype=object)

sales_team.sample(10, replace=True)

array(['Brian', 'Amir', 'Brian', 'Claire', 'Brian', 'Damian', 'Claire', 'Brian',


'Claire', 'Claire'], dtype=object)

INTRODUCTION TO STATISTICS IN PYTHON


Sampling distribution of proportion

INTRODUCTION TO STATISTICS IN PYTHON


Mean of sampling distribution
# Estimate expected value of die
np.mean(sample_means)

3.48

# Estimate proportion of "Claire"s


np.mean(sample_props)

Estimate characteristics of unknown


0.26
underlying distribution
More easily estimate characteristics of
large populations

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The Poisson
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Poisson processes
Events appear to happen at a certain rate,
but completely at random

Examples
Number of animals adopted from an
animal shelter per week

Number of people arriving at a


restaurant per hour

Number of earthquakes in California per


year
Time unit is irrelevant, as long as you use
the same unit when talking about the same
situation

INTRODUCTION TO STATISTICS IN PYTHON


Poisson distribution
Probability of some # of events occurring over a fixed period of time

Examples
Probability of ≥ 5 animals adopted from an animal shelter per week

Probability of 12 people arriving at a restaurant per hour

Probability of < 20 earthquakes in California per year

INTRODUCTION TO STATISTICS IN PYTHON


Lambda (λ)
λ = average number of events per time interval
Average number of adoptions per week = 8

INTRODUCTION TO STATISTICS IN PYTHON


Lambda is the distribution's peak

INTRODUCTION TO STATISTICS IN PYTHON


Probability of a single value
If the average number of adoptions per week is 8, what is P (# adoptions in a week = 5)?

from scipy.stats import poisson


poisson.pmf(5, 8)

0.09160366

INTRODUCTION TO STATISTICS IN PYTHON


Probability of less than or equal to
If the average number of adoptions per week is 8, what is P (# adoptions in a week ≤ 5)?

from scipy.stats import poisson


poisson.cdf(5, 8)

0.1912361

INTRODUCTION TO STATISTICS IN PYTHON


Probability of greater than
If the average number of adoptions per week is 8, what is P (# adoptions in a week > 5)?

1 - poisson.cdf(5, 8)

0.8087639

If the average number of adoptions per week is 10, what is P (# adoptions in a week > 5)?

1 - poisson.cdf(5, 10)

0.932914

INTRODUCTION TO STATISTICS IN PYTHON


Sampling from a Poisson distribution
from scipy.stats import poisson
poisson.rvs(8, size=10)

array([ 9, 9, 8, 7, 11, 3, 10, 6, 8, 14])

INTRODUCTION TO STATISTICS IN PYTHON


The CLT still applies!

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
More probability
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Exponential distribution
Probability of time between Poisson events
Examples
Probability of > 1 day between adoptions

Probability of < 10 minutes between restaurant arrivals

Probability of 6-8 months between earthquakes

Also uses lambda (rate)

Continuous (time)

INTRODUCTION TO STATISTICS IN PYTHON


Customer service requests
On average, one customer service ticket is created every 2 minutes
λ = 0.5 customer service tickets created each minute

INTRODUCTION TO STATISTICS IN PYTHON


Lambda in exponential distribution

INTRODUCTION TO STATISTICS IN PYTHON


Expected value of exponential distribution
In terms of rate (Poisson):

λ = 0.5 requests per minute

In terms of time between events (exponential):

1/λ = 1 request per 2 minutes


1/0.5 = 2

INTRODUCTION TO STATISTICS IN PYTHON


How long until a new request is created?
P (wait < 1 min) =

from scipy.stats import expon expon.cdf(1, scale=2)

scale = 1/λ = 1/0.5 = 2 0.3934693402873666

P (wait > 4 min) = P (1 min < wait < 4 min) =

1- expon.cdf(4, scale=2) expon.cdf(4, scale=2) - expon.cdf(1, scale=2)

0.1353352832366127 0.4711953764760207

INTRODUCTION TO STATISTICS IN PYTHON


(Student's) t-distribution
Similar shape as the normal distribution

INTRODUCTION TO STATISTICS IN PYTHON


Degrees of freedom
Has parameter degrees of freedom (df) which affects the thickness of the tails
Lower df = thicker tails, higher standard deviation

Higher df = closer to normal distribution

INTRODUCTION TO STATISTICS IN PYTHON


Log-normal distribution
Variable whose logarithm is normally
distributed

Examples:
Length of chess games

Adult blood pressure

Number of hospitalizations in the 2003


SARS outbreak

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Correlation
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Relationships between two variables

x = explanatory/independent variable
y = response/dependent variable

INTRODUCTION TO STATISTICS IN PYTHON


Correlation coefficient
Quantifies the linear relationship between two variables

Number between -1 and 1

Magnitude corresponds to strength of relationship

Sign (+ or -) corresponds to direction of relationship

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.99 (very strong relationship)

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.99 (very strong relationship) 0.75 (strong relationship)

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.56 (moderate relationship)

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.56 (moderate relationship) 0.21 (weak relationship)

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.04 (no relationship) Knowing the value of x doesn't tell us
anything about y

INTRODUCTION TO STATISTICS IN PYTHON


Sign = direction
0.75: as x increases, y increases -0.75: as x increases, y decreases

INTRODUCTION TO STATISTICS IN PYTHON


Visualizing relationships
import seaborn as sns
sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()

INTRODUCTION TO STATISTICS IN PYTHON


Adding a trendline
import seaborn as sns
sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None)
plt.show()

INTRODUCTION TO STATISTICS IN PYTHON


Computing correlation
msleep['sleep_total'].corr(msleep['sleep_rem'])

0.751755

msleep['sleep_rem'].corr(msleep['sleep_total'])

0.751755

INTRODUCTION TO STATISTICS IN PYTHON


Many ways to calculate correlation
Used in this course: Pearson product-moment correlation (r )
Most common

x̄ = mean of x
σx = standard deviation of x
n
(xi − x̄)(yi − ȳ )
r=∑
σx × σy
i=1

Variations on this formula:


Kendall's tau
Spearman's rho

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Correlation caveats
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Non-linear relationships

r = 0.18

INTRODUCTION TO STATISTICS IN PYTHON


Non-linear relationships
What we see: What the correlation coefficient sees:

INTRODUCTION TO STATISTICS IN PYTHON


Correlation only accounts for linear relationships
Correlation shouldn't be used blindly Always visualize your data

df['x'].corr(df['y'])

0.081094

INTRODUCTION TO STATISTICS IN PYTHON


Mammal sleep data
print(msleep)

name genus vore order ... sleep_cycle awake brainwt bodywt


1 Cheetah Acinonyx carni Carnivora ... NaN 11.9 NaN 50.000
2 Owl monkey Aotus omni Primates ... NaN 7.0 0.01550 0.480
3 Mountain beaver Aplodontia herbi Rodentia ... NaN 9.6 NaN 1.350
4 Greater short-ta... Blarina omni Soricomorpha ... 0.133333 9.1 0.00029 0.019
5 Cow Bos herbi Artiodactyla ... 0.666667 20.0 0.42300 600.000
.. ... ... ... ... ... ... ... ... ...
79 Tree shrew Tupaia omni Scandentia ... 0.233333 15.1 0.00250 0.104
80 Bottle-nosed do... Tursiops carni Cetacea ... NaN 18.8 NaN 173.330
81 Genet Genetta carni Carnivora ... NaN 17.7 0.01750 2.000
82 Arctic fox Vulpes carni Carnivora ... NaN 11.5 0.04450 3.380
83 Red fox Vulpes carni Carnivora ... 0.350000 14.2 0.05040 4.230

INTRODUCTION TO STATISTICS IN PYTHON


Body weight vs. awake time
msleep['bodywt'].corr(msleep['awake'])

0.3119801

INTRODUCTION TO STATISTICS IN PYTHON


Distribution of body weight

INTRODUCTION TO STATISTICS IN PYTHON


Log transformation
msleep['log_bodywt'] = np.log(msleep['bodywt'])

sns.lmplot(x='log_bodywt',
y='awake',
data=msleep,
ci=None)
plt.show()

msleep['log_bodywt'].corr(msleep['awake'])

0.5687943

INTRODUCTION TO STATISTICS IN PYTHON


Other transformations
Log transformation ( log(x) )
Square root transformation ( sqrt(x) )

Reciprocal transformation ( 1 / x )

Combinations of these, e.g.:


log(x) and log(y)

sqrt(x) and 1 / y

INTRODUCTION TO STATISTICS IN PYTHON


Why use a transformation?
Certain statistical methods rely on variables having a linear relationship
Correlation coefficient

Linear regression

Introduction to Linear Modeling in Python

INTRODUCTION TO STATISTICS IN PYTHON


Correlation does not imply causation
x is correlated with y does not mean x causes y

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Design of
experiments
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Vocabulary
Experiment aims to answer: What is the effect of the treatment on the response?

Treatment: explanatory/independent variable

Response: response/dependent variable

E.g.: What is the effect of an advertisement on the number of products purchased?

Treatment: advertisement

Response: number of products purchased

INTRODUCTION TO STATISTICS IN PYTHON


Controlled experiments
Participants are assigned by researchers to either treatment group or control group
Treatment group sees advertisement

Control group does not

Groups should be comparable so that causation can be inferred

If groups are not comparable, this could lead to confounding (bias)


Treatment group average age: 25

Control group average age: 50

Age is a potential confounder

INTRODUCTION TO STATISTICS IN PYTHON


The gold standard of experiments will use...
Randomized controlled trial
Participants are assigned to treatment/control randomly, not based on any other
characteristics

Choosing randomly helps ensure that groups are comparable

Placebo
Resembles treatment, but has no effect

Participants will not know which group they're in

In clinical trials, a sugar pill ensures that the effect of the drug is actually due to the drug
itself and not the idea of receiving the drug

INTRODUCTION TO STATISTICS IN PYTHON


The gold standard of experiments will use...
Double-blind trial
Person administering the treatment/running the study doesn't know whether the
treatment is real or a placebo

Prevents bias in the response and/or analysis of results

Fewer opportunities for bias = more reliable conclusion about causation

INTRODUCTION TO STATISTICS IN PYTHON


Observational studies
Participants are not assigned randomly to groups
Participants assign themselves, usually based on pre-existing characteristics

Many research questions are not conducive to a controlled experiment


You can't force someone to smoke or have a disease

You can't make someone have certain past behavior

Establish association, not causation


Effects can be confounded by factors that got certain people into the control or
treatment group

There are ways to control for confounders to get more reliable conclusions about
association

INTRODUCTION TO STATISTICS IN PYTHON


Longitudinal vs. cross-sectional studies
Longitudinal study Cross-sectional study
Participants are followed over a period of Data on participants is collected from a
time to examine effect of treatment on single snapshot in time
response Effect of age on height is confounded by
Effect of age on height is not confounded generation
by generation Cheaper, faster, more convenient
More expensive, results take longer

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Congratulations!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Overview
Chapter 1 Chapter 2
What is statistics? Measuring chance

Measures of center Probability distributions

Measures of spread Binomial distribution

Chapter 3 Chapter 4
Normal distribution Correlation

Central limit theorem Controlled experiments

Poisson distribution Observational studies

INTRODUCTION TO STATISTICS IN PYTHON


Build on your skills
Introduction to Linear Modeling in Python

INTRODUCTION TO STATISTICS IN PYTHON


Congratulations!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
A tale of two
variables
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Swedish motor insurance data
Each row represents one geographic region n_claims total_payment_sek
in Sweden.
108 392.5
There are 63 rows. 19 46.2
13 15.7
124 422.2
40 119.4
... ...

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Descriptive statistics
import pandas as pd
print(swedish_motor_insurance.mean())

n_claims 22.904762
total_payment_sek 98.187302
dtype: float64

print(swedish_motor_insurance['n_claims'].corr(swedish_motor_insurance['total_payment_sek']))

0.9128782350234068

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


What is regression?
Statistical models to explore the n_claims total_payment_sek
relationship a response variable and some
108 3925
explanatory variables.
19 462
Given values of explanatory variables, you
13 157
can predict the values of the response
124 4222
variable.
40 1194
200 ???

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Jargon
Response variable (a.k.a. dependent variable)
The variable that you want to predict.

Explanatory variables (a.k.a. independent variables)


The variables that explain how the response variable will change.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Linear regression and logistic regression
Linear regression
The response variable is numeric.

Logistic regression
The response variable is logical.

Simple linear/logistic regression


There is only one explanatory variable.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Visualizing pairs of variables
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x="n_claims",
y="total_payment_sek",
data=swedish_motor_insurance)

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Adding a linear trend line
sns.regplot(x="n_claims",
y="total_payment_sek",
data=swedish_motor_insurance,
ci=None)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Course flow
Chapter 1
Visualizing and ing linear regression models.

Chapter 2
Making predictions from linear regression models and understanding model coe cients.

Chapter 3
Assessing the quality of the linear regression model.

Chapter 4
Same again, but with logistic regression models

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Python packages for regression
statsmodels
Optimized for insight (focus in this course)

scikit-learn
Optimized for prediction (focus in other DataCamp courses)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Fitting a linear
regression
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Straight lines are defined by two things
Intercept
The y value at the point when x is zero.

Slope
The amount the y value increases if you increase x by one.

Equation
y = intercept + slope ∗ x

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Estimating the intercept

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Estimating the intercept

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Estimating the intercept

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Estimating the slope

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Estimating the slope

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Estimating the slope

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Estimating the slope

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Running a model
from statsmodels.formula.api import ols
mdl_payment_vs_claims = ols("total_payment_sek ~ n_claims",
data=swedish_motor_insurance)

mdl_payment_vs_claims = mdl_payment_vs_claims.fit()
print(mdl_payment_vs_claims.params)

Intercept 19.994486
n_claims 3.413824
dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Interpreting the model coefficients
Intercept 19.994486
n_claims 3.413824
dtype: float64

Equation
total_payment_sek = 19.99 + 3.41 ∗ n_claims

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Categorical
explanatory
variables
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Fish dataset
Each row represents one sh. species mass_g
There are 128 rows in the dataset. Bream 242.0

There are 4 species of sh: Perch 5.9


Common Bream Pike 200.0
European Perch Roach 40.0

Northern Pike ... ...

Common Roach

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Visualizing 1 numeric and 1 categorical variable
import matplotlib.pyplot as plt
import seaborn as sns

sns.displot(data=fish,
x="mass_g",
col="species",
col_wrap=2,
bins=9)

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Summary statistics: mean mass by species
summary_stats = fish.groupby("species")["mass_g"].mean()
print(summary_stats)

species
Bream 617.828571
Perch 382.239286
Pike 718.705882
Roach 152.050000
Name: mass_g, dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Linear regression
from statsmodels.formula.api import ols
mdl_mass_vs_species = ols("mass_g ~ species", data=fish).fit()
print(mdl_mass_vs_species.params)

Intercept 617.828571
species[T.Perch] -235.589286
species[T.Pike] 100.877311
species[T.Roach] -465.778571

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Model with or without an intercept
From previous slide, model with intercept Model without an intercept

mdl_mass_vs_species = ols( mdl_mass_vs_species = ols(


"mass_g ~ species", data=fish).fit() "mass_g ~ species + 0", data=fish).fit()
print(mdl_mass_vs_species.params) print(mdl_mass_vs_species.params)

Intercept 617.828571 species[Bream] 617.828571


species[T.Perch] -235.589286 species[Perch] 382.239286
species[T.Pike] 100.877311 species[Pike] 718.705882
species[T.Roach] -465.778571 species[Roach] 152.050000

The coe cients are relative to the intercept: In case of a single, categorical variable,
617.83 − 235.59 = 382.24! coe cients are the means.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Making predictions
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The fish dataset: bream
bream = fish[fish["species"] == "Bream"]
print(bream.head())

species mass_g length_cm


0 Bream 242.0 23.2
1 Bream 290.0 24.0
2 Bream 340.0 23.9
3 Bream 363.0 26.3
4 Bream 430.0 26.5

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Plotting mass vs. length
sns.regplot(x="length_cm",
y="mass_g",
data=bream,
ci=None)

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Running the model
mdl_mass_vs_length = ols("mass_g ~ length_cm", data=bream).fit()
print(mdl_mass_vs_length.params)

Intercept -1035.347565
length_cm 54.549981
dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Data on explanatory values to predict
If I set the explanatory variables to these values,
what value would the response variable have?

explanatory_data = pd.DataFrame({"length_cm": np.arange(20, 41)})

length_cm
0 20
1 21
2 22
3 23
4 24
5 25
...

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Call predict()
print(mdl_mass_vs_length.predict(explanatory_data))

0 55.652054
1 110.202035
2 164.752015
3 219.301996
4 273.851977
...
16 928.451749
17 983.001730
18 1037.551710
19 1092.101691
20 1146.651672
Length: 21, dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Predicting inside a DataFrame
explanatory_data = pd.DataFrame( length_cm mass_g
{"length_cm": np.arange(20, 41)} 0 20 55.652054
) 1 21 110.202035
prediction_data = explanatory_data.assign( 2 22 164.752015
mass_g=mdl_mass_vs_length.predict(explanatory_data) 3 23 219.301996
) 4 24 273.851977
print(prediction_data) .. ... ...
16 36 928.451749
17 37 983.001730
18 38 1037.551710
19 39 1092.101691
20 40 1146.651672

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Showing predictions
import matplotlib.pyplot as plt
import seaborn as sns
fig = plt.figure()
sns.regplot(x="length_cm",
y="mass_g",
ci=None,
data=bream,)
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
color="red",
marker="s")
plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Extrapolating
Extrapolating means making predictions
outside the range of observed data.

little_bream = pd.DataFrame({"length_cm": [10]})

pred_little_bream = little_bream.assign(
mass_g=mdl_mass_vs_length.predict(little_bream))

print(pred_little_bream)

length_cm mass_g
0 10 -489.847756

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Working with model
objects
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
.params attribute
from statsmodels.formula.api import ols
mdl_mass_vs_length = ols("mass_g ~ length_cm", data = bream).fit()
print(mdl_mass_vs_length.params)

Intercept -1035.347565
length_cm 54.549981
dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


.fittedvalues attribute
Fi ed values: predictions on the original 0 230.211993

dataset 1 273.851977
2 268.396979
3 399.316934
print(mdl_mass_vs_length.fittedvalues)
4 410.226930
...
or equivalently 30 873.901768
31 873.901768

explanatory_data = bream["length_cm"] 32 939.361745


33 1004.821722

print(mdl_mass_vs_length.predict(explanatory_data)) 34 1037.551710
Length: 35, dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


.resid attribute
Residuals: actual response values minus 0 11.788007
1 16.148023
predicted response values
2 71.603021
3 -36.316934
print(mdl_mass_vs_length.resid) 4 19.773070
...
or equivalently

print(bream["mass_g"] - mdl_mass_vs_length.fittedvalues)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


.summary()
mdl_mass_vs_length.summary()

OLS Regression Results


==============================================================================
Dep. Variable: mass_g R-squared: 0.878
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 237.6
Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.22e-16
Time: 13:23:21 Log-Likelihood: -199.35
No. Observations: 35 AIC: 402.7
Df Residuals: 33 BIC: 405.8
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
<-----------------------------------------------------------------------------
Intercept -1035.3476 107.973 -9.589 0.000 -1255.020 -815.676
length_cm 54.5500 3.539 15.415 0.000 47.350 61.750
==============================================================================
Omnibus: 7.314 Durbin-Watson: 1.478
Prob(Omnibus): 0.026 Jarque-Bera (JB): 10.857
Skew: -0.252 Prob(JB): 0.00439
Kurtosis: 5.682 Cond. No. 263.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


OLS Regression Results
==============================================================================
Dep. Variable: mass_g R-squared: 0.878
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 237.6
Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.22e-16
Time: 13:23:21 Log-Likelihood: -199.35
No. Observations: 35 AIC: 402.7
Df Residuals: 33 BIC: 405.8
Df Model: 1
Covariance Type: nonrobust

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


coef std err t P>|t| [0.025 0.975]
<-----------------------------------------------------------------------------
Intercept -1035.3476 107.973 -9.589 0.000 -1255.020 -815.676
length_cm 54.5500 3.539 15.415 0.000 47.350 61.750
==============================================================================
Omnibus: 7.314 Durbin-Watson: 1.478
Prob(Omnibus): 0.026 Jarque-Bera (JB): 10.857
Skew: -0.252 Prob(JB): 0.00439
Kurtosis: 5.682 Cond. No. 263.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Regression to the
mean
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The concept
Response value = ed value + residual

"The stu you explained" + "the stu you couldn't explain"

Residuals exist due to problems in the model and fundamental randomness

Extreme cases are o en due to randomness

Regression to the mean means extreme cases don't persist over time

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Pearson's father son dataset
1078 father/son pairs father_height_cm son_height_cm
Do tall fathers have tall sons? 165.2 151.8
160.7 160.6
165.0 160.9
167.0 159.5
155.3 163.3
... ...

1 Adapted from h ps://www.rdocumentation.org/packages/UsingR/topics/father.son

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Scatter plot
fig = plt.figure()

sns.scatterplot(x="father_height_cm",
y="son_height_cm",
data=father_son)

plt.axline(xy1=(150, 150),
slope=1,
linewidth=2,
color="green")

plt.axis("equal")
plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Adding a regression line
fig = plt.figure()

sns.regplot(x="father_height_cm",
y="son_height_cm",
data=father_son,
ci = None,
line_kws={"color": "black"})

plt.axline(xy1 = (150, 150),


slope=1,
linewidth=2,
color="green")

plt.axis("equal")
plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Running a regression
mdl_son_vs_father = ols("son_height_cm ~ father_height_cm",
data = father_son).fit()
print(mdl_son_vs_father.params)

Intercept 86.071975
father_height_cm 0.514093
dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Making predictions
really_tall_father = pd.DataFrame( really_short_father = pd.DataFrame(
{"father_height_cm": [190]}) {"father_height_cm": [150]})

mdl_son_vs_father.predict( mdl_son_vs_father.predict(
really_tall_father) really_short_father)

183.7 163.2

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Transforming
variables
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Perch dataset
perch = fish[fish["species"] == "Perch"]
print(perch.head())

species mass_g length_cm


55 Perch 5.9 7.5
56 Perch 32.0 12.5
57 Perch 40.0 13.8
58 Perch 51.5 15.0
59 Perch 70.0 15.7

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


It's not a linear relationship
sns.regplot(x="length_cm",
y="mass_g",
data=perch,
ci=None)

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Bream vs. perch

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Plotting mass vs. length cubed
perch["length_cm_cubed"] = perch["length_cm"] ** 3

sns.regplot(x="length_cm_cubed",
y="mass_g",
data=perch,
ci=None)
plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Modeling mass vs. length cubed
perch["length_cm_cubed"] = perch["length_cm"] ** 3

mdl_perch = ols("mass_g ~ length_cm_cubed", data=perch).fit()


mdl_perch.params

Intercept -0.117478
length_cm_cubed 0.016796
dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Predicting mass vs. length cubed
explanatory_data = pd.DataFrame({"length_cm_cubed": np.arange(10, 41, 5) ** 3,
"length_cm": np.arange(10, 41, 5)})

prediction_data = explanatory_data.assign(
mass_g=mdl_perch.predict(explanatory_data))
print(prediction_data)

length_cm_cubed length_cm mass_g


0 1000 10 16.678135
1 3375 15 56.567717
2 8000 20 134.247429
3 15625 25 262.313982
4 27000 30 453.364084
5 42875 35 719.994447
6 64000 40 1074.801781

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Plotting mass vs. length cubed
fig = plt.figure() fig = plt.figure()
sns.regplot(x="length_cm_cubed", y="mass_g", sns.regplot(x="length_cm", y="mass_g",
data=perch, ci=None) data=perch, ci=None)
sns.scatterplot(data=prediction_data, sns.scatterplot(data=prediction_data,
x="length_cm_cubed", y="mass_g", x="length_cm", y="mass_g",
color="red", marker="s") color="red", marker="s")

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Facebook advertising dataset
How advertising works spent_usd n_impressions n_clicks

1. Pay Facebook to shows ads. 1.43 7350 1


1.82 17861 2
2. People see the ads ("impressions").
1.25 4259 1
3. Some people who see it, click it.
1.29 4133 1
4.77 15615 3

936 rows ... ... ...

Each row represents 1 advert

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Plot is cramped
sns.regplot(x="spent_usd",
y="n_impressions",
data=ad_conversion,
ci=None)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Square root vs square root
ad_conversion["sqrt_spent_usd"] = np.sqrt(
ad_conversion["spent_usd"])

ad_conversion["sqrt_n_impressions"] = np.sqrt(
ad_conversion["n_impressions"])

sns.regplot(x="sqrt_spent_usd",
y="sqrt_n_impressions",
data=ad_conversion,
ci=None)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Modeling and predicting
mdl_ad = ols("sqrt_n_impressions ~ sqrt_spent_usd", data=ad_conversion).fit()

explanatory_data = pd.DataFrame({"sqrt_spent_usd": np.sqrt(np.arange(0, 601, 100)),


"spent_usd": np.arange(0, 601, 100)})

prediction_data = explanatory_data.assign(sqrt_n_impressions=mdl_ad.predict(explanatory_data),
n_impressions=mdl_ad.predict(explanatory_data) ** 2)
print(prediction_data)

sqrt_spent_usd spent_usd sqrt_n_impressions n_impressions


0 0.000000 0 15.319713 2.346936e+02
1 10.000000 100 597.736582 3.572890e+05
2 14.142136 200 838.981547 7.038900e+05
3 17.320508 300 1024.095320 1.048771e+06
4 20.000000 400 1180.153450 1.392762e+06
5 22.360680 500 1317.643422 1.736184e+06
6 24.494897 600 1441.943858 2.079202e+06

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Quantifying model
fit
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Bream and perch models
Bream Perch

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Coefficient of determination
Sometimes called "r-squared" or "R-squared".

The proportion of the variance in the response variable that is predictable from the
explanatory variable

1 means a perfect t

0 means the worst possible t

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


.summary()
Look at the value titled "R-Squared"

mdl_bream = ols("mass_g ~ length_cm", data=bream).fit()

print(mdl_bream.summary())

# Some lines of output omitted

OLS Regression Results


Dep. Variable: mass_g R-squared: 0.878
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 237.6

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


.rsquared attribute
print(mdl_bream.rsquared)

0.8780627095147174

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


It's just correlation squared
coeff_determination = bream["length_cm"].corr(bream["mass_g"]) ** 2
print(coeff_determination)

0.8780627095147173

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Residual standard error (RSE)

A "typical" di erence between a prediction and an observed response

It has the same unit as the response variable.

MSE = RSE²

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


.mse_resid attribute
mse = mdl_bream.mse_resid
print('mse: ', mse)

mse: 5498.555084973521

rse = np.sqrt(mse)
print("rse: ", rse)

rse: 74.15224261594197

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Calculating RSE: residuals squared
residuals_sq = mdl_bream.resid ** 2 residuals sq:
0 138.957118
print("residuals sq: \n", residuals_sq) 1 260.758635
2 5126.992578
3 1318.919660
4 390.974309
...
30 2125.047026
31 6576.923291
32 206.259713
33 889.335096
34 7665.302003
Length: 35, dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Calculating RSE: sum of residuals squared
residuals_sq = mdl_bream.resid ** 2 resid sum of sq : 181452.31780412616

resid_sum_of_sq = sum(residuals_sq)

print("resid sum of sq :",


resid_sum_of_sq)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Calculating RSE: degrees of freedom
residuals_sq = mdl_bream.resid ** 2 deg freedom: 33

resid_sum_of_sq = sum(residuals_sq)

deg_freedom = len(bream.index) - 2

print("deg freedom: ", deg_freedom)

Degrees of freedom equals the number of


observations minus the number of model
coe cients.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Calculating RSE: square root of ratio
residuals_sq = mdl_bream.resid ** 2 rse : 74.15224261594197

resid_sum_of_sq = sum(residuals_sq)

deg_freedom = len(bream.index) - 2

rse = np.sqrt(resid_sum_of_sq/deg_freedom)

print("rse :", rse)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Interpreting RSE
mdl_bream has an RSE of 74 .

The di erence between predicted bream masses and observed bream masses is typically
about 74g.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Root-mean-square error (RMSE)
residuals_sq = mdl_bream.resid ** 2 residuals_sq = mdl_bream.resid ** 2

resid_sum_of_sq = sum(residuals_sq) resid_sum_of_sq = sum(residuals_sq)

deg_freedom = len(bream.index) - 2 n_obs = len(bream.index)

rse = np.sqrt(resid_sum_of_sq/deg_freedom) rmse = np.sqrt(resid_sum_of_sq/n_obs)

print("rse :", rse) print("rmse :", rmse)

rse : 74.15224261594197 rmse : 72.00244396727619

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Visualizing model fit
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Residual properties of a good fit
Residuals are normally distributed

The mean of the residuals is zero

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Bream and perch again
Bream: the "good" model Perch: the "bad" model

mdl_bream = ols("mass_g ~ length_cm", data=bream).fit() mdl_perch = ols("mass_g ~ length_cm", data=perch).fit()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Residuals vs. fitted
Bream Perch

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Q-Q plot
Bream Perch

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Scale-location plot
Bream Perch

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


residplot()
sns.residplot(x="length_cm", y="mass_g", data=bream, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


qqplot()
from statsmodels.api import qqplot
qqplot(data=mdl_bream.resid, fit=True, line="45")

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Scale-location plot
model_norm_residuals_bream = mdl_bream.get_influence().resid_studentized_internal
model_norm_residuals_abs_sqrt_bream = np.sqrt(np.abs(model_norm_residuals_bream))
sns.regplot(x=mdl_bream.fittedvalues, y=model_norm_residuals_abs_sqrt_bream, ci=None, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Sqrt of abs val of stdized residuals")

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Outliers, leverage,
and influence
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Roach dataset
roach = fish[fish['species'] == "Roach"]
print(roach.head())

species mass_g length_cm


35 Roach 40.0 12.9
36 Roach 69.0 16.5
37 Roach 78.0 17.5
38 Roach 87.0 18.2
39 Roach 120.0 18.6

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Which points are outliers?
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None)
plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Extreme explanatory values
roach["extreme_l"] = ((roach["length_cm"] < 15) |
(roach["length_cm"] > 26))

fig = plt.figure()
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None)

sns.scatterplot(x="length_cm",
y="mass_g",
hue="extreme_l",
data=roach)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Response values away from the regression line
roach["extreme_m"] = roach["mass_g"] < 1

fig = plt.figure()
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None)

sns.scatterplot(x="length_cm",
y="mass_g",
hue="extreme_l",
style="extreme_m",
data=roach)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Leverage and influence
Leverage is a measure of how extreme the explanatory variable values are.

In uence measures how much the model would change if you le the observation out of the
dataset when modeling.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


.get_influence() and .summary_frame()
mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()
summary_roach = mdl_roach.get_influence().summary_frame()
roach["leverage"] = summary_roach["hat_diag"]

print(roach.head())

species mass_g length_cm leverage


35 Roach 40.0 12.9 0.313729
36 Roach 69.0 16.5 0.125538
37 Roach 78.0 17.5 0.093487
38 Roach 87.0 18.2 0.076283
39 Roach 120.0 18.6 0.068387

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Cook's distance
Cook's distance is the most common measure of in uence.

roach["cooks_dist"] = summary_roach["cooks_d"]
print(roach.head())

species mass_g length_cm leverage cooks_dist


35 Roach 40.0 12.9 0.313729 1.074015
36 Roach 69.0 16.5 0.125538 0.010429
37 Roach 78.0 17.5 0.093487 0.000020
38 Roach 87.0 18.2 0.076283 0.001980
39 Roach 120.0 18.6 0.068387 0.006610

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Most influential roaches
print(roach.sort_values("cooks_dist", ascending = False))

species mass_g length_cm leverage cooks_dist


35 Roach 40.0 12.9 0.313729 1.074015 # really short roach
54 Roach 390.0 29.5 0.394740 0.365782 # really long roach
40 Roach 0.0 19.0 0.061897 0.311852 # roach with zero mass
52 Roach 290.0 24.0 0.099488 0.150064
51 Roach 180.0 23.6 0.088391 0.061209
.. ... ... ... ... ...
43 Roach 150.0 20.4 0.050264 0.000257
44 Roach 145.0 20.5 0.050092 0.000256
42 Roach 120.0 19.4 0.056815 0.000199
47 Roach 160.0 21.1 0.050910 0.000137
37 Roach 78.0 17.5 0.093487 0.000020

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Removing the most influential roach
roach_not_short = roach[roach["length_cm"] != 12.9]

sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None,
line_kws={"color": "green"})

sns.regplot(x="length_cm",
y="mass_g",
data=roach_not_short,
ci=None,
line_kws={"color": "red"})

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Why you need
logistic regression
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Bank churn dataset
has_churned time_since_ rst_purchase time_since_last_purchase
0 0.3993247 -0.5158691
1 -0.4297957 0.6780654
0 3.7383122 0.4082544
0 0.6032289 -0.6990435
... ... ...
response length of relationship recency of activity

1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Churn vs. recency: a linear model
mdl_churn_vs_recency_lm = ols("has_churned ~ time_since_last_purchase",
data=churn).fit()

print(mdl_churn_vs_recency_lm.params)

Intercept 0.490780
time_since_last_purchase 0.063783
dtype: float64

intercept, slope = mdl_churn_vs_recency_lm.params

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Visualizing the linear model
sns.scatterplot(x="time_since_last_purchase",
y="has_churned",
data=churn)

plt.axline(xy1=(0, intercept),
slope=slope)

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Zooming out
sns.scatterplot(x="time_since_last_purchase",
y="has_churned",
data=churn)

plt.axline(xy1=(0,intercept),
slope=slope)

plt.xlim(-10, 10)
plt.ylim(-0.2, 1.2)
plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


What is logistic regression?
Another type of generalized linear model.

Used when the response variable is logical.

The responses follow logistic (S-shaped) curve.

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Logistic regression using logit()
from statsmodels.formula.api import logit
mdl_churn_vs_recency_logit = logit("has_churned ~ time_since_last_purchase",
data=churn).fit()
print(mdl_churn_vs_recency_logit.params)

Intercept -0.035019
time_since_last_purchase 0.269215
dtype: float64

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Visualizing the logistic model
sns.regplot(x="time_since_last_purchase",
y="has_churned",
data=churn,
ci=None,
logistic=True)

plt.axline(xy1=(0,intercept),
slope=slope,
color="black")

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Zooming out

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Predictions and odds
ratios
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The regplot() predictions
sns.regplot(x="time_since_last_purchase",
y="has_churned",
data=churn,
ci=None,
logistic=True)

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Making predictions
mdl_recency = logit("has_churned ~ time_since_last_purchase",
data = churn).fit()

explanatory_data = pd.DataFrame(
{"time_since_last_purchase": np.arange(-1, 6.25, 0.25)})

prediction_data = explanatory_data.assign(
has_churned = mdl_recency.predict(explanatory_data))

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Adding point predictions
sns.regplot(x="time_since_last_purchase",
y="has_churned",
data=churn,
ci=None,
logistic=True)

sns.scatterplot(x="time_since_last_purchase",
y="has_churned",
data=prediction_data,
color="red")

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Getting the most likely outcome
prediction_data = explanatory_data.assign(
has_churned = mdl_recency.predict(explanatory_data))
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Visualizing most likely outcome
sns.regplot(x="time_since_last_purchase",
y="has_churned",
data=churn,
ci=None,
logistic=True)

sns.scatterplot(x="time_since_last_purchase",
y="most_likely_outcome",
data=prediction_data,
color="red")

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Odds ratios
Odds ratio is the probability of something
happening divided by the probability that it
doesn't.

probability
odds_ratio =
(1 − probability)
0.25 1
odds_ratio = =
(1 − 0.25) 3

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Calculating odds ratio
prediction_data["odds_ratio"] = prediction_data["has_churned"] /
(1 - prediction_data["has_churned"])

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Visualizing odds ratio
sns.lineplot(x="time_since_last_purchase",
y="odds_ratio",
data=prediction_data)

plt.axhline(y=1,
linestyle="dotted")

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Visualizing log odds ratio
sns.lineplot(x="time_since_last_purchase",
y="odds_ratio",
data=prediction_data)

plt.axhline(y=1,
linestyle="dotted")

plt.yscale("log")

plt.show()

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Calculating log odds ratio
prediction_data["log_odds_ratio"] = np.log(prediction_data["odds_ratio"])

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


All predictions together
time_since_last_prchs has_churned most_likely_rspns odds_ratio log_odds_ratio
0 0.491 0 0.966 -0.035
2 0.623 1 1.654 0.503
4 0.739 1 2.834 1.042
6 0.829 1 4.856 1.580
... ... ... ... ...

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Comparing scales
Are values easy to Are changes easy to Is
Scale
interpret? interpret? precise?
Probability ✔ ✘ ✔
Most likely
✔✔ ✔ ✘
outcome
Odds ratio ✔ ✘ ✔
Log odds ratio ✘ ✔ ✔

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Quantifying logistic
regression fit
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The four outcomes
predicted false predicted true
actual false correct false positive
actual true false negative correct

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Confusion matrix: counts of outcomes
actual_response = churn["has_churned"]

predicted_response = np.round(mdl_recency.predict())

outcomes = pd.DataFrame({"actual_response": actual_response,


"predicted_response": predicted_response})

print(outcomes.value_counts(sort=False))

actual_response predicted_response
0 0.0 141
1.0 59
1 0.0 111
1.0 89

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Visualizing the confusion matrix
conf_matrix = mdl_recency.pred_table()

print(conf_matrix)

[[141. 59.]
[111. 89.]]

true negative false positive


false negative true positive

from statsmodels.graphics.mosaicplot
import mosaic

mosaic(conf_matrix)

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Accuracy
Accuracy is the proportion of correct
acc = (TN + TP) / (TN + TP + FN + FP)
predictions.
print(acc)

TN + TP
accuracy =
TN + FN + FP + TP 0.575

[[141., 59.],
[111., 89.]]

TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Sensitivity
Sensitivity is the proportion of true positives.
sens = TP / (FN + TP)
TP print(sens)
sensitivity =
FN + TP
0.445
[[141., 59.],
[111., 89.]]

TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Specificity
Speci city is the proportion of true negatives.
spec = TN / (TN + FP)
TN print(spec)
specificity =
TN + FP
0.705
[[141., 59.],
[111., 89.]]

TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Congratulations
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
You learned things
Chapter 1 Chapter 2

Fit a simple linear regression Make predictions

Interpret coe cients Regression to the mean

Transforming variables

Chapter 3 Chapter 4

Quantifying model t Fit a simple logistic regression

Outlier, leverage, and in uence Make predictions

Get performance from confusion matrix

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Multiple explanatory variables
Intermediate Regression with statsmodels in Python

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Unlocking advanced skills
Generalized Linear Models in Python

Introduction to Predictive Analytics in


Python

Linear Classi ers in Python

INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON


Happy learning!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Parallel slopes linear
regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The previous course
This course assumes knowledge from Introduction to Regression with statsmodels in Python

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


From simple regression to multiple regression
Multiple regression is a regression model with more than one explanatory variable.

More explanatory variables can give more insight and be er predictions.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The course contents
Chapter 1 Chapter 2
"Parallel slopes" regression Interactions

Simpson's Paradox

Chapter 3 Chapter 4
More explanatory variables Multiple logistic regression

How linear regression works The logistic distribution

How logistic regression works

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The fish dataset
mass_g length_cm species Each row represents a sh

242.0 23.2 Bream mass_g is the response variable

5.9 7.5 Perch 1 numeric, 1 categorical explanatory


200.0 30.0 Pike variable

40.0 12.9 Roach

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


One explanatory variable at a time
from statsmodels.formula.api import ols mdl_mass_vs_species = ols("mass_g ~ species + 0",
data=fish).fit()
mdl_mass_vs_length = ols("mass_g ~ length_cm",
data=fish).fit() print(mdl_mass_vs_species.params)

print(mdl_mass_vs_length.params) species[Bream] 617.828571


species[Perch] 382.239286
species[Pike] 718.705882
Intercept -536.223947
species[Roach] 152.050000
length_cm 34.899245
dtype: float64
dtype: float64

1 intercept coe cient 1 intercept coe cient for each category

1 slope coe cient

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Both variables at the same time
mdl_mass_vs_both = ols("mass_g ~ length_cm + species + 0",
data=fish).fit()

print(mdl_mass_vs_both.params)

species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
dtype: float64

1 slope coe cient

1 intercept coe cient for each category

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Comparing coefficients
print(mdl_mass_vs_length.params) print(mdl_mass_vs_both.params)

Intercept -536.223947 species[Bream] -672.241866


length_cm 34.899245 species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554

print(mdl_mass_vs_species.params)

species[Bream] 617.828571
species[Perch] 382.239286
species[Pike] 718.705882
species[Roach] 152.050000

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualization: 1 numeric explanatory variable
import matplotlib.pyplot as plt
import seaborn as sns

sns.regplot(x="length_cm",
y="mass_g",
data=fish,
ci=None)

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualization: 1 categorical explanatory variable
sns.boxplot(x="species",
y="mass_g",
data=fish,
showmeans=True)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualization: both explanatory variables
coeffs = mdl_mass_vs_both.params plt.axline(xy1=(0, ic_bream), slope=sl, color="blue")
print(coeffs) plt.axline(xy1=(0, ic_perch), slope=sl, color="green")
plt.axline(xy1=(0, ic_pike), slope=sl, color="red")
plt.axline(xy1=(0, ic_roach), slope=sl, color="orange")
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554

ic_bream, ic_perch, ic_pike, ic_roach, sl = coeffs

sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Predicting parallel
slopes
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The prediction workflow
import pandas as pd length_cm
import numpy as np 0 5
expl_data_length = pd.DataFrame( 1 10
{"length_cm": np.arange(5, 61, 5)}) 2 15
print(expl_data_length) 3 20
4 25
5 30
6 35
7 40
8 45
9 50
10 55
11 60

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction workflow
[A, B, C] x [1, 2] ==> [A1, B1, C1, A2, B2, C2] length_cm species
0 5 Bream
1 5 Roach
from itertools import product
2 5 Perch
product(["A", "B", "C"], [1, 2])
3 5 Pike
4 10 Bream
length_cm = np.arange(5, 61, 5) 5 10 Roach
species = fish["species"].unique() 6 10 Perch
...
p = product(length_cm, species) 41 55 Roach
42 55 Perch
expl_data_both = pd.DataFrame(p, 43 55 Pike
columns=['length_cm', 44 60 Bream
'species']) 45 60 Roach
print(expl_data_both) 46 60 Perch
47 60 Pike

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction workflow
Predict mass_g from length_cm only length_cm mass_g
0 5 -361.7277
1 10 -187.2315
prediction_data_length = expl_data_length.assign(
2 15 -12.7353
mass_g = mdl_mass_vs_length.predict(expl_data)
3 20 161.7610
)
4 25 336.2572
5 30 510.7534
... # number of rows: 12

Predict mass_g from both explanatory length_cm species mass_g


variables 0 5 Bream -459.3991
1 5 Roach -513.9350
2 5 Perch -500.4501
prediction_data_both = expl_data_both.assign(
3 5 Pike -876.6133
mass_g = mdl_mass_vs_both.predict(expl_data)
4 10 Bream -246.5563
)
5 10 Roach -301.0923
... # number of rows: 48

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualizing the predictions
plt.axline(xy1=(0, ic_bream), slope=sl, color="blue")
plt.axline(xy1=(0, ic_perch), slope=sl, color="green")
plt.axline(xy1=(0, ic_pike), slope=sl, color="red")
plt.axline(xy1=(0, ic_roach), slope=sl, color="orange")

sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)

sns.scatterplot(x="length_cm",
y="mass_g",
color="black",
data=prediction_data)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating predictions for linear regression
coeffs = mdl_mass_vs_length.params Intercept -536.223947
print(coeffs) length_cm 34.899245

intercept, slope = coeffs length_cm mass_g


0 5 -361.727721
1 10 -187.231494
explanatory_data = pd.DataFrame(
2 15 -12.735268
{"length_cm": np.arange(5, 61, 5)})
3 20 161.760959
4 25 336.257185
prediction_data = explanatory_data.assign(
5 30 510.753412
mass_g = intercept + slope * explanatory_data
...
)
9 50 1208.738318
10 55 1383.234545
print(prediction_data) 11 60 1557.730771

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating predictions for multiple
regression
coeffs = mdl_mass_vs_both.params
print(coeffs)

species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554

ic_bream, ic_perch, ic_pike, ic_roach, slope = coeffs

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


np.select()
conditions = [
condition_1,
condition_2,
# ...
condition_n
]

choices = [list_of_choices] # same length as conditions

np.select(conditions, choices)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Choosing an intercept with np.select()
conditions = [ [ -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Bream", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Perch", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Pike", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Roach" -672.24 -726.78 -713.29 -1089.46
] -672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
choices = [ic_bream, ic_perch, ic_pike, ic_roach]
-672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
intercept = np.select(conditions, choices) -672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46]
print(intercept)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The final prediction step
prediction_data = explanatory_data.assign( length_cm species intercept mass_g
intercept = np.select(conditions, choices), 0 5 Bream -672.2419 -459.3991
mass_g = intercept + slope * explanatory_data["length_cm"]) 1 5 Roach -726.7778 -513.9350
2 5 Perch -713.2929 -500.4501
print(prediction_data) 3 5 Pike -1089.4561 -876.6133
4 10 Bream -672.2419 -246.5563
5 10 Roach -726.7778 -301.0923
6 10 Perch -713.2929 -287.6073
7 10 Pike -1089.4561 -663.7705
8 15 Bream -672.2419 -33.7136
...
40 55 Bream -672.2419 1669.0286
41 55 Roach -726.7778 1614.4927
42 55 Perch -713.2929 1627.9776
43 55 Pike -1089.4561 1251.8144
44 60 Bream -672.2419 1881.8714
45 60 Roach -726.7778 1827.3354
46 60 Perch -713.2929 1840.8204
47 60 Pike -1089.4561 1464.6572

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Compare to .predict()
mdl_mass_vs_both.predict(explanatory_data) 0 -459.3991
1 -513.9350
2 -500.4501
3 -876.6133
4 -246.5563
5 -301.0923
...
43 1251.8144
44 1881.8714
45 1827.3354
46 1840.8204
47 1464.6572

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Assessing model
performance
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Model performance metrics
Coe cient of determination (R-squared): how well the linear regression line ts the
observed values.
Larger is be er.

Residual standard error (RSE): the typical size of the residuals.


Smaller is be er.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Getting the coefficient of determination
print(mdl_mass_vs_length.rsquared)

0.8225689502644215

print(mdl_mass_vs_species.rsquared)

0.25814887709499157

print(mdl_mass_vs_both.rsquared)

0.9200433561156649

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Adjusted coefficient of determination
More explanatory variables increases R2 .
Too many explanatory variables causes over ing.

Adjusted coe cient of determination penalizes more explanatory variables.

R¯2 = 1 − (1 − R2 ) nobsn−n
obs −1
var −1

Penalty is noticeable when R2 is small, or nvar is large fraction of nobs .


In statsmodels , it's contained in the rsquared_adj a ribute.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Getting the adjusted coefficient of determination
print("rsq_length: ", mdl_mass_vs_length.rsquared)
print("rsq_adj_length: ", mdl_mass_vs_length.rsquared_adj)

rsq_length: 0.8225689502644215
rsq_adj_length: 0.8211607673300121

print("rsq_species: ", mdl_mass_vs_species.rsquared)


print("rsq_adj_species: ", mdl_mass_vs_species.rsquared_adj)

rsq_species: 0.25814887709499157
rsq_adj_species: 0.24020086605696722

print("rsq_both: ", mdl_mass_vs_both.rsquared


print("rsq_adj_both: ", mdl_mass_vs_both.rsquared_adj)

rsq_both: 0.9200433561156649
rsq_adj_both: 0.9174431400543857

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Getting the residual standard error
rse_length = np.sqrt(mdl_mass_vs_length.mse_resid)
print("rse_length: ", rse_length)

rse_length: 152.12092835414788

rse_species = np.sqrt(mdl_mass_vs_species.mse_resid)
print("rse_species: ", rse_species)

rse_species: 313.5501156682592

rse_both = np.sqrt(mdl_mass_vs_both.mse_resid)
print("rse_both: ", rse_both)

rse_both: 103.35563303966488

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Models for each
category
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Four categories
print(fish["species"].unique())

array(['Bream', 'Roach', 'Perch', 'Pike'], dtype=object)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Splitting the dataset
bream = fish[fish["species"] == "Bream"]
perch = fish[fish["species"] == "Perch"]
pike = fish[fish["species"] == "Pike"]
roach = fish[fish["species"] == "Roach"]

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Four models
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit() mdl_perch = ols("mass_g ~ length_cm", data=perch).fit()
print(mdl_bream.params) print(mdl_perch.params)

Intercept -1035.3476 Intercept -619.1751


length_cm 54.5500 length_cm 38.9115

mdl_pike = ols("mass_g ~ length_cm", data=pike).fit() mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()


print(mdl_pike.params) print(mdl_roach.params)

Intercept -1540.8243 Intercept -329.3762


length_cm 53.1949 length_cm 23.3193

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Explanatory data
explanatory_data = pd.DataFrame( length_cm
{"length_cm": np.arange(5, 61, 5)}) 0 5
1 10
print(explanatory_data) 2 15
3 20
4 25
5 30
6 35
7 40
8 45
9 50
10 55
11 60

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Making predictions
prediction_data_bream = explanatory_data.assign( prediction_data_perch = explanatory_data.assign(
mass_g = mdl_bream.predict(explanatory_data), mass_g = mdl_perch.predict(explanatory_data),
species = "Bream") species = "Perch")

prediction_data_pike = explanatory_data.assign( prediction_data_roach = explanatory_data.assign(


mass_g = mdl_pike.predict(explanatory_data), mass_g = mdl_roach.predict(explanatory_data),
species = "Pike") species = "Roach")

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Concatenating predictions
prediction_data = pd.concat([prediction_data_bream, length_cm mass_g species
prediction_data_roach, 0 5 -762.597660 Bream
prediction_data_perch, 1 10 -489.847756 Bream
prediction_data_pike]) 2 15 -217.097851 Bream
3 20 55.652054 Bream
4 25 328.401958 Bream
5 30 601.151863 Bream
...
3 20 -476.926955 Pike
4 25 -210.952626 Pike
5 30 55.021703 Pike
6 35 320.996032 Pike
7 40 586.970362 Pike
8 45 852.944691 Pike
9 50 1118.919020 Pike
10 55 1384.893349 Pike
11 60 1650.867679 Pike

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualizing predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)
plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Adding in your predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)

sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species",
ci=None,
legend=False)

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Coefficient of determination
mdl_fish = ols("mass_g ~ length_cm + species", print(mdl_bream.rsquared_adj)
data=fish).fit()

0.874
print(mdl_fish.rsquared_adj)

print(mdl_perch.rsquared_adj)
0.917

0.917

print(mdl_pike.rsquared_adj)

0.941

print(mdl_roach.rsquared_adj)

0.815

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Residual standard error
print(np.sqrt(mdl_fish.mse_resid)) print(np.sqrt(mdl_bream.mse_resid))

103 74.2

print(np.sqrt(mdl_perch.mse_resid))

100

print(np.sqrt(mdl_pike.mse_resid))

120

print(np.sqrt(mdl_roach.mse_resid))

38.2

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
One model with an
interaction
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
What is an interaction?
In the sh dataset
Di erent sh species have di erent mass to length ratios.

The e ect of length on the expected mass is di erent for di erent species.

More generally
The e ect of one explanatory variable on the expected response changes depending on the
value of another explanatory variable.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Specifying interactions
No interactions No interactions
response ~ explntry1 + explntry2 mass_g ~ length_cm + species

With interactions (implicit) With interactions (implicit)


response_var ~ explntry1 * explntry2 mass_g ~ length_cm * species

With interactions (explicit) With interactions (explicit)


response ~ explntry1 + explntry2 + explntry1:explntry2 mass_g ~ length_cm + species + length_cm:species

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Running the model
mdl_mass_vs_both = ols("mass_g ~ length_cm * species", data=fish).fit()

print(mdl_mass_vs_both.params)

Intercept -1035.3476
species[T.Perch] 416.1725
species[T.Pike] -505.4767
species[T.Roach] 705.9714
length_cm 54.5500
length_cm:species[T.Perch] -15.6385
length_cm:species[T.Pike] -1.3551
length_cm:species[T.Roach] -31.2307

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Easier to understand coefficients
mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0", data=fish).fit()

print(mdl_mass_vs_both_inter.params)

species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Familiar numbers
print(mdl_mass_vs_both_inter.params) print(mdl_bream.params)

species[Bream] -1035.3476 Intercept -1035.3476


species[Perch] -619.1751 length_cm 54.5500
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Making predictions
with interactions
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The model with the interaction
mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0",
data=fish).fit()

print(mdl_mass_vs_both_inter.params)

species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction flow
from itertools import product length_cm species mass_g
0 5 Bream -762.5977
length_cm = np.arange(5, 61, 5) 1 5 Roach -212.7799
2 5 Perch -424.6178
species = fish["species"].unique() 3 5 Pike -1274.8499
4 10 Bream -489.8478
p = product(length_cm, species) 5 10 Roach -96.1836
6 10 Perch -230.0604
7 10 Pike -1008.8756
explanatory_data = pd.DataFrame(p,
8 15 Bream -217.0979
columns=["length_cm",
...
"species"])
40 55 Bream 1964.9014
41 55 Roach 953.1833
prediction_data = explanatory_data.assign(
42 55 Perch 1520.9556
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
43 55 Pike 1384.8933
44 60 Bream 2237.6513
print(prediction_data) 45 60 Roach 1069.7796
46 60 Perch 1715.5129
47 60 Pike 1650.8677

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualizing the predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)

sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species")

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating the predictions
coeffs = mdl_mass_vs_both_inter.params

species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193

ic_bream, ic_perch, ic_pike, ic_roach,


slope_bream, slope_perch, slope_pike, slope_roach = coeffs

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating the predictions
conditions = [
explanatory_data["species"] == "Bream",
explanatory_data["species"] == "Perch",
explanatory_data["species"] == "Pike",
explanatory_data["species"] == "Roach"
]

ic_choices = [ic_bream, ic_perch, ic_pike, ic_roach]


intercept = np.select(conditions, ic_choices)

slope_choices = [slope_bream, slope_perch, slope_pike, slope_roach]


slope = np.select(conditions, slope_choices)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating the predictions
prediction_data = explanatory_data.assign( prediction_data = explanatory_data.assign(
mass_g = intercept + slope * explanatory_data["length_cm"]) mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))

print(prediction_data) print(prediction_data)

length_cm species mass_g length_cm species mass_g


0 5 Bream -762.5977 0 5 Bream -762.5977
1 5 Roach -212.7799 1 5 Roach -212.7799
2 5 Perch -424.6178 2 5 Perch -424.6178
3 5 Pike -1274.8499 3 5 Pike -1274.8499
4 10 Bream -489.8478 4 10 Bream -489.8478
5 10 Roach -96.1836 5 10 Roach -96.1836
... ...
43 55 Pike 1384.8933 43 55 Pike 1384.8933
44 60 Bream 2237.6513 44 60 Bream 2237.6513
45 60 Roach 1069.7796 45 60 Roach 1069.7796
46 60 Perch 1715.5129 46 60 Perch 1715.5129
47 60 Pike 1650.8677 47 60 Pike 1650.8677

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Simpson's Paradox
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
A most ingenious paradox!
Simpson's Paradox occurs when the trend of a model on the whole dataset is very di erent
from the trends shown by models on subsets of the dataset.

trend = slope coe cient

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Synthetic Simpson data
x y group 5 groups of data, labeled "A" to "E"

62.24344 70.60840 D
52.33499 14.70577 B
56.36795 46.39554 C
66.80395 66.17487 D
66.53605 89.24658 E
62.38129 91.45260 E

1 h ps://www.rdocumentation.org/packages/datasauRus/topics/simpsons_paradox

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Linear regressions
Whole dataset By group
mdl_whole = ols("y ~ x", mdl_by_group = ols("y ~ group + group:x + 0",
data=simpsons_paradox).fit() data = simpsons_paradox).fit()

print(mdl_whole.params) print(mdl_by_group.params)

Intercept -38.554 groupA groupB groupC groupD groupE


x 1.751 32.5051 67.3886 99.6333 132.3932 123.8242
groupA:x groupB:x groupC:x groupD:x groupE:x
-0.6266 -1.0105 -0.9940 -0.9908 -0.5364

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Plotting the whole dataset
sns.regplot(x="x",
y="y",
data=simpsons_paradox,
ci=None)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Plotting by group
sns.lmplot(x="x",
y="y",
data=simpsons_paradox,
hue="group",
ci=None)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Reconciling the difference
Good advice
If possible, try to plot the dataset.

Common advice
You can't choose the best model in general – it depends on the dataset and the question you
are trying to answer.

More good advice


Articulate a question before you start modeling.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Test score example

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Infectious disease example

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Reconciling the difference
Usually (but not always) the grouped model contains more insight.

Are you missing explanatory variables?

Context is important.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Simpson's paradox in real datasets
The paradox is usually less obvious.

You may see a zero slope rather than a complete change in direction.

It may not appear in every group.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Two numeric
explanatory
variables
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Visualizing three numeric variables
3D sca er plot

2D sca er plot with response as color

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Another column for the fish dataset
species mass_g length_cm height_cm
Bream 1000 33.5 18.96
Bream 925 36.2 18.75
Roach 290 24.0 8.88
Roach 390 29.5 9.48
Perch 1100 39.0 12.80
Perch 1000 40.2 12.60
Pike 1250 52.0 10.69
Pike 1650 59.0 10.81

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


3D scatter plot

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


2D scatter plot, color for response
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Modeling with two numeric explanatory variables
mdl_mass_vs_both = ols("mass_g ~ length_cm + height_cm",
data=fish).fit()

print(mdl_mass_vs_both.params)

Intercept -622.150234
length_cm 28.968405
height_cm 26.334804

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction flow
from itertools import product length_cm height_cm mass_g
0 5 2 -424.638603
length_cm = np.arange(5, 61, 5) 1 5 4 -371.968995
height_cm = np.arange(2, 21, 2) 2 5 6 -319.299387
3 5 8 -266.629780
p = product(length_cm, height_cm) 4 5 10 -213.960172
.. ... ... ...
explanatory_data = pd.DataFrame(p, 115 60 12 1431.971694
columns=["length_cm", 116 60 14 1484.641302
"height_cm"]) 117 60 16 1537.310909
118 60 18 1589.980517
119 60 20 1642.650125
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both.predict(explanatory_data))
[120 rows x 3 columns]

print(prediction_data)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Plotting the predictions
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")

sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Including an interaction
mdl_mass_vs_both_inter = ols("mass_g ~ length_cm * height_cm",
data=fish).fit()

print(mdl_mass_vs_both_inter.params)

Intercept 159.107480
length_cm 0.301426
height_cm -78.125178
length_cm:height_cm 3.545435

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction flow with an interaction
length_cm = np.arange(5, 61, 5)
height_cm = np.arange(2, 21, 2)

p = product(length_cm, height_cm)

explanatory_data = pd.DataFrame(p,
columns=["length_cm",
"height_cm"])

prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Plotting the predictions
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")

sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
More than two
explanatory
variables
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
From last time
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Faceting by species
grid = sns.FacetGrid(data=fish,
col="species",
hue="mass_g",
col_wrap=2,
palette="plasma")

grid.map(sns.scatterplot,
"length_cm",
"height_cm")

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Faceting by species
It's possible to use more than one
categorical variable for faceting

Beware of faceting overuse

Plo ing becomes harder with increasing


number of variables

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Different levels of interaction
No interactions

ols("mass_g ~ length_cm + height_cm + species + 0", data=fish).fit()

two-way interactions between pairs of variables

ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0", data=fish).fit()

three-way interaction between all three variables

ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0", data=fish).fit()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


All the interactions
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0",
data=fish).fit()

same as

ols(
"mass_g ~ length_cm * height_cm * species + 0",
data=fish).fit()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Only two-way interactions
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0",
data=fish).fit()

same as

ols(
"mass_g ~ (length_cm + height_cm + species) ** 2 + 0",
data=fish).fit()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction flow
mdl_mass_vs_all = ols( length_cm height_cm species mass_g
"mass_g ~ length_cm * height_cm * species + 0", 0 5 2 Bream -570.656437
data=fish).fit() 1 5 2 Roach 31.449145
2 5 2 Perch 43.789984
length_cm = np.arange(5, 61, 5) 3 5 2 Pike 271.270093
height_cm = np.arange(2, 21, 2) 4 5 4 Bream -451.127405
species = fish["species"].unique() .. ... ... ... ...
475 60 18 Pike 2690.346384
p = product(length_cm, height_cm, species) 476 60 20 Bream 1531.618475
477 60 20 Roach 2621.797668
explanatory_data = pd.DataFrame(p, 478 60 20 Perch 3041.931709
columns=["length_cm", 479 60 20 Pike 2926.352397
"height_cm",
"species"]) [480 rows x 4 columns]

prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_all.predict(explanatory_data))

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
How linear
regression works
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The standard simple linear regression plot

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualizing residuals

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


A metric for the best fit
The simplest idea (which doesn't work)
Take the sum of all the residuals.

Some residuals are negative.

The next simplest idea (which does work)


Take the square of each residual, and add up those squares.

This is called the sum of squares.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


A detour into numerical optimization
A line plot of a quadratic equation

x = np.arange(-4, 5, 0.1)
y = x ** 2 - x + 10

xy_data = pd.DataFrame({"x": x,
"y": y})

sns.lineplot(x="x",
y="y",
data=xy_data)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Using calculus to solve the equation
y = x2 − x + 10
∂y
∂x
= 2x − 1

0 = 2x − 1

x = 0.5

y = 0.52 − 0.5 + 10 = 9.75

Not all equations can be solved like this.

You can let Python gure it out.

Don't worry if this doesn't make sense, you


won't need it for the exercises.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


minimize()
from scipy.optimize import minimize fun: 9.75
hess_inv: array([[0.5]])
jac: array([0.])
def calc_quadratic(x):
message: 'Optimization terminated successfully.'
y = x ** 2 - x + 10
nfev: 6
return y
nit: 2
njev: 3
minimize(fun=calc_quadratic, status: 0
x0=3) success: True
x: array([0.49999998])

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


A linear regression algorithm
De ne a function to calculate the sum of
def calc_sum_of_squares(coeffs):
squares metric.
intercept, slope = coeffs
# More calculation!

Call minimize() to nd coe cients that minimize(


minimize this function. fun=calc_sum_of_squares,
x0=0
)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Multiple logistic
regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Bank churn dataset
has_churned time_since_ rst_purchase time_since_last_purchase
0 0.3993247 -0.5158691
1 -0.4297957 0.6780654
0 3.7383122 0.4082544
0 0.6032289 -0.6990435
... ... ...
response length of relationship recency of activity

1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


logit()
from statsmodels.formula.api import logit

logit("response ~ explanatory", data=dataset).fit()

logit("response ~ explanatory1 + explanatory2", data=dataset).fit()

logit("response ~ explanatory1 * explanatory2", data=dataset).fit()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The four outcomes
predicted false predicted true
actual false correct false positive
actual true false negative correct

conf_matrix = mdl_logit.pred_table()

print(conf_matrix)

[[102. 98.]
[ 53. 147.]]

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Prediction flow
from itertools import product

explanatory1 = some_values
explanatory2 = some_values

p = product(explanatory1, explanatory2)
explanatory_data = pd.DataFrame(p,
columns=["explanatory1",
"explanatory2"])
prediction_data = explanatory_data.assign(
mass_g = mdl_logit.predict(explanatory_data))

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualization
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])

sns.scatterplot(...
data=churn,
hue="has_churned",
...)

sns.scatterplot(...
data=prediction_data,
hue="most_likely_outcome",
...)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
The logistic
distribution
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Gaussian probability density function (PDF)
from scipy.stats import norm

x = np.arange(-4, 4.05, 0.05)

gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x)}
)

sns.lineplot(x="x",
y="gauss_pdf",
data=gauss_dist)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Gaussian cumulative distribution function (CDF)
x = np.arange(-4, 4.05, 0.05)

gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)

sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Gaussian cumulative distribution function (CDF)
x = np.arange(-4, 4.05, 0.05)

gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)

sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Gaussian inverse CDF
p = np.arange(0.001, 1, 0.001)

gauss_dist_inv = pd.DataFrame({
"p": p,
"gauss_inv_cdf": norm.ppf(p)}
)

sns.lineplot(x="p",
y="gauss_inv_cdf",
data=gauss_dist_inv)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Logistic PDF
from scipy.stats import logistic

x = np.arange(-4, 4.05, 0.05)

logistic_dist = pd.DataFrame({
"x": x,
"log_pdf": logistic.pdf(x)}
)

sns.lineplot(x="x",
y="log_pdf",
data=logistic_dist)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Logistic distribution
Logistic distribution CDF is also called the logistic function.
1
cdf(x) = (1+exp(−x))

Logistic distribution inverse CDF is also called the logit function.


p
inverse_cdf(p) = log( (1−p) )

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
How logistic
regression works
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Sum of squares doesn't work
np.sum((y_pred - y_actual) ** 2)

y_actual is always 0 or 1 .

y_pred is between 0 and 1 .

There is a be er metric than sum of squares.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Likelihood
y_pred * y_actual

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Likelihood
y_pred * y_actual + (1 - y_pred) * (1 - y_actual)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Likelihood
np.sum(y_pred * y_actual + (1 - y_pred) * (1 - y_actual))

When y_actual = 1

y_pred * 1 + (1 - y_pred) * (1 - 1) = y_pred

When y_actual = 0

y_pred * 0 + (1 - y_pred) * (1 - 0) = 1 - y_pred

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Log-likelihood
Computing likelihood involves adding many very small numbers, leading to numerical error.

Log-likelihood is easier to compute.

log_likelihood = np.log(y_pred) * y_actual + np.log(1 - y_pred) * (1 - y_actual)

Both equations give the same answer.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Negative log-likelihood
Maximizing log-likelihood is the same as minimizing negative log-likelihood.

-np.sum(log_likelihoods)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Logistic regression algorithm
def calc_neg_log_likelihood(coeffs)
intercept, slope = coeffs
# More calculation!

from scipy.optimize import minimize

minimize(
fun=calc_neg_log_likelihood,
x0=[0, 0]
)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Congratulations!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
You learned things
Chapter 1 Chapter 2

Fit/visualize/predict/assess parallel slopes Interactions between explanatory variables

Simpson's Paradox

Chapter 3 Chapter 4

Extend to many explanatory variables Logistic regression with multiple


explanatory variables
Implement linear regression algorithm
Logistic distribution

Implement logistic regression algorithm

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


There is more to learn
Training and testing sets

Cross validation

P-values and signi cance

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Advanced regression
Generalized Linear Models in Python

Introduction to Predictive Analytics in Python

Linear Classi ers in Python

Machine Learning with Tree-Based Models in Python

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Have fun regressing!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Sampling and point
estimates
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Estimating the population of France
A census asks every household how many
people live there.

SAMPLING IN PYTHON
There are lots of people in France
Censuses are really expensive!

SAMPLING IN PYTHON
Sampling households
Cheaper to ask a small number of households
and use statistics to estimate the population

Working with a subset of the whole population


is called sampling

SAMPLING IN PYTHON
Population vs. sample
The population is the complete dataset

Doesn't have to refer to people

Typically, don't know what the whole population is

The sample is the subset of data you calculate on

SAMPLING IN PYTHON
Coffee rating dataset
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
90.58 NA Ethiopia 8.67 8.83 8.67 8.50 8.42
89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
... ... ... ... ... ... ... ...
73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83

Each row represents 1 coffee

1338 rows

We'll treat this as the population

SAMPLING IN PYTHON
Points vs. flavor: population
pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]

total_cup_points flavor
0 90.58 8.83
1 89.92 8.67
2 89.75 8.50
3 89.00 8.58
4 88.83 8.50
... ... ...
1333 78.75 7.58
1334 78.08 7.67
1335 77.17 7.33
1336 75.08 6.83
1337 73.75 6.67

[1338 rows x 2 columns]

SAMPLING IN PYTHON
Points vs. flavor: 10 row sample
pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)

total_cup_points flavor
1088 80.33 7.17
1157 79.67 7.42
1267 76.17 7.33
506 83.00 7.67
659 82.50 7.42
817 81.92 7.50
1050 80.67 7.42
685 82.42 7.50
1027 80.92 7.25
62 85.58 8.17

[10 rows x 2 columns]

SAMPLING IN PYTHON
Python sampling for Series
Use .sample() for pandas DataFrames and Series

cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)

1088 80.33
1157 79.67
1267 76.17
... ...
685 82.42
1027 80.92
62 85.58
Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Population parameters & point estimates
A population parameter is a calculation made on the population dataset

import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])

82.15120328849028

A point estimate or sample statistic is a calculation made on the sample dataset

np.mean(cup_points_samp)

81.31800000000001

SAMPLING IN PYTHON
Point estimates with pandas
pts_vs_flavor_pop['flavor'].mean()

7.526046337817639

pts_vs_flavor_samp['flavor'].mean()

7.485000000000001

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Convenience
sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
The Literary Digest election prediction

Prediction: Landon gets 57%; Roosevelt gets 43%

Actual results: Landon got 38%; Roosevelt got 62%

Sample not representative of population, causing sample bias

Collecting data by the easiest method is called convenience sampling

SAMPLING IN PYTHON
Finding the mean age of French people
Survey 10 people at Disneyland Paris
Mean age of 24.6 years

Will this be a good estimate for all of


France?

1 Image by Sean MacEntee

SAMPLING IN PYTHON
How accurate was the survey?
Year Average French Age 24.6 years is a poor estimate
1975 31.6 People who visit Disneyland aren't
1985 33.6 representative of the whole population

1995 36.2
2005 38.9
2015 41.2

SAMPLING IN PYTHON
Convenience sampling coffee ratings
coffee_ratings["total_cup_points"].mean()

82.15120328849028

coffee_ratings_first10 = coffee_ratings.head(10)

coffee_ratings_first10["total_cup_points"].mean()

89.1

SAMPLING IN PYTHON
Visualizing selection bias
import matplotlib.pyplot as plt
import numpy as np
coffee_ratings["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()

coffee_ratings_first10["total_cup_points"].hist(bins=np.arange(59, 93, 2))


plt.show()

SAMPLING IN PYTHON
Distribution of a population and of a convenience
sample
Population: Convenience sample:

SAMPLING IN PYTHON
Visualizing selection bias for a random sample
coffee_sample = coffee_ratings.sample(n=10)
coffee_sample["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()

SAMPLING IN PYTHON
Distribution of a population and of a simple random
sample
Population: Random Sample:

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Pseudo-random
number generation
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
What does random mean?
{adjective} made, done, happening, or chosen without method or conscious decision.

1 Oxford Languages

SAMPLING IN PYTHON
True random numbers
Generated from physical processes, like flipping coins
Hotbits uses radioactive decay

RANDOM.ORG uses atmospheric noise

True randomness is expensive

1 https://www.fourmilab.ch/hotbits 2 https://www.random.org

SAMPLING IN PYTHON
Pseudo-random number generation
Pseudo-random number generation is cheap and fast
Next "random" number calculated from previous "random" number

The first "random" number calculated from a seed

The same seed value yields the same random numbers

SAMPLING IN PYTHON
Pseudo-random number generation example
seed = 1
calc_next_random(seed)

calc_next_random(3)

calc_next_random(2)

SAMPLING IN PYTHON
Random number generating functions
Prepend with numpy.random , such as numpy.random.beta()

function distribution function distribution


.beta Beta .hypergeometric Hypergeometric
.binomial Binomial .lognormal Lognormal
.chisquare Chi-squared .negative_binomial Negative binomial
.exponential Exponential .normal Normal
.f F .poisson Poisson
.gamma Gamma .standard_t t
.geometric Geometric .uniform Uniform

SAMPLING IN PYTHON
Visualizing random numbers
randoms = np.random.beta(a=2, b=2, size=5000)
randoms

array([0.6208281 , 0.73216171, 0.44298403, ...,


0.13411873, 0.52198411, 0.72355098])

plt.hist(randoms, bins=np.arange(0, 1, 0.05))


plt.show()

SAMPLING IN PYTHON
Random numbers seeds
np.random.seed(20000229) np.random.seed(20000229)

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([-0.59030264, 1.87821258]) array([-0.59030264, 1.87821258])

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([2.52619561, 4.9684949 ]) array([2.52619561, 4.9684949 ])

SAMPLING IN PYTHON
Using a different seed
np.random.seed(20000229) np.random.seed(20041004)

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([-0.59030264, 1.87821258]) array([1.09364337, 4.55285159])

np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)

array([2.52619561, 4.9684949 ]) array([2.67038916, 2.36677492])

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Simple random and
systematic sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Simple random sampling

SAMPLING IN PYTHON
Simple random sampling of coffees

SAMPLING IN PYTHON
Simple random sampling with pandas
coffee_ratings.sample(n=5, random_state=19000113)

total_cup_points variety country_of_origin aroma flavor \


437 83.25 None Colombia 7.92 7.75
285 83.83 Yellow Bourbon Brazil 7.92 7.50
784 82.08 None Colombia 7.50 7.42
648 82.58 Caturra Colombia 7.58 7.50
155 84.58 Caturra Colombia 7.42 7.67

aftertaste body balance


437 7.25 7.83 7.58
285 7.33 8.17 7.50
784 7.42 7.67 7.42
648 7.42 7.67 7.42
155 7.75 8.08 7.83

SAMPLING IN PYTHON
Systematic sampling

SAMPLING IN PYTHON
Systematic sampling - defining the interval
sample_size = 5
pop_size = len(coffee_ratings)
print(pop_size)

1338

interval = pop_size // sample_size


print(interval)

267

SAMPLING IN PYTHON
Systematic sampling - selecting the rows
coffee_ratings.iloc[::interval]

total_cup_points variety country_of_origin aroma flavor aftertaste \


0 90.58 None Ethiopia 8.67 8.83 8.67
267 83.92 None Colombia 7.83 7.75 7.58
534 82.92 Bourbon El Salvador 7.50 7.50 7.75
801 82.00 Typica Taiwan 7.33 7.50 7.17
1068 80.50 Other Taiwan 7.17 7.17 7.17

body balance
0 8.50 8.42
267 7.75 7.75
534 7.92 7.83
801 7.50 7.33
1068 7.17 7.25

SAMPLING IN PYTHON
The trouble with systematic sampling
coffee_ratings_with_id = coffee_ratings.reset_index()
coffee_ratings_with_id.plot(x="index", y="aftertaste", kind="scatter")
plt.show()

Systematic sampling is only safe if we don't see a pattern in this scatter plot

SAMPLING IN PYTHON
Making systematic sampling safe
shuffled = coffee_ratings.sample(frac=1)
shuffled = shuffled.reset_index(drop=True).reset_index()
shuffled.plot(x="index", y="aftertaste", kind="scatter")
plt.show()

Shuffling rows + systematic sampling is the same as simple random sampling

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Stratified and
weighted random
sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Coffees by country
top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)

country_of_origin
Mexico 236
Colombia 183
Guatemala 181
Brazil 132
Taiwan 75
United States (Hawaii) 73
dtype: int64

1 The dataset lists Hawaii and Taiwan as countries for convenience, as they are notable coffee-growing regions.

SAMPLING IN PYTHON
Filtering for 6 countries
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]

top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)

coffee_ratings_top = coffee_ratings[top_counted_subset]

SAMPLING IN PYTHON
Counts of a simple random sample
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)

coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)

country_of_origin
Mexico 0.250000
Guatemala 0.204545
Colombia 0.181818
Brazil 0.181818
United States (Hawaii) 0.102273
Taiwan 0.079545
dtype: float64

SAMPLING IN PYTHON
Comparing proportions
Population: 10% simple random sample:

Mexico 0.268182 Mexico 0.250000


Colombia 0.207955 Guatemala 0.204545
Guatemala 0.205682 Colombia 0.181818
Brazil 0.150000 Brazil 0.181818
Taiwan 0.085227 United States (Hawaii) 0.102273
United States (Hawaii) 0.082955 Taiwan 0.079545
Name: country_of_origin, dtype: float64 Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Proportional stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=0.1, random_state=2021)

coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)

Mexico 0.272727
Guatemala 0.204545
Colombia 0.204545
Brazil 0.147727
Taiwan 0.090909
United States (Hawaii) 0.079545
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Equal counts stratified sampling
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
.sample(n=15, random_state=2021)

coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)

Taiwan 0.166667
Brazil 0.166667
United States (Hawaii) 0.166667
Guatemala 0.166667
Mexico 0.166667
Colombia 0.166667
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Weighted random sampling
Specify weights to adjust the relative probability of a row being sampled

import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"

coffee_ratings_weight['weight'] = np.where(condition, 2, 1)

coffee_ratings_weight = coffee_ratings_weight.sample(frac=0.1, weights="weight")

SAMPLING IN PYTHON
Weighted random sampling results
10% weighted sample:

coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)

Brazil 0.261364
Mexico 0.204545
Guatemala 0.204545
Taiwan 0.170455
Colombia 0.090909
United States (Hawaii) 0.068182
Name: country_of_origin, dtype: float64

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Cluster sampling
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Stratified sampling vs. cluster sampling
Stratified sampling
Split the population into subgroups

Use simple random sampling on every subgroup

Cluster sampling
Use simple random sampling to pick some subgroups

Use simple random sampling on only those subgroups

SAMPLING IN PYTHON
Varieties of coffee
varieties_pop = list(coffee_ratings['variety'].unique())

[None, 'Other', 'Bourbon', 'Catimor',


'Ethiopian Yirgacheffe','Caturra',
'SL14', 'Sumatra', 'SL34', 'Hawaiian Kona',
'Yellow Bourbon', 'SL28', 'Gesha', 'Catuai',
'Pacamara', 'Typica', 'Sumatra Lintong',
'Mundo Novo', 'Java', 'Peaberry', 'Pacas',
'Mandheling', 'Ruiru 11', 'Arusha',
'Ethiopian Heirlooms', 'Moka Peaberry',
'Sulawesi', 'Blue Mountain', 'Marigojipe',
'Pache Comun']

SAMPLING IN PYTHON
Stage 1: sampling for subgroups
import random
varieties_samp = random.sample(varieties_pop, k=3)

['Hawaiian Kona', 'Bourbon', 'SL28']

SAMPLING IN PYTHON
Stage 2: sampling each group
variety_condition = coffee_ratings['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee_ratings[variety_condition]

coffee_ratings_cluster['variety'] = coffee_ratings_cluster['variety'].cat.remove_unused_categories()

coffee_ratings_cluster.groupby("variety")\
.sample(n=5, random_state=2021)

SAMPLING IN PYTHON
Stage 2 output
total_cup_points variety country_of_origin ...
variety
Bourbon 575 82.83 Bourbon Guatemala
560 82.83 Bourbon Guatemala
524 83.00 Bourbon Guatemala
1140 79.83 Bourbon Guatemala
318 83.67 Bourbon Brazil
Hawaiian Kona 1291 73.67 Hawaiian Kona United States (Hawaii)
1266 76.25 Hawaiian Kona United States (Hawaii)
488 83.08 Hawaiian Kona United States (Hawaii)
461 83.17 Hawaiian Kona United States (Hawaii)
117 84.83 Hawaiian Kona United States (Hawaii)
SL28 137 84.67 SL28 Kenya
452 83.17 SL28 Kenya
224 84.17 SL28 Kenya
66 85.50 SL28 Kenya
559 82.83 SL28 Kenya

SAMPLING IN PYTHON
Multistage sampling
Cluster sampling is a type of multistage sampling
Can have > 2 stages

E.g., countrywide surveys may sample states, counties, cities, and neighborhoods

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling methods
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Review of sampling techniques - setup
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]

subset_condition = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[subset_condition]

coffee_ratings_top.shape

(880, 8)

SAMPLING IN PYTHON
Review of simple random sampling
coffee_ratings_srs = coffee_ratings_top.sample(frac=1/3, random_state=2021)

coffee_ratings_srs.shape

(293, 8)

SAMPLING IN PYTHON
Review of stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=1/3, random_state=2021)

coffee_ratings_strat.shape

(293, 8)

SAMPLING IN PYTHON
Review of cluster sampling
import random
top_countries_samp = random.sample(top_counted_countries, k=2)
top_condition = coffee_ratings_top['country_of_origin'].isin(top_countries_samp)
coffee_ratings_cluster = coffee_ratings_top[top_condition]
coffee_ratings_cluster['country_of_origin'] = coffee_ratings_cluster['country_of_origin']\
.cat.remove_unused_categories()

coffee_ratings_clust = coffee_ratings_cluster.groupby("country_of_origin")\
.sample(n=len(coffee_ratings_top) // 6)

coffee_ratings_clust.shape

(292, 8)

SAMPLING IN PYTHON
Calculating mean cup points
Population Simple random sample
coffee_ratings_top['total_cup_points'].mean() coffee_ratings_srs['total_cup_points'].mean()

81.94700000000002 81.95982935153583

Stratified sample Cluster sample


coffee_ratings_strat['total_cup_points'].mean() coffee_ratings_clust['total_cup_points'].mean()

81.92566552901025 82.03246575342466

SAMPLING IN PYTHON
Mean cup points by country: simple random
Population: Simple random sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_srs.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.414878
Colombia 83.106557 Colombia 82.925536
Guatemala 81.846575 Guatemala 82.045385
Mexico 80.890085 Mexico 81.100714
Taiwan 82.001333 Taiwan 81.744333
United States (Hawaii) 81.820411 United States (Hawaii) 82.008000
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Mean cup points by country: stratified
Population: Stratified sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_strat.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.499773
Colombia 83.106557 Colombia 83.288197
Guatemala 81.846575 Guatemala 81.727667
Mexico 80.890085 Mexico 80.994684
Taiwan 82.001333 Taiwan 81.846800
United States (Hawaii) 81.820411 United States (Hawaii) 81.051667
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Mean cup points by country: cluster
Population: Cluster sample:

coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_clust.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()

country_of_origin country_of_origin
Brazil 82.405909 Colombia 83.128904
Colombia 83.106557 Mexico 80.936027
Guatemala 81.846575 Name: total_cup_points, dtype: float64
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Relative error of
point estimates
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Sample size is number of rows
len(coffee_ratings.sample(n=300)) len(coffee_ratings.sample(frac=0.25))

300 334

SAMPLING IN PYTHON
Various sample sizes
coffee_ratings['total_cup_points'].mean()

82.15120328849028

coffee_ratings.sample(n=10)['total_cup_points'].mean()

83.027

coffee_ratings.sample(n=100)['total_cup_points'].mean()

82.4897

coffee_ratings.sample(n=1000)['total_cup_points'].mean()

82.1186

SAMPLING IN PYTHON
Relative errors
Population parameter:

population_mean = coffee_ratings['total_cup_points'].mean()

Point estimate:

sample_mean = coffee_ratings.sample(n=sample_size)['total_cup_points'].mean()

Relative error as a percentage:

rel_error_pct = 100 * abs(population_mean-sample_mean) / population_mean

SAMPLING IN PYTHON
Relative error vs. sample size
import matplotlib.pyplot as plt
errors.plot(x="sample_size",
y="relative_error",
kind="line")
plt.show()

Properties:

Really noise, particularly for small samples

Amplitude is initially steep, then flattens

Relative error decreases to zero (when the


sample size = population)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Creating a sampling
distribution
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Same code, different answer
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()

82.53066666666668 81.97566666666667

coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()

82.68 81.675

SAMPLING IN PYTHON
Same code, 1000 times
mean_cup_points_1000 = []
for i in range(1000):
mean_cup_points_1000.append(
coffee_ratings.sample(n=30)['total_cup_points'].mean()
)
print(mean_cup_points_1000)

[82.11933333333333, 82.55300000000001, 82.07266666666668, 81.76966666666667,


...
82.74166666666666, 82.45033333333335, 81.77199999999999, 82.8163333333333]

SAMPLING IN PYTHON
Distribution of sample means for size 30
import matplotlib.pyplot as plt
plt.hist(mean_cup_points_1000, bins=30)
plt.show()

A sampling distribution is a distribution of


replicates of point estimates.

SAMPLING IN PYTHON
Different sample sizes
Sample size: 6 Sample size: 150

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Approximate
sampling
distributions
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
4 dice
die1 die2 die3 die4
0 1 1 1 1
1 1 1 1 2
2 1 1 1 3
3 1 1 1 4
4 1 1 1 5
dice = expand_grid( ... ... ... ... ...
{'die1': [1, 2, 3, 4, 5, 6], 1291 6 6 6 2
'die2': [1, 2, 3, 4, 5, 6], 1292 6 6 6 3
'die3': [1, 2, 3, 4, 5, 6], 1293 6 6 6 4
'die4': [1, 2, 3, 4, 5, 6] 1294 6 6 6 5
} 1295 6 6 6 6
)
[1296 rows x 4 columns]

SAMPLING IN PYTHON
Mean roll
dice['mean_roll'] = (dice['die1'] + die1 die2 die3 die4 mean_roll
dice['die2'] + 0 1 1 1 1 1.00
dice['die3'] + 1 1 1 1 2 1.25
dice['die4']) / 4 2 1 1 1 3 1.50
print(dice) 3 1 1 1 4 1.75
4 1 1 1 5 2.00
... ... ... ... ... ...
1291 6 6 6 2 5.00
1292 6 6 6 3 5.25
1293 6 6 6 4 5.50
1294 6 6 6 5 5.75
1295 6 6 6 6 6.00

[1296 rows x 5 columns]

SAMPLING IN PYTHON
Exact sampling distribution
dice['mean_roll'] = dice['mean_roll'].astype('category')
dice['mean_roll'].value_counts(sort=False).plot(kind="bar")

SAMPLING IN PYTHON
The number of outcomes increases fast
n_dice = list(range(1, 101))
n_outcomes = []
for n in n_dice:
n_outcomes.append(6**n)

outcomes = pd.DataFrame(
{"n_dice": n_dice,
"n_outcomes": n_outcomes})

outcomes.plot(x="n_dice",
y="n_outcomes",
kind="scatter")
plt.show()

SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np

np.random.choice(list(range(1, 7)), size=4, replace=True).mean()

SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
sample_means_1000 = []
for i in range(1000):
sample_means_1000.append(
np.random.choice(list(range(1, 7)), size=4, replace=True).mean()
)
print(sample_means_1000)

[3.25, 3.25, 1.75, 2.0, 2.0, 1.0, 1.0, 2.75, 2.75, 2.5, 3.0, 2.0, 2.75,
...
1.25, 2.0, 2.5, 2.5, 3.75, 1.5, 1.75, 2.25, 2.0, 1.5, 3.25, 3.0, 3.5]

SAMPLING IN PYTHON
Approximate sampling distribution
plt.hist(sample_means_1000, bins=20)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Standard errors and
the Central Limit
Theorem
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Sampling distribution of mean cup points
Sample size: 5 Sample size: 20

Sample size: 80 Sample size: 320

SAMPLING IN PYTHON
Consequences of the central limit theorem

Averages of independent samples have approximately normal distributions.

As the sample size increases,

The distribution of the averages gets closer to being normally distributed

The width of the sampling distribution gets narrower

SAMPLING IN PYTHON
Population & sampling distribution means
coffee_ratings['total_cup_points'].mean() Use np.mean() on each approximate
sampling distribution:
82.15120328849028
Sample size Mean sample mean
5 82.18420719999999

20 82.1558634

80 82.14510154999999

320 82.154017925

SAMPLING IN PYTHON
Population & sampling distribution standard deviations
coffee_ratings['total_cup_points'].std(ddof=0) Sample size Std dev sample mean
5 1.1886358227738543
2.685858187306438
20 0.5940321141669805

80 0.2934024263916487

Specify ddof=0 when calling .std() on 320 0.13095083089190876


populations

Specify ddof=1 when calling np.std() on


samples or sampling distributions

SAMPLING IN PYTHON
Population mean over square root sample size
Sample size Std dev sample mean Calculation Result
5 1.1886358227738543 2.685858187306438 / sqrt(5) 1.201

20 0.5940321141669805 2.685858187306438 / sqrt(20) 0.601

80 0.2934024263916487 2.685858187306438 / sqrt(80) 0.300

320 0.13095083089190876 2.685858187306438 / sqrt(320) 0.150

SAMPLING IN PYTHON
Standard error
Standard deviation of the sampling distribution
Important tool in understanding sampling variability

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Introduction to
bootstrapping
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
With or without
Sampling without replacement: Sampling with replacement ("resampling"):

SAMPLING IN PYTHON
Simple random sampling without replacement
Population: Sample:

SAMPLING IN PYTHON
Simple random sampling with replacement
Population: Resample:

SAMPLING IN PYTHON
Why sample with replacement?
coffee_ratings : a sample of a larger population of all coffees

Each coffee in our sample represents many different hypothetical population coffees

Sampling with replacement is a proxy

SAMPLING IN PYTHON
Coffee data preparation
coffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()

index variety country_of_origin flavor


0 0 None Ethiopia 8.83
1 1 Other Ethiopia 8.67
2 2 Bourbon Guatemala 8.50
3 3 None Ethiopia 8.58
4 4 Other Ethiopia 8.50
... ... ... ... ...
1333 1333 None Ecuador 7.58
1334 1334 None Ecuador 7.67
1335 1335 None United States 7.33
1336 1336 None India 6.83
1337 1337 None Vietnam 6.67

[1338 rows x 4 columns]

SAMPLING IN PYTHON
Resampling with .sample()
coffee_resamp = coffee_focus.sample(frac=1, replace=True)

index variety country_of_origin flavor


1140 1140 Bourbon Guatemala 7.25
57 57 Bourbon Guatemala 8.00
1152 1152 Bourbon Mexico 7.08
621 621 Caturra Thailand 7.50
44 44 SL28 Kenya 8.08
... ... ... ... ...
996 996 Typica Mexico 7.33
1090 1090 Bourbon Guatemala 7.33
918 918 Other Guatemala 7.42
249 249 Caturra Colombia 7.67
467 467 Caturra Colombia 7.50

[1338 rows x 4 columns]

SAMPLING IN PYTHON
Repeated coffees
coffee_resamp["index"].value_counts() 658 5
167 4
363 4
357 4
1047 4
..
771 1
770 1
766 1
764 1
0 1
Name: index, Length: 868, dtype: int64

SAMPLING IN PYTHON
Missing coffees
num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))

868

len(coffee_ratings) - num_unique_coffees

470

SAMPLING IN PYTHON
Bootstrapping
The opposite of sampling from a
population

Sampling: going from a population to a


smaller sample

Bootstrapping: building up a theoretical


population from the sample

Bootstrapping use case:

Develop understanding of sampling


variability using a single sample

SAMPLING IN PYTHON
Bootstrapping process
1. Make a resample of the same size as the original sample
2. Calculate the statistic of interest for this bootstrap sample

3. Repeat steps 1 and 2 many times

The resulting statistics are bootstrap statistics, and they form a bootstrap distribution

SAMPLING IN PYTHON
Bootstrapping coffee mean flavor
import numpy as np
mean_flavors_1000 = []
for i in range(1000):
mean_flavors_1000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)

SAMPLING IN PYTHON
Bootstrap distribution histogram
import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling and
bootstrap
distributions
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Coffee focused subset
coffee_sample = coffee_ratings[["variety", "country_of_origin", "flavor"]]\
.reset_index().sample(n=500)

index variety country_of_origin flavor


132 132 Other Costa Rica 7.58
51 51 None United States (Hawaii) 8.17
42 42 Yellow Bourbon Brazil 7.92
569 569 Bourbon Guatemala 7.67
.. ... ... ... ...
643 643 Catuai Costa Rica 7.42
356 356 Caturra Colombia 7.58
494 494 None Indonesia 7.58
169 169 None Brazil 7.81

[500 rows x 4 columns]

SAMPLING IN PYTHON
The bootstrap of mean coffee flavors
import numpy as np
mean_flavors_5000 = []
for i in range(5000):
mean_flavors_5000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
bootstrap_distn = mean_flavors_5000

SAMPLING IN PYTHON
Mean flavor bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(bootstrap_distn, bins=15)
plt.show()

SAMPLING IN PYTHON
Sample, bootstrap distribution, population means
Sample mean: Estimated population mean:

coffee_sample['flavor'].mean() np.mean(bootstrap_distn)

7.5132200000000005 7.513357731999999

True population mean:

coffee_ratings['flavor'].mean()

7.526046337817639

SAMPLING IN PYTHON
Interpreting the means
Bootstrap distribution mean:

Usually close to the sample mean

May not be a good estimate of the population mean


Bootstrapping cannot correct biases from sampling

SAMPLING IN PYTHON
Sample sd vs. bootstrap distribution sd
Sample standard deviation: Estimated population standard deviation?

coffee_sample['flavor'].std() np.std(bootstrap_distn, ddof=1)

0.3540883911928703 0.015768474367958217

SAMPLING IN PYTHON
Sample, bootstrap dist'n, pop'n standard deviations
Sample standard deviation: Estimated population standard deviation:

coffee_sample['flavor'].std() standard_error = np.std(bootstrap_distn, ddof=1)

0.3540883911928703 Standard error is the standard deviation of the


statistic of interest

True standard deviation: standard_error * np.sqrt(500)

coffee_ratings['flavor'].std(ddof=0)
0.3525938058821761

0.34125481224622645 Standard error times square root of sample


size estimates the population standard
deviation

SAMPLING IN PYTHON
Interpreting the standard errors
Estimated standard error → standard deviation of the bootstrap distribution for a sample
statistic

Population std. dev ≈ Std. Error × √Sample size

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Confidence intervals
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Confidence intervals
"Values within one standard deviation of the mean" includes a large number of values from
each of these distributions

We'll define a related concept called a confidence interval

SAMPLING IN PYTHON
Predicting the weather
Rapid City, South Dakota in the United
States has the least predictable weather

Our job is to predict the high temperature


there tomorrow

SAMPLING IN PYTHON
Our weather prediction
Point estimate = 47°F (8.3°C)
Range of plausible high temperature values = 40 to 54°F (4.4 to 12.8°C)

SAMPLING IN PYTHON
We just reported a confidence interval!
40 to 54°F is a confidence interval
Sometimes written as 47 °F (40°F, 54°F) or 47°F [40°F, 54°F]

... or, 47 ± 7°F

7°F is the margin of error

SAMPLING IN PYTHON
Bootstrap distribution of mean flavor
import matplotlib.pyplot as plt
plt.hist(coffee_boot_distn, bins=15)
plt.show()

SAMPLING IN PYTHON
Mean of the resamples
import numpy as np
np.mean(coffee_boot_distn)

7.513452892

SAMPLING IN PYTHON
Mean plus or minus one standard deviation
np.mean(coffee_boot_distn)

7.513452892

np.mean(coffee_boot_distn) - np.std(coffee_boot_distn, ddof=1)

7.497385709174466

np.mean(coffee_boot_distn) + np.std(coffee_boot_distn, ddof=1)

7.529520074825534

SAMPLING IN PYTHON
Quantile method for confidence intervals
np.quantile(coffee_boot_distn, 0.025)

7.4817195

np.quantile(coffee_boot_distn, 0.975)

7.5448805

SAMPLING IN PYTHON
Inverse cumulative distribution function
PDF: The bell curve

CDF: integrate to get area under bell curve

Inv. CDF: flip x and y axes

Implemented in Python with

from scipy.stats import norm


norm.ppf(quantile, loc=0, scale=1)

SAMPLING IN PYTHON
Standard error method for confidence interval
point_estimate = np.mean(coffee_boot_distn)

7.513452892

std_error = np.std(coffee_boot_distn, ddof=1)

0.016067182825533724

from scipy.stats import norm


lower = norm.ppf(0.025, loc=point_estimate, scale=std_error)
upper = norm.ppf(0.975, loc=point_estimate, scale=std_error)
print((lower, upper))

(7.481961792328933, 7.544943991671067)

SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Congratulations!
SAMPLING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Recap
Chapter 1 Chapter 3

Sampling basics Sample size and population parameters

Selection bias Creating sampling distributions


Pseudo-random numbers Approximate vs. actual sampling dist'ns

Central limit theorem


Chapter 2 Chapter 4

Simple random sampling Bootstrapping from a single sample

Systematic sampling Standard error

Stratified sampling Confidence intervals


Cluster sampling

SAMPLING IN PYTHON
The most important things

The std. deviation of a bootstrap statistic is a good approximation of the standard error

Can assume bootstrap distributions are normally distributed for confidence intervals

SAMPLING IN PYTHON
What's next?
Experimental Design in Python and Customer Analytics and A/B Testing in Python
Hypothesis Testing in Python

Foundations of Probability in Python and Bayesian Data Analysis in Python

SAMPLING IN PYTHON
Happy learning!
SAMPLING IN PYTHON
Hypothesis tests and
z-scores
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
A/B testing
In 2013, Electronic Arts (EA) released
SimCity 5

They wanted to increase pre-orders of the


game

They used A/B testing to test different


advertising scenarios

This involves splitting users into control and


treatment groups

1 Image credit: "Electronic Arts" by majaX1 CC BY-NC-SA 2.0

HYPOTHESIS TESTING IN PYTHON


Retail webpage A/B test
Control: Treatment:

HYPOTHESIS TESTING IN PYTHON


A/B test results
The treatment group (no ad) got 43.4% more purchases than the control group (with ad)
Intuition that "showing an ad would increase sales" was false

Was this result statistically significant or just chance?

Need EA's data to determine this

Techniques from Sampling in Python + this course to do so

HYPOTHESIS TESTING IN PYTHON


Stack Overflow Developer Survey 2020
import pandas as pd
print(stack_overflow)

respondent age_1st_code ... age hobbyist


0 36.0 30.0 ... 34.0 Yes
1 47.0 10.0 ... 53.0 Yes
2 69.0 12.0 ... 25.0 Yes
3 125.0 30.0 ... 41.0 Yes
4 147.0 15.0 ... 28.0 No
... ... ... ... ... ...
2259 62867.0 13.0 ... 33.0 Yes
2260 62882.0 13.0 ... 28.0 Yes

[2261 rows x 8 columns]

HYPOTHESIS TESTING IN PYTHON


Hypothesizing about the mean
A hypothesis:

The mean annual compensation of the population of data scientists is $110,000

The point estimate (sample statistic):

mean_comp_samp = stack_overflow['converted_comp'].mean()

119574.71738168952

HYPOTHESIS TESTING IN PYTHON


Generating a bootstrap distribution
import numpy as np
# Step 3. Repeat steps 1 & 2 many times, appending to a list
so_boot_distn = []
for i in range(5000):
so_boot_distn.append(
# Step 2. Calculate point estimate
np.mean(
# Step 1. Resample
stack_overflow.sample(frac=1, replace=True)['converted_comp']
)
)

1 Bootstrap distributions are taught in Chapter 4 of Sampling in Python

HYPOTHESIS TESTING IN PYTHON


Visualizing the bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(so_boot_distn, bins=50)
plt.show()

HYPOTHESIS TESTING IN PYTHON


Standard error
std_error = np.std(so_boot_distn, ddof=1)

5607.997577378606

HYPOTHESIS TESTING IN PYTHON


z-scores
value − mean
standardized value =
standard deviation
sample stat − hypoth. param. value
z=
standard error

HYPOTHESIS TESTING IN PYTHON


sample stat − hypoth. param. value
z=
standard error
stack_overflow['converted_comp'].mean()

119574.71738168952

mean_comp_hyp = 110000

std_error

5607.997577378606

z_score = (mean_comp_samp - mean_comp_hyp) / std_error

1.7073326529796957

HYPOTHESIS TESTING IN PYTHON


Testing the hypothesis
Is 1.707 a high or low number?
This is the goal of the course!

HYPOTHESIS TESTING IN PYTHON


Testing the hypothesis
Is 1.707 a high or low number?
This is the goal of the course!

Hypothesis testing use case:

Determine whether sample statistics are close to or far away from expected (or
"hypothesized" values)

HYPOTHESIS TESTING IN PYTHON


Standard normal (z) distribution
Standard normal distribution: normal distribution with mean = 0 + standard deviation = 1

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
p-values
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Criminal trials
Two possible true states:
1. Defendant committed the crime

2. Defendant did not commit the crime

Two possible verdicts:


1. Guilty

2. Not guilty

Initially the defendant is assumed to be not guilty

Prosecution must present evidence "beyond reasonable doubt" for a guilty verdict

HYPOTHESIS TESTING IN PYTHON


Age of first programming experience
age_first_code_cut classifies when Stack Overflow user first started programming
"adult" means they started at 14 or older

"child" means they started before 14

Previous research: 35% of software developers started programming as children

Evidence that a greater proportion of data scientists starting programming as children?

HYPOTHESIS TESTING IN PYTHON


Definitions
A hypothesis is a statement about an unknown population parameter

A hypothesis test is a test of two competing hypotheses

The null hypothesis (H0 ) is the existing idea

The alternative hypothesis (HA ) is the new "challenger" idea of the researcher

For our problem:

H0 : The proportion of data scientists starting programming as children is 35%


HA : The proportion of data scientists starting programming as children is greater than 35%

1"Naught" is British English for "zero". For historical reasons, "H-naught" is the international convention for
pronouncing the null hypothesis.

HYPOTHESIS TESTING IN PYTHON


Criminal trials vs. hypothesis testing
Either HA or H0 is true (not both)
Initially, H0 is assumed to be true

The test ends in either "reject H0 " or "fail to reject H0 "

If the evidence from the sample is "significant" that HA is true, reject H0 , else choose H0

Significance level is "beyond a reasonable doubt" for hypothesis testing

HYPOTHESIS TESTING IN PYTHON


One-tailed and two-tailed tests
Hypothesis tests check if the sample statistics
lie in the tails of the null distribution

Test Tails
alternative different from null two-tailed
alternative greater than null right-tailed
alternative less than null left-tailed

HA : The proportion of data scientists starting


programming as children is greater than 35%

This is a right-tailed test

HYPOTHESIS TESTING IN PYTHON


p-values
p-values: probability of obtaining a result,
assuming the null hypothesis is true

Large p-value, large support for H0


Statistic likely not in the tail of the null
distribution
Small p-value, strong evidence against H0
Statistic likely in the tail of the null
distribution
"p" in p-value → probability

"small" means "close to zero"

HYPOTHESIS TESTING IN PYTHON


Calculating the z-score
prop_child_samp = (stack_overflow['age_first_code_cut'] == "child").mean()

0.39141972578505085

prop_child_hyp = 0.35

std_error = np.std(first_code_boot_distn, ddof=1)

0.010351057228878566

z_score = (prop_child_samp - prop_child_hyp) / std_error

4.001497129152506

HYPOTHESIS TESTING IN PYTHON


Calculating the p-value
norm.cdf() is normal CDF from scipy.stats .

Left-tailed test → use norm.cdf() .

Right-tailed test → use 1 - norm.cdf() .

from scipy.stats import norm


1 - norm.cdf(z_score, loc=0, scale=1)

3.1471479512323874e-05

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Statistical
significance
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
p-value recap
p-values quantify evidence for the null hypothesis
Large p-value → fail to reject null hypothesis

Small p-value → reject null hypothesis

Where is the cutoff point?

HYPOTHESIS TESTING IN PYTHON


Significance level
The significance level of a hypothesis test (α) is the threshold point for "beyond a
reasonable doubt"

Common values of α are 0.2 , 0.1 , 0.05 , and 0.01

If p ≤ α, reject H0 , else fail to reject H0


α should be set prior to conducting the hypothesis test

HYPOTHESIS TESTING IN PYTHON


Calculating the p-value
alpha = 0.05
prop_child_samp = (stack_overflow['age_first_code_cut'] == "child").mean()
prop_child_hyp = 0.35
std_error = np.std(first_code_boot_distn, ddof=1)

z_score = (prop_child_samp - prop_child_hyp) / std_error

p_value = 1 - norm.cdf(z_score, loc=0, scale=1)

3.1471479512323874e-05

HYPOTHESIS TESTING IN PYTHON


Making a decision
alpha = 0.05
print(p_value)

3.1471479512323874e-05

p_value <= alpha

True

Reject H0 in favor of HA

HYPOTHESIS TESTING IN PYTHON


Confidence intervals
For a significance level of α, it's common to choose a confidence interval level of 1 - α

α = 0.05 → 95% confidence interval

import numpy as np
lower = np.quantile(first_code_boot_distn, 0.025)
upper = np.quantile(first_code_boot_distn, 0.975)
print((lower, upper))

(0.37063246351172047, 0.41132242370632466)

HYPOTHESIS TESTING IN PYTHON


Types of errors
Truly didn't commit crime Truly committed crime
Verdict not guilty correct they got away with it
Verdict guilty wrongful conviction correct

actual H0 actual HA

chosen H0 correct false negative

chosen HA false positive correct

False positives are Type I errors; false negatives are Type II errors.

HYPOTHESIS TESTING IN PYTHON


Possible errors in our example
If p ≤ α, we reject H0 :

A false positive (Type I) error: data scientists didn't start coding as children at a higher rate

If p > α, we fail to reject H0 :

A false negative (Type II) error: data scientists started coding as children at a higher rate

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Performing t-tests
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Two-sample problems
Compare sample statistics across groups of a variable
converted_comp is a numerical variable

age_first_code_cut is a categorical variable with levels ( "child" and "adult" )

Are users who first programmed as a child compensated higher than those that started as
adults?

HYPOTHESIS TESTING IN PYTHON


Hypotheses
H0 : The mean compensation (in USD) is the same for those that coded first as a child and
those that coded first as an adult.

H0 : μchild = μadult

H0 : μchild − μadult = 0

HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult.

HA : μchild > μadult

HA : μchild − μadult > 0

HYPOTHESIS TESTING IN PYTHON


Calculating groupwise summary statistics
stack_overflow.groupby('age_first_code_cut')['converted_comp'].mean()

age_first_code_cut
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64

HYPOTHESIS TESTING IN PYTHON


Test statistics
Sample mean estimates the population mean

x̄ - a sample mean
x̄child - sample mean compensation for coding first as a child
x̄adult - sample mean compensation for coding first as an adult
x̄child − x̄adult - a test statistic
z-score - a (standardized) test statistic

HYPOTHESIS TESTING IN PYTHON


Standardizing the test statistic
sample stat − population parameter
z=
standard error
difference in sample stats − difference in population parameters
t=
standard error
(x̄child − x̄adult ) − (μchild − μadult )
t=
SE(x̄child − x̄adult )

HYPOTHESIS TESTING IN PYTHON


Standard error
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult

s is the standard deviation of the variable

n is the sample size (number of observations/rows in sample)

HYPOTHESIS TESTING IN PYTHON


Assuming the null hypothesis is true
(x̄child − x̄adult ) − (μchild − μadult )
t=
SE(x̄child − x̄adult )
(x̄child − x̄adult )
H0 : μchild − μadult = 0 → t=
SE(x̄child − x̄adult )
(x̄child − x̄adult )
t=

s2child s2adult
+
nchild nadult

HYPOTHESIS TESTING IN PYTHON


Calculations assuming the null hypothesis is true
xbar = stack_overflow.groupby('age_first_code_cut')['converted_comp'].mean()

adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64 age_first_code_cut

s = stack_overflow.groupby('age_first_code_cut')['converted_comp'].std()

adult 271546.521729
child 255585.240115
Name: converted_comp, dtype: float64 age_first_code_cut

n = stack_overflow.groupby('age_first_code_cut')['converted_comp'].count()

adult 1376
child 885
Name: converted_comp, dtype: int64

HYPOTHESIS TESTING IN PYTHON


Calculating the test statistic
(x̄child − x̄adult )
t=

s2child s2adult
+
nchild nadult

import numpy as np
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator

1.8699313316221844

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Calculating p-values
from t-statistics
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
t-distributions
t statistic follows a t-distribution
Have a parameter named degrees of
freedom, or df
Look like normal distributions, with fatter
tails

HYPOTHESIS TESTING IN PYTHON


Degrees of freedom
Larger degrees of freedom → t-distribution
gets closer to the normal distribution

Normal distribution → t-distribution with


infinite df

Degrees of freedom: maximum number of


logically independent values in the data
sample

HYPOTHESIS TESTING IN PYTHON


Calculating degrees of freedom
Dataset has 5 independent observations
Four of the values are 2, 6, 8, and 5

The sample mean is 5

The last value must be 4

Here, there are 4 degrees of freedom

df = nchild + nadult − 2

HYPOTHESIS TESTING IN PYTHON


Hypotheses
H0 : The mean compensation (in USD) is the same for those that coded first as a child and
those that coded first as an adult

HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult

Use a right-tailed test

HYPOTHESIS TESTING IN PYTHON


Significance level
α = 0.1

If p ≤ α then reject H0 .

HYPOTHESIS TESTING IN PYTHON


Calculating p-values: one proportion vs. a value
from scipy.stats import norm
1 - norm.cdf(z_score)

SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult

z-statistic: needed when using one sample statistic to estimate a population parameter

t-statistic: needed when using multiple sample statistics to estimate a population parameter

HYPOTHESIS TESTING IN PYTHON


Calculating p-values: two means from different groups
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator

1.8699313316221844

degrees_of_freedom = n_child + n_adult - 2

2259

HYPOTHESIS TESTING IN PYTHON


Calculating p-values: two means from different groups
Use t-distribution CDF not normal CDF

from scipy.stats import t


1 - t.cdf(t_stat, df=degrees_of_freedom)

0.030811302165157595

Evidence that Stack Overflow data scientists who started coding as a child earn more.

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Paired t-tests
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
US Republican presidents dataset
state county repub_percent_08 repub_percent_12
0 Alabama Hale 38.957877 37.139882
1 Arkansas Nevada 56.726272 58.983452
2 California Lake 38.896719 39.331367
3 California Ventura 42.923190 45.250693
.. ... ... ... ...
96 Wisconsin La Crosse 37.490904 40.577038
97 Wisconsin Lafayette 38.104967 41.675050
98 Wyoming Weston 76.684241 83.983328
99 Alaska District 34 77.063259 40.789626

[100 rows x 4 columns]

100 rows; each row represents county-level votes in a presidential election.

1 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ

HYPOTHESIS TESTING IN PYTHON


Hypotheses
Question: Was the percentage of Republican candidate votes lower in 2008 than 2012?

H0 : μ2008 − μ2012 = 0

HA : μ2008 − μ2012 < 0

Set α = 0.05 significance level.

Data is paired → each voter percentage refers to the same county


Want to capture voting patterns in model

HYPOTHESIS TESTING IN PYTHON


From two samples to one
sample_data = repub_votes_potus_08_12
sample_data['diff'] = sample_data['repub_percent_08'] - sample_data['repub_percent_12']

import matplotlib.pyplot as plt


sample_data['diff'].hist(bins=20)

HYPOTHESIS TESTING IN PYTHON


Calculate sample statistics of the difference
xbar_diff = sample_data['diff'].mean()

-2.877109041242944

HYPOTHESIS TESTING IN PYTHON


Revised hypotheses
Old hypotheses: x̄diff − μdiff
t=

H0 : μ2008 − μ2012 = 0 s2dif f
HA : μ2008 − μ2012 < 0 ndiff

df = ndif f − 1

New hypotheses:
H0 : μdiff = 0

HA : μdiff < 0

HYPOTHESIS TESTING IN PYTHON


Calculating the p-value
x̄diff − μdiff
n_diff = len(sample_data) t=

s2diff
100
ndiff
s_diff = sample_data['diff'].std()
df = ndiff − 1
t_stat = (xbar_diff-0) / np.sqrt(s_diff**2/n_diff)

-5.601043121928489 from scipy.stats import t


p_value = t.cdf(t_stat, df=n_diff-1)

degrees_of_freedom = n_diff - 1
9.572537285272411e-08

99

HYPOTHESIS TESTING IN PYTHON


Testing differences between two means using ttest()
import pingouin
pingouin.ttest(x=sample_data['diff'],
y=0,
alternative="less")

T dof alternative p-val CI95% cohen-d \


T-test -5.601043 99 less 9.572537e-08 [-inf, -2.02] 0.560104

BF10 power
T-test 1.323e+05 1.0

1Details on Returns from pingouin.ttest() are available in the API docs for pingouin at https://pingouin-
stats.org/generated/pingouin.ttest.html#pingouin.ttest.

HYPOTHESIS TESTING IN PYTHON


ttest() with paired=True
pingouin.ttest(x=sample_data['repub_percent_08'],
y=sample_data['repub_percent_12'],
paired=True,
alternative="less")

T dof alternative p-val CI95% cohen-d \


T-test -5.601043 99 less 9.572537e-08 [-inf, -2.02] 0.217364

BF10 power
T-test 1.323e+05 0.696338

HYPOTHESIS TESTING IN PYTHON


Unpaired ttest()
pingouin.ttest(x=sample_data['repub_percent_08'],
y=sample_data['repub_percent_12'],
paired=False, # The default
alternative="less")

T dof alternative p-val CI95% cohen-d BF10 \


T-test -1.536997 198 less 0.062945 [-inf, 0.22] 0.217364 0.927

power
T-test 0.454972

Unpaired t-tests on paired data increases the chances of false negative errors

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
ANOVA tests
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Job satisfaction: 5 categories
stack_overflow['job_sat'].value_counts()

Very satisfied 879


Slightly satisfied 680
Slightly dissatisfied 342
Neither 201
Very dissatisfied 159
Name: job_sat, dtype: int64

HYPOTHESIS TESTING IN PYTHON


Visualizing multiple distributions
Is mean annual compensation different for
different levels of job satisfaction?

import seaborn as sns


import matplotlib.pyplot as plt
sns.boxplot(x="converted_comp",
y="job_sat",
data=stack_overflow)
plt.show()

HYPOTHESIS TESTING IN PYTHON


Analysis of variance (ANOVA)
A test for differences between groups

alpha = 0.2

pingouin.anova(data=stack_overflow,
dv="converted_comp",
between="job_sat")

Source ddof1 ddof2 F p-unc np2


0 job_sat 4 2256 4.480485 0.001315 0.007882

0.001315 <α
At least two categories have significantly different compensation

HYPOTHESIS TESTING IN PYTHON


Pairwise tests
μvery dissatisfied ≠ μslightly dissatisfied μslightly dissatisfied ≠ μslightly satisfied
μvery dissatisfied ≠ μneither μslightly dissatisfied ≠ μvery satisfied
μvery dissatisfied ≠ μslightly satisfied μneither ≠ μslightly satisfied
μvery dissatisfied ≠ μvery satisfied μneither ≠ μvery satisfied
μslightly dissatisfied ≠ μneither μslightly satisfied ≠ μvery satisfied

Set significance level to α = 0.2.

HYPOTHESIS TESTING IN PYTHON


pairwise_tests()
pingouin.pairwise_tests(data=stack_overflow,
dv="converted_comp",
between="job_sat",
padjust="none")

Contrast A B Paired Parametric ... dof alternative p-unc BF10 hedges


0 job_sat Slightly satisfied Very satisfied False True ... 1478.622799 two-sided 0.000064 158.564 -0.192931
1 job_sat Slightly satisfied Neither False True ... 258.204546 two-sided 0.484088 0.114 -0.068513
2 job_sat Slightly satisfied Very dissatisfied False True ... 187.153329 two-sided 0.215179 0.208 -0.145624
3 job_sat Slightly satisfied Slightly dissatisfied False True ... 569.926329 two-sided 0.969491 0.074 -0.002719
4 job_sat Very satisfied Neither False True ... 328.326639 two-sided 0.097286 0.337 0.120115
5 job_sat Very satisfied Very dissatisfied False True ... 221.666205 two-sided 0.455627 0.126 0.063479
6 job_sat Very satisfied Slightly dissatisfied False True ... 821.303063 two-sided 0.002166 7.43 0.173247
7 job_sat Neither Very dissatisfied False True ... 321.165726 two-sided 0.585481 0.135 -0.058537
8 job_sat Neither Slightly dissatisfied False True ... 367.730081 two-sided 0.547406 0.118 0.055707
9 job_sat Very dissatisfied Slightly dissatisfied False True ... 247.570187 two-sided 0.259590 0.197 0.119131

[10 rows x 11 columns]

HYPOTHESIS TESTING IN PYTHON


As the number of groups increases...

HYPOTHESIS TESTING IN PYTHON


Bonferroni correction
pingouin.pairwise_tests(data=stack_overflow,
dv="converted_comp",
between="job_sat",
padjust="bonf")

Contrast A B ... p-unc p-corr p-adjust BF10 hedges


0 job_sat Slightly satisfied Very satisfied ... 0.000064 0.000638 bonf 158.564 -0.192931
1 job_sat Slightly satisfied Neither ... 0.484088 1.000000 bonf 0.114 -0.068513
2 job_sat Slightly satisfied Very dissatisfied ... 0.215179 1.000000 bonf 0.208 -0.145624
3 job_sat Slightly satisfied Slightly dissatisfied ... 0.969491 1.000000 bonf 0.074 -0.002719
4 job_sat Very satisfied Neither ... 0.097286 0.972864 bonf 0.337 0.120115
5 job_sat Very satisfied Very dissatisfied ... 0.455627 1.000000 bonf 0.126 0.063479
6 job_sat Very satisfied Slightly dissatisfied ... 0.002166 0.021659 bonf 7.43 0.173247
7 job_sat Neither Very dissatisfied ... 0.585481 1.000000 bonf 0.135 -0.058537
8 job_sat Neither Slightly dissatisfied ... 0.547406 1.000000 bonf 0.118 0.055707
9 job_sat Very dissatisfied Slightly dissatisfied ... 0.259590 1.000000 bonf 0.197 0.119131

[10 rows x 11 columns]

HYPOTHESIS TESTING IN PYTHON


More methods
padjust : string

Method used for testing and adjustment of pvalues.

'none' : no correction [default]

'bonf' : one-step Bonferroni correction

'sidak' : one-step Sidak correction

'holm' : step-down method using Bonferroni adjustments

'fdr_bh' : Benjamini/Hochberg FDR correction

'fdr_by' : Benjamini/Yekutieli FDR correction

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
One-sample
proportion tests
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Chapter 1 recap
Is a claim about an unknown population proportion feasible?

1. Standard error of sample statistic from bootstrap distribution


2. Compute a standardized test statistic

3. Calculate a p-value

4. Decide which hypothesis made most sense

Now, calculate the test statistic without using the bootstrap distribution

HYPOTHESIS TESTING IN PYTHON


Standardized test statistic for proportions
p: population proportion (unknown population parameter)

p^: sample proportion (sample statistic)

p0 : hypothesized population proportion


p^ − mean( p^) p^ − p
z= =
SE( p^) SE( p^)
Assuming H0 is true, p = p0 , so
p^ − p0
z=
SE( p^)

HYPOTHESIS TESTING IN PYTHON


Simplifying the standard error calculations
SE p^ = √
p0 ∗ (1 − p0 )
→ Under H0 , SE p^ depends on hypothesized p0 and sample size n
n
Assuming H0 is true,

p^ − p0
z=
√ p0 ∗ (1 − p0 )
n
^ and n) and the hypothesized parameter (p0 )
Only uses sample information ( p

HYPOTHESIS TESTING IN PYTHON


Why z instead of t?
(x̄child − x̄adult )
t=

s2child s2adult
+
nchild nadult

s is calculated from x̄
x̄ estimates the population mean
s estimates the population standard deviation
↑ uncertainty in our estimate of the parameter
t-distribution - fatter tails than a normal distribution

p^ only appears in the numerator, so z-scores are fine

HYPOTHESIS TESTING IN PYTHON


Stack Overflow age categories
H0 : Proportion of Stack Overflow users under thirty = 0.5

HA : Proportion of Stack Overflow users under thirty ≠ 0.5

alpha = 0.01

stack_overflow['age_cat'].value_counts(normalize=True)

Under 30 0.535604
At least 30 0.464396
Name: age_cat, dtype: float64

HYPOTHESIS TESTING IN PYTHON


Variables for z
p_hat = (stack_overflow['age_cat'] == 'Under 30').mean()

0.5356037151702786

p_0 = 0.50

n = len(stack_overflow)

2261

HYPOTHESIS TESTING IN PYTHON


Calculating the z-score
p^ − p0
z=

p0 ∗ (1 − p0 )
n

import numpy as np
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)
z_score = numerator / denominator

3.385911440783663

HYPOTHESIS TESTING IN PYTHON


Calculating the p-value
Two-tailed ("not equal"):

p_value = norm.cdf(-z_score) +
1 - norm.cdf(z_score)

p_value = 2 * (1 - norm.cdf(z_score))
Left-tailed ("less than"):
0.0007094227368100725
from scipy.stats import norm
p_value = norm.cdf(z_score)
p_value <= alpha

Right-tailed ("greater than"):


True

p_value = 1 - norm.cdf(z_score)

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Two-sample
proportion tests
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Comparing two proportions
H0 : Proportion of hobbyist users is the same for those under thirty as those at least thirty

H0 : p≥30 − p<30 = 0

HA : Proportion of hobbyist users is different for those under thirty to those at least thirty

HA : p≥30 − p<30 ≠ 0

alpha = 0.05

HYPOTHESIS TESTING IN PYTHON


Calculating the z-score
z-score equation for a proportion test:
( p^≥30 − p^<30 ) − 0
z=
SE( p^≥30 − p^<30 )
Standard error equation:

SE( p^≥30 − p^<30 ) = √


p^ × (1 − p^) p^ × (1 − p^)
+
n≥30 n<30

p^ → weighted mean of p^≥30 and p^<30


n≥30 × p^≥30 + n<30 × p^<30
p^ =
n≥30 + n<30
Only require p^≥30 , p^<30 , n≥30 , n<30 from the sample to calculate the z-score

HYPOTHESIS TESTING IN PYTHON


Getting the numbers for the z-score
p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)

age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64

n = stack_overflow.groupby("age_cat")['hobbyist'].count()

age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64

HYPOTHESIS TESTING IN PYTHON


Getting the numbers for the z-score
p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)

age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64

p_hat_at_least_30 = p_hats[("At least 30", "Yes")]


p_hat_under_30 = p_hats[("Under 30", "Yes")]
print(p_hat_at_least_30, p_hat_under_30)

0.773333 0.843105

HYPOTHESIS TESTING IN PYTHON


Getting the numbers for the z-score
n = stack_overflow.groupby("age_cat")['hobbyist'].count()

age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64

n_at_least_30 = n["At least 30"]


n_under_30 = n["Under 30"]
print(n_at_least_30, n_under_30)

1050 1211

HYPOTHESIS TESTING IN PYTHON


Getting the numbers for the z-score
p_hat = (n_at_least_30 * p_hat_at_least_30 + n_under_30 * p_hat_under_30) /
(n_at_least_30 + n_under_30)

std_error = np.sqrt(p_hat * (1-p_hat) / n_at_least_30 +


p_hat * (1-p_hat) / n_under_30)

z_score = (p_hat_at_least_30 - p_hat_under_30) / std_error


print(z_score)

-4.223718652693034

HYPOTHESIS TESTING IN PYTHON


Proportion tests using proportions_ztest()
stack_overflow.groupby("age_cat")['hobbyist'].value_counts()

age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64

n_hobbyists = np.array([812, 1021])


n_rows = np.array([812 + 238, 1021 + 190])
from statsmodels.stats.proportion import proportions_ztest
z_score, p_value = proportions_ztest(count=n_hobbyists, nobs=n_rows,
alternative="two-sided")

(-4.223691463320559, 2.403330142685068e-05)

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Chi-square test of
independence
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Revisiting the proportion test
age_by_hobbyist = stack_overflow.groupby("age_cat")['hobbyist'].value_counts()

age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64

from statsmodels.stats.proportion import proportions_ztest


n_hobbyists = np.array([812, 1021])
n_rows = np.array([812 + 238, 1021 + 190])
stat, p_value = proportions_ztest(count=n_hobbyists, nobs=n_rows,
alternative="two-sided")

(-4.223691463320559, 2.403330142685068e-05)

HYPOTHESIS TESTING IN PYTHON


Independence of variables
Previous hypothesis test result: evidence that hobbyist and age_cat are associated

Statistical independence - proportion of successes in the response variable is the same


across all categories of the explanatory variable

HYPOTHESIS TESTING IN PYTHON


Test for independence of variables
import pingouin
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x='hobbyist',
y='age_cat', correction=False)
print(stats)

test lambda chi2 dof pval cramer power


0 pearson 1.000000 17.839570 1.0 0.000024 0.088826 0.988205
1 cressie-read 0.666667 17.818114 1.0 0.000024 0.088773 0.988126
2 log-likelihood 0.000000 17.802653 1.0 0.000025 0.088734 0.988069
3 freeman-tukey -0.500000 17.815060 1.0 0.000024 0.088765 0.988115
4 mod-log-likelihood -1.000000 17.848099 1.0 0.000024 0.088848 0.988236
5 neyman -2.000000 17.976656 1.0 0.000022 0.089167 0.988694

χ2 statistic = 17.839570 = (−4.223691463320559)2 = (z -score)2

HYPOTHESIS TESTING IN PYTHON


Job satisfaction and age category
stack_overflow['age_cat'].value_counts() stack_overflow['job_sat'].value_counts()

Under 30 1211 Very satisfied 879


At least 30 1050 Slightly satisfied 680
Name: age_cat, dtype: int64 Slightly dissatisfied 342
Neither 201
Very dissatisfied 159
Name: job_sat, dtype: int64

HYPOTHESIS TESTING IN PYTHON


Declaring the hypotheses
H0 : Age categories are independent of job satisfaction levels

HA : Age categories are not independent of job satisfaction levels

alpha = 0.1

Test statistic denoted χ2

Assuming independence, how far away are the observed results from the expected values?

HYPOTHESIS TESTING IN PYTHON


Exploratory visualization: proportional stacked bar plot
props = stack_overflow.groupby('job_sat')['age_cat'].value_counts(normalize=True)
wide_props = props.unstack()
wide_props.plot(kind="bar", stacked=True)

HYPOTHESIS TESTING IN PYTHON


Exploratory visualization: proportional stacked bar plot

HYPOTHESIS TESTING IN PYTHON


Chi-square independence test
import pingouin
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x="job_sat", y="age_cat")
print(stats)

test lambda chi2 dof pval cramer power


0 pearson 1.000000 5.552373 4.0 0.235164 0.049555 0.437417
1 cressie-read 0.666667 5.554106 4.0 0.235014 0.049563 0.437545
2 log-likelihood 0.000000 5.558529 4.0 0.234632 0.049583 0.437871
3 freeman-tukey -0.500000 5.562688 4.0 0.234274 0.049601 0.438178
4 mod-log-likelihood -1.000000 5.567570 4.0 0.233854 0.049623 0.438538
5 neyman -2.000000 5.579519 4.0 0.232828 0.049676 0.439419

Degrees of freedom:

(No. of response categories − 1) × (No. of explanatory categories − 1)

(2 − 1) ∗ (5 − 1) = 4

HYPOTHESIS TESTING IN PYTHON


Swapping the variables?
props = stack_overflow.groupby('age_cat')['job_sat'].value_counts(normalize=True)
wide_props = props.unstack()
wide_props.plot(kind="bar", stacked=True)

HYPOTHESIS TESTING IN PYTHON


Swapping the variables?

HYPOTHESIS TESTING IN PYTHON


chi-square both ways
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x="age_cat", y="job_sat")
print(stats[stats['test'] == 'pearson'])

test lambda chi2 dof pval cramer power


0 pearson 1.0 5.552373 4.0 0.235164 0.049555 0.437417

Ask: Are the variables X and Y independent?

Not: Is variable X independent from variable Y?

HYPOTHESIS TESTING IN PYTHON


What about direction and tails?
Observed and expected counts squared must be non-negative

chi-square tests are almost always right-tailed 1

1Left-tailed chi-square tests are used in statistical forensics to detect if a fit is suspiciously good because the
data was fabricated. Chi-square tests of variance can be two-tailed. These are niche uses, though.

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Chi-square
goodness of fit tests
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Purple links
How do you feel when you discover that you've already visited the top resource?

purple_link_counts = stack_overflow['purple_link'].value_counts()

purple_link_counts = purple_link_counts.rename_axis('purple_link')\
.reset_index(name='n')\
.sort_values('purple_link')

purple_link n
2 Amused 368
3 Annoyed 263
0 Hello, old friend 1225
1 Indifferent 405

HYPOTHESIS TESTING IN PYTHON


Declaring the hypotheses
hypothesized = pd.DataFrame({ purple_link prop
'purple_link': ['Amused', 'Annoyed', 'Hello, old friend', 'Indifferent'], 0 Amused 0.166667
'prop': [1/6, 1/6, 1/2, 1/6]}) 1 Annoyed 0.166667
2 Hello, old friend 0.500000
3 Indifferent 0.166667

H0 : The sample matches the hypothesized χ2 measures how far observed results are
distribution from expectations in each group

HA : The sample does not match the alpha = 0.01


hypothesized distribution

HYPOTHESIS TESTING IN PYTHON


Hypothesized counts by category
n_total = len(stack_overflow)
hypothesized["n"] = hypothesized["prop"] * n_total

purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333

HYPOTHESIS TESTING IN PYTHON


Visualizing counts
import matplotlib.pyplot as plt

plt.bar(purple_link_counts['purple_link'], purple_link_counts['n'],
color='red', label='Observed')
plt.bar(hypothesized['purple_link'], hypothesized['n'], alpha=0.5,
color='blue', label='Hypothesized')

plt.legend()
plt.show()

HYPOTHESIS TESTING IN PYTHON


Visualizing counts

HYPOTHESIS TESTING IN PYTHON


chi-square goodness of fit test
print(hypothesized)

purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333

from scipy.stats import chisquare


chisquare(f_obs=purple_link_counts['n'], f_exp=hypothesized['n'])

Power_divergenceResult(statistic=44.59840778416629, pvalue=1.1261810719413759e-09)

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Assumptions in
hypothesis testing
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Randomness
Assumption
The samples are random subsets of larger
populations

Consequence
Sample is not representative of population

How to check this


Understand how your data was collected

Speak to the data collector/domain expert

1 Sampling techniques are discussed in "Sampling in Python".

HYPOTHESIS TESTING IN PYTHON


Independence of observations
Assumption
Each observation (row) in the dataset is independent

Consequence
Increased chance of false negative/positive error

How to check this


Understand how our data was collected

HYPOTHESIS TESTING IN PYTHON


Large sample size
Assumption
The sample is big enough to mitigate uncertainty, so that the Central Limit Theorem applies

Consequence
Wider confidence intervals

Increased chance of false negative/positive errors

How to check this


It depends on the test

HYPOTHESIS TESTING IN PYTHON


Large sample size: t-test
One sample Two samples
At least 30 observations in the sample At least 30 observations in each sample

n ≥ 30 n1 ≥ 30, n2 ≥ 30

n: sample size ni : sample size for group i

Paired samples ANOVA


At least 30 pairs of observations across the At least 30 observations in each sample
samples
ni ≥ 30 for all values of i
Number of rows in our data ≥ 30

HYPOTHESIS TESTING IN PYTHON


Large sample size: proportion tests
One sample Two samples
Number of successes in sample is greater Number of successes in each sample is
than or equal to 10 greater than or equal to 10

n × p^ ≥ 10 n1 × p^1 ≥ 10

Number of failures in sample is greater n2 × p^2 ≥ 10


than or equal to 10
Number of failures in each sample is
n × (1 − p^) ≥ 10 greater than or equal to 10

n: sample size n1 × (1 − p^1 ) ≥ 10


p^: proportion of successes in sample
n2 × (1 − p^2 ) ≥ 10

HYPOTHESIS TESTING IN PYTHON


Large sample size: chi-square tests
The number of successes in each group in greater than or equal to 5
ni × p^i ≥ 5 for all values of i

The number of failures in each group in greater than or equal to 5


ni × (1 − p^i ) ≥ 5 for all values of i

ni : sample size for group i


p^i : proportion of successes in sample group i

HYPOTHESIS TESTING IN PYTHON


Sanity check
If the bootstrap distribution doesn't look normal, assumptions likely aren't valid

Revisit data collection to check for randomness, independence, and sample size

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Non-parametric
tests
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Parametric tests
z-test, t-test, and ANOVA are all parametric tests
Assume a normal distribution

Require sufficiently large sample sizes

HYPOTHESIS TESTING IN PYTHON


Smaller Republican votes data
print(repub_votes_small)

state county repub_percent_08 repub_percent_12


80 Texas Red River 68.507522 69.944817
84 Texas Walker 60.707197 64.971903
33 Kentucky Powell 57.059533 61.727293
81 Texas Schleicher 74.386503 77.384464
93 West Virginia Morgan 60.857614 64.068711

HYPOTHESIS TESTING IN PYTHON


Results with pingouin.ttest()
5 pairs is not enough to meet the sample size condition for the paired t-test:
At least 30 pairs of observations across the samples.

alpha = 0.01
import pingouin
pingouin.ttest(x=repub_votes_potus_08_12_small['repub_percent_08'],
y=repub_votes_potus_08_12_small['repub_percent_12'],
paired=True,
alternative="less")

T dof alternative p-val CI95% cohen-d BF10 power


T-test -5.875753 4 less 0.002096 [-inf, -2.11] 0.500068 26.468 0.239034

HYPOTHESIS TESTING IN PYTHON


Non-parametric tests
Non-parametric tests avoid the parametric assumptions and conditions
Many non-parametric tests use ranks of the data

x = [1, 15, 3, 10, 6]

from scipy.stats import rankdata


rankdata(x)

array([1., 5., 2., 4., 3.])

HYPOTHESIS TESTING IN PYTHON


Non-parametric tests
Non-parametric tests are more reliable than parametric tests for small sample sizes and
when data isn't normally distributed

HYPOTHESIS TESTING IN PYTHON


Non-parametric tests
Non-parametric tests are more reliable than parametric tests for small sample sizes and
when data isn't normally distributed

Wilcoxon-signed rank test


Developed by Frank Wilcoxon in 1945

One of the first non-parametric procedures

HYPOTHESIS TESTING IN PYTHON


Wilcoxon-signed rank test (Step 1)
Works on the ranked absolute differences between the pairs of data

repub_votes_small['diff'] = repub_votes_small['repub_percent_08'] -
repub_votes_small['repub_percent_12']
print(repub_votes_small)

state county repub_percent_08 repub_percent_12 diff


80 Texas Red River 68.507522 69.944817 -1.437295
84 Texas Walker 60.707197 64.971903 -4.264705
33 Kentucky Powell 57.059533 61.727293 -4.667760
81 Texas Schleicher 74.386503 77.384464 -2.997961
93 West Virginia Morgan 60.857614 64.068711 -3.211097

HYPOTHESIS TESTING IN PYTHON


Wilcoxon-signed rank test (Step 2)
Works on the ranked absolute differences between the pairs of data

repub_votes_small['abs_diff'] = repub_votes_small['diff'].abs()
print(repub_votes_small)

state county repub_percent_08 repub_percent_12 diff abs_diff


80 Texas Red River 68.507522 69.944817 -1.437295 1.437295
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097

HYPOTHESIS TESTING IN PYTHON


Wilcoxon-signed rank test (Step 3)
Works on the ranked absolute differences between the pairs of data

from scipy.stats import rankdata


repub_votes_small['rank_abs_diff'] = rankdata(repub_votes_small['abs_diff'])
print(repub_votes_small)

state county repub_percent_08 repub_percent_12 diff abs_diff rank_abs_diff


80 Texas Red River 68.507522 69.944817 -1.437295 1.437295 1.0
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705 4.0
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760 5.0
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961 2.0
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097 3.0

HYPOTHESIS TESTING IN PYTHON


Wilcoxon-signed rank test (Step 4)
state county repub_percent_08 repub_percent_12 diff abs_diff rank_abs_diff
80 Texas Red River 68.507522 69.944817 -1.437295 1.437295 1.0
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705 4.0
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760 5.0
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961 2.0
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097 3.0

Incorporate the sum of the ranks for negative and positive differences

T_minus = 1 + 4 + 5 + 2 + 3
T_plus = 0
W = np.min([T_minus, T_plus])

HYPOTHESIS TESTING IN PYTHON


Implementation with pingouin.wilcoxon()
alpha = 0.01
pingouin.wilcoxon(x=repub_votes_potus_08_12_small['repub_percent_08'],
y=repub_votes_potus_08_12_small['repub_percent_12'],
alternative="less")

W-val alternative p-val RBC CLES


Wilcoxon 0.0 less 0.03125 -1.0 0.72

Fail to reject H0 , since 0.03125 > 0.01

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Non-parametric
ANOVA and
unpaired t-tests
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Wilcoxon-Mann-Whitney test
Also know as the Mann Whitney U test
A t-test on the ranks of the numeric input

Works on unpaired data

HYPOTHESIS TESTING IN PYTHON


Wilcoxon-Mann-Whitney test setup
age_vs_comp = stack_overflow[['converted_comp', 'age_first_code_cut']]

age_vs_comp_wide = age_vs_comp.pivot(columns='age_first_code_cut',
values='converted_comp')

age_first_code_cut adult child


0 77556.0 NaN
1 NaN 74970.0
2 NaN 594539.0
... ... ...
2258 NaN 97284.0
2259 NaN 72000.0
2260 NaN 180000.0

[2261 rows x 2 columns]

HYPOTHESIS TESTING IN PYTHON


Wilcoxon-Mann-Whitney test
alpha=0.01

import pingouin
pingouin.mwu(x=age_vs_comp_wide['child'],
y=age_vs_comp_wide['adult'],
alternative='greater')

U-val alternative p-val RBC CLES


MWU 744365.5 greater 1.902723e-19 -0.222516 0.611258

HYPOTHESIS TESTING IN PYTHON


Kruskal-Wallis test
Kruskal-Wallis test is to Wilcoxon-Mann-Whitney test as ANOVA is to t-test

alpha=0.01

pingouin.kruskal(data=stack_overflow,
dv='converted_comp',
between='job_sat')

Source ddof1 H p-unc


Kruskal job_sat 4 72.814939 5.772915e-15

HYPOTHESIS TESTING IN PYTHON


Let's practice!
HYPOTHESIS TESTING IN PYTHON
Congratulations!
HYPOTHESIS TESTING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Course recap
Chapter 1 Chapter 3

Workflow for testing proportions vs. a Testing differences in sample proportions


hypothesized value between two groups using proportion tests

False negative/false positive errors Using chi-square independence/goodness


of fit tests

Chapter 2 Chapter 4

Testing differences in sample means Reviewing assumptions of parametric


between two groups using t-tests hypothesis tests

Extending this to more than two groups Examined non-parametric alternatives


using ANOVA and pairwise t-tests when assumptions aren't valid

HYPOTHESIS TESTING IN PYTHON


More courses
Inference
Statistics Fundamentals with Python skill track

Bayesian statistics
Bayesian Data Analysis in Python

Applications
Customer Analytics and A/B Testing in Python

HYPOTHESIS TESTING IN PYTHON


Congratulations!
HYPOTHESIS TESTING IN PYTHON

You might also like