Lecture Notes
Lecture Notes
Physics students
Fall 2020
Thomas Nagler
Thomas Nagler1
1
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International
(CC BY-ND 4.0) License.
Contents
2 Probability basics 1
2.1 Sample spaces and events . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Probability mass and density functions . . . . . . . . . . . . . . . 15
2.8 Bivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Independence (ctd’) . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.11 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.12 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.13 Variance and standard deviation . . . . . . . . . . . . . . . . . . . 26
2.14 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . 27
5 Parameter estimation 65
5.1 Motivation: distributions in times of Corona . . . . . . . . . . . . 65
5.2 A general framework . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 The method of moments . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . 70
5.5 Checking for misspecification and model fit . . . . . . . . . . . . . 75
5.6 Chi-square fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Uncertainty quantification 78
6.1 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Asymptotic normality of estimators . . . . . . . . . . . . . . . . . 80
ii
Contents
7 Testing 90
7.1 Hypothesis testing: a quick walk through . . . . . . . . . . . . . . 90
7.2 Issues with hypothesis testing . . . . . . . . . . . . . . . . . . . . 93
7.3 Null and alternative hypotheses . . . . . . . . . . . . . . . . . . . 95
7.4 Test statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.5 Test errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.6 Significance and p-values . . . . . . . . . . . . . . . . . . . . . . . 98
7.7 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.8 Multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.9 Some classes of tests . . . . . . . . . . . . . . . . . . . . . . . . . 101
iii
Probability basics
2
Having observed some data X1 , . . . , Xn , what can we say about the
mechanism that generated them?
Definition 2.1. (i) The sample space Ω is the set of all possible outcomes
of an experiment.
Ω = {HH, HT, T H, T T },
Chapter 2 Probability basics 2
where H stands for heads and T for tails. The event that the first toss is heads
is A = {HH, HT }.
Example 2.4. We ask a random person in the street what his month of birth
is. The sample space is Ω = {Jan, F eb, . . . , Dec}. The event that he was born in
spring is A = {M ar, Apr, M ay}.
Now consider two events: an earthquake (A) and a flood (B) defined on
the sample space Ω. The set A contains all states of our world ω, in which a
earthquake happens. The set B contains all ω, for which a flood happens.
What is the event of no earthquake happening?
To understand events, visualizations like the Venn diagram are often helpful.
Think of the rectangle as the sample space Ω. That is, all points ω inside the
rectangle together form the set Ω. The disk represents the event A. All points ω
inside the disk form the set A. The event Ac (‘a does not happen’) is defined as
all outcomes in Ω that are not part of the set A (the shaded area).
Example 2.6. We ask a random person in the street what his month of birth
is. The sample space is Ω = {Jan, F eb, . . . , Dec}. The event that he was born in
spring is A = {M ar, Apr, M ay}. The event that he was not born in spring is
Ac = {ω ∈ Ω : ω ∈/ A}
= {Jan, F eb, Jun, Jul, . . . , Dec}.
There are several ways to connect or disentangle two different events A and B.
First, what is the event that there is an earthquake or a flood?
Chapter 2 Probability basics 3
Definition 2.7. The union of the events A and B, which can be thought of
as an event ‘A or B’, is defined as
A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B or both}.
A B
The ‘or’ here is non-exclusive: A and B may both happen simultaneously, but
we also accept it if just one of them does.
Similarly, or a possibly infinite sequence of events A1 , A2 , A3 , . . .
[
Ai = {ω ∈ Ω : ω ∈ Ai for at least one i}.
i
Example 2.8. We ask a random person in the street what his month of birth
is. The sample space is Ω = {Jan, F eb, . . . , Dec}. The event that he was born
in spring is A = {M ar, Apr, M ay}. The event that he was born in summer is
B = {Jun, Jul, Aug}. The event that he was born in spring or summer is
The second way to connect the two is: what is the event that there is both an
earth quake and a flood?
A ∩ B = {ω ∈ Ω : ω ∈ A and ω ∈ B simultaneously}.
A B
Example 2.10. We ask a random person in the street what his month of birth
is. The sample space is Ω = {Jan, F eb, . . . , Dec}. The event that he was born in
winter is A = {Dec, Jan, F eb}. The event that he was born in the first half of a
year is B = {Jan, . . . , Jun}. The event that he was born both in the winter and
in the first half of the year is {Jan, F eb}.
A B
Example 2.12. We ask a random person in the street what his month of birth
is. The sample space is Ω = {Jan, F eb, . . . , Dec}. The event that he was born
in summer is A = {Jun, Jul, Aug}. The event that he was born in August is
B = {Aug}. The event that he was born in Summer but not in August is
2.2 Probabilities
Having set a mathematical framework to speak about events, we can move on to
probabilities of events.
(ii) P(Ω) = 1,
The axioms are supposed to reflect what we mean by the concept ‘probability’.
A (i) non-negative number, such that (ii) the set of all possible outcomes has
probability 1, and (ii) the probabilities of exclusive events (only one of them can
happen) sum up.2
There are two common interpretations of probability: frequencies and degrees
of belief.
• P(∅) = 0.
• A ⊂ B ⇒ P(A) ≤ P(B).
• 0 ≤ P(A) ≤ 1.
2
Admittedly, the third axiom is not that intuitive for non-mathematicians.
Chapter 2 Probability basics 6
• P(Ac ) = 1 − P(A).
• A ∩ B = ∅ ⇒ P (A ∪ B) = P(A) + P(B).
The following result is less trivial:
P (A ∪ B) = P(A) + P(B) − P (A ∩ B) .
A B
We now use the third axiom again, but in the other direction. The sets (A \ B)
and (A ∩ B) are disjoint and (A \ B) ∪ (A ∩ B) = A. Hence,
and similarly,
Example 2.15. Toss a fair coin twice. Let H1 be the event that head occurs on
toss 1 and let H2 be the event that head occurs on toss 2, then
If each outcome is equally likely, then P(A) = |A|/36. For instance, the probability
that the sum of the dice is 11 is equal to 2/36.
For a finite set A, let |A| denote the number of elements in A. If Ω is finite
and each outcome is equally likely, then
|A|
P(A) = .
|Ω|
Example 2.17. We ask a random person in the street what his month of birth
is. The sample space is Ω = {Jan, F eb, . . . , Dec}. The probability that he was
born in the winter is 3/12 = 1/4, if being born in each month is equally probable.
2.3 Independence
Colloquially, we speak of two independent events, when they have nothing to
do with each other. For example, the events ‘it will be raining tomorrow’ and
‘you read an article on mice yesterday’ are entirely unrelated. There is a formal,
probabilistic definition of such events.
Chapter 2 Probability basics 8
P(A ∩ B) = P(A)P(B).
Example 2.20. Toss a fair coin 10 times. Let A = ‘at least one head’ and Tj
be the event that tail occurs on the j th coin toss. Then
P(A) = 1 − P(Ac )
= 1 − P(all tails)
10
!
\
=1−P Ti
i=1
10
Y
(using independence) =1− P(Ti )
i=1
10
1
=1− ≈ 0.999.
2
The restriction P(B) > 0 is necessary for the fraction to be well defined. This
aligns with common sense: if something cannot possibly happen, it’s foolish to
talk about it’s consequences. Let’s get a visual intuition for the formula.
A B
We see the conventional probability P(A) as the area of the disk around A divided
by the area of the rectangle (Ω). Now suppose we know that the realization ω
lies in the disk B and forget about all the other cases. Then B acts as our new
Ω. Now we see the conditional probability P(A | B) is the area covered by A
(which, after forgetting everything else, is A ∩ B) relative to the area of the disk
around B. We therefore think P(A | B) as the fraction of times A occurs among
those in which B occurs.
For fixed B, P(A | B) is a proper probability measure – it satisfies all three
axioms:
It does not behave like that as a function of B though. In general it is not true
that
P(A | B ∪ C) = P(A | B) + P(A | C).
Neither is P(A | B) = P(B | A) true in general (related to the ‘prosecutor’s
fallacy’4 ).
Our intuition often fails us when conditional probabilities are involved. There’s
a huge number of ‘fallacies’ and ‘paradoxes’. That’s why understanding the
concept of conditional probabilities is even more important.
Example 2.22. A medical test for COVID-19 has outcomes + and −. The joint
probabilities are
4
https://en.wikipedia.org/wiki/Prosecutor%27s_fallacy
Chapter 2 Probability basics 10
COV ID healthy
+ 1% 1%
− 0.2% 97.6%
P(+ ∩ COVID) 1%
P(+ | COVID) = = ≈ 83%,
P(COVID) 1% + 0.2%
P(− ∩ healthy) 97.6%
P(− | healthy) = = ≈ 99%.
P(healthy) 97.6% + 1%
The first equation states that, if someone has the disease, the test detects the
disease in 83% of the cases. The second equation states that, if someone does not
has the disease, the test will correctly diagnose ‘no disease’ in 99% of the cases.
These conditional probabilities are the two common quality measures for medical
tests (called sensitivity and specificity). If you read ‘the test is 99% correct’ in a
newspaper, it refers to one of these conditional probabilities (or both). The above
accuracy numbers are in line with what we know about the COVID tests in use to
day.
Now suppose you go for a test and the test is positive. What is the probability
you have the disease?
P(COVID ∩ +) 1%
P(COVID | +) = = = 50%.
P(+) 1% + 1%
So if you get a positive test, you have a 50% chance to be healthy anyway. That’s
not intuitive at all, the test is correct at least 83% of the time!
Indeed it is correct on 83% of diseased patients and correct on 99% of healthy
patients. However, there are way more healthy people (P (healthy) = 98%) than
diseased ones (P (COVID) = 2%). Out of the huge number of healthy people,
1% are incorrectly diagnosed with the disease. If the whole population would get
tested, 98% × 1% ≈ 1% of the entire population would be tested positive despite
being healthy. Contrast this with the total number of diseased people: 1% of the
population. Hence, most of the people with a positive test are healthy.
The previous example is not just a brain teaser but important for medical
practice. False positives often have severe consequences (think: quarantine,
mental health issues, expensive medicine, unnecessary surgery). Testing a large
proportion of a population can therefore cause more harm than good if a) the
test is not reliable enough, or b) the disease is too rare. Especially b) is rarely
part of the public discourse, but at least as important as a).
Let’s continue with some useful results on conditional probabilities.
5
Here we use that for two events A, B, it holds P (A) = P (A ∩ B) + P (A ∩ B c ). This follows
from the third axiom because A can be partitioned in the two disjoint sets A ∩ B and A ∩ B c .
Chapter 2 Probability basics 11
Lemma 2.23.
Proof.
(i) Recall that independence of A and B is equivalent to P(A | B) = P(A)P(B).
By the definition of conditional probabilities (Definition 2.21), it holds
Example 2.24. Draw two cards from a deck (of 52 cards), without replacement.
Let A be the event that the first draw is the Ace of Clubs and let B be the event
that the second draw is the Queen of Diamonds. Then
1 1
P(A ∩ B) = P(A)P(B | A) = × .
52 51
Note that A and B are not independent, because if A happens, the second draw
cannot be the Ace of Clubs (this card was removed from the deck).
There are two more useful formulas related to conditional probabilities. The
first, Bayes’ theorem 6 , is a direct consequence of Lemma 2.23.
Theorem 2.25 (Bayes’ theorem). Let A, B be events with P(A), P(B) > 0.
Then
P(B | A)P(A)
P(A | B) = .
P(B)
6
Named after Reverend Thomas Bayes, who invented the concept of conditional probability
in 1763. (Yes, that long ago!)
Chapter 2 Probability basics 12
Proof. We have
This also tells us that P(A | B) = P(B | A) only if P(A) = P(B). Otherwise,
the two conditional probabilities are only proportional up to factor determined
by the relative probabilities.
The last result relates to partitions of the sample space. SA partition of Ω is a
(finite or infinite) sequence of disjoint events Ai , such that i Ai = Ω. The next
result shows that any unconditional probability can be computed from a sum of
weighted conditional probabilities.
• A2 = ‘low priority’
• A3 = ‘high priority’.
Chapter 2 Probability basics 13
Let B be the event that the email contains the word ‘free’. From previous experi-
ence,
Now suppose I receive a new email containing the word free. What is the proba-
bility, that it is spam? Bayes’ theorem and the law of total probability yield
P(B | A1 )P(A1 )
P(A1 | B) =
P(B)
P(B | A1 )P(A1 )
= P3
i=1 P(B | Ai )P(Ai )
0.9 × 0.7
=
(0.9 × 0.7) + (0.01 × 0.2) + (0.01 × 0.1)
= 0.995.
X:Ω→R
Example 2.29. Flip a coin 5 times. Let X(ω) be the number of heads in the
sequence ω. For example, if ω = HHT T H, then X(ω) = 3.
Random variables provide the link between sample spaces and events and the
data. In general, a random variable is any quantity whose actual value is random,
Chapter 2 Probability basics 14
Example 2.32. Suppose we flip a fair coin (so P(H) = P(T ) = 1/2) twice
and let X be the number of heads. Then P(X = 0) = P(X = 2) = 1/4 and
P(X = 1) = 1/2. The corresponding CDF is
0 x<0
1/4 0 ≤ x < 1
FX (x) =
3/4 1 ≤ x < 2
x ≥ 2.
1
1.0
●
0.8
●
0.6
FX(x)
0.4
●
0.2
0.0
−1 0 1 2 3
x
CDFs have a few useful properties. While Definition 2.31 gives a probabilistic
definition of CDFs, such functions can also be characterized analytically.
The proof isn’t hard but quite boring. Let’s turn to more interesting things.
7
A set is countable if we can assign each of its elements with a natural number. Examples are
finite sets, e.g., {0, 1}, or countably infinite sets like N, Z, or Q.
Chapter 2 Probability basics 16
The following properties follow directly from the definition of probabilities and
the CDF:
• For all x we have X
fX (x) ≥ 0, fX (xi ) = 1.
i
In fact, we implicitly derived the CDF in Example 2.32 from the PMF fX . Here’s
another common example:
The CDF is
0
for x < 0
F (x) = 1 − p for 0 ≤ x < 1
1 for x ≥ 1.
Maybe it’s not obvious, but this definition already implies that X can take
uncountably many values. (Otherwise the CDF must jump somewhere.) In this
case, the concept of a PMF is meaningless, because P(X = x) = 0 for all x.8
Instead, we use a slightly different concept, a density function.
8
The sum of uncountably many strictly positive numbers is always infinite. Hence, P(X =
x) > 0 is only possible for countably many x.
Chapter 2 Probability basics 17
• fX (x) ≥ 0,
R∞
• −∞ fX (t)dt = 1,
Z x
• FX (x) = fX (t)dt.
−∞
Example 2.39 (Uniform distribution). Suppose that a < b and X has PDF
(
1/(b − a) for a < x < b
fX (x) =
0 otherwise.
R∞
Clearly, fX (x) ≥ 0 and −∞ fX (x)dx = 1. A random variable with such a PDF is
said to have a Uniform(a, b) distribution. The CDF is given by
0
x<a
FX (x) = (x − a)/(b − a) a ≤ x ≤ b
1 x > b.
The PDF and CDF of the Uniform(0, 1) distribution are shown below:
PDF of Uniform(0, 1) CDF of Uniform(0, 1)
1.0 1.0
FX(x)
fX(x)
0.5 0.5
0.0 0.0
−0.5 0.0 0.5 1.0 1.5 −0.5 0.0 0.5 1.0 1.5
x x
The graph of the PDF indicates that all values in the interval (0, 1) are equally
likely, which explains the distribution’s name.
Note that, unlike a PMF, a PDF can be larger than 1 (and even unbounded!).
Chapter 2 Probability basics 18
Example 2.40. LetR f (x) = (2/3)x−1/3 for 0 < x < 1 and f (x) = 0 otherwise.
∞
Then f (x) ≥ 0 and −∞ f (x)dx = 1, but f is unbounded.
From a joint distribution, we can also extract the distributions of the individual
variables. The latter are called marginal distributions.
Definition 2.45. If (X, Y ) has the joint mass function fX,Y , the marginal
mass function for X is
X X
fX (x) = P(X = x) = P(X = x, Y = y) = f (x, y).
y y
As you can see, we extract the marginal PDFs by summing the joint PMF over
all possible values of the other variable. For example, in Example 2.44 we have
fX (0) = 1/3 and fY (1) = 2/3.
This is the PDF of the uniform distribution on the unit square. Suppose we want
to compute
P(X ≤ 1/2, Y ≤ 1/2) = FX,Y (1/2, 1/2) = P(X < 1/2, Y < 1/2).
Marginal densities are constructed like in the discrete case, just replacing the
sum by an integral.
Definition 2.48. For continuous random variables (X, Y ) with a joint PDF
f (x, y) the marginal PDFs are
Z Z
fX (x) = f (x, y)dy, fY (y) = f (x, y)dx.
Then Z
P(X ∈ A | Y = y) = fX|Y (x | y)dx,
A
The interpretation is the same as for events: two variables are independent, if
their outcomes are entirely unrelated (do not influence each other).
Independence can also be characterized using densities (this follows immediately
from their definitions).
Theorem 2.53. Let X and Y have joint PDF (or PMF) fX,Y . Then X ⊥ Y
if and only if fX,Y (x, y) = fX (x)fY (y) for all (x, y).
Example 2.54. Let X and Y have the joint distribution as in the following table:
Chapter 2 Probability basics 22
Y =0 Y =1
X = 0 1/4 1/4 1/2
X = 1 1/4 1/4 1/2
1/2 1/2 1
Then X and Y are independent, which can be verified by the previous theorem.
For example, f (0, 0) = 1/4 = fX (0)fY (0), and similarly for other cases.
Example 2.55. Let X and Y be independent and both have the same PDF
(
2x if 0 ≤ x ≤ 1
f (x) =
0 otherwise.
Suppose we want to find P(X + Y ≤ 1). Thanks to independence,
(
4xy if 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) = fX (x)fY (y) =
0 otherwise.
So Z Z
1
P(X + Y ≤ 1) = f (x, y)dxdy = .
x+y≤1 6
Let’s back up for a second and think about the bigger picture. Our goal in
this course is to learn from data. These data are modeled as random variables
X1 , . . . , Xn . In many circumstances it is reasonable to assume these random
variables are independent. For example, if Xj is the jth flip of a coin, we have no
reason to assume that the outcome of one flip is affecting another. The situation
is similar in other repeated experiments or some measurements taken on distinct
objects (e.g., stars, galaxies, . . . ).
We need to be a little bit more precise here. Independence of X1 , dots, Xn is
more than just pairwise independence of all Xi , Xj :
2.11 Transforms
Sometimes we know the distribution of random variable X, but are interested in
the distribution of a transformation r(X). The following result comes in handy.
fX (r−1 (y))
fY (y) = 0 −1 .
|r (r (y))|
Similar results exist for discrete variables and joint densities. You can look
them up yourself11 whenever there’s a need. The above result will be good enough
for this course.
100
50
r(x)
−50
−100
dr(y)
Now r−1 (y) = y 1/3 and dy
= 3y 2 . So
fX (y 1/3 ) fX (y 1/3 )
fY (y) = = .
|3(y 1/3 )2 | 3 |y 2/3 |
2.12 Expectation
Distributions are functions and, thus, fairly complex objects. Instead we can
also look at summaries of the distribution. The most important summary is the
expect value E[X]. It tells us what value a random variable X we can expect to
see on average. For the formal definition, we once again need to discern discrete
and continuous cases.
10
A function is invertible if and only if it is strictly monotone (increasing or decreasing).
10
Whenever you can compute the derivative, it is differentiable. This will rarely be an issue.
11
https://en.wikibooks.org/wiki/Probability/Transformation_of_Probability_
Densities
Chapter 2 Probability basics 24
R
The integral Ω xdF (x) has a precise measure theoretic meaning that you don’t
need to worry about. You may just treat it as short hand notation for one of the
two cases on the right.
In both cases, the expected value is an average over all possible outcomes of X,
weighted by the likelihood of occurrence. Another way to think about it is the
following approximation: repeat
Pn the same experiment many times and average all
1
outcomes, then E[X] ≈ n i=1 Xi , for a large number of iid draws X1 , . . . , Xn
with the same distribution as X. While this is only an approximation, it helps
to understand the meaning of the number E[X]. Let’s see some examples.
Example 2.62. Let X denote the outcome of a single throw of a fair die. Then
E[X] = (1 + 2 + .. + 6) × 1/6 = 3.5.
Hence, if we draw uniformly from the interval (1, 2) many times, we expect to see
a value of 1.5 on average.
As you can see, computing the expectation of a random variable can be fairly
easy. It’s similarly easy to compute the expectation of a transformed random
variable.
E[Y ] = E[X 2 ] = 12 × p + 02 × (1 − p) = p.
Chapter 2 Probability basics 25
The same also works when more than one random variable is involved, just
that we need to use the joint PMF/PDF as a weight in the sum/integral. For
example, for two continuous random variables X1 , X2 and Y = r(X1 , X2 ),
Z
E[Y ] = E[r(X1 , X2 )] = r(x1 , x2 )fX1 ,X2 (x1 , x2 )dx1 dx2 .
The expectation has the nice property of being linear, which just means that
you can pull the sum and any constants out of it. If you think about our
approximation of E[X] as averaging over many draws from an experiments, this
makes sense (and it follows immediately from Definition 2.60).
iid
Example 2.68. Let Y1 , . . . , Yn ∼ Bernoulli(p) and X = ni=1 Yi . Then we say
P
X has Binomial(n, p)-distribution. By linearity of the expectation, it holds
n
X
E[X] = E[Yi ] = np.
i=1
While sums of random variables are easy to handle, this is not in general
true for products. A convenient and common special case is when variables are
independent.
Proof. Let’s just look at the continuous case and n = 2. If X1 , X2 are independent,
the joint density fX1 ,X2 is just the product of marginal densities fX1 × fX2 . Hence,
Chapter 2 Probability basics 26
by Theorem 2.64
Z
E[X1 X2 ] = x1 x2 fX1 ,X2 (x1 , x2 )dx1 dx2
Z
= x1 x2 fX1 (x1 )fX2 (x2 )dx1 dx2
Z Z
= x1 fX1 (x1 )dx1 × x2 fX2 (x2 )dx2
= E[X1 ] × E[X2 ],
Both variance and standard deviation are one-number summaries of the dis-
tribution. They answer the question: “how much does X fluctuate around its
mean”? In the extreme case V[X] = 0, there is no variability at all and X is just
a constant. While the variance is easier to calculate with, the standard deviation
is easier to interpret. Because we square what’s in the expectation, we need to
take the square root of the result to bring it back to the original scale/units.
Keep that in mind when reporting or reading about these measures.
Chapter 2 Probability basics 27
You can verify this an exercise. The key ingredient is linearity of the expectation
Theorem 2.67.
Cov[X, Y ]
ρX,Y = ρ(X, Y ) = .
σX σY
Chapter 2 Probability basics 28
Similar to the variance, the covariance is a bit harder to interpret (just think about
its units), mainly because it mixes two things: i) the individual variability of X
and Y , ii) the dependence between X and Y . The correlation is a standardized
version that takes out the variability part. That makes it a pure measure of
dependence, which is typically what we want.
The first property is useful mainly for calculations. The second tells us that the
correlation is standardized to the interval [−1, 1]. The sign and magnitude of the
correlation tell us what kind of dependence we are dealing with. Let’s consider
the three extreme cases. As shown in (iii), the correlation has absolute magnitude
1 when two variables are perfectly linearly related. The sign tells us how:
• ρ(X, Y ) > 0: X and Y tend to be both large or both small at the same
time.
• ρ(X, Y ) < 0: large values of X tend to occur with small values of Y and
vice versa.
We see that, in general, we need an extra correction term for the variance of
a sum. To understand why, consider the case where X1 = −X2 . Then clearly
X1 + X2 = 0, so there is no variability at all. If we would just sum up the
variances of X1 and X2 , we would get 2V[X1 ] 6= 0. The covariance term in the
theorem above fixes this.
E[E[X|Y ]] = E[X].
The Tower rule is a convenient tool for computations as we’ll see in a minute.
Let’s first verify that the formula makes sense. Suppose we know the exact
12
You see what I did there?
Chapter 2 Probability basics 30
function µ(y) stating what body height to expect given every possible value of
the body weight y > 0. Now we start drawing random people from the population,
check their weight Y , and compute the expected height Z = µ(Y ). On average,
our guesses Z should equal the average height in the population E[X].
and there is a similar result for the variance (”law of total variance”). We won’t
need it for this course, but it’s good to know it exists.
Descriptive statistics and
3
exploratory data analysis
The last chapter introduced the fundamentals of probability theory to set the
stage for the main objective of this course: doing statistics (or ‘analyzing data’).
Recall the basic problem of statistics:
In plain words, Yn →p Y means: as we get more and more data (n → ∞), the
probability that Yn is away from Y goes to 0. You could also write the definition
the other way around: for every > 0,
lim P |Yn − Y | ≤ = 1.
n→∞
Theorem 3.3 (The law of large numbers, LNN). Let PnX1 , . . . , Xn be iid random
1
variables with E[Xi ] = µ < ∞ and define X̄n = n i=1 Xi . Then
X̄n →p µ.
While we can’t know µ, X̄n is something we observe. The LNN implies that
the sample mean X̄n is a reasonable approximation of µ. Hence, X̄n gives us
a “feeling” what the actual mean µ might be. The LNN makes this intuition
mathematically precise. It allows us to learn about the expected value of an
unknown random mechanism just from seeing the data.
Example 3.4. Let’s illustrate the LLN with a small experiment: We simulate
X1 , . . . , Xn ∼ Bernoulli(0.5) and compute X̄n for each n. We repeat this exper-
iment three times. By the law of large numbers, we expect the three resulting
sequences to converge to the expected value E[X1 ] = 0.5. The results are shown
in Figure 3.1. Each line (color) corresponds to a sequence X̄1 , X̄2 , X̄3 , . . . , one
Chapter 3 Descriptive statistics and exploratory data analysis 33
1.00
0.75
Xn
0.50
0.25
0.00
0 1000 2000 3000
n
Figure 3.1: The law of large numbers in action (Example 3.4). Each line corre-
sponds to a sequence X̄n after simulating from n iid Bernoulli(0.5)
random variables.
line for each repetition of the experiment. We see that for small n, X̄n can be
quite far away from the mean. As we increase the amount of data, all three lines
seem to stabilize around 0.5. However, the three lines are different, reflecting
the randomness of our sample. The green line lies mainly above 0.5, the others
mainly below. The LLN states that, despite this randomness, it becomes less and
less likely that one of the lines ends up away from 0.5.
Remark 3.1. Just as a side note: There is a stronger version of the LLN called
the strong law of large numbers (SLLN). It involves a different notion of con-
vergence called almost sure convergence. It states that the the probability that
the sequence Xn converges to µ is exactly 1: P(limn→∞ Xn = µ) = 1. Here
we’re making a probability statement about the limit limn→∞ Xn . Convergence
in probability is a statement about convergence of probabilities. For us, conver-
gence in probability will be enough, but it’s good to have heard about almost sure
convergence.
Less formally, an estimator is any number that you compute from data.
Chapter 3 Descriptive statistics and exploratory data analysis 34
θbn →p θ.
Definition 3.8.
n n
1X 2 1X 2
• Sample variance: Sn2 = Xi − X̄n = X − (X̄n )2 .
n i=1 n i=1 i
Theorem 3.9. If X1 , . . .p
, Xn are iid samples from a distribution F , it holds
2
Sn →p V[X] and Sn →p V[X], for a random variable X ∼ F .
Proof. By the LLN X̄n →p E[X]P and, thus, (X̄n )2 →p E[X]2 . Similarly (defining
Yi = Xi2 ), the LLN gives that n1 ni=1 Xi2 →p E[X 2 ]. In combination this yields
n
1X 2
Sn2 = X − (X̄n )2 →p E[X 2 ] − E[X]2 = V[X].
n i=1 i
p p
Finally, because the square root is a continuous function, Sn = Sn2 →p V[X].
Similarly, we can define estimators of the covariance and correlation for two-
dimensional iid data (X1 , Y1 ), . . . , (Xn , Yn ):
Chapter 3 Descriptive statistics and exploratory data analysis 35
Definition 3.10.
n
1X
• Sample covariance: Cn =
Xi − X̄n Yi − Ȳn .
n i=1
Cn
• Sample correlation: Rn = q q ,
2 2
Sn,X Sn,Y
where
n n
2 1X 2 2 1X 2
Sn,X = Xi − X̄n , Sn,Y = Yi − Ȳn .
n i=1 n i=1
Again using the LLN, we find that Cn and Rn are consistent estimators for
Cov[X, Y ] and ρ(X, Y ), respectively.
Computing these summaries at the very start of a data analysis is a good idea.
They give us a “feeling” of the behavior of certain variables: location, variability,
and dependence.
Let’s see a few examples. Figures 3.2 to 3.4 show scatterplots of two variables
X and Y : each dot represents one sample (Xi , Yi ) in the data set (with Xi drawn
on the x-axis and Yi on the y-axis). The figure captions contain the sample means,
standard deviations, and correlations computed from these data sets. As you
can see from Figure 3.2, X is located around 0.5, Y around 3. The variability
of X is much larger than the variability of Y , which is reflected in the sample
standard deviations. The correlation is around 0, so there’s not much dependence
going on. In Figure 3.3, the sample means and standard deviations remain the
same, but now we have a correlation of 0.62. We can see this dependence in
the scatterplot by the upward trend in the data: when X is small, Y tends to
be small; when X is large, Y tends to be large. This is what we call positive
dependence. In Figure 3.4 the correlation changes to −0.82. So now we have
negative dependence which is reflects the downward trend: when X is small, Y
tends to be large; when X is large, Y tends to be small.
1
Everything computed from the actual (unknown) distribution F is called a “population
version”. Everything computed from data is called a “sample version”.
Chapter 3 Descriptive statistics and exploratory data analysis 36
10
●
●● ●
5 ● ●
● ● ●● ●
●
●
●
● ●
●
● ●●
●
●
●
●
●
● ● ● ● ● ●
●●
● ● ●● ●●● ● ● ● ●●
●
● ●● ●
● ● ● ● ●● ●●
● ● ● ● ● ●●● ●
● ● ● ●
●
● ●● ● ● ● ●
● ●● ● ● ● ●● ●
● ●● ● ● ● ●● ●● ● ● ● ●
● ●● ●● ● ●● ● ● ●
● ●● ● ● ● ●●
● ● ●
● ●●● ● ● ● ● ●
●
●
● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●
● ● ●●●● ●●● ● ●
● ● ● ● ●●
● ● ●●
● ●●
●● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ●
Y
●
−5
−10
−10 −5 0 5 10
X
Figure 3.2: Example with X̄n = 0.54, Ȳn = 3.02, Sn,X = 2.85, Sn,Y = 1.02,
Rn = −0.04.
10
●
● ●
●
5 ● ● ●
●
●
●●
●
●
●●
● ●
●
●
●
●
●
●
●
● ● ●●●
●● ● ● ● ● ●
● ● ● ●
●●
● ●●● ●●● ●●
● ● ● ●
● ● ● ●● ● ● ●● ●● ●● ● ●
● ● ● ●●
● ● ●● ● ● ●● ●● ●
●●● ● ● ● ●
● ● ●● ●
● ●
●
● ● ● ●● ● ● ● ●
● ●● ● ●● ●●●●●●●
● ● ●●
●● ● ●
● ● ● ● ●● ●●
●●● ● ● ● ●
●● ● ●
● ● ●● ● ●
● ● ● ●
●●● ● ● ●● ●
● ●●
●
● ●● ● ● ● ● ●● ●● ●● ●●
● ●
● ●
● ●
● ● ●
●
0
Y
−5
−10
−10 −5 0 5 10
X
Figure 3.3: Example with X̄n = 0.54, Ȳn = 3.02, Sn,X = 2.85, Sn,Y = 1.02,
Rn = 0.62.
10
● ●●
5 ●
●
● ●
●
● ●
●
●
●● ● ● ● ●
●
● ●●● ●● ●● ●● ● ● ●●
● ●● ● ●
● ● ● ● ●● ● ● ●●●
● ●
● ● ●● ● ● ●● ●●
●●● ●
● ●● ●●● ●
● ●
●● ● ●●●● ●● ● ●●● ●
●●
● ● ●● ● ●● ● ●● ● ●
●● ● ●●●
● ● ● ●● ● ● ●
● ●● ●
● ●
●● ● ●●● ● ●● ● ● ●
●●●●● ●● ● ●
●
● ●
● ● ● ●
● ●●●● ● ● ● ● ● ● ●
● ●● ● ● ● ● ●
● ●
● ●●
● ● ●
● ● ●● ● ●
● ● ●
● ●●● ● ●
●
0
Y
●
●
−5
−10
−10 −5 0 5 10
X
Figure 3.4: Example with X̄n = 0.54, Ȳn = 3.02, Sn,X = 2.85, Sn,Y = 1.02,
Rn = −0.82.
Chapter 3 Descriptive statistics and exploratory data analysis 37
Figure 3.5: The Datasaurus: all scatterplots have the same mean, standard devi-
ation, and correlation.
So how does this help to estimate F ? You can verify that F (x) = P(X ≤
x) = E[1(X ≤ x)]. So estimating F isn’t much different from estimating an
expectation (except that we have to estimate one expectation for every x ∈ R).
Chapter 3 Descriptive statistics and exploratory data analysis 38
1.00 ●
●
0.75
●
Fn(x)
0.50
●
0.25
●
observations
0.00
0.0 0.4 0.8
x
1
Pn
(Similarly, we can define Fn (x, y) = n i=1 1(Xi ≤ x, Yi ≤ y) as the ECDF
of a bivariate distribution.)
The formula is very intuitive. Recall that 1(Xi ≤P x) = 1 for exactly those Xi
with Xi ≤ x. (Otherwise 1(Xi ≤ x) = 0). Hence, ni=1 1(Xi ≤ x) is counting
how many observations Xi are less than x. Dividing by n gives us the proportion
of samples that are less than x. Intuitively, this proportion should be a good
approximation of the probability that X ≤ x.
The ECDF Fn is a function very similar to the CDF of a discrete random
variable. In fact, it is the CDF of a discrete random variable X̃ that puts
probability mass 1/n on each data point. As a result, the ECDF is right continuous
and jumps by 1/n at every observation. You can see this in an example with
n = 5 in Figure 3.6. The crosses represent the location of our five observations
X1 , . . . , X5 . The ECDF starts out at 0 until we encounter the first observation
(coming from the left). At each of the observations the ECDF jumps up by
1/5 = 0.2 until it reaches 1, from where on it remains constant.
We can show that the ECDF Fn (x) is a consistent estimator for F (x). In fact
it is consistent uniformly, i.e. for all values of x at the same time.
Chapter 3 Descriptive statistics and exploratory data analysis 39
n = 10 n = 100 n = 1000
1.00 ● 1.00 ●
●
●●
● 1.00 ●
●●
●
●
●
●
●●
● ●
●
●●
●
●●
● ●
●
●
●
●
●
● ●
● ●●
●
●
●
● ●
●
●●
● ●
●
●
●
●
●
●
●● ●
●
●
●
● ●
●
●
●●
● ●
● ●
●●
●
●●
● ●●
●
●
●●
● ●
0.75 0.75 ●
●
●●
0.75 ●●
●
●
●
●
●
●
●
●
●
● ●●● ●
●
●●
●
●● ●●
●
●
●●
● ●●
●
●
●
●
●
● ●
●●
● ●
● ●●
●
●●
● ●● ●●
●
●
●
●
●
●● ●
Fn(x)
Fn(x)
Fn(x)
●
●●
●●● ●●
●
●●
● ●
0.50 ● 0.50 ●
●
●●
● 0.50 ●
●
●
●
●
●
●●
●
●●
●● ●●
●
●
●
●
● ●
●
●●
●
● ● ●
●
●
●●● ●
●
●
●
●
●
●
● ●
●●
● ●
●●
● ●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
●●
●● ●
●●
0.25 0.25 ●● 0.25 ●
●
●●
●●● ●
●
●
●
●
● ●
● ●
●
●
●●
●
●●● ●
●●
●●
●●
●
●
●
● ●
●
●●
●
●●
● ●
●
●
●
●
●
● ●
● ●●
●
●
●
● ●
●
●● ●
●●
●
●
●●
● ●
●
●
●
●
●●
● ●●
●
0.00 0.00 0.00 ●
●
iid
Theorem 3.12 (Glivenko-Cantelli theorem). Let X1 , . . . , Xn ∼ F . Then
This convergence is visualized in Figure 3.7. We simulate data sets from the
Uniform(0, 1) distribution of increasing size n. The true CDF is F (x) = x which
is shown as as the straight diagonal line. For n = 10 we are fairly close in some
regions , but far away in others (around x = 0.7). As n increases we get closer
and closer to the true CDF. And we do so in away that is uniform in the sense
that there’s no region where our approximation remains bad. For n = 1000 the
true CDF F and the ECDF Fn are hardly distinguishable.
(i) For some x0 < xK , we divide the interval (x0 , xK ] into K bins Bk =
(xk−1 , xk ] of equal size δ = x1 − x0 = x2 − x1 = . . .
Nk
hn (x) = .
n×δ
Chapter 3 Descriptive statistics and exploratory data analysis 40
This process is visualized in Figure 3.8. The data is shown as crosses in the
top panel. Then the interval (0, 3.5] is divided into 7 bins of equal size (mid
panel). Then we count the number of observations per bin to compute the relative
frequencies in step 3 (bottom panel).
The histogram is extremely powerful. With a single glance we get a good feeling
for the shape of the entire distribution. Two important characteristics to look out
for are skew and modality. Skew is related to symmetry: a distribution is called
right-skewed if the histogram/density is leaning to the left and left-skewed if it is
leaning to the right. (I know that’s confusing, but it is what it is.) Modality tells
us about potential clusters (showing up as bumps in the graph). Each bump is
called a mode; for example, a density with two distinct bumps is called bimodal.
You can see some exemplary graphs in Figure 3.9.
Example 3.13. Figure 3.10 shows a histogram for the metallicity of globular
clusters in the Milky way (relative to the sun) with K = 10 bins. We see that
the distribution is slightly right skewed and potentially bimodal (with one bump
around -1.5 and a possible second one at -0.9). But why did I choose 10 bins? In
Figure 3.11 we see histograms for the same data, but this time with K = 2 (left)
and K = 100 (right). There’s not much we can learn from two bins, the graph is
hiding most of the information in two huge blocks. On the other hand, the graph
with 100 bins is extremely erratic and we wouldn’t expect the true density to have
a shap with that many peaks and troughs.
This example illustrates that the number of bins is crucial for getting meaningful
information out of a histogram. Choosing this number is more or less guesswork
unfortunately. A good rule of thumb is K ≈ 2n1/3 . (There’s some theory behind
this, but every data set is different.) In practice, we usually try a few values and
see what works best.
Let’s conclude with a theorem on the consistency of the histogram. This
(almost) follows from the Glivenko-Cantelli theorem.
iid
Theorem 3.14. Let X1 , . . . , Xn ∼ f , and K → ∞, K/n → 0 as n → ∞.
0.8
0.6
density
0.4
0.2
0.0
0 1 2 3
x
1.00 B1 B2 B3 B4 B5 B6 B7
0.75
density
0.50
0.25
0.00
0 1 2 3
x
1.00 B1 B2 B3 B4 B5 B6 B7
0.75
density
0.50
0.25
0.00
0 1 2 3
x
Figure 3.8: Construction of a histogram: the data is shown the top panel, the
interval (0, 3.5] is divided into bins (mid panel), relative frequencies
drawn in the bottom panel.
Chapter 3 Descriptive statistics and exploratory data analysis 42
0.4
0.3 0.4
density
density
density
0.2
0.2 0.2
0.1
0.3
0.15
density
density
density
0.10
0.2
0.10
0.05
0.1 0.05
Figure 3.9: Common shapes of a distribution. Histograms with the the true
density superimposed as orange line.
0.6
density
0.4
0.2
0.0
−2 −1 0
log metallicity
Figure 3.10: Histogram for the metallicity of globular clusters in the Mikly way
(relative to the sun).
Chapter 3 Descriptive statistics and exploratory data analysis 43
1.5
0.15
density
density
1.0
0.10
0.05 0.5
0.00 0.0
−4 −2 0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0
log metallicity log metallicity
Figure 3.11: Histograms for the metallicity of globular clusters in the Mikly way
(relative to the sun) with too few and too many bins.
The definition is a bit weird, so some discussion is in place. First consider the
case where F is continuous. Then Q(p) = F −1 (p) is just the inverse function of
the CDF F . The weirdness only comes in for discrete distributions, where the
CDF is not strictly increasing and, thus, no inverse exists. Conceptually, the
definition above is equivalent to an inverse function though.
Quantiles answer the question: which value of X is not exceed with a probability
of p? For example, if Q(0.01) = −5, then the probability that X is less than -5 is
1%. More generally, the Q(p) divides the real line into two parts: the first part,
X ≤ Q(p) has probability p, and the second part,X > Q(p) has probability 1 − p.
Given data X1 , . . . , Xn , we define the sample p-quantile as:
Let’s write this in a more intuitive way. Denote dae as the smallest integer k
with k ≥ a (‘rounding up’). Now Qn (p) is defined such that
So a proportion of at most p of the data points is less than Qn (p), and a proportion
of at most 1 − p is larger than Qn (p). To compute this number, do the following:
First sort the data X1 , . . . , Xn in ascending order. This gives an ordered data set
X(1) , . . . , X(n) , where X(k) denotes the kth smallest observation. Then set2
Qn (p) = X(dnpe) .
2
Most software/books make an adjustment if np is an integer.
Chapter 3 Descriptive statistics and exploratory data analysis 44
• Q(1/2): median,
We compute X̄n = 3.1 and Qn (1/2) = 3. What happens to the mean and median
if we change X4 = 300? Well the median doesn’t change at all, but the mean
increases by an order of magnitude.
potential outliers
upper quartile
median
lower quartile
2. For each variable, compute and interpret summary statistics for location and
scale.
3. For each variable, plot and interpret boxplot and/or histograms. Consider:
skewness, modality, outliers
Now let’s walk through this process with some real data.
http://cas.sdss.org/dr16/en/tools/search/sql.aspx
Our goal is to get a feeling for what is going on in this data. What follows
will be a very brief version of the process, mainly because I know too little about
astronomy to tell you something interesting. Maybe you see some more interesting
things?
We see that galaxies in this data set tend do be greener than quasars and also
the variability seems slightly larger for galaxies.
2.0
2.0
1.5 1.5
density
u−g
1.0
1.0
0.5
0.5
0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 GALAXY QSO
u−g class
We can clearly identify the two populations in the u -g histogram and boxplot
(galaxies and quasars). Above u - g= 1, we find almost no quasars, below
this value we find almost no galaxies. The histogram for the quasars is roughly
symmetric and unimodal. The histogram for the galaxies is bimodal, indicating
that there may be two sub-populations of galaxies. The boxplots show a few
potential outliers, but the points don’t seem to crazy. It is certainly plausible
that a Quasar has u - g ≈ 1.5. Without further reasons, we should keep them
in the data set.
Variable g - r
Galaxies Quasars
mean std. dev. mean std. dev.
0.70 0.32 0.21 0.22
We see that galaxies are redder than quasars on average and also more variable.
Chapter 3 Descriptive statistics and exploratory data analysis 48
1.5
0
1.0
−2
density
g− r
0.5 −4
−6
0.0
−5.0 −2.5 0.0 GALAXY QSO
g− r class
In the boxplot there is a clear outlier galaxy with g - r = 6. This looks very
suspicious and it definitely affects our data analysis. Optimally, we would now
look this galaxy up online and check whether it is really that red. And if it is, we
still need to ask ourselves if we want to keep it or focus on more ‘normal’ galaxies.
If we exclude it, we should also recompute all summaries and graphs above. But
let’s move on for now.
Variable z
Galaxies Quasars
mean std. dev. mean std. dev.
0.08 0.05 1.30 0.69
4
4
3
3
redshift
density
2
2
1 1
0 0
0 1 2 3 4 GALAXY QSO
redshift class
We see that more clearly in the histogram and boxplot. While all galaxies
are fairly close (z < 0.2), the quasars are spread out wider with a center around
1.5 and one quasar with redshift of more than 4. This quasar also shows up as
potential outlier in the boxplot, but z ≈ 4 is certainly a plausible value for a
quasar.
R(G) = 0.86, R(Q) = 0.37 R(G) = 0, R(Q) = −0.19 R(G) = 0.25, R(Q) = −0.42
● ● ●
● ●
● ●
●
● ● ● ● ●
●
● ●
●
●● ● ● ● ● ● ●
● ●● ● ● ● ●● ● ●
●
●● ● ●
● ● ●
● ●● ●● ● ● ●
●● ●● ●
● ● ● ● ● ● ●
●●● ● ● ●● ●
● ● ● ●●
1.0 ● ●
●
●
●●●●
●● ● ●●
●●
●●● ●●●● ●●●
●
●
● ●●●●●
●●
●● ●●
●●●
●
● ●
● ● ●●
●
●
●
●
●● ● ●
●
●
● ●
● ●●
●
●
●●
● ●● ●
● ●
● ●●
●
●
●●●●
●●●●●
●
●●●●
●
●
●
●
● ●
●
●●
●●●●
●● ●● ●●
●
●
●●● ●●
●●
●
●
●●●●
● ●
●●
●
2 ●
●
● ●
● ● ●
● ● ●
●
●
2 ●
●
●●
● ●●
●●
●
●
●
● ● ●●
● ● ● ● ●●
● ● ●
● ● ●●● ● ● ●● ●●
● ●● ●
● ● ● ●● ● ●●
●
●●●
●●●●●●
● ●● ●
●● ●● ● ● ● ●●● ●● ● ● ●
●●
●
●●
●
● ● ● ●● ● ●●●
● ●●
● ●
●●
●
●●
●
●
● ●●
●
●●
● ●
● ● ● ● ●●●● ● ●●● ● ●
●● ●
● ● ●● ●● ● ● ● ● ●● ●
● ● ●●● ●● ●●● ●● ●
● ● ●● ●●● ● ●
● ●●●● ● ●● ●● ● ●
● ● ●● ●
●● ●●●●● ● ●
● ● ●● ● ● ● ●
●● ●● ●●●
●
●
●● ●●●● ● ●
● ● ●● ● ●● ●● ● ● ●●●
● ●●●●● ●
● ● ●●
●●●
● ●●●●●●
●
●
●
● ● ● ●●
● ●●●● ● ● ●
● ● ●●●●● ●● ● ● ● ● ●
redshift
redshift
● ●● ●●●● ● ● ●
● ● ●● ● ● ●
●● ●●
●● ●
●
●
● ●● ●
● ●●● ●● ● ● ● ● ● ●● ● ●
● ●●● ● ●● ●● ●
●● ● ●● ● ● ●●
● ● ●● ●● ● ●● ● ●● ●
●●●●
g− r
● ●● ●●●● ●● ● ●
●● ●●●●
●
●
●●●● ●● ● ● ● ● ●
● ● ● ●●●● ● ●●●●●●●●●● ●● ● ● ●
●
● ●
● ●●●●●
●● ●● ●● ●●●●● ●
●●
●
●
●●● ● ● ● ● ●
● ● ●●●● ● ●
●●●●
●
● ●●● ●● ●
● ●● ● ●●● ● ●● ●●● ● ● ● ● ● ● ●
● ●● ●● ●●
● ●
●● ●●●●
●
●●●
● ●●
●
● ●
● ● ● ● ● ●
● ● ●
●
●
●
●●●●●●● ● ● ● ● ● ● ●
●● ●
●● ●
●●
●● ●●● ●●● ●● ●● ● ●
●
● ● ● ● ●
●●●●●●●●
0.5 ●
●
●
● ● ●●
●
●●●●●●
● ●
●●●●
● ● ●
●
●
● ●●
●●●
●● ●●●●
●
●
●
●●
●●●●
●●
●●
●
●● ●
● ●
● ●●●
●●
●●
●●
●
● ● ●
●
● ●●
●
●
●
● ●
●
●
●
●
●
● ●● ●●●●● ●
●●● ●
● ●● ●● ● ●● ●
● ●
●
●
●
● ●
● ●●
● ●
●●● ●
● ● ● ●● ●
●●● ● ●
●
●
●● ● ●●
● ●
●
●
1 ● ●
● ● ●
●
●
●●
1 ●
● ●
● ●
●●
● ● ● ● ● ●
● ●● ● ● ● ● ●
● ● ●
●● ● ●
● ● ●
●●
● ● ● ●● ● ● ● ●●
● ●● ● ●● ●● ●
● ● ● ● ● ● ●● ●
● ●● ●● ●● ● ● ●
● ● ● ●● ● ● ●
●
● ● ●●
● ●
● ●● ● ● ● ● ● ●
●●● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ●
●
● ● ●● ● ● ● ●
● ● ● ● ● ● ● ●
● ●● ●
●
●● ●
● ●●● ● ●● ● ● ● ●
● ● ● ●
● ● ● ●●●●● ● ● ● ●
● ● ●● ● ● ●
● ● ●● ●●● ● ●●
● ● ● ● ● ● ● ● ● ● ●
●● ●● ● ● ●
● ● ● ● ● ● ●
● ●
● ●
0.0 ●
●
●● ● ● ●● ●
● ● ●
●
● ●
● ●
●
●
● ●
●
● ● ●
●
● ●●● ● ● ●●●●●
● ● ● ●●● ●
●●● ●● ●● ●● ● ●
●
●
● ●
● ●●
●● ●
● ●●●● ● ● ●
●●
● ●
●
● ●●● ●
●
● ● ●
●
●●●
● ●●●●
●●●● ●●●●
●●●●● ●●●
●●● ●●●● ●●● ●● ●
●● ●● ●
●● ● ●●● ● ●●
● ●
●● ●
● ●●
●●●●●●●
●●●●●●●●●
● ●●●
●
●●●●
●●●
●●●●
●● ●
●●●●
●●
●●●●●●●●● ●●● ● ●●●
● ●● ●●●
●●● ●● ●
● ●●●●●●●●●
●●● ● ●●●
●
●●
● ●●
● ●● ●● ●●● ● ●● ●●
●● ●● ● ● ●●●●●●●●● ●● ●
●● ●● ●●
●● ●●●●● ●●●● ●
● ●● ●●● ●●●●●●●●● ●● ●
●●
●
● ● ●
● ●●
●●
● ● ● ● ●●●●●●●●● ●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●●●
●
●●●
●
●●
●
●
●
●
●●●●●
●● ●●
●
●●●● ●●
●
●●● ● ●● ●●●●
●
●●●●
●●
● ●
● ●●
●
●●●
●● ●●●
●●●● ● ● ● ● ●●
●●●●● ● ● ●●●
●
●●
●● ●●●
●●● ●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●●●
●
●●
●●
●
●
●
●●
● ●
●
●
●●
●●●●
●●●●
●
●●● ●●●
●
● ●●
●●● ●●●
●●●●●
● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●●
● ●
●●● ● ●
● ● ● ● ●● ●●●
●
●●●
●●●
●●●
●●●●
●●●
●●
●●●
●●●
●● ● ● ●●●●
●● ●
●●
●
●● ●
● ●●
●●●●●
● ● ●
●
●●●
●●●
● ●
●●●●
●● ●
●
●●
●●
●●
●
●
●●
●●
●●
●
●●
●●●
●
●●
●●
●● ●●
●● ●●
●●
●●●●● ● ●●●
●●●
●●
● ●
●●●●●
● ●
●●
●
●●
●●●
● ●
●●
● ●
●
● ●
●●●● ●
●●●
●●
●
●●
●●
●●●●
● ●
●●
●●●●
●
● ●
●●●
●●
●●●
●
●●●
●
●●●●
●●
●
●●●
●
●
●●
●●
●●
●
●●
●●●
●●●● ● ●
● ●
●● ●●
● ●
●
●● ●●●
●●
●
●●● ●
●●
●●
●
●
●●
●●
●
●●
●●
●●
●●●
●
●
●
●
●
●●●●
●●
●●●
●●●●●
●
●●●●
●●●●●●●● ● ●●
●●●
●●
●
●
●
●●●
●●
●●
● ●
●●●●●
●●
●
●●
●●
●● ●
●
● ●
●●●●●● ● ● ● ●● ● ●●●● ● ●●● ●●●
●●
●
●
●
●● ●●
●●
●
●
●●
●
● ●
●
●●●
●●●
● ●
●●●●●
● ●
●●●●
● ●
● ●
●● ●●
●●
●●●●
●●
●●
● ●●
●● ●●
●●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●●
●●
● ●
●●●● ●
● ●
●
● ● ●
●● ●● ●●●●
● ●●●
●●●
●
●
●●
●
●●
●●●●●
●●
●●
●●●●
●● ●●
●
● ●●
● ●●●
● ●
●●
●
●● ●●
●● ●
● ●
●
● ● ● ●●
●●
●● ●
●
● ●●●
●●●●●
●●●●●
● ●
● ● ●● ● ●
●● ●● ●●●●
●●●
●●●
●●●
●●●
●●
● ●●●
●
●●
●
●●●
● ●●●
●
●●
●
●●●●●
●●●●
●●
●
●●
●
●● ●●●
● ●● ●●●
●●
●
●●●●
●●●●
●
●● ● ●
●
●
0 ● ● ● ●● ●●
●●
●●●●
●
●●
●
●●
● ●
●● ●
●●●●
●
●
●●
●●● ●
●
●●
● ●●
●●● ● ●
● ●
●●● ●●
● ●
● ●
0 ● ●●●●● ●
● ● ●● ● ●
● ●
●
●●●
●
●●
●●
● ●
●● ●
● ●●●●
●
● ●●●●
●● ●● ●
● ●
● ● ●●●
●
● ●●●
●
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0
u−g u−g g− r
class ●
GALAXY ●
QSO
3.7.5 Wrap up
This was a quick walk through the steps of an EDA. As a statistician, I can
compute numbers and draw graphs. But that’s only useful in combination with
domain knowledge. As an astronomer, you should always try to interpret the
result in the numbers and graphs in context. Always ask yourself if what you
Chapter 3 Descriptive statistics and exploratory data analysis 50
see is in line with your expectation. If something seems implausible, dig deeper.
Zoom into a graph, compute summaries for interesting subsets of the data etc.
At the end of this process you should i) have a feeling for what’s going on in
the data, and ii) trust that the data set you continue with is suitable for future
modeling steps. We’ll learn more about that in the next chapters.
Parametric statistical models
4
Let’s once again recall the basic problem of statistics.
Having observed some data X1 , . . . , Xn , what can we say about the
mechanism that generated them?
In the last chapter we learned how to get a ‘feeling’ for this mechanism. We can
now try and come up with plausible mechanisms that could have generated the
data. Since we don’t know the mechanism, what we come up with are just models
of reality. A statistical model involves randomness and is hence characterized
by a probability distribution
(or density). A parametric statistical model is a
family of distributions Fθ : θ ∈ Θ that is characterized by a parameter vector
θ ∈ Θ ⊆ Rp . Once we have a parametric model, the main question is which value
of the parameter θ fits the data best. This will be the topic of the next chapter.
In the current chapter, we shall introduce some essential statistical models.
This includes parametric families for discrete and continuous distributions as
well as multi-dimensional data and prediction problems. At the end of this
chapter, you should have heard about the most common models and know in
which situations they may or may not apply. That’s admittedly a bit boring, but
it’s the last thing we need in preparation for all the exciting things that follow in
the next chapters.
We have already seen the Bernoulli distribution earlier in the course when
speaking about coin flips. We can use the (arbitrary) encoding ‘0 = heads, 1 =
Chapter 4 Parametric statistical models 52
tails’ to define a random variabl X that represents the outcome of a single coin
flip. We say that X follows a Bernoulli distributionwith parameter p ∈ (0, 1)
or X ∼ Bernoulli(p). The parameter p = P(X = 1) is called success probability.
Note that the interpretation of this parameter depends heavily on your coding of
the categories (1 = heads vs. 1 = tails).
We rarely flip coins in reality, but the distribution is everywhere nevertheless.
Its quite common to put study subjects into two categories:
• yes or no answers,
• dead or alive people,
• radio-quiet and radio-loud galaxies,
• red-sequence and blue-sequence galaxies,
• metal-rich or metal-poor globular clusters
All these categories can be recoded (arbitrarily) to a binary variable that only
takes values 0 or 1. When checking the categorie of a random object, we’re again
faced with the Bernoulli distribution.
If we throw a coin 10 times, how many heads do we get? That’s again a toy
problem of course, but we can replace the coin with other variables. Out of 100
people receiving cancer treatment, how many survive? Out of 50 random galaxies
under study, how many are radio-quiet? In all these questions, we take a sum
of Bernoulli variables, so the Binomial distribution arises naturally. It has two
parameters: the number of trials n and the success probability p.
Example 4.3. Suppose that a proportion p of all galaxies are radio-quiet. Pick
50 galaxies at random and let X be the number of radio-quiet galaxies. Then
X ∼ Binomial(50, p).
The PMF is visualized for varying parameter choices in Fig. 4.1.1 Note that
all PMFs are zero for x > n and that they have peak near E[X] = pn.
1
The dashed lines are only added as visual guides. The PMF is only defined where the dots
are.
Chapter 4 Parametric statistical models 53
●
0.3
● ●
0.2 ●
●
●
parameters
● ● n = 10, p = 0.2
f
● n = 10, p = 0.5
●
● ● ●
● n = 20, p = 0.7
●
0.1
●
●
●
● ●
● ●
●
● ●
●
● ● ●
0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20
x
Figure 4.1: Probability mass function of the Binomial distribution for varying
parameter choices.
λx
f (x) = e−λ , x = 0, 1, 2, . . . .
x!
One may check E[X] = λ, Var[X] = λ. Graphs of the PMF are shown in Fig. 4.2.
Note that f (x) > 0 for all x ∈ N.
The Poisson distribution can be derived formally as: the distribution of the
number of events occurring in a fixed period (area/volume/. . . ), if they occur at
a fixed rate and independently of the time since the last event. Such situations
arise often in The Poisson distribution often arises when modeling rare events:
• the number of mutations on a strand of DNA per unit length,
• telephone calls arriving in a system,
• number of radioactive decays in a given time interval.
The distribution has a few interesting properties:
• When n is large and p is small, the Binomial(n, p) distribution is well
approximated by the Poisson(λ) distribution with λ = np.
• If X1 ∼ Poisson(λ1 ) and X2 ∼ Poisson(λ2 ) are independent, then X1 +X2 ∼
Poisson(λ1 + λ2 ).
Chapter 4 Parametric statistical models 54
●
0.8
0.6
parameters
● λ = 0.2
f
0.4
● ●
● λ=1
● λ=5
0.2 ● ● ●
●
● ●
●
●
● ●
● ●
● ● ●
● ● ● ●
0.0 ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15
x
Figure 4.2: Probability mass function of the Poisson distribution for varying
parameter choices.
Example 4.5. A distant quasar emits 1064 photons per second in the X-ray band,
but at a earth-orbiting X-ray telescope only ca. 10−3 photons arrive per second.
In a typical observation period of 104 seconds, only around 101 of the n ≈ 1068
photons emitted during the observation period arrive, giving p ≈ 10−67 . The
number of photons arriving can be thought as Binomial(n, p)-distributed, and the
latter is well-approximated by the Poisson(λ) distribution with λ = np ≈ 10.
2
Some authors use a different parametrization with ‘scale’ parameter α = 1/λ.
Chapter 4 Parametric statistical models 55
2.0
1.5
parameters
λ = 0.5
1.0
f
λ=1
λ=2
0.5
0.0
0 1 2 3 4 5
x
Figure 4.3: Probability density function of the Exponential distribution for vary-
ing parameter choices.
2.0
1.5
parameters
λ = 0.5
1.0
f
λ=1
λ=2
0.5
0.0
0 1 2 3 4 5
x
Figure 4.4: Probability density function of the Exponential distribution for vary-
ing parameter choices.
1.00
0.75
parameters
high shape, high scale
0.25
0.00
0.0 2.5 5.0 7.5 10.0
x
Figure 4.5: Probability density function of the Gamma distribution for varying
parameter choices.
Chapter 4 Parametric statistical models 57
Example 4.8. The gamma distribution has been studied extensively in extra-
galatic astronomy with respect to the distribution of luminosities, where it is
known as the Schechter luminosity function. According to the Schechter function
the number of stars or galaxies within a luminosity bin of fixed width at luminosity
` is proportional to `α exp(−l/L∗ ).
Definition 4.9. The Power law or Pareto distribution with shape α > 0
and truncation point ξ > 0, written as Pareto(α, ξ) is defined through the PDF
( α
ξ
α xα+1 for x ≥ ξ,
f (x) =
0 otherwise.
A graph of the Pareto density is shown in Fig. 4.6. The truncation parameter ξ
determines the smallest value that the random variable X can take. In particular,
P(X < ξ) = 0. From there on, the density strictly decreases with what is called
polynomial decay or a polynomial tail : f (x) ∝ x−(α+1) . The smaller the value
of alpha, the slower is the decay. Contrast this to exponential tails found in the
exponential and Gamma distributions, where (approximately) f (x) ∝ exp(−ax)
which goes to zero much faster. The type of decay determines how probable very
large values of the random variable X are. When the density decays slowly, large
values of X occur at relatively high frequency. In fact we have,
( (
∞, α≤1 ∞, α≤2
E[X] = αξ , Var[X] = αξ 2
α−1
, α > 1. (α−1)2 (α−2)
, α > 2.
So when α is very small, large values of X are so probable that the expectation
(or variance) is infinite. This is a bit of a mathematical oddity. The interpretation
is that large values or so frequent that taking the average of such numbers doesn’t
yield a stable result, no matter how many numbers we average.
Typical application domains are similar to the Gamma distribution. But
now we put more probability mass on very large values of the random variable.
Examples are:
Chapter 4 Parametric statistical models 58
1.5
1.0
parameters
α=1
f
α=2
α=3
0.5
ξ=2
0.0
0.0 2.5 5.0 7.5 10.0
x
Figure 4.6: Probability density function of the Pareto distribution for varying
parameter choices.
Example 4.10. Imagine taking a random sample of stars, which are just entering
the main sequence. The masses of such stars are called initial masses. The
probability density of their masses is called the initial mass function (IMF). Let
us measure mass m in multiples of 1 solar mass. Salpeter discovered that the
number of stars with mass m appears to decrease as a power law (at least for the
larger stars).
0.4
0.3
parameters
µ = 0,σ = 1
0.2
f
µ = 0,σ = 2
µ = 2,σ = 1
0.1
0.0
−6 −3 0 3 6
x
Figure 4.7: Probability density function of the Normal distribution for varying
parameter choices.
(x − µ)2
1
f (x) = √ exp − , x ∈ R.
σ 2π 2σ 2
As you might expect from the parameter names, it holds E[X] = µ, Var[X] = σ 2 .
If µ = 0 and σ = 1, we shall say that X has a standard normal distribution.
The PDF and CDF of the standard normal random variable are conventionally
denoted by φ(z) and Φ(z), respectively.
Fig. 4.7 shows graphs of normal density functions. The role of the parameters
is quite obvious. The densities have bell shape, symmetric around a peak at µ.
So this parameter is used to shift the location of the distribution. The spread of
the distribution is determined by the variance parameter σ 2 , where larger values
spread the probability mass out more.
The normal distribution is considered the most important distribution in statis-
tics. Before the change to Euros, it was even featured prominently on the most
common Deutsche Mark bill. (In Fig. 4.8 you can see during my PhD defence,
explaining what a PDF is using the graph on the bill.) It is called ‘normal’
because so many things we observe appear to approximately follow a normal dis-
tribution. This is also the case in astronomy (example: the near-infrared K-band
distribution of globular cluster magnitudes in the Milky Way Galaxy). There’s
even a mathematical reason for that. We will later see that (most) averages of
random variables are approximately normal. This is known as the central limit
Chapter 4 Parametric statistical models 60
Figure 4.8: The normal distribution was featured prominently on Deutsche Mark
bills.
theorem, the second fundamental ‘law’ in statistics (the first being the law of
large numbers). Many quantities are sums or averages of smaller contributions.
That’s true for both things that we observe and things that we compute (just
look back to the previous chapter). We’ll learn more about that later.
A particularly common application domain for the normal distribution are
measurement errors.
Proposition 4.13.
The first property tells us that, by shifting and scaling a normal random variable,
we can transform it to a standard normal one. This also works the other way
around: by shifting and scaling a standard normal variable, we obtain a normal
variable with arbitrary mean and variance (second property). Finally, if we add
Chapter 4 Parametric statistical models 61
• mean vector µ ∈ Rd ,
You can check that the joint density of the multivariate normal simplifies to
the density of the univariate normal when d = 1. The interpretation of the
parameters is the same. The mean vector µ shifts location, the covariance matrix
determines the spread in every direction. Of course, the covariance matrix also
contains information about the dependence between the components of X.
Chapter 4 Parametric statistical models 63
The first two properties are unsurprising. The third one generalizes the fact that
the sum of two independent normals is again normal. It states that any linear
combination of components of a multivariate normal vector is again normal. The
statement from earlier is recovered for d = 2, a = (1, 1), and Cov[X1 , X2 ] = 0.
But why does Cov[X1 , X2 ] = 0 mean that the variables are independent? For
the multivariate normal distribution, variables are independent if and only if
they are uncorrelated (fourth property). Note that this is a specific feature of
the normal distribution. For most other distributions, zero correlation does not
imply independence.
Y = β0 + β1 X + ,
where β0 , β1 are model parameters and ∼ N (0, σ 2 ). This model can be written
equivalently as follows: the conditional distribution of Y given X = x is
Y = β0 + β > X + , E[ | X] = 0.
• prediction.
1
This is the key result of an area of statistics called extreme value theory. The theory says that
events exceeding a large threshold approximately follow a generalized Pareto distribution.
It is well established that the generalized model simplifies to the usual Pareto model for
financial returns.
Chapter 5 Parameter estimation 66
Example 5.4. Let θ∗ = E[X]. Then the sample average θbn = X̄n is a consistent
estimator.
bias[θ] b − θ∗ .
b = E[θ]
Hence, the sample average X̄n is an unbiased estimator for the true mean E[X].
Example 5.8. We already know that the sample average is unbiased: E[X̄n ] =
E[X] for all n. Furthermore, we can compute V[X̄n ] = n1 V[X] → 0. Hence,
X̄n →p E[X]. (This is one way to prove the law of large numbers.)
Instead of considering bias and variance separately, we can also look at a single
measure for the quality of an estimator.
Definition 5.9 (Mean squared error, MSE). The mean squared error of
an estimator θb is defined as
MSE = E[(θb − θ∗ )2 ].
E[(θb − θ∗ )2 ] = bias[θ]
b 2 + V[θ].
b
3
The expectation is not a random variable, so this convergence is in the usual, non-probabilistic
sense.
4
To preserve units, we may take the root of MSE.
Chapter 5 Parameter estimation 69
= E[(θb − E[θ]) 2 b − θ∗ )2 .
b ] + (E[θ]
| {z } | {z }
=V[θ]
b b2
bias[θ]
µ
b = X̄n , b2 = Sn2 .
σ
These examples were a bit boring: the parameters are equal to the mean and
variance. It’s often more involved, though.
Example 5.13. Assume X ∼ Gamma(α, β), i.e., θ = (α, β). We know that
E[X] = αβ and V[X] = αβ 2 . We define the MOM estimator θb = (b
α, β)
b by
solving the system of equations
α
bβb = X̄n , bβb2 = Sn2 .
α
Chapter 5 Parameter estimation 70
X̄n2 S2
α
b= , βb = n .
Sn2 X̄n
5.4.1 Motivation
Suppose we have decided on a statistical model F = {fθ : θ ∈ Θ} and want to
construct an estimator for the parameter θ. Consider the following question:
The data are random variables, so it’s adequate to assess them probabilistically.
We then define an estimator θb as the value θ under which the observed data are
most likely: we maximize the likelihood of the data given the parameter.
Chapter 5 Parameter estimation 71
Convince yourself that the two defintions (5.1) and (5.2) are in fact equivalent
(the logarithm is a strictly increasing function).
So what’s the advantage of taking the logarithm? Recall that, to find the
maximum of a function, we equate the first derivative of the function to zero.
The derivative of a product of n terms is unwieldy (think applying the product
rule dozens of times). The derivative of a sum is just the sum of derivatives. This
simplifies computations a lot. If the maximization problem is solved numerically,
a sum also tends to be more stable than a product, but that’s a topic for another
course.
Chapter 5 Parameter estimation 72
Remark 5.1. Suppose we are not actually interest in the parameter θ∗ , but some
transformation τ ∗ = g(θ∗ ) of it. If θb is the MLE for θ∗ , then g(θ)
b is the MLE for
∗
τ . This property is called equivariance of the MLE.
Step 2. Compute the first derivative with respect to all components of θ (θ may
be multidimensional).
Step 3. Equate the derivatives to zero. For k-dimensional θ, we get the system
of equations
∂`(θ) ∂`(θ)
= 0, ..., = 0. (5.3)
∂θ1 ∂θk
To solve the above equation, multiply both sides with p(1 − p), which gives
n
X n
X
(1 − p) Xi − p (1 − Xi ) = 0
i=1 i=1
n
X n
X n
X n
X
⇔ Xi − p Xi − p 1+p Xi = 0
i=1 i=1 i=1 i=1
Xn Xn
⇔ Xi − p 1=0
i=1 i=1
n
X
⇔ Xi − pn = 0
i=1
n
1X
⇔ p= Xi = X̄n .
n i=1
Y = β0 + β > X + ,
i=1 i=1
That’s why the MLE under a Gaussian likelihood is also referred to as (ordinary)
least-squares estimator or OLS for short. The OLS estimator can be computed
theoretically, but let’s reserve that for another time.
When the statistical model is too complicated, it may be hard to derive the
MLE theoretically. If that’s the case (or you just feel lazy), the MLE can
be computed using numerical optimization algorithms (e.g., scipy.optimize).
Theoretical expressions are much faster to compute, however, so they are still
useful in practice (and for exam problems).
Let’s get back to our Corona crash example from the beginning. We can also
compute the MLE for the Pareto distribution.
Chapter 5 Parameter estimation 74
iid
Example 5.16. Suppose X1 , . . . , Xn ∼ Pareto(ξ, α) and ξ is known. Then θ = α
and
αξ α
fα (x) = α+1 , for x > ξ.
x
The log-likelihood is
n
X
`(α) = n ln(α) + nα ln(ξ) − (α + 1) ln(Xi ).
i=1
Now we can fit the parameter and compute a probability for a crash as extreme
as last week. I downloaded data for Dow Jones returns for the last 35 years from
yahoo finance5 . Let’s set ξ = 0.05: every week with a loss larger than 5% is
considered extreme. By this definition, 39 of the weeks (≈ 2.1% of the data) were
larger than than ξ and the MLE gives α b ≈ 2.8. In the tweet, the weekly loss was
a whopping 17%. The probability of an event at least as extreme is
Thus, we expect a crash like this every 1/0.0007 ≈ 1400 weeks or every 1/(52 ×
0.0007) ≈ 27 years. While this is still unlikely, the event is orders of magnitude
more probable as you would expect under a normal distribution. It also aligns
well with what we observed over the last 120 years. Seems like we’re not that
unlucky after all.
5.4.5 Consistency
The MLE enjoys several nice theoretical properties. In some sense, it is even the
best possible estimator (think: no other consistent estimator can have a smaller
MSE), but that’s beyond the current scope. For now, we shall content ourselves
with the fact that the MLE is consistent.
iid
Theorem 5.17. Suppose X1 , . . . , Xn ∼ fθ∗ for some fθ∗ ∈ F. Under some
regularity conditions6 , then the MLE θb exists7 and is consistent.
5
https://finance.yahoo.com/
7
The ‘regularity conditions’ are mostly unproblematic. Their main purpose is to exclude
pathological cases; for example, densities fθ that aren’t continuous in θ.
7
The ‘exists’ refers to the fact that the likelihood actually has a maximum.
Chapter 5 Parameter estimation 75
Remark 5.2. We are not going to prove the above theorem. But in case you’re
interested, here’s the idea: Instead of maximizing `(θ), we could just as well
maximize `(θ)/n (scaling doesn’t change the maximal point). By the law of large
numbers, it holds
n
1X
`(θ)/n = ln fθ (Xi ) →p Eθ∗ [ln fθ (Xi )],
n i=1
where the expectation on the right is computed under the true model with parameter
θ∗ and density fθ∗ . That is,
Z
Eθ∗ [ln fθ (Xi )] = ln fθ (x)fθ∗ (x)dx,
which is called the cross-entropy between two densities fθ and fθ∗ . Hence, θb is a
value that maximizes cross-entropy asymptotically (as n → ∞). Finally, one can
show that the cross-entropy is maximized by the true parameter value θ∗ .
All models are wrong, but some are useful. — George E. Box
In that sense, it’s useless to worry about the model being incorrect. But we
should certainly think about whether a statistical model is useful. If what we
observe doesn’t align with the properties of the model, it’s probably not a useful
one.
So how do we check? There’s one simple tool, called quantile-quantile plot
or just QQ-plot. Suppose θb is the estimated parameter and Fθb the associated
distribution function. The QQ-plot is simply a graph with the theoretical quantiles
Fθb−1 (p) on the x-axis and the empirical quantiles Fn−1 (p) on the y-axis. If the
model fit is good, all points should lie on the main diagonal x = y.
A QQ-plot for the Dow Jones data is shown in Fig. 5.2. The QQ-plot for
the normal distribution is indicated by black dots (parameters were estimated
by MOM). The points deviate a lot from the main diagonal. For large losses,
the sample quantiles are much larger than the theoretical quantiles. Hence, the
normal distribuition is a poor model for large losses. The orange triangles are
Chapter 5 Parameter estimation 76
0.2
●
●
●
Sample quantiles
●
0.1
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
0.0 ●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
−0.1 ●
Figure 5.2: QQ-plots for the normal and Pareto distributions in the Corona crash.
Data are weekly losses on the Dow Jones Index. Since we cut of the
Pareto at ξ = 0.05, only values above this threshold are shown.
the QQ-pairs for the Pareto distribution. They are generally quite close to the
diagonal, so the Pareto model seems to provide a good fit.
There are numerous issues with the above procedure: binning causes bias,
it’s unclear how to choose number and location of bins, error variances σk2 need
to be estimated, . . . There are equally many modifications of the above: taking
logarithms of Xi first, taking logarithms of b
h and fθ when computing the criterion,
how to compute σk , etc. — many things that try to fix issues caused by binning.
In the old days, data were often recorded or shared in the form of binned
counts. Then the only option is to go through these chores. Gladly, this is rare
in modern times and we can just use MLE or MOM.
Uncertainty quantification
6
By now, we have seen many estimators θb of some parameter θ∗ :
• The empirical distribution function Fn (x) and quantile Fn−1 (p) as estimators
of a CDF F (x) and quantile F −1 (p).
• The histogram b
h(x) as an estimator for the PDF/PMF f (x).
All but MOM and MLE are also called nonparametric estimators, because we do
not need to specify a parametric model for them to work.
So if we have computed an estimator θ, b can we say that θ∗ = θ? b Of course,
not! The estimator θb is a random variable, but θ∗ is not. Every time we compute
an estimator θb we will make an estimation error θb − θ∗ 6= 0. We cannot know the
exact error without knowing θ∗ . If the estimator is consistent, we know that it
converges for infinitely many observations. But on finite samples, there there is
some uncertainty how close we are to the truth.
In this chapter, we learn how to quantify this uncertainty probabilistically.
Yn →d Y,
Fn (y) → F (y), as n → ∞.
Theorem 6.2 (Central limit theorem, CLT). Let Y1 , . P. . , Yn be iid with mean
E[Y1 ] = µ and variance V[Y1 ] = σ and define Ȳn = n ni=1 Yi . Then
2 1
√
Ȳn − E[Ȳn ] n(Ȳn − µ)
p = →d N (0, 1),
V[Ȳn ] σ
Remark 6.1. The statement of the theorem uses the common short notation
(Yn − µ)/σ →d N (0, 1). The long form is “there is a random variable Y ∼ N (0, 1)
such that (Yn − µ)/σ →d Y .” An alternative way to write it is
√
n(Ȳn − µ) →d N (0, σ 2 ).
Remember when we said that (most) averages behave like a Gaussian random
variable? The CLT is the mathematically precise formulation of this fact. The
interpretation is that, for large enough n, the sample average Ȳn behaves approxi-
mately1 like a N (µ, σ 2 /n) random variable. This also explains why the Gaussian
distribution is found everywhere in nature. It is the natural model when many
independent factors contribute to an outcome.
As n → ∞, the variance V[Ȳn ] = σ 2 /n vanishes. Hence, in a probabilistic
√ Ȳn − µ gets closer to 0 (that’s the law of large numbers).
sense, the difference
The scaling with n allows us to obtain a non-trivial
√ limit. You can think of it
this
√ way: multiplying a random variable √ by n√blows up its variance. The rate
n strikes just the right balance: V[ nȲn ] = ( n)2 V[Ȳn ] = σ 2 ∈ (0, ∞).
1
“Approximately behaves like” refers to probability statements: probability statements con-
cerning Ȳn are approximated by probability statements concerning N (µ, σ 2 /n).
Chapter 6 Uncertainty quantification 80
The central limit theorem is quite remarkable. The only assumptions are that
the sequence is iid with finite variance. It is called central because it plays such
a central role in probability and statistics. The name was first used by George
Pólya2 in 1920 (in German, “Zentraler Grenzwertsatz”), but the idea is older
and many other famous mathematicians contributed, including Laplace, Cauchy,
Bessel, Poisson (all part-time astronomers!).
As a side note, let me mention that there are several generalizations of the
CLT. The multivariate CLT states that an average of random vectors behaves
like a multivariate normal random variable. Furthermore, the variables do not
have to be iid. For example, we can allow their distribution to change with n or
for (weak) dependence between observations.
There is a joke about statisticians taking averages all day and, in a sense, this
is true. Many estimators we have seen so far can be expressed as averages (or
functions of averages). We shall see that even when they don’t, they can often be
approximated by a suitable average. The CLT tells us that all these estimators
behave like a Gaussian when properly scaled. How nice is that?
2
You might have been tortured by his ‘urn’ in high school.
Chapter 6 Uncertainty quantification 81
Example 6.4. Suppose for simplicity that F is continuous. The histogram for
x ∈ (xk−1 , xk ] is defined as
n
1 X
hn (x) =
b 1(xk−1 < Xi ≤ xk ).
n(xk − xk−1 ) i=1
pk pk (1 − pk )
E[b
hn (x)] = , V[b
hn (x)] =
xk − xk−1 n(xk − xk−1 )2
and therefore
pk pk (1 − pk )
hn (x) ≈ N
b , .
xk − xk−1 n(xk − xk−1 )2
Note that E[b hn (x)] 6= f (x), so the histogram is biased. (One can check that it
is asymptotically unbiased, however.) If we want to compute probabilities like
P(|bhn (x) − f (x)| < ), we would need to estimated not just the variance of b
hn (x),
but also its bias. That’s beyond the scope of this course, but be aware that biased
estimators complicate things.
iid
Theorem 6.5. Suppose X1 , . . . , Xn ∼ fθ for some fθ∗ ∈ F. Under some
regularity conditions, the MLE θb satisfies
√
n(θb − θ∗ ) →d N 0, I(θ∗ )−1 ,
where
2
∂ ln f (X; θ)
I(θ) = Eθ .
∂θ
Thus,
∂f (x; α) 1 ∂ 2 f (x; α) 1
= + ln(ξ) − ln(x), 2
= − 2.
∂α α (∂α) α
b − α∗ ≈ N 0, (α∗ )2 /n .
Hence, I(α) = 1/α2 and the MLE satisfies α
Now apply the central limit theorem to the the right hand side. (This was the
interesting part of the proof, you can skip the following details if you want.)
Because, by the chain rule
∂θ f (Xi ; θ∗ )
∂θ ln f (Xi ; θ∗ ) = ,
f (Xi ; θ∗ )
it holds
∂θ f (x; θ∗ )
Z
∗
E ∂θ ln f (Xi ; θ ) = f (x; θ∗ )dx
f (x; θ∗ )
Z
= ∂θ f (x; θ∗ )dx
Z
= ∂θ f (x; θ∗ )dx
= ∂θ 1
= 0.
Chapter 6 Uncertainty quantification 84
Further,
∂θ ln f (Xi ; θ∗ )
1
V ∂θ ln f (Xi ; θ∗ )
V ∗
= ∗ 2
I(θ ) I(θ )
1 ∗ 2
∗ 2
= E ∂θ ln f (Xi ; θ ) − E ∂θ ln f (Xi ; θ )
I(θ∗ )2
1 2
E ∂θ ln f (Xi ; θ∗ )
= ∗ 2
I(θ )
1
= .
I(θ∗ )
√
Theorem 6.7 (Delta method). Suppose n(θb − θ∗ ) →d N (0, σ 2 ) and that g
is continuously differentiable. Then,
√
b − g(θ∗ ) →d N 0, g 0 (θ∗ )2 σ 2 .
n g(θ)
Example
√ 6.8. Let σ 2 = V[X]. For the sample variance Sn2 , one can show
n(Sn2 − σ 2 ) →d N (0, µ4 − σ 4 ), where µp 4
4 = E[(X − µ) ]. Now consider the
√
sample standard deviation Sn = g(Sn2 ) = Sn2 . It hold’s g 0 (x) = 1/(2 x) and
therefore
√ µ4 − σ 4
n(Sn − σ) →d N 0, .
4σ 2
Example 6.9. The delta rule is often useful when computing probabilities from
an estimated model. Recall the Corona crash example following Example 5.16.
We computed the MLE α α) = 0.021 × 1 − Fξ,bα (0.17) .4
b and then a probability p(b
4
Let’s treat 0.021 as a fixed number for simplicity. Strictly speaking it’s also a random variable.
Chapter 6 Uncertainty quantification 85
You can check that Fξ,α (x) = 1 − (ξ/x)α for x > ξ. Hence,
and
√
α) − p(α∗ ) →d N 0, p0 (α∗ )2 (α∗ )2 .
n p(b
p0 (b
α) 0.021 × (ξ/0.17)α ln(ξ/0.17)
t0 (α) = − = ,
52 × p(b α)2 α )2
52 × p(b
and
√
α) − t(α∗ ) →d N 0, t0 (α∗ )2 (α∗ )2 .
n t(b
P θ∗ ∈ (θbl , θbu ) ≥ γ.
5
Taken from https://seeing-theory.brown.edu, a beautiful introduction to statistics with
interactive graphics. Check it!
Chapter 6 Uncertainty quantification 86
Figure 6.1: Illustration of confidence intervals. The dashed line is the true pa-
rameter θ∗ , intervals are constructed repeatedly from simulated data.
The dashed line indicates the fixed location of the true parameter θ∗ . The dots
are the estimates θ,b the bars indicate the intervals (θbl , θbu ). The estimates and
intervals are random, so they are different for every of the 14 runs. Some of
the intervals cover the true value θ∗ , some don’t. For γ-confidence intervals, we
expect that the long-run proportion6 of intervals covering θ∗ is at least γ.
So how do we construct such intervals? Suppose that an estimator θb is asymp-
totically normal, that is θb ≈ N (θ∗ , se[θ]
b 2 ). The standard error se[θ]
b may not be
known, but estimated by some se[ b θ]
b (see, e.g., Example 6.3). Recall that Φ is
the cdf of the standard normal function. Set γ = 1 − α (α is called significance
level in a related context, but we’ll get to that). Define zα/2 as the corresponding
(1 − α/2)-quantile
6
‘Long-run’ means that we repeat the experiment a large number of times
Chapter 6 Uncertainty quantification 87
Proof. It holds
Example p6.12. Let θ∗ = E[θ∗ ] and θb = X̄n . The CLT states θb ≈ N (θ∗ , V[X]/n),
so se[θ]
b = V[X]/n, which we can approximate by se[ b = Sn /√n. Hence,
b θ]
zα/2 Sn zα/2 Sn
X̄n − √ , X̄n + √
n n
Example 6.13. Let’s reconsider our Corona crash example. We computed the
MLE for the Pareto shape as 2.87 following Example 5.16. In Example 6.9, we
have shown that
√
α) − t(α∗ ) →d N 0, t0 (α∗ )2 (α∗ )2 ,
n t(b
where t(α∗ ) is the expected number of years between events. Recall that the MLE
was αb = 2.87, ξ = 0.05, and n = 39. Substituting these values in the expressions
derived in Example 6.9 yields
Remark 6.3. Note that the conditions of Theorem 6.11 do not apply to the
histogram because of its bias. The intervals can still be used to guide intuition,
Chapter 6 Uncertainty quantification 88
The first is rarely an issue. The second and third require hard work. For complex
statistical models or estimators, the standard error may not be known, difficult
to derive, or difficult to estimate.
Luckily, Bradley Efron came up with an ingenious idea in 1987. The bootstrap is
one of the most celebrated and widely used techniques for uncertainty quantifica-
tion. Recall that to quantify uncertainty, we need to approximate the distribution
of the random variable θb − θ.7 Alas, we only observe a single realization of this
variable: the estimate computed from the observed data X1 , . . . , Xn .
Suppose for a moment that we can simulate from the true distribution F .
Consider the following bootstrap algorithm:
Step 3. Define qbα/2 and qb1−α/2 as the α/2 and (1 − α/2) sample quantiles of the
‘observations’ θb1 , . . . , θbB .
7
This distribution is also called sampling distribution of θ.
b
8
https://en.wiktionary.org/wiki/pull_oneself_up_by_one%27s_bootstraps
Chapter 6 Uncertainty quantification 89
H1 . In our example,
H0 : ∆∗ < 0, H1 : ∆∗ ≥ 0.
Now we want to check whether the data contradicts the hypothesis. Note that I
want to reject H0 to prove that I’m right. This is how statistical tests are usually
set up, but more on that later.
Test statistics
Suppose that we know which galaxies in the data are star-forming and which are
(A) (A) (P ) (P )
not. We have data X1 , . . . , Xn from active galaxies and data X1 , . . . , Xm
from passive galaxies. The true value of ∆∗ is unknown to us, but can be
estimated by
n m
b = X̄ (A) − X̄ (P ) = 1 1 X (P )
X (A)
∆ n m X − X .
n i=1 i m i=1 i
P-values
Note that Tbn is a random variable (because ∆ b is) from which we see only one
realization — the one computed from the data we observed. Let’s denote this
number by t to make the distinction between the random variable and the
realization more clear.
To decide whether or not to reject H0 , we compute the p-value: the probability
of seeing a value of Tbn at least as large as t, if H0 would be true:
∆
b
Tbn = p 2 →d N (0, 1).
σ bP2 /m
bA /n + σ
and
The test above is a Wald test. That is, a test constructed from an asymptotically
normal estimator. Since you already know many estimators that are asymp-
totically normal, you should be able to construct Wald tests for other types of
hypothesis as well.
Significance
Ultimately, we want to make a decision: do we reject H0 or not? We do this by
comparing the p-value against a significance level α. Recall that, a small value
of p constitutes evidence against H0 . Hence, we use the rule
• if p < α: reject H0 ,
• if p ≥ α: don’t reject H0 .2
If p < α, we also say that the result is statistically significant at level α. Similar
to the confidence level γ, choosing the significance level α is up to the researcher.
The most common value is 5%, but this depends on the field and type of research.
But what does it actually mean? The value of α controls the probability of a
false positive: rejecting H0 although it is true. We call this a ‘positive’, because
most tests use H0 as the hypothesis of ‘no effect’. If we want to establish an
effect, we actually want to reject H0 . α = 5% means that, if H0 is true, we expect
it to be rejected in 5% of the cases — just due to chance. This is unacceptable
in many physical experiments, where much smaller levels for α are used.
However, the smaller α, the harder it is to reject H0 (or to ‘detect an effect’).
The probability of detecting a real effect (rejecting H0 when it is false) is also
called power of the test. If a test has little power, we will rarely reject H0 , no
matter if it is true or not. That’s why a large p-value should not be interpreted
as evidence for H0 . Generally, the power of a test increases if we have more data
to base our decision on.
Multiple testing
Let’s assume we found t = −3, such that p ≈ 0.999. Unfortunately, I couldn’t find
evidence against Marius’ hypothesis that star-forming galaxies emit bluer light.
1
Here, worst case means that it is harder to find evidence against ∆∗ < 0 than against ∆∗ < c
for any other c ≤ 0.
2
Again, we never “accept”, we only “not reject”.
Chapter 7 Testing 93
That’s a bit embarrassing. Maybe I can at least prove that I’m not a complete
idiot and the difference is small. Let’s test a new hypothesis H0 : ∆∗ < −0.2.
We find p = 0.04 and conclude that I’m not a complete idiot at significance level
α = 5%.
That would be even more embarrassing, because it would mean that I also have
no clue about statistics. By testing two hypotheses, we increase the probability of
a false positive (reject H0 although it is true). It means that, even if we compare
p against 5%, the level of the two tests combined is larger than 5%. This is a
multiple testing problem and for sake of good science, we need to correct for it.
There are two popular ways to do that:
• The Bonferroni correction compares p against α/m, where m is the number
of tests. This correction guarantees that the false positive rate is at most
α. It is generally conservative and safeguards against the worst case.
7.2.1 Overuse
There is a tendency to overuse statistical tests. Very often, estimation and
confidence intervals are better tools. It’s a good idea to ask yourself three
questions:
Q1. Do I have a well-defined and well-motivated hypothesis to test for?
But with α = 5% or lower, how can it be that the false positive rate is so large?
There are several likely reasons for this, some good and some bad. Among the
good ones is that multiple testing issues are ignored across a huge proportion
of the scientific literature — mainly due to a lack of awareness and insufficient
statistics education. If this reason counts as ‘good’, you get an idea of what
comes next.
Academic journals are less likely to publish research with no significant results.
Because scientists know this, many don’t even try to publish insignificant ones.
We don’t really know how often someone failed to reject a hypothesis, we only
see the significant results. This is known as the file drawer effect.
It gets worse. If people are aware of it or not, uncorrected multiple testing
actually makes it easier to claim ‘scientific discoveries’. If we use α = 5%, we
can test 100 things where there is no effect, but will make significant ‘discoveries’
in 5 of them — just by chance. Scientists acquire fame and secure their job
through ‘discoveries’, so there is an incentive to make as many as possible. As a
consequence, the incentives suggest to not correct for multiple testing.
Much worse. Remember when I was unhappy with the outcome of my test, so
I tested another hypothesis instead? This is known as HARKing or hypothesizing
after results are known and it’s problematic. If the same data is used to form a
hypothesis and to test it, all inferences (like error probabilities) are corrupted.
Unfortunately, this is practice is widespread. Intentions don’t need to be bad.
For example, data can be expensive or even impossible to collect twice (for
example, when testing hypotheses about a certain time period). What one can do
nevertheless, is to clearly communicate how (and when) a hypothesis was formed
and whether this has implications for inference.
In fact, surveys suggest that many researchers torture their data until they
make a ‘discovery’. This can mean to come up with and test new hypotheses
until p < 0.05 for one of the tests. It can also mean to change the data to push
the p-value beyond the significance boundary by, e.g., excluding or including
outliers, control variables, or sub-groups of the data. These practices are known
as p-hacking and are poison to scientific progress.
In the past five years or so, these issues started to attract attention and things
are changing for the better. Luckily, astronomy and physics are fields where
such practices have been less problematic. But they are not immune to these
issues either. Take the above as a cautionary tale. Small violations of the rules
accumulate and corrupt the scientific endeavor. So it’s better to be aware and
avoid corrupting your own field.
H0 : θ∗ ∈ Θ0 , H1 : θ∗ ∈ Θ1 .
• one-sided hypotheses:
– H0 : θ∗ < θ0 and H1 : θ∗ ≥ θ0 ,
– H0 : θ∗ ≤ θ0 and H1 : θ∗ > θ0 ,
– H0 : θ∗ > θ0 and H1 : θ∗ ≤ θ0 ,
– H0 : θ∗ ≥ θ0 and H1 : θ∗ < θ0 ,
Example 7.1. Consider the hypothesis that, on average, stars in the Milky Way
and Andromeda galaxies have the same mass. If µM W is the expected mass of a
star in the Milky Way and µA the expected mass of an Andromeda star. Then
θ∗ = µM W − µA and
H0 : µM W − µA = 0, H1 : µM W − µA 6= 0.
Example 7.2. Consider the hypothesis that the metallicity of a quasar is inde-
pendent of its age. If they are independent, the theory predicts that they must
be uncorrelated. Denote by ρ the correlation between metallicity and age. Then
θ∗ = ρ and
H0 : ρ = 0, H1 : ρ 6= 0.
Example 7.3. Consider the hypothesis that the luminosity of stars in the Milky
way (in magnitudes) follows a normal distribution. Denote by Φµ,σ2 the corre-
sponding CDF and let F = {Φµ,σ2 : (µ, σ 2 ) ∈ R × (0, ∞)} be the statistical model.
The parameter of interest is the true CDF F . That is, θ∗ = F 5 and
H0 : F ∈ F, H1 : F ∈
/ F.
4
The ‘side’ refers to the alternative hypothesis.
5
Note that here the parameter θ∗ is not just a number, but an entire function.
Chapter 7 Testing 97
• if Tbn ∈
/ R, do not reject the null-hypothesis.
Most commonly, the rejection region takes the form {Tbn > c} (for one-sided tests)
or {|Tbn | > c} (for two-sided tests), where c ∈ R is a critical value. It is more
common to reformulate the decision rule in terms of p-values, to which we’ll get
in a minute.
Note that we never ‘accept’ the null-hypothesis. If we don’t reject it, this can
have several reasons. The main one is that the test statistic is not informative
enough. That does not mean that we found evidence for H0 , only that we couldn’t
find any against it.
There are two types of errors we can make: convicting someone innocent and
letting the perpetrator go unpunished. The same is true for statistical tests:
I’ve said it before, let me say it again: statisticians are terrible at naming
things.6 A more intuitive terminology comes from medicine. The outcome of a
medical test is termed positive if it indicates disease (as in ‘HIV-positive’) and
negative if not. The type I error corresponds to a false positive: diagnosing a
disease when the patient is healthy. The type II error corresponds to a false
negative: not detecting the disease although the patient is ill. See the table below
for a summary.
6
Confession time: I need to check Wikipedia every time ‘type I/II’ errors are mentioned.
Chapter 7 Testing 98
retain H0 reject H0
H0 true , type I/false positive
H0 false type II/false negative ,
One−sided test Two−sided test
p p
p
● 2 ● ● 2
Tn − Tn Tn
Figure 7.1: Illustration of p-values: The curve is the density of the test statistic
under the null hypothesis. The p-value is the area under the curve
beyond the observed test statistic Tbn .
This is also called the false positive rate. The subscript θ in Pθ indicates that
the probability is computed under the assumption θ∗ = θ: “If θ∗ was equal to θ,
what is the probability of rejecting H0 ?” In the two-sided case, Θ0 = {θ0 }; in the
one-sided case, θ0 is the worst-case parameter of Θ0 .
Let’s denote FTbn (t) = Pθ0 (Tbn ≤ t) as the CDF of Tbn under H0 . Then
Pθ0 (Tbn > t) = 1 − FTbn (t), Pθ0 (|Tbn | > t) = 1 − FTbn (t) + FTbn (−t).
The p-value is defined as p = 1 − FTbn (Tbn ) and p = 1 − FTbn (|Tbn |) + FTbn (−|Tbn |),
7
The max is actually a sup, but that ship has sailed.
Chapter 7 Testing 99
7.7 Power
The probabilities used above can be generalized as follows.
The value β(θ) is the probability of rejecting the null hypothesis if θ∗ was equal
to θ.
When the null hypothesis is false (θ∗ ∈ Θ1 ), we want a to reject it with high
probability. That is, we want the power β(θ) to be as large as possible for all
θ ∈ Θ1 .
∆b − ∆∗
Tbn = p 2 →d N (0, 1),
σ bP2 /m
bA /n + σ
so
c − ∆
β(∆) = P∆ (Tbn > c) ≈ 1 − Φ p 2 .
σA /n + σP2 /m
A few observations:
• As the true difference ∆ grows larger (more positive), the power increases.
That is, we are more likely to reject the hypothesis that ∆∗ < 0. This makes
sense: the more positive the true ∆ is, the easier it is to detect.
8
The probability of a ‘five standard deviation event’ or 5σ-event.
Chapter 7 Testing 100
• The larger the critical value c, the stronger the deviation for H0 has to be
for us to reject and, consequently, the less powerful is the test.
• If c − ∆ < 0 and the sample sizes n, m increase, the test becomes more
powerful. For if we have more data, it becomes easier to detect deviations
from the null.
The observations made in this example hold more generally. Large deviations
from the null are easier to detect, and more data helps.
When the power of a test is low, the probability of rejecting H0 is small, no
matter if it is true or not. That’s why Tbn ∈ / R should not be interpreted as
evidence for H0 .
But equality only holds when A1 = · · · = Am . In the worst case, the events
A1 , . . . , Am are all disjoint. Therefore,
m
[ m
X
FWER = P Ak ≤ P(Ak ) = mα.
k=1 k=1
rejections among all the rejections. A key difference is that we don’t assume that
all hypotheses are true. Let m be the number of tests, R be the total number of
rejections, and R0 be the number of false positives. Then FDR = E[R0 /R]. The
Benjamini-Hochberg (BH) procedure allows to control the FDR. To ensure
that FDR ≤ α:
• Rank-based tests: Construct test statistics from ranking the data. Among
them are tests for equality of distributions and independence.
9
See, for example, https://en.wikipedia.org/wiki/Category:Statistical_tests.
Regression models
8
Broadly speaking, Regression models are statistical models for conditional distri-
butions. The goal is usually to explain some target quantity (Y ) with the help
of others (X). The models can be used to formalize scientific theories and make
predictions. Outside of statistics the term regression has become out of fashion.
But most methods trading under the names machine learning and artificial in-
telligence today are fundamentally regression models. We touched on regression
models briefly in Chapter 4 and Example 5.15 and will expand on them a bit
more in this chapter.
8.1 Terminology
A regression model involves two types of variables:
Y = β > X + , (8.1)
where
Remark 8.1. The model assumes a linear relationship between the response and
the predictors. Note that we could take, for example, X3 = X22 , so that non-linear
relationships can be represented as well. We will speak more about this later.
In Example 5.15, we separated the intercept from the remaining predictors, but
the current formulation will be more convenient.
Example 8.1. Let V and B be the visual and blue band magnitudes of a star. A
linear regression model for the color-magnitude diagram is
V = β1 + β2 × (B-V) + ,
where B-V is the color index. Fig. 8.1 shows an example of a linear regression
model for the color-magnitude diagram of selected stars from the Hipparcos catalog.
Each point represents a star, the straight line is the function β1 + β2 × (B-V).
We see that, on average, the data exhibit an (almost) linear relationship on: the
bluer the star, the brighter it tends to be. Of course, not every star falls on the
line β1 + β2 × (B-V). The vertical distance to the line is the error term . For
some stars it is positive, for some negative; for some larger, for some smaller.
The regression coefficients in Fig. 8.1 were not chosen arbitrarily, but estimated
from the data. That’s our next topic.
Chapter 8 Regression models 104
15
0
0 1 2
B−V color index (mag)
Figure 8.1: Linear regression model for the color-magnitude diagram of 2655 stars
from the Hipparcos catalog.
It turns out that this criterion also works if is not Gaussian (but then βb is no
longer the MLE). The intuition is that (Yi − β > Xi )2 is a measure for prediction
error. The smaller it is (on average), the better the model is at explaining Yi
from Xi . The true parameter β ∗ is the one that explains Yi best in the sense
that the expected error E[(Y − β > X)2 ] is minimal.
To find an explicit expression for β,
b we equate the derivative of the criterion
to zero:
n
1X
(Yi − βb> Xi )Xi> = 0
n i=1
n n
1X > 1 X b>
⇔ Y i Xi = β Xi Xi>
n i=1 n i=1
X n −1 X n
1 > 1
⇔ β=
b Xi X i Y i Xi . (8.2)
n i=1 n i=1
Note that Xi Xi> is a p × p matrix, so that the solution involves matrix inversion.
The estimator βb above is also called the ordinary least squares (OLS) estimator.
Chapter 8 Regression models 105
Theorem 8.2. Define β ∗ = arg minβ E[(Y − β > X)2 ]. Then the OLS (8.2) is
consistent for β ∗ : βb →p β ∗ .
Proof. Recall that β ∗ = arg minβ E[(Y − β > X)2 ]. Using the same arguments as
in (8.2), we can show that
Remark 8.2. Note that Theorem 8.2 does not assume that the model (8.1) is
correctly specified. The OLS converges to the best linear predictor β ∗ (the one
minimizing E[(Y − β > X)2 ]) in any case. However, if (8.1) does not hold, βb> X
does not converge to E[Y | X].
Let use look at the OLS estimator in a bit more detail in the simple case where
X = (1, X2 ). One can show that the formula simplifies to
Sn,Y
βb1 = Ȳn − βb2 X̄2,n , βb2 = Rn ,
Sn,X2
where Rn is the sample correlation of (Y, X2 ) and Sn,Y , Sn,X2 are the sample
standard deviations of Y and X2 , respectively. First note that the correlation
is unit-free, while the sample standard deviations have the same units as the
variables they’re computed from.
Now let’s interpret the coefficients above:
• The intercept βb1 is (literally) the average value of Yi after the average effect
of Xi,2 has been removed. It has the same units as Yi . It’s interpretation
is sometimes meaningful and sometimes not. Essentially it is the expected
value of Yi if Xi,2 = 0.
Example 8.3. Consider again the color-magnitude diagram in Fig. 8.1. The
straight line in the graph is in fact the OLS estimate which gives
The coefficient βb1 is in the same units as V , i.e., magnitudes. It tells us that a star
with (B-V) = 0 is expected to have a V -band magnitude of 4.7. The coefficient
βb2 is unit-free, because V and B-V have the same units. It tells us that, for an
increase of 1 mag in B-V, we expect to see an increase of 4.59 mag in V .
Theorem 8.4. Define β ∗ = arg minβ E[(Y − β > X)2 ]. Then the OLS βb
satisfies for all j = 1, . . . , p,
√ −1/2
nΣ b (βb − β ∗ ) →d N (0, Ip×p ),
The results follows from the multivariate CLT, but we won’t bother with it. Note
b is a p × p matrix, not a single number. The standard error for βbj is
that Σ
computed as the jth diagonal element of Σ b 1/2 /√n.
Theorem 8.4 can be used for a Wald test for the effect of individual covariates:
H0 : βj∗ = 0, H1 : βj∗ 6= 0.
Yi = β > Xi + i
Ybi = βb> Xi ,
which are called the fitted values. The regression residuals are defined as
i = Yi − Ybi = Yi − βb> Xi .
b
Chapter 8 Regression models 107
They residuals approximate the error term i and are quite useful to check for
model fit and misspecification.
The residual sum of squares (RSS) is defined as
n
X n
X
RSS = 2
=
b (Yi − βb> Xi )2 ,
i=1 i=1
and measures the quality of the fit. We already saw it pop up in the asymptotic
variance in Theorem 8.4. If the RSS is small, it means that our model predictions
are close to the observed values. However, the RSS depends crucially on the
variance of . If this variance is large, the RSS will be large, too. (Why?) A
standardized version, called R-squared, is
Pn 2 2
2 Sn,b
R = 1 − Pn i=1 −
b
2
= 1 2
.
(Y
i=1 i − Ȳ n ) Sn,Y
20 Df Model : 3
21 Covariance Type : nonrobust
22 ==============================================================================
23 coef std err t P >| t | [0.025 0.975]
24 ------------------------------------------------------------------------------
25 x1 0.4639 0.162 2.864 0.008 0.132 0.796
26 x2 0.0105 0.019 0.539 0.594 -0.029 0.050
27 x3 0.3786 0.139 2.720 0.011 0.093 0.664
28 const -1.4980 0.524 -2.859 0.008 -2.571 -0.425
29 ==============================================================================
30 Omnibus : 0.176 Durbin - Watson : 2.346
31 Prob ( Omnibus ) : 0.916 Jarque - Bera ( JB ) : 0.167
32 Skew : 0.141 Prob ( JB ) : 0.920
33 Kurtosis : 2.786 Cond . No . 176.
34 ==============================================================================
The first three instructions import the libraries and some data set. The fourth
instruction adds an intercept to the covariates (as the last element, because
prepend=false). The fifth instruction specifies which variables are response and
which are the covariates (Y = spector_data.endog, X = (spector_data.exog,
1)). The sixth instruction, computes the OLS and the seventh instruction prints
a summary of the fitted model.
There’s more information in the output than you will normally need and more
than what’s covered here. So let me just point you to the important bits.
• Below you’ll find the log-likelihood (assuming Gaussian errors) and the
model selection criteria AIC and BIC, which we’ll cover later.
• In the table below (lines 23–28) you see everything related to parameter
estimates. x1, x2, x3 are the names of the (random) covariates, const
refers to the intercept that we added to the model.
• The first column (coef) contains the estimated parameters βbk followed by
the standard error.
• The fourth column (P>|t|) is the p-value for H0 : βk∗ = 0 — not corrected
for multiple testing!
• The last two columns are the lower and upper bounds of a 95% confidence
interval. If you want another confidence level, you can compute your own
from the standard errors in the second column.
8.2.6 Heteroscedasticity
In the model formulation (8.1), we made no assumption about the variance of
. To derive the OLS criterion from the normal distribution, we assumed that
this variance is constant. Heteroscedasticity (another terrible name) refers to
situations where the variance depends on X, i.e., V[ | X] is not constant.
Consider for example the model in Fig. 8.1. In some B-V-regions the residuals b
i
tend to be larger (in absolute terms). This is a common phenomenon, especially
Chapter 8 Regression models 109
Y | X ∼ N (β > X, σ 2 ).
link
ex
g(x) = .
1 + ex
Instead of regression, we often call this a classification model, because it
models the probability of class-membership, e.g., radio-loud vs radio-quite,
star-forming or not, etc.
This is only a small sample from an extremely rich class of models. For example,
one can play with other link functions or let both parameters of the Gamma
family vary with X.
F = fg(β> X),η : β ∈ Rp , η ∈ E
5
6 # Instantiate a gamma family model with the default link function .
7 In [4]: gamma_model = sm . GLM ( data . endog , data . exog , family = sm . families . Gamma () )
8 In [5]: gamma_results = gamma_model . fit ()
9 In [6]: print ( gamma_results . summary () )
10 Generalized Linear Model Regression Results
11 ==============================================================================
12 Dep . Variable : y No . Observations : 32
13 Model : GLM Df Residuals : 24
14 Model Family : Gamma Df Model : 7
15 Link Function : inverse_power Scale : 0.0035843
16 Method : IRLS Log - Likelihood : -83.017
17 Date : Fri , 21 Feb 2020 Deviance : 0.087389
18 Time : 13:59:13 Pearson chi2 : 0.0860
19 No . Iterations : 6
20 Covariance Type : nonrobust
21 ==============================================================================
22 coef std err z P >| z | [0.025 0.975]
23 ------------------------------------------------------------------------------
24 const -0.0178 0.011 -1.548 0.122 -0.040 0.005
25 x1 4.962 e -05 1.62 e -05 3.060 0.002 1.78 e -05 8.14 e -05
26 x2 0.0020 0.001 3.824 0.000 0.001 0.003
27 x3 -7.181 e -05 2.71 e -05 -2.648 0.008 -0.000 -1.87 e -05
28 x4 0.0001 4.06 e -05 2.757 0.006 3.23 e -05 0.000
29 x5 -1.468 e -07 1.24 e -07 -1.187 0.235 -3.89 e -07 9.56 e -08
30 x6 -0.0005 0.000 -2.159 0.031 -0.001 -4.78 e -05
31 x7 -2.427 e -06 7.46 e -07 -3.253 0.001 -3.89 e -06 -9.65 e -07
32 ==============================================================================
There are only minor differences to the example we’ve seen above. The code
above sets up a Gamma regression model (fourth instruction) instead of a linear
model. The model summary contains a bit less information, but the important
parts are still there. The estimate νb of the fixed parameter is given as Scale in
line 15, right column.
Y = h(X) + , or Y | X ∼ Fg(h(X)),η .
q=2 q = 10
12.5
12
10.0
Visual band magnitude
5.0
4
2.5
0
0.0
0 1 2 0 1 2
B−V color index (mag) B−V color index (mag)
Figure 8.2: Non-linear regression model for the color-magnitude diagram of 2655
stars from the Hipparcos catalog using polynomial expansions of order
2 (left) and 10 (right).
1.2
1.00
0.9
0.9
0.75
0.6 0.6
s
s
0.50
0.3 0.3
0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x x x
a sweet spot for q, but unfortunately it’s hard to know where it is in advance.
Later, we’ll discuss methods on how to find this sweet spot.
Spline functions
Splines are functions that are composed piece-wise from polynomials.
q=1 q=3
1.00
0.75
sk(x)
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x
Figure 8.4: Spline basis functions on 5 knots for degrees 0 (left) and 3 (right).
Spline basis
So let’s get back to the statistical problem, estimating the unknown function h
in (8.3). If we assume that h is a spline on m knots, we now need to estimate
10 polynomials of degree q, giving m × q coefficients. That doesn’t sound very
convenient. But hold on: we also have this little side constraint of continuous
derivatives. Quite remarkably, this seemingly innocent assumptions reduces our
degrees of freedom a lot. Even more remarkably, any spline can be represented
conveniently as a linear combination of just m + q + 1 known basis functions Bj :
q+m
X
s(x) = βj Bj (x),
j=0
where
Bj (x) = xj , j = 0, . . . , q,
Bq+1+j (x) = max{0, (x − ξj )}q , j = 0, . . . , m − 1,
and βj are unknown coefficients (to be estimated). Other forms of the basis
exist, but the number of functions stays the same. You can create such a basis in
Python using statsmodels.gam.smooth basis.BSplines. Such basis functions
Bj for q = 1, 3 and m = 4 are shown in Fig. 8.4, where each color corresponds to
a different j. A spline function is simply a linear combination of these functions.
For a single covariate X ∈ [a, b], we can therefore assume
q+m
X
h(X) ≈ s(X) = βj Bj (X),
j=0
Chapter 8 Regression models 115
m=1 m=6
10
10
Visual band magnitude
0 0
0 1 2 0 1 2
B−V color index (mag) B−V color index (mag)
Figure 8.5: Spline regression model for the color-magnitude diagram of 2655 stars
from the Hipparcos catalog using cubic splines (q = 3).
or
q+m
X
Y ≈ βj Bj,q (X) +
j=0
Y ≈ β>Z +
Since the basis functions Bj are known, the coefficients β can be estimated
easily with OLS or MLE. An example is shown in Fig. 8.5, where we fit cubic
splines with m = 1, 6 to the Hipparcos data. We observe that splines tend to
be more stable than the polynomials in the prevous section. The spline with
m = 6 has 10 parameters, but produces a very reasonable model (in contrast to
the polynomial of 10th order.)
Theorem 8.6. Suppose that |ξk − ξk−1 | = 1/m for all k = 1, . . . , m. Under
some regularity conditions
−(q+1) m
E[b
h(x)] = h(x) + O(m ), V[b
h(x)] = O , for all x ∈ [a, b].
n
Remark 8.3. The O-symbol is to be read as ‘is of the same order as’. For-
mally, for two sequences an , bn , an = O(bn ) means limn→∞ |an /bn | < ∞. More
intuitively, it says that if bn → 0, then also an → 0 at least as fast.
• We assume that the distance between any two subsequent knots ξk−1 and
ξk is the same. That is, the spline is defined on intervals of the same length.
(This is only to simplify result, it’s by no means necessary in practice.)
• The ‘regularity conditions’ are very mild. The main assumption is that
the true function h is a few times continuously differentiable. This just
excludes cases where h goes completely wild.
The last two bullets describe a ubiquitous phenomenon in function estimation (as
opposed to parameter estimation). We have a tuning parameter (here m) that
controls a trade-off between bias and variance. If we decrease the variance, we
increase the bias; if we decrease the bias, we increase the variance.
Gladly, large sample sizes n reduce the variance. We can therefore afford larger
values of m when there’s a lot of data. Again there’s a sweet spot that balances
bias variance in an optimal way. One can show that the mean squared error is
optimal if m increases at the order n1/(2q+3) , but that’s not helpful in practice.
We’ll see how to solve this shortly.
Remark 8.4. One can mathematically prove that the bias-variance trade-off is
unavoidable when estimating regression or density functions. For example, the
same phenomenon appears for the histogram, where more bins decrease bias, but
increase variance.
Chapter 8 Regression models 117
Remark 8.5. Just so you know: there is another popular way to control the
bias-variance tradeoff for splines. Here, we take a large number of knots m, but
put a penalty on the magnitude of coefficients βj . This is called a penalized spline
and the strength of the penalty is controlled by a parameter α ≥ 0. For α = 0
there is no penalty, and for α = ∞, all coefficients are βj = 0.
Y | X ∼ Fg(h(X)),η ,
where h is as in (8.4) and g is an appropriate link. Such models are still quite
easy to estimate and interpret, because the functions hk can be treated sepa-
rately. Don’t worry about the details of estimation, these models have great
implementations (e.g., statsmodels.gam).
where Bjp ,q are the basis functions from before. Tensor product splines can
approximate all continuous, p-dimensional functions with arbitrary accuracy.
However, the model has a (m + q + 1)p parameters to estimate and is much harder
to interpret. As a rule of thumb, tensor product splines are only useful when
p ≤ 3.
The principles we learned in this course also apply to these models, but require
more advanced mathematics. If you want to learn more about splines or GAMs,
there is an excellent book by Simon Wood “Generalized Additive Models: An
Introduction with R”.
Γ N
M1 (β1 , ν) : Fg(β > X),ν , M2 (β2 , σ 2 ) : Fg(β > X),σ 2 .
1 2
Example 8.8. Consider the linear regression model Y = β1> X + . Suppose you
want to use only one covariate, but you don’t know which. Define
Example 8.9. Consider the spline GLM Fg(sm (X)),η , where m is the number of
knots the spline function sm is defined on. To choose the tuning parameter m,
define
Of course, all the examples above can be combined: you might want to choose
between different types of GAMs, the covariates to include, and the smoothing
parameter at the same time.
The two most popular criteria for model selection are Akaike’s information
criterion (AIC) and the Bayesian information criterion (BIC). Both are based
on the likelihood of a model. For model Mk , let
Then
• AIC selects the best predictive model among a number of possibly misspec-
ified models.
• BIC selects the true model (with minimal number of necessary parameters)
if it is included in the candidate set.
So as a general rule of thumb: if the main goal is prediction, use AIC; if the main
goal is identification of the truth, use BIC.
Chapter 8 Regression models 120
The two criteria can be used to select arbitrary statistical models with a
likelihood, not just regression models. Furthermore, for the linear model, one
commonly replaces `k by the residual sum of squares, which is equivalent to
assuming Gaussian errors.
Remark 8.6. There’s one caveat when model parameters are not estimated by
plain maximum-likelihood (like penalized splines). Then there is something called
effective number of parameters or effective degrees of freedom that needs to be
substituted for pk in the formulas above. Software usually takes care of that for
you.
Missing data
9
Missing data refers to situations where some of the objects or quantities that we
measure are not or only partially observed. Missing data can be a problem if
the observed sample gives a biased view on the whole population. Quite often,
however, missingness can be accounted for by careful statistical modeling. This
chapter gives an overview over different types of missingness and methods to
address them.
Example 9.2. There’s a large data base of nearby stellar objects and the survey
is known to be complete. To get a sense of the data, you extract a random subset
of the observations. All objects not in this subset are missing.
This turns out to be the (rare) best case scenario. If data is MCAR, the usual
statistical procedures remain valid.
As an instructive example, suppose we want to estimate the mean Pn E[Y ]. The
1
complete-data estimator would be just the sample mean Ȳ = n i=1 Yi . Let’s
Chapter 9 Missing data 122
assume only m = ni=1 Ii < n of the data are actually observed. If we apply the
P
sample mean to the incomplete data set, we get
Pn 1
Pn
Yi Ii Yi Ii E[Yi Ii ]
Ỹ = Pni=1
= 1 Pi=1
n
n →p , (9.1)
i=1 Ii n i=1 Ii E[Ii ]
where the last step follows from applying the law of large numbers to the
numerator and denominator separately. Because Yi and Ii are independent,
E[Yi Ii ] = E[Yi ]E[Ii ] and, thus, Ỹ →p E[Y ].
The same holds true for essentially all statistical methods, including tests and
regression models. If we believe data is MCAR, there’s no reason to worry about
it any further. Unfortunately, that’s rarely the case.
Example 9.3. Suppose a survey is measuring stellar mass, but cannot detect
masses smaller than 1/100 of the sun’s mass. Less massive stars are missing
from the survey and massive stars are overrepresented.
E[Yi Ii ]
6= E[Yi ].
E[Ii ]
For example, when large Yi are less likely to be observed, we will underestimate
the true mean.
MNAR is called non-ignorable because we have to do something about it to
obtain valid inferences. More precisely, we have to come up with a model for
mechanism leading to missing data. In general, this mechanism is not identifiable,
meaning that it cannot be estimated from the observed data. To make (approxi-
mately) valid statistical inferences nevertheless, we need to come up with a model
for the missingness mechanism. Optimally, we know a thing or two about how
data are collected and can use domain knowledge to model the mechanism. If
that’s not the case, the best we can do is making educated guesses and be very
careful in drawing conclusions from the results.
Chapter 9 Missing data 123
Example 9.5. Suppose a study estimates exoplanet masses from optical photom-
etry data. If the planet doesn’t emit light for accurate photometric measurements,
it’s mass is marked as missing. Here, only the photometric measurements deter-
mine missingness in mass. Thus, conditionally on the photometric outcome, the
value of the mass and whether we observe it are independent.
Example 9.6. Censoring occurs most commonly when Yi is a time of some event
of interest; for example the time until light is reflected back to earth. We can wait
only a finite amount of time until this happens. If the light has not been reflected
Chapter 9 Missing data 124
at that time, all we know is that (i) it has not yet been reflected back (Ii = 0),
(ii) that the actual time Yi must be larger than Ỹi = time between emission from
earth and the end of the study.
We shall only consider the third category here, because it is both easy to implement
and very general.
Let’s start with a general setup. Suppose we have a model for the probability
π(Yi , Xi ) = P(Ii = 1 | Yi , Xi ). Here, Ii = 1(Yi is fully observed) such that Ii = 0
if Yi is not or only partially observed. If no covariates Xi are available, they can
be omitted in the formulas, i.e., π(Yi , Xi ) = π(Yi ).
The idea is as follows: first, we throw away all incomplete observations (includ-
ing partially observed ones). Now for all y, x with π(y, x) < 1, observations with
(Yi , Xi ) = (Y, x) are underrepresented in the remaining data; a complete data
set would contain 1/π(y, x) times more of such observations. We correct for this
by up-weighting the complete observations by this factor. This assumes that the
quantity we compute is a sum or average. But as I’ve said earlier in the semester:
almost everything in statistics is an average or well approximated by one.
To make this more concrete, reconsider the example of estimating the mean
E[Yi ] from incomplete (MNAR or MAR) data. The IPW version of the sample
mean is
n
IP W 1 X Yi Ii
Ȳ = .
n i=1 π(Yi , Xi )
The Ii in the numerator is responsible for “throwing away all incomplete data”.
The π(Yi , Xi ) in the numerator is up-weighting the complete cases. By the law
of large numbers, we have
IP W Yi Ii
Ȳ →p E .
π(Yi , Xi )
Chapter 9 Missing data 125
The second equality is due to the fact that Yi , Xi are fixed numbers if we condition
on Yi , Xi . Hence, the IPW version of the sample mean is a consistent estimator.
The same arguments apply whenever we rely on estimating one or more ex-
pectations of the form E[g(Yi )] for some function g(Yi ). That covers almost
everything we learned in this course, including empirical CDFs, histograms, sam-
ple quantiles, sample variances, maximum-likelihood estimators, etc. In practice,
the probabilities π(Yi , Xi ) are rarely known. If we observe the indicator Ii , we
can estimate them with a regression model. If the indicator Ii is unobserved, we
have to use domain knowledge and EDA to postulate a plausible model.
9.5 Takeaways
1. Missing data problems are common and it is important to think about them
carefully. What type of missingness do you face? What is the reason/mecha-
nism for incomplete observations?
procedures are then designed to update our belief optimally after seeing some
data.
As you can see, frequentism vs. Bayesianism is a matter of philosophy. There
has been quite some dispute over the right or wrong over the last decades and
some hold strong opinions. Nowadays, the majority of statisticians take a rather
neutral stance and use whatever is most convenient in a given situation.
1. Choose a prior probability density π(θ) that expresses our beliefs about the
unknown parameter θ.
Note that step 1 only makes sense if we view the unknown parameter as a random
variable Θ. In that view, the true parameter θ is the realization of this random
variable in our universe.1 Accordingly we write the statistical model in step 2 as
f (x | θ). It is our model for the data conditional on the event Θ = θ. In practice,
this model is formulated just as in the frequentist paradigm.
and Θ are discrete. Then Bayes theorem (Theorem 2.25) and the law of total
probability (Theorem 2.26) give
P(X = x | Θ = θ)P(Θ = θ)
f (θ | x) = P(Θ = θ | X = x) =
P(X = x)
P(X = x | Θ = θ)P(Θ = θ)
=P
θ P(X = x | Θ = θ)P(Θ = θ)
P(X = x | Θ = θ)π(θ)
=P
θ P(X = x | Θ = θ)π(θ)
f (x | θ)π(θ)
f (θ | x) = R .
f (x | θ)π(θ)dθ
The above rule gives the optimal update of our belief π(θ) having seen a single
observation X = x. If we see multiple iid observations X1 , . . . , Xn , we replace
the likelihood of a single observation f (x | θ) by the joint likelihood
n
Y
Ln (θ) = f (X1 , . . . , Xn | θ) = f (Xi | θ).
i=1
This gives,
Ln (θ)π(θ)
f (θ | X1 , . . . , Xn ) = R .
Ln (θ)π(θ)dθ
R
The denominator Ln (θ)π(θ)dθ is called the marginal likelihood of the data. It
is a normalizing constant not depending on θ and usually irrelevant for inference.
Thus the main take away is
f (θ | X1 , . . . , Xn ) ∝ Ln (θ)π(θ)
4 type
density
prior
posterior (20 obs)
posterior (50 obs)
2
0
0.00 0.25 0.50 0.75 1.00
p
A seasoned statistician
Pn would immediately
Pn realize that this is proportional to the
density of a Beta( i=1 Xi + 1, n − i=1 Xi + 1) random variable (‘proportional’
only because we threw away the normalizing constant). Hence, our posterior belief
about the unknown parameter p is expressed as
n
X n
X
p | X1 , . . . , Xn ∼ Beta Xi + 1, n − Xi + 1 .
i=1 i=1
The set C is called credible interval instead of confidence interval to emphasize the
difference in paradigm. The above is a probability statement about the unknown
parameter θ (which would be meaningless under the frequentist paradigm): Given
the data we have, our belief that θ ∈ C is described by the probability 1 − α.
If the posterior density has a simple form, posterior means and credible sets
are easy to extract.
Hence, the posterior mean is a weighted average between the sample mean X̄n
(which is the frequentist MLE) and the prior mean, 1/2. As n → ∞, the weight
for the sample mean, n/(n + 2), tends to one. This is a general phenomenon:
for large samples, Bayesian and frequentist estimates are very similar. On small
samples, the prior contribution matters however.
10.0
7.5
type
density
prior
5.0
posterior (20 obs)
posterior (50 obs)
2.5
0.0
0.00 0.25 0.50 0.75 1.00
p
Figure 10.2: The effect of a biased prior on the posterior densities in Example 10.2
Well, no matter the evidence, if they apply Bayes theorem to update their
beliefs, they will end up with P(A is best | data) = 1. (You can do the calculation
yourself.) They will vote for the idiot, no matter how outrages his actions. This
is a very extreme example, but a weaker phenomenon occurs if a voter’s prior
is strongly biased towards one candidate. The stronger the bias, the more
contradictory evidence the voter must see before changing their opinion.
So what does that mean for Bayesian procedures? First, if we assign prior
probability 0 to some region C of the parameter space, the posterior probability
for this region will inevitably be 0. Thus, we should make sure that the prior
density assigns positive mass to all possible outcomes. Second, if our prior heavily
favors a certain region, this bias will only gradually fade out from the posterior.
This is illustrated in Fig. 10.2, where we chose a prior that is heavily biased
towards large values of p. Given the same data as in Fig. 10.1, the posteriors
preserve the bias for large values of p and only gradually move towards the true
value p = 0.3.
model by g(θ) instead of θ, the flat prior for θ turns into a non-flat prior for
g(θ) (this follows from the transformation-of-densities theorem, Theorem 2.58).
A better choice for non-informative priors is Jeffrey’s prior
π(θ) = I(θ)1/2 ,
where I(θ) is the Fisher information from Section 6.3. This prior can indeed be
shown to be transformation invariant, but is often hard to compute.
(The flat prior π(θ) ≡ 1 is a special case with α = β = 1.) Conjugate priors are
nice because they allow to do everything in closed form. However, there is only a
handful of rather simple statistical models for which conjugate priors are known.
2
We need to switch to the frequentists paradigm to objectively assess the quality of estimators.
The Bayesian view is always subjective through the prior.
Chapter 10 Bayesian inference 133
implement this yourself, there are excellent libraries (like emcee and PyStan for
Python). So we’ll just quickly brush over the main idea.
The goal is to simulate from the posterior density f (θ | X1 , . . . , Xn ) ∝
Ln (θ)π(θ), optimally without knowing the normalizing constant. We do this
by constructing a stationary Markov chain Θ1 , . . . , ΘT . The sequence of random
variables Θ1 , . . . , ΘT is called stationary if all Θt have the same distribution. In
our case, we want this distribution to be the posterior. However, random variables
in a Markov Chain are not independent. They only need to satisfy the Markov
property
f (Θt | Θt−1 , . . . , Θ1 ) = f (Θt | Θt−1 ).
Hence, the distribution of Θt depends on the past realizations Θ1 , . . . , Θt−1 , but
only through the most recent element Θt−1 .
The dependence in a stationary Markov chain is weak enough for the law of
large numbers to hold. So if we are able to simulate such a stationary Markov
chain Θ1 , . . . , ΘT , the posterior mean can be approximated by
T
1X
θb = Θt
T t=1
1. Pick some density q(· | Θt−1 ) that we can easily simulate from, called proposal
distribution. For example we may take q(· | Θt−1 ) ∼ N (Θt−1 , σ 2 ) if θ ∈ R or a
suitable Beta density if θ ∈ (0, 1).
3. For t = 1, . . . , T :
(i) Simulate a proposal value Θt ∼ q(· | Θt−1 ).
(ii) Compute
Ln (Θt )π(Θt )q(Θt−1 | Θt )
R= .
Ln (Θt−1 )π(Θt−1 )q(Θt | Θt−1 )
Markov chain if q assigns positive probability to all values in the parameter range
of θ. How much dependence there is between θt and θt−1 depends on two factors:
how much the proposal density q(· | Θt−1 ) concentrates around Θt−1 and how
often we set Θt = Θt−1 in step 3(iii). The dependence will be weaker, the closer
q(· | θt−1 ) is to the posterior density f (θ | X1 , . . . , Xn ).
3
I heard ‘Statistical rethinking’ by Richard McElreath is nice for applications.