Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

1 Probability theory

Subjects Sections
Random variables 1.1, 1.4, 1.5,1.6
Order statistics 5.4
Moments & moment generating function 2.2, 2.3
Multivariate variables 4.1, 4.2
Law of total expectation 4.4.3, 4.4.7
Statistical Inequalities 4.7.7
Convergence concepts 5.5

Statistics is a mathematical and abstract subject, especially when encountered for


the first time. Its foundation is build on probability theory, which provides a means of
modeling the world around us.

1.1 Repetition
To succeed in this course you should brush up on your knowledge of basic probability: of
key distributions and how to make calculations with random variables. We start out by a
quick review of a few facts. Note that a list of popular distributions, their moments and
other properties can be found in chapter three and the Appendix of Casella and Berger.

1.1.1 Random variables and probability structures


The main objective of a statistician to is to make conclusions about the world around us
by studying an experiment.

Definition 1.1 (1.1.1). The set S of possible outcomes of an experiment is called the
sample space for the experiment.

Suppose the experiment consists of tossing a coin two times. Then there are only four
possible outcomes, and thus our sample space is given by S = {hh, ht, th, tt}. If we are
measuring the amount of time till the next time it rains, then the sample space is chosen
large to encompass all possible outcomes and thus we choose S = (0, ∞).

Definition 1.2 (1.1.2). An event A is a collection of possible outcomes of the experiment,


that is A ⊆ S.

We obtain a probability structure once we define a probability function P from the


power set of S to [0, 1] that tells us what probability each event has. If the coin in our
previous example is fair, then we would have P({hh}) = P({ht}) = P({th}) = P({tt}) =
1/4. Usually one is interested in a particular question concerning the experiment. In
such a case it is easier to deal with a summary variable than with the original probability
structure.

Definition 1.3 (1.4.1). A random variable is a function from a sample space S to the
real numbers R.

Suppose we are only interested in the total number of heads. Then we can define X as
such and let X({hh}) = 2, . . . , X({tt}) = 0. We can then extend the probability structure

1
to the real numbers by defining for each A ∈ R the probability function PX by

P(X ∈ A) = PX (A) = P({s ∈ S | X(s) ∈ A}).

It is a lot of work to fully define the experiment with probability structure and then go on
to infer the probability structure for a random variable. We can also work directly with
random variables without describing the underlying sample space.

Definition 1.4 (1.5.1). The cumulative distribution function (cdf) of a random variable
X is defined by FX (x) = P(X ≤ x) for all x ∈ R.

It turns out that the cdf of a random variable fully describes its probability structure.
Let X and Y be two identically distributed random variables, that is P(X ∈ A) = P(Y ∈
A) for all A ⊆ R, then FX (x) = FY (x) for all x ∈ R. Therefore we can disregard the
underlying experiment and just talk about random variables having a certain cdf.

Definition 1.5 (1.6.1). The probability mass function (pmf) of a discrete random variable
X is given by f (x) = P(X = x).

Definition 1.6 (1.6.3). The probability density function (pdf) of a continuous random
d
variable X is given by f (x) = dx F (x).
P
R P(X ∈ A) = x∈A f (x), while for continuous ran-
For discrete distributions we have
dom variables we have P(X ∈ A) = A f (x)dx. We will assume that the pdf always exists
and therefore we can readily swap between the different descriptions of the probability
structure for a random variable.

1.1.2 Transformations
Let X be a random variable taking values in X . Any function of X, say Y = g(X) is again
a random variable.1 The distribution of Y can be expressed in terms of the distribution
of X, by noting that
Z
P(Y ≤ y) = P(g(X) ≤ y) = P ({x ∈ X | g(x) ≤ y}) = fX (x)dx.
{x∈X |g(x)≤y}

The region over which to integrate, {x ∈ X | g(x) ≤ y}, is not always easy to identify. An
important special case, however, is when g(x) is a monotonic function. For example,
(
x ∈ X | x ≤ g −1 (y)

g(x) monotonically increasing,
{x ∈ X | g(x) ≤ y} =  −1

x ∈ X | x ≥ g (y) g(x) monotonically decreasing.

Example 1.7 (2.1.4). Suppose X ∼ Uniform(0, 1) and let Y = g(X) = − log X. We


derive the pdf of Y via

FY (y) = P(Y ≤ y) = P(− log X ≤ y) = P(X ≥ e−y )


= 1 − P(X ≤ e−y ) = 1 − FX (e−y ) = 1 − e−y .

Therefore we get fY (y) = d


dy FY (y) = e−y and thus Y ∼ Exponential(1).
1
Formally, this holds only for “measurable” functions g, but an understanding of this nuance requires
elements of measure theory that fall outside the scope of this course.

2
Definition 1.8 (5.4.1). Let X = (X1 , . . . , Xn ) be an iid vector and order X1 , . . . , Xn
from small to big. Then the k’th element in the new ordering is called the k’th order
statistic, denoted X(k) . Special order statistics are X(1) = min{X1 , . . . , Xn } and X(n) =
max{X1 , . . . , Xn }.
Example 1.9. Suppose X1 , . . . , Xn ∼ Exponential(λ). We derive the distribution of X(1)
via

P(X(1) ≤ x) = 1 − P(X(1) ≥ x) = 1 − P(X1 ≥ x, . . . , Xn ≥ x)


Yn Yn
=1− P(Xi ≥ x) = 1 − e−λx = 1 − e−nλx .
i=1 i=1

d −nλx = nλe−nλx and thus Y ∼ Exponential(nλ).



Therefore fX(1) (x) = dx 1−e

1.1.3 Moments
The expected value of a random variable X is its average value, where the average is
weighted according to the pdf. It can be interpreted as a measure of the center of a
distribution, as we think of averages as being middle values. By weighting values with
their probability of occurrence, we hope to obtain a number that summarises a typical or
expected value of an observation of a random variable.
Definition 1.10 (2.2.1). The expected value or mean of a random variable g(X) is defined
as
(P
g(x)fX (x) if X is discrete
E(g(X)) = R x∈X .
x∈X g(x)fX (x)dx if X is continuous

If E |g(X)| = ∞, we say that E(g(X)) does not exist.


Example 1.11 (2.2.3). Let X ∼ Bernoulli(p). Then

X 1
X
E(X) = xf (x) = xP (X = x) = 1 × P (X = 1) + 0 × P (X = 0) = P (X = 1) = p.
x∈X x=0

Example 1.12. Suppose X ∼ N (µ, σ 2 ). Then


Z ∞ Z ∞
1 (x−µ)2 /2σ 2 1 2 2
E(X) = x√ e dx = (y + µ) √ ey /2σ dy
2πσ 2 2πσ 2
Z−∞
∞ Z ∞−∞
1 2 2 1 2 2
= y√ ey /2σ dy + µ √ ey /2σ dy
−∞ 2πσ 2 −∞ 2πσ 2

= 0 + µ = µ.

We used the transformation y = x−µ to obtain the sum of two integrals. The first integral
is symmetric around zero, while the second one is the pdf of a normal distribution and
thus equals one.
Other expectations that will often be used are:
Definition 1.13 (2.3.2). Let X be a random variable. The variance is given by Var(X) =
E(X − E(X))2 . Note that we have the equality Var(X) = E(X 2 − 2XE(X) + E2 X) =
EX 2 − 2E2 X + E2 X = EX 2 − E2 X.

3
Since summation and integration are linear operators, it is straightforward to show
that the expectation is a linear operator as well. In particular, the following useful rules
can be derived.

Lemma 1.14. Let X and Y be random variables and let a, b ∈ R be real numbers. Then

ˆ E(aX + bY ) = aE(X) + bE(Y ).

ˆ Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2ab Cov(X, Y ).

The expectation and variance of X are functions of E(X) and E(X 2 ), which are referred
to as the first and second moments of X, respectively. In general, we define the k-th
moment as follows.

Definition 1.15 (2.3.1). Let k ∈ N, then we define the k’th moment of X as E(X k ).

While the moments of a random variable are easy to define, they may be difficult to
derive analytically. Moreover, depending on the underlying probability distribution, not
all moments may exist. In fact, as illustrated by the following example, there exist random
variables for which no moments exists at all.

Example 1.16 (2.2.4). Let X be a Cauchy random variable, whose pdf is given by
1 1
fX (x) = , −∞ < x < ∞.
π 1 + x2
Then, it follows that

|x| 2 ∞ x 2 K x
Z Z Z
1
E |X| = 2
dx = dx = lim dx
π −∞ 1 + x π 0 1 + x2 K→∞ π 0 1 + x2
1 K 1
= lim log(1 + x2 ) x=0 = lim log(1 + K 2 ) = ∞.
K→∞ π K→∞ π

Hence, we conclude that E(X) does not exist.

For some random variables, it is possible to define a unique characterization of the


distribution via the so-called Moment Generating Function (MGF).

Definition 1.17 (2.3.6). Let X be a random variable with cdf Fx (x). The Moment
Generating Function (MGF) of X, denoted by MX (t), is defined as

MX (t) = E etX ,


provided that the expectation exists for some t in a neighbourhood of zero. That is, there
is an h > 0, such that for all −h < t < h, E etX < ∞.

The MGF is most often used to establish the equality in distribution between (limits
of) random variables. Sometimes, however, the MGF can be used to simplify the process
of deriving the moments of a random variable.

Theorem 1.18 (2.3.7). If X has MGF MX (t), then

dn
E(X n ) = M (t) .

X
dtn t=0

4
Proof. We only cover the proof for a continuous random variable. Assuming that we
can interchange integration and differentiation, it holds that
dn dn
Z Z n Z
tx ∂ tx
e fX (x)dx = xn etx fX (x)dx = E X n etX .

n
MX (t) = n e fX (x)dx = n
dt dt ∂t
Hence,
dn
M (t) = E(X n ).

n X
dt

t=0

We end this section with a useful statistical inequality for the expectation of convex or
concave transformation of random variables. First, let us recall the definition of convexity.

Definition 1.19. A function g(x) is convex if it holds that

g(λx + (1 − λ)y) ≤ λg(x)(1 − λ)g(y)

for all x, y and 0 < λ < 1. The function g(x) is concave if −g(x) is convex.

The geometric interpretation of a convex function is that each line segment connecting
two point on the graph will lie above the graph itself, as illustrated in Figure XXX.

Theorem 1.20 (4.7.7). For any random variable X, it holds that

ˆ E(g(X)) ≥ g(E(X)) if g(x) is convex,

ˆ E(g(X)) ≤ g(E(X)) if g(x) is concave.

When g(x) is strictly convex/concave and X is not almost surely constant, the above
inequality becomes strict as well.

Proof. We prove the version of Jensen’s inequality for strictly convex functions, while
leaving the proof of the other cases as an individual exercise. Take any point x0 and
recall that the equation of the line tangent to g(x) at the point (x0 , g(x0 )) is given by
h(x) = g(x0 ) + b(x − x0 ), where b is the slope of the tangent line. Strict convexity implies
that the graph of g lies strictly above h for all x ̸= x0 , i.e.

g(x) > g(x0 ) + b(x − x0 ), x ̸= x0 .

Since this holds for all x, x0 with x ̸= x0 , we may choose x0 = E(X) and x = X. Then,

g(X) > g(E(X)) + b(X − E(X)) ⇒ E(g(X)) > E(g(E(X))) +b E(X − E(X))
| {z } | {z }
=g(E(X)) =0

⇒ E(g(X)) > g(E(X)),

where the case X = E(X) almost surely is excluded by the assumption that X is not
almost surely constant. ■

Throughout this course we will make extensive use of Jensen’s inequality. One valuable
consequence of Jensen’s inequality can already be stated now.

Lemma 1.21. If E |X|k < ∞, then E |X|m < ∞ for all k, m ∈ N with m < k.

5
Proof. Note that g(x) = |x|k/m is a convex function, because k/m > 1. Then, by
Jensen’s inequality, it holds that

∞ > E |X|k = E (|X|m )k/m > (E |X|m )k/m ,

which implies E |X|m < ∞. Hence, we have shown that if the k-th moment exist, all lower
moments exist as well. ■

1.1.4 Random vectors and dependence


Often we are interested in multiple random variables and how they behave together.
Definition 1.22 (4.1.1). An n-dimensional random vector X = (X1 , . . . , Xn ) is a vector
where each element is a random variable.
Let X and Y be two random variables. Then the random vector (X, Y ) has a joint
pdf denoted by f (x, y). If X and Y are discrete, then f (x, y) = P (X = x, Y = y). From
now on we assume X and Y are continuous, but all the results also work in the discrete
case. In that situation integrals have to be interchanged for summation signs.
Theorem 1.23 (4.1.6). Let X and Y be two random variables and let f (x, y) denote their
joint density. Then the univariate densities can be obtained via
Z ∞ Z ∞
fX (x) = f (x, y)dy fY (y) = f (x, y)dx.
−∞ −∞

Oftentimes we are interested in conditional probabilities, e.g. what is the probability


that X is in a given region given that Y = 3.
Definition 1.24 (4.2.3). Let X and Y be two random variables with joint density f (x, y).
The conditional pdf of Y given that X = x is the function of y defined by

f (y | x) = f (x, y)/fX (x).

The conditional densities can be used to calculate conditional expectations. Specifically


we have
Z ∞
E(Y | X = x) = yf (y | x)dy.
−∞

Note that this gives a number that depends on the value x that we chose for the random
variable X. We thus get a whole family of expectations that belong to a whole family
of distributions, one for each x ∈ X . When we want to talk about this whole family of
distributions we write about “the distribution of Y | X”. The expectation E(Y | X) is
now a random variable whose value depends on the value of X.
Example 1.25 (4.2.4). Suppose X and Y have joint density

f (x, y) = e−y if 0 < x < y < ∞.

Then we can derive the marginal density for X by


Z ∞ Z ∞
∞
e−y = −e−y x = 0 − (−e−x ) = e−x .

fX (x) = f (x, y)dy =
−∞ x

6
Thus X has an Exponential(1) marginal distribution. We can also calculate the conditional
distribution of Y on X = x via
f (x, y) e−y
f (y | x) = = −x = e−(y−x) if 0 < x < y < ∞.
fX (x) e
We continue and derive the conditional expectation
Z ∞ Z ∞ h i∞
E(Y | X = x) = yf (y | x) = ye−(y−x) dy = −ye−(y−x) − e−(y−x)
−∞ x x

= (0 + 0) − (−x − 1) = 1 + x.

We finally conclude that E(Y | X) = 1 + X.


Conditional expectations and variances can be very useful to find univariate expecta-
tions and variances. The next result is called the law of total expectation.
Theorem 1.26 (4.4.3 & 4.4.7). Let X and Y be random variables. Then
ˆ EX = E(E(X | Y )).
ˆ Var X = E(Var(X | Y )) + Var(E(X | Y )).
Proof. We only prove the first result. Denote the joint pdf of X and Y as f (x, y) and
the univariate density of X as fX (x). Then
Z ∞ Z ∞ Z ∞  Z ∞Z ∞
E(X) = xfX (x)dx = x f (x, y)dy dx = xf (x, y)dxdy
−∞ −∞ −∞ −∞ −∞
Z ∞Z ∞ Z ∞ Z ∞ 
= xf (x|y)fY (y)dxdy = xf (x|y)dx fY (y)dy
Z−∞

−∞ −∞ −∞

= E(X | y)fY (y)dxdy = E(E(X | Y )).


−∞


Probability theory in higher dimensions becomes increasingly complicated as the dimension
increases. Complications can be avoided though if the random variables in question have
little to do with each other.
Definition 1.27 (4.2.5). Let X and Y be random variables with joint pdf f (x, y). Then
X and Y are called independent if f (x, y) = fX (x)fX (y).
Let X = (X1 , . . . , Xn ) be a vector of random variables. If X1 , . . . , Xn are all indepen-
dent, then the pdf splits into a product. Let gi be the pdf of Xi , then
n
Y
fX (x) = fX (x1 , . . . , xn ) = gi (xi ) = g1 (x1 ) · · · gn (xn ).
i=1

If moreover all the Xi are identically distributed, then gi = g for some pdf g and all i and
thus
n
Y
fX (x) = fX (x1 , . . . , xn ) = g(xi ) = g(x1 ) · · · g(xn ).
i=1

A vector of independent and identically distributed (iid) random variables will appear very
often during this course.

7
Lemma 1.28. Let X and Y be two independent random variables. Then

ˆ E(XY ) = E(X)E(Y ).

ˆ Var(X + Y ) = Var(X) + Var(Y ).

1.2 Convergence
Let X = (X1 , . . . , Xn ) be a vector of iid random variables. Statistical arguments for
accuracy of approximations are often of the asymptotic type, that is, we assume that the
amount of data n goes to infinity and analyse what happens.

1.2.1 The law of large numbers


In this section we study the sample average
n
1X
Xn = Xi
n
i=1

and its asymptotic properties. As was mentioned in the section on moments, the expec-
tation E(X1 ) is a measure of the center of a distribution. The average of n numbers can
similarly be seen as a measure of the center of those n numbers. It thus might seem
natural to think that X n converges to E(X1 ) as n goes to infinity. However, what do we
mean by convergence? It is not immediately clear how to extend the results of conver-
gence in calculus to probability theory. The closest form to convergence as we know it in
a deterministic setting is the following.

Definition 1.29 (5.5.6). An infinite sequence of random variables X1 , X2 , . . . is said to


converge almost surely to X̃ if for all ε > 0 we have
 
P lim |Xn − X̃| < ε = 1.
n→∞

Recall that every random variable X is a function from the sample space to the real
numbers S 7→ R. Almost sure convergence says that Xn (s) → X̃(s) for almost all s ∈ S.
We can now state one of the most important results in probability theory and statistics,
the strong law of large numbers (SLLN).

Theorem 1.30 (5.5.9). Suppose X1 , X2 , . . . is a sequence of iid random variables with


E(X1 ) < ∞. Then almost surely
n
1X
lim Xi = E(X1 ).
n→∞ n
i=1

The law of large numbers is a great, and actually very intuitive result. Most people,
also those without a background in probability theory, answer one half to the question ”If
I throw this coin one thousand times, approximately how many times on average will we
see heads?”. This intuitive result is a direct application of the SLLN. Let X1 , . . . , Xn ∼
Bernoulli(p), then
as
X n → E(X1 ) = p.

8
An unloaded coin has p = P({h}) = 1/2, thus the average indeed converges to one half.
The same quick calculations can be used to show that the average of dice throws will
indeed converge to 3.5 and many other results.
Almost sure convergence is the convergence concept that is most similar to convergence
in a deterministic setting. In practice it is often hard to work with, because it is a very
strong concept. An easier version of convergence to work with is the following

Definition 1.31 (5.5.1). An infinite sequence of random variables X1 , X2 , . . . is said to


converge in probability to X̃ if for all ε > 0 we have
 
lim P |Xn − X̃| < ε = 1.
n→∞

Note that almost sure convergence and convergence in probability look very similar, but
that almost sure convergence is much stronger. By taking the limit out of the probability
measure we allow ourselves to talk about limits of probabilities instead of probabilities
of limits. See examples 5.5.7 and 5.5.8 in Casella and Berger for detailed examples that
highlight the differences between the two concepts.

1.2.2 The central limit theorem


The law of large numbers tells us that the sample average converges to the mean, but it
does not provide information about the whole distribution for large values of n. In Figure
1 we plot X n for large n and show how it converges. The Xi are t distributed with three

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 500 1000 1500 2000 2500 3000
n

Figure 1

degrees of freedom and it seems indeed that the average converges to zero. The red line is

given by the function 1/ n, we can see in the figure that the average seems to converge
at a rate very similar to this function. Therefore, if we scale by this same rate, then
the average doesn’t converge to zero anymore as n → ∞. To discuss the distributional
behaviour of this scaled average we introduce the following concept.

Definition 1.32 (5.5.6). A sequence of random variables X1 , X2 , . . . converges in distri-


bution to X̃ if limn→∞ FXn (x) = FX̃ (x) for all x ∈ R at which FX̃ is continuous.

9
We can now state another important result in probability theory and statistics, the
central limit theorem (CLT).

Theorem 1.33 (5.5.16). Let X1 , X2 , . . . be a sequence of iid random variables with E(X1 ) =
µ < ∞ and Var(X1 ) = σ 2 < ∞. Then
√ d
n(X n − µ)/σ → Normal(0, 1).

Note the remarkability of this theorem. Even though we have close to no assumptions,
any distribution is allowed for the Xi , we end up with the normal distribution. The central
limit theorem is the reason that the normal distribution plays such a vital role in statistics.
In Figure 2 we plot the histogram of 2000 realisations of X n for a fixed large n. The red
line is the Normal(0, 1) density. We see that the shape of the histogram and red line indeed
look very much alike.

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

Figure 2

The final result of this chapter will be very useful later on, it is called the continuous
mapping theorem and allows us to transfer convergence results from one sequence of
random variables to another.

Theorem 1.34 (5.5.4). Let X1 , X2 , . . . be a sequence of random variables and let h be a


continuous real function. Then
as as
ˆ Xn → X̃ implies that h(Xn ) → h(X̃).
p p
ˆ Xn → X̃ implies that h(Xn ) → h(X̃).
d d
ˆ Xn → X̃ implies that h(Xn ) → h(X̃).

10

You might also like