ST202 Lecture Notes
ST202 Lecture Notes
L ECTURE N OTES
Matteo Barigozzi
Michaelmas Term 2015-2016
This version October 7, 2015
Introductory Material
Course Aims and Objectives
The first part of the course aims to convey a thorough understanding of probability and
distribution theory. A range of methods, related to distributions of practical importance,
are taught. The course builds on material in ST102 Elementary Statistical Theory and
provides the foundation for further courses in statistics and actuarial science.
The following list gives you an idea of the sort of things you will be able to do by the
end of this term it does not come close to covering everything. By the end of the first
part of the course you should:
be able to work out probabilities associated with simple (and not so simple) experiments,
know the distinction between a random variable and an instance of a random variable,
for any given distribution, be able to select suitable methods and use them to work
out moments,
be familiar with a large number of distributions,
understand relationships between variables, conditioning, independence and correlation,
feel at ease with joint distributions and conditional distributions,
know the law of large numbers and the central limit theorem and their implications,
be able to put together all the theory and techniques you have learned to solve
practical problems.
Office: COL7.11
Pre-requisites
Most of you will have done MA100 Mathematical Methods and ST102 Elementary Statistical Theory. If you have not taken both of these courses (or if the relevant memory has
been deleted) you should (re-)familiarize yourself with their content. Below is a, far from
exhaustive, list of mathematical tools that we will be using.
Sets: union, intersection, complement.
Series: arithmetic, geometric, Taylor.
Differentiation: standard differentials, product rule, function of a function rule.
Integration: standard integrals, integration by parts.
collection boxes are per GROUP NUMBER and are located in the entrance hall
on ground floor of Columbia House. After being marked, problem sets are handed
back to in the seminar the following week where we will go through some of the solutions. You are encouraged to solve and discuss the exercises with your colleagues
and cooperate in finding solutions. Marks will be registered on LSE for You so that
your academic advisors can keep track of your efforts. Failing to submit homework
may bar you from exams.
Help sessions: will be held on
Thursday 17:00-18:00 in room TW1.G.01 in weeks 1 to 5 and 7 to 10 (teachers
Mr. Cheng Li and Mr. Baojun Dou);
you are encouraged to attend them as they provide the opportunity to work on the
exercises with one-to-one help of two assistant teachers.
Class tests: two tests will take place in class on
Thursday 17:00-18:00 in room TW1.G.01 in weeks 6 and 11;
you are strongly encouraged to participate in order to verify the degree of your
preparation. As for homework, tests marks will be registered on LSE for You in
order to allow advisors to monitor your attendance and progress.
Exam: the course is assessed by a three hour written exam in the Summer term
which covers the material taught during both terms; previous years exams with
solutions are available from LSE library website. Homework and mid-term test do
not count for the final exam mark, but the more effort you put in solving exercises
and studying during the year the more likely you are to pass the exam.
A Guide to Content
The following is a guide to the content of the course rather than a definitive syllabus.
Throughout the course examples, with varying degrees of realism, will be used to illustrate
the theory. Here is an approximate list of the topics I plan to teach, however the material
that goes into the exam will be determined by what is actually covered during the lectures
(I will provide you with an updated list of topics at the end of the term).
1. Events and their Probabilities:
sample space;
elementary set theory;
events;
probability;
counting;
conditional probability;
3
independence.
2. Random Variables and their Distributions:
random variables;
distributions;
discrete random variables and probability mass function;
continuous random variables and probability density function;
support and indicator functions;
expectations (mean and variance);
moments;
inequalities (Chebyshev, Markov, Jensen);
moment generating functions;
survival and hazard.
3. The Distribution Zoo:
discrete distributions;
degenerate;
Bernoulli;
binomial;
negative binomial;
geometric;
hypergeometric;
uniform;
Poisson and its approximation of binomial;
continuous distributions;
uniform;
normal;
gamma;
chi-squared;
exponential;
beta;
log-normal;
4. Multivariate Distributions:
joint and marginal distributions;
dependence;
joint moments;
inequalities (Holder, Cauchy-Schwarz, Minkowski);
4
conditional distributions.
5. Multivariate Applications:
sums of random variables;
mixtures and random sums;
random vectors;
multivariate normal distribution;
modes of convergence;
limit theorems for Bernoulli sums;
law of large numbers;
central limit theorem.
Books
There are a large number of books that cover at least part of the material in the course.
Finding a useful book is partly a question of personal taste. I suggest you look at what is
available in the library and find a text that covers the material in a way that you find appealing and intelligible. Reproduced below is the reading list along with some additional
texts that may be worth looking at.
Casella, G. and R. L. Berger. Statistical inference. [QA276 C33]
(Nearly all material covered in the course can be found in this book.)
Larson, H. J. Introduction to probability theory and statistical inference. [QA273.A5 L33]
Hogg, R. V. and A. T. Craig. Introduction to mathematical statistics. [QA276.A2 H71]
Freund, J. E. Mathematical statistics. [QA276 F88]
Hogg, R. V. and E. A. Tanis. Probability and statistical inference. [QA273 H71]
Meyer, P. L. Introductory probability and statistical applications. [QA273.A5 M61]
Mood, A. M., F. A. Graybill and D. C. Boes. Introduction to the theory of statistics.
[QA276.A2 M81]
Bartoszyski, R. and M. Niewiadomska-Bugaj. Probability and statistical inference. [QA273
B29]
Cox, D. R. and D. V. Hinkley. Theoretical statistics. [QA276.A2 C87]
(Not great to learn from but a good reference source.)
Stuart, A. and J. K. Ord. Kendalls advanced theory of statistics 1, Distribution theory.
[QA276.A2 K31]
(A bit arcane but covers just about everything.)
Practical informations
I will post all material related to the course on moodle. Lecture notes and lecture recordings will also be made available in due course. Each week I will upload the problem sets
on Thursday and after you return them on Wednesday I will also upload solutions.
If you have questions related to the course you have different ways to ask me:
1. ask questions in class, feel free to interrupt me at any time;
2. come and see me in my office COL.7.11 in Columbia House during my office hours
on Thursday 13:30 - 15:00. Please make an appointment through LSE for You in
advance to avoid queues;
3. take advantage of the help sessions;
4. use the forum on moodle where all of you can post both questions and answers to
your colleagues; if necessary I will also post my answers there; this is a way to
stimulate the discussion and will avoid me repeating the same answer many times;
(5.) if you think you need to speak with me personally, but you cannot come to my
office hours, send me an email and we will fix an appointment; please try to avoid
emails for questions that might be of interest for all the class and use the moodle
forum instead.
Probability Space
1.1
Sample space
Experiment: a procedure which can be repeated any number of times and has a welldefined set of possible outcomes.
Sample outcome: a potential eventuality of the experiment. The notation is used for
an outcome.
Sample space: the set of all possible outcomes. The notation is used for the sample
space of an experiment. An outcome is a member of the sample space , that is, .
Example: a fair six-sided die is thrown once. The outcomes are numbers between 1 and
6, i.e. the sample space is given by = {1, . . . , 6}.
Example: a fair six-sided die is thrown twice. The outcomes are pairs of numbers between 1 and 6. For example, (3, 5) denotes a 3 on the first throw and 5 on the second. The
sample space is given by = {(i, j) : i = 1, . . . , 6, j = 1, . . . , 6}. In this example the
sample space is finite so can be written out in full:
(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
(1,2)
(2,2)
(3,2)
(4,2)
(5,2)
(6,2)
(1,3)
(2,3)
(3,3)
(4,3)
(5,3)
(6,3)
(1,4)
(2,4)
(3,4)
(4,4)
(5,4)
(6,4)
(1,5)
(2,5)
(3,5)
(4,5)
(5,5)
(6,5)
(1,6)
(2,6)
(3,6)
(4,6)
(5,6)
(6,6)
Example: the measurement of peoples height has the positive real numbers as sample
space, if we allow for infinite precision in the measurement.
Example: assume to have an experiment with a given sample space , then the experiment corresponding to n replications of the underlying experiment has sample space n .
Notice that, in principle we can repeat an experiment infinitely many times.
1.2
A
Ac
AB
Intersection
AB
Union
Difference
Inclusion
Empty set
Impossible event
Whole space
Certain event
A\B
AB
Sn
i=1
Ai )c =
Tn
i=1
Aci .
1.3
Events
For any experiment, the events form a collection of all the possible subsets of which we
denote F and has the following properties:
1. F,
2. if A F then Ac F,
3. if A1 , A2 , . . . , F, then
i=1
1.4
Probability
In an experiment the intuitive definition of probability is the ratio between the number of
favorable outcomes over the total number of possible outcomes or with the above notation,
the probability of an event A , such that A F, is:
P (A) =
#elements in A
.
#elements in
Example: if we toss a fair coin the sample space is = {H, T }, then the event A = {H}
has probability
#elements in A
1
= .
P ({H}) =
#elements in
2
Alternatively, we could compute this probability by tossing the coin n times, where n is
large, and compute the number of times we get head say kn . If the coin is fair, we should
get
1
kn
= .
P ({H}) = lim
n n
2
We here adopt a more mathematical definition of probability, based on the Kolmogorov
axioms.
Probability measure: is a function P : F [0, 1], such that
1. P (A) 0,
2. P () = 1,
3. if A1 , A2 , . . . , is an infinite collection of mutually exclusive members of F then
P(
Ai ) =
i=1
P (Ai ),
i=1
This in turn implies that for any finite collection A1 , A2 , . . . , An of mutually exclusive members of F then
n
n
[
X
P ( Ai ) =
P (Ai ).
i=1
i=1
1. P (Ac ) = 1 P (A).
2. P (A) 1.
3. P () = 0.
4. P (B Ac ) = P (B) P (A B).
5. P (A B) = P (A) + P (B) P (A B).
6. If A B then P (B) = P (A) + P (B\A) P (A).
7. More generally if A1 , . . . , An are events then
!
n
[
X
X
X
P
Ai =
P (Ai )
P (Ai Aj )+
P (Ai Aj Ak )+. . .+(1)n+1 P (A1 . . .An ).
i=1
i<j
i<j<k
n
X
P (B Ai ).
i=1
!
Ai
i=1
P (Ai ).
i=1
|A|
#elements in A
=
#elements in
||
Consider the following problem: k balls are distributed among n distinguishable boxes
in such a manner that all configurations are equally likely or analogously (from the modeling point of view) we extract k balls out on n. We need to define the sample space
and its cardinality, i.e. the number of its elements. The balls can be distinguishable or
undistinguishable which is analogous to saying that the order in the extraction matters or
not. Moreover, the extraction can be with or without replacement, i.e. the choice of a ball
11
is independent or not from the ball previously chosen. In terms of balls and boxes this
means that we can put as many balls as we want in each box (with replacement) or only
one ball can fit in each box (without replacement).
There are four possible cases (three of which are named after famous physicists).
Ordered (distinct), without replacement (dependent): in this case we must have k n
and the sample space is
= {(1 . . . k ) : 1 i n i i 6= j for i 6= j},
where i is the box where ball i is located. All the possible permutations of k balls that
can be formed from n distinct elements, i.e. not allowing for repetition, are
|| = n (n 1) (n 2) . . . (n k + 1) =
n!
.
(n k)!
with box i occupied if and only if i = 1. Starting from the case of distinct balls, we have
to divide out the redundant outcomes and we obtain the total number of outcomes:
n!
n
n (n 1) (n 2) . . . (n k + 1)
=
=
.
|| =
1 2 ... k
k!(n k)!
k
Unordered (not distinct), with replacement (independent) - Bose-Einstein: the sample
space is
(
)
n
X
= (1 . . . n ) : 0 i k i and
i = k ,
i=1
with i the number of balls in box i. This is the most difficult case to count. The easiest
way is to think in terms of k balls and n boxes. We can put as many balls as we want in
each box and balls are identical. To find all the possible outcomes it is enough to keep
12
track of the balls and of the walls separating the boxes. Excluding the 2 external walls,
we have n + 1 2 = n 1 walls and k balls, hence we have n 1 + k objects that can
be arranged in (n 1 + k)! ways. However, since the balls and the walls are identical we
need to divide out the redundant orderings which are k!(n 1)!, so
n1+k
(n 1 + k)!
=
.
|| =
k!(n 1)!
k
Example: in a lottery 5 numbers are extracted without replacement out of {1, . . . , 90}.
Which is the probability of extracting the exact sequence of numbers (1, 2, 3, 4, 5)?
The possible outcomes of this lottery are all the 5-tuples = (1 , . . . , 5 ) such that
i {1, . . . , 90}. We can extract the first number in 90 ways, the second in 89 ways and
so on, so
90!
.
|| = 90 89 88 87 86 =
85!
Since all the outcomes are equally likely, the probability we are looking for is 85!/90! '
1/5109 .
Example: if in the previous example the order of extraction does not matter, i.e. we look
for the probability of extracting the first 5 numbers independently of their ordering, then
contains all the combinations of 5 numbers extracted from 90 numbers:
90
|| =
.
5
Since all the outcomes are equally likely, the probability we are looking for is 1/ 90
'
5
1/4107 so as expected it is greater than before, although still very small!
Example: which is the probability that, out of n people randomly chosen, at least two
were born in the same day of the year? We can define a generic event of the sample space
as = (1 , . . . , n ) such that i {1, . . . , 365}. Each birth date can be selected n times
so
|| = |365 365{z . . . 365} = 365n .
n times
Now we have to compute the number of elements contained in the event A = { :
has at least two identical entries}. It is easier to compute the number of elements of the
complement set Ac = { : has all entries distinct}. Indeed Ac is made of all
n-tuples of numbers that are extracted out of 365 numbers without replacement, so the
first entry can be selected in 365 ways, the second in 364 ways and so on, then
|Ac | =
365!
.
(365 n)!
If we assume that the outcomes of are all equally likely (which is not completely correct
as we now that birth rates are not equally distributed throughout the year), then
P (A) = 1
365!
,
365n (365 n)!
13
P (A) =
b
nk
b+r
n
Reading
Casella and Berger, Sections 1.1, 1.2.
Conditional probability
Let A and B be events with P (B) > 0. The conditional probability of A given B is the
probability that A will occur given that B has occurred;
P (A|B) =
P (A B)
.
P (B)
P (A)
.
P (B)
14
P (B|Aj )P (Aj )
P (Aj )
= Pn
.
P (B)
i=1 P (B|Ai )P (Ai )
i=0
j=1
where we define A0 = .
Independence
P (B)
= P (B),
P (A)
j=1
15
1
1 1 1
= = P (H1 )P (H2 )P (H3 ),
8
2 2 2
2
1 1
= = P (H1 )P (H3 ).
8
2 2
Example: consider tossing a tetrahedron (i.e. a die with just four faces) with a red, a blue,
a yellow face, and a face with all three colours. Each face has equal probability 41 to be
selected.1 We want to see if the events: red (R), green (G), blue (B) are independent. The
probability of selecting any colour is then P (R) = P (G) = P (B) = 12 since all colours
appear twice on the tetrahedron. Consider the conditional probability
P (R|G) =
P (RG)
1/4
1
=
= = P (R),
P (G)
1/2
2
so the event R is independent of the event G, by repeating the same reasoning with all
couples of colours we see that colours are pairwise independent. However, we do not
have mutual independence indeed, for example,
P (R|GB) =
1
P (RGB)
1/4
1
=
= 1 6= P (R) = .
P (GB)
1/4
2
Due to its geometry in this case the selected face is the bottom one once the tetrahedron is tossed.
16
Example. Consider the following game: your ST202 lecturer shows you three cups and
tells you that under one of these there is a squashball while under the other two there is
nothing. The aim of the Monty Squashball problem2 is to win the squashball by picking
the right cup. Assume you choose one of the three cups, without lifting it. At this point
one of the remaining cups for sure does not contain the ball and the your lecturer lifts
it showing emptiness (selecting one at random if there is a choice). With two cups still
candidates to hide the squashball, you are given a second chance of choosing a cup: will
you stick to the original choice or will you switch to the other cup?
We can model and solve the problem by using conditional probability and Bayes rule.
The probability of getting the ball is identical for any cup, so
1
P (ball is in k) = ,
3
k = 1, 2, 3.
Once you choose a cup (say i), your ST202 lecturer can lift only a cup with no ball and
not chosen by yourself, he will lift cup j (different from i and k) with probability
1
if i = k,
2
P (ST202 lecturer lifts j|you choose i and ball is in k) =
1 if i 6= k.
Let us call the cup you pick number 1 (we can always relabel the cups). Using Bayes
rule we compute (for j 6= k and j 6= 1)
P (ball is in k| ST202 lecturer lifts j) =
3
X
k=1
1 1
1
1
1
+0 +1 = .
2 3
3
3
2
1
2
1
2
1
3
1
= .
3
this is an eco-friendly version of the famous Monty Hall problem which has doors for cups, goats
for nothing and a car for squashball; no animals are harmed in the Monty Squashball problem. It
is also closely related to Bertrands box paradox and the Prisoners paradox (not to be confused with the
Prisoners dilemma)
17
Reading
Casella and Berger, Sections 1.3.
Random variables
We use random variables to summarize in a more convenient way the structure of experiments.
Borel -algebra: is the -algebra B(R) (called the Borel -algebra) on = R, i.e.
the -algebra generated by (i.e. the smalles sigma-algebra containing) the intervals (a, b]
where we allow for a = and b = +.
We could have equally have taken intervals [a, b] (think about this for a while!).
Random variable: a real-valued function is defined on the sample space
X : R
with the property that, for every B B(R), X 1 (B) F.
Define, for all x R, the set of outcomes
Ax = { : X() x}
then Ax F. Thus, Ax is an event, for every real-valued x.
The function X defines a new sample space (its range) and creates a bijective correspondence between events in the probability space (, F, P ) with events in the probability
space (R, B(R), PX ) which allows for easier mathematical computations. We need to define the probability measure on the Borel -algebra.
Example: consider the experiment of tossing a coin n times, the sample space is made of
all the n-tuples = (1 , . . . , n ) such that i = 1 if we get head and i = 0 if we get
tail. An example of random variable is the function: number of heads in n tosses which
we can define as
n
X
X() =
i .
i=1
Consider the case in which we get m times head with m < n. Then,
P for every number m
we can define the event Am = { = (1 , . . . , n ) : X() = ni=1 i = m}.
Notice that in this example the random variables have only integer values which are a
subset of the real line. Notice also that the original sample space is made of 2n elements,
18
while the new sample space is made of the integer numbers {0 . . . , n} which is a smaller
space.
Example: consider the random walk, i.e. a sequence of n steps = (1 , . . . , n ) such
that the i-th step can be to the left or to the right. We can introduce a random variable that
represents the i-th step by Xi () = 1 where it takes the value 1 if the step is to the left
and -1 if the step is to the right. We can also introduce the
variable that represents
Prandom
k
the position of the random walk after k steps: Yk () = i=1 Xi ().
Distributions
We must check that the probability measure P defined on the original sample space is
still valid as a probability measure defined on R. If the sample space is = {1 , . . . , n }
and the range of X is {x1 , . . . , xm }, we say that we observe X = xi if and only if the
outcome of the experiment is j such that X(j ) = xi .
Induced probability: we have two cases
1. finite or countable sample spaces: given a random variable X, the associated probability measure PX is such that, for any xi R,
PX (X = xi ) = P ({j : X(j ) = xi }).
2. uncountable sample spaces given a random variable X, the associated probability
measure PX is such that, for any B B(R),
PX (X B) = P ({ : X() B}).
Hereafter, given the above equivalences, we denote PX simply as P .
Cumulative distribution function (cdf): given a random variable X, it is the function
F : R [0, 1], s.t. F (x) = P (X x), x R.
Properties of cdfs: F is a cdf if and only if
1. Limits: limx F (x) = 0 and limx+ F (x) = 1.
2. Non-decreasing: if x < y then F (x) F (y).
3. Right-continuous: limh0+ F (x + h) = F (x).
Probabilities from distribution functions
19
1. P (X > x) = 1 F (x);
2. P (x < X y) = F (y) F (x);
3. P (X < x) = limh0 F (x + h) = F (x );
4. P (X = x) = F (x) F (x ).
Identically distributed random variables: the random variables X and Y are identically distributed if, for any set A B(R), P (X A) = P (Y A). This is equivalent to
saying that FX (x) = FY (x), for every x R.
Example: in the random walk the step size random variable Xi is distributed as:
1
P (Xi = 1) = .
2
1
P (Xi = 1) = ,
2
while
1
FX (1) = ,
FX (1) = 1.
2
The random variables Xi are identically distributed. Moreover, they are also independent
so
n
Y
P () = P (X1 = 1 , . . . , Xn = n ) =
P (Xi = i ),
i=1
for any choice of 1 , . . . , n = 1. Therefore, all n-tuples are equally probable with
probability
n
Y
1
1
P () = P (X1 = 1 , . . . , Xn = n ) =
= n.
2
2
i=1
Consider the random variable Z the counts the steps to the right, then the probability of
having k steps to the right and n k steps to the left is
P (Z = k) = FZ (k) = (# of ways of extracting k 1s out of n) ( Prob. of a generic )
n 1
.
=
k 2n
We say that Xi follows a Bernoulli distribution and Z follows a Binomial distribution.
The previous example of a fair coin can be modeled exactly in the same way but this time
by defining Xi () = 0 or 1.
A random variable X is discrete if it only takes values in some countable subset {x1 , x2 , . . .}
of R, then F (x) is a step-function of x, but still right-continuous.
20
Probability mass function (pmf): given a discrete random variable X, it is the function
f : R [0, 1] s.t. f (x) = P (X = x)
x R.
Properties of pmfs
1. f (x) = F (x) F (x );
P
2. F (x) = i:xi x f (xi );
P
3.
i f (xi ) = 1;
4. f (x) = 0 if x
/ {x1 , x2 , . . .}.
F (x) =
f (u)du
x R.
for some constant K > 0, then h(x)/K is a pdf of a random variable with values in A.
Unified notation: given a random variable X;
P
Z b
i:a<xi b f (xi ), if X discrete,
P (a < X b) =
dF (x) =
Rb
a
f (u)du,
if X continuous.
a
21
Reading
Casella and Berger, Sections 1.4 - 1.5 - 1.6.
Expectations
xf (x)dx, if X continuous,
where f is either the pmf or the pdf. The definition holds provided that
.
R +
(x )2 f (x)dx, if X continuous,
p
where f is either the pmf or the pdf. The standard deviation is defined as = Var[X].
R +
Notice that 2 = E[(X )2 ]. The definition holds provided that (x )2 dF (x) <
.
R
Expectations: for an integrable function g : R R such that |g(x)|dF (x) < ,
the expectation of the random variable g(X) as
P
Z +
if X discrete,
i g(xi )f (xi ),
g(x)dF (x) =
E[g(X)] =
R +
Note that we have cheated a bit here, since we need to show in fact that g(X) is a random
variable and also that the given expression corresponds to the one given above for the
random variable g(X). This can be done but is beyond the scope of ST202. Feel free to
ask me if you would like to hear more about this.
Properties of expectations: for any constant a, integrable functions g1 and g2 , and random variables X and Y :
1. E[a] = a;
22
10
Moments
Moments are expectations of powers of a random variable. They characterise the distribution of a random variable. Said differently (and somewhat informally), the more moments
of X we can compute, the more precise is our knowledge of the distribution of X.
Moment: given a random variable X, for r a positive integer then the rth moment, r , of
X is
P r
Z +
if X discrete,
i xi f (xi ),
r
r
x dF (x) R
r = E[X ] =
+ r
x f (x)dx, if X continuous,
R +
where f is either the pmf or the pdf. The definition holds provided that |x|r dF (x) <
.
Central moment: given a random variable X, the rth central moment, mr is
mr = E[(X 1 )r ].
R +
The definition holds provided that |x|r dF (x) < . so if the r-th moment exists,
then also the r-th central moment exists.
Properties of moments:
1. mean: 1 = E[X] = and m1 = 0;
2. variance: m2 = E[(X 1 )2 ] = Var[X] = 2 ;
3
11
E[g(X)]
.
a
12
These are functions that help to compute moments of a distribution and are also useful
to characterie the distribution. However, it can be shown that the moments do not characterise the distribution uniquely (if you would like to know more about this, check the
log-normal distribution).
Moment generating function (mgf): given a random variable X, it is a function
M : R [0, ) s.t. M (t) = E[etX ],
where it is assumed M (t) < for |t| < h and some h > 0, i.e. the expectation exists in
a neighborhood of 0. Therefore,
P txi
Z +
if X discrete,
i e f (xi ),
tx
M (t) =
e dF (x) =
R + tx
e f (x)dx, if X continuous.
24
1. Taylor expansion:
X E[X j ]
t2
tr
M (t) = 1 + tE[X] + E[X 2 ] + . . . + E[X r ] + . . . =
tj ;
2!
r!
j!
j=0
2. the rth moment is the coefficient of tr /r! in the Taylor expansion;
3. derivatives at zero:
r
r = E[X ] = M
(r)
dr
(0) = r M (t) .
dt
t=0
= E[XetX ],
and in general
dr
M (t) = E[X r etX ],
dtr
by imposing t = 0 we get the desired result.
Uniqueness: let FX and FY be two cdfs with all moments defined, then:
1. if X and Y have bounded support, then FX (x) = FY (x) for any x R if and only
if E[X r ] = E[Y r ] for any r N;
2. if the mgfs exist and MX (t) = MY (t) for all |t| < h and some h > 0, then
FX (x) = FY (x) for all x R.
Cumulant generating function (cgf): given a random variable X with moment generating function M (t), it is defined as
K(t) = log M (t).
Cumulant: the rth cumulant, cr , is the coefficient of tr /r! in the Taylor expansion of the
cumulant generating function K(t):
dr
(r)
cr = K (0) = r K(t) .
dt
t=0
Properties of cgfs:
25
Reading
Casella and Berger, Sections 2.2 - 2.3.
13
Discrete distributions
for x = a.
M (t) = eat ,
= a,
K(t) = at.
2 = 0.
Bernoulli: trials with two, and only two, possible outcomes, here labeled X = 0 (failure)
and X = 1 (success).
f (x) = px (1 p)1x
for x = 0, 1.
M (t) = 1 p + pet ,
= p,
2 = p(1 p).
n!
px (1
x!(nx)!
p)nx =
M (t) = (1 p + pet )n ,
= np,
n
x
px (1 p)nx
for x = 0, 1, . . . , n.
2 = np(1 p).
26
i=1
n
X
i=1
n
X
n x
n
nx
M (t) =
e
p (1 p)
=
(pet )x (1 p)nx
k
k
x=0
x=0
tx
x1
r1
r
p (1 p)xr for x = r, r + 1, . . ..
r
pet
M (t) = 1(1p)e
, K(t) = r log{(1 p1 ) + p1 et }
t
f (x) =
= pr ,
2 =
r(1p)
.
p2
th
It is also the defined in terms of the number of failures before the r success.
Geometric: to count the number of Bernoulli trials before the first success occurs. Equivalent to a negative binomial with r = 1.
f (x) = (1 p)x1 p
M (t) =
pet
,
1(1p)et
= p1 ,
2 =
for x = 1, 2, . . ..
K(t) = log{(1 p1 ) + p1 et }
1p
.
p2
Given that we observed t failures we observe an additional s t failures with the same
probability as we observed s t failures at the beginning of the experiment. The only
thing that counts is the length of the sequence of failures not its position.
Hypergeometric: it is usually explained with the example of the urn model. Assume to
have an urn containing a total of N balls made up of N1 balls of type 1 and N2 = N N1
balls of type 2, we want to count the number of type 1 balls chosen when selecting n < N
balls without replacement from the urn.
f (x) =
N1
x
= n NN1 ,
N N1
nx
/
N
n
for x {0, . . . , n} {n (N N1 ), . . . , N1 }.
1 N n
2 = n NN1 N N
.
N
N 1
1
N
N +1
,
2
for x = 1, 2, . . . , N .
2 =
N 2 1
.
12
28
Poisson: to count the number of events which occur in an interval of time. The assumption
is that for small time intervals the probability of an occurrence is proportional to the length
of the waiting time between two occurrences. We consider the random variable X which
counts the number of occurrences of a given event in a given unit time interval, it depends
on a parameter which is the intensity of the process considered. Notation Pois().
f (x) =
x e
x!
for x = 0, 1, . . ..
M (t) = e(e 1) ,
= ,
2 = .
The intensity is the average number of occurrences in a given unit time interval. Notice
that the Poisson distribution can also be used for the number of events in other specified
intervals such as distance, area or volume.
Example: think of crossing a busy street with an average of 300 cars per hour passing. In
order to cross we need to know the probability that in the next minute no car passes. In
a given minute we have an average of = 300/60 = 5 cars passing through. If X is the
number of cars passing in one minute we have
e5 50
= 6.7379 103 ,
0!
maybe is better to cross the street somewhere else. Notice that has to be the intensity
per unit of time. If we are interested in no cars passing in one hour then = 300 and
clearly the probability would be even smaller. If we want to know the average number of
cars passing in 5 minutes time then just define a new random variable X which counts the
cars passing in 5 minutes, which is distributed as Poisson with = 300/12 = 25 and this
is also the expected value.
P (X = 0) =
The Poisson approximation: if X Bin(n, p) and Y Pois() with = np, then for
large n and small p we have P (X = x) ' P (Y = x). More rigorously we have to prove
that for finite = np
lim FX (x; n, p) = FY (x; )
n
Proof: we can use the following result: given a sequence of real numbers s.t. an a for
n , then
an n
= ea .
lim 1 +
n
n
Now
n
1 t
t n
lim MX (t; n, p) = lim (1 p + pe ) = lim 1 + (e 1)np =
n
n
n
n
n
1
t
= lim 1 + (et 1) = e(e 1) .
n
n
29
14
Continuous distributions
Uniform: a random number chosen from a given closed interval [a, b]. Notation U (a, b).
f (x) =
M (t) =
=
1
ba
for a x b.
etb eta
t(ba)
a+b
,
2
for t 6= 0
2 =
and M (0) = 1.
(ba)2
.
12
2
2
1 e(x) /(2 )
2
M (t) = et+
E[X] = ,
2 t2 /2
Var[X] = 2 .
Var[X] = Var[ + Z] = 2 .
skewness =
30
The skewness coefficient measures the asymmetry and indeed is zero for the normal, and
the kurtosis coefficient measures flatness of the tails, usually we are interested in the coefficient of excess kurtosis (with respect to the normal case), i.e. 3.
Computing moments and mgf of the standard normal distribution: the mgf is computed as
Z +
z2
1
M (t) =
e 2 +tz dz =
2
Z +
z 2 2tz+t2
1
2
=
e 2
et /2 dz =
2
t2 /2 Z +
(zt)2
t2
e
e 2 dz = e 2 .
=
2
The Taylor expansion of M (t) is
+
X t2j
t2
t4
M (t) = 1 + 0 + + 0 + 2 + . . . =
=
2
2 2!
2j j!
j=0
+
X tr
t2
t3
t4
r ,
= 1 + 1 t + 2 + 3 + 4 + . . . =
2!
3!
4!
r!
r=0
hence the moments of Z (which in this case are equal to the central moments) are for
r = 0, 1, 2, . . .
2r+1 = E[Z 2r+1 ] = 0,
2r = E[Z 2r ] =
(2r)!
.
2r r!
Properties of the gamma function (t) = (t 1)(t 1) for t > 1 and (n) = (n 1)!
for positive integer n. Notation for the gamma distribution; Gamma(, ) or G(, ).
f (x) =
M (t) =
1
x1 ex
()
1
(1t/)
= /,
for 0 x < .
for t < .
2 = /2 .
31
1
xr/21 ex/2
(r/2)2r/2
M (t) =
1
(12t)r/2
= r,
2 = 2r.
for 0 x < .
Exponential: waiting time between events distributed as Poisson with intensity . Notation Exp() (somewhat ambiguous). Equivalent to a gamma distribution with = 1.
f (x) = ex
M (t) =
= 1/,
for 0 x < .
for t < .
2 = 1/2 .
2
2
1 1 e(log x) /(2 )
x
2
E[X] = e+
2 /2
Notice that in this case M (t) is not defined (see ex. 2.36 Casella & Berger). Examples
are the distributions of income or consumption. This choice allows to model the logs of
income and consumption by means of the normal distribution which is the distribution
predicted by economic theory.
32
15
P (X x + |X > x)
F (x + ) F (x)
= lim+
,
F (x)
it is then defined as
f (x)
F 0 (x)
h(x) =
=
.
F (x)
F (x)
d
h(x) = log(1 F (x)),
dx
h(u)du .
Reading
Casella and Berger, Sections 3.1 - 3.2 - 3.3.
16
For simplicity we first give the definitions for the bivariate case and then we generalise to
the n-dimensional setting.
Joint cumulative distribution function: for two random variables X and Y the joint cdf
is a function FX,Y : R R [0, 1] such that
F (x, y) = P (X x, Y y).
Properties of joint cdf:
33
Marginal cdfs are generated from the joint cdf, but the reverse is not true. The joint cdf
contains information that is not captured in the marginals. In particular it tells us about
the dependence structure among the random variables, i.e. how they are associated.
17
Joint probability mass function: for two discrete random variables X and Y it is a
function fX,Y : R R [0, 1] such that
fX,Y (x, y) = P (X = x, Y = y) x, y R.
In general
X
x1 <xx2 y1 <yy2
Marginal probability mass functions: for two discrete random variables X and Y , with
range {x1 , x2 , . . .} and {y1 , y2 , . . .} respectively, the marginal pmfs are
X
fX (x) =
fX,Y (x, y)
y{y1 ,y2 ,...}
fY (y) =
X
x{x1 ,x2 ,...}
34
Joint probability density function: for two jointly continuous random variables X and
Y , it is an integrable function fX,Y : R R [0, +) such that
Z y Z x
fX,Y (u, v)dudv x, y R,
FX,Y (x, y) =
y2
x2
x1
4. for any (Borel) set B R2 the probability that (X, Y ) takes values in B is
Z Z
fX,Y (x, y)dxdy.
P (B) =
B
In the one-dimensional case events are usually intervals of R and their probability is proportional to their length, in two-dimensions events are regions of the plane R2 and their
probability is proportional to their area, in three-dimensions events are regions of the
space R3 and their probability is proportional to their volume. Lengths, areas and volumes are weighted by the frequencies of the outcomes which are part of the considered
events hence they are areas, volumes and 4-d volumes under the pdfs. Probability is the
measure of events with respect to the measure of the whole sample space which is 1 by
definition.
Marginal probability density functions: for two jointly continuous random variables X
and Y , they are integrable functions fX : R [0, +) and fY : R [0, +) such that
Z +
fX (x) =
fX,Y (x, y)dy, x R,
Z +
fY (y) =
fX,Y (x, y)dx, y R.
35
x R,
y R.
18
X1 ,...,Xn
Reading
Casella and Berger, Section 4.1.
19
Besides the usual univariate measures of location (mean) and scale (variance), in the multivariate case we are interested in measuring the dependence among random variables.
Joint cdf of independent random variables: two random variables X and Y are independent if and only if the events {X x}, {Y y} are independent for all choices of x
and y, i.e., for all x, y R,
P (X x, Y y) = P (X x)P (Y y),
FX,Y (x, y) = FX (x)FY (y).
Joint pmf or pdf of independent random variables: two random variables X and Y are
independent if and only if, for all x, y R,
fX,Y = fX (x)fY (y).
36
The two above are necessary and sufficient conditions, while the following is just necessary conditions but not sufficient (see also below the distinction between independence
and uncorrelation).
Expectation and independence: if X and Y are independent then
E[XY ] = E[X]E[Y ].
Moreover, if g1 and g2 are well-behaved functions then also g1 (X) and g2 (Y ) are independent random variables, hence
E[g1 (X)g2 (Y )] = E[g1 (X)]E[g2 (Y )].
20
n
Y
FXj (xj );
j=1
n
Y
fXj (xj ).
j=1
n
Y
E[Xj ],
j=1
n
Y
j=1
37
21
38
1/4 if
x=0
and y = 1
1/4
if
x
=
0
and y = 1
1/4 if
x=1
and y = 0
fX,Y (x, y) =
1/4 if
x = 1 and y = 0
0
otherwise
Now, E[XY ] = 0 and E[X] = E[Y ] = 0, thus Cov(X, Y ) = 0, the variables are uncorrelated. If we now choose g1 (X) = X 2 and g2 (Y ) = Y 2 we have E[g1 (X)g2 (Y )] =
E[X 2 Y 2 ] = 0, but
1
11
= 6= 0.
E[g1 (X)]E[g2 (Y )] =
22
4
So X and Y are not independent.
Example: suppose X is a standard normal random variable, i.e. with E[X k ] = 0 for
k odd, and let Y = X 2 . Clearly X and Y are not independent: if you know X, you also
know Y . And if you know Y , you know the absolute value of X. The covariance of X
and Y is
Cov(X, Y ) = E[XY ] E[X]E[Y ] = E[X 3 ] 0 E[Y ] = E[X 3 ] = 0.
Thus Corr(X, Y ) = 0, and we have a situation where the variables are not independent,
yet they have no linear dependence. A linear correlation coefficient does not encapsulate
anything about the quadratic dependence of Y upon X.
22
r,s = E[X Y ] =
(r,s)
MX,Y (0, 0)
dr+s
;
= r s MX,Y (t, u)
dt du
t=0,u=0
4. moment generating function for marginals: MX (t) = E[etX ] = MX,Y (t, 0),
MY (u) = E[euY ] = MX,Y (0, u);
5. if X and Y independent:
MX,Y (t, u) = MX (t)MY (u).
40
Joint cumulants: let KX,Y (t, u) = log MX,Y (t, u) be the joint cumulant generating function, then we define the (r, s)th joint cumulant cr,s as the coefficient of (tr us )/(r!s!) in the
Taylor expansion of KX,Y . Thus,
Cov(X, Y ) = c1,1 and Corr(X, Y ) =
23
c1,1
.
c2,0 c0,2
n
Y
MXj (tj );
j=1
41
24
Inequalities
1
p
1
q
= 1, if X belongs to Lp
|Cov(X, Y )| X Y ,
which means |Corr(X, Y )| 1.
Minkowskis inequality: let p 1, if X and Y belong to Lp , then X + Y belongs to Lp
and
E[|X + Y |p ]1/p E[|X|p ]1/p + E[|Y |p ]1/p .
Reading
Casella and Berger, Sections 4.2 - 4.5 - 4.7.
25
Conditional distributions
When we observe more than one random variable their values may be related. By considering conditional probabilities we can improve our knowledge of a given random variable
by exploiting the information we have about the other.
Conditional cumulative distribution function: given X and Y random variables with
P (X = x) > 0, the distribution of Y conditional (given) to X = x is defined as
FY |X (y|x) = P (Y y|X = x).
It is a possibly different distribution for every value of X, we have a family of distributions.
Conditional probability mass function: given X and Y discrete random variables with
P (X = x) > 0, the conditional pmf of Y given X = x is
fY |X (y|x) = P (Y = y|X = x) =
42
fX,Y (x, y)
,
fX (x)
FY |X (y|x) =
fY |X (yi |x).
yi y
fX,Y (x, y)
,
fX (x)
fX,Y (x, v)
dv.
fX (x)
FY |X (y|x) =
fX,Y (x, y)
fY |X (y|x) =
=
fX (x)
f
(x,y)
P X,Y
,
y fX,Y (x,y)
discrete case,
fX,Y (x,y)
,
fX,Y (x,y)dy
continuous case;
P
x fY |X (y|x)fX (x),
R
discrete case,
fX,Y (x, y)
fX (x)
=
fY |X (y|x).
fY (x)
fY (y)
26
43
If we consider all possible values taken by X then we have a new random variable which
is the conditional expectation of Y given X and it is written as E[Y |X]. It is the best
guess of Y given the knowledge of X. All properties of expectations still hold.
Law of iterated expectations: since E[Y |X] is a random variable we can take its expectation:
E[E[Y |X]] = E[Y ].
Indeed, in the continuous case
Z
E[E[Y |X]] =
Z +
A useful consequence is that we can compute E[Y ] without having to refer to the marginal
pmf or pdf of Y :
P
discrete case,
x E[Y |X = x]fX (x),
E[Y ] =
R +
E[Y |X = x]fX (x)dx, continuous case.
Conditional expectations of function of random variables: if g is a well-behaved, realvalued function, the expectation of g(Y ) given X = x is defined as:
P
discrete case,
y g(y)fY |X (y|x),
E[g(Y )|X = x] =
R +
g(y)fY |X (y|x)dy, continuous case.
The conditional expectation of g(Y ) given X is written as E[g(Y )|X] and it is also a random variable.
As a consequence any function of X can be treated as constant with respect to expectations conditional on X. In general for well-behaved functions g1 and g2
E[g1 (X)g2 (Y )|X] = g1 (X)E[g2 (Y )|X].
Notice that also E[Y |X] is a function of X so
E[E[Y |X]Y |X] = E[Y |X]E[Y |X] = (E[Y |X])2 .
Conditional variance: for random variables X and Y , it is defined as
Var[Y |X = x] = E[(Y E[Y |X = x])2 |X = x] =
P
discrete case,
y [y E[Y |X = x]]2 fY |X (y|x),
=
R
[y E[Y |X = x]]2 fY |X (y|x)dy, continuous case.
44
The conditional variance of Y given X is written as Var[Y |X] and it is a random variable
function of X. Moreover,
Var[Y |X] = E[Y 2 |X] (E[Y |X])2 ,
By using the law of iterated expectations,
Var[Y ] =
=
=
=
=
fY |X (y|x) =
Finally,
E[Y |X] = E[Y ].
Conditional moment generating function: given X = x, it is the function defined as
P uy
discrete case,
y e fY |X (y|x),
uY
MY |X (u|x) = E[e |X = x] =
R + uy
e fY |X (y|x)dy, continuous case.
This is a conditional expectation so it is a random variable. We can calculate the joint mgf
and marginal mgfs from the conditional mgf,
MX,Y (t, u) = E[etX+uY ] = E[etX MY |X (u|X)],
MY (u) = MX,Y (0, u) = E[MY |X (u|X)].
45
Example: suppose that X is the number of hurricanes that form in the Atlantic basin
in a given year and Y is the number making landfall. We assume we know that each
hurricane has a probability p of making landfall independent of other hurricanes. If we
know the number of hurricanes that form say x we can view Y as the number of success
in x independent Bernoulli trials, i.e. Y |X = x Bin(x, p). If we also know that
X Pois(), then we can compute the distribution of Y (notice that X Y )
fY (y) =
+
X
fY |X (y|x)fX (x) =
x=y
+
X
x=y
x e
x!
py (1 p)xy
=
y!(x y)!
x!
+
y py e X [(1 p)]xy
=
=
y!
(x
y)!
x=y
+
y py e X [(1 p)]j
=
=
y!
j!
j=0
y py e (1p)
e
=
y!
(p)y ep
=
,
y!
=
thus Y Poisp. So E[Y ] = p and Var[Y ] = p, but we could find these results without
the need of the marginal pdf. Since Y |X = x Bin(x, p), then
E[Y |X = x] = Xp
Var[Y |X = x] = Xp(1 p)
MY |X (u|X) = (1 p + peu )X ,
therefore
MY (u) =
=
=
=
=
27
1
Z 1
x2
1
=
+ xy dy =
+ y dy =
2
2
0
0
0
1
y y2
+
=
= 1.
2
2 0
1
FX,Y (x, y) =
Z
y
Z
x
(u + v)dudv =
Z y 2
=
0
2
1
x v xv 2
x
=
+ xv dv =
+
2
2
2 0
1
=
xy(x + y)
2
12 x(x + 1)
1
y(y + 1)
FX,Y (x, y) =
2
Z
fX (x) =
Z 1
=
0
1
(x + y)dy = x + .
2
47
=
0
3
y
y
+
24
6
1
=
0
y2 y2
+
dy =
8
2
5
.
24
1
, 2x
2
Z 1Z 1
Z 1
1 s
1 s+1
r s
=
x y (x + y)dxdy =
dy =
y +
y
r+2
r+1
0
0
0
1
1
1
1
1
s+1
s+2
=
y
+
y
+
.
=
(r + 2)(s + 1)
(r + 1)(s + 2)
(r + 2)(s + 1) (r + 1)(s + 2)
0
Thus, E[XY ] = 31 , E[X] = E[Y ] =
7
,
12
E[X 2 ] =
5
12
so Var[X] =
11
144
and finally
1
49
1
=
,
3 144
144
1
and Corr(X, Y ) = 11
, so X and Y are not independent.
We find this result also by noticing that given the marginals and the joint pdfs we
have
x+y 1
+ ,
fX (x)fY (y) = xy +
2
4
therefore fX (x)fY (y) 6= fX,Y (x, y) so X and Y are not independent.
x+y
x+ 12
if 0 < y < 1
otherwise.
Z 1
x+y
=
y
dy =
x + 12
0
2
1
1
xy
y3
=
+
=
3 0
x + 12 2
3x + 2
.
=
6x + 3
48
Reading
Casella and Berger, Sections 4.2 - 4.4 - 4.5.
28
and, by using the linearity of expectations and the binomial expansion, we have for r N
r
X
r
E[(X + Y ) ] =
E[X j Y rj ].
j
j=0
r
Probability mass/density function of a sum: if X and Y are random variables with joint
density fX,Y (x, y) and we define Z = X + Y then the pmf/pdf of Z is
P
discrete case,
49
u fX (u)fY (z u),
fZ (z) =
R +
f (u)fY (z u)du, continuous case.
X
This operation is known as convolution
Z
fX (u)fY (z u)du.
fZ = fX fY
Convolution is commutative so fX fY = fY fX .
Moment generating function of the sum of independent random variables: if X and
Y are independent random variables and we define Z = X + Y then the mgf of Z is
MZ (t) = MX (t)MY (t),
and the cumulant generating function is
KZ (t) = KX (t) + KY (t).
Example: suppose the X and Y are independent r.v. exponentially distributed, X
Exp() and Y Exp(), with 6= , then the pdf of Z = X + Y is
Z +
fZ (z) =
fX (u)fY (z u)du =
Z z
eu e(zu) du =
=
0
z
1 ()u
z
= e
e
=
=
(ez ez )
0 z < +.
Note the domain of integration [0, z]. Indeed, since both X and Y are positive r.v., also U
and Z U have to be positive, thus we need 0 < U Z.
In theory, we could also use mgfs, but in this case we get a function of t that does not
have an expression that resembles one of a known distribution.
2
Example: suppose the X and Y are independent r.v. normally distributed, X N (X , X
)
2
and Y N (Y , Y ), then then to compute the pdf of Z = X + Y we use the cumulant
generating functions
KX (t) = X t +
2 2
X
t
,
2
KY (t) = Y t +
and
KZ (t) = (X + Y )t +
50
Y2 t2
,
2
2
(X
+ Y2 )t2
2
2
by uniqueness of cumulant generating functions Z N (X + Y , X
+ Y2 ).
Multivariate
generalization: for n independent random variables X1 , . . . , Xn let S =
Pn
X
then
j
j=1
1. the pmf/pdf of S is
fS = fX1 . . . fXn ;
2. the mgf of Sis
MS (t) = MX1 (t) . . . MXn (t).
3. if X1 , . . . , Xn are also identically distributed they have a common mgf MX (t) thus
fS = f f . . . f ,
{z
}
|
ntimes
To indicate independent and identically distributed random variables we use the notation
i.i.d.
Example: given n i.i.d. Bernoulli r.v. X1 . . . Xn with probability p and mgf
MX (t) = 1 p + pet ,
the sum S =
Pn
j=1
Xj has mgf
MS (t) = (1 p + pet )n ,
j=1
j=1
If Xj iidN (, 2 ), then
S=
n
X
Xj N (n, n 2 ).
j=1
j = 1, . . . n;
2. Gamma:
X Gamma(r1 , ), Y Gamma(r2 , ) Z Gamma(r1 + r2 , )
Xj iidExp() S Gamma(n, )
j = 1, . . . n;
3. Binomial:
X Bin(n1 , p), Y Bin(n2 , p) Z Bin(n1 + n2 , p)
Xj iidBin(k, p) S Bin(nk, p) j = 1, . . . n.
29
> 0;
a.s.
as n +.
Example: when tossing a coin Xi is 1 if we get head or 0 if we get tail (a Bernoulli trial),
Sn is the number of heads we get in n independent tosses. The frequency of heads will
converge to 1/2 which is the value of p in this particular case.
The following result is a special case of the Central Limit Theorem which we shall
see in due course.
De Moivre-Laplace Limit Theorem: as n +, and for Z N (0, 1),
!
Z z2 /2
Sn /n p
e
dz, R,
= P (Z ) =
np
lim P
n
2
p(1 p)
which implies
Sn /n p d
np
Z.
p(1 p)
We are saying that the sample mean (which is a random variable) of the Bernoulli trials
converges in distribution or is asymptotically distributed as a normal random variable with
mean p (this we know already from the law of large numbers) and variance p(1 p)/n,
thus the more trials we observe the smaller the
uncertainty about the expected value of the
sample mean, the rate of convergence being n. This result contains useful informations
not only on the point-wise estimate of the population mean but also on the uncertainty
and the speed with which we have convergence.
Finally, remember that Sn Bin(np, np(1 p)), then, by rearranging the terms, we have
S np
d
p n
Z.
np(1 p)
i.e. the Binomial distribution can be approximated by a normal distribution with mean np
and variance np(1 p).
Reading
Casella and Berger, Sections 5.2
30
Hierarchies and mixtures: suppose we are interested in a random variable Y which has
a distribution that depends on another random variables, say X. This is called a hierarchical model and Y has a mixture distribution. In the first instance we do not know the
marginal distribution of Y directly, but we know the conditional distribution of Y given
53
X = x and the marginal distribution of X (see the example on hurricanes of Section 26).
The key results which are necessary for characterising Y , are
E[Y ] = E[E[Y |X]]
Var = E[Var[Y |X]] + Var[E[Y |X]]
fY (y) = E[fY |X (y|X)] and MY (t) = E[MY |X (t|X)]
Example: Poisson mixing. If Y | = Pois(), for some positive r.v. , then
E[Y |] = Var[Y |] = .
Therefore,
E[Y ] = E[],
Marginal results for random sums: suppose that {Xj } is a sequence of i.i.d. random
variables with mean E[X] and variance Var[X], for any j, and
that N is a random
Psuppose
N
variable taking only positive integer values and define Y = j=1 Xj , then
E[Y ] = E[N ]E[X],
Var[Y ] = E[N ]Var[X] + Var[N ]{E[X]}2 ,
MY (t) = MN (log MX (t)) and KY (t) = KN (KX (t)).
Example: each year the value of claims made by an owner of a health insurance policy is
distributed exponentially with mean independent of previous years. At the end of each
year with probability p the individual will cancel her policy. We want the distribution of
the total cost of the health insurance policy for the insurer. The value of claims in year j
is Xj and the number of years in which the policy is held is N , thus
1
Xj iidExp
,
N Geometric(p).
54
PN
The total cost for the insurer is Y =
j=1 Xj . Therefore, E[Y ]
distribution we use the cumulant generating function
1
KX (t) = log(1 t),
KN (t) = log 1 +
p
= p1 . To get the
1 t
,
e
p
and
1 1
=
=
=
=
E[X],
E[X 2 ],
MN (log MX (t)) = e(MX (t)1) ,
(MX (t) 1).
Reading
Casella and Berger, Section 4.4
31
Random vectors
This is just a way to simplify notation when we consider n random variables. Expectations are element wise and we have to remember that the variance of a vector is a matrix.
Random vector: an n-dimensional vector of random variables, i.e. a function
X = (X1 , . . . , Xn )T : Rn .
The cdf, pmf or pdf, and mgf of a random vector are the joint cdf, pmf or pdf, and mgf of
X1 , . . . , Xn so, for any x = (x1 , . . . , xn ), t = (t1 , . . . , tn ) Rn ,
FX (x) = FX1 ,...,Xn (x1 , . . . , xn ),
fX (x) = fX1 ,...,Xn (x1 , . . . , xn ),
MX (t) = MX1 ,...,Xn (t1 , . . . , tn ).
55
E[X1 ]
..
E[X] =
.
.
E[Xn ]
For jointly continuous random variables we have
Z
xfX (x)dx =
E[X] =
Rn
Z +
Z +
x1 . . . xn fX1 ,...,Xn (x1 , . . . , xn )dx1 . . . dxn .
...
=
Var[X1 ]
Cov(X1 , X2 ) . . . Cov(X1 , Xn )
Cov(X2 , X1 )
Var[X2 ]
. . . Cov(X2 , Xn )
=
.
..
..
..
.
.
.
.
.
.
Cov(Xn , X1 )
...
...
Var[Xn ]
The matrix is symmetric and if the variables are uncorrelated then it is a diagonal matrix.
If the variables are also identically distributed then = 2 In where 2 is the variance
of each random variable and In is the n-dimensional identity matrix. Finally, as the univaraite variance is always positive, in this case we have that is a non-negative definite
matrix, i.e.
bT b 0
b Rn .
Example: if N = 2 and assume E[X] = E[Y ] = 0 then
X
X 2 XY
E[X 2 ] E[XY ]
Var[X]
Cov(X, Y )
=E
(X Y ) = E
=
=
.
Y
YX Y2
E[Y X] E[Y 2 ]
Cov(X, Y )
Var[Y ]
Conditioning for random vectors: if X and Y are random vectors, and if fX (x) > 0,
we can define the conditional pdf/pmf as
fY|X (y|x) =
fX,Y (x, y)
.
fX (x)
or
fX,Y (x, y) = fY|X (y|x)fX (x).
56
where the random vector Xj1 is the random vector X without its j-th element.
Example: consider 3 r.v. X1 , X2 and X3 , we can group them in different ways and we
get for example
fX1 ,X2 ,X3 (x1 , x2 , x3 ) = fX3 |X1 ,X2 (x3 |x1 , x2 )fX1 ,X2 (x1 , x2 ),
and applying again the definition above to the joint pdf/pmf of X1 and X2 we have
fX1 ,X2 ,X3 (x1 , x2 , x3 ) = fX3 |X1 ,X2 (x3 |x1 , x2 )fX2 |X1 (x2 |x1 )fX1 (x1 ).
32
We start with the bivariate case. We want a bivariate version of the normal distribution.
Given two standard normal random variables, we can build a bivariate normal that depends only on their correlation.
Standard bivariate normal: given U and V i.i.d. standardp
normal random variables, and
for some number || < 1, define X = U and Y = U + 1 2 V , then we can prove
that
1. X N (0, 1) and Y N (0, 1);
2. Corr(X, Y ) = ;
3. the joint pdf is that of a standard bivariate normal random variable and depends
only on the parameter :
fX,Y (x, y) =
1
p
exp (x2 2xy + y 2 )/(2(1 2 )) .
2 1 2
57
Bivariate normal for independent random variables: if the random variables U and V
are independent and standard normal, the joint pdf and mgf are
1 (u2 +v2 )/2
e
,
2
2
2
MU,V (s, t) = e(s +t )/2 .
fU,V (u, v) =
The random vector (U, V ) is normally distributed with variance covariance matrix
1 0
U,V =
.
0 1
p
Computing the joint pdf: given X = U and Y = U + 1 2 V , we have to compute
fX,Y (x, y) given fU,V (u, v). Given the function h : R2 R2 such that h(X, Y ) = (U, V )
and the domain of h is C R2 and it is in one-to-one correspondence with the support
of (U, V ), we have the rule
fU,V (h(x, y))|Jh (x, y)| for (x, y) C
fX,Y (x, y) =
0
otherwise
where
Jh (x, y) = det
h (x, y)
x 1
h (x, y)
x 2
h (x, y)
y 1
h (x, y)
y 2
In this case, C = R2 ,
u = h1 (x, y) = x,
and |Jh (x, y)| = 1
12
y x
v = h2 (x, y) = p
,
1 2
, thus
y x
x, p
1 2
1
p
1 2
Y |X = x N y + (x X ), Y (1 ) .
X
It is obtained by using the joint and the marginal pdfs.
Multivariate case
1. Multivariate normal density: let X1 , . . . , Xn be random variables and define the
n 1 random vector X = (X1 , . . . , Xn )T . If X1 , . . . , Xn are jointly normal then
X N (, ), where the mean = E[X] is an n 1 vector and the covariance
matrix = Var[X] is an n n matrix whose (i, j)th entry is Cov(Xi , Xj ). The
joint density functions is
0
1 (x)/2
Reading
Casella and Berger, Definition 4.5.10
33
This section starts off somewhat more abstract but concludes with the most important and
widely-used theorem in probability, the Central Limit Theorem. Along the way we also
state and prove two laws of large numbers.
To get started, as an example, consider a sequence
Pn of independent Bernoulli random
1
variables Xi X with p = 1/2 and let Yn = n i=1 (2Xi 1). Note that we have normalised the Xi so that E[Yn ] = 0 and Var(Yn ) = 1. In particular, the mean and variance
of Yn does not depend on n. A gambler could think of Yn as their (rescaled) earnings in
case they win 1 each time a fair coin ends up head and lose 1 each time the coin leads to
tail. Astonishingly, even though Yn is constructed from a humble Bernoulli distribution,
as n gets large, the distribution of Yn approaches that of the normal distribution. Indeed,
60
using moment generating functions (and MaX+b (t) = ebt MX (at) for a, b R), we get
n
MYn (t) =
et/ n MX (2t/ n)
n
1 1 2t/n
t/ n
(1 + e
)
=
e
2 2
n
1 t/n 1 t/n
=
e
+ e
2
2
n
1 2
1 2
1 1
1 1
( t/ n + t /n) + ( + t/ n + t /n)
(Taylor)
2 2
4
2 2
4
n
1 2
=
(1 + t /n)
2
2 /2
et
34
Modes of convergence
a.s.
61
Note that whenever we write P (A) we should check that A is in our sigma-algebra. Indeed, with A := { : limn+ Xn () = X()} we have that A if and only if
k N N N s.t. n N
and hence
1
k
\ [ \
1
A=
: |Xn () X()| <
k
kN N N nN
!
|Xm X| >
m=n
To see that the latter two definitions are equivalent, first consider an increasing sequence
of events Bn , meaning that Bi Bi+1 for each i. Using countable additivity it follows
that (with B0 = )
!
!
X
[
[
P (Bn \Bn1 ) = lim P (Bn ).
(Bn \Bn1 ) =
P
Bn = P
n
A diagram might help here to see why the above is true and the final equality is an example
of a so-called telescoping
series. This is called the continuity property of probability.
S
Next, note that m=n {|Xm X| > } is a decreasing sequence of sets and by taking
complements equivalence now follows (try filling in the details).
The remaining three modes of convergence are somewhat more straightforward.
Convergence in probability: the sequence {Xn } converges to X in probability if
lim P (|Xn X| < ) = 1,
> 0,
> 0.
m.s.
62
1. if Xn X then Xn X;
m.s.
2. if Xn X then Xn X;
P
3. if Xn X then Xn X.
Proof:
1. If Xn converges to X almost surely, this means that for any > 0
!
[
lim P
|Xm X| > = 0.
n
m=n
m=n {|Xm
so Xn converges to X in probability.
2. From Chebyshevs inequality we know that for any > 0
E[(Xn X)2 ]
0
2
as n and hence mean-square convergence indeed implies convergence in
probability.
P (|Xn X| > )
3. Suppose for simplicity that Xn and X are continuous random variables and assume
P
that Xn X. From the bounds
P (X t ) P (Xn t) + P (|Xn X| )
and
P (Xn t) P (X t + ) + P (|Xn X| )
it follows by letting > 0 arbitrarily small that
P (Xn t) P (X t)
as n .
This argument can be adapted to the case when Xn or X are not continuous random
variables as long as t is a point of continuity of FX .
Note that it follows that convergence in distribution is implied by any of the other modes
of convergence. None of the other implications hold in general. For some of the examples
and also for the proof of (a special case of) the Strong Law of Large Numbers the so-called
Borel Cantelli Lemmas are incredibly useful.
63
35
The Borel Cantelli Lemmas are two fundamental lemmas in probability theory. Let An
be a sequence of events and denote by A := n
m=n Am the event that infinitely many
of the An occur. The Borel Cantelli Lemmas give sufficient conditions on the An under
which either P (A) = 0 or P (A) = 1.
P
Borel Cantelli 1: Suppose
n=1 P (An ) < . Then P (A) = 0.
Proof: Note that since by definition A
m=n Am for each n, it follows that
P (A) P (
Am )
m=n
since
n=1
P (Am ) 0
as n
m=n
P (An ) < .
n=1
P (An ) = . Then
[ \
P (Ac ) = P
Acm
=
=
n m=n
lim P
lim
lim
!
Acm
(as
m=n
Acm is increasing in n)
m=n
Y
m=n
(1 P (Am ))
eP (Am )
(independence)
(since 1 x ex )
m=n
P
m=n P (Am )
lim e
= 0
whenever
36
n=1
P (An ) = .
X
n=1
P (An ) =
X
1
=
n
n=1
64
as the harmonic series diverges3 . Now, from the second Borel Cantelli Lemma it follows
that P (Xn = 1 for infinitely many n) = 1, so Xn does not converge to 0 almost surely.
Example square mean does not imply almost surely
Let Xn be defined as in the previous example. Then
E[(Xn 0)2 ] =
1
0
n
n2
=n
n
does not converge to zero, the random variables Xn do not converge to 0 in square mean.
Example almost surely does not imply square mean
If we tweak Xn and define the sequence now with P (Xn = 0) = 1 1/n2 and P (Xn =
n) = 1/n2 we have that for any > 0
P (|Xn 0| > ) =
1
.
n2
P
2
< , (in fact4 , it is 2 /6), it now follows from the first Borel Cantelli
Since
n=1 n
Lemma that
P (|Xn 0| > for infinitely many n) = 0
a.s.
for example, this follows form the fact that the harmonic series 1 + 1/2 + (1/3 + 1/4) + (1/5 +
1/6 + 1/7 + 1/8) + . . . has a lower bound 1 + 1/2 + (1/4 + 1/4) + (1/8 + 1/8 + 1/8 + 1/8) + . . . =
1 + 1/2 + 1/2 + 1/2 + . . . =
4
for various proofs of this surprising result see http://empslocal.ex.ac.uk/people/
staff/rjchapma/etc/zeta2.pdf
65
P (|X n | > ) =
P (X n < X n > ) =
P (X < 1/n X > 1/n ) =
P (X < 1 X > 1) = 0.
=
=
=
=
=
=
+ I[0,1] (),
+ I[0,1/2] (),
+ I[1/2,1] (),
+ I[0,1/3] (),
+ I[1/3,2/3] (),
+ I[2/3,1] ().
= (1 )n 0.
66
37
that is Sn /n .
Proof: for every > 0, we use Chebychevs inequality
P (|Sn /n | > )
E[(Sn /n )2 ]
Var[Sn /n]
Var[Sn ]
2
=
=
=
2
2
n2 2
n2
a.s.
that is Sn /n .
Proof: Here we give the proof in the case that we have the additional assumption that
E[Xi4 ] < . In that case, note that
!4
n
X
1
E[(Sn /n )4 ] = 4 E
(Xi ) .
n
i=1
Note that this is a rather humongous sum. Justify (exercise) that it is equal to
o
1 n
4
2 2
.
nE[(X
)
]
+
3n(n
1)
E
(X
)
n4
Note that this expression can be bounded by Cn2 for some C > 0 which does not depend
on n. Using Chebyshevs inequality with g(x) = x4 we have that for > 0
P (|Sn /n | )
E[(Sn /n )4 ]
C
4 2.
4
n
a.s.
Since 1/n2 is summable we deduce from Borel Cantelli 1 that Sn /n . (to see why,
reconsider the example above of almost sure convergence but not convergence in mean
square).
On the assumptions: for the proof of weak and strong law above we have used the
assumption of finite second and fourth moment, respectively. This is in fact stronger than
what is needed. A sufficient condition is the weaker assumption E[|X|] < ; the proof
is much more demanding though.
The Strong Law of Large Numbers implies the Weak Law of Large Numbers and also
d
convergence in distribution Sn /n which can be interpreted as convergence to the
degenerate distribution with all of the mass concentrated at the single value . We shall
soon see that, just as in the case of the sum of Bernoulli random variables at the beginning, we can say a lot more about the limiting distribution of Sn by proper rescaling. To
be more specific, since
Sn /n converges to zero and since Var(Sn /n ) = 1/n, a
scaling with factor n, i.e. n(Sn /n ) seems promising. This is the subject of the
next section.
38
In this section we state and prove the fundamental result in probability and statistics,
namely that the normalised sample mean from an i.i.d. sample (with finite variance)
converges to a standard normal distribution. We shall make use of moment generating
functions and the following result from the theory of so-called Laplace transforms.
68
Convergence of mgfs (Theorem 2.3.12 in Casella Berger) If Xn is a sequence of random varaibles with a moment generating functions satisfying
lim MXn (t) = MX (t)
n
We can state the convergence in distribution as
Z z2 /2
Sn /n
e
dz,
P
n
R.
Notice that both and 2 exist and are finite since the mgf exists in a neighbourhood of
0.
Proof: Define Yi = (Xi )/, then
n
Sn /n
1 X
n
=
Yi
n i=1
therefore the mgf of Yi exists for t in some neighbourhood of 0 (and we shall take t
sufficiently small from now on), given that Yi are i.i.d.
Mn1 (Sn /n) (t) = Mn1/2 Pni=1 Yi (t) =
n
=
= E exp tYi / n
n
t
=
MYi
.
n
By expanding in Taylor series around t = 0, we have
MYi
=
X
k=0
k
n)
.
k!
(t/
E[Yik ]
t
(t/ n)2
t
MYi
=1+
+o
,
2
n
n
69
where the last term is the remainder term in the Taylor expansion such that
o[(t/ n)2 ]
lim
= 0.
n (t/ n)2
Since t is fixed we also have
"
2 #
o[(t/ n)2 ]
t
= 0,
lim
= lim n o
n (1/ n)2
n
n
thus
n
t
lim Mn1 (Sn /n) (t) = lim MYi
=
n
n
n
"
(
2 #!)n
t
1 t2
2
+ no
= lim 1 +
= et /2 ,
n
n 2
n
which is the mgf of a standardnormal random variable. Therefore, by uniqueness of the
moment generating function, n(Sn /n )/ converges in distribution to a standard
normal random variable.
On the assumptions:
1. we can relax the assumption of finite variances, it is enough to have Xi that are small
with respect to Sn ; this can be assured by imposing two conditions by Lyapunov
and Lindeberg of asymptotic negligibility;
2. Independence can also be relaxed by asking for asymptotic independence.
3. The assumption on the existence on moment generating functions can be dropped
and a similar proof can be given in terms of the so-called characteristic function.
This is defined similarly to the moment generating function by
(t) := E[eitX ]
for t R.
Reading
Casella and Berger, Sections 5.5
70