Module 2 (Updated)
Module 2 (Updated)
2 Probability Distributions
Learning objectives: At the end of this module, the student should be able to:
4. define and illustrate expectations, variance and standard deviation of a random variable
and discuss their properties,
7. discuss raw moments, central moments and moment generating functions (mgf ),
8. write the distributions of discrete and continuous random variables in notation form
indicating the appropriate parameters,
9. state the pmf/pdf, expectation, variance and mgf of random variable with discrete and
continuous distribution, and
10. solve problems in Computer Science involving discrete and continuous distributions.
In the previous module, you have learned the constructs of probability; its foundation,
properties, and how to apply them in solving problems that are probabilistic. Here, you will
learn different probability distributions with their properties and how they are practically
used.
Page 1
Stat 123: Probability and Statistics Module 2
The concept of probability distributions always begin with the idea of random variables. You
can take its meaning but taking it literally and you are probably right. But of course for
every mathematical concept, we have to formally define what it is.
The domain of a random variable is the sample space Ω. Its range can be the set of all
real numbers R, or its subsets like (0, +∞), or the integers Z, or the interval (0, 1), depending
on what possible values the random variable can potentially take.
Once an experiment is completed, and the outcome ω is known, the value of random
variable X(ω) becomes determined.
Example:
3. Consider an experiment of tossing 3 fair coins and counting the number of heads.
Certainly, the same model suits the number of girls in a family with 3 children, the
number of 1s in a random binary code consisting of 3 characters, etc. Let X3 be the
number of heads. Prior to an experiment, its value is not known. All we can say is
that X3 has to be an integer between 0 and 3. Since assuming each value is an event,
Page 2
Stat 123: Probability and Statistics Module 2
1 1 1 1
P ({X3 = 0}) = P ({three tails}) = P ({T T T }) = =
2 2 2 8
3
P ({X3 = 1}) = P ({HT T }) + P ({T HT }) + P ({T T H}) =
8
3
P ({X3 = 2}) = P ({HHT }) + P ({HT H}) + P ({T HH}) =
8
1
P ({X3 = 3}) = P ({HHH}) = .
8
Summarizing,
x P (X3 = x)
0 1/8
1 3/8
2 3/8
3 1/8
Total 1
This table contains everything that is know about random variable X3 prior to the
experiment. Generally, before we know the outcome ω, we cannot tell what X equals
to. However, we can list all possible values of X and determine the corresponding
probabilities. By this example, we can now formally define probability distribution.
Example:
Page 3
Stat 123: Probability and Statistics Module 2
x P (X1 = x)
0 1/2
1 1/2
Total 1
2. Biased coin toss experiment. Biased in such a way that the tail is twice as likely to
occur than the head.
x P (X4 = x)
0 1/3
1 2/3
Total 1
Page 4
Stat 123: Probability and Statistics Module 2
x P (X2 = x)
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
Total 1
The examples shown above are actually collection of all probabilities related to a random
variable X, referred to as the distribution of X. Moreover, based from the examples, we can
also ask the question, what is the probability that X is less than or equal to a particular x?
We just simply take the “cumulative” sum of all probabilities that are less than or equal x.
For example, the probability that X2 is less than or equal 4 is
F (4) := P (X ⩽ 4)
Page 5
Stat 123: Probability and Statistics Module 2
x P (X2 ⩽ x)
1 1/6
2 1/3
3 1/2
4 2/3
5 5/6
6 1
For every outcome ω, the variable X takes one and only value x. This makes events
X = x mutually exclusive and exhaustive, therefore,
X X
P (X) = P (X = x) = 1.
x x
We can also conclude that the cdf F (X) is a non-decreasing function of x, always between
0 and 1, with
lim F (x) = 0 and lim F (x) = 1.
x→−∞ x→∞
Between any two subsequent values of X, F (X) is constant. It jumps by P (X) at each
possible value x of X.
Page 6
Stat 123: Probability and Statistics Module 2
Recall that one way to compute for the probability of an event is to add probabilities of
all the outcomes in it. Hence, for any set A,
X
P (X ∈ A) = P (X).
x∈A
When A is an interval, its probability can be computed directly from the cdf F (X),
Example: A program consists of two modules. The number of errors X1 in the first module
has the pmf p1 (X), and the number of errors X2 in the second module has the pmf p2 (X),
independently of X1 , where
x p1 (x) p2 (x)
0 0.5 0.2
1 0.3 0.2
2 0.1 0.1
3 0.1 0
Page 7
Stat 123: Probability and Statistics Module 2
= (0.5)(0.7) = 0.35
Now check:
5
X
pY (y) = 0.35 + 0.31 + 0.18 + 0.12 + 0.03 + 0.01 = 1,
y=0
thus we probably counted all the possibilities and did not miss any (we just wanted to
P
emphasize that simply getting P (X) = 1 does not guarantee that we made no mistake in
our solution. However, if this equality is not satisfied, we have a mistake for sure).
Page 8
Stat 123: Probability and Statistics Module 2
2. Draw a graph of its probability mass function and cumulative distribution function.
Page 9
Stat 123: Probability and Statistics Module 2
Recall that a random variable is a function whose range is R. Since R is subsettable into
different number systems, it should be that there are random variables whose range is only
a subset of R, such as N or N ∪ {0}. There are two classifications of a random variable;
discrete and continuous. And so far, we are dealing with discrete random variables.
Definition 2.4. Discrete random variables are random variables whose range is finite
or countable.
In particular, it means that their values can be listed, or arranged in a sequence. Examples
include the number of jobs submitted to a printer, the number of errors, the number of
error-free modules, the number of failed components, and so on. Discrete variables don’t
have to be integers. For example, the proportion of defective components in a lot of 100 can
be 0, 1/100, . . . , 99/100, or 1. This variable assumes 101 different values, so it is discrete,
although not an integer. On the other hand, if the range is uncountable, the random variable
is said to be continuous.
In addition, the name probability mass function is solely for probability functions of
discrete random variables.
Definition 2.5. For a discrete random variable, the probability mass function for x denoted
by fX (x) is defined as fX (x) = P (X = x).
∞
X
fX (x) = 1.
x=−∞
[⌊k⌋]
X
P (X ⩽ k) = fX (x).
x=−∞
Page 10
Stat 123: Probability and Statistics Module 2
P (a < X ⩽ b) = P (X ⩽ b) − P (X ⩽ a)
[⌊b⌋] [⌊a⌋]
X X
= fX (x) − fX (x)
x=−∞ x=−∞
[⌊b⌋]
X
= fX (x)
x=a+1
P (a ⩽ X ⩽ b) = P (X ⩽ b) − P (X ⩽ a − 1).
P (X ⩾ x) = 1 − P (X ⩽ x − 1).
Example:
∞
X r
rk =
k=1
1−r
∞ x 1
X 1 2
= 1 = 1.
x=1
2 1− 2
Therefore, f is a pmf.
Page 11
Stat 123: Probability and Statistics Module 2
Answer: The equation for a geometric series with a finite upper limit is given by
n
X r − rn+1
rk = .
k=1
1−r
Hence,
10 x
X 1
P (X ⩽ 10) =
x=1
2
1 1 11
− 2
2
=
1 − 12
1023
= .
1024
2. A tetrahedron (4-sided die) is rolled twice. Let X be the larger of the 2 outcomes if
they are different and the common value if they are the same.
x f (x)
1 1/16
2 3/16
3 5/16
4 7/16
Page 12
Stat 123: Probability and Statistics Module 2
2x − 1
fX (x) = , x = 1, 2, 3, 4.
16
Answer:
x F (x)
1 1/16
2 4/16
3 9/16
4 1
x2
fX (x) = , x = 1, 2, 3, 4.
16
Exercise: A fair coin is tossed 3 times. A player wins $1 if the toss is a head, but loses $1
if the first toss is a tail. Similarly, the player wins $2 if the second toss is a head and loses
$2 if the second toss is a tail, same with third toss. Let the random variable X be the total
winnings after 3 tosses. Find the pmf and the cdf.
On the other hand, another type of random variable is when the range is infinite and
uncountable.
Definition 2.6. Continuous random variables are random variables whose range is
uncountable.
Probability functions whose random variable is continuous are called probability distribu-
tion functions or pdf.
Page 13
Stat 123: Probability and Statistics Module 2
Definition 2.7. For a continuous random variable, the probability distribution function for
x denoted by fX (x) is defined as fX (x) = P (X ∈ (a, b)).
The definition means that probability for crv is defined in an interval, not on a point.
The following are the properties of a probability distribution function:
1. The sum for all values of f for all non-overlapping (a, b) ∈ X is 1, that is,
Z ∞
fX (x)dx = 1.
−∞
This also means that the entire area under the pdf curve f is 1.
Z k
P (X ⩽ k) = fX (x)dx.
−∞
Z x
P (X > x) = 1 − P (X < x) = 1 − fX (x)dx.
−∞
Page 14
Stat 123: Probability and Statistics Module 2
That type of random variable assume a whole interval of values. This could be a bounded
interval (a, b), or an unbounded interval such as (a, ∞), (−∞, b), or (−∞, ∞). Sometimes,
it may be a union of several such intervals. Intervals are uncountable, therefore, all values
of a random variable cannot be listed in this case. Examples of continuous variables include
various times (software installation time, code execution time, connection time, waiting
time, lifetime), also physical variables like weight, height, voltage, temperature, distance,
the number of miles per gallon, etc.
Example:
1. For comparison, observe that a long jump is formally a continuous random variable
because an athlete can jump any distance within some range. Results of a high jump,
however, are discrete because the bar can only be placed on a finite number of heights.
2. A job is sent to a printer. Let X be the waiting time before the job starts printing.
With some probability, this job appears first in line and starts printing immediately,
X = 0. It is also possible that the job is first in line but t takes 20 seconds for the
printer to warm up, in which case X = 20. So far, the variable has a discrete behavior
with a positive pmf P (X) at x = 0 and x = 20. However, if there are other jobs in
a queue, then X depends on the time it takes to print them, which is a continuous
random variable. Using a popular jargon, besides “point masses” at x = 0 and x = 20,
the variable is continuous, taking values in (0, ∞).
Page 15
Stat 123: Probability and Statistics Module 2
Often we deal with several random variables simultaneously. We may look at the size of a
RAM and the speed of a CPU, the price of a computer and its capacity, temperature and
humidity, technical and artistic performance, etc.
Definition 2.8. If X and Y are random variables, then the pair (X, Y ) is a random vector.
Its distribution is called the joint distribution of X and Y . Individual distributions of X
and Y are then called marginal distributions.
Although we talk about two random variables here, all the concepts extend to a vector
(X1 , X2 , . . . , Xn ) of n components and its joint distribution. Similarly to a single variable,
the joint distribution of a vector is a collection of probabilities for a vector (X, Y ) to take a
value (x, y). Recall that two vectors are equal,
if X = x and Y = y. Note that this “and” means the intersection, therefore, the joint
probability mass function (jpmf) of X and Y is given by
For ease of notations and without loss of generality, let us discuss joint distributions in
the discrete sense and we begin with this definition of joint probability mass function.
Definition 2.9. The joint probability mass function denoted by fX,Y (x, y) is given by
probability values
XX
pij := P (X = xi , Y = yj ) ⩾ 0 and pij = 1.
i j
To illustrate, we can show the jpmf of X and Y by using a table. In the univariate sense,
Page 16
Stat 123: Probability and Statistics Module 2
X P (X = xi )
x1 f (x1 )
x2 f (x2 )
.. ..
. .
P
We know that i f (xi ) = 1. Extending this to jpmf, we have
Y
X y1 y2 ... yn ...
x1 p11 p12 . . . p1n ...
x2 p21 p22 . . . p2n ...
.. .. .. .. .. ..
. . . . . .
xm pm1 pm2 . . . pmn . . .
.. .. .. .. .. ..
. . . . . .
Definition 2.10. The joint cumulative distribution function denoted by FX,Y (x, y) is
given by
X X
F (x, y) = pij .
i;xi ⩽x j;yj ⩽y
The definition is then again an extension from the definition of cdf of the univariate case.
Next, we find what we call the marginal distributions of X and Y from the jpmf fX,Y (x, y).
Definition 2.11. Let F (x, y) and f (x, y) be the jcdf and jpmf of the random variables X
Page 17
Stat 123: Probability and Statistics Module 2
X X
P (X = xi ) =: pi+ = pij and P (Y = yj ) =: p+j = pij .
j i
This means that pi+ is simply the sum of the ith row and p+j is the sum of the j th column.
To illustrate, the marginal distribution of X is
x fX,Y (x)
.. ..
. .
.. ..
. .
y fX,Y (y)
.. ..
. .
.. ..
. .
Page 18
Stat 123: Probability and Statistics Module 2
Since fX,Y (x) and fX,Y (y) are pmf themselves, then it must be that
X X
pi+ = p+j = 1.
i j
Remark: The joint distribution cannot be computed from marginal distributions because
they carry no information about interrelations between random variables. For example,
marginal distributions cannot tell whether variables X and Y are independent or dependent.
Page 19
Stat 123: Probability and Statistics Module 2
fZ (1) = P (X = 0 ∩ Y = 1) + P (X = 1 ∩ Y = 0)
P
It is a good check to verify z fZ (z) = 1.
Answer: To decide on the independence of X and Y , check if their joint pmf factors into
a product of marginal pmfs. We see that fX,Y (0, 0) = 0.2 indeed equals fX (0)fY (0) =
Page 20
Stat 123: Probability and Statistics Module 2
(0.5)(0.4). Keep checking. Next, fX,Y (0, 1) = 0.2 whereas fX (0)fY (1) = (0.5)(0.3) =
0.15. There is no need to check further. We found a pair of x and y that violates the
formula for independent random variables. Therefore, the numbers of errors in two
modules are dependent.
Since a jpmf involves two random variables X and Y , and are essentially events, it also
makes sense to know the conditional distribution of X given Y , or vice-versa.
P (X = xi , Y = yj )
pi|Y =yj := .
P (Y = yj )
Note that this equation is derived form the definition of conditional probability of A given
B which is P (A|B) = P (A ∩ B)/P (B).
Example: Two cards are drawn without replacement from an ordinary deck and the random
variable X measures the number of hearts and Y measures the number of clubs drawn.
Answer:
Y
X 0 1 2
(26 C2 )(13 C0 )(13 C0 ) 25 (26 C1 )(13 C0 )(13 C1 ) 13 (26 C0 )(13 C0 )(13 C2 ) 1
0 52 C2
= 102 52 C2
= 51 52 C2
= 17
(26 C1 )(13 C1 )(13 C0 ) 13 (26 C0 )(13 C1 )(13 C1 ) 13
1 52 C2
= 51 52 C2
= 102
0
(26 C0 )(13 C2 )(13 C0 ) 1
2 52 C2
= 17
0 0
Page 21
Stat 123: Probability and Statistics Module 2
Answer:
X pi|Y =1
P (X=2,Y =1)
2 P (Y =1)
=0
Answer: We take p11 . Is it equal to p1+ p+1 ? Since 13/102 ̸= 13/34 × 13/34, then X
and Y are not independent.
Page 22
Stat 123: Probability and Statistics Module 2
Exercise: Consider two random variables X and Y with jpmf given in the table below.
Y =0 Y =1 Y =2
1 1 1
X=0 6 4 8
1 1 1
X=1 8 6 6
1. Find P (X = 0, Y ⩽ 1).
Page 23
Stat 123: Probability and Statistics Module 2
The distribution of a random variable or a random vector, the full collection of related
probabilities, contains the entire information about its behavior. This detailed information
can be summarized in a few vital characteristics describing the average value, the most
likely value of a random variable, its spread, variability, etc. The most commonly used are
the expectation, variance, standard deviation, covariance, and correlation, introduced in this
section.
We know that X can take different values with different probabilities. For this reason,
its average value is not just the average of all its values. Rather, it is a weighted average.
Example:
1. Consider a variable that takes values 0 and 1 with probabilities P (0) = P (1) = 0.5.
That is,
0; with probability 1/2
X= .
1; with probability 1/2
Observing this variable many times, we shall see X = 0 about 50% of times and X = 1
about 50% of times. The average value of X will then be close to 0.5, so it is reasonable
to have E[X] = 0.5.
2. Suppose that P (0) = 0.75 and P (1) = 0.25. Then, in a long run, X is equal 1 only
1/4 of times, otherwise it equals 0. Suppose we earn $1 every time we see X = 1. On
the average, we earn $1 every four times, or $0.25 per each observation. Therefore, in
this case E[X] = 0.25.
In a certain sense, expectation is the best forecast of X. The variable itself is random.
It takes different values with different probabilities P (x). At the same time, it has just one
Page 24
Stat 123: Probability and Statistics Module 2
Definition 2.15 (Mathematical Expectation). Let X be a drv with pmf fX (x). Let h(·)
be a positive-valued function. Then the (mathematical) expectation of h(x), denoted by
E[h(x)] is given by
∞
X
E[h(x)] = h(x)fX (x).
x=−∞
Remark: Indeed, if g is a one-to-one function, then Y takes each value y = g(x) with
probability f (x), and the formula for E[Y ] can be applied directly. If h is not one-to-one,
then some values of h(x) will be repeated in Definition 2.15. However, they are still multiplied
by the corresponding probabilities. When we add in Definition 2.15, these probabilities are
also added, thus each value of h(x) is still multiplied by the probability fY (h(x)).
We now define special consequences of Definition 2.15.
∞
X
k
E[X ] = xk fX (x)
x=−∞
∞
X
E[X] = xfX (x)
x=−∞
This is the formula to use to get the expected values from the previous examples.
Page 25
Stat 123: Probability and Statistics Module 2
∞
X
E[(X − E[X])k ] = (x − E[X])k fX (x)
x=−∞
Note that E[X] is not a random variable, rather, an unknown constant. We can set
E[X] = µ so that we have
∞
X
k
E[(X − µ) ] = (x − µ)k fX (x)
x=−∞
instead.
Definition 2.19. From Definition 2.18, if we set k = 2, then we have the 2nd central moment,
also called the variance of X, given by
∞
X
2
E[(X − E[X]) ] = (x − E[X])2 fX (x).
x=−∞
Example: Take for example the tetrahedron problem from page 12. Let us get the mean
and the variance of the random variable.
4
X
E[X] = xfX (x)
x=1
1 3 5 7
=1· +2· +3· +4·
16 16 16 16
1 3 15 7
= + + +
16 8 16 4
1
=3 .
8
Page 26
Stat 123: Probability and Statistics Module 2
Answer: We use the Definition 2.19 and the answer we got from (1).
2 2 2 2
1 1 1 3 1 5 1 7
Var[X] := 1 − 3 + 2−3 + 3−3 + 4−3
8 16 8 16 8 16 8 16
289 243 5 343
= + + +
1024 1024 1024 1024
55
= .
64
Now we discuss relevant properties of expectation. Note that these properties will only
hold true if such expectation E exists.
Property 2.3. If h1 and h2 are functions, then E[h1 (X) + h2 (X)] = E[h1 (X)] + E[h2 (X)]
Example:
1. Let X be a number selected at random from the first 10 positive integers. Assuming
equally likely outcomes, compute E[X(11 − X)].
Answer: Since all outcomes are equally likely to occur, then it should be that
1
f (x) =
10
= 11E[X] − E[X 2 ]
Hence, we can solve for 11E[X] and E[X 2 ] separately and just get their difference
after.
Page 27
Stat 123: Probability and Statistics Module 2
(a)
1 1 1
E[X] = 1 · +c· + · · · + 10 ·
10 10 10
1
= (1 + · · · + 10)
10
= 5.5
(b)
1 1 1
E[X 2 ] = 12 · + 22 · + · · · + 102 ·
10 10 10
(10)(11)(21) 1
= ·
6 10
= 38.5
(|x| + 1)2
f (x) = ; x = −1, 0, 1.
9
Find:
(a) E[X]
Page 28
Stat 123: Probability and Statistics Module 2
(b) E[X 2 ]
(c) E[3X 2 − 2X + 4]
Property 2.5. Suppose Var[X] = E[(X − µ)2 ] where E[X] = µ. Then Var[X] = E[X 2 ] −
(E[X])2 .
Proof.
= E[X 2 ] − 2µE[X] + µ2
= E[X 2 ] − 2µ2 + µ2
= E[X 2 ] − µ2
= E[X 2 ] − (E[X])2 .
Page 29
Stat 123: Probability and Statistics Module 2
Let’s have more properties involving variance. For the following properties, we begin
with two random variables X and Y whose expectations exist and a constant c ∈ R.
Proof.
= E[(c − c)2 ]
= 0.
Proof.
= E[(cX − cE[X])2 ]
= E[(c(X − E[X])2 )]
= E[c2 (X − E[X])2 ]
= c2 E[(X − E[X])2 ]
= c2 Var[X].
The variance of a random variable X Var[X] is often denoted as σ 2 and its square root
σ is called the standard deviation.
p
σ =: Std[X] = Var[X].
Page 30
Stat 123: Probability and Statistics Module 2
If X is measured in some units, then its mean µ has the same measurement unit as X.
Variance σ 2 is measured in squared units, and therefore, it cannot be compared with X or µ.
No matter how funny it sounds, it is rather normal to measure variance of profit in squared
dollars, variance of class enrollment in squared students, and variance of available disk space
in squared gigabytes. When a squared root is taken, the resulting standard deviation σ is
again measured in the same units as X. This is the main reason of introducing yet another
measure of variability, σ.
Exercises:
1. There is one error in one of five blocks of a program. To find the error, we test
three randomly selected blocks. Let X be the number of errors in these three blocks.
Compute E[X] and Var[X].
2. Tossing a fair die is an experiment that can result in any integer number from 1 to
6 with equal probabilities. Let X be the number of dots on the top face of a die.
Compute E[X] and Var[X].
3. The number of home runs scored by a certain team in one baseball game is a random
variable with the distribution
x 0 1 2
The team plays 2 games. The number of home runs scored in one game is independent
of the number of home runs in the other game. Let Y be the total number of home
runs. Find E[Y ] and Var[Y ].
4. A computer user tries to recall her password. She knows it can be one of 4 possible
passwords. She tries her passwords until she finds the right one. Let X be the number
of wrong passwords she uses before she finds the right one. Find E[X] and Var[X].
Page 31
Stat 123: Probability and Statistics Module 2
Expectation, variance, and standard deviation characterize the distribution of a single ran-
dom variable. Now we introduce measures of association of two random variables.
XX
Cov[X, Y ] = f (x, y)(x − E[X])(y − E[y]).
x y
Moreover, covariance is the expected product of deviations of X and Y from their respective
expectations. If Cov[X, Y ] > 0, then positive deviations X − E[X] are more likely to be
multiplied by positive Y − E[Y ], and negative X − E[X] are more likely to be multiplied by
negative Y − E[Y ]. In short, large X imply large Y , and small X imply small Y .
The following are properties of the covariance.
Page 32
Stat 123: Probability and Statistics Module 2
Proof.
= E[(X − E[X])2 ]
= Var[X].
Proof.
= E[XY ] − E[X]E[Y ].
Proof.
= E[X]E[Y ] − E[X]E[Y ]
= 0.
Page 33
Stat 123: Probability and Statistics Module 2
In Property 2.10, we say that X and Y are uncorrelated. We see that independent
variables are always uncorrelated. The reverse is not always true. There exist some variables
that are uncorrelated but not independent.
Cov[X, Y ]
ρ := .
(Std[X])(Std[Y ])
Further, values of ρ near 1 indicate strong positive correlation, values near −1 show
strong negative correlation, and values near 0 show weak correlation or no correlation.
Page 34
Stat 123: Probability and Statistics Module 2
Example: Let us continue the example we did from page 19, and we compute
2
µX = 0.5 σX = 0.25
0 0.4 0 0 0
√
As a result, we have Var[X] = 0.25, Var[Y ] = 2.25 − 1.052 = 1.1475, Std[X] = 0.25 =
√
0.5, and Std[Y ] = 1.1475 = 1.0712. Also,
XX
E[XY ] = xyf (x, y) = (1)(1)(0.1) + (1)(2)(0.1) + (1)(3)(0.1) = 0.6
x y
Page 35
Stat 123: Probability and Statistics Module 2
and
Cov[X, Y ] 0.075
ρ= = = 0.1400.
(Std[X])(Std[Y ]) (0.5)(1.0712)
Exercises:
Page 36
Stat 123: Probability and Statistics Module 2
Knowing just the expectation and variance, one can find the range of values most likely
taken by this variable. Russian mathematician Pafnuty Chebyshev (1821–1894) showed that
any random variable X with expectation µ = E[X] and variance σ 2 = Var[X] belongs to
the interval µ ± ε = [µ − ε, µ + ε] with probability of at least σ 2 /ε2 for ε > 0.
Theorem 2.1. Let X be a random variable with expectation µ and variance σ 2 . Then
σ 2
P (|X − µ| ⩾ ε) ⩽
ε
Chebyshev’s inequality shows that only a large variance may allow a variable X to differ
significantly from its expectation µ. In this case, the risk of seeing an extremely low or
extremely high value of X increases. For this reason, risk is often measured in terms of a
variance or standard deviation.
There are ways where we can prove Chebyshev’s inequality and one way to do that is by
Markov’s inequality.
Theorem 2.2 (Markov’s Inequality). Let X be a nonnegative random variable. Then for
ε > 0,
E[X]
P (X ⩾ ε) ⩽ .
ε
∞
X
E[X] = xfX (x)
x=−∞
ε−1
X ∞
X
= xfX (x) + xfX (x)
x=−∞ x=ε
X∞
⩾ xfX (x)
x=ε
Page 37
Stat 123: Probability and Statistics Module 2
Since x ⩾ ε, then
∞
X ∞
X
xfX (x) ⩾ εfX (x)
x=ε x=ε
X∞ X∞
εfX (x) = ε fX (x)
x=ε x=ε
= ε · P (X ⩾ ε)
Proof. Let Y = (X − E[X])2 . Then Y is a nonnegative random variable with expected value
E[Y ] = Var[X]. By Markov’s Inequality,
E[Y ] Var[X]
P (Y ⩾ ε2 ) ⩽ 2
= .
ε ε2
σ2
P (|X − µ| ⩾ ε) ⩽
ε2
Example:
1. Suppose the number of errors in a new software has expectation µ = 20 and a standard
deviation of 2. According to Chebyshev’s Inequality, there are more than 30 errors with
probability
2
2
P (X > 30) ⩽ P (|X − 20| > 10) ⩽ = 0.04.
10
However, if the standard deviation is 5 instead of 2, then the probability of more than
30 errors can only be bounded by (5/10)2 = 0.25.
Page 38
Stat 123: Probability and Statistics Module 2
2. Chebyshev’s inequality shows that in general, higher variance implies higher probabili-
ties of large deviations, and this increases the risk for a random variable to take values
far from its expectation. This finds a number of immediate applications. Here we
focus on evaluating risks of financial deals, allocating funds, and constructing optimal
portfolios. This application is intuitively simple. The same methods can be used for
the optimal allocation of computer memory, CPU time, customer support, or other
resources.
As an example, suppose we would like to invest $10, 000 into shares of companies
XX and YY. Shares of XX cost $20 per share. The market analysis shows that their
expected return is $1 per share with a standard deviation of $0.5. Shares of YY cost
$50 per share, with an expected return of $2.50 and a standard deviation of $1 per
share, and returns from the two companies are independent. In order to maximize the
expected return and minimize the risk (standard deviation or variance), is it better
to invest (A) all $10, 000 into XX, (B) all $10, 000 into YY, or (C) $5, 000 in each
company?
Answer: Let X be the actual (random) return from each share of XX, and Y be the
actual return from each share of YY. Compute the expectation and variance of the
return for each of the proposed portfolios (A, B, and C).
(a) At $20 a piece, we can use $10, 000 to buy 500 shares of XX collecting a profit of
A = 500X. Using Property 2.2 and Property 2.7,
(b) Investing all $10, 000 into YY, we buy 10, 000/50 = 200 shares of it and collect a
Page 39
Stat 123: Probability and Statistics Module 2
profit of B = 200Y ,
(c) Investing $5, 000 into each company makes a portfolio consisting of 250 shares
of XX and 100 shares of YY; the profit in this case will be C = 250X + 100Y .
Following Exercise no. 3 of page 36 for independent X and Y,
Var[C] = 2502 Var[X] + 1002 Var[Y ] = 2502 (0.5)2 + 1002 (1)2 = 25, 625.
The expected return is the same for each of the proposed three portfolios because each
share of each company is expected to return 1/20 or 2.50/50, which is 5%. In terms of
the expected return, all three portfolios are equivalent. Portfolio C, where investment
is split between two companies, has the lowest variance; therefore, it is the least risky.
This supports one of the basic principles in finance: to minimize the risk, diversify the
portfolio.
Page 40
Stat 123: Probability and Statistics Module 2
Next, we introduce the most commonly used families of discrete distributions. Amazingly, ab-
solutely different phenomena can be adequately described by the same mathematical model,
or a family of distributions. For example, as we shall see below, the number of virus at-
tacks, received e-mails, error messages, network blackouts, telephone calls, traffic accidents,
earthquakes, and so on can all be modeled by the same Poisson family of distributions.
Perhaps the easiest discrete distribution to understand, the discrete uniform distribution
assumes that all outcomes are equally likely to occur. It means that all outcomes have equal
chances of being chosen.
Definition 2.23. Let X represent an outcome of an experiment with n outcomes all with
equal chances of occurrence. Then the experiment follows a discrete distribution that is
denoted by X ∼ DU (n) with pmf given by
1
fX (x) = ; x = 1, 2, . . . , n.
n
Example: Suppose we throw an unbiased die. Let X represent the outcome of throwing
the die. Then x = 1, 2, 3, 4, 5, 6 and f (x) = 1/6 for all values of x.
P
If X ∼ DU (n), does it satisfy X = 1? We need to show that the sum of probabilities
Page 41
Stat 123: Probability and Statistics Module 2
∞ n
X X 1
fX (x) =
x=−∞ x=1
n
1 1 1
= + + ··· +
|n n {z n}
n terms
1
=n
n
= 1.
P
Hence if X ∼ DU (n) then X = 1.
n+1
Property 2.11. If X ∼ DU (n), then E[X] = 2
.
n2 −1
Property 2.12. If X ∼ DU (n), then Var[X] = 12
.
Suppose we have an experiment with only two outcomes; success and failure. In that exper-
iment, you are allowed to do n trials. Let p denote the probability of success and q := 1 − p
the probability of failure. In this experiment of n trials, k of them must be successes and
k ⩽ n. The rest of the trials will then be failures and there will be n − k of them. By
Multiplication Principle, the probability that this will occur is
pk q n−k .
Page 42
Stat 123: Probability and Statistics Module 2
n
However, there are
ways in which k successes are distributed from n trials, hence the
k
probability that there are k successes after n trials is given by
n
k n−k
p q .
k
Definition 2.24. Let X represent the number of successes after n trials where p is the prob-
ability of success and q the probability of failure. Then the experiment follows a binomial
distribution which is denoted by X ∼ Bi(n, p) with pmf given by
n
x n−x
p q
fX (x) =
x
where x = 0, 1, . . . , n.
Example: Suppose an unbiased coin will be tossed 5 times. What is the probability of
having 2 heads?
Answer: The value of the parameters are n = 5 and p = 1/2, and the experiment X is
denoted by X ∼ Bi(5, 1/2). Hence,
5 2 3
1 1 5
fX (2) =
2 =
2 16
2
Page 43
Stat 123: Probability and Statistics Module 2
Proof.
n
X
E[X] = x · n Cx px q n−x
x=0
n
X
= x · n Cx px q n−x
x=1
n
X n!
= x px q n−x
x=1
(n − x)!x!
n
X n!
= px q n−x
x=1
(n − x)!(x − 1)!
n
X (n − 1)!
= np px−1 q n−x
x=1
(n − x)!(x − 1!)
Let n − x = (n − 1) − (x − 1).
n
X (n − 1)!
= np px−1 q [(n−1)−(x−1)]
x=1
[(n − 1) − (x − 1)]!(x − 1)
Let m = n − 1, y = x − 1.
m
X m!
= np py q m−y
y=0
(m − y)!y!
m
X
y m−y
= np m Cy p q
y=0
By Binomial Theorem,
= nP (p + q)m
= np[p + (1 − p)]m
= nP (1)m
= np.
Proof. Note that Var[X] = E[X 2 ] − (E[X])2 . Since we already know E[X], what’s left for
Page 44
Stat 123: Probability and Statistics Module 2
us to find is E[X 2 ]. Here, we use some trick to find it. First, we find E[X(X − 1)] and use
it to find E[X 2 ] since
Solving,
n
X
E[X(X − 1)] = x(x − 1) · n Cx px q n−x
x=0
n
X n!
= x(x − 1) px q n−x
x=0
(n − x)!x!
n
X n!
= x(x − 1) px q n−x
x=2
(n − x)!x!
n
X n!
= px q n−x
x=2
(n − x)!(x − 2)!
n
2
X (n − 2)!
= n(n − 1)p px−2 q n−x
x=2
(n − x)!(x − 2)!
Let n − x = (n − 2) − (x − 2).
n
X (n − 2)!
= n(n − 1)p2 px−2 q (n−2)−(x−2)
x=2
[(n − 2) − (x − 2)]!(x − 2)!
Let m = n − 2, y = x − 2.
m
2
X m!
= n(n − 1)p py q m−y
y=0
(m − y)!y!
Xm
= n(n − 1)p2 m Cy p
y m−y
q
y=0
= n(n − 1)p2 .
Page 45
Stat 123: Probability and Statistics Module 2
= nP (np − p + 1 − np)
= nP (−p + 1)
= nP (1 − p)
= npq.
Definition 2.25. Let X represent the number of success after doing an experiment one
time. Then X is said to follow a Bernoulli distribution, denoted by X ∼ Be(p) whose
pmf is given by
fX (x) = px q 1−x .
Page 46
Stat 123: Probability and Statistics Module 2
Answer:
1 1−1
2 1 2
fX (1) = = .
3 3 3
Property 2.15. The sum of n independent Bernoulli trials is a binomial distribution, i.e.,
if X1 , X2 , . . . , Xn is a sequence of independent Bernoulli random variables, then
X1 + X2 + · · · + Xn ∼ Bi(n, p).
The next distribution is related to a concept of rare events, or Poissonian events. Essentially
it means that two such events are extremely unlikely to occur simultaneously or within a
very short period of time. Arrivals of jobs, telephone calls, e-mail messages, traffic accidents,
network blackouts, virus attacks, errors in software, floods, and earthquakes are examples of
rare events.
Definition 2.26. Let X represent the number of occurrences per unit of space/time and
λ is the average number of occurrences per unit of space/time. Then X follows a Poisson
distribution denoted by X ∼ P o(λ) with pmf
e−λ λx
fX (x) =
x!
where x = 0, 1, 2, . . .
P
To show that X ∼ P o(λ) satisfies X = 1, we introduce first a special series.
Page 47
Stat 123: Probability and Statistics Module 2
∞
X xn
.
n=0
n!
P
By Property 2.17, X = 1 is satisfied where X ∼ P o(λ) since
∞ ∞
X e−λ λx −λ
X λx
=e
x=0
x! x=0
x!
= e−λ eλ
= 1.
Example: Customers of an internet service provider initiate new accounts at the average
rate of 10 accounts per day.
1. What is the probability that more than 8 new accounts will be initiated today?
Answer: New account initiations qualify as rare events because no two customers open
accounts simultaneously. Then the number X of today’s new accounts has Poisson
distribution with parameter λ = 10. Hence,
2. What is the probability that more than 16 accounts will be initiated within 2 days?
Answer: The number of accounts, Y , opened within 2 days does not equal 2X. Rather,
Y is another Poisson random variable whose parameter equals 20. Indeed, the param-
eter is the average number of rare events, which, over the period of two days, doubles
the one-day average. Hence, with λ = 20,
Page 48
Stat 123: Probability and Statistics Module 2
Proof.
∞
X e−λ λx
E[X] = x
x=0
x!
∞
X e−λ λx
= x
x=1
x!
∞
−λ
X xλx−1
= λe
x=1
x!
∞
X λx−1
= λe−λ
x=1
(x − 1)!
Let y = x − 1.
∞
−λ
X λy
= λe
y=0
y!
= λe−λ eλ
= λ.
Proof. Proof is left as an exercise. Hint: Solve E[X(X − 1)] first to get E[X 2 ].
Definition 2.27. Let X be the number of trials until the first success. Often regarded
as the opposite of Bernoulli, X is said to follow the geometric distribution denoted by
X ∼ Ge(p) with pmf
fX (x) = q x−1 p
Page 49
Stat 123: Probability and Statistics Module 2
where x = 1, 2, . . .
Example: Find the probability that it will take 6 tosses to get a head in a coin toss where
heads is 5 times as less likely to occur as tails.
Answer:
6−1
1 5 5
fX (x) = = .
6 6 46656
To get the expectation and the variance of a geometric distribution, we first introduce a
special series.
Property 2.20. The function 1/(1 − x)2 can be approximated by getting the sum
1 + 3x + 6x2 + 10x3 + . . .
Proof.
∞
X
E[X] = xq x−1 p
x=1
= p + 2qp + 3q 2 p + . . . (1)
pE[X] = p + qp + q 2 p + q 3 p + . . .
E[X] = 1 + q + q 2 + q 3 + . . .
∞
X
= qx Geometric series
x=0
1
=
1−q
1
= .
p
Page 50
Stat 123: Probability and Statistics Module 2
Proof.
∞
X
E[X 2 ] = x2 q x−1 p
x=1
= p + 4qp + 9q 2 p + 16q 2 p + . . .
= P (1 + 4q + 9q 2 + 16q 3 + . . . )
= p[(1 + 3q + 6q 2 + . . . ) + (q + 3q 2 + 6q 3 + . . . )]
= P (1 − q)−3 (1 + q)
= P (p)−3 (2 − p)
2−p
= .
p2
Therefore,
Page 51
Stat 123: Probability and Statistics Module 2
In a sequence of independent Bernoulli trials, the number of trials needed to obtain n suc-
cesses has negative binomial distribution.
Definition 2.28. Let X be the number of trials until the nth success. Then X follows a
negative binomial distribution denoted by X ∼ N B(n, p) with pmf
x − 1
n x−n
fX (x) =
p q
n−1
where x = n, n + 1, n + 2, . . .
Example:
1. Find the probability that it will take 5 tosses to get 2 heads in a coin toss.
Answer:
2 3
1 1 1
fX (x) = 5−1 C2−1 = .
2 2 8
Answer: Let X be the number of components tested until 12 non-defective ones are
found. It is a number of trials needed to see 12 successes, hence X has a negative
binomial distribution with n = 12 and p = 0.05.
P∞
We need P (X > 15) = x=16 f (x) or 1 − F (15); however, applying the formula for
f (x) directly is rather cumbersome. What would be a quick solution?
Page 52
Stat 123: Probability and Statistics Module 2
= P (Y < 12)
This technique, expressing a probability about one random variable in terms of another
random variable, is rather useful. Soon it will help us relate Gamma and Poisson
distributions and simplify computations significantly.
Property 2.23. The sum of n independent geometric trials is a negative binomial distribu-
tion, i.e., if X1 , X2 , . . . , Xn is a sequence of independent geometric random variables, then
X1 + X2 + · · · + Xn ∼ N B(n, p).
Page 53
Stat 123: Probability and Statistics Module 2
= E[X1 ] + · · · + E[Xn ]
1 1
= + ··· +
p p
| {z }
n terms
n
= .
p
k M − k
x n−x
fX (x) =
M
n
Page 54
Stat 123: Probability and Statistics Module 2
sumer) of electronic equipment. The microprocessors are supplied in batches of 50. The
consumer regards a batch as acceptable provided that there are not more than 5 defective
microprocessors in the batch. Rather than test all of the microprocessors in the batch, 10
are selected at random and tested.
1. Find the probability that out of a sample of 10, d = 0, 1, 2, 3, 4, 5 are defective when
there are actually 5 defective microprocessors in the batch.
45 C10−d × 5 Cd
P (X = d) = .
50 C10
Hence,
45 C10× 5 C0 45 C9 × 5 C1
P (X = 0) = = 0.311 P (X = 1) = = 0.431
50 C10 50 C10
45 C8 × 5 C2 45 C7 × 5 C3
P (X = 2) = = 0.210 P (X = 3) = = 0.044
50 C10 50 C10
45 C6 × 5 C4 45 C5 × 5 C5
P (X = 4) = = 0.004 P (X = 5) = = 0.0001
50 C10 50 C10
2. Suppose that the consumer will accept the batch provided that not more than m
defectives are found in the sample of 10.
(a) Find the probability that the batch is accepted when there are 5 defectives in the
batch.
Answer:
m m
45 C10−d × 5 Cd
X X
P (X = d) = ; m⩽5
50 C10
d=0 d=0
(b) Find the probability that the batch is rejected when there are 3 defectives.
Answer:
m m
47 C10−d × 3 Cd
X X
1− P (X = d) = 1 − ; m⩽3
50 C10
d=0 d=0
Page 55
Stat 123: Probability and Statistics Module 2
Exercises:
1. The number of computer shutdowns during any month has a Poisson distribution,
averaging 0.25 shutdowns per month.
(a) What is the probability of at least 3 computer shutdowns during the next year?
(b) During the next year, what is the probability that at least 3 months (out of 12)
with exactly 1 computer shutdown in each?
2. A lab network consisting of 20 computers was attacked by a computer virus. This virus
enters each computer with probability 0.4, independently of other computers. Find the
probability that it entered at least 10 computers.
3. Suppose a gambler will draw a bridgehand (13 cards) randomly in an ordinary deck.
(b) If the gambler has k black cards, what is the probability that there are at least 3
red cards?
4. Based on a basketball player’s record, the probability of him shooting a 3-point shot
is 0.65.
(a) Find the probability that it it will take him at most 10 shots to get 5 3-point
shots.
(b) Find the probability that he can shoot at least 6 3-point shots after 10 shots.
Page 56
Stat 123: Probability and Statistics Module 2
The probability density functions of these distributions are described by formulas that de-
pend on some parameter values. The expectation and variances of the distributions are
specified in terms of these parameters. The probability values associated with these contin-
uous distributions distributions are sometimes straightforward to calculate, although some
distributions require the use of a software package.
As in the discrete case, varieties of phenomena can be described by relatively few families
of continuous distributions. Here, we shall discuss Uniform, Exponential, Gamma, and
Normal families.
Uniform distribution plays a unique role in stochastic modeling. A random variable with
any thinkable distribution can be generated from a Uniform random variable. Many com-
puter languages and software are equipped with a random number generator that produces
random variates. Users can convert them into variables with desired distributions and use
for computer simulation of various events and processes.
Definition 2.30. The uniform distribution has a constant density. On the interval
(a, b) ⊂ R, its density equals
1
f (x) = , a < x < b.
b−a
Note that |b − a| has to be a finite number. Hence, there does not exist a uniform
distribution on the entire real line. In other words, if you are asked to choose a random
number from (−∞, ∞), you cannot do it uniformly.
Examples:
Page 57
Stat 123: Probability and Statistics Module 2
2. Let X denote the waiting time at a bus stop. The waiting time at a bus stop is
uniformly distributed between 1 and 12 minute.
1
f (x) = , 1 ⩽ x ⩽ 12
12 − 1
1
= , 1 ⩽ x ⩽ 12
11
P (X ⩽ 8) = base · height
1
= (8 − 1)
11
7
=
11
≈ 0.6364.
a+b (b − a)2
Property 2.26. If X ∼ U (a, b), then E[X] = and Var[X] = .
2 12
Exercises:
(a) E[X]
(c) P (0 ⩽ X ⩽ 4)
2. A new battery supposedly with a charge of 1.5 volts actually has a voltage with a
uniform distribution between 1.43 and 1.60 volts.
Page 58
Stat 123: Probability and Statistics Module 2
(c) What is the probability that a battery has a voltage less than 1.48 volts?
(a) If 20 random numbers are generated, what are the expectation and variance of the
number that lie in each of the four intervals [0.00, 0.30), [0.30, 0.50), [0.50, 0.75),
and [0.75, 1.00)?
(b) What is the probability that exactly five numbers lie in each of the four intervals?
Definition 2.31. The exponential distribution has a state space x ⩾ 0 and is often used
to model failure or waiting times and interarrival times. If X follows such distribution, it is
denoted by X ∼ ExP (λ) and it has a probability distribution function
f (x) = λe−λx
F (x) = 1 − e−λx .
Property 2.27. If X ∼ ExP (λ), then E[X] = 1/λ and V ar[X] = 1/λ2 .
Example:
1. An engineer examines the edges of steel girders for hairline fractures. The girders are
10 m long, and it is discovered that they have an average of 42 fractures each. If a
girder has 42 fractures, then there are 43 “gaps” between fractures or between the ends
of the girder and the adjacent fractures. The average length of these gaps is therefore
Page 59
Stat 123: Probability and Statistics Module 2
10/43 = 0.23 m. The fractures appear to be randomly spaced on the girders, so the
engineer proposes that the location of fractures on a particular girder can be modeled
by a Poisson process with
1
λ= = 4.3.
0.23
According to this model, the length of a gap between any two adjacent fractures has
an exponential distribution with λ = 4.3. In this case, the probability that a gap is
less than 10 cm long is
2. The engineer in charge of the car panel manufacturing process pays particular attention
to the arrival of metal sheets at the beginning of the panel construction lines. These
metal sheets are brought one by one from other parts of the factory floor, where they
have been cut into the required sizes. On average, about 96 metal sheets are delivered
to the panel construction lines in 1 hour. The engineer decides to model the arrival of
the metal sheets with a Poisson process. The average waiting time between arrivals is
60/96 = 0.625 minute, so a value of
1
λ= = 1.6
0.625
is used. This model assumes that the waiting times between arrivals of metal sheets
are independently distributed as exponential distributions with λ = 1.6. For example,
Page 60
Stat 123: Probability and Statistics Module 2
the probability that there is a wait of more than 3 minutes between arrivals is
Exercise: Suppose that you are waiting for a friend to call you and that the time you wait
in minutes has an exponential distribution with parameter λ = 0.1.
2. What is the probability that you will wait longer than 10 minutes?
3. What is the probability that you will wait less than 5 minutes?
4. Suppose that after 5 minutes you are still waiting for the call. What is the distribution
of your additional waiting time? In this case, what is the probability that your total
waiting time is longer than 15 minutes?
5. Suppose now that the time you wait in minutes for the call has a U (0, 20) distribution.
What is the expectation of your waiting time? If after 5 minutes you are still waiting
for the call, what is the distribution of your additional waiting time?
In this section, we discuss the normal or Gaussian distribution. It is the most important of all
continuous probability distributions and is used extensively as the basis for many statistical
inference methods. Its importance stems from the fact that it is a natural probability
distribution for directly modeling error distributions and many other naturally occurring
phenomena.
(x − µ)2
1
f (x) = √ exp −
σ 2π 2σ 2
Page 61
Stat 123: Probability and Statistics Module 2
for x ∈ (−∞, ∞), depending upon two parameters, the mean and the variance
of the distribution. The probability density function is a bell-shaped curve that is symmetric
about µ. The notation
X ∼ N (µ, σ 2 )
denotes that the random variable X has a normal distribution with mean µ and variance σ 2 .
In addition, the random variable X can be referred to as being “normally distributed”.
The probability density function of a normal random variable is symmetric about the
value µ and has what is known as a “bell-shaped” curve. The figure above shows the
probability density functions of normal distributions with different values for µ and σ and
notice how the shape changes when the parameters µ and σ change.
Definition 2.33. A normal distribution with mean µ and variance σ 2 = 1 is known as the
standard normal distribution. Its probability density function has the notation ϕ(x) and
is given by
2
1 x
ϕ(x) = √ exp −
2π 2
Page 62
Stat 123: Probability and Statistics Module 2
for x ∈ (−∞, ∞). The notation Φ(x) is used for the cdf of a standard normal distribution,
which is calculated from the expression
Z x
Φ(x) = ϕ(y)dy.
−∞
The symmetry of the standard normal distribution about 0 implies that if the random
variable Z has a standard normal distribution, then
Φ(x) + Φ(−x) = 1.
A very important general result is that if X ∼ N (µ, σ 2 ) then the transformed random
variable
X −µ
Z=
σ
has a standard normal distribution. This result indicates that any normal distribution can
be related to the standard normal distribution by appropriate scaling and location changes.
Notice that the transformation operates by first subtracting the mean value µ and then by
dividing by the standard deviation σ. The random variable Z is known as the “standardized”
version of the random variable X.
A consequence of this result is that the probability values of any normal distribution
can be related to the probability values of any normal distribution can be related to the
probability values of a standard normal distribution and, in particular, to the cdf Φ(x). For
Page 63
Stat 123: Probability and Statistics Module 2
example,
a−µ X −µ b−µ
P (a ⩽ X ⩽ b) = p ⩽ ⩽
σ σ σ
a−µ b−µ
=p ⩽Z⩽
σ σ
b−µ a−µ
=Φ −Φ .
σ σ
1.
P (X ⩽ 6) = P (−∞ ⩽ X ⩽ 6)
6−3 −∞ − 3
=Φ −Φ
2 2
= Φ(1.5) − Φ(−∞)
= 0.9332 − 0
= 0.9332
2.
5.4 − 3.0 2.0 − 3.0
P (2 ⩽ X ⩽ 5.4) = Φ −Φ
2.0 2.0
= Φ(1.2) − Φ(−0.5)
= 0.8849 − 0.3085
= 0.5764.
Definition 2.34 (Empirical Rule). • There is a probability of about 68% that a random
variable takes a value with one standard deviation of its mean.
• There is a probability of about 95% that a random variable takes a value with two
Page 64
Stat 123: Probability and Statistics Module 2
• There is a probability of about 99.7% that a random variable takes a value with three
standard deviations of its mean.
P (X ⩽ µ + σzα ) = P (Z ⩽ zα ) = 1 − α.
For example, since the 95th percentile of the standard normal distribution of z0.05 = 1.645,
the 95th percentile of a N (3, 4) distribution is
Example:
1. A company manufactures concrete blocks that are used for construction purposes.
Suppose that the weights of the individual concrete blocks are normally distributed
with a mean value of µ = 11.0 kg and a standard deviation of σ = 0.3 kg. The
probability that a concrete block weighs less than 10.5 kg is
= 0.0475 − 0
= 0.0475.
Page 65
Stat 123: Probability and Statistics Module 2
Consequently, only about 1 in 20 concrete blocks weighs less than 10.5 kg.
2. A Wall Street analyst estimates that the annual return from the stock of company A
can be considered to be an observation from a normal distribution with mean µ = 8.0%
and standard deviation σ = 1.5%. The analyst’s investment choices are based upon the
considerations that any return greater than 5% is “satisfactory” and a return greater
than 10% is “excellent”. The probability that company A’s stock will prove to be
“unsatisfactory” is
= 0.0028 − 0
= 0.0028
and the probability that company A’s stock will prove to be “excellent” is
P (10.0 ⩽ X) = P (10.0 ⩽ X ⩽ ∞)
∞−µ 10.0 − µ
=Φ −Φ
σ σ
∞ − 8.0 10.0 − 8.0
=Φ −Φ
1.5 1.5
= Φ(∞) − Φ(1.33)
= 1 − 0.9082
= 0.0918.
Page 66
Stat 123: Probability and Statistics Module 2
Exercises:
(a) P (X ⩽ 10.34)
(b) P (X ⩾ 11.98)
2. The amount of sugar contained in 1-kg packets is actually normally distributed with a
mean of µ = 1.03 kg and a standard deviation of σ = 0.014 kg.
(b) If an alternative package-filling machine is used for which the weights of the pack-
ets are normally distributed with a mean µ = 1.05 kg and a standard deviation
of σ = 0.016 kg, does this result in an increase or a decrease on the proportion of
underweight packets?
(c) In each case, what is the expected value of the excess package weight above the
advertised level of 1 kg?
Page 67
Stat 123: Probability and Statistics Module 2
Definition 2.35. The gamma distribution has applications in reliability theory and is
also used in the analysis of Poisson process. The parameters of this distribution are λ and
k meaning if X is a random variable with a gamma distribution, then it is denoted by
X ∼ Ga(λ, k). Its probability density function is
λk xk−1 e−λx
f (x) =
Γ(k)
The function Γ(k) is known as the gamma function. It provides the correct scaling to
ensure that the total area under the probability density function is equal to 1.
Z ∞
Γ(k) = xk−1 e−x dx.
0
√
Some special cases are Γ(1) = 1 and Γ(1/2) = π, and in general,
And if k ∈ N, then Γ(k) = (k − 1)!. Also, notice that if k = 1, the gamma distribution
simplifies to the exponential distribution with parameter λ. The expectation and variance
of a gamma distribution are given in the following property.
Property 2.28. If X ∼ Ga(λ, k), then E[X] = k/λ and V ar[X] = k/λ2 .
The parameter k is often referred to as the shape parameter of the gamma distribution,
and λ is referred to as the scale parameter. Another important property of a gamma distri-
bution with an integer value of the parameter k is that it can be obtained as the sum of a
set of independent exponential random variables.
Page 68
Stat 123: Probability and Statistics Module 2
This property implies that for a Poisson process with parameter λ, the time taken for
k events to occur has a gamma distribution with parameters k and λ, since the time taken
until the first event occurs, and the times between subsequent events, each have independent
exponential distributions with parameter λ.
Examples:
1. Suppose that the random variable X measures the length between one end of a girder
and the fifth fracture along the girder, as shown in the figure below.
If the fracture locations are modeled by a Poisson process, X has a gamma distribution
with parameters k = 5 and λ = 4.3. The expected distance to the fifth fracture is
therefore
k 5
E[X] = = = 1.16m.
λ 4.3
We use R to show that the 0.05 quantile point of this distribution is x = 0.458m, so
that
F (0.458) = 0.05.
Consequently, the engineer can be 95% sure that the fifth fracture is at least 46 cm
away from the end of the girder. A software package can also be used to calculate the
Page 69
Stat 123: Probability and Statistics Module 2
probability that the fifth fracture is within 1 m of the end of the girder, which is
F (1) = 0.4296.
It is interesting to note that this latter probability can also be obtained using the
Poisson distribution. The number of fractures within a 1-m section of the girder has a
Poisson distribution with mean
λ × 1 = 4.3.
The probability that the fifth fracture is within 1 m of the end of the girder is the
probability that there are at least five fractures within the first 1 m section, which is
therefore
P (Y ⩾ 5) = 0.4296
where Y ∼ P o(4.3).
2. Suppose that the engineer in charge of the car panel manufacturing process is interested
in how long it will take for 20 metal sheets to be delivered to the panel construction
lines. Under the Poisson process model, this time X has a gamma distribution with
parameters k = 20 and λ = 1.6. The expected waiting time is consequently
k 20
E[X] = = = 12.5 minutes.
λ 1.6
k 20
Var[X] = 2
= = 7.81
λ 1.62
√
so that the standard deviation is σ = 7.81 = 2.80 minutes. We use R to show that
Page 70
Stat 123: Probability and Statistics Module 2
The engineer can therefore be 95% confident that 20 metal sheets will have arrived
within 18 minutes, say. Furthermore, there is a probability of about 0.82 that they
will all arrive within 15 minutes.
Exercises:
1. A day’s sales in $1000 units at a gas station have a gamma distribution with parameters
k = 5 and λ = 0.9.
(c) What are the upper and lower quartiles of a day’s sales?
(d) What is the probability that a day’s sales are more than $6000?
2. Suppose that the time in minutes taken by a worker on an assembly line to complete
a particular task has a gamma distribution with parameters k = 44 and λ = 0.7.
(a) What are the expectation and standard deviation of the time taken to complete
the task?
(b) Use a software package to find the probability that the task is completed within
an hour.
Page 71