Chapter 3
Probability Theory
Probability theory is a branch of mathematical science that deals with the mathematical analysis of random events. Probability is commonly used to describe the
mind’s attitude to statements that we are not sure of. Statements usually take the
form of “Will a particular event occur?” and the attitude of our minds will be of the
form “How confident are we that this event will occur?” Our confidence can be
described numerically, which is a value between 0 and 1 that we call probability. The
more likely to occur an event is, the more confident we are that it will occur. The
focus in this chapter is mainly on probability space, random variable, multidimensional probability distribution, expected value, variance, and covariance.
3.1
Probability Space
In this section, we present some important definitions that form the basis of probability theory.
Definition 3.1 A random experiment is an experiment whose outcome is not known
in advance.
Definition 3.2 An event that may or may not occur as a result of a random
experiment is called a random event.
For example, tossing a coin is a random experiment because its outcome is not
known in advance. Coming heads is a random event because it may or may not
happen.
Definition 3.3 The set of all possible outcomes of a random experiment associated
with the phenomenon is called the sample space, represented by the symbol Ω. An
individual element ω of Ω is called a sample point.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
A. Amirteimoori et al., Stochastic Benchmarking, International Series in Operations
Research & Management Science 317,
https://doi.org/10.1007/978-3-030-89869-4_3
31
32
3
Probability Theory
Definition 3.4 The sample space Ω is discrete if Ω is a finite set or a countably
infinite set.
Definition 3.5 The sample space Ω is continuous if Ω is an infinite set.
Example 3.1
(i) Coin tossing: Ω ¼ {H, T}.
(ii) Rolling one dice: Ω ¼ {1, 2, 3, 4, 5, 6}.
(iii) Picking one card at random in a pack of 52: Ω ¼ {1, 2, 3, . . ., 52}.
(iv) An integer-valued random outcome: Ω ¼ {0, 1, 2, . . .}.
(v) A non-negative, real-valued outcome: Ω ¼ R+.
(vi) A random continuous parameter (such as time, weather, price or wealth,
temperature, ...): Ω ¼ R.
An event is a collection of outcomes, which is represented by a subset of Ω. We
can considera class F of events, i.e., a class F of subsets of Ω [not necessarily all of
the power sets of Ω, P ðΩÞ] as a σ algebra according to the following definition:
Definition 3.6 A collection F of events is a σ-algebra if it satisfies the following
conditions:
(i) ∅ 2 F ;
(ii) For
S all countable sequences (An)n
An 2 F ;
1
such that An 2 F , n 1, we have
n1
(iii) A 2 F ) ðΩ∖AÞ 2 F :
Since A is an event, we expect its complement A0 to be an event. On the other
hand, Ω will also be an event because Ω ⊆ Ω, so we expect ∅ to be an event. Ω is
called a “certain event” and ∅ is an “impossible event.” Consequently, since F is a
subset of all subsets of Ω, we call it an “event space.”
Example 3.2 Rolling one dice : Ω ¼ {1, 2, 3, 4, 5, 6}.
The event A ¼ {1, 3, 5} corresponds to “The result of the experiment is an odd
number.”
F ≔fΩ, ∅, f1, 3, 5g, f2, 4, 6gg defines a σ algebra on Ω which corresponds to
the knowledge of parity of an integer picked at random from 1 to 6.
G ≔ {Ω, ∅ , {2, 4, 6}, {2, 4}, {6}, {1, 2, 3, 4, 5}, {1, 3, 5, 6}, {1, 3, 5}} defines a
σ algebra on Ω which is bigger than F and corresponds to the parity information
contained in F , completed by the knowledge of whether the outcome is equal to
6 or not.
Definition 3.7 Probability Measure. A probability measure is a mapping ℙ :
F ! ½0, 1 that for each event A 2 F assigns a value in the interval [0, 1], with the
following three axioms:
3.1 Probability Space
33
(a) For any event A 2 F , ℙ(A) 0;
(b) ℙ(Ω) ¼ 1;!
S
P
ℙðAn Þ, whenever Ai \ Ak ¼ ∅ ,
(c) ℙ
An ¼
n1
n1
i 6¼ k.
These principles were put forward by the Russian mathematician and statistician
Andrey Kolmogorov in 1933 as the principles of probability function and have since
been known as the Kolmogorov Principles. The triple ðΩ, F , ℙÞ is called a probability space. This setting is generally referred to as the Kolmogorov framework.
Based on the Kolmogorov Principles, many theorems have been established for the
probability function. In the following, some important theorems for the probability
function are given:
Theorem 3.1 Let ðΩ, F , ℙÞ be a probability space. Then, we have
(a) The probability of the event ∅is zero, i.e. ℙ(∅) ¼ 0.
(b) The probability of the union of two disjoint events is the sum of their probabilities, i.e., ℙ(A [ B) ¼ ℙ(A) + ℙ(B), where A and B are two disjoint events.
(c) The probability value for the complement of an event A is equal to
ℙ(A0) ¼ 1 ℙ(A).
(d) If event A is a subset of event B, then ℙ(A) ℙ(B).
(e) If C is an event created by the union of two events A and B, then
ℙ(C) ¼ ℙ(A [ B) ¼ ℙ(A) + ℙ(B) ℙ(A \ B).
Definition 3.8 Let A and B be two events in a sample space T
Ω, where ℙ(A) 6¼ 0;
ℙðA
BÞ
then the conditional probability of B given A is ℙðBjAÞ ¼ ℙðAÞ .
It follows immediately that
T if A and B are the two events in the sample space Ω,
where ℙ(A) 6¼ 0, then
ℙ(A
B) = ℙ(A). ℙ(B| A). Moreover, given that ℙ(B) 6¼ 0,
T
T
ℙ ðA
BÞ
since ℙðAjBÞ ¼ ℙðBÞ , then ℙ(A B) = ℙ(B). ℙ(A| B).
T
Definition 3.9 Two events A and B are independent if and only if ℙ(A B) = ℙ(A).
ℙ(B).
In other words, two events A and B are independent if the occurrence or
nonoccurrence of one event does not affect the probability of the occurrence or
nonoccurrence of the other event. In many cases, the result of an experiment depends
on what happened at different intermediate stages. The following theorem on the law
of total probability deals with this issue.
Theorem 3.3 Law of Total Probability. Let the events A1, A2, . . ., An be events
that constitute a partition of sample space Ω, where ℙ(Ai) 6¼ 0, for i ¼ 1, 2, . . ., n.
n
P
Then, for any event B in Ω, ℙðBÞ ¼
ℙðAi ÞℙðBjAi Þ.
i¼1
34
3
Probability Theory
In probability theory and statistics, the Bayes theorem describes the probability of
an event, based on prior knowledge of conditions that might be related to the event.
For example, if the probability that someone has cancer is related to his/her age, then
using the Bayes theorem, the age can be used to more accurately assess the
probability of cancer than it can be done without the knowledge of the age.
Theorem 3.4 Let the events A1, A2, . . ., An constitute a partition on the sample space
Ω, where ℙ(Ai) 6¼ 0, for i ¼ 1, 2, . . ., n. Then, for any event B in Ω with ℙ(B) 6¼ 0,
ℙðAr Þ:ℙðBjAr Þ
ℙðAr jBÞ ¼ P
n
ℙðAi ÞℙðBjAi Þ
r ¼ 1, . . . , n
i¼1
3.2
Random Variable
Definition 3.10 Let ðΩ, F , ℙÞ be a probability space. A real-valued random variable
on this probability space is a measurable mapping X : ðΩ, F Þ ! ðℝ, B ðℝÞÞ , i.e.
X 1 ðGÞ 2 F for all G 2 B ðℝÞ.
Note that X is a random variable from a probability space Ω into the state space ℝ,
which maps each ω 2 Ω to X(ω) 2 ℝ.
Definition 3.11 A random variable X is continuous when Ω is continuous and is
discrete when Ω is discrete.
Example 3.3 When we roll two dice, Ω ≔ {1, 2, 3, 4, 5, 6}
{1, 2, 3, 4, 5, 6} ¼ {(1, 1), (1, 2), . . ., (6, 6)}. Consider X : Ω ! ℝ and (k, l) ! k + l.
Then, X is a random variable that gives the sum of the two numbers appearing on
each dice.
Definition 3.12 If X is a discrete random variable on ðΩ, F , ℙÞ , the function f
(x) ¼ P(X ¼ x) for each x in the range of X is called the probability distribution of X.
Proposition 3.1 Let X be a discrete random variable on ðΩ, F , ℙÞ . Then, f is a
probability distribution of X if and only if it satisfies the following conditions:
(a) P
f(x) 0, for each value of x;
(b) f ðxÞ ¼ 1.
x
Example 3.4 (Freund et al., 2004) Check whether the function given by f ðxÞ ¼
x ¼ 1, 2, 3, 4, 5, can serve as the probability distribution of a discrete random
variable.
xþ2
25 , for
3
4
5
By introducing the possible values in the function, f ð1Þ ¼ 25
, f ð2Þ ¼ 25
, f ð3Þ ¼ 25
,
6
7
f ð4Þ ¼ 25 , f ð5Þ ¼ 25 , since these values are all non-negative, the first condition of
3.2 Random Variable
35
proposition 3.1 is satisfied, and since f ð1Þ þ f ð2Þ þ f ð3Þ þ f ð4Þ þ f ð5Þ ¼
3
4
5
6
7
25 þ 25 þ 25 þ 25 þ 25 ¼ 1 , the second condition is also satisfied. Thus, the given
function can serve as the probability distribution of a random variable having the
range {1, 2, 3, 4, 5}.
Definition 3.13 Distribution Function. If X is a discrete random
P variable on
ðΩ, F , ℙÞ , the function FX(x) is defined by FX ðxÞ ¼ PðX xÞ ¼ f ðtÞ, 1 <
tx
x < 1 and is called the distribution function or the cumulative distribution of X, in
which f(t) is the value of the probability distribution of X at t.
Proposition 3.2 Let FX be a cumulative distribution function of a discrete random
variable X. Then,
(a) FX is a non-decreasing function;
(b) FX( 1) ¼ 0 and FX(+1) ¼ 1.
Definition 3.14 Probability density function. A function fX : ℝ ! ℝ+ is called a
probability density function of the continuous random variable X if and only if
Rb
Pða x bÞ ¼ a f X ðxÞdx, for any real constants a and b with a b.
The probability density functions are also referred to as probability densities,
density functions, densities, or pdf.
Theorem 3.5 If X is a continuous random variable on ðΩ, F , ℙÞ and a and b are real
constants with a b, then P(a X b) ¼ P(a X < b) ¼ P(a < X b) ¼ P
(a < X < b).
Proposition 3.3 Let X be a continuous random variable on ðΩ, F , ℙÞ. Then fX is a
probability density of X if and only if it satisfies the following conditions:
(a) fRX(x) 0, for
1
(b) 1 f X ðxÞdx ¼ 1:
1 <x< 1;
Example 3.5 (Freund et al., 2004) If X has the probability density:
f X ð xÞ ¼
(
k:e
0
3x
, for x > 0;
, otherwise;
find k and P(0.5 X 1).
the second condition of Proposition 3.3, we must have
R 1Solution: ToR satisfy
3x
1
k
3x
f
ð
x
Þdx
¼
k:e
dx ¼ k: e 3 j1
X
0 ¼ 3 ¼ 1. It follows that k ¼ 3. Moreover,
1
0 R
1
3x
3x 1
Pð0:5 X 1Þ ¼ 0:5 3e dx ¼ e j0:5 ¼ e 3 þ e 1:5 ¼ 0:173:
Definition 3.15 Let X be a continuous random variable on ðΩ, F , ℙÞ. The distribution function or the cumulative distribution function (CDF) of X, FX : ℝ ! [0, 1], is
36
3
defined by FX ðxÞ PðX xÞ ¼
the probability density of X at t.
Rx
1 f X ðtÞdt, for
Probability Theory
x 2 ℝ, where fX(t) is the value of
That is, FX(x) is the probability that the random variable X takes a value, which is
less than or equal to x.
Proposition 3.4 Let FX be a cumulative distribution function. Then:
(a) 0 FX(x) 1,
(b) FX is a non-decreasing function,
(c) FX ð 1Þ ¼ lim FX ðxÞ ¼ 0 and FX ðþ1Þ ¼ lim FX ðxÞ ¼ 1:
x! 1
x!þ1
Theorem 3.6 If fX(x) and FX(x) are the probability density and cumulative distribution functions of the random variable X, respectively, then for any real constants a
X ðxÞ
and b with a b, we have P(a < X b) ¼ F(b) F(a) and f X ðxÞ ¼ dFdx
.
Remark 3.1 P(x ¼ a) ¼ 0.
Example 3.6 Find the distribution function of the random variable X in Example
3.6 and use it to re-evaluate P(0.5 X 1).
Rx
Rx
For X > 0, FX ðxÞ ¼ 1 f X ðtÞdt ¼ 0 3e 3t dt ¼ e 3t jx0 ¼ 1 e 3x and since
F(x) ¼ 0, for x 0, we can write
Fð X Þ ¼
(
0
1
e
3x
,
for x 0;
,
for x > 0:
Now, to determine P(0.5 X 1), we use Theorem 3.6. So, P(0.5 X 1) ¼ F
F(0.5) ¼(1 e 3) (1 e 1.5) ¼0.173.
In probability theory, a multivariate random variable or random vector is a list of
random variables, the values of which are unknown. For example, while a given
person has a specific age, height, and weight, the representation of these features for
an unspecified person in a group would be a random vector. Similarly, for the
random variable, we now define multivariate random variables, or random vectors,
as multivariate functions.
(1)
Definition 3.16 An n-dimensional random variable or vector X is a (measurable)
function from the probability space Ω to ℝn, i.e. X : Ω ! ℝn.
Definition 3.17 Joint Probability Distribution. Let X1, X2, . . ., Xn be discrete
random variables. The joint probability distribution of X1, X2, . . ., Xn is defined as
f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ PðX1 ¼ x1 , X2 ¼ x2 , . . . , Xn ¼ xn Þ, where xi 2 Range
(Xi), i ¼ 1, . . ., n.
Theorem 3.7 A multi-dimensional function can serve as the joint probability
distribution of the discrete random variables X1, X2, . . ., Xn if and only if its values,
f X 1 ,X 2 ,...,X n ðx1 , x2 , . . . , xn Þ, satisfy the following conditions:
3.2 Random Variable
37
(a) fP
ðx1 , x2 , . . . , xn Þ 0 for each (x1, x2, . . ., xn) within its domain;
X1P
,X2 ,...,XP
n
(b)
. . . f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ 1 , where the multiple summations
x1 x2
xn
extend over all possible x1, x2, . . ., xn within its domain.
Definition 3.18 Joint Cumulative Distribution. Let X1, X2, . . ., Xn be discrete
random variables. The joint distribution function or the joint cumulative distribution
of X1, X2, . . ., Xn is defined as
FX1 ,X2 ,...,Xn ðx1 ,x2 , ... ,xn Þ ¼ PðX1 x1 ,X2 x2 , .. .,Xn xn Þ
XX X
¼
...
f X1 ,X2 ,...,Xn ðt1 , t2 , .. .,tn Þ for x1 , x2 , . .., xn
t1 x1 t2 x2
tn xn
2 ℝ,
for all (x1, x2, . . ., xn) within the range of X1, X2, . . ., Xn.
Definition 3.19 Joint Probability Density Function. A function f X1 ,...Xn : ℝn !
ℝþ is called a joint probability density function of the continuous random variables
X 1,
X,
. . .,
Xn
if
and
only
if
PðfX1 , X2 , . . . , Xn g 2 AÞ ¼
RR
R2
. . . f X1 ,...Xn ðx1 , x2 , . . . , xn Þdx1 dx2 . . . dxn , for any region A in the ℝn.
Theorem 3.8 Let X1, X2, . . ., Xn be continuous random variables on ðΩ, F , ℙÞ.
Then, f X1 ,...Xn is a joint probability density function of X1, X2, . . ., Xn if and only if it
satisfies the following conditions:
(a) Rf X1 ,XR2 ,...,Xn ðx1R, x2 , . . . , xn Þ 0, for x1 , x2 , . . . , xn 2 ℝ;
1
1
1
(b) 1 1 . . . 1 f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þdx1 dx2 . . . dxn ¼ 1:
Definition 3.20 Joint Distribution Function. If X1, X2, . . ., Xn are continuous
random variables on ðΩ, F , ℙÞ, then the function is given by
FX1 ,X2 ,...,Xn ðx1 ,x2 , ...,xn Þ ¼ PðX1 x1 ,X2 x2 , ...,Xn xn Þ
Z x1 Z x2 Z xn
¼
...
f X1 ,X2 ,...,Xn ðt1 ,t2 , ...,tn Þdt1 dt2 ...dtn ,for
1
1
1
x1 ,x2 , ...,xn 2 ℝ
is called the joint distribution function of X1, X2, . . ., Xn, in which
f X1 ,X2 ,...,Xn ðt1 , t2 , . . . , tn Þ is the joint probability density of X1, X2, . . ., Xn at (t1,
t2, . . ., tn).
Theorem 3.9
(a) FX1 ,X2 ,...,Xn ð 1, 1, . . . , 1Þ ¼ 0;
(b) FX1 ,X2 ,...,Xn ð1, 1, . . . , 1Þ ¼ 1;
38
3
Probability Theory
(c) If a1 < b1, a2 < b2, . . ., an < bn, then FX1 ,X2 ,...,Xn ða1 , a2 , . . . , an Þ
FX1 ,X2 ,...,Xn ðb1 , b2 , . . . , bn Þ;
n
(d) f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ ∂x1∂...∂xn FX1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ:
Theorem 3.10 Let Xi : i ¼ 1, . . ., n be random variables with probability distributions f Xi ðxi Þ, i ¼ 1, . . . , n, and let f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ be the joint probability
distribution function of X1, X2, . . ., Xn. Then, X1, X2, . . ., Xn are independent if and
only if f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ f X1 ðx1 Þ:f X2 ðx2 Þ: . . . :f Xn ðxn Þ, for all (x1, x2, . . .,
xn) within their range.
3.3
Mathematical Expectation
The expectation, or expected value, of a random variable X is the mean or average
value of X. In practice, expectations can be even more useful than probabilities.
For example, bank deposits or the prices of inputs of the DMUs can be considered
as random variables. In such cases, we usually refer to their expected values rather
than their actual values.
Definition 3.21 Expected Value. If X is a discrete random variable and
P fX(x) is its
probability distribution function, the expected value of X is ðXÞ ¼ x x:f X ðxÞ.
Correspondingly, if X is a continuous random
R 1 variable and fX(x) is its probability
density, the expected value of X is ðXÞ ¼ 1 x:f X ðxÞdx.
Theorem 3.11 If X is a discrete random variable and fX(x) is its probability
distributionP
function, the expected value of g(X), as a function of X, is given by
½gðXÞ ¼ x gðxÞ:f X ðxÞ: Correspondingly, if X is a continuous random variable
and fX(x) is its probability density function,
then the expected value of g(X), as a
R1
function of X, is given by ½gðXÞ ¼ 1 gðxÞ:f X ðxÞdx:
Theorem 3.12 If a and b are constants, then ðaX þ bÞ ¼ aðXÞ þ b:
n
n
P
P
c i gi ð X Þ ¼
ci ½gi ðXÞ:
Theorem 3.13 If c1, c2, . . ., cn are constants, then
i¼1
i¼1
Theorem 3.14 If X1, X2, . . ., Xn are independent random variables, then
ðX1 X2 . . . Xn Þ ¼ ðX1 ÞðX2 Þ . . . ðXn Þ.
Definition 3.22 Variance. Let X be a random variable with a finite expected value,
μ. Then, the variance of X, denoted by σ2, σ2X , or var(X), is defined by σ2 ¼
h
i
ðX μ Þ2 :
The positive square root of the variance, σ, is called the standard deviation of X.
3.3 Mathematical Expectation
39
σ2 ¼ X2
Theorem 3.15
2 ðXÞ:
Theorem 3.16 If X has the variance σ2 and a and b are constants, then var
(aX + b) ¼ a2σ2.
Herein, we represent Chebyshev’s inequality that enables us to derive bounds on
probabilities when only the mean or both the mean and the variance of the probability distribution are known and valid for all distributions of each random variable.
Theorem 3.17 Chebyshev’s inequality. If X is a random variable with finite mean
2
μ and variance σ2, then for any value k > 0, we have ℙðjX μj kÞ σk2 .
Definition 3.23 Covariance. Let X and Y be two random variables with finite
expected values μX and μY, respectively. Then, the covariance of X and Y, denoted
by σXY or cov(X, Y), is defined by covðX, YÞ ¼ ½ðX μX ÞðY μY Þ.
Theorem 3.18
covðX, YÞ = ðXYÞ
ðXÞðYÞ:
It can easily be shown that if X and Y are two independent random variables, then
cov(X, Y) ¼ 0.
Theorem 3.19 If X1, X2, . . ., Xn are random variables and Y ¼
a2, . . ., an are constants, then varðYÞ ¼
Proof. See Freund et al. (2004).
n
P
i¼1
a2i :varðXi Þ þ 2
PP
i<j
n
P
ai Xi , where a1,
i¼1
ai aj :cov Xi , Xj .
Corollary 3.1 If X1, X2, . . ., Xn are independent random variables and Y ¼
, then varðYÞ ¼
n
P
i¼1
n
P
ai X i
i¼1
a2i :varðXi Þ:
Theorem 3.20 If X1, X2, . . ., Xn are random variables and Y1 ¼
n
P
i¼1
ai Xi and Y2 ¼
covðY1 , Y2 Þ ¼
n
P
n
P
bi Xi , where a1, a2, . . ., an, b1, b2, . . ., bn are constants, then
i¼1
i¼1
ai bi :varðXi Þ þ
P P
i<j
ai bj þ aj bi :cov Xi Xj :
Corollary 3.2 If the random variables X1, X2, . . ., Xn are independent, Y1 ¼
n
P
i¼1
ai Xi and Y2 ¼
n
P
i¼1
bi Xi , then covðY1 , Y2 Þ ¼
n
P
i¼1
ai bi :varðXi Þ:
40
3
Probability Theory
Definition 3.24 Correlation Coefficient. The correlation coefficient of two random
covðX, YÞ
ffipffiffiffiffiffiffiffiffiffiffi.
variables X and Y, denoted by ρ(X, Y), is defined by ρðX, YÞ ¼ pffiffiffiffiffiffiffiffiffi
varðXÞ
varðYÞ
It can be shown that 1 ρ(X, Y) 1. The correlation coefficient is a measure of
the degree of linearity between X and Y. A value of ρ(X, Y) near +1 or 1 indicates
a high degree of linearity between X and Y, whereas a value near 0 indicates that
such linearity is absent. A positive value of ρ(X, Y) indicates that Y tends to increase
when X does, whereas a negative value indicates that Y tends to decrease when X
increases. If ρ(X, Y) ¼ 0, then X and Y are said to be uncorrelated.
3.4
Discrete Distributions
In this section, we introduce some of the most used probability distributions. Due to
the importance of normal and chi-square distributions, these two distributions are
studied in more detail.
Definition 3.25 Discrete uniform distribution. A random variable X has a discrete uniform distribution and it is referred to as a discrete uniform random
variable if and only if its probability distribution is given by f ðxÞ ¼ 1k , for k ¼
x1 , x2 , . . . xk .
A discrete uniform random variable holds all its values with equal probabilities.
Definition 3.26 Bernoulli distribution. A random variable X has a Bernoulli
distribution and it is referred to as a Bernoulli random variable if and only if its
probability distribution is given by f(x; p) ¼ px(1 p)1 x, for x ¼ 0, 1.
Theorem 3.21 Let X be a Bernoulli random variable. The mean and variance of
X are then μ ¼ p and σ 2 ¼ p(1 p), respectively.
Definition 3.27 Binomial distribution. A random variable X has a binomial
distribution and it is referred to as a binomial random variable if and only if its
n x
probability distribution is given by bðx; n, pÞ ¼
p ð1 pÞn x , for x ¼
x
0, 1, . . . , n.
Theorem 3.22 Let X be a Binomial random variable. The mean and variance of
X are then μ ¼ np and σ 2 ¼ np(1 p), respectively.
Definition 3.28 Poisson distribution. A random variable X has a Poisson distribution and it is referred to as a Poison random variable if and only if its probability
x λ
distribution is given by pðx; λÞ ¼ λ x!e , for x ¼ 0, 1, 2, . . . , where X is the mean of
successes in each given time interval or region.
Theorem 3.23 Let X be a Poisson random variable. The mean and variance of X are
then μ ¼ λ and σ 2 ¼ λ, respectively.
3.5 Continuous Distributions
3.5
41
Continuous Distributions
Definition 3.29 Continuous uniform distribution. A random variable X has a
uniform distribution and it is referred to as a continuous uniform random variable if
and only if its probability density is given by
f ð xÞ ¼
8
<
:
1
b
a
0
a x b;
,
,
elsewhere:
In other words, the random variable X is uniformly distributed at intervals.
Theorem 3.24 Let X be a uniform random variable. The mean and variance of X are
1
2
then μ ¼ aþb
aÞ2 , respectively.
2 and σ ¼ 12 ðb
Definition 3.30 Gamma function. The gamma function of α, denoted by Γ(α), is
defined as
ΓðαÞ ¼
Z
1
xα 1 e x dx:
0
Corollary 3.3
ΓðαÞ ¼ ðα
1Þ!
Definition 3.31 Gamma distribution. A random variable X has a gamma distribution and it is referred to as a gamma random variable if and only if its probability
density function is given by
gðx; α, βÞ ¼
where α > 0 and β > 0.
8
>
<
>
:0
1
xα 1 e
βα ΓðαÞ
x=β
, for x > 0;
, elsewhere;
Theorem 3.25 Let X be a gamma random variable. The mean and variance of X are
then μ ¼ αβ and σ 2 ¼ αβ2, respectively.
Definition 3.32 Exponential distribution. A random variable X has an exponential
distribution and it is referred to as an exponential random variable if and only if its
probability density is represented by
42
3
where θ > 0.
8
< 1e
θ
gðx, θÞ ¼
:
0
x=θ
Probability Theory
for x > 0;
,
,
elsewhere;
Note: The exponential distribution is a special case of gamma distribution with
α ¼ 1 and β ¼ θ.
Theorem 3.26 Let X be an exponential random variable. The mean and variance of
X are then μ ¼ θ and σ 2 ¼ θ2, respectively.
Definition 3.33 Weibull distribution. A random variable X has a Weibull distribution and it is referred to as a Weibull random variable if and only if its probability
density is represented by
f ðxÞ ¼
(
kxβ 1 e
αxβ
0
, for x > 0;
, elsewhere;
where α > 0 and β > 0.
Note: The exponential distribution is a special case of Weibull distribution with
β ¼ 1.
Definition 3.34 Beta distribution. A random variable X has beta distribution and
it is referred to as a beta random variable if and only if its probability density is
represented by
8
>
< Γðα þ βÞ xα 1 ð1
f ðx; α, βÞ ¼ ΓðαÞΓðβÞ
>
:
0
xÞβ 1 , for 0 < x < 1;
, elsewhere;
where α > 0 and β > 0.
Theorem 3.27 Let X be a Beta random variable. The mean and variance of X are
α
then μ ¼ αþβ
and σ 2 ¼ ðαþβÞ2αβ
, respectively.
ðαþβþ1Þ
3.6
The Normal Distribution
In probability theory, the normal distribution is one of the most important statistical
distributions. This distribution is sometimes referred to as the Gaussian distribution
or the Laplace–Gauss distribution. In this section, we briefly summarize the properties of this distribution.
3.6 The Normal Distribution
43
Fig. 3.1 Graph of normal
distribution
Definition 3.35 Normal Distribution. A random variable X has a normal distribution with expectation μ and variance σ2, i. e. , X N(μ, σ2), if and only if its
probability density is given by
where σ > 0.
1 x μ 2
1
f ðxÞ ¼ pffiffiffiffiffi e 2ð σ Þ ,
σ 2π
for x 2 ℝ,
The graph of a normal distribution is shown in Fig. 3.1. It is shaped like the crosssection of a bell. μ and σ are the two parameters that play a key role in the shape of
normal distribution.
Definition 3.36 Standard Normal Distribution. The normal random variable with
μ ¼ 0 and σ ¼ 1 is referred to as the standard normal random variable.
Theorem 3.28 If X has a normal distribution with the mean μ and the standard
deviation σ, then Z ¼ Xσ μ has a standard normal distribution.
Proof See Freund et al. (2004).
Example 3.7 If X is a normal random variable with μ ¼ 3 and σ ¼ 4, find P
(4 X 8).
Solution:
3 X 3 8 3
4
4
4
¼ Pð0:25 Z 1:25Þ
¼ PðZ 1:25Þ PðX 0:25Þ
Pð 4 X 8 Þ ¼ P
4
¼ 0:8944 0:5987
¼ 0:2957:
Definition 3.37 Multivariate Normal Random Variable. Let X ¼ (X1, . . ., Xk) be
a k-dimensional random variable. X has a multivariate normal distribution with mean
44
3
Probability Theory
Fig. 3.2 PDFs of the
normal distribution (mean
zero) and half-normal
distribution
μ and matrix variance–covariance Σ, i.e., X N(μ, Σ), if and only if its probability
density is represented by
1
f X ðx1 , . . . , xk Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi e
σ ð2πÞk j Σ j
1
2ðX
μÞT Σ 1 ðX μÞ
, for xi 2 ℝ, i ¼ 1, . . . , k:
Note that μ ¼ ðX Þ ¼ ½ðX 1 Þ, ðX 2 Þ, . . . , ðX k Þ and Σ ¼ [Cov(Xi, Xj); 1 i, j k].
Example 3.8 Suppose Z ¼ (X, Y) is a bivariate normal random variable. Let the
mean vector and variance–covariance matrix of Z be μ ¼ ðμX , μY Þ and Σ ¼
!
σ2X
ρσX σY
,
respectively.
Then,
f X,Y ðx, yÞ ¼
ρσX σY
σ2Y
h
i
ðx μX Þ2
ðy μY Þ2
2ρðx μX Þðy μY Þ
1pffiffiffiffiffiffiffiffi
1
exp
þ
:
2
2
2
σX σY
2ð1 ρ Þ
σ
σ
2
2πσX σY
1 ρ
X
Y
Note that ρ is the correlation coefficient between X and Y.
Definition 3.38 Half-normal distribution. A random variable X has a half-normal
distribution and it is referred to as a half-normal random variable if and only if its
pffiffi
x2
probability density function is represented by f ðx; σ Þ ¼ p2ffiffi exp
2σ 2 , x > 0.
σ π
Let Z follow a normal distribution, i.e. Z~N(0, σ2). Then, X ¼ |Z| follows a halfnormal distribution. The graph of a normal distribution along with a half-normal
distribution is shown in Fig. 3.2.
3.7 The Chi-Square Distribution
3.7
45
The Chi-Square Distribution
In probability theory and statistics, the chi-square distribution (χ 2-distribution) with
k degrees of freedom is the distribution of the sum of the squares of k independent
standard normal random variables. Due to the importance of this distribution, we
will discuss it here.
Definition 3.39 Chi-Square Distribution. A random variable X has a chi-square
distribution, i.e., X χ2(k), and it is referred to as a chi-square random variable if and
only if its probability density function is represented by
8
>
>
>
<
k 2
x
1
x 2 e 2,
for x > 0;
k
2
Γ
f ðx, nÞ ¼
2
>
>
>
:
0
, otherwise;
k
2
where k is referred to as the degrees of freedom.
Corollary 3.4 The mean and variance of the chi-square distribution are μ ¼ ν and
σ2 ¼ 2ν, respectively.
Theorem 3.29 Suppose X1, X2, . . ., Xn are n independent random variables having
n
P
standard normal distributions. Then, Y ¼
X 2i has a chi-square distribution with n
i¼1
degrees of freedom.
Proof See Freund et al. (2004).
Theorem 3.30 Suppose that X is a k-dimension random vector, X N(μ, Σ), where
Σ is positive definite. Then (X μ)TΣ 1(X μ) follows a chi-square distribution on
k degrees of freedom.
Proof See Flury (2013).
μ)TΣ 1(X
Remark 3.2 The curves of the form (X
ellipses or ellipsoids in higher dimensions.
μ) ¼ constant > 0 are
From Example 3.8, to find an ellipse within X, which falls with probability α, we
need to set (X
μ)TΣ 1(X
μ) to be equal to the α-quantile of the chi-square
distribution on two degrees of freedom. If U is a chi-square on two degrees of
freedom, then its distribution function is represented by
Fð u Þ ¼ P ð U u Þ ¼
0
1
e
1
2u
, u 0;
, u > 0:
And, therefore, the α-quantile is computed as follows:
46
3
Probability Theory
7
6
x2
5
4
3
2
1
–1
0
1
2
3
x1
4
5
6
7
Fig. 3.3 Ellipses of the constant density of a bivariate normal distribution. The ellipses represent
the regions within which X falls with probability α ¼ 0.1, 0.2, . . ., 0.9
Fð u Þ ¼ α ) 1
)u¼
e
2 log ð1
1
2u
¼α
αÞ:
By choosing different values for α 2 (0, 1), we will obtain different values for
u. Therefore, for each α,
we
have a quadratic
curve
in the (x1, x2)-plane.
3
2 1
For instance, for μ ¼
and Σ ¼
, we have
4
1 1
ðX
μ ÞT Σ 1 ðX
μÞ ¼ ðx1
3, x2
4Þ
1
1
1
2
x1
x2
3
4
¼ c2 ,
2 log (1 α).
where c2 ¼
Figure 3.3 shows various curves for α ¼ 0.1, 0.2, . . ., 0.9.
3.8
Sampling Distributions
As we know, the events of a statistical experiment are determined numerically by the
random variables. The total set of observations surveyed is called population and
the number of population members is called the size of the population. The
observations are values of a random variable, and since each random variable has
3.8 Sampling Distributions
47
a probability distribution, each statistical population can be assigned a random
variable and thus has a probability distribution. For example, if a random variable
corresponding to the observations of a population is a normal random variable, that
population is called the normal population.
Since we define a population as a set, it can have a finite or infinite number of
members. On the other hand, since we do not have access to all observations in an
infinite population, we do not know the distribution of the population, and we do not
know its mean and variance. Therefore, we call the mean and variance of the
population as population parameters that must be estimated. For this purpose, we
need a sample of the population.
Definition 3.40 Random Sample. If X1, X2, . . ., Xn are independent and identically
distributed random variables, we say that they constitute a random sample from the
infinite population given by their same distribution.
Note: If X1, X2, . . ., Xn are n independent and identically distributed random
variables with the same probability function, then the probability distribution of this
random sample is f(x1, x2, . . ., xn) ¼ f(x1)f(x2). . .f(xn).
Now, to estimate the population parameter, we first introduce the mean and
variance of the sample. These calculated values from a random sample are called
statistics and since these values depend on the sample and there are many random
samples in the population, then each statistic is a random variable.
Definition 3.41 Statistic. Any function of random sample members that does not
contain unknown parameters is called a statistic.
Definition 3.42 Sample mean and sample variance. If X1, X2, . . ., Xn constitute a
n
P
Xi
random sample, then the sample mean is given by X ¼
n
P
2
ðX i X Þ
2
i¼1
variance is given by S ¼
:
n 1
i¼1
n
, and the sample
Note: X and S2 are the two statistics.
Theorem 3.31 If X1, X2, . . ., Xn constitute a random sample from an infinite
2
population with the mean μ and the variance σ2, then X ¼ μ and var X ¼ σn .
Theorem 3.32 If X1, X2, . . ., Xn constitute a random sample from an infinite
population with the mean μ and the variance σ2, then S2 ¼ σ2 .
3.8.1
Limit Theorems
Limit theorems are the most important theoretical results in probability theory. These
theorems are regarded as the results giving convergence of sequences of random
variables or their distribution functions. Since random variables are functions with
48
3
Probability Theory
random influences, then different modes of convergence are involved in a sequence
of random variables. The central limit theorem and the law of large numbers are the
most important limit theorems, which we will be introducing in this section.
Theorem 3.33 The law of large numbers. If X1, X2, . . ., Xn constitute a random
2
sample from
an infinite
population with the mean μ and variance σ , then for any
E > 0, ℙ X μ E ! 0 as n ! 1.
The law of large numbers states that the sample average converges in probability
toward the expected value. In fact, as the sample size increases, the sample mean gets
closer to the population mean.
Theorem 3.34 The central limit theorem. Let X1, X2, . . ., Xn constitute a random
sample from an infinite population with the mean μ and variance σ2. Then the
ffiffi as n ! 1 is the standard normal distribution. That is, for
distribution of Z ¼ X pnμ
σ n
1 < a < 1,
Z
X nμ
1
pffiffiffi a ! pffiffiffiffiffi
ℙ
σ n
2π
3.9
a
e
x2 =2
1
dx as n ! 1:
Estimation Theory
In this section, we want to analyze and estimate population parameters using the
statistics presented in the previous section. It is possible to examine population
parameters in several ways. The most common of these is the classical estimation
method, which provides the population parameters directly from a sample of the
population.
The parameters of a population can be estimated in two ways; point estimation
and interval estimation. Generally, the parameter of the population and the statistics
b respectively. The
that will be used to examine this parameter are denoted by θ and Θ,
statistic used for the point estimation is called the estimator. Since one parameter
may have multiple estimators, we need to know which estimator is better.
b is an unbiased estimator of the
Definition 3.43 Unbiased estimator. A statistic Θ
b ¼ θ for all possible values of
parameter θ of a given distribution if and only if Θ
θ.
b is called a biased estimator if Θ
b 6¼ θ. The bias is then defined as the
Note: Θ
b and θ, i.e. bias ¼ Θ
b
difference between Θ
θ.
Each parameter may have several unbiased estimators. If we must choose one of
them, we usually take the one whose sampling distribution has the smallest variance.
3.9 Estimation Theory
49
Example 3.9 If X1, X2, . . ., Xn has a Bernoulli distribution with success parameter
n
P
Xi
p, then the statistic X ¼ i¼1n is an unbiased estimator of p.
0P
1
n
X
i
n
P
B
C
ðXi Þ ¼ 1n np ¼ p.
Solution. X ¼ @ i¼1n A ¼ 1n
i¼1
Definition 3.44 Minimum variance unbiased estimator. The estimator for the
parameter θ of a given distribution that has the smallest variance of all unbiased
estimators for θ is called the minimum variance unbiased estimator, or the bestunbiased estimator for θ.
b be an unbiased estimator of θ and
Theorem 3.35 Let Θ
b ¼
var Θ
n:
1
∂ ln f ðXÞ
∂θ
2
,
b is the unbiased estimator of θ with minimum variance.
then Θ
The quantity in the denominator is referred to as the information about θ. Thus,
the smaller the variance is, the greater the information.
b 1 and Θ
b 2 are two unbiased estimators of θ, where the variance of Θ
b 1 is
Note: If Θ
b 2, then we say Θ
b 1 is relatively more efficient than Θ
b 2.
smaller than the variance of Θ
b
b
b
b
The efficiency of Θ1 relative to Θ2 , denoted eff Θ1 , Θ2 is defined to be the ratio:
var b
Θ2
b
b
eff Θ1 , Θ2 ¼ :
var b
Θ1
b is MSE Θ
b ¼
Definition 3.45 The mean square error of a point estimator Θ
2
b θ
Θ
.
b is also called the risk function of an estimator.
MSE Θ
b is a consistent estimator of
Definition 3.46 Consistent estimator. The statistic Θ
the parameter θ of a given distribution if and only if for each c > 0,
b θ c ¼ 1 or, equivalently, lim ℙ Θ
b θ > c ¼ 0:
lim ℙ Θ
n!1
n!1
The previous definition says that when the size of the random sample is sufficiently large, we can be practically certain that the error made with a consistent
estimator will be less than any small pre-assigned positive constant.
50
3
Probability Theory
b is an unbiased estimator of the parameter θ and var Θ
b ! 0 as
Theorem 3.36 If Θ
b is a consistent estimator of θ.
n ! 1, then Θ
b is a sufficient estimator of the
Definition 3.47 Sufficient estimator. The statistic Θ
b the conditional
parameter θ of a given distribution if and only if for each value of Θ,
b ¼ θ,
probability distribution or density of the random sample X1, X2, . . ., Xn, given Θ
is independent of θ.
b of a parameter θ which gives as much information about θ as is
The statistic Θ
possible from the sample is called a sufficient estimator.
b is called the
Definition 3.48 Minimal sufficient statistic. The sufficient statistic Θ
b
b
minimal sufficient if, for any
other sufficient statistic such as Θ and the arbitrary
b
b¼f Θ
b .
function f, we have Θ
It should be noted that the sufficient statistic is not unique. By Definition 3.48, a
sufficient statistic, which is a function of all-sufficient statistics, is called a Minimal
Sufficient. Thus, the minimal sufficient statistics can be considered as the most
effective sufficient statistics for the parameter θ, which is simpler than all the
sufficient statistics.
Theorem 3.37 The Rao–Blackwell theorem. Let b
θ be an unbiased estimator for
b is a sufficient statistic for θ, define b
b .
θ such that Var b
θ < 1: If Θ
θ ¼ b
θjΘ
Then, for all θ, b
θ ¼ θ and var b
θ var b
θ :
b
The Rao–Blackwell theorem says that, if b
θ is an unbiased estimator for θ and if Θ
b
is a sufficient statistic for θ, then there is a function of Θ that is also an unbiased
estimator for θ and has variance no larger than b
θ. This theorem can be used to find an
unbiased estimator with the least variance (Minimum Variance Unbiased Estimator,
MVUE). To find it, it is enough to find the uncertainty estimator and obtain its
conditional expectation in terms of sufficient parameter statistics. The resulting
estimator will have a smaller variance than the variance of the initial estimator. To
determine the best estimator in the class or set of unbiased estimators, the Lehmann–
Scheffe theorem should be used. This theorem ensures that the unbiased estimator
created by the Rao–Blackwell theorem has less variance than any other unbiased
estimator.
b for its distribution family is complete if the followDefinition 3.49 The statistic Θ
ing relation exists for any measurable function g and any parameter value θ. That is,
3.9 Estimation Theory
51
b
g Θ
b
¼0!ℙ g Θ
¼ 0Þ ¼ 1:
In the sufficient statistic, we consider a statistic that has the most information about
the unknown parameter. We also know that a sufficient statistic that is a function of
all-sufficient statistics is called a minimal sufficient statistic and it is best to use this
statistic for estimation. But the goal of creating a complete statistic is to produce or
select a statistic that can provide us with the smallest amount of information about
the parameter. This means that in a minimal sufficient statistic, there may be
additional information that is not relevant to draw inference about the population
parameter. Selecting “Complete Minimal Sufficient Statistic” will result in a statistic
that stores information only about the population parameter and has no superfluous
information.
b be a complete sufficient statistic. If there
Theorem 3.38 Lehmann–Scheffe. Let Θ
are unbiased estimators, then there exists a unique MVUE. We can obtain the
b
b¼ b
b , for any unbiased b
MVUE as Θ
θjΘ
θ. The MVUE can also be characterized
b
b¼φ Θ
b of the complete sufficient statistic Θ.
b
as the unique unbiased function Θ
This theorem helps to identify the best estimator from the class of unbiased
estimators. This is a complement to Rao–Blackwell’s theorem. Using this theorem,
we can show under what conditions the unbiased estimator with the least variance is
unique. This way, a “Uniformly Minimum Variance Unbiased Estimator-UMVUE”
estimator can be obtained. Uniform means that this estimator has the least variance in
the class of unbiased estimators for all points in the parametric space.
3.9.1
The Method of Maximum Likelihood
One of the most popular methods for estimating parameters is the method of
maximum likelihood. The advantages of this method are that it yields sufficient
estimators and the maximum likelihood estimators are the minimum variance unbiased estimators.
Definition 3.50 Maximum likelihood estimator (MLE). Let x1, x2, . . ., xn be the
values of a random sample from a population with the parameter θ. The likelihood
n
Q
function of the sample is then represented by LðθÞ ¼ f ðx1 , x2 , . . . , xn jθÞ ¼
f ðxi jθÞ
i¼1
for values of θ within a given domain.
Note that f(x1, x2, . . ., xn; θ) is the value of the joint probability distribution or joint
density of random variables X1, X2, . . ., Xn at X1 ¼ x1, X2 ¼ x2, . . ., Xn ¼ xn. We refer
to the value of θ that maximizes L(θ) as the maximum likelihood estimator of θ. It is
usually customary and easier to maximize the logarithm of the likelihood, ln L(θ). To
52
3
Probability Theory
maximize ln L(θ), we take the derivative of ln L(θ) concerning θ and set the
expression equal to 0, i.e. ∂ ln∂θLðθÞ ¼ 0:
Example 3.10 If x1, x2, . . ., xn are the values of a random sample from an
exponential population, find the maximum likelihood estimator of its parameter θ.
Solution According to the definition of likelihood function, we have LðθÞ ¼
n
P
1
x
n
i
n θ
Q
i¼1
. Differentiating lnL(θ) with respect to θ yields
f ðxi jθÞ ¼ 1θ :e
i¼1
d ln LðθÞ
dθ
¼
n
θ
þ θ12 :
n
P
i¼1
xi ¼ 0 . By solving this equation, we get the maximum
likelihood estimate as θ ¼ 1n
b ¼ X.
is Θ
3.9.2
n
P
i¼1
xi ¼ x. Hence, the maximum likelihood estimator
Linear Regression Model
One of the popular methods for studying the causal relationship between independent and dependent variables is the linear regression method. There are two types of
relationships between variables; deterministic and probabilistic. In the deterministic
form, the relationship between the two variables is exact. For example, we might
have Y ¼ βX,where the value of Y is determined by X. On the other hand, in the
probabilistic form, the relationship between variables involves random components
or random error. For example, we might have Y ¼ βX + E containing two components; a deterministic component βX plus a random error E.
3.9.3
General Linear Model
Let the model for linear regression be represented by
where
Y ¼ βX þ E with E N 0, σ 2 I ,
3.9 Estimation Theory
2
6
6
Y¼6
4
Y1
Y2
⋮
Yn
3
7
7
7,
5
53
2
1
6
6 1
X¼6
6⋮
4
1
x22
⋮
⋯ x1p
⋯ x2p
⋱ ⋮
xn2
⋯
x12
3
7
7
7,
7
5
xnp
2
6
6
β¼6
6
4
β0
β1
⋮
βp
3
7
7
7
7
5
2
6
6
and E ¼ 6
4
E1
E2
⋮
En
3
7
7
7:
5
Note that ðEi Ei Þ ¼ 0, i 6¼ j and ðEi Ei Þ ¼ σ 2 , i ¼ j:
Our problem is to choose an estimation of linear regression of the form Y ¼
b
βX þ e, where b
β is a ( p + 1) column vector as an estimation of the vector β and e is
an n column vector of residuals.
3.9.4
Ordinary Least Square Method (OLS)
The problem of classical linear model estimation requires estimation of the unknown
parameters β0, β1, . . ., βn and σ2. The ordinary least squares method selects β0, β1,
. . ., βn values to minimize the sum of squares of the errors, i.e. S ¼
n
P
i¼1
Yi
βo
β1 xi1
...
2
βp xip . This can also be represented as
S ¼ ðY
βX ÞT ðY
βX Þ
¼ YTY
βT X T Y
Y T Xβ þ βT X T Xβ
¼ YTY
2βT X T Y þ βT X T Xβ:
Now, we must derive S in terms of β and then set it to zero. Differentiating S with
respect to β and equating it to zero yields
∂S
¼
∂β
2X T Y þ 2X T X b
β ¼ 0:
1
Solving this equation yields b
β ¼ ðX T X Þ X T Y.
2
T
∂ S
b
Note that since ∂β
2 ¼ 2X X is a positive definite matrix, then β will minimize S.
Definition 3.51 The Best Linear Unbiased Estimator (BLUE) of a parameter β
based on data X
1. is a linear function of X, i.e. the estimator can be written as b
βX;
b
2. is unbiased, i.e. βX ¼ β; and
3. has the smallest variance among all the unbiased linear estimators.
Finally, we end this chapter by presenting an important theorem below.
54
3
Probability Theory
Theorem 3.39 Gauss–Markov Theorem. If b
β is the ordinary least square estimator
b
of β in the classical linear
model, and if b
β is any other linear unbiased
regression
0b
0
β , where c is any constant vector of the
estimator of β, then var c b
β var c b
appropriate order.
In statistics, the Gauss–Markov theorem states that in a linear model whose errors
have zero expectations are uncorrelated and have equal variances, the BLUE for the
system coefficients is the least squares estimator.
References
Flury, B. (2013). A first course in multivariate statistics. Springer Science & Business Media.
Freund, J. E., Miller, I., & Miller, M. (2004). John E. Freund’s Mathematical statistics: With
applications. Pearson Education India.