Probability Theory

Alireza Amirteimoori; Biresh K. Sahoo; Vincent Charles; Saber Mehdizadeh

Probability Theory

saber mehdizadeh

Stochastic Benchmarking

visibility

…

description

24 pages

link

1 file

Abstract

Probability theory is a branch of mathematical science that deals with the mathematical analysis of random events. Probability is commonly used to describe the mind's attitude to statements that we are not sure of. Statements usually take the form of "Will a particular event occur?" and the attitude of our minds will be of the form "How confident are we that this event will occur?" Our confidence can be described numerically, which is a value between 0 and 1 that we call probability. The more likely to occur an event is, the more confident we are that it will occur. The focus in this chapter is mainly on probability space, random variable, multidimensional probability distribution, expected value, variance, and covariance. 3.1 Probability Space In this section, we present some important definitions that form the basis of probability theory. Definition 3.1 A random experiment is an experiment whose outcome is not known in advance. Definition 3.2 An event that may or may not occur as a result of a random experiment is called a random event. For example, tossing a coin is a random experiment because its outcome is not known in advance. Coming heads is a random event because it may or may not happen. Definition 3.3 The set of all possible outcomes of a random experiment associated with the phenomenon is called the sample space, represented by the symbol Ω. An individual element ω of Ω is called a sample point.

Chapter 3 Probability Theory Probability theory is a branch of mathematical science that deals with the mathematical analysis of random events. Probability is commonly used to describe the mind’s attitude to statements that we are not sure of. Statements usually take the form of “Will a particular event occur?” and the attitude of our minds will be of the form “How conﬁdent are we that this event will occur?” Our conﬁdence can be described numerically, which is a value between 0 and 1 that we call probability. The more likely to occur an event is, the more conﬁdent we are that it will occur. The focus in this chapter is mainly on probability space, random variable, multidimensional probability distribution, expected value, variance, and covariance. 3.1 Probability Space In this section, we present some important deﬁnitions that form the basis of probability theory. Deﬁnition 3.1 A random experiment is an experiment whose outcome is not known in advance. Deﬁnition 3.2 An event that may or may not occur as a result of a random experiment is called a random event. For example, tossing a coin is a random experiment because its outcome is not known in advance. Coming heads is a random event because it may or may not happen. Deﬁnition 3.3 The set of all possible outcomes of a random experiment associated with the phenomenon is called the sample space, represented by the symbol Ω. An individual element ω of Ω is called a sample point. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 A. Amirteimoori et al., Stochastic Benchmarking, International Series in Operations Research & Management Science 317, https://doi.org/10.1007/978-3-030-89869-4_3 31 32 3 Probability Theory Deﬁnition 3.4 The sample space Ω is discrete if Ω is a ﬁnite set or a countably inﬁnite set. Deﬁnition 3.5 The sample space Ω is continuous if Ω is an inﬁnite set. Example 3.1 (i) Coin tossing: Ω ¼ {H, T}. (ii) Rolling one dice: Ω ¼ {1, 2, 3, 4, 5, 6}. (iii) Picking one card at random in a pack of 52: Ω ¼ {1, 2, 3, . . ., 52}. (iv) An integer-valued random outcome: Ω ¼ {0, 1, 2, . . .}. (v) A non-negative, real-valued outcome: Ω ¼ R+. (vi) A random continuous parameter (such as time, weather, price or wealth, temperature, ...): Ω ¼ R. An event is a collection of outcomes, which is represented by a subset of Ω. We can considera class F of events, i.e., a class F of subsets of Ω [not necessarily all of the power sets of Ω, P ðΩÞ] as a σ algebra according to the following deﬁnition: Deﬁnition 3.6 A collection F of events is a σ-algebra if it satisﬁes the following conditions: (i) ∅ 2 F ; (ii) For S all countable sequences (An)n An 2 F ; 1 such that An 2 F , n 1, we have n1 (iii) A 2 F ) ðΩ∖AÞ 2 F : Since A is an event, we expect its complement A0 to be an event. On the other hand, Ω will also be an event because Ω ⊆ Ω, so we expect ∅ to be an event. Ω is called a “certain event” and ∅ is an “impossible event.” Consequently, since F is a subset of all subsets of Ω, we call it an “event space.” Example 3.2 Rolling one dice : Ω ¼ {1, 2, 3, 4, 5, 6}. The event A ¼ {1, 3, 5} corresponds to “The result of the experiment is an odd number.” F ≔fΩ, ∅, f1, 3, 5g, f2, 4, 6gg deﬁnes a σ algebra on Ω which corresponds to the knowledge of parity of an integer picked at random from 1 to 6. G ≔ {Ω, ∅ , {2, 4, 6}, {2, 4}, {6}, {1, 2, 3, 4, 5}, {1, 3, 5, 6}, {1, 3, 5}} deﬁnes a σ algebra on Ω which is bigger than F and corresponds to the parity information contained in F , completed by the knowledge of whether the outcome is equal to 6 or not. Deﬁnition 3.7 Probability Measure. A probability measure is a mapping ℙ : F ! ½0, 1 that for each event A 2 F assigns a value in the interval [0, 1], with the following three axioms: 3.1 Probability Space 33 (a) For any event A 2 F , ℙ(A) 0; (b) ℙ(Ω) ¼ 1;! S P ℙðAn Þ, whenever Ai \ Ak ¼ ∅ , (c) ℙ An ¼ n1 n1 i 6¼ k. These principles were put forward by the Russian mathematician and statistician Andrey Kolmogorov in 1933 as the principles of probability function and have since been known as the Kolmogorov Principles. The triple ðΩ, F , ℙÞ is called a probability space. This setting is generally referred to as the Kolmogorov framework. Based on the Kolmogorov Principles, many theorems have been established for the probability function. In the following, some important theorems for the probability function are given: Theorem 3.1 Let ðΩ, F , ℙÞ be a probability space. Then, we have (a) The probability of the event ∅is zero, i.e. ℙ(∅) ¼ 0. (b) The probability of the union of two disjoint events is the sum of their probabilities, i.e., ℙ(A [ B) ¼ ℙ(A) + ℙ(B), where A and B are two disjoint events. (c) The probability value for the complement of an event A is equal to ℙ(A0) ¼ 1 ℙ(A). (d) If event A is a subset of event B, then ℙ(A) ℙ(B). (e) If C is an event created by the union of two events A and B, then ℙ(C) ¼ ℙ(A [ B) ¼ ℙ(A) + ℙ(B) ℙ(A \ B). Deﬁnition 3.8 Let A and B be two events in a sample space T Ω, where ℙ(A) 6¼ 0; ℙðA BÞ then the conditional probability of B given A is ℙðBjAÞ ¼ ℙðAÞ . It follows immediately that T if A and B are the two events in the sample space Ω, where ℙ(A) 6¼ 0, then ℙ(A B) = ℙ(A). ℙ(B| A). Moreover, given that ℙ(B) 6¼ 0, T T ℙ ðA BÞ since ℙðAjBÞ ¼ ℙðBÞ , then ℙ(A B) = ℙ(B). ℙ(A| B). T Deﬁnition 3.9 Two events A and B are independent if and only if ℙ(A B) = ℙ(A). ℙ(B). In other words, two events A and B are independent if the occurrence or nonoccurrence of one event does not affect the probability of the occurrence or nonoccurrence of the other event. In many cases, the result of an experiment depends on what happened at different intermediate stages. The following theorem on the law of total probability deals with this issue. Theorem 3.3 Law of Total Probability. Let the events A1, A2, . . ., An be events that constitute a partition of sample space Ω, where ℙ(Ai) 6¼ 0, for i ¼ 1, 2, . . ., n. n P Then, for any event B in Ω, ℙðBÞ ¼ ℙðAi ÞℙðBjAi Þ. i¼1 34 3 Probability Theory In probability theory and statistics, the Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the probability that someone has cancer is related to his/her age, then using the Bayes theorem, the age can be used to more accurately assess the probability of cancer than it can be done without the knowledge of the age. Theorem 3.4 Let the events A1, A2, . . ., An constitute a partition on the sample space Ω, where ℙ(Ai) 6¼ 0, for i ¼ 1, 2, . . ., n. Then, for any event B in Ω with ℙ(B) 6¼ 0, ℙðAr Þ:ℙðBjAr Þ ℙðAr jBÞ ¼ P n ℙðAi ÞℙðBjAi Þ r ¼ 1, . . . , n i¼1 3.2 Random Variable Deﬁnition 3.10 Let ðΩ, F , ℙÞ be a probability space. A real-valued random variable on this probability space is a measurable mapping X : ðΩ, F Þ ! ðℝ, B ðℝÞÞ , i.e. X 1 ðGÞ 2 F for all G 2 B ðℝÞ. Note that X is a random variable from a probability space Ω into the state space ℝ, which maps each ω 2 Ω to X(ω) 2 ℝ. Deﬁnition 3.11 A random variable X is continuous when Ω is continuous and is discrete when Ω is discrete. Example 3.3 When we roll two dice, Ω ≔ {1, 2, 3, 4, 5, 6} {1, 2, 3, 4, 5, 6} ¼ {(1, 1), (1, 2), . . ., (6, 6)}. Consider X : Ω ! ℝ and (k, l) ! k + l. Then, X is a random variable that gives the sum of the two numbers appearing on each dice. Deﬁnition 3.12 If X is a discrete random variable on ðΩ, F , ℙÞ , the function f (x) ¼ P(X ¼ x) for each x in the range of X is called the probability distribution of X. Proposition 3.1 Let X be a discrete random variable on ðΩ, F , ℙÞ . Then, f is a probability distribution of X if and only if it satisﬁes the following conditions: (a) P f(x) 0, for each value of x; (b) f ðxÞ ¼ 1. x Example 3.4 (Freund et al., 2004) Check whether the function given by f ðxÞ ¼ x ¼ 1, 2, 3, 4, 5, can serve as the probability distribution of a discrete random variable. xþ2 25 , for 3 4 5 By introducing the possible values in the function, f ð1Þ ¼ 25 , f ð2Þ ¼ 25 , f ð3Þ ¼ 25 , 6 7 f ð4Þ ¼ 25 , f ð5Þ ¼ 25 , since these values are all non-negative, the ﬁrst condition of 3.2 Random Variable 35 proposition 3.1 is satisﬁed, and since f ð1Þ þ f ð2Þ þ f ð3Þ þ f ð4Þ þ f ð5Þ ¼ 3 4 5 6 7 25 þ 25 þ 25 þ 25 þ 25 ¼ 1 , the second condition is also satisﬁed. Thus, the given function can serve as the probability distribution of a random variable having the range {1, 2, 3, 4, 5}. Deﬁnition 3.13 Distribution Function. If X is a discrete random P variable on ðΩ, F , ℙÞ , the function FX(x) is deﬁned by FX ðxÞ ¼ PðX xÞ ¼ f ðtÞ, 1 < tx x < 1 and is called the distribution function or the cumulative distribution of X, in which f(t) is the value of the probability distribution of X at t. Proposition 3.2 Let FX be a cumulative distribution function of a discrete random variable X. Then, (a) FX is a non-decreasing function; (b) FX( 1) ¼ 0 and FX(+1) ¼ 1. Deﬁnition 3.14 Probability density function. A function fX : ℝ ! ℝ+ is called a probability density function of the continuous random variable X if and only if Rb Pða x bÞ ¼ a f X ðxÞdx, for any real constants a and b with a b. The probability density functions are also referred to as probability densities, density functions, densities, or pdf. Theorem 3.5 If X is a continuous random variable on ðΩ, F , ℙÞ and a and b are real constants with a b, then P(a X b) ¼ P(a X < b) ¼ P(a < X b) ¼ P (a < X < b). Proposition 3.3 Let X be a continuous random variable on ðΩ, F , ℙÞ. Then fX is a probability density of X if and only if it satisﬁes the following conditions: (a) fRX(x) 0, for 1 (b) 1 f X ðxÞdx ¼ 1: 1 <x< 1; Example 3.5 (Freund et al., 2004) If X has the probability density: f X ð xÞ ¼ ( k:e 0 3x , for x > 0; , otherwise; ﬁnd k and P(0.5 X 1). the second condition of Proposition 3.3, we must have R 1Solution: ToR satisfy 3x 1 k 3x f ð x Þdx ¼ k:e dx ¼ k: e 3 j1 X 0 ¼ 3 ¼ 1. It follows that k ¼ 3. Moreover, 1 0 R 1 3x 3x 1 Pð0:5 X 1Þ ¼ 0:5 3e dx ¼ e j0:5 ¼ e 3 þ e 1:5 ¼ 0:173: Deﬁnition 3.15 Let X be a continuous random variable on ðΩ, F , ℙÞ. The distribution function or the cumulative distribution function (CDF) of X, FX : ℝ ! [0, 1], is 36 3 deﬁned by FX ðxÞ PðX xÞ ¼ the probability density of X at t. Rx 1 f X ðtÞdt, for Probability Theory x 2 ℝ, where fX(t) is the value of That is, FX(x) is the probability that the random variable X takes a value, which is less than or equal to x. Proposition 3.4 Let FX be a cumulative distribution function. Then: (a) 0 FX(x) 1, (b) FX is a non-decreasing function, (c) FX ð 1Þ ¼ lim FX ðxÞ ¼ 0 and FX ðþ1Þ ¼ lim FX ðxÞ ¼ 1: x! 1 x!þ1 Theorem 3.6 If fX(x) and FX(x) are the probability density and cumulative distribution functions of the random variable X, respectively, then for any real constants a X ðxÞ and b with a b, we have P(a < X b) ¼ F(b) F(a) and f X ðxÞ ¼ dFdx . Remark 3.1 P(x ¼ a) ¼ 0. Example 3.6 Find the distribution function of the random variable X in Example 3.6 and use it to re-evaluate P(0.5 X 1). Rx Rx For X > 0, FX ðxÞ ¼ 1 f X ðtÞdt ¼ 0 3e 3t dt ¼ e 3t jx0 ¼ 1 e 3x and since F(x) ¼ 0, for x 0, we can write Fð X Þ ¼ ( 0 1 e 3x , for x 0; , for x > 0: Now, to determine P(0.5 X 1), we use Theorem 3.6. So, P(0.5 X 1) ¼ F F(0.5) ¼(1 e 3) (1 e 1.5) ¼0.173. In probability theory, a multivariate random variable or random vector is a list of random variables, the values of which are unknown. For example, while a given person has a speciﬁc age, height, and weight, the representation of these features for an unspeciﬁed person in a group would be a random vector. Similarly, for the random variable, we now deﬁne multivariate random variables, or random vectors, as multivariate functions. (1) Deﬁnition 3.16 An n-dimensional random variable or vector X is a (measurable) function from the probability space Ω to ℝn, i.e. X : Ω ! ℝn. Deﬁnition 3.17 Joint Probability Distribution. Let X1, X2, . . ., Xn be discrete random variables. The joint probability distribution of X1, X2, . . ., Xn is deﬁned as f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ PðX1 ¼ x1 , X2 ¼ x2 , . . . , Xn ¼ xn Þ, where xi 2 Range (Xi), i ¼ 1, . . ., n. Theorem 3.7 A multi-dimensional function can serve as the joint probability distribution of the discrete random variables X1, X2, . . ., Xn if and only if its values, f X 1 ,X 2 ,...,X n ðx1 , x2 , . . . , xn Þ, satisfy the following conditions: 3.2 Random Variable 37 (a) fP ðx1 , x2 , . . . , xn Þ 0 for each (x1, x2, . . ., xn) within its domain; X1P ,X2 ,...,XP n (b) . . . f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ 1 , where the multiple summations x1 x2 xn extend over all possible x1, x2, . . ., xn within its domain. Deﬁnition 3.18 Joint Cumulative Distribution. Let X1, X2, . . ., Xn be discrete random variables. The joint distribution function or the joint cumulative distribution of X1, X2, . . ., Xn is deﬁned as FX1 ,X2 ,...,Xn ðx1 ,x2 , ... ,xn Þ ¼ PðX1 x1 ,X2 x2 , .. .,Xn xn Þ XX X ¼ ... f X1 ,X2 ,...,Xn ðt1 , t2 , .. .,tn Þ for x1 , x2 , . .., xn t1 x1 t2 x2 tn xn 2 ℝ, for all (x1, x2, . . ., xn) within the range of X1, X2, . . ., Xn. Deﬁnition 3.19 Joint Probability Density Function. A function f X1 ,...Xn : ℝn ! ℝþ is called a joint probability density function of the continuous random variables X 1, X, . . ., Xn if and only if PðfX1 , X2 , . . . , Xn g 2 AÞ ¼ RR R2 . . . f X1 ,...Xn ðx1 , x2 , . . . , xn Þdx1 dx2 . . . dxn , for any region A in the ℝn. Theorem 3.8 Let X1, X2, . . ., Xn be continuous random variables on ðΩ, F , ℙÞ. Then, f X1 ,...Xn is a joint probability density function of X1, X2, . . ., Xn if and only if it satisﬁes the following conditions: (a) Rf X1 ,XR2 ,...,Xn ðx1R, x2 , . . . , xn Þ 0, for x1 , x2 , . . . , xn 2 ℝ; 1 1 1 (b) 1 1 . . . 1 f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þdx1 dx2 . . . dxn ¼ 1: Deﬁnition 3.20 Joint Distribution Function. If X1, X2, . . ., Xn are continuous random variables on ðΩ, F , ℙÞ, then the function is given by FX1 ,X2 ,...,Xn ðx1 ,x2 , ...,xn Þ ¼ PðX1 x1 ,X2 x2 , ...,Xn xn Þ Z x1 Z x2 Z xn ¼ ... f X1 ,X2 ,...,Xn ðt1 ,t2 , ...,tn Þdt1 dt2 ...dtn ,for 1 1 1 x1 ,x2 , ...,xn 2 ℝ is called the joint distribution function of X1, X2, . . ., Xn, in which f X1 ,X2 ,...,Xn ðt1 , t2 , . . . , tn Þ is the joint probability density of X1, X2, . . ., Xn at (t1, t2, . . ., tn). Theorem 3.9 (a) FX1 ,X2 ,...,Xn ð 1, 1, . . . , 1Þ ¼ 0; (b) FX1 ,X2 ,...,Xn ð1, 1, . . . , 1Þ ¼ 1; 38 3 Probability Theory (c) If a1 < b1, a2 < b2, . . ., an < bn, then FX1 ,X2 ,...,Xn ða1 , a2 , . . . , an Þ FX1 ,X2 ,...,Xn ðb1 , b2 , . . . , bn Þ; n (d) f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ ∂x1∂...∂xn FX1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ: Theorem 3.10 Let Xi : i ¼ 1, . . ., n be random variables with probability distributions f Xi ðxi Þ, i ¼ 1, . . . , n, and let f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ be the joint probability distribution function of X1, X2, . . ., Xn. Then, X1, X2, . . ., Xn are independent if and only if f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ f X1 ðx1 Þ:f X2 ðx2 Þ: . . . :f Xn ðxn Þ, for all (x1, x2, . . ., xn) within their range. 3.3 Mathematical Expectation The expectation, or expected value, of a random variable X is the mean or average value of X. In practice, expectations can be even more useful than probabilities. For example, bank deposits or the prices of inputs of the DMUs can be considered as random variables. In such cases, we usually refer to their expected values rather than their actual values. Deﬁnition 3.21 Expected Value. If X is a discrete random variable and P fX(x) is its probability distribution function, the expected value of X is ðXÞ ¼ x x:f X ðxÞ. Correspondingly, if X is a continuous random R 1 variable and fX(x) is its probability density, the expected value of X is ðXÞ ¼ 1 x:f X ðxÞdx. Theorem 3.11 If X is a discrete random variable and fX(x) is its probability distributionP function, the expected value of g(X), as a function of X, is given by ½gðXÞ ¼ x gðxÞ:f X ðxÞ: Correspondingly, if X is a continuous random variable and fX(x) is its probability density function, then the expected value of g(X), as a R1 function of X, is given by ½gðXÞ ¼ 1 gðxÞ:f X ðxÞdx: Theorem 3.12 If a and b are constants, then ðaX þ bÞ ¼ aðXÞ þ b: n n P P c i gi ð X Þ ¼ ci ½gi ðXÞ: Theorem 3.13 If c1, c2, . . ., cn are constants, then  i¼1 i¼1 Theorem 3.14 If X1, X2, . . ., Xn are independent random variables, then ðX1 X2 . . . Xn Þ ¼ ðX1 ÞðX2 Þ . . . ðXn Þ. Deﬁnition 3.22 Variance. Let X be a random variable with a ﬁnite expected value, μ. Then, the variance of X, denoted by σ2, σ2X , or var(X), is deﬁned by σ2 ¼ h i  ðX μ Þ2 : The positive square root of the variance, σ, is called the standard deviation of X. 3.3 Mathematical Expectation 39 σ2 ¼  X2 Theorem 3.15 2 ðXÞ: Theorem 3.16 If X has the variance σ2 and a and b are constants, then var (aX + b) ¼ a2σ2. Herein, we represent Chebyshev’s inequality that enables us to derive bounds on probabilities when only the mean or both the mean and the variance of the probability distribution are known and valid for all distributions of each random variable. Theorem 3.17 Chebyshev’s inequality. If X is a random variable with ﬁnite mean 2 μ and variance σ2, then for any value k > 0, we have ℙðjX μj kÞ σk2 . Deﬁnition 3.23 Covariance. Let X and Y be two random variables with ﬁnite expected values μX and μY, respectively. Then, the covariance of X and Y, denoted by σXY or cov(X, Y), is deﬁned by covðX, YÞ ¼ ½ðX μX ÞðY μY Þ. Theorem 3.18 covðX, YÞ = ðXYÞ ðXÞðYÞ: It can easily be shown that if X and Y are two independent random variables, then cov(X, Y) ¼ 0. Theorem 3.19 If X1, X2, . . ., Xn are random variables and Y ¼ a2, . . ., an are constants, then varðYÞ ¼ Proof. See Freund et al. (2004). n P i¼1 a2i :varðXi Þ þ 2 PP i<j n P ai Xi , where a1, i¼1 ai aj :cov Xi , Xj . Corollary 3.1 If X1, X2, . . ., Xn are independent random variables and Y ¼ , then varðYÞ ¼ n P i¼1 n P ai X i i¼1 a2i :varðXi Þ: Theorem 3.20 If X1, X2, . . ., Xn are random variables and Y1 ¼ n P i¼1 ai Xi and Y2 ¼ covðY1 , Y2 Þ ¼ n P n P bi Xi , where a1, a2, . . ., an, b1, b2, . . ., bn are constants, then i¼1 i¼1 ai bi :varðXi Þ þ P P i<j ai bj þ aj bi :cov Xi Xj : Corollary 3.2 If the random variables X1, X2, . . ., Xn are independent, Y1 ¼ n P i¼1 ai Xi and Y2 ¼ n P i¼1 bi Xi , then covðY1 , Y2 Þ ¼ n P i¼1 ai bi :varðXi Þ: 40 3 Probability Theory Deﬁnition 3.24 Correlation Coefﬁcient. The correlation coefﬁcient of two random covðX, YÞ ﬃpﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ. variables X and Y, denoted by ρ(X, Y), is deﬁned by ρðX, YÞ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃ varðXÞ varðYÞ It can be shown that 1 ρ(X, Y) 1. The correlation coefﬁcient is a measure of the degree of linearity between X and Y. A value of ρ(X, Y) near +1 or 1 indicates a high degree of linearity between X and Y, whereas a value near 0 indicates that such linearity is absent. A positive value of ρ(X, Y) indicates that Y tends to increase when X does, whereas a negative value indicates that Y tends to decrease when X increases. If ρ(X, Y) ¼ 0, then X and Y are said to be uncorrelated. 3.4 Discrete Distributions In this section, we introduce some of the most used probability distributions. Due to the importance of normal and chi-square distributions, these two distributions are studied in more detail. Deﬁnition 3.25 Discrete uniform distribution. A random variable X has a discrete uniform distribution and it is referred to as a discrete uniform random variable if and only if its probability distribution is given by f ðxÞ ¼ 1k , for k ¼ x1 , x2 , . . . xk . A discrete uniform random variable holds all its values with equal probabilities. Deﬁnition 3.26 Bernoulli distribution. A random variable X has a Bernoulli distribution and it is referred to as a Bernoulli random variable if and only if its probability distribution is given by f(x; p) ¼ px(1 p)1 x, for x ¼ 0, 1. Theorem 3.21 Let X be a Bernoulli random variable. The mean and variance of X are then μ ¼ p and σ 2 ¼ p(1 p), respectively. Deﬁnition 3.27 Binomial distribution. A random variable X has a binomial distribution and it is referred to as a binomial random variable if and only if its n x probability distribution is given by bðx; n, pÞ ¼ p ð1 pÞn x , for x ¼ x 0, 1, . . . , n. Theorem 3.22 Let X be a Binomial random variable. The mean and variance of X are then μ ¼ np and σ 2 ¼ np(1 p), respectively. Deﬁnition 3.28 Poisson distribution. A random variable X has a Poisson distribution and it is referred to as a Poison random variable if and only if its probability x λ distribution is given by pðx; λÞ ¼ λ x!e , for x ¼ 0, 1, 2, . . . , where X is the mean of successes in each given time interval or region. Theorem 3.23 Let X be a Poisson random variable. The mean and variance of X are then μ ¼ λ and σ 2 ¼ λ, respectively. 3.5 Continuous Distributions 3.5 41 Continuous Distributions Deﬁnition 3.29 Continuous uniform distribution. A random variable X has a uniform distribution and it is referred to as a continuous uniform random variable if and only if its probability density is given by f ð xÞ ¼ 8 < : 1 b a 0 a x b; , , elsewhere: In other words, the random variable X is uniformly distributed at intervals. Theorem 3.24 Let X be a uniform random variable. The mean and variance of X are 1 2 then μ ¼ aþb aÞ2 , respectively. 2 and σ ¼ 12 ðb Deﬁnition 3.30 Gamma function. The gamma function of α, denoted by Γ(α), is deﬁned as ΓðαÞ ¼ Z 1 xα 1 e x dx: 0 Corollary 3.3 ΓðαÞ ¼ ðα 1Þ! Deﬁnition 3.31 Gamma distribution. A random variable X has a gamma distribution and it is referred to as a gamma random variable if and only if its probability density function is given by gðx; α, βÞ ¼ where α > 0 and β > 0. 8 > < > :0 1 xα 1 e βα ΓðαÞ x=β , for x > 0; , elsewhere; Theorem 3.25 Let X be a gamma random variable. The mean and variance of X are then μ ¼ αβ and σ 2 ¼ αβ2, respectively. Deﬁnition 3.32 Exponential distribution. A random variable X has an exponential distribution and it is referred to as an exponential random variable if and only if its probability density is represented by 42 3 where θ > 0. 8 < 1e θ gðx, θÞ ¼ : 0 x=θ Probability Theory for x > 0; , , elsewhere; Note: The exponential distribution is a special case of gamma distribution with α ¼ 1 and β ¼ θ. Theorem 3.26 Let X be an exponential random variable. The mean and variance of X are then μ ¼ θ and σ 2 ¼ θ2, respectively. Deﬁnition 3.33 Weibull distribution. A random variable X has a Weibull distribution and it is referred to as a Weibull random variable if and only if its probability density is represented by f ðxÞ ¼ ( kxβ 1 e αxβ 0 , for x > 0; , elsewhere; where α > 0 and β > 0. Note: The exponential distribution is a special case of Weibull distribution with β ¼ 1. Deﬁnition 3.34 Beta distribution. A random variable X has beta distribution and it is referred to as a beta random variable if and only if its probability density is represented by 8 > < Γðα þ βÞ xα 1 ð1 f ðx; α, βÞ ¼ ΓðαÞΓðβÞ > : 0 xÞβ 1 , for 0 < x < 1; , elsewhere; where α > 0 and β > 0. Theorem 3.27 Let X be a Beta random variable. The mean and variance of X are α then μ ¼ αþβ and σ 2 ¼ ðαþβÞ2αβ , respectively. ðαþβþ1Þ 3.6 The Normal Distribution In probability theory, the normal distribution is one of the most important statistical distributions. This distribution is sometimes referred to as the Gaussian distribution or the Laplace–Gauss distribution. In this section, we brieﬂy summarize the properties of this distribution. 3.6 The Normal Distribution 43 Fig. 3.1 Graph of normal distribution Deﬁnition 3.35 Normal Distribution. A random variable X has a normal distribution with expectation μ and variance σ2, i. e. , X N(μ, σ2), if and only if its probability density is given by where σ > 0. 1 x μ 2 1 f ðxÞ ¼ pﬃﬃﬃﬃﬃ e 2ð σ Þ , σ 2π for x 2 ℝ, The graph of a normal distribution is shown in Fig. 3.1. It is shaped like the crosssection of a bell. μ and σ are the two parameters that play a key role in the shape of normal distribution. Deﬁnition 3.36 Standard Normal Distribution. The normal random variable with μ ¼ 0 and σ ¼ 1 is referred to as the standard normal random variable. Theorem 3.28 If X has a normal distribution with the mean μ and the standard deviation σ, then Z ¼ Xσ μ has a standard normal distribution. Proof See Freund et al. (2004). Example 3.7 If X is a normal random variable with μ ¼ 3 and σ ¼ 4, ﬁnd P (4 X 8). Solution: 3 X 3 8 3 4 4 4 ¼ Pð0:25 Z 1:25Þ ¼ PðZ 1:25Þ PðX 0:25Þ Pð 4 X 8 Þ ¼ P 4 ¼ 0:8944 0:5987 ¼ 0:2957: Deﬁnition 3.37 Multivariate Normal Random Variable. Let X ¼ (X1, . . ., Xk) be a k-dimensional random variable. X has a multivariate normal distribution with mean 44 3 Probability Theory Fig. 3.2 PDFs of the normal distribution (mean zero) and half-normal distribution μ and matrix variance–covariance Σ, i.e., X N(μ, Σ), if and only if its probability density is represented by 1 f X ðx1 , . . . , xk Þ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ e σ ð2πÞk j Σ j 1 2ðX μÞT Σ 1 ðX μÞ , for xi 2 ℝ, i ¼ 1, . . . , k: Note that μ ¼ ðX Þ ¼ ½ðX 1 Þ, ðX 2 Þ, . . . , ðX k Þ and Σ ¼ [Cov(Xi, Xj); 1 i, j k]. Example 3.8 Suppose Z ¼ (X, Y) is a bivariate normal random variable. Let the mean vector and variance–covariance matrix of Z be μ ¼ ðμX , μY Þ and Σ ¼ ! σ2X ρσX σY , respectively. Then, f X,Y ðx, yÞ ¼ ρσX σY σ2Y h i ðx μX Þ2 ðy μY Þ2 2ρðx μX Þðy μY Þ 1pﬃﬃﬃﬃﬃﬃﬃﬃ 1 exp þ : 2 2 2 σX σY 2ð1 ρ Þ σ σ 2 2πσX σY 1 ρ X Y Note that ρ is the correlation coefﬁcient between X and Y. Deﬁnition 3.38 Half-normal distribution. A random variable X has a half-normal distribution and it is referred to as a half-normal random variable if and only if its pﬃﬃ x2 probability density function is represented by f ðx; σ Þ ¼ p2ﬃﬃ exp 2σ 2 , x > 0. σ π Let Z follow a normal distribution, i.e. Z~N(0, σ2). Then, X ¼ |Z| follows a halfnormal distribution. The graph of a normal distribution along with a half-normal distribution is shown in Fig. 3.2. 3.7 The Chi-Square Distribution 3.7 45 The Chi-Square Distribution In probability theory and statistics, the chi-square distribution (χ 2-distribution) with k degrees of freedom is the distribution of the sum of the squares of k independent standard normal random variables. Due to the importance of this distribution, we will discuss it here. Deﬁnition 3.39 Chi-Square Distribution. A random variable X has a chi-square distribution, i.e., X χ2(k), and it is referred to as a chi-square random variable if and only if its probability density function is represented by 8 > > > < k 2 x 1 x 2 e 2, for x > 0; k 2 Γ f ðx, nÞ ¼ 2 > > > : 0 , otherwise; k 2 where k is referred to as the degrees of freedom. Corollary 3.4 The mean and variance of the chi-square distribution are μ ¼ ν and σ2 ¼ 2ν, respectively. Theorem 3.29 Suppose X1, X2, . . ., Xn are n independent random variables having n P standard normal distributions. Then, Y ¼ X 2i has a chi-square distribution with n i¼1 degrees of freedom. Proof See Freund et al. (2004). Theorem 3.30 Suppose that X is a k-dimension random vector, X N(μ, Σ), where Σ is positive deﬁnite. Then (X μ)TΣ 1(X μ) follows a chi-square distribution on k degrees of freedom. Proof See Flury (2013). μ)TΣ 1(X Remark 3.2 The curves of the form (X ellipses or ellipsoids in higher dimensions. μ) ¼ constant > 0 are From Example 3.8, to ﬁnd an ellipse within X, which falls with probability α, we need to set (X μ)TΣ 1(X μ) to be equal to the α-quantile of the chi-square distribution on two degrees of freedom. If U is a chi-square on two degrees of freedom, then its distribution function is represented by Fð u Þ ¼ P ð U u Þ ¼ 0 1 e 1 2u , u 0; , u > 0: And, therefore, the α-quantile is computed as follows: 46 3 Probability Theory 7 6 x2 5 4 3 2 1 –1 0 1 2 3 x1 4 5 6 7 Fig. 3.3 Ellipses of the constant density of a bivariate normal distribution. The ellipses represent the regions within which X falls with probability α ¼ 0.1, 0.2, . . ., 0.9 Fð u Þ ¼ α ) 1 )u¼ e 2 log ð1 1 2u ¼α αÞ: By choosing different values for α 2 (0, 1), we will obtain different values for u. Therefore, for each α, we have a quadratic curve in the (x1, x2)-plane. 3 2 1 For instance, for μ ¼ and Σ ¼ , we have 4 1 1 ðX μ ÞT Σ 1 ðX μÞ ¼ ðx1 3, x2 4Þ 1 1 1 2 x1 x2 3 4 ¼ c2 , 2 log (1 α). where c2 ¼ Figure 3.3 shows various curves for α ¼ 0.1, 0.2, . . ., 0.9. 3.8 Sampling Distributions As we know, the events of a statistical experiment are determined numerically by the random variables. The total set of observations surveyed is called population and the number of population members is called the size of the population. The observations are values of a random variable, and since each random variable has 3.8 Sampling Distributions 47 a probability distribution, each statistical population can be assigned a random variable and thus has a probability distribution. For example, if a random variable corresponding to the observations of a population is a normal random variable, that population is called the normal population. Since we deﬁne a population as a set, it can have a ﬁnite or inﬁnite number of members. On the other hand, since we do not have access to all observations in an inﬁnite population, we do not know the distribution of the population, and we do not know its mean and variance. Therefore, we call the mean and variance of the population as population parameters that must be estimated. For this purpose, we need a sample of the population. Deﬁnition 3.40 Random Sample. If X1, X2, . . ., Xn are independent and identically distributed random variables, we say that they constitute a random sample from the inﬁnite population given by their same distribution. Note: If X1, X2, . . ., Xn are n independent and identically distributed random variables with the same probability function, then the probability distribution of this random sample is f(x1, x2, . . ., xn) ¼ f(x1)f(x2). . .f(xn). Now, to estimate the population parameter, we ﬁrst introduce the mean and variance of the sample. These calculated values from a random sample are called statistics and since these values depend on the sample and there are many random samples in the population, then each statistic is a random variable. Deﬁnition 3.41 Statistic. Any function of random sample members that does not contain unknown parameters is called a statistic. Deﬁnition 3.42 Sample mean and sample variance. If X1, X2, . . ., Xn constitute a n P Xi random sample, then the sample mean is given by X ¼ n P 2 ðX i X Þ 2 i¼1 variance is given by S ¼ : n 1 i¼1 n , and the sample Note: X and S2 are the two statistics. Theorem 3.31 If X1, X2, . . ., Xn constitute a random sample from an inﬁnite 2 population with the mean μ and the variance σ2, then  X ¼ μ and var X ¼ σn . Theorem 3.32 If X1, X2, . . ., Xn constitute a random sample from an inﬁnite population with the mean μ and the variance σ2, then  S2 ¼ σ2 . 3.8.1 Limit Theorems Limit theorems are the most important theoretical results in probability theory. These theorems are regarded as the results giving convergence of sequences of random variables or their distribution functions. Since random variables are functions with 48 3 Probability Theory random inﬂuences, then different modes of convergence are involved in a sequence of random variables. The central limit theorem and the law of large numbers are the most important limit theorems, which we will be introducing in this section. Theorem 3.33 The law of large numbers. If X1, X2, . . ., Xn constitute a random 2 sample from an inﬁnite population with the mean μ and variance σ , then for any E > 0, ℙ X μ E ! 0 as n ! 1. The law of large numbers states that the sample average converges in probability toward the expected value. In fact, as the sample size increases, the sample mean gets closer to the population mean. Theorem 3.34 The central limit theorem. Let X1, X2, . . ., Xn constitute a random sample from an inﬁnite population with the mean μ and variance σ2. Then the ﬃﬃ as n ! 1 is the standard normal distribution. That is, for distribution of Z ¼ X pnμ σ n 1 < a < 1, Z X nμ 1 pﬃﬃﬃ a ! pﬃﬃﬃﬃﬃ ℙ σ n 2π 3.9 a e x2 =2 1 dx as n ! 1: Estimation Theory In this section, we want to analyze and estimate population parameters using the statistics presented in the previous section. It is possible to examine population parameters in several ways. The most common of these is the classical estimation method, which provides the population parameters directly from a sample of the population. The parameters of a population can be estimated in two ways; point estimation and interval estimation. Generally, the parameter of the population and the statistics b respectively. The that will be used to examine this parameter are denoted by θ and Θ, statistic used for the point estimation is called the estimator. Since one parameter may have multiple estimators, we need to know which estimator is better. b is an unbiased estimator of the Deﬁnition 3.43 Unbiased estimator. A statistic Θ b ¼ θ for all possible values of parameter θ of a given distribution if and only if  Θ θ. b is called a biased estimator if  Θ b 6¼ θ. The bias is then deﬁned as the Note: Θ b and θ, i.e. bias ¼  Θ b difference between  Θ θ. Each parameter may have several unbiased estimators. If we must choose one of them, we usually take the one whose sampling distribution has the smallest variance. 3.9 Estimation Theory 49 Example 3.9 If X1, X2, . . ., Xn has a Bernoulli distribution with success parameter n P Xi p, then the statistic X ¼ i¼1n is an unbiased estimator of p. 0P 1 n X i n P B C ðXi Þ ¼ 1n np ¼ p. Solution.  X ¼ @ i¼1n A ¼ 1n i¼1 Deﬁnition 3.44 Minimum variance unbiased estimator. The estimator for the parameter θ of a given distribution that has the smallest variance of all unbiased estimators for θ is called the minimum variance unbiased estimator, or the bestunbiased estimator for θ. b be an unbiased estimator of θ and Theorem 3.35 Let Θ b ¼ var Θ n: 1 ∂ ln f ðXÞ ∂θ 2 , b is the unbiased estimator of θ with minimum variance. then Θ The quantity in the denominator is referred to as the information about θ. Thus, the smaller the variance is, the greater the information. b 1 and Θ b 2 are two unbiased estimators of θ, where the variance of Θ b 1 is Note: If Θ b 2, then we say Θ b 1 is relatively more efﬁcient than Θ b 2. smaller than the variance of Θ b b b b The efﬁciency of Θ1 relative to Θ2 , denoted eff Θ1 , Θ2 is deﬁned to be the ratio: var b Θ2 b b eff Θ1 , Θ2 ¼ : var b Θ1 b is MSE Θ b ¼ Deﬁnition 3.45 The mean square error of a point estimator Θ 2 b θ  Θ . b is also called the risk function of an estimator. MSE Θ b is a consistent estimator of Deﬁnition 3.46 Consistent estimator. The statistic Θ the parameter θ of a given distribution if and only if for each c > 0, b θ c ¼ 1 or, equivalently, lim ℙ Θ b θ > c ¼ 0: lim ℙ Θ n!1 n!1 The previous deﬁnition says that when the size of the random sample is sufﬁciently large, we can be practically certain that the error made with a consistent estimator will be less than any small pre-assigned positive constant. 50 3 Probability Theory b is an unbiased estimator of the parameter θ and var Θ b ! 0 as Theorem 3.36 If Θ b is a consistent estimator of θ. n ! 1, then Θ b is a sufﬁcient estimator of the Deﬁnition 3.47 Sufﬁcient estimator. The statistic Θ b the conditional parameter θ of a given distribution if and only if for each value of Θ, b ¼ θ, probability distribution or density of the random sample X1, X2, . . ., Xn, given Θ is independent of θ. b of a parameter θ which gives as much information about θ as is The statistic Θ possible from the sample is called a sufﬁcient estimator. b is called the Deﬁnition 3.48 Minimal sufﬁcient statistic. The sufﬁcient statistic Θ b b minimal sufﬁcient if, for any other sufﬁcient statistic such as Θ and the arbitrary b b¼f Θ b . function f, we have Θ It should be noted that the sufﬁcient statistic is not unique. By Deﬁnition 3.48, a sufﬁcient statistic, which is a function of all-sufﬁcient statistics, is called a Minimal Sufﬁcient. Thus, the minimal sufﬁcient statistics can be considered as the most effective sufﬁcient statistics for the parameter θ, which is simpler than all the sufﬁcient statistics. Theorem 3.37 The Rao–Blackwell theorem. Let b θ be an unbiased estimator for b is a sufﬁcient statistic for θ, deﬁne b b . θ such that Var b θ < 1: If Θ θ ¼ b θjΘ Then, for all θ,  b θ ¼ θ and var b θ var b θ : b The Rao–Blackwell theorem says that, if b θ is an unbiased estimator for θ and if Θ b is a sufﬁcient statistic for θ, then there is a function of Θ that is also an unbiased estimator for θ and has variance no larger than b θ. This theorem can be used to ﬁnd an unbiased estimator with the least variance (Minimum Variance Unbiased Estimator, MVUE). To ﬁnd it, it is enough to ﬁnd the uncertainty estimator and obtain its conditional expectation in terms of sufﬁcient parameter statistics. The resulting estimator will have a smaller variance than the variance of the initial estimator. To determine the best estimator in the class or set of unbiased estimators, the Lehmann– Scheffe theorem should be used. This theorem ensures that the unbiased estimator created by the Rao–Blackwell theorem has less variance than any other unbiased estimator. b for its distribution family is complete if the followDeﬁnition 3.49 The statistic Θ ing relation exists for any measurable function g and any parameter value θ. That is, 3.9 Estimation Theory 51 b  g Θ b ¼0!ℙ g Θ ¼ 0Þ ¼ 1: In the sufﬁcient statistic, we consider a statistic that has the most information about the unknown parameter. We also know that a sufﬁcient statistic that is a function of all-sufﬁcient statistics is called a minimal sufﬁcient statistic and it is best to use this statistic for estimation. But the goal of creating a complete statistic is to produce or select a statistic that can provide us with the smallest amount of information about the parameter. This means that in a minimal sufﬁcient statistic, there may be additional information that is not relevant to draw inference about the population parameter. Selecting “Complete Minimal Sufﬁcient Statistic” will result in a statistic that stores information only about the population parameter and has no superﬂuous information. b be a complete sufﬁcient statistic. If there Theorem 3.38 Lehmann–Scheffe. Let Θ are unbiased estimators, then there exists a unique MVUE. We can obtain the b b¼ b b , for any unbiased b MVUE as Θ θjΘ θ. The MVUE can also be characterized b b¼φ Θ b of the complete sufﬁcient statistic Θ. b as the unique unbiased function Θ This theorem helps to identify the best estimator from the class of unbiased estimators. This is a complement to Rao–Blackwell’s theorem. Using this theorem, we can show under what conditions the unbiased estimator with the least variance is unique. This way, a “Uniformly Minimum Variance Unbiased Estimator-UMVUE” estimator can be obtained. Uniform means that this estimator has the least variance in the class of unbiased estimators for all points in the parametric space. 3.9.1 The Method of Maximum Likelihood One of the most popular methods for estimating parameters is the method of maximum likelihood. The advantages of this method are that it yields sufﬁcient estimators and the maximum likelihood estimators are the minimum variance unbiased estimators. Deﬁnition 3.50 Maximum likelihood estimator (MLE). Let x1, x2, . . ., xn be the values of a random sample from a population with the parameter θ. The likelihood n Q function of the sample is then represented by LðθÞ ¼ f ðx1 , x2 , . . . , xn jθÞ ¼ f ðxi jθÞ i¼1 for values of θ within a given domain. Note that f(x1, x2, . . ., xn; θ) is the value of the joint probability distribution or joint density of random variables X1, X2, . . ., Xn at X1 ¼ x1, X2 ¼ x2, . . ., Xn ¼ xn. We refer to the value of θ that maximizes L(θ) as the maximum likelihood estimator of θ. It is usually customary and easier to maximize the logarithm of the likelihood, ln L(θ). To 52 3 Probability Theory maximize ln L(θ), we take the derivative of ln L(θ) concerning θ and set the expression equal to 0, i.e. ∂ ln∂θLðθÞ ¼ 0: Example 3.10 If x1, x2, . . ., xn are the values of a random sample from an exponential population, ﬁnd the maximum likelihood estimator of its parameter θ. Solution According to the deﬁnition of likelihood function, we have LðθÞ ¼ n P 1 x n i n θ Q i¼1 . Differentiating lnL(θ) with respect to θ yields f ðxi jθÞ ¼ 1θ :e i¼1 d ln LðθÞ dθ ¼ n θ þ θ12 : n P i¼1 xi ¼ 0 . By solving this equation, we get the maximum likelihood estimate as θ ¼ 1n b ¼ X. is Θ 3.9.2 n P i¼1 xi ¼ x. Hence, the maximum likelihood estimator Linear Regression Model One of the popular methods for studying the causal relationship between independent and dependent variables is the linear regression method. There are two types of relationships between variables; deterministic and probabilistic. In the deterministic form, the relationship between the two variables is exact. For example, we might have Y ¼ βX,where the value of Y is determined by X. On the other hand, in the probabilistic form, the relationship between variables involves random components or random error. For example, we might have Y ¼ βX + E containing two components; a deterministic component βX plus a random error E. 3.9.3 General Linear Model Let the model for linear regression be represented by where Y ¼ βX þ E with E N 0, σ 2 I , 3.9 Estimation Theory 2 6 6 Y¼6 4 Y1 Y2 ⋮ Yn 3 7 7 7, 5 53 2 1 6 6 1 X¼6 6⋮ 4 1 x22 ⋮ ⋯ x1p ⋯ x2p ⋱ ⋮ xn2 ⋯ x12 3 7 7 7, 7 5 xnp 2 6 6 β¼6 6 4 β0 β1 ⋮ βp 3 7 7 7 7 5 2 6 6 and E ¼ 6 4 E1 E2 ⋮ En 3 7 7 7: 5 Note that ðEi Ei Þ ¼ 0, i 6¼ j and ðEi Ei Þ ¼ σ 2 , i ¼ j: Our problem is to choose an estimation of linear regression of the form Y ¼ b βX þ e, where b β is a ( p + 1) column vector as an estimation of the vector β and e is an n column vector of residuals. 3.9.4 Ordinary Least Square Method (OLS) The problem of classical linear model estimation requires estimation of the unknown parameters β0, β1, . . ., βn and σ2. The ordinary least squares method selects β0, β1, . . ., βn values to minimize the sum of squares of the errors, i.e. S ¼ n P i¼1 Yi βo β1 xi1 ... 2 βp xip . This can also be represented as S ¼ ðY βX ÞT ðY βX Þ ¼ YTY βT X T Y Y T Xβ þ βT X T Xβ ¼ YTY 2βT X T Y þ βT X T Xβ: Now, we must derive S in terms of β and then set it to zero. Differentiating S with respect to β and equating it to zero yields ∂S ¼ ∂β 2X T Y þ 2X T X b β ¼ 0: 1 Solving this equation yields b β ¼ ðX T X Þ X T Y. 2 T ∂ S b Note that since ∂β 2 ¼ 2X X is a positive deﬁnite matrix, then β will minimize S. Deﬁnition 3.51 The Best Linear Unbiased Estimator (BLUE) of a parameter β based on data X 1. is a linear function of X, i.e. the estimator can be written as b βX; b 2. is unbiased, i.e.  βX ¼ β; and 3. has the smallest variance among all the unbiased linear estimators. Finally, we end this chapter by presenting an important theorem below. 54 3 Probability Theory Theorem 3.39 Gauss–Markov Theorem. If b β is the ordinary least square estimator b of β in the classical linear model, and if b β is any other linear unbiased regression 0b 0 β , where c is any constant vector of the estimator of β, then var c b β var c b appropriate order. In statistics, the Gauss–Markov theorem states that in a linear model whose errors have zero expectations are uncorrelated and have equal variances, the BLUE for the system coefﬁcients is the least squares estimator. References Flury, B. (2013). A ﬁrst course in multivariate statistics. Springer Science & Business Media. Freund, J. E., Miller, I., & Miller, M. (2004). John E. Freund’s Mathematical statistics: With applications. Pearson Education India.

Log In

Probability Theory

Sign up for access to the world's latest research

Abstract

Related papers

Related papers