Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Academia.eduAcademia.edu

Probability Theory

Stochastic Benchmarking

Abstract

Probability theory is a branch of mathematical science that deals with the mathematical analysis of random events. Probability is commonly used to describe the mind's attitude to statements that we are not sure of. Statements usually take the form of "Will a particular event occur?" and the attitude of our minds will be of the form "How confident are we that this event will occur?" Our confidence can be described numerically, which is a value between 0 and 1 that we call probability. The more likely to occur an event is, the more confident we are that it will occur. The focus in this chapter is mainly on probability space, random variable, multidimensional probability distribution, expected value, variance, and covariance. 3.1 Probability Space In this section, we present some important definitions that form the basis of probability theory. Definition 3.1 A random experiment is an experiment whose outcome is not known in advance. Definition 3.2 An event that may or may not occur as a result of a random experiment is called a random event. For example, tossing a coin is a random experiment because its outcome is not known in advance. Coming heads is a random event because it may or may not happen. Definition 3.3 The set of all possible outcomes of a random experiment associated with the phenomenon is called the sample space, represented by the symbol Ω. An individual element ω of Ω is called a sample point.

Chapter 3 Probability Theory Probability theory is a branch of mathematical science that deals with the mathematical analysis of random events. Probability is commonly used to describe the mind’s attitude to statements that we are not sure of. Statements usually take the form of “Will a particular event occur?” and the attitude of our minds will be of the form “How confident are we that this event will occur?” Our confidence can be described numerically, which is a value between 0 and 1 that we call probability. The more likely to occur an event is, the more confident we are that it will occur. The focus in this chapter is mainly on probability space, random variable, multidimensional probability distribution, expected value, variance, and covariance. 3.1 Probability Space In this section, we present some important definitions that form the basis of probability theory. Definition 3.1 A random experiment is an experiment whose outcome is not known in advance. Definition 3.2 An event that may or may not occur as a result of a random experiment is called a random event. For example, tossing a coin is a random experiment because its outcome is not known in advance. Coming heads is a random event because it may or may not happen. Definition 3.3 The set of all possible outcomes of a random experiment associated with the phenomenon is called the sample space, represented by the symbol Ω. An individual element ω of Ω is called a sample point. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 A. Amirteimoori et al., Stochastic Benchmarking, International Series in Operations Research & Management Science 317, https://doi.org/10.1007/978-3-030-89869-4_3 31 32 3 Probability Theory Definition 3.4 The sample space Ω is discrete if Ω is a finite set or a countably infinite set. Definition 3.5 The sample space Ω is continuous if Ω is an infinite set. Example 3.1 (i) Coin tossing: Ω ¼ {H, T}. (ii) Rolling one dice: Ω ¼ {1, 2, 3, 4, 5, 6}. (iii) Picking one card at random in a pack of 52: Ω ¼ {1, 2, 3, . . ., 52}. (iv) An integer-valued random outcome: Ω ¼ {0, 1, 2, . . .}. (v) A non-negative, real-valued outcome: Ω ¼ R+. (vi) A random continuous parameter (such as time, weather, price or wealth, temperature, ...): Ω ¼ R. An event is a collection of outcomes, which is represented by a subset of Ω. We can considera class F of events, i.e., a class F of subsets of Ω [not necessarily all of the power sets of Ω, P ðΩÞ] as a σ algebra according to the following definition: Definition 3.6 A collection F of events is a σ-algebra if it satisfies the following conditions: (i) ∅ 2 F ; (ii) For S all countable sequences (An)n An 2 F ;  1 such that An 2 F , n  1, we have n1 (iii) A 2 F ) ðΩ∖AÞ 2 F : Since A is an event, we expect its complement A0 to be an event. On the other hand, Ω will also be an event because Ω ⊆ Ω, so we expect ∅ to be an event. Ω is called a “certain event” and ∅ is an “impossible event.” Consequently, since F is a subset of all subsets of Ω, we call it an “event space.” Example 3.2 Rolling one dice : Ω ¼ {1, 2, 3, 4, 5, 6}. The event A ¼ {1, 3, 5} corresponds to “The result of the experiment is an odd number.” F ≔fΩ, ∅, f1, 3, 5g, f2, 4, 6gg defines a σ algebra on Ω which corresponds to the knowledge of parity of an integer picked at random from 1 to 6. G ≔ {Ω, ∅ , {2, 4, 6}, {2, 4}, {6}, {1, 2, 3, 4, 5}, {1, 3, 5, 6}, {1, 3, 5}} defines a σ algebra on Ω which is bigger than F and corresponds to the parity information contained in F , completed by the knowledge of whether the outcome is equal to 6 or not. Definition 3.7 Probability Measure. A probability measure is a mapping ℙ : F ! ½0, 1Š that for each event A 2 F assigns a value in the interval [0, 1], with the following three axioms: 3.1 Probability Space 33 (a) For any event A 2 F , ℙ(A)  0; (b) ℙ(Ω) ¼ 1;! S P ℙðAn Þ, whenever Ai \ Ak ¼ ∅ , (c) ℙ An ¼ n1 n1 i 6¼ k. These principles were put forward by the Russian mathematician and statistician Andrey Kolmogorov in 1933 as the principles of probability function and have since been known as the Kolmogorov Principles. The triple ðΩ, F , ℙÞ is called a probability space. This setting is generally referred to as the Kolmogorov framework. Based on the Kolmogorov Principles, many theorems have been established for the probability function. In the following, some important theorems for the probability function are given: Theorem 3.1 Let ðΩ, F , ℙÞ be a probability space. Then, we have (a) The probability of the event ∅is zero, i.e. ℙ(∅) ¼ 0. (b) The probability of the union of two disjoint events is the sum of their probabilities, i.e., ℙ(A [ B) ¼ ℙ(A) + ℙ(B), where A and B are two disjoint events. (c) The probability value for the complement of an event A is equal to ℙ(A0) ¼ 1 ℙ(A). (d) If event A is a subset of event B, then ℙ(A)  ℙ(B). (e) If C is an event created by the union of two events A and B, then ℙ(C) ¼ ℙ(A [ B) ¼ ℙ(A) + ℙ(B) ℙ(A \ B). Definition 3.8 Let A and B be two events in a sample space T Ω, where ℙ(A) 6¼ 0; ℙðA BÞ then the conditional probability of B given A is ℙðBjAÞ ¼ ℙðAÞ . It follows immediately that T if A and B are the two events in the sample space Ω, where ℙ(A) 6¼ 0, then ℙ(A B) = ℙ(A). ℙ(B| A). Moreover, given that ℙ(B) 6¼ 0, T T ℙ ðA BÞ since ℙðAjBÞ ¼ ℙðBÞ , then ℙ(A B) = ℙ(B). ℙ(A| B). T Definition 3.9 Two events A and B are independent if and only if ℙ(A B) = ℙ(A). ℙ(B). In other words, two events A and B are independent if the occurrence or nonoccurrence of one event does not affect the probability of the occurrence or nonoccurrence of the other event. In many cases, the result of an experiment depends on what happened at different intermediate stages. The following theorem on the law of total probability deals with this issue. Theorem 3.3 Law of Total Probability. Let the events A1, A2, . . ., An be events that constitute a partition of sample space Ω, where ℙ(Ai) 6¼ 0, for i ¼ 1, 2, . . ., n. n P Then, for any event B in Ω, ℙðBÞ ¼ ℙðAi ÞℙðBjAi Þ. i¼1 34 3 Probability Theory In probability theory and statistics, the Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the probability that someone has cancer is related to his/her age, then using the Bayes theorem, the age can be used to more accurately assess the probability of cancer than it can be done without the knowledge of the age. Theorem 3.4 Let the events A1, A2, . . ., An constitute a partition on the sample space Ω, where ℙ(Ai) 6¼ 0, for i ¼ 1, 2, . . ., n. Then, for any event B in Ω with ℙ(B) 6¼ 0, ℙðAr Þ:ℙðBjAr Þ ℙðAr jBÞ ¼ P n ℙðAi ÞℙðBjAi Þ r ¼ 1, . . . , n i¼1 3.2 Random Variable Definition 3.10 Let ðΩ, F , ℙÞ be a probability space. A real-valued random variable on this probability space is a measurable mapping X : ðΩ, F Þ ! ðℝ, B ðℝÞÞ , i.e. X 1 ðGÞ 2 F for all G 2 B ðℝÞ. Note that X is a random variable from a probability space Ω into the state space ℝ, which maps each ω 2 Ω to X(ω) 2 ℝ. Definition 3.11 A random variable X is continuous when Ω is continuous and is discrete when Ω is discrete. Example 3.3 When we roll two dice, Ω ≔ {1, 2, 3, 4, 5, 6}  {1, 2, 3, 4, 5, 6} ¼ {(1, 1), (1, 2), . . ., (6, 6)}. Consider X : Ω ! ℝ and (k, l) ! k + l. Then, X is a random variable that gives the sum of the two numbers appearing on each dice. Definition 3.12 If X is a discrete random variable on ðΩ, F , ℙÞ , the function f (x) ¼ P(X ¼ x) for each x in the range of X is called the probability distribution of X. Proposition 3.1 Let X be a discrete random variable on ðΩ, F , ℙÞ . Then, f is a probability distribution of X if and only if it satisfies the following conditions: (a) P f(x)  0, for each value of x; (b) f ðxÞ ¼ 1. x Example 3.4 (Freund et al., 2004) Check whether the function given by f ðxÞ ¼ x ¼ 1, 2, 3, 4, 5, can serve as the probability distribution of a discrete random variable. xþ2 25 , for 3 4 5 By introducing the possible values in the function, f ð1Þ ¼ 25 , f ð2Þ ¼ 25 , f ð3Þ ¼ 25 , 6 7 f ð4Þ ¼ 25 , f ð5Þ ¼ 25 , since these values are all non-negative, the first condition of 3.2 Random Variable 35 proposition 3.1 is satisfied, and since f ð1Þ þ f ð2Þ þ f ð3Þ þ f ð4Þ þ f ð5Þ ¼ 3 4 5 6 7 25 þ 25 þ 25 þ 25 þ 25 ¼ 1 , the second condition is also satisfied. Thus, the given function can serve as the probability distribution of a random variable having the range {1, 2, 3, 4, 5}. Definition 3.13 Distribution Function. If X is a discrete random P variable on ðΩ, F , ℙÞ , the function FX(x) is defined by FX ðxÞ ¼ PðX  xÞ ¼ f ðtÞ, 1 < tx x < 1 and is called the distribution function or the cumulative distribution of X, in which f(t) is the value of the probability distribution of X at t. Proposition 3.2 Let FX be a cumulative distribution function of a discrete random variable X. Then, (a) FX is a non-decreasing function; (b) FX( 1) ¼ 0 and FX(+1) ¼ 1. Definition 3.14 Probability density function. A function fX : ℝ ! ℝ+ is called a probability density function of the continuous random variable X if and only if Rb Pða  x  bÞ ¼ a f X ðxÞdx, for any real constants a and b with a  b. The probability density functions are also referred to as probability densities, density functions, densities, or pdf. Theorem 3.5 If X is a continuous random variable on ðΩ, F , ℙÞ and a and b are real constants with a  b, then P(a  X  b) ¼ P(a  X < b) ¼ P(a < X  b) ¼ P (a < X < b). Proposition 3.3 Let X be a continuous random variable on ðΩ, F , ℙÞ. Then fX is a probability density of X if and only if it satisfies the following conditions: (a) fRX(x)  0, for 1 (b) 1 f X ðxÞdx ¼ 1: 1 <x< 1; Example 3.5 (Freund et al., 2004) If X has the probability density: f X ð xÞ ¼ ( k:e 0 3x , for x > 0; , otherwise; find k and P(0.5  X  1). the second condition of Proposition 3.3, we must have R 1Solution: ToR satisfy 3x 1 k 3x f ð x Þdx ¼ k:e dx ¼ k: e 3 j1 X 0 ¼ 3 ¼ 1. It follows that k ¼ 3. Moreover, 1 0 R 1 3x 3x 1 Pð0:5  X  1Þ ¼ 0:5 3e dx ¼ e j0:5 ¼ e 3 þ e 1:5 ¼ 0:173: Definition 3.15 Let X be a continuous random variable on ðΩ, F , ℙÞ. The distribution function or the cumulative distribution function (CDF) of X, FX : ℝ ! [0, 1], is 36 3 defined by FX ðxÞ  PðX  xÞ ¼ the probability density of X at t. Rx 1 f X ðtÞdt, for Probability Theory x 2 ℝ, where fX(t) is the value of That is, FX(x) is the probability that the random variable X takes a value, which is less than or equal to x. Proposition 3.4 Let FX be a cumulative distribution function. Then: (a) 0  FX(x)  1, (b) FX is a non-decreasing function, (c) FX ð 1Þ ¼ lim FX ðxÞ ¼ 0 and FX ðþ1Þ ¼ lim FX ðxÞ ¼ 1: x! 1 x!þ1 Theorem 3.6 If fX(x) and FX(x) are the probability density and cumulative distribution functions of the random variable X, respectively, then for any real constants a X ðxÞ and b with a  b, we have P(a < X  b) ¼ F(b) F(a) and f X ðxÞ ¼ dFdx . Remark 3.1 P(x ¼ a) ¼ 0. Example 3.6 Find the distribution function of the random variable X in Example 3.6 and use it to re-evaluate P(0.5  X  1). Rx Rx For X > 0, FX ðxÞ ¼ 1 f X ðtÞdt ¼ 0 3e 3t dt ¼ e 3t jx0 ¼ 1 e 3x and since F(x) ¼ 0, for x  0, we can write Fð X Þ ¼ ( 0 1 e 3x , for x  0; , for x > 0: Now, to determine P(0.5  X  1), we use Theorem 3.6. So, P(0.5  X  1) ¼ F F(0.5) ¼(1 e 3) (1 e 1.5) ¼0.173. In probability theory, a multivariate random variable or random vector is a list of random variables, the values of which are unknown. For example, while a given person has a specific age, height, and weight, the representation of these features for an unspecified person in a group would be a random vector. Similarly, for the random variable, we now define multivariate random variables, or random vectors, as multivariate functions. (1) Definition 3.16 An n-dimensional random variable or vector X is a (measurable) function from the probability space Ω to ℝn, i.e. X : Ω ! ℝn. Definition 3.17 Joint Probability Distribution. Let X1, X2, . . ., Xn be discrete random variables. The joint probability distribution of X1, X2, . . ., Xn is defined as f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ PðX1 ¼ x1 , X2 ¼ x2 , . . . , Xn ¼ xn Þ, where xi 2 Range (Xi), i ¼ 1, . . ., n. Theorem 3.7 A multi-dimensional function can serve as the joint probability distribution of the discrete random variables X1, X2, . . ., Xn if and only if its values, f X 1 ,X 2 ,...,X n ðx1 , x2 , . . . , xn Þ, satisfy the following conditions: 3.2 Random Variable 37 (a) fP ðx1 , x2 , . . . , xn Þ  0 for each (x1, x2, . . ., xn) within its domain; X1P ,X2 ,...,XP n (b) . . . f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ 1 , where the multiple summations x1 x2 xn extend over all possible x1, x2, . . ., xn within its domain. Definition 3.18 Joint Cumulative Distribution. Let X1, X2, . . ., Xn be discrete random variables. The joint distribution function or the joint cumulative distribution of X1, X2, . . ., Xn is defined as FX1 ,X2 ,...,Xn ðx1 ,x2 , ... ,xn Þ ¼ PðX1  x1 ,X2  x2 , .. .,Xn  xn Þ XX X ¼ ... f X1 ,X2 ,...,Xn ðt1 , t2 , .. .,tn Þ for x1 , x2 , . .., xn t1 x1 t2 x2 tn xn 2 ℝ, for all (x1, x2, . . ., xn) within the range of X1, X2, . . ., Xn. Definition 3.19 Joint Probability Density Function. A function f X1 ,...Xn : ℝn ! ℝþ is called a joint probability density function of the continuous random variables X 1, X, . . ., Xn if and only if PðfX1 , X2 , . . . , Xn g 2 AÞ ¼ RR R2 . . . f X1 ,...Xn ðx1 , x2 , . . . , xn Þdx1 dx2 . . . dxn , for any region A in the ℝn. Theorem 3.8 Let X1, X2, . . ., Xn be continuous random variables on ðΩ, F , ℙÞ. Then, f X1 ,...Xn is a joint probability density function of X1, X2, . . ., Xn if and only if it satisfies the following conditions: (a) Rf X1 ,XR2 ,...,Xn ðx1R, x2 , . . . , xn Þ  0, for x1 , x2 , . . . , xn 2 ℝ; 1 1 1 (b) 1 1 . . . 1 f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þdx1 dx2 . . . dxn ¼ 1: Definition 3.20 Joint Distribution Function. If X1, X2, . . ., Xn are continuous random variables on ðΩ, F , ℙÞ, then the function is given by FX1 ,X2 ,...,Xn ðx1 ,x2 , ...,xn Þ ¼ PðX1  x1 ,X2  x2 , ...,Xn  xn Þ Z x1 Z x2 Z xn ¼ ... f X1 ,X2 ,...,Xn ðt1 ,t2 , ...,tn Þdt1 dt2 ...dtn ,for 1 1 1 x1 ,x2 , ...,xn 2 ℝ is called the joint distribution function of X1, X2, . . ., Xn, in which f X1 ,X2 ,...,Xn ðt1 , t2 , . . . , tn Þ is the joint probability density of X1, X2, . . ., Xn at (t1, t2, . . ., tn). Theorem 3.9 (a) FX1 ,X2 ,...,Xn ð 1, 1, . . . , 1Þ ¼ 0; (b) FX1 ,X2 ,...,Xn ð1, 1, . . . , 1Þ ¼ 1; 38 3 Probability Theory (c) If a1 < b1, a2 < b2, . . ., an < bn, then FX1 ,X2 ,...,Xn ða1 , a2 , . . . , an Þ  FX1 ,X2 ,...,Xn ðb1 , b2 , . . . , bn Þ; n (d) f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ ∂x1∂...∂xn FX1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ: Theorem 3.10 Let Xi : i ¼ 1, . . ., n be random variables with probability distributions f Xi ðxi Þ, i ¼ 1, . . . , n, and let f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ be the joint probability distribution function of X1, X2, . . ., Xn. Then, X1, X2, . . ., Xn are independent if and only if f X1 ,X2 ,...,Xn ðx1 , x2 , . . . , xn Þ ¼ f X1 ðx1 Þ:f X2 ðx2 Þ: . . . :f Xn ðxn Þ, for all (x1, x2, . . ., xn) within their range. 3.3 Mathematical Expectation The expectation, or expected value, of a random variable X is the mean or average value of X. In practice, expectations can be even more useful than probabilities. For example, bank deposits or the prices of inputs of the DMUs can be considered as random variables. In such cases, we usually refer to their expected values rather than their actual values. Definition 3.21 Expected Value. If X is a discrete random variable and P fX(x) is its probability distribution function, the expected value of X is ðXÞ ¼ x x:f X ðxÞ. Correspondingly, if X is a continuous random R 1 variable and fX(x) is its probability density, the expected value of X is ðXÞ ¼ 1 x:f X ðxÞdx. Theorem 3.11 If X is a discrete random variable and fX(x) is its probability distributionP function, the expected value of g(X), as a function of X, is given by ½gðXފ ¼ x gðxÞ:f X ðxÞ: Correspondingly, if X is a continuous random variable and fX(x) is its probability density function, then the expected value of g(X), as a R1 function of X, is given by ½gðXފ ¼ 1 gðxÞ:f X ðxÞdx: Theorem 3.12 If a and b are constants, then ðaX þ bÞ ¼ aðXÞ þ b: n  n P P c i gi ð X Þ ¼ ci ½gi ðXފ: Theorem 3.13 If c1, c2, . . ., cn are constants, then  i¼1 i¼1 Theorem 3.14 If X1, X2, . . ., Xn are independent random variables, then ðX1 X2 . . . Xn Þ ¼ ðX1 ÞðX2 Þ . . . ðXn Þ. Definition 3.22 Variance. Let X be a random variable with a finite expected value, μ. Then, the variance of X, denoted by σ2, σ2X , or var(X), is defined by σ2 ¼ h i  ðX μ Þ2 : The positive square root of the variance, σ, is called the standard deviation of X. 3.3 Mathematical Expectation 39   σ2 ¼  X2 Theorem 3.15 2 ðXÞ: Theorem 3.16 If X has the variance σ2 and a and b are constants, then var (aX + b) ¼ a2σ2. Herein, we represent Chebyshev’s inequality that enables us to derive bounds on probabilities when only the mean or both the mean and the variance of the probability distribution are known and valid for all distributions of each random variable. Theorem 3.17 Chebyshev’s inequality. If X is a random variable with finite mean 2 μ and variance σ2, then for any value k > 0, we have ℙðjX μj  kÞ  σk2 . Definition 3.23 Covariance. Let X and Y be two random variables with finite expected values μX and μY, respectively. Then, the covariance of X and Y, denoted by σXY or cov(X, Y), is defined by covðX, YÞ ¼ ½ðX μX ÞðY μY ފ. Theorem 3.18 covðX, YÞ = ðXYÞ ðXÞðYÞ: It can easily be shown that if X and Y are two independent random variables, then cov(X, Y) ¼ 0. Theorem 3.19 If X1, X2, . . ., Xn are random variables and Y ¼ a2, . . ., an are constants, then varðYÞ ¼ Proof. See Freund et al. (2004). n P i¼1 a2i :varðXi Þ þ 2 PP i<j n P ai Xi , where a1, i¼1   ai aj :cov Xi , Xj . Corollary 3.1 If X1, X2, . . ., Xn are independent random variables and Y ¼ , then varðYÞ ¼ n P i¼1 n P ai X i i¼1 a2i :varðXi Þ: Theorem 3.20 If X1, X2, . . ., Xn are random variables and Y1 ¼ n P i¼1 ai Xi and Y2 ¼ covðY1 , Y2 Þ ¼ n P n P bi Xi , where a1, a2, . . ., an, b1, b2, . . ., bn are constants, then i¼1 i¼1 ai bi :varðXi Þ þ P P i<j    ai bj þ aj bi :cov Xi Xj : Corollary 3.2 If the random variables X1, X2, . . ., Xn are independent, Y1 ¼ n P i¼1 ai Xi and Y2 ¼ n P i¼1 bi Xi , then covðY1 , Y2 Þ ¼ n P i¼1 ai bi :varðXi Þ: 40 3 Probability Theory Definition 3.24 Correlation Coefficient. The correlation coefficient of two random covðX, YÞ ffipffiffiffiffiffiffiffiffiffiffi. variables X and Y, denoted by ρ(X, Y), is defined by ρðX, YÞ ¼ pffiffiffiffiffiffiffiffiffi varðXÞ varðYÞ It can be shown that 1  ρ(X, Y)  1. The correlation coefficient is a measure of the degree of linearity between X and Y. A value of ρ(X, Y) near +1 or 1 indicates a high degree of linearity between X and Y, whereas a value near 0 indicates that such linearity is absent. A positive value of ρ(X, Y) indicates that Y tends to increase when X does, whereas a negative value indicates that Y tends to decrease when X increases. If ρ(X, Y) ¼ 0, then X and Y are said to be uncorrelated. 3.4 Discrete Distributions In this section, we introduce some of the most used probability distributions. Due to the importance of normal and chi-square distributions, these two distributions are studied in more detail. Definition 3.25 Discrete uniform distribution. A random variable X has a discrete uniform distribution and it is referred to as a discrete uniform random variable if and only if its probability distribution is given by f ðxÞ ¼ 1k , for k ¼ x1 , x2 , . . . xk . A discrete uniform random variable holds all its values with equal probabilities. Definition 3.26 Bernoulli distribution. A random variable X has a Bernoulli distribution and it is referred to as a Bernoulli random variable if and only if its probability distribution is given by f(x; p) ¼ px(1 p)1 x, for x ¼ 0, 1. Theorem 3.21 Let X be a Bernoulli random variable. The mean and variance of X are then μ ¼ p and σ 2 ¼ p(1 p), respectively. Definition 3.27 Binomial distribution. A random variable X has a binomial distribution and it is referred to as a binomial random variable if and only if its   n x probability distribution is given by bðx; n, pÞ ¼ p ð1 pÞn x , for x ¼ x 0, 1, . . . , n. Theorem 3.22 Let X be a Binomial random variable. The mean and variance of X are then μ ¼ np and σ 2 ¼ np(1 p), respectively. Definition 3.28 Poisson distribution. A random variable X has a Poisson distribution and it is referred to as a Poison random variable if and only if its probability x λ distribution is given by pðx; λÞ ¼ λ x!e , for x ¼ 0, 1, 2, . . . , where X is the mean of successes in each given time interval or region. Theorem 3.23 Let X be a Poisson random variable. The mean and variance of X are then μ ¼ λ and σ 2 ¼ λ, respectively. 3.5 Continuous Distributions 3.5 41 Continuous Distributions Definition 3.29 Continuous uniform distribution. A random variable X has a uniform distribution and it is referred to as a continuous uniform random variable if and only if its probability density is given by f ð xÞ ¼ 8 < : 1 b a 0 a  x  b; , , elsewhere: In other words, the random variable X is uniformly distributed at intervals. Theorem 3.24 Let X be a uniform random variable. The mean and variance of X are 1 2 then μ ¼ aþb aÞ2 , respectively. 2 and σ ¼ 12 ðb Definition 3.30 Gamma function. The gamma function of α, denoted by Γ(α), is defined as ΓðαÞ ¼ Z 1 xα 1 e x dx: 0 Corollary 3.3 ΓðαÞ ¼ ðα 1Þ! Definition 3.31 Gamma distribution. A random variable X has a gamma distribution and it is referred to as a gamma random variable if and only if its probability density function is given by gðx; α, βÞ ¼ where α > 0 and β > 0. 8 > < > :0 1 xα 1 e βα ΓðαÞ x=β , for x > 0; , elsewhere; Theorem 3.25 Let X be a gamma random variable. The mean and variance of X are then μ ¼ αβ and σ 2 ¼ αβ2, respectively. Definition 3.32 Exponential distribution. A random variable X has an exponential distribution and it is referred to as an exponential random variable if and only if its probability density is represented by 42 3 where θ > 0. 8 < 1e θ gðx, θÞ ¼ : 0 x=θ Probability Theory for x > 0; , , elsewhere; Note: The exponential distribution is a special case of gamma distribution with α ¼ 1 and β ¼ θ. Theorem 3.26 Let X be an exponential random variable. The mean and variance of X are then μ ¼ θ and σ 2 ¼ θ2, respectively. Definition 3.33 Weibull distribution. A random variable X has a Weibull distribution and it is referred to as a Weibull random variable if and only if its probability density is represented by f ðxÞ ¼ ( kxβ 1 e αxβ 0 , for x > 0; , elsewhere; where α > 0 and β > 0. Note: The exponential distribution is a special case of Weibull distribution with β ¼ 1. Definition 3.34 Beta distribution. A random variable X has beta distribution and it is referred to as a beta random variable if and only if its probability density is represented by 8 > < Γðα þ βÞ xα 1 ð1 f ðx; α, βÞ ¼ ΓðαÞΓðβÞ > : 0 xÞβ 1 , for 0 < x < 1; , elsewhere; where α > 0 and β > 0. Theorem 3.27 Let X be a Beta random variable. The mean and variance of X are α then μ ¼ αþβ and σ 2 ¼ ðαþβÞ2αβ , respectively. ðαþβþ1Þ 3.6 The Normal Distribution In probability theory, the normal distribution is one of the most important statistical distributions. This distribution is sometimes referred to as the Gaussian distribution or the Laplace–Gauss distribution. In this section, we briefly summarize the properties of this distribution. 3.6 The Normal Distribution 43 Fig. 3.1 Graph of normal distribution Definition 3.35 Normal Distribution. A random variable X has a normal distribution with expectation μ and variance σ2, i. e. , X  N(μ, σ2), if and only if its probability density is given by where σ > 0. 1 x μ 2 1 f ðxÞ ¼ pffiffiffiffiffi e 2ð σ Þ , σ 2π for x 2 ℝ, The graph of a normal distribution is shown in Fig. 3.1. It is shaped like the crosssection of a bell. μ and σ are the two parameters that play a key role in the shape of normal distribution. Definition 3.36 Standard Normal Distribution. The normal random variable with μ ¼ 0 and σ ¼ 1 is referred to as the standard normal random variable. Theorem 3.28 If X has a normal distribution with the mean μ and the standard deviation σ, then Z ¼ Xσ μ has a standard normal distribution. Proof See Freund et al. (2004). Example 3.7 If X is a normal random variable with μ ¼ 3 and σ ¼ 4, find P (4  X  8). Solution: 3 X 3 8 3   4 4 4 ¼ Pð0:25  Z  1:25Þ ¼ PðZ  1:25Þ PðX  0:25Þ Pð 4  X  8 Þ ¼ P 4 ¼ 0:8944 0:5987 ¼ 0:2957: Definition 3.37 Multivariate Normal Random Variable. Let X ¼ (X1, . . ., Xk) be a k-dimensional random variable. X has a multivariate normal distribution with mean 44 3 Probability Theory Fig. 3.2 PDFs of the normal distribution (mean zero) and half-normal distribution μ and matrix variance–covariance Σ, i.e., X  N(μ, Σ), if and only if its probability density is represented by 1 f X ðx1 , . . . , xk Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi e σ ð2πÞk j Σ j 1 2ðX μÞT Σ 1 ðX μÞ , for xi 2 ℝ, i ¼ 1, . . . , k: Note that μ ¼ ðX Þ ¼ ½ðX 1 Þ, ðX 2 Þ, . . . , ðX k ފ and Σ ¼ [Cov(Xi, Xj); 1  i, j  k]. Example 3.8 Suppose Z ¼ (X, Y) is a bivariate normal random variable. Let the mean vector and variance–covariance matrix of Z be μ ¼ ðμX , μY Þ and Σ ¼ ! σ2X ρσX σY , respectively. Then, f X,Y ðx, yÞ ¼ ρσX σY σ2Y h i ðx μX Þ2 ðy μY Þ2 2ρðx μX Þðy μY Þ 1pffiffiffiffiffiffiffiffi 1 exp þ : 2 2 2 σX σY 2ð1 ρ Þ σ σ 2 2πσX σY 1 ρ X Y Note that ρ is the correlation coefficient between X and Y. Definition 3.38 Half-normal distribution. A random variable X has a half-normal distribution and it is referred to as a half-normal random variable if and only if its pffiffi x2 probability density function is represented by f ðx; σ Þ ¼ p2ffiffi exp 2σ 2 , x > 0. σ π Let Z follow a normal distribution, i.e. Z~N(0, σ2). Then, X ¼ |Z| follows a halfnormal distribution. The graph of a normal distribution along with a half-normal distribution is shown in Fig. 3.2. 3.7 The Chi-Square Distribution 3.7 45 The Chi-Square Distribution In probability theory and statistics, the chi-square distribution (χ 2-distribution) with k degrees of freedom is the distribution of the sum of the squares of k independent standard normal random variables. Due to the importance of this distribution, we will discuss it here. Definition 3.39 Chi-Square Distribution. A random variable X has a chi-square distribution, i.e., X  χ2(k), and it is referred to as a chi-square random variable if and only if its probability density function is represented by 8 > > > < k 2 x 1   x 2 e 2, for x > 0; k 2 Γ f ðx, nÞ ¼ 2 > > > : 0 , otherwise; k 2 where k is referred to as the degrees of freedom. Corollary 3.4 The mean and variance of the chi-square distribution are μ ¼ ν and σ2 ¼ 2ν, respectively. Theorem 3.29 Suppose X1, X2, . . ., Xn are n independent random variables having n P standard normal distributions. Then, Y ¼ X 2i has a chi-square distribution with n i¼1 degrees of freedom. Proof See Freund et al. (2004). Theorem 3.30 Suppose that X is a k-dimension random vector, X  N(μ, Σ), where Σ is positive definite. Then (X μ)TΣ 1(X μ) follows a chi-square distribution on k degrees of freedom. Proof See Flury (2013). μ)TΣ 1(X Remark 3.2 The curves of the form (X ellipses or ellipsoids in higher dimensions. μ) ¼ constant > 0 are From Example 3.8, to find an ellipse within X, which falls with probability α, we need to set (X μ)TΣ 1(X μ) to be equal to the α-quantile of the chi-square distribution on two degrees of freedom. If U is a chi-square on two degrees of freedom, then its distribution function is represented by Fð u Þ ¼ P ð U  u Þ ¼ 0 1 e 1 2u , u  0; , u > 0: And, therefore, the α-quantile is computed as follows: 46 3 Probability Theory 7 6 x2 5 4 3 2 1 –1 0 1 2 3 x1 4 5 6 7 Fig. 3.3 Ellipses of the constant density of a bivariate normal distribution. The ellipses represent the regions within which X falls with probability α ¼ 0.1, 0.2, . . ., 0.9 Fð u Þ ¼ α ) 1 )u¼ e 2 log ð1 1 2u ¼α αÞ: By choosing different values for α 2 (0, 1), we will obtain different values for u. Therefore, for each α,  we  have a quadratic curve   in the (x1, x2)-plane. 3 2 1 For instance, for μ ¼ and Σ ¼ , we have 4 1 1 ðX μ ÞT Σ 1 ðX μÞ ¼ ðx1 3, x2 4Þ  1 1 1 2  x1 x2 3 4  ¼ c2 , 2 log (1 α). where c2 ¼ Figure 3.3 shows various curves for α ¼ 0.1, 0.2, . . ., 0.9. 3.8 Sampling Distributions As we know, the events of a statistical experiment are determined numerically by the random variables. The total set of observations surveyed is called population and the number of population members is called the size of the population. The observations are values of a random variable, and since each random variable has 3.8 Sampling Distributions 47 a probability distribution, each statistical population can be assigned a random variable and thus has a probability distribution. For example, if a random variable corresponding to the observations of a population is a normal random variable, that population is called the normal population. Since we define a population as a set, it can have a finite or infinite number of members. On the other hand, since we do not have access to all observations in an infinite population, we do not know the distribution of the population, and we do not know its mean and variance. Therefore, we call the mean and variance of the population as population parameters that must be estimated. For this purpose, we need a sample of the population. Definition 3.40 Random Sample. If X1, X2, . . ., Xn are independent and identically distributed random variables, we say that they constitute a random sample from the infinite population given by their same distribution. Note: If X1, X2, . . ., Xn are n independent and identically distributed random variables with the same probability function, then the probability distribution of this random sample is f(x1, x2, . . ., xn) ¼ f(x1)f(x2). . .f(xn). Now, to estimate the population parameter, we first introduce the mean and variance of the sample. These calculated values from a random sample are called statistics and since these values depend on the sample and there are many random samples in the population, then each statistic is a random variable. Definition 3.41 Statistic. Any function of random sample members that does not contain unknown parameters is called a statistic. Definition 3.42 Sample mean and sample variance. If X1, X2, . . ., Xn constitute a n P Xi random sample, then the sample mean is given by X ¼ n P 2 ðX i X Þ 2 i¼1 variance is given by S ¼ : n 1 i¼1 n , and the sample Note: X and S2 are the two statistics. Theorem 3.31 If X1, X2, . . ., Xn constitute a random sample from an infinite     2 population with the mean μ and the variance σ2, then  X ¼ μ and var X ¼ σn . Theorem 3.32 If X1, X2, . . ., Xn constitute a random sample from an infinite   population with the mean μ and the variance σ2, then  S2 ¼ σ2 . 3.8.1 Limit Theorems Limit theorems are the most important theoretical results in probability theory. These theorems are regarded as the results giving convergence of sequences of random variables or their distribution functions. Since random variables are functions with 48 3 Probability Theory random influences, then different modes of convergence are involved in a sequence of random variables. The central limit theorem and the law of large numbers are the most important limit theorems, which we will be introducing in this section. Theorem 3.33 The law of large numbers. If X1, X2, . . ., Xn constitute a random 2 sample from  an infinite  population with the mean μ and variance σ , then for any E > 0, ℙ X μ  E ! 0 as n ! 1. The law of large numbers states that the sample average converges in probability toward the expected value. In fact, as the sample size increases, the sample mean gets closer to the population mean. Theorem 3.34 The central limit theorem. Let X1, X2, . . ., Xn constitute a random sample from an infinite population with the mean μ and variance σ2. Then the ffiffi as n ! 1 is the standard normal distribution. That is, for distribution of Z ¼ X pnμ σ n 1 < a < 1,   Z X nμ 1 pffiffiffi  a ! pffiffiffiffiffi ℙ σ n 2π 3.9 a e x2 =2 1 dx as n ! 1: Estimation Theory In this section, we want to analyze and estimate population parameters using the statistics presented in the previous section. It is possible to examine population parameters in several ways. The most common of these is the classical estimation method, which provides the population parameters directly from a sample of the population. The parameters of a population can be estimated in two ways; point estimation and interval estimation. Generally, the parameter of the population and the statistics b respectively. The that will be used to examine this parameter are denoted by θ and Θ, statistic used for the point estimation is called the estimator. Since one parameter may have multiple estimators, we need to know which estimator is better. b is an unbiased estimator of the Definition 3.43 Unbiased estimator. A statistic Θ b ¼ θ for all possible values of parameter θ of a given distribution if and only if  Θ θ. b is called a biased estimator if  Θ b 6¼ θ. The bias is then defined as the Note: Θ b and θ, i.e. bias ¼  Θ b difference between  Θ θ. Each parameter may have several unbiased estimators. If we must choose one of them, we usually take the one whose sampling distribution has the smallest variance. 3.9 Estimation Theory 49 Example 3.9 If X1, X2, . . ., Xn has a Bernoulli distribution with success parameter n P Xi p, then the statistic X ¼ i¼1n is an unbiased estimator of p. 0P 1 n X i n   P B C ðXi Þ ¼ 1n np ¼ p. Solution.  X ¼ @ i¼1n A ¼ 1n i¼1 Definition 3.44 Minimum variance unbiased estimator. The estimator for the parameter θ of a given distribution that has the smallest variance of all unbiased estimators for θ is called the minimum variance unbiased estimator, or the bestunbiased estimator for θ. b be an unbiased estimator of θ and Theorem 3.35 Let Θ b ¼ var Θ n:  1 ∂ ln f ðXÞ ∂θ 2 , b is the unbiased estimator of θ with minimum variance. then Θ The quantity in the denominator is referred to as the information about θ. Thus, the smaller the variance is, the greater the information. b 1 and Θ b 2 are two unbiased estimators of θ, where the variance of Θ b 1 is Note: If Θ b 2, then we say Θ b 1 is relatively more efficient than Θ b 2. smaller than the variance of Θ b b b b The efficiency of Θ1 relative to Θ2 , denoted eff Θ1 , Θ2 is defined to be the ratio:   var b Θ2 b b eff Θ1 , Θ2 ¼   : var b Θ1 b is MSE Θ b ¼ Definition 3.45 The mean square error of a point estimator Θ   2 b θ  Θ . b is also called the risk function of an estimator. MSE Θ b is a consistent estimator of Definition 3.46 Consistent estimator. The statistic Θ the parameter θ of a given distribution if and only if for each c > 0, b θ  c ¼ 1 or, equivalently, lim ℙ Θ b θ > c ¼ 0: lim ℙ Θ n!1 n!1 The previous definition says that when the size of the random sample is sufficiently large, we can be practically certain that the error made with a consistent estimator will be less than any small pre-assigned positive constant. 50 3 Probability Theory b is an unbiased estimator of the parameter θ and var Θ b ! 0 as Theorem 3.36 If Θ b is a consistent estimator of θ. n ! 1, then Θ b is a sufficient estimator of the Definition 3.47 Sufficient estimator. The statistic Θ b the conditional parameter θ of a given distribution if and only if for each value of Θ, b ¼ θ, probability distribution or density of the random sample X1, X2, . . ., Xn, given Θ is independent of θ. b of a parameter θ which gives as much information about θ as is The statistic Θ possible from the sample is called a sufficient estimator. b is called the Definition 3.48 Minimal sufficient statistic. The sufficient statistic Θ b b minimal sufficient if, for any  other sufficient statistic such as Θ and the arbitrary b b¼f Θ b . function f, we have Θ It should be noted that the sufficient statistic is not unique. By Definition 3.48, a sufficient statistic, which is a function of all-sufficient statistics, is called a Minimal Sufficient. Thus, the minimal sufficient statistics can be considered as the most effective sufficient statistics for the parameter θ, which is simpler than all the sufficient statistics. Theorem 3.37 The Rao–Blackwell theorem. Let b θ be an unbiased estimator for  b is a sufficient statistic for θ, define b b . θ such that Var b θ < 1: If Θ θ ¼ b θjΘ   Then, for all θ,  b θ ¼ θ and var b θ  var b θ : b The Rao–Blackwell theorem says that, if b θ is an unbiased estimator for θ and if Θ b is a sufficient statistic for θ, then there is a function of Θ that is also an unbiased estimator for θ and has variance no larger than b θ. This theorem can be used to find an unbiased estimator with the least variance (Minimum Variance Unbiased Estimator, MVUE). To find it, it is enough to find the uncertainty estimator and obtain its conditional expectation in terms of sufficient parameter statistics. The resulting estimator will have a smaller variance than the variance of the initial estimator. To determine the best estimator in the class or set of unbiased estimators, the Lehmann– Scheffe theorem should be used. This theorem ensures that the unbiased estimator created by the Rao–Blackwell theorem has less variance than any other unbiased estimator. b for its distribution family is complete if the followDefinition 3.49 The statistic Θ ing relation exists for any measurable function g and any parameter value θ. That is, 3.9 Estimation Theory 51 b  g Θ b ¼0!ℙ g Θ ¼ 0Þ ¼ 1: In the sufficient statistic, we consider a statistic that has the most information about the unknown parameter. We also know that a sufficient statistic that is a function of all-sufficient statistics is called a minimal sufficient statistic and it is best to use this statistic for estimation. But the goal of creating a complete statistic is to produce or select a statistic that can provide us with the smallest amount of information about the parameter. This means that in a minimal sufficient statistic, there may be additional information that is not relevant to draw inference about the population parameter. Selecting “Complete Minimal Sufficient Statistic” will result in a statistic that stores information only about the population parameter and has no superfluous information. b be a complete sufficient statistic. If there Theorem 3.38 Lehmann–Scheffe. Let Θ are unbiased estimators, then there exists a unique MVUE. We can obtain the b b¼ b b , for any unbiased b MVUE as Θ θjΘ θ. The MVUE can also be characterized b b¼φ Θ b of the complete sufficient statistic Θ. b as the unique unbiased function Θ This theorem helps to identify the best estimator from the class of unbiased estimators. This is a complement to Rao–Blackwell’s theorem. Using this theorem, we can show under what conditions the unbiased estimator with the least variance is unique. This way, a “Uniformly Minimum Variance Unbiased Estimator-UMVUE” estimator can be obtained. Uniform means that this estimator has the least variance in the class of unbiased estimators for all points in the parametric space. 3.9.1 The Method of Maximum Likelihood One of the most popular methods for estimating parameters is the method of maximum likelihood. The advantages of this method are that it yields sufficient estimators and the maximum likelihood estimators are the minimum variance unbiased estimators. Definition 3.50 Maximum likelihood estimator (MLE). Let x1, x2, . . ., xn be the values of a random sample from a population with the parameter θ. The likelihood n Q function of the sample is then represented by LðθÞ ¼ f ðx1 , x2 , . . . , xn jθÞ ¼ f ðxi jθÞ i¼1 for values of θ within a given domain. Note that f(x1, x2, . . ., xn; θ) is the value of the joint probability distribution or joint density of random variables X1, X2, . . ., Xn at X1 ¼ x1, X2 ¼ x2, . . ., Xn ¼ xn. We refer to the value of θ that maximizes L(θ) as the maximum likelihood estimator of θ. It is usually customary and easier to maximize the logarithm of the likelihood, ln L(θ). To 52 3 Probability Theory maximize ln L(θ), we take the derivative of ln L(θ) concerning θ and set the expression equal to 0, i.e. ∂ ln∂θLðθÞ ¼ 0: Example 3.10 If x1, x2, . . ., xn are the values of a random sample from an exponential population, find the maximum likelihood estimator of its parameter θ. Solution According to the definition of likelihood function, we have LðθÞ ¼   n P 1 x n i  n θ Q i¼1 . Differentiating lnL(θ) with respect to θ yields f ðxi jθÞ ¼ 1θ :e i¼1 d ln LðθÞ dθ ¼ n θ þ θ12 : n P i¼1 xi ¼ 0 . By solving this equation, we get the maximum likelihood estimate as θ ¼ 1n b ¼ X. is Θ 3.9.2 n P i¼1 xi ¼ x. Hence, the maximum likelihood estimator Linear Regression Model One of the popular methods for studying the causal relationship between independent and dependent variables is the linear regression method. There are two types of relationships between variables; deterministic and probabilistic. In the deterministic form, the relationship between the two variables is exact. For example, we might have Y ¼ βX,where the value of Y is determined by X. On the other hand, in the probabilistic form, the relationship between variables involves random components or random error. For example, we might have Y ¼ βX + E containing two components; a deterministic component βX plus a random error E. 3.9.3 General Linear Model Let the model for linear regression be represented by where   Y ¼ βX þ E with E  N 0, σ 2 I , 3.9 Estimation Theory 2 6 6 Y¼6 4 Y1 Y2 ⋮ Yn 3 7 7 7, 5 53 2 1 6 6 1 X¼6 6⋮ 4 1 x22 ⋮ ⋯ x1p ⋯ x2p ⋱ ⋮ xn2 ⋯ x12 3 7 7 7, 7 5 xnp 2 6 6 β¼6 6 4 β0 β1 ⋮ βp 3 7 7 7 7 5 2 6 6 and E ¼ 6 4 E1 E2 ⋮ En 3 7 7 7: 5 Note that ðEi Ei Þ ¼ 0, i 6¼ j and ðEi Ei Þ ¼ σ 2 , i ¼ j: Our problem is to choose an estimation of linear regression of the form Y ¼ b βX þ e, where b β is a ( p + 1) column vector as an estimation of the vector β and e is an n column vector of residuals. 3.9.4 Ordinary Least Square Method (OLS) The problem of classical linear model estimation requires estimation of the unknown parameters β0, β1, . . ., βn and σ2. The ordinary least squares method selects β0, β1, . . ., βn values to minimize the sum of squares of the errors, i.e. S ¼ n  P i¼1 Yi βo β1 xi1 ... 2 βp xip . This can also be represented as S ¼ ðY βX ÞT ðY βX Þ ¼ YTY βT X T Y Y T Xβ þ βT X T Xβ ¼ YTY 2βT X T Y þ βT X T Xβ: Now, we must derive S in terms of β and then set it to zero. Differentiating S with respect to β and equating it to zero yields ∂S ¼ ∂β 2X T Y þ 2X T X b β ¼ 0: 1 Solving this equation yields b β ¼ ðX T X Þ X T Y. 2 T ∂ S b Note that since ∂β 2 ¼ 2X X is a positive definite matrix, then β will minimize S. Definition 3.51 The Best Linear Unbiased Estimator (BLUE) of a parameter β based on data X 1. is a linear function of X, i.e. the estimator can be written as b βX; b 2. is unbiased, i.e.  βX ¼ β; and 3. has the smallest variance among all the unbiased linear estimators. Finally, we end this chapter by presenting an important theorem below. 54 3 Probability Theory Theorem 3.39 Gauss–Markov Theorem. If b β is the ordinary least square estimator b of β in the classical linear model, and if b β is any other linear unbiased  regression  0b 0 β , where c is any constant vector of the estimator of β, then var c b β  var c b appropriate order. In statistics, the Gauss–Markov theorem states that in a linear model whose errors have zero expectations are uncorrelated and have equal variances, the BLUE for the system coefficients is the least squares estimator. References Flury, B. (2013). A first course in multivariate statistics. Springer Science & Business Media. Freund, J. E., Miller, I., & Miller, M. (2004). John E. Freund’s Mathematical statistics: With applications. Pearson Education India.