1 + X E (X Is Is Integrable, But Not Square Is Not Integrable, The Variance Is
1 + X E (X Is Is Integrable, But Not Square Is Not Integrable, The Variance Is
1 + X E (X Is Is Integrable, But Not Square Is Not Integrable, The Variance Is
Contents 1. Comments on expected values 2. Expected values of some common random variables 3. Covariance and correlation 4. Indicator variables and the inclusion-exclusion formula 5. Conditional expectations
(a) Recall that E [ X ] is well dened unless both sums x:x<0 xpX (x) and x:x>0 xpX (x) are innite. Furthermore, E[X ] is well-dened and nite if and only if both sums are nite. This is the same as requiring that E[|X |] = |x|pX (x) < .
x
Random variables that satisfy this condition are called integrable. (b) Note that for any random variable X , E[X 2 ] is always (whether well-dened 2 nite or innite), because all the terms in the sum x x pX (x) are nonneg ative. If we have E[X 2 ] < , we say that X is square integrable. (c) Using the inequality |x| 1 + x2 , we have E[|X |] 1 + E[X 2 ], which shows that a square integrable random variable is always integrable. (d) Because of the formula var(X ) = E[X 2 ] (E[X ])2 , we see that: (i) if X is square integrable, the variance is nite; (ii) if X is integrable, but not square integrable, the variance is innite; (iii) if X is not integrable, the variance is undened. 1
In this section, we use either the denition or the properties of expectations to calculate the mean and variance of a few common discrete random variables. (a) Bernoulli(p). Let X be a Bernoulli random variable with parameter p. Then, E[X ] = 1 p + 0 (1 p) = p, var(X ) = E[X 2 ] (E[X ])2 = 12 p + 02 (1 p) p2 = p(1 p). (b) Binomial(n, p). Let X be a binomial random variable with parameters n and p. We note that X can be expressed in the form X = n i=1 Xi , where X1 , . . . , Xn are independent Bernoulli random variables with a common parameter p. It follows that E[X ] =
n i=1
E[Xi ] = np.
(c) Geometric(p). Let X be a geometric random variable with parameter p. We will use the formula E[X ] = n=0 P(X > n). We observe that P(X > n) = which implies that E[X ] = The variance of X is given by var(X ) = 1p , p2
j =n+1
(1 p)j 1 p = (1 p)n ,
(1 p)n =
n=0
1 . p
(d) Poisson(). Let X be a Poisson random variable with parameter . A direct calculation yields E[X ] = e = e = e = e = . The variance of X turns out to satisfy var(X ) = , but we defer the deriva tion to a later section. We note, however, that the mean and the variance of a Poisson random variable are exactly what one would expect, on the basis of the formulae for the mean and variance of a binomial random variable, and taking the limit as n , p 0, while keeping np xed at . (e) Power(). Let X be a random variable with a power law distribution with parameter . We have E[X ] =
k=0 n n n!
n=0 n=1
n n!
n=1
n (n 1)! n n!
n=0
P(X > k ) =
k=0
1 . (k + 1)
If 1, the expected value is seen to be innite. For > 1, the sum is nite, but a closed form expression is not available; it is known as the Riemann zeta function, and is denoted by (). 3 COVARIANCE AND CORRELATION
3.1 Covariance The covariance of two square integrable random variables, X and Y , is denoted by cov(X, Y ), and is dened by cov(X, Y ) = E X E[X ] Y E[Y ] . When cov(X, Y ) = 0, we say that X and Y are uncorrelated. 3
Note that, under the square integrability assumption, the covariance is al ways well-dened and nite. This is a consequence of the fact that |XY | (X 2 + Y 2 )/2, which implies that XY , as well as (X E[X ])(Y E[Y ]), are integrable. Roughly speaking, a positive or negative covariance indicates that the values of X E[X ] and Y E[Y ] obtained in a single experiment tend to have the same or the opposite sign, respectively. Thus, the sign of the covariance provides an important qualitative indicator of the relation between X and Y . We record a few properties of the covariance, which are immediate conse quences of its denition: (a) cov(X, X ) = var(X ); (b) cov(X, Y + a) = cov(X, Y ); (c) cov(X, Y ) = cov(Y, X ); (d) cov(X, aY + bZ ) = a cov(X, Y ) + b cov(X, Z ). An alternative formula for the covariance is cov(X, Y ) = E[XY ] E[X ] E[Y ], as can be veried by a simple calculation. Note that if X and Y are independent, we have E[XY ] = E[X ] E[Y ], which implies that cov(X, Y ) = 0. Thus, if X and Y are independent, they are also uncorrelated. However, the reverse is not true, as illustrated by the following example.
Example. Suppose that the pair of random variables (X, Y ) takes the values (1, 0), (0, 1), (1, 0), and (0, 1), each with probability 1/4. Thus, the marginal PMFs of X and Y are symmetric around 0, and E[X ] = E[Y ] = 0. Furthermore, for all possible value pairs (x, y ), either x or y is equal to 0, which implies that XY = 0 and E[XY ] = 0. Therefore, cov(X, Y ) = E[XY ] E[X ] E[Y ] = 0, and X and Y are uncorrelated. However, X and Y are not independent since, for example, a nonzero value of X xes the value of Y to zero.
3.2 Variance of the sum of random variables The covariance can be used to obtain a formula for the variance of the sum of several (not necessarily independent) random variables. In particular, if X1 , X2 , . . . , Xn are random variables with nite variance, we have var(X1 + X2 ) = var(X1 ) + var(X2 ) + 2cov(X1 , X2 ), 4
This can be seen from the following calculation, where for brevity, we denote i = Xi E[Xi ]: X n 2 n
i
var Xi = E X
i=1 i=1
n n iX j = E X
i=1 j =1
n n i=1 j =1
iX j ] E[X
n i=1
n i=1
var(Xi ) + 2
cov(Xi , Xj ).
i=1 j =i+1
3.3 Correlation coefcient The correlation coefcient (X, Y ) of two random variables X and Y that have nonzero and nite variances is dened as cov(X, Y ) (X, Y ) = . var(X )var(Y ) (The simpler notation will also be used when X and Y are clear from the context.) It may be viewed as a normalized version of the covariance cov(X, Y ).
Theorem 1. Let X and Y be discrete random variables with positive vari ance, and correlation coefcient equal to . (a) We have 1 1. (b) We have = 1 (respectively, = 1) if and only if there exists a positive (respectively, negative) constant c such that Y E[Y ] = a(X E[X ]), with probability 1. The proof of Theorem 1 relies on the Schwarz (or Cauchy-Schwarz) inequal ity, given below. Proposition 1. (Cauchy-Schwarz inequality) For any two random vari ables, X and Y , with nite variance, we have 2 E[XY ] E[X 2 ] E[Y 2 ]. Proof: Let us assume that E[Y 2 ] = 0; otherwise, we have Y = 0 with probabil ity 1, and hence E[XY ] = 0, so the inequality holds. We have 2 E[XY ] 0E X Y E[Y 2 ] 2 E[XY ] E[XY ] 2 2 =E X 2 XY + 2 Y 2 E[Y 2 ] E[Y ] 2 E[XY ] E[XY ] 2 2 = E[X ] 2 E[XY ] + 2 E[Y ] E[Y 2 ] E[Y 2 ] 2 E[XY ] 2 = E[X ] , E[Y 2 ] 2 i.e., E[XY ] E[X 2 ] E[Y 2 ]. Proof of Theorem 1: = X E[X ] and Y = Y E[Y ]. Using the Schwarz inequality, we (a) Let X get Y ] 2 2 E[X (X, Y ) = 1,
2 ] E[Y 2] E[X and hence |(X, Y )| 1.
6
= aX , then (b) One direction is straightforward. If Y (X, Y ) = X ] E[Xa 2 ] E[(aX )2 ] E[X = a , |a|
which equals 1 or 1 depending on whether a is positive or negative. 2 To establish the reverse direction, let us assume that (X, Y ) = 1, which 2 ]E[Y 2 ] = E[X Y ] 2 . Using the inequality established in implies that E[X the proof of Proposition 1, we conclude that the random variable E[X Y ] Y X 2] E[Y is equal to zero, with probability 1. It follows that, with probability 1, Y ] 2 E [ X = = E[X ] (X, Y )Y . X Y 2] 2] E[Y E[Y and Y is determined by the sign Note that the sign of the constant ratio of X of (X, Y ), as claimed.
Example. Consider n independent tosses of a coin with probability of a head equal to p. Let X and Y be the numbers of heads and of tails, respectively, and let us look at the correlation coefcient of X and Y . Here, we have X + Y = n, and also E[X ] + E[Y ] = n. Thus, X E[X ] = Y E[Y ] . We will calculate the correlation coefcient of X and Y , and verify that it is indeed equal to 1. We have cov(X, Y ) = E X E[X ] Y E[Y ] 2 = E X E[X ] = var(X ). Hence, the correlation coefcient is (X, Y ) = cov(X, Y ) var(X )var(Y ) = var(X ) var(X )var(X ) = 1.
Indicator functions are special discrete random variables that can be useful in simplifying certain derivations or proofs. In this section, we develop the inclusionexclusion formula and apply it to a matching problem. Recall that with every event A, we can associate its indicator function, which is a discrete random variable IA : {0, 1}, dened by IA ( ) = 1 if A, and IA ( ) = 0 otherwise. Note that IAc = 1 IA and that E[IA ] = P(A). These simple observations, together with the linearity of expectations turn out to be quite useful. 4.1 The inclusion-exclusion formula Note that IAB = IA IB , for every A, B F . Therefore, IAB = 1 I(AB )c = 1 IAc B c = 1 IAc IB c = 1 (1 IA )(1 IB ) = IA + IB IA IB . Taking expectations of both sides, we obtain P(A B ) = P(A) + P(B ) P(A B ), an already familiar formula. We now derive a generalization, known as the inclusion-exclusion formula. Suppose we have a collection of events Aj , j = 1, . . . , n, and that we are inter ested in the probability of the event B = n j =1 Aj . Note that IB = 1
n j =1
(1 IAj ).
We begin with the easily veriable fact that for any real numbers a1 , . . . , an , we have
n j =1
(1 aj ) =1
1 j n
aj +
1i<j n
ai aj
1i<j<kn
ai aj ak
+ + (1)n a1 an . We replace aj by IAj , and then take expectations of both sides, to obtain
P(B ) =
1 j n
P(Aj )
1i<j n
P(Ai Aj )
1i<j<kn
P(Ai Aj Ak )
+ + (1)n P(A1 An ).
4.2 The matching problem Suppose that n people throw their hats in a box, where n 2, and then each person picks one hat at random. (Each hat will be picked by exactly one person.) We interpret at random to mean that every permutation of the n hats is equally likely, and therefore has probability 1/n!. In an alternative model, we can visualize the experiment sequentially: the rst person picks one of the n hats, with all hats being equally likely; then, the second person picks one of the remaining n 1 remaining hats, with every remaining hat being equally likely, etc. It can be veried that the second model is equivalent to the rst, in the sense that all permutations are again equally likely. We are interested in the mean, variance, and PMF of a random variable X , dened as the number of people that get back their own hat.1 This problem is best approached using indicator variables. For the ith person, we introduce a random variable Xi that takes the value 1 if the person selects his/her own hat, and takes the value 0 otherwise. Note that X = X1 + X2 + + Xn . Since P(Xi = 1) = 1/n and P(Xi = 0) = 1 1/n, the mean of Xi is 1 1 1 E[Xi ] = 1 + 0 1 = , n n n which implies that 1 = 1. n In order to nd the variance of X , we rst nd the variance and covariances of the random variables Xi . We have 1 1 var(Xi ) = 1 . n n E[X ] = E[X1 ] + E[X2 ] + + E[Xn ] = n
For more results on various extensions of the matching problem, see L.A. Zager and G.C. Verghese, Caps and robbers: what can you expect?, College Mathematics Journal, v. 38, n. 3, 2007, pp. 185-191.
1
For i = j , we have cov(Xi , Xj ) = E Xi E[Xi ] Xj E[Xj ] = E[Xi Xj ] E[Xi ] E[Xj ] = P(Xi = 1 and Xj = 1) P(Xi = 1)P(Xj = 1) = P(Xi = 1)P(Xj = 1 | Xi = 1) P(Xi = 1)P(Xj = 1) 1 1 1 = 2 n n 1 n
1
= . n2 (n 1) Therefore, var(X ) = var
n i=1
n
i=1
Xi
n 1 n
var(Xi ) + 2 1 1 n
cov(Xi , Xj )
i=1 j =i+1
= n = 1.
1 n
+2
n(n 1) 1 2 2 n (n 1)
Finding the PMF of X is a little harder. Let us rst dispense with some easy cases. We have P(X = n) = 1/n!, because there is only one (out of the n! possible) permutations under which every person receives their own hat. Furthermore, the event X = n 1 is impossible: if n 1 persons have received their own hat, the remaining person must also have received their own hat. Let us continue by nding the probability that X = 0. Let Ai be the event that the ith person gets their own hat, i.e., Xi = 1. Note that the event X = 0 n is the same as the event i Ac i . Thus, P(X = 0) = 1 P(i=1 Ai ). Using the inclusion-exclusion formula, we have P(n P(Ai ) P(Ai Aj ) + P(Ai Aj Ak ) + . i=1 Ai ) =
i i<j i<j<k
Observe that for every xed distinct indices i1 , i2 , . . . , ik , we have P(Ai1 Ai2 Aik ) = 1 1 1 (n k )! = . n n1 nk+1 n! (1)
10
Note that P(X = 0) e1 , as n . To conclude, let us now x some integer r, with 0 < r n 2, and calculate P(X = r). The event {X = r} can only occur as follows: for some subset S of {1, . . . , n}, of cardinality r, the following two events, BS and CS , occur: BS : CS : We then have {X = r} =
S : |S |=r
for every i S , person i receives their own hat; for every i / S , person i does not receive their own hat.
BS CS .
The events BS CS for different subsets S are disjoint. Furthermore, by sym metry, P(BS CS ) is the same for every S of cardinality r. Thus, P(X = r) = P(BS CS )
S : |S |=r
n = P(BS ) P(CS | BS ). r (n r)! , n! by the same argument as in Eq. (1). Conditioned on the event that the r persons in the set S have received their own hats, the event CS will materialize if and only if none of the remaining n r persons receive their own hat. But this is the same situation as the one analyzed when we calculated the probability that X = 0, except that n needs to be replaced by n r. We conclude that P(BS ) = P (CS | BS ) = 1 1 1 + + (1)nr . 2! 3! (n r)! 11 Note that
We have already dened the notion of a conditional PMF, pX | Y ( | y ), given the value of a random variable Y . Similarly, given an event A, we can dene a conditional PMF pX |A , by letting pX |A (x) = P(X = x | A). In either case, the conditional PMF, as a function of x, is a bona de PMF (a nonnegative function that sums to one). As such, it is natural to associate a (conditional) expectation to the (conditional) PMF. Denition 1. Given an event A, such that P(A) > 0, and a discrete random variable X , the conditional expectation of X given A is dened as E[X | A] = xpX | A (x),
x
provided that the sum is well-dened. Note that the preceding also provides a denition for a conditional expecta tion of the form E[X | Y = y ], for any y such that pY (y ) > 0: just let A be the event {Y = y }, which yields E[X | Y = y ] = xpX | Y (x | y ).
x
We note that the conditional expectation is always well dened when either the random variable X is nonnegative, or when the random variable X is inte grable. In particular, whenever E[|X |] < , we also have E[|X | | Y = y ] < , 12
for every y such that pY (y ) > 0. To verify the latter assertion, note that for every y such that pY (y ) > 0, we have
x
|x|pX |Y (x | y ) =
|x|
The converse, however, is not true: it is possible that E[|X | | Y = y ] is nite for every y that has positive probability, while E[|X |] = . The conditional expectation is essentially the same as an ordinary expecta tion, except that the original PMF is replaced by the conditional PMF. As such, the conditional expectation inherits all the properties of ordinary expectations (cf. Proposition 4 in the notes for Lecture 6). 5.1 The total expectation theorem A simple calculation yields E[X | Y = y ]pY (y ) = xpX |Y (x | y )pY (y )
y y x
y x
xpX,Y (x, y )
= E[X ]. Note that this calculation is rigorous if X is nonnegative or integrable. Suppose now that {Ai } is a countable family of disjoint events that forms a partition of the probability space . Dene a random variable Y by letting Y = i if and only if Ai occurs. Then, pY (i) = P(Ai ), and E[X | Y = i] = E[X | Ai ], which yields E[X ] = E[X | Ai ]P(Ai ).
i
Example. (The mean of the geometric.) Let X be a random variable with parameter p, so that pX (k ) = (1 p)k1 p, for p N. We rst observe that the geometric distribution is memoryless: for k N, we have P(X 1 = k | X > 1) = = = = P(X = k + 1, X > 1) P(X > 1) P(X = k + 1) P(X > 1) (1 p)k p = (1 p)k1 p 1p P(X = k ).
13
In words, in a sequence of repeated i.i.d., trials, given that the rst trial was a failure, the distribution of the remaining trials, X 1, until the rst success is the same as the unconditional distribution of the number of trials, X , until the rst success. In particular, E[X 1 | X > 1] = E[X ]. Using the total expectation theorem, we can write E[X ] = E[X | X > 1]P(X > 1)+ E[X | X = 1]P(X = 1) = (1+ E[X ])(1 p)+1 p. We solve for E[X ], and nd that E[X ] = 1/p. Similarly, E[X 2 ] = E[X 2 | X > 1]P(X > 1) + E[X 2 | X = 1]P(X = 1). Note that E[X 2 | X > 1] = E[(X 1)2 | X > 1]+ E[2(X 1)+1 | X > 1] = E[X 2 ]+(2/p)+1. Thus, E[X 2 ] = (1 p)(E[X 2 ] + (2/p) + 1) + p, which yields E[X 2 ] = We conclude that 2 2 1 1 1p var(X ) = E[X 2 ] E[X ] = 2 2 = . p p p p2 Example. Suppose we ip a biased coin N times, independently, where N is a Poisson random variable with parameter . The probability of heads at each ip is p. Let X be the number of heads, and let Y be the number of tails. Then, n n m E[X | N = n] = mP(X = m | N = n) = m p (1 p)nm . m m=0 m=0 But X is just the expected number of heads in n trials, so that E[X | N = n] = np. Let us now calculate E[N | X = m]. We have E[N | X = m] = = =
n=0 n=m
2 1 . p2 p
14
Recall that X = Pois(p), so that P(X = m) = ep (p)m /m!. Thus, after some cancellations, we obtain E[N | X = m] = =
n=m
(n m)
+m
n=m
= (1 p) + m. A faster way of obtaining this result is as follows. From Theorem 3 in the notes for Lecture 6, we have that X and Y are independent, and that Y is Poisson with parameter (1 p). Therefore, E[N | X = m] = E[X | X = m] + E[Y | X = m] = m + E[Y ] = m + (1 p).
5.2 The conditional expectation as a random variable Let X and Y be two discrete random variables. For any xed value of y , the expression E[X | Y = y ] is a real number, which however depends on y , and can be used to dene a function : R R, by letting (y ) = E[X | Y = y ]. Consider now the random variable (Y ); this random variable takes the value E[X | Y = y ] whenever Y takes the value y , which happens with probability P(Y = y ). This random variable will be denoted as E[X | Y ]. (Strictly speak ing, one needs to verify that this is a measurable function, which is left as an exercise.)
Example. Let us return to the last example and nd E[X | N ] and E[N | X ]. We found that E[X | N = n] = np. Thus E[X | N ] = N p, i.e., it is a random variable that takes the value np with probability P(N = n) = (n /n!)e . We found that E[N | X = m] = (1 p) + m. Thus E[N | X ] = (1 p) + X . Note further that E[E[X | N ]] = E[N p] = p = E[X ], and E[E[N | X ]] = (1 p) + E[X ] = (1 p) + p = = E[N ]. This is not a coincidence; the equality E[E[X | Y ]] = E[X ] is always true, as we shall now see. In fact, this is just the total expectation theorem, written in more abstract notation.
15
Theorem 2. Let g : R R be a measurable function such that Xg (Y ) is either nonnegative or integrable. Then, E E[X | Y ]g (Y ) = E[Xg (Y )]. In particular, by letting g (y ) = 1 for all y , we obtain E[E[X |Y ]] = E[X ]. Proof: We have E E[X |Y ]g (Y ) = E[X | Y = y ]g (y )pY (y )
y
= =
y x
xpX |Y (x | y )g (y )pY (y )
x,y
(3)
Here is an interpretation. We can think of E[X | Y ] as an estimate of X , on the basis of Y , and E[X | Y ] X as an estimation error. The above formula says that the estimation error is uncorrelated with every function of the original data. Equation (3) can be used as the basis for an abstract denition of conditional expectations. Namely, we dene the conditional expectation as a random vari able of the form (Y ), where is a measurable function, that has the property E ((Y ) X )g (Y ) = 0, for every measurable function g . The merits of this denition is that it can be used for all kinds of random variables (discrete, continuous, mixed, etc.). However, for this denition to be sound, there are two facts that need to be veried: (a) Existence: It turns out that as long as X is integrable, a function with the above properties is guaranteed to exist. We already know that this is the case for discrete random variables: the conditional expectation as dened in the beginning of this section does have the desired properties. For general random variables, this is a nontrivial and deep result. It will be revisited later in this course. 16
(b) Uniqueness: It turns out that there is essentially only one function with the above properties. More precisely, any two functions with the above properties are equal with probability 1.
17
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.