Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Notes

The document outlines an introductory course on econometrics, emphasizing its importance in analyzing economic relationships and testing theories using statistical methods. It reviews basic statistical concepts and introduces the summation operator, random variables, and the distinction between populations and samples. The course aims to equip students with the tools to derive insights from data related to household consumption and other economic variables.

Uploaded by

hangdh.duong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Notes

The document outlines an introductory course on econometrics, emphasizing its importance in analyzing economic relationships and testing theories using statistical methods. It reviews basic statistical concepts and introduces the summation operator, random variables, and the distinction between populations and samples. The course aims to equip students with the tools to derive insights from data related to household consumption and other economic variables.

Uploaded by

hangdh.duong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

ECON335 – Statistics 10-17-21

INTRODUCTION & STATISTICS REVIEW

This course is intended to give you a working knowledge of some basic econometric tools.

A reasonable first question to ask is “why do I have to take this course?”

To give you an idea of why we need and use econometrics let's consider some examples from
Principles of Microeconomics.

Don't worry if you don't remember or have not heard about the economic concepts I'm
discussing. These are simply examples.

(I) Take a look at the left-hand graph below:

Can anyone identify the line I drew in the graph?

It's a demand curve ("D"). We derive demand functions in microeconomic theory courses.

The shape of the curve conveys the idea that as the price of rice drops more people will demand
rice.

Does economic theory say that D curves will always be downward sloping? As it turns out, the
answer is “No.”

While not likely, it is possible that a demand curve is upward sloping.

So, our theory says that D functions might be downward or upward sloping.

How can we determine what is true of the rice market we are studying? Look to the real world.

1
ECON335 – Statistics 10-17-21

The analysis does not end there. Our discussion of economic thy also indicated that the steepness
(or elasticity) of the demand curve is important. The steeper the curve the smaller the change in
the equilibrium quantity in the market when the S curve shifts or a tax is imposed on the market;
i.e., the more inelastic the Demand function the smaller the change in equilibrium quantity.

Policymakers often are interested in the elasticity of the D function. Our econometric analysis
also provides some insight into the elasticity of the D function.

Define Econometrics

Ok, that’s why we take econometrics. Let’s turn to econometrics itself.

I like starting off the course with definitions of econometrics.

“Econometrics is based upon the development of statistical methods for estimating economic
relationships, testing economic theories, and evaluating and implementing government and
business policy.”

Woolridge, Jeffrey M. (7th: 2) Introductory Econometrics: A Modern Approach. Thomson-


Southwestern: 1.

------------------ skip-----------------------------------------------
“Econometrics uses economic theory, as embodied in an econometric model; facts, as
summarized by relevant data; and statistical theory, as refined in econometric techniques, to
measure and to test empirically certain relationships among economic variables, thereby
giving empirical content to economic reasoning.” Intrilligator et al., p.1.
----------------------end skip---------------------------------------

This quotation really embodies what we will do in this course.

(A) It is the theoretical economics which drives use of the econometrics; we have a theory and
wish to test it.

(B) To test it, we translate it into an econometric model.

(C) We then estimate the model using data and test the results.

Economic Theory → Econometric Model → Estimator & Testing

2
ECON335 – Statistics 10-17-21

☺ Statistics: both definitions refer to statistical theory in defining econometrics. Statistical ideas
underlie econometric theory. Because of the importance of statistics I will review basic statistical
concepts in the 1st several classes in this course. The review will also serve as a nice review of
the statistics course you have taken. The statistics review will reveal the general approach used in
econometrics but in a simpler context. We will then turn to econometrics.

In order to get in to statistics, we must remind ourselves of the summation operator and its
properties. Let’s do that now.

PROPERTIES OF THE SUMMATION OPERATOR (7th: 666. 6th: 628)

You will recall that the summation operator is designed to represent the summation of several
variables. Thus, if we wished to represent the idea that we are summing the numbers 1 to 100,
we could write out the whole summation, or we could write
100

∑i
i=1

How do we interpret it? Go from i=1 to i=100 and sum the variable after the operator.

Ex Now, suppose that we obtain data on how much money people spend in a year on food and
that we represent the consumption of food with the variable yi.

And, suppose that we get this information for 100 people.

If we wish to represent summing the random variable for all 100 households, we could write

100

∑ yi
i=1

we include a subscript to identify each of the 100 variables.

☞ We can generalize this representation somewhat by allowing the number of variables


summed to be indeterminate. We refer to a number of variables - n - being summed.

We get n

∑ yi
i=1

3
ECON335 – Statistics 10-17-21

Properties

(1) Property Sum 1: the variable is a constant: a


n

∑c = c + c + ... + c = n∙c.
i=1

In other words, we sum n vars, each of which has the same value: c. The sum must = n∙c.

Ex: Suppose that the “variable” is Age and that we have 25 people (n=25) all of whom are 19
years old. Here, we are summing 25 variables each of which has value 19;

sum = n∙a.= 19*25 = 475.

(2) Property Sum 2: a constant times a non-constant variable (7th: 666)

n n

∑a∙yi. get a∙∑yi.


i=1 i=1

You can pull the constant out of the summation.

Ex: Suppose that Y is measured in hundreds of thousands of dong. So that y = 5 represents


500,000 dong. Suppose that we want to represent it in thousands of dong. So that y = 500 when
we have 500,000 dong. What would we have to do to the original variable to change the units?
Multiply it by 100. In this example, a = 100.

We have y1, y2, & y3. ⇒ 100∙y1 + 100∙y2 + 100∙y3 = 100(y1 + y2 + y3)

Ex: a variable minus(+) a constant:

n n n n

∑(yi-a)= ∑ yi - ∑a = ∑ yi - n∙a.
i=1 i=1 i=1 i=1

This rule reflects a basic fact about the summation operator. When we have linear functions like
this one, we can represent the summation of the function as the summation of each element in the
function. Of course, the n∙a reflects property (1).

Ex:
3

∑(yi-a) = (y1-a) + (y2-a) + (y3-a) = y1 + y2 + y3 -a -a -a = ∑yi - 3∙a.


i=1 i=1

4
ECON335 – Statistics 10-17-21

(3) Property Sum 3: a variable minus(+) a variable (7th: 667)

∑(a•xi - b•yi) get ∑ a•xi - ∑ b•yi.


i=1 i=1
i=1

Ex: x1,x2,x3 & y1,y2,y3

(4) Property Sum 4: a variable times a variable (I created this.)

∑(xi∙yi) What it is
i=1

Ex: n = 3. It = x1∙y1 + x2∙y2 + x3∙y3.

What it is not:
n n

∑xi∙∑ yi = (x1 + x2 + x3)(y1 + y2 + y3)


i=1 n=1

Note that the summation operator does not move through non-linear relations.

Another example

n n n

∑(xi - yi2)= ∑ xi∙ - ∑ yi2


i=1 i=1 i=1

n n

It does not equal ∑ xi∙ - ( ∑yi )2


i=1 i=1

Ex a (variable minus(+) a constant) squared (7th: 667. 6th: 630)


n

∑(Yi-a)2
i=1

What it is:
n

∑(Yi-a)2 = ∑(Yi-a)(Yi-a) =∑(Yi2 -2aYi + a2) = ∑Yi2 - 2a∑Yi + na2


i=1 i=1 i=1 i=1
i=1

5
ECON335 – Statistics 10-17-21

Note that this is not n

[∑(Yi-a)]2
i=1

In this formula we do the summation and then square it, while in the formula we’re considering
we square and then sum the squares.

Ex. A random variable divided by another random variable


n

∑(xi/yi)
i=1
n

= ∑ xi∙(1/yi)) ≠… ∑ xi∙∑(1/yi))
i=1

STATISTICS REVIEW

Will start the course by first reviewing some basic concepts learned in statistics courses. The
review will set the stage for our econometric analysis & it will describe ideas which we will use
in our econometric analysis.

I will also approach the statistics review by referring often to one example. The review will
allow me to identify basic principles that underlie our statistical and econometric analysis.

Example: Suppose that we're interested in the consumption of households in Vietnam [the U.S.].
We will consider how we might approach analysis of that consumption.

(1)(A) Random Variables (7th: 684. 6th: 645)

If we consider household consumption, the first thing we will note is that it varies across
households. Some households spend a lot - e.g., Jeff Bezo’s household, while other households
don't spend much - e.g., households in rural areas.

We call the concept we are considering - "Household Consumption" - a random variable.

6
ECON335 – Statistics 10-17-21

Define a random variable. Define it in two parts.

A variable represents "a quantity that takes on different values for different persons or things."

- Households will have different levels of consumption; so, household consumption is a


variable.

- On the other hand, the speed of light emitted from flashlights does not vary from
flashlight to flashlight. It is a constant: 186,000/mps.

A random variable is a variable whose value cannot be predicted with certainty. It can take on
many values.

Ex. We cannot, however, predict with certainty a household’s (hh's) consumption. My income
differs from your income and each of your incomes likely differ from each other.

☞ There are several sources of randomness. The primary reason cited by economists for
randomness in data: we do not have complete information about the factors which affect a
variable.

☞ Reasons for not having complete information:

(1) Even though it's possible to get all relevant data, do not have all such data.

e.g., the household consumption of food example. It will depend on many factors: education, #
people in the hh, family background, who your family knows, your preferences, etc.

If do not have all of the information affecting a variable then the variable will appear random.

e.g., 2 households with 4 people and one wage earner. One household has 30 million dong per
month in consumption of food and the other 5 million dong per month in food consumption.

If add the fact that the first wage earner is a doctor who has been in practice for 20 years & the
second wage earner has an elementary school education and works as a construction laborer
during the day & as a janitor at night, then we might explain the diff in income levels.

7
ECON335 – Statistics 10-17-21

(2) Not sure of all of the variables affecting what you're interested in.

Ex: advertising or whether you woke up happy versus waking up sad.

(3) some variables are inherently non-quantifiable or non-observable.

ex. innate ability & how it affects income

ex. education - might use years but do need more

(1)(b) Population vs. Sample

Now, we have to distinguish a population from a sample.

A "population" is the whole group one is considering (7th: 714. 6th: 674). (Any well-defined
group of subjects.)

In the example we are considering (consumption of households (hh’s) in the country), it is all of
the hh's in the country in a given year.

A "sample" is a subset of the population.

Ex. the 2009 Vietnamese Census samples 15% of the country’s households

[ http://www.gso.gov.vn/default_en.aspx?tabid=515&idmid=5&ItemID=9813 ].

[2009: 22,638,167 households in Vietnam]

Aside: Each data point (or element) in the sample is called an "observation." An observation is
simply one of the households in the sample. Thus, if we have 15% of 22 million hh's, we have
3.3 million hh's (or observations) in the sample.

(1)(C) Characteristics of the population

So, we're interested in some random variable: in our example, household consumption in
Vietnam [the U.S.] in some year (e.g., 2017).

Suppose that we would like to gain some insight into that variable.
8
ECON335 – Statistics 10-17-21

We must now ask "what characteristic of household consumption are we interested in?"

We might be interested in consumption of the "typical hh" in Vietnam [the U.S.]


Standard measures used are the mean and the median.

Or, we might be interested in the range of consumption across hh's; say, the distribution of that
consumption. In statistics, we saw that the variance measures how spread out a random variable
is.

Aside: we call characteristics like the mean and variance population parameters.

☺ Here’s an important point: we, as a rule, do not know the values of these parameters.

✌ An important question that arises is "how can we gain insights into those characteristics?"

We might determine those parameters by obtaining information on consumption of every


household in the country. If we had that information then would be able to calculate any
characteristic of the population.

A problem with this approach, however, is that it would be costly to obtain information on all
households in a population.

As a result, we obtain information on samples of a population & hope to infer characteristics


of the population from that sample.
This is a Key Point.

Note that a key implication of using a sample to make inferences about a population is that the
population and its relevant characteristics remain unknown to the economist (researcher).

Ex. We do not know the mean consumption of hh's in Vietnam (the US) in a certain year.

Thus, we seek to use the sample to make statements (or inferences) about an unknown
parameter (characteristic).

Ex. use the sample mean to make a statement about the population mean.

9
ECON335 – Statistics 10-17-21

(1)(d) Discuss Samples

The type of sample you choose is important. Return to the consumption ex. Suppose that your
sample was obtained by determining consumption levels of households in more wealthy parts of
Hanoi (in the U.S.: Cherry Creek, Beverly Hills or Beacon Hill).

We would not believe that the information obtained from those samples was "representative" of
households (hh’s) in the whole country.

Because of such possible biases, we must be careful in sampling. We want a sample that will be
representative of the population.

The type of sample we focus on primarily in this course and in statistics is called a

"simple random sample (SRS)" (7th: 715. 6th: 674)

It is a subset of the population with each member of the population having the same probability
of being included in the sample

Ex: returning to the consumption function example. If we're interested in the population "all
households in the United States in a year" and if there are 22 million hh's in VN, a simple
random sample would place a 1/22m probability on choosing a given person in the country.

Hereafter, we will assume that we have data from a SRS.

Transition

So, suppose that we have information on consumption levels of 1 million hh's in the country
obtained from a SRS. We may legitimately ask whether the info obtained about those hh's will
allow us to gain insight into characteristics of the full population; i.e., can we use information
obtained from 1 million households to make statements about the whole 90 million households?

In other words, we ask

"can we make statements about the whole population when all we have is information on a
subset ( a SRS) of the population?"

10
ECON335 – Statistics 10-17-21

In order to answer this question we have to link the sample to the population & determine if
we can use the sample to gain insights into the population.

SAMPLE → POPULATION

We link samples to populations by building a theory of probability. We then use that theory to
show how the sample allows us to make inferences about the population.

(2)(a) Fundamentals of Probability (7th: 684. 6th: 645)

Let's review some of the basic principles of probability theory.

The probability theory derived in statistics is intended to describe random variables. Starting
from fundamental ideas such as sample space, an outcome and an event.

The basic point of the analysis is to describe all of the possible values a random variable might
take and associate probabilities with each possible value.

Ex: consumption may be $5k, 100k or 1m. There will be a specific probability a household has
$5k in consumption and another for $100k and ....

The analysis builds to a point where we describe random variables and the probabilities
associated w/ them w/ probability distributions. That is where I will start my analysis.

Probability Distributions

Required Characteristics of Probability Distributions

Probability distributions associate probabilities with the possible values of a random variable.

The exact way in which we define and calculate a probability depends on the type of random
variable we are considering.
First, we must make two definitions:

Let Y be the set of discrete possible outcomes that a random variable can take.

Y will represent the random variable, and

y will be a specific outcome in the set Y.

11
ECON335 – Statistics 10-17-21

Three Types of Random Variables: not all variables are the same. We must distinguish
between 3 types of random variables:

(1) discrete, (2) continuous and (3) categorical random variables.

(1) Discrete Random Variables (7th: 685. 6th: 646) can take on only a fixed set of values.

Ex. Years of education completed. Can take on values from zero to ...? H.S. = 12; college = 16;

In this example, Y is the random variable with certain distinct values


y is a distinct value of Y, such as 10 years (or 9 years or …).

We call each possible value of Y as a mass point.

f(y) is called a probability mass function ("pmf") (Wooldridge calls it a pdf: 7th: 686. 6th: 647)

it is a function which assigns probabilities to each possible outcome of Y

Note that f(y) is the probability of observing a given value of the random variable Y.

Ex: f(12) = 0.25 implies that the probability of observing someone who obtained a high school
diploma is 25%,

f(16) = 0.1 implies that 10% of all individuals have 16 years of education … & so on.

A pmf must possess the following properties (7th: 685. 6th: 647)

(1) f(y) ≥ 0 for all possible values of Y.

(2) ∑ f(y) = 1
y

(1) states that all probabilities must be non-negative.

(2) requires all probabilities to sum to one. That's reasonable.

12
ECON335 – Statistics 10-17-21

☺ We might ask what a graph of a pmf might look like. Let's do it for the education example.
The graph will have spikes at the values the variable (education) can take (the mass points of the
variable), with the height of the spike representing the probability of observing the value of that
variable.

Example of a discrete distribution which is used for a discrete (i.e., non-categorical) variable.

Distributions for Continuous Variables (7th: 687. 6th: 648)

We have seen that continuous variables lie w/in a specified range; e.g., between [0,1] or (-∞,∞)

We use a probability density functions (pdf) to identify the probabilities for continuous
variables.

Because it is assigning probabilities for a continuous variable, a pdf is continuous. So, it will
look like, e.g.,

When dealing with continuous variables we must identify probabilities in a different way.

We can see this best by looking at the requirements for a pdf.

13
ECON335 – Statistics 10-17-21

pdfs must satisfy the following conditions:

(1) f(y) ≥ 0 everywhere and

(2) the whole area under the pdf must equal one (in mathematical terms - you don’t have to

worry about this ∫−∞ f(y) 𝐝𝐲 = 1).

(1) is similar to the (1) requirement for pmf’s.

Requirement (2) is analogous to the sum of probabilities = 1 requirement of pmf's.

Requirement (2) suggests that we obtain probabilities for continuous variables by calculating an
area under a probability density function (pdf) between two values of the random variable.

As a result, we obtain probabilities for continuous variables only for ranges of values of the
random variable; not for specific values (4th: 419). I won’t go over how we calculate the
probabilities exactly – that involves mathematics not needed for this course.

☞ It is necessary to note, however, that the probability is not the value of f(y) at a specific
point or the difference between f(y) at two different points. You do not obtain probabilities by
sticking values of y into f(y).

(3) Attribute or categorical variables place individuals into categories.

Exs, whether one is male or female, whether one lives in an urban area and whether one has a
college degree.

We represent these variables using what are called dummy variables.

A dummy variable identifies whether one is in one of two categories: it takes on a value of zero
or one.

Ex. 1: Gender: male vs female. y = 1 if female


= 0 if male.

Ex 2: Education: y = 1 if a person has a college degree and


= 0 if they have less than a college degree.

You see that these variables are discrete in the sense that they take on a limited # of values.
14
ECON335 – Statistics 10-17-21

☞ The #'s attributed to them, however, are meaningless.

Ex we could just as easily define y = 1 if Male and y = 0 if female.

On the other hand, for a discrete variable like those described above, #'s make a difference.
Three doctor’s visits is different than 1 or 0 visits.

So, we treat categorical variables as if they were discrete, with the variable values having no
quantitative meaning.

We will discuss probability distributions for discrete and continuous variables.

Ex Bernoulli distribution with parameter p. (7th: 685. 6th: 646)

This is a distribution for a random variable which takes on only two values: 0 or 1.

Here f(y) = py (1-p)(1-y) for y= {0,1} 0≤p≤1


= 0 otherwise.

We may note that p is called a parameter of the distribution.

How do we use the distribution? To answer the question, let's ask

"What's the probability that y=1."

Plug in y=1 and see what get: get f(1) = p1(1-p)(1-1) = p.

☼ So, does the Bernoulli pmf satisfy the requirements of a pmf?

Is it + or 0 for all points in Y? Here Y can take 2 values: 0 & 1. The pmf = p for y =1 and = (1-p)
for y = 0.

Is it 0 for values not in Y? Yes.

Do the probabilities sum to one? Yes: p + (1-p) = 1.

What will it look like? ...[show it graphically!]

15
ECON335 – Statistics 10-17-21

When might this distribution be used?

Ex., drawing one person from the pop "everyone in Vietnam [the U.S.] over 25" and determining
whether they have a college degree.

Let, y = 1 could equal one if "they have a degree" and = 0 if "they do not have a degree."

What would p be? The probability that the person has a college degree.

(There are many other discrete distributions. You will only have to know about the Bernoulli
distribution.)

_______________________________ skip ______________________________________

Standard continuous distributions

There are a lot of diff continuous distributions. In fact, we will see several of them in this course.
For now I will focus only on the two distributions: (1) the General Normal probability density
function, and (2) the Standard Normal probability density function.

General Normal PDF (7th: 704. 6th: )

We will not look at the specific formula for the pdf. We will simply look at graphical examples
of it.

We represent the idea that the variable has a general normal pdf as follows: y∼N(μ,σ2)

where μ and σ2 are parameters of the distribution.

Here are some graphical examples: [GRAPHS on 4th: 465]

☞ What can we say about the distribution

(1) This is the familiar bell-shaped distribution that many people talk about.

(2) The distribution is symmetric. By symmetric, we mean that if draw a line down the center
the right half will look like the left half.

16
ECON335 – Statistics 10-17-21

(3) The shape of the distribution depends on mu & sigma (called parameters).

μ determines where the peak (or center) of the distribution will be, while

σ determines how it is spread out.

We can see how the size of mu and sigma affect the distribution in the following OH.

(4) The distribution with the dashed line has a μ = 0 & σ = 1. It is called the standard normal
distribution. (7th: 705. 6th: 666)

You can see that it is centered over 0. We may see how mu & sigma affect the distribution by
contrasting other gen normal distributions with the standard normal.

The distribution with the dashed and dotted line has μ = 2 and σ = 1.

So, we've simply changed the value of mu. We see that changing it causes the distribution to
shift to the right. The spread of the distribution has not changed.

The distribution with the solid line has a mean = 0 and a σ = 2.

We see that it has the same center (peak) as the 1st distribution but it is more spread out.

☞ We obtain the following general results:

(1) a + μ implies that the distribution shifts right, as compared w/ the standard normal
distribution. A smaller μ implies a shift to the left.

(2) a smaller σ implies that the spread of the distribution is not as great.

☞ Transformations

I noted above that when μ = 0 and σ2 = 1 we obtain a standard normal distribution.

It is important to note that we can transform any variable with a general normal distribution to
one with a standard normal distribution.

Suppose that y∼N(μ,σ2).

17
ECON335 – Statistics 10-17-21

Property Normal 1: (7th: 706) if we define the variable Z = (y-μ)/σ

we can show that Z ∼N(0,1).

Note that we divide by σ, not σ2.

This result follows from the fact that a linear transformation of a normally distributed r.v. has a
normal distribution. W/ the foregoing transformation, we get a standard normal variable.

Ex: y∼N(1,4) ⇒ Z = (y-1)/2 ∼N(0,1)

The standard normal distribution plays a major role in our econometric analysis.

Standard Normal Tables: because the actual calculation of the probabilities is difficult (it
involves integration) econometrics and statistics texts contain tables which identify the
probability of falling between two points in a standard normal distribution.

Table G-1 (7th: 784. 6th: 743) contains such probabilities.

Ex: suppose that we want to determine the probability that the value of a standard normal
variable falls between 0.5 and 0.

How do we calculate the probability? Note that the table gives us the probability that the random
variable lies between 0 and some + point. ...

Examples: …. [0,1] What about [1,2]? Pr( 0 ≤ Y ≤ 2) - Pr( 0 ≤ Y ≤ 1)

☺ We might ask why textbooks only report probabilities for the standard normal distribution.

Why do we not have them for all general normal distributions?

We would need an infinite # of tables and, anyway, we can calculate the probabilities for general
normal distributions using the standard normal distribution. We just undertake the transformation
described above.

I will leave our discussion of the normal distribution here & go on.
___________________________ end skip ____________________________
18
ECON335 – Statistics 10-17-21

Features of Probability Distributions (7th: 691. 6th: 652)

So, our probability theory allowed us to describe random variables with probability distributions.

While it is nice to be able to describe a random variable in this way we are usually not interested
in the whole distribution but only certain characteristics of the distribution.

I alluded to several characteristics earlier: mean, median, variance … standard deviation

For our purposes, the characteristics in which we are interested fall under two broad categories:

(i) measures of central tendency & (ii) measures of dispersion.

We will consider each.

B-3a: A Measure of Central Tendency: The Expected Value (7th: 691)

The mean, median, and mode are MOCT. As I noted earlier, we often look to them when we
wish to describe the "typical" person or household or observation.

The most popular measure and the measure that will dominate our analysis of econometrics is
the mean of a random variable. We will focus on it.

The mean (or average) of a random variable is also called the expected value of a
distribution.

The expected value of a random variable is calculated differently for discrete and for continuous
random variables.

Discrete: if we let f(y) represent the probability mass function of a random variable y, then the
expected value of the random variable y is

E(Y) = ∑ y f(y) (B.17 7th: 691. 6th: 652)


y ∈Y

where y ∈ Y requires that y be a value that the random variable can take [i.e., a mass point].

In other words, we multiply each possible value the random variable can take (each mass point)
by its probability & we sum up the products.

19
ECON335 – Statistics 10-17-21

I will note that E(Y) is a standard way in which the expected value of a random variable is
represented.
We might also represent it as μy

Ex
Y 10 20 30 40 E(Y)
f(y) 0.20 0.50 0.20 0.10
y∙f(y) 2 10 6 4 22

Let’s consider an example using a standard probability mass function.

Ex: Bernoulli

Here f(y) = py (1-p)(1-y) for y = {0,1}.

How would we calculate the expected value?

Using the above formula, it's 0∙f(0) + 1∙f(1).

If we put the specific functional forms for f(0) and f(1) into the formula, we get

0∙p0 (1-p)(1-0) + 1∙p1(1-p)(1-1) = p

So, the expected value in this case is simply the probability of observing x = 1.

Continuous Case:

The expected value of continuous random variables is calculated in a similar fashion except that
we integrate rather than sum. We won’t worry about integration here.

While we won’t integrate, I will identify the expected value of continuous random variables.

I will make one observation with respect to the general normal distribution, however. It can be
shown that the expected value of a general normal variable = μ.

It follows that the expected value of the standard normal variable is 0.

20
ECON335 – Statistics 10-17-21

Properties of the Expected Value (7th: 692. 6th: 653)

Our discussion of the expected value (or mean) of a random variable uses the expectation
operator. Generally, it applies to functions of y - g(Y). You’ll see what I mean by functions of y
through the examples we will consider.

(1) Property E1: if b is a constant E(b) = b

We will discuss the result in the context of a discrete random variable.

We have E(b) = ∑ b f(y).


Y

Because the rv is a constant, we can pull it out of the summation and get E(b) = b∑f(y).

Since the summation of the probabilities = 1, E(b) = b.

E(bY) = bE(Y)

Makes sense: if we multiply a random variable by a constant - say, 500 - each possible value it
can take will be multiplied by 500. We can - as in any summation - factor out the constant.

Ex Y is the Bernoulli described above. Let b = 5

We saw that the expected value of the Bernoulli was p. The above formula implies that the
expected value of the function g(y) will be 5p.

Let's see if that holds.

E(g(Y)) = E(5Y) = 5∙0∙(1-p) + 5∙1∙p


= 5 (0∙(1-p)+1∙p) = 5E(Y) = 5p

(2) Property E2: E(a + bY) = a + bE(Y) (7th: 693. 6th: 653)

Note that this example indicates that the expectation operator (like the summation operator)
passes through linear transformations.

Ex. The above Bernoulli example with a = 10.


21
ECON335 – Statistics 10-17-21

Property E3: (7th: 693. 6th: 653) a1•X1 + … + an•Xn It’s expected value is ….

Ex E(Y2) ≠ (E(Y))2

With respect to the left-hand side, the formula is E(Y2) is the square of each value of y multiplied
by the probability of observing each value of y with all of the products summed.

It is generally true that it does not equal the expected value of the variable squared.

Ex: Bernoulli: know E(Y) = p implies (E(Y))2 = p2.

What's E(Y2)?
02∙f(0) + 12∙f(1) = 02∙(1-p) + 12∙p = p

Median: (7th: 694) …

B-3d: Measures of Variability (7th: 695. 6th: 656)

These measures give us a feel for how a variable is spread out. They do not describe a point in
the distribution (as do the mean & median).

We will focus on the variance of a distribution and its associated measure the standard deviation.
Variance & Standard Deviation

Variance (7th: 695)

We represent the variance as E(Y-μy)2 = E[(Y-μy)2]

Where μy is the population mean of the random variable Y.

It is the expected value of the diff between the random variable & its mean, squared.

Notation: typically represent it as follows: σy 2 .

☺ The formula for the variance of a discrete random variable is

E(Y-μy)2 = ∑ (y-μy)2 f(y)


y∈Y

22
ECON335 – Statistics 10-17-21

For a continuous random variable, we calculate it in a similar way except that we integrate (don’t
worry about the integration).

[Remember that parameters – μy - are constants]

Note that the pmf can be any form.

We can show that E(Y-μy)2 = E(Y2) - μy2 = E(Y2) - E(Y)2. (7th: 695: B.24)

Ex: do it for the earlier example

Y 10 20 30 40 E(Y)
f(y) 0.20 0.50 0.20 0.10
y∙f(y) 2 10 6 4 22
y - μy -12 -2 8 18
(y-μy)2∙f(y) 28.8 2 12.8 32.4 76

HW: show that E(Y2) - μy2 = 76.

☞ While the variance does describe the spread of a distribution, it is not comparable with the
mean because it is in terms of squared values of the random variable.

Ex. Consumption. The mean is in terms of dong [dollars]. The variance is in terms of squared
dong [dollars].

Because we would like to be able to compare the dispersion & the mean, we must translate the
variance into the same units as the mean. How would we do it? Take the square root of the var.

We call the square root of the variance the standard deviation (7th: 696)

Ex: general normal distribution Y ∼N(μ,σ2). Can show that the standard deviation of the
general normal distribution is the sigma parameter; i.e., it's σ. Thus, the variance is σ2.

23
ECON335 – Statistics 10-17-21

Ex: Bernoulli: Have seen that the expected value = p

E(Y-μy)2 =∑i(yi-μy)2f(yi) = (0-p)2∙(1-p) + (1-p)2∙p

= p2∙(1-p) + (1-p)2∙p = p∙(1-p)[p + (1-p)] = p∙(1-p).

Do it with the other formula .... E(Y2) = p & E(Y) = p ⇒ E(Y2) -E(Y)2 = p - p2 = p(1-p).

Properties of the Variance (7th: 696)

We can talk about the properties of the variance.

Go through each of the following:

(1) Property Var.1: variance of a constant equals zero.

(2) If b is a constant, V(Y+b) = V(Y). The constant does not affect the variance.

(3) Property Var.2: if “a” is a constant V(aY) = a2V(Y).

The properties that we may be interested in may be summarized by considering the variance of
the following linear transformation of the rv Y:

Z = b + a∙Y

We can show that V(Z) = a2∙V(Y).

Two points to note:

(1) The addition of a constant - a (e.g., 10) - to a random variable does not change its variance.
It simply shifts the distribution (& the mean). [Saw this in the handout with respect to the
general normal distribution.]

(2) The variance of the original random variable is multiplied by the square of b .

Ex: Y is Bernoulli. We know that V(Y) = p(1-p).

What is the variance of Z = 0.5 + 0.5y? V(Z) = 0.25V(Y) = .25p(1-p).

24
ECON335 – Statistics 10-17-21

Summary

The foregoing is our basic theoretical analysis of the distribution of a single random variable:

1st describe the random variables with probability distributions & then talk about certain
characteristics of the distributions.

We will recall that we undertook a review of probability distributions because we wanted to link
a sample of data with the population of interest.

In order to make that link, we must discuss some aspects of probability distributions of greater
than one random variable & expectations with respect thereto. So, let’s turn to them.

Joint Distributions: (7th: 688) Functions of More Than One Random Variable

We will first consider a function of two random variables. Suppose that Y and X are two random
variables which may be discrete or continuous. We will focus on the situation in which they are
discrete.

These random variables will have some probability distribution which describes the probabilities
that the two variables jointly take on specified values.

Represent the joint distribution as f(y,x).

Ex. PCS sold and Printers sold Suppose that we are focusing on the population of people
who purchase computers and we’re interested in whether they also purchase a printer. Let Y
represent the number of printers sold in a day and X represent the # of PCS sold in a day.

This population includes 2 random variables: (1) # of PCs sold & (2) # of Printers sold.

We call the probability distribution for these two random variables a joint probability mass
function. It's a mass function because our variables are categorical.

The joint pmf is below. The joint pmf will describe the probability that get any combination of (#
of PCs sold in day, # of Printers sold in that day)

25
ECON335 – Statistics 10-17-21

PCs (x)
0 1 2 3 4 fy(y)
0 .03 .03 .02 .02 .01 0.11
Printers 1 .02 .05 .06 .02 .01 0.16
(y) 2 .01 .02 .10 .05 .05 0.23
3 .01 .01 .05 .10 .10 0.27
4 .01 .01 .01 .05 .15 0.23

fx(x) .08 .12 .24 .24 .32

How do we interpret the values in the table?

Each element in the table describes the (joint) probability that the two random variables take on
the values identified in the column & the row.

Examples: …

☺ You will note that a joint pmf should satisfy all requirements of a pmf. Thus, all
probabilities should sum to one & there should be no negative probabilities. [confirm for the
above]

Marginal Probabilities (7th: 688)

We can obtain from the joint pmf what are called marginal probabilities and marginal pmf's.

A marginal pmf, in this context, is the pmf of one the 2 random variables. The phrase marginal
pmf is really descriptive. It provides us with the probabilities we'd obtain if we summed across
rows or down the cols; i.e., the probabilities that would be contained in a margin.

Ex. Suppose that we're interested only in the number of PCs sold (X) and we're wondering
about the likelihood that a certain number is sold in a day. How would we calculate that
probability? For a given # of PCs sold, sum down all possible # of printers sold.

We can do the same for the # of Printers sold.

So, from this joint pmf we can obtain two marginal pmf's.

We represent generally the marginal probability functions of y and x as fy(y) and fx(x).
26
ECON335 – Statistics 10-17-21

Ex: in the example they are the fY(∙) column and fX(∙) row.

You should note [and confirm] that the marginal pmf's should satisfy all requirements of a pmf.

Statistical Independence (7th: 688)

Let's turn to another concept regarding two random variables; namely, ....

Independence has a specific definition: 2 variables are independent when their joint probability
distribution is the product of the marginal pd's of the two variables.

In other words, f(x,y) = fx(x)fy(y).

We may interpret independence as saying that the two variables are not related.

Alternatively, knowing the value of 1 variable provides no insight into the likelihood of
obtaining a certain value of the other var.

Ex: PCs sold and Printers sold. Independence implies that knowing the number of PCs sold in a
day provides no insight into the number of printers sold in that day.

Would we expect SI in this example? Probably not.

Ex: my consumption & the consumption of Joe Biden? Yes.

Ex. Do we have it in the PC – Printer example?

f(0,0) = 0.03 fx(0)fy(0) = (0.11)(0.08) = 0.0088. So, they’re not SI.

Example of Statistical Independence … X = 1 if age < 25, = 0 if ≥ 25


Y = 1 if have a cell phone.

Pr(X = 1) = 0.8. Pr(Y = 1) = 0.4.

X fx(x)
0 1
Y 0 0.12 0.48 0.6
1 0.08 0.32 0.4
fy(y) 0.20 0.80
27
ECON335 – Statistics 10-17-21

Features of Joint and Conditional Distributions (7th: 697)

We can use the marginal pmf's to obtain the expected values of each random variable (r.v.)

The expected value is ∑y∙f(y).


y∈Y

Ex. Do for PC’s (x).

Expected values and variances of functions of the two random variables (4th: 437)

Suppose that we're interested in the expected value of the r.v. Z = Y + X.

Does the fact that we have a joint distribution affect our conclusions with respect to
expectations?

Not in the case of the E(Z). E(Z) = E(Y) + E(X).

Foregoing will also be true for general linear transformations: Z = a + bY + cX.

E(Z) = a + bE(X)+ cE(Y)]

The foregoing is a nice property of the expectation operator. It holds generally: the operator
moves through linear transformations of random variables.

Finally, we may note that the operator does not move through non-linear transformations of
variables;

ex. Z = X∙Y E(Z)≠ E(X)E(Y) in general, with one exception I'll talk about later.

ex. Z = X2 + Y. Linear? = E(X2) + E(Y).

Variance of Z = X + Y

What about the variance of Z = X+ Y? In order to determine the variance of the two variables we
must introduce the covariance.

28
ECON335 – Statistics 10-17-21

B-4b Covariance (7th: 697)

The formula for the covariance between two random variables is

COV(X,Y) = E((X -μx)(Y -μY)).

where μx = E(X) and μy = E(Y).

We can show that E((X - μx)(Y - μy)) = E(X●Y) - E(X)E(Y). This makes calculations easier.

Ex: X
0 1
Y 0 0.4 0.2
1 0.1 0.3

0.4 0.2 0.1 0.3


0 0 f(0,0) + 1 0 f(1,0) + 0 1 f(0,1) + 1 1 f(1,1) - (.5) ● (.4) = .3 - .2 = .1.
● ● ● ● ● ● ● ●

Properties of Covariance (7th: 698)

Prop Cov 1: Impact of independence on covariance.

If X and Y are statistically independent Cov(X,Y) = 0.

When discussing the expectation operator earlier, I noted that E(X∙Y) ≠ E(X)E(Y)

That is not true when variables are independent. They are equal.

What does this imply with respect to covariance? I have noted that

COV(X,Y) = E(X●Y) - E(X)E(Y)

with the observation with respect to independence, we get = E(X)E(Y) - E(X)E(Y) = 0.

Thus, independence implies a zero covariance.

29
ECON335 – Statistics 10-17-21

I will note that the opposite is not true: zero covariance does not imply independence.

☞ Let's return to formula (VI) (and change definition of the rv’s).

Suppose that Z = Y1 + Y2 & suppose that the 2 variables are independent.

What's V(Z)? Formula (VI) is V(Z) = V(Y1) + V(Y2) + 2∙COV(Y1,Y2)

Since the covariance = 0 we get, V(Z) = V(Y1) + V(Y2)

which implies that when variables are independent the variance of their sum is the sum of their
variances.

Property COV 2: Let Z1= a + bX and Z2= c + dY

then cov(Z1,Z2) = bdCov(X,Y).

cov(Z1,Z2) = cov(a + bX, c + dY) = cov(a,c) + cov(a,dY) + cov(bX,c) + cov(bX,dY)

= cov(bX,dY) = bdcov(X,Y)

because the covariance of a constant w/ a variable = 0 (the constant does not vary).

Ex: et Y1 = Experience (in years) & Y2 = income (in 1,000s); let b=12 & d=1000.

Interpretation of the change: data in months and in dong.

Note that the cov(X,X) = var(X)? Plug into the general formula at (V) & consider what other
formula it looks like?

B-4c: Correlation Coefficient (7th: 698)

☞ There is a problem with the covariance: its size depends on the units used for a variable.

Thus, if you measured income in dollars you'd get a different covariance than if you measured it
in terms of hundreds of dollars.

30
ECON335 – Statistics 10-17-21

Because of this problem with units, we often focus on the correlation coefficient to obtain some
insight into the degree to which two variables are related.

The correlation between two random variables is ρ = COV(X,Y)/SD(X)SD(Y).

You will note that we divide the covariance by the standard deviations of the 2 variables.

How does this get rid of the units problem? The standard deviations in the denominator will be
measured in the same units as the numerator; so, they'll cancel out.

☞ Ex: X & Y measured in thousands of dong. Get dong squared in the numerator & the
denominator.

If measured in 1000s, get 1000's squared in the numerator & the denominator.

☞ The correlation coefficient will lie between -1 and 1. It measures the extent to which
variables are related linearly.

☞ Ex. Consumption & income. Would expect increased income to produce greater
consumption. Thus, would expect them to be positively related. Expect a + covariance.

☞ Ex. Price of a product & demand. We would expect a higher price to result in lower D: a
negative covariance.

A correlation coefficient = 1 implies that the two variables are perfectly linearly related in a
positive manner.

E.g. they will lie on a straight line with a positive slope.

A correlation coefficient = -1 implies that the 2 variables are perfectly related in a negative
manner; i.e., they will lie on a straight line with a negative slope.

If the correlation coefficient = 0, then the 2 variables are not linearly related.

31
ECON335 – Statistics 10-17-21

Variance of Sums of Random Variables (7th: 699)

Now, let's return to the r.v. Z = X + Y.

What is Var(Z)? It turns out to be Z = V(X) + V(Y) + 2*COV(X,Y).

It is not simply the sum of the variances. Must account for the covariance.

We can generalize this result: suppose that Z = a + bX + cY.

V(Z) = b2*V(X) + c2*V(Y) + 2bcCOV(X,Y)

The first two elements reflect the properties of the variance we have talked about.

Note that the constant does not affect the variance.

Ex. 2: Z = X - Y. V(Z) = V(X) + V(Y) - 2∙Cov(X,Y) b = 1 & c = -1.

Statistical Independence: no covariance.

Fundamentals of Mathematical Statistics (7th: 714)

☞ Next, let's extend our analysis to greater than two random variables.

[Turn to the reason we considered functions of greater than one variable.]

Suppose we draw n random variables Y - Y1, Y2, ... Yn . - and that each Y is drawn from the
same probability distribution function (either continuous or discrete) with mean μY & standard
deviation σY which are known.

Suppose further that each variable is independent. Note that the independence assumption is very
important!

How might we interpret this group of random variables?

As a SRS of size n from a known population. Each random variable represents a draw from the
population.
32
ECON335 – Statistics 10-17-21

☞ Consider the following function: 𝑦̅ = (1/n)∑𝑛𝑖=1 yi = (1/n)(y1 + y2 + ... + yn ) (7th: 715).

Does anyone recognize this function?

It's the mean of the random variables drawn: the sample mean.

You should note that because 𝑌̅is a function of random variables it is a rv. Thus, the sample
mean is a rv.

!! This is a key point: the sample mean is a random variable!!!

The foregoing implies that the sample mean has its own probability distribution.

☺ We might ask what are the mean and variance of the random variable called the sample mean.

What is the expected value of 𝑦̅? E(𝑦̅) = (1/n) [E(y1) + E(y2) + ... + E(yn)] (7th: 717)

If we note that each expected value of 𝑌̅ = μY because they are drawn from the same probability
distribution and that we have n of them, we see that

E(𝑦) = μY. [End Class 2 2017 VN]

So, the expected value of the mean of the random sample is the population mean. That's nice.

☺ We may next ask "what is the variance of the sample mean?"

To answer that question, note that we may represent 𝑦̅ as

𝑦̅ = (1/n)(y1) + (1/n)(y2) + ... + (1/n)(yn ).

Recalling that independence implies no covariances between the variables & the rules regarding
variances we just discussed, we see that

V(𝑦̅) = (1/n)2V(y1) + (1/n)2V(y2) + ... + (1/n)2V(yn ) (7th: 718)

= (1/n2)[σY2 + σY2 + ... + σY2]

33
ECON335 – Statistics 10-17-21

Because there are n σY's. we see that the variance of the sample mean is V(𝑦̅) = (1/n)σY2.

So, we have seen that the mean of the probability distribution associated with the sample mean
has an expected value equal to the mean of the underlying population distribution & variance
equal to (1/n)σY2.

(4) Law of Large Numbers & Central Limit Theorem

Now, consider what happens as n increases. Thus, we expect the distribution of the sample mean
to collapse on the population mean as the sample size approaches ∞.

Graphically, we have ... (7th: 719)

╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶┴╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶

This result is known as the Law of Large Numbers (7th: 724).

This is a nice result. It says that the expected value of a SRS is the mean of the underlying
population & that as n increases our sample mean is more & more likely to be close to that
population mean.

Central Limit Theorem (7th: 724) we'll turn next to this

Here’s one definition of the CLT:

"If [random variable] Y has any distribution with mean μY & variance σY2, then the distribution
of
(𝑦̅ - μY)/(σY2/n)0.5

approaches the standard normal distribution as sample size n increases. Therefore the distribution
of 𝑦̅ in large samples is approximately normal with mean μY & variance σY2/n."

You should note that the random variable can have any distribution. The distribution can be
continuous or discrete.

[Figure C-5 (4th: 473)]


34
ECON335 – Statistics 10-17-21

Appendix C: Fundamentals of Mathematical Statistics (7th: 714)

That ends our discussion of probability theory. You will remember that we talked about
probability theory because we wanted to use a sample to make inferences about the population.
We can now make such inferences.

In our discussion immediately above, we assumed that we knew the underlying pop distribution.
We were able to show that the expected value of the sample mean is the population mean.

☼ Now we turn to the typical situation facing economists; namely,

we have a sample of data which comes from an unknown population distribution and we're
interested in the mean value of the relevant variable for the population.

It is important to realize that we do not know the population probability distribution or any of its
characteristics.

Ex. We don’t know anything about the population of the approximately 223 million households.

☞ We would like to make a "best guess" (or an inference) about the mean (or expected value)
of the unknown population.

We are asking,

"How do we use the sample of data to makes inferences about the mean of the
unknown distribution?"

This is a problem of Statistical Inference.

☞ Statistical inference involves searching for estimators which allow us to gain insights into
the value of the unknown population parameter: e.g., in our example it’s the population mean.

☺ Estimator (7th: 715): an estimator of a population parameter θ is a random variable which


depends on, is a function of, the data and whose realizations provide approximations θ; e.g., in
our example, the population parameter is the mean.

35
ECON335 – Statistics 10-17-21

Wooldridge (7th: 715): “Given a random sample … drawn from a population distribution that
depends on an unknown parameter θ, an estimator of θ is a rule that assigns each possible
outcome of the sample a value of θ.”

The concept of an estimator is important!!

☞ Consider the population mean. As we will see, a variety of possible estimators for the
population mean exist.

We might ask "where should we start?"

Ex. Returning to the household consumption example: what would an estimator of the mean of
the population distribution be?

☺ There is a idea in the literature regarding estimators called the Analogy Principle (AP).

The AP states that you should look for the sample analog of the population characteristic you are
interested in. Thus, if you're interested in the population mean, you should use the sample mean
as an estimator.

So, in this case, the Analogy Principle implies considering the sample mean as an estimator of
the population mean.

The formula for the mean of a SRS of a random variable is 𝑦̅ = (1/n) ∑iyi

☺ Is it an estimator, according to the above description?

It is a formula (a function of the data) and it is random because we have seen that the sample
mean is a random variable. So, it is a possible estimator.

☞ Returning to the sample mean, we have already talked about its characteristics.

We have seen that its expected value = that of the population mean. That is a nice characteristic.

☞ So, should we rely on it as an estimator? The answer is not necessarily obvious because it is
not the only possible estimator of the population mean which has this characteristic.

36
ECON335 – Statistics 10-17-21

If we think about the sample mean, we realize that it is a function of the sample which places
equal weight on each observation in the sample w. the weights summing to one.

In other words, we saw that we can represent it as

𝑦̅ = (1/n)(y1) + (1/n)(y2) + ... +(1/n)(yn).

𝑦̅ is a function which puts a weight of (1/n) on each random variable in the formula.

We may also note that the weights sum to one: there are n weights of (1/n).

✌ In light of these observations, imagine an estimator which has weights which sum to one but
which has different weights than those identified above.

Ex 1: consider an estimator which places a weight of one on the first observation and a weight of
zero on all other observations.

Zalt = (1)(y1) + (0)(y2) + ... + (0)(yn ).

What is its expected value? … E(Zalt) = (1)E(y1) + (0)E(y2) + ... + (0)E(yn ) = μY.

The population mean.

Ex. 2: Z = (1/2)(y1) + (1/2)(y2) + (0)(y3) + ... + (0)(yn) = μY.

☞ So, we have two different estimators of the population mean of this random variable.

Which estimator should we choose? (7th: 716) In order to answer that question, we have to
identify certain characteristics of estimators we might deem desirable & then compare different
estimators with respect to those characteristics. We will do that now.

Analysis of desirable properties of estimators focuses on two different situations:

(1) Small Sample Theory: we consider properties of an estimator when we have a small (or fixed
size) sample, and

(2) Large Sample (Asymptotic) Theory: we consider properties of an estimator as the same size
ets larger and larger [Gujarati & Porter 4th: 497-498]
37
ECON335 – Statistics 10-17-21

Each of these approaches focuses on the probability distribution of the estimator. We will focus
primarily on small sample theory. (We will touch on large sample theory only a little bit.)

Small Sample Theory

Small sample theory focuses on two characteristics of estimators:

(1) bias & (2) efficiency.

Bias concerns the expected value of the estimator (7th: 716).

An estimator is unbiased if its expected value = the value of the pop characteristic we're
interested in (7th: 716 C.3].

“An estimator W of population parameter θ is an unbiased estimator if E(W) = θ for all


possible values of θ.”

Thus, if we are interested in the population mean then an unbiased estimator has an expected
value = the population mean.

The estimators we identified above were all unbiased; we saw that their expected values = the
population mean.

Indeed, any estimator with weights which sum to one will be unbiased.

☺ What would a biased estimator be? Let’s consider two examples.

(1) Any estimator whose weights do not sum to one.

Ex. 1: Z = (1/2)(y1) + (1)(y2) + (0)(y2) + ... + (0)(yn ) E(Z) = (3/2) μY

Ex. 2: Z = y1 + y2 + ... + yn E(Z) = (n)μY

(2) Consider the estimator Z = (1/(n+1))(y1) + (1/(n+1))(y2) + ... +(1/(n+1))(yn).

Its expected value is (n/(n+1))•μY, which differs from μY.

38
ECON335 – Statistics 10-17-21

The fact that we have may many, many possible estimators which are unbiased is a reason we
consider efficiency.

Efficiency (7th: 719) concerns the variance of the estimator. It applies only to unbiased
estimators.

We say that one estimator is more efficient than another estimator if it has a smaller variance.

“Relative Efficiency” Defined (7th: 719) “If W1 and W2 are two unbiased estimators of parameter
θ, W1 is efficient relative to W2 when V(W1) ≤ V(W2), with strict inequality for at least one value
of θ.”

Why focus on the variance?

The key point to remember is that a given estimate is not likely to = the population mean. In fact,
the likelihood that it = the pop mean when the random variable is continuous = 0.

Thus, the larger the variance of an estimator the less sure we are that a given estimate is near the
actual population mean.

If we consider two estimators with different variances, the one with the smaller variance is more
likely to produce an estimate that is close to the actual population mean.

Graphically (7th: 719. Fig. C-2) ….

So, we should prefer it to the one with the larger variance.

With the definition of efficiency in mind, consider the two estimators described above.

(I) Z = (1/n)(y1) + (1/n)(y2) + ... +(1/n)(yn ) Have seen that V(Z) = (1/n)σY2.

(II) Zalt = (1)(y1) + (0)(y2) + ... + (0)(yn ).

V(Zalt) = 12V(y1) + 02V(y2) + ... + 02V(yn) = σY2.

39
ECON335 – Statistics 10-17-21

So, our original estimator for the sample mean has a smaller variance than the formula
represented by Zalt. We can say that the original estimator is more efficient than the second
estimator.

The foregoing discussion of efficiency was in relative terms; we compared one estimator with
another.

It would be nice to talk about estimators in absolute terms; i.e., it would be nice to identify the
smallest possible variance among all estimators. If we could id the smallest variance then when
we derived an estimator w/ that variance we could stop looking for "better" estimators.

If we are willing to limit ourselves to estimators which are linear combinations of the
observations, we can show that

the sample mean has the lowest variance.

We say that the sample mean is the Best Linear Unbiased Estimator (BLUE).

Of course, this result is limited to linear, unbiased estimators. We may get rid of these limitations
but takes us beyond the scope of this course.

☺ Tosome extent, we have now answered the question we posed in the first day of class;
namely,

"may we use a sample of data to make inferences about population parameters?"

If we are interested in the population mean (or expected value), the answer is Yes. We know that
the sample mean is BLUE. Thus, we can make such an inference.

Large Sample Theory (7th: 721)

I noted earlier that the second type of analysis we undertake in econometrics is called Large
Sample Theory (or Asymptotic Analysis).

Asymptotic analysis considers characteristics of an estimator as the sample size gets larger and
larger, ultimately approaching infinity.

40
ECON335 – Statistics 10-17-21

We won’t discuss Large Sample Theory much. The purpose of this discussion is to give you the
basic intuition underlying large sample theory. I will mention it from time to time throughout the
semester but you will not need it for problem sets or tests.

☼ We might first ask why we focus on large sample theory when we have the unbiasedness &
efficiency criteria.

We consider large-sample theory in econometrics often because the small sample properties of
an estimator are effectively impossible to analyze.

Even though the small sample characteristics of an estimator from a known population cannot be
specified, we can talk about & analyze the distribution of the estimator as the sample size → ∞.

Asymptotic analysis also focuses on two characteristics of an estimator:

(1) Consistency (7th: 721) this is the same as (asymptotic) unbiasedness.

We say that an estimator is consistent if as n → ∞ its value is the same as the population
value of interest.

Consider the Sample Mean of Y: 𝑦̅ = (1/n)(y1) + (1/n)(y2) + ... +(1/n)(yn )

A basic conclusion is that an unbiased estimator is always consistent; since small sample theory
applies for any n, it can apply for very, very large n. So, let’s consider a biased estimator.

Ex. Consider the estimator I mentioned above: ya = (1/(n+1))∑i yi

We saw that E(ya) = (n/(n+1))•μy.

Is it consistent? What happens to its value as n → ∞? I will depend on what happens to


(n/(n+1)). It approaches 1.

So, it is a consistent estimator of the population mean. It is not unbiased, however.

41
ECON335 – Statistics 10-17-21

(2) Asymptotic Efficiency: this focuses on the variance of the estimator as sample size → ∞.

We won’t consider it other than to note that efficiency implies asymptotic efficiency.

Thus, the sample mean is asymptotically efficient.

Summary

Let's summarize what we just did.

We were interested in using a random sample of data on household consumption to draw


inferences about the unknown mean (or expected value) of the population distribution of
household consumption.

We talked about using "estimators" to make a guess about the unknown mean.

It is important to note that an estimator is a random variable because it is a function of random


variables. Thus, an estimator has its own probability distribution.

After observing that we can have many estimators for a given population parameter - in our case
we were looking for the population mean - we talked about standards we use in identifying
preferred estimators.

Those standards - for both large and small samples - focused on characteristics of the
distributions of the estimators. Specifically, the standards focused on whether the expected value
of the estimator was the pop parameter (& whether the variance of the estimator collapsed to a
point as n increased) and on the variance of the estimator.

Throughout the discussion we focused on use of the formula for the sample mean as an estimator
for the population mean. We saw that it is the BLUE estimator of the population mean.

As a result, we can conclude that if we're interested generally - for any given situation - in
obtaining an estimate of the pop mean we should use the sample mean.

☼ The approach I just described is used generally in econometrics. We identify a population in


which we're interested and a characteristic of the population in which we're interested.

We then consider possible estimators of that characteristic. We search among the various
possible unbiased estimators for that with the lowest variance (or the lowest possible variance).

42
ECON335 – Statistics 10-17-21

C-6: Hypothesis Testing (HT) (7th: 733)

The last major topic we consider in statistics is HT.

HT concerns testing whether statistics support our beliefs about the population parameter on
which we are focusing.

Ex: Suppose that we’re interested in the income of individuals working in Hanoi. We might have
some beliefs about the mean income of the population which we wish to test.

Suppose that we believe that the mean income of a person working in Hanoi is $7.5 million
dong/month.

We might think of using a random sample of working individuals to test whether mean income is
really $7.5m dong/month. We do so with HT.

☺ HT testing proceeds in two steps

(1) identify a null & an alternative hypothesis (7th: 733-734).

(2) use parameter estimates obtained from a sample to test the hypotheses.

Consider (1):

Identification of the two hypotheses involves stating formally our beliefs about the population
parameter we are considering.

The null hypothesis often assumes that a believed effect or relationship does not exist.

The alternative hypothesis identifies what you believe is true if the null hypothesis is not true.

Ex. Our population is working people in Hanoi. We’re focusing on their monthly income.
[Stats suggest a mean income of 7.6m dong/month]

Suppose that our null hypothesis is H0 : μY = 7.5m dong/month

It states that the population mean = 7.5 million dong. This is what I meant when I said that the
null assumes that beliefs are not true.
43
ECON335 – Statistics 10-17-21

Alternative hypothesis HA: μY ≠ 7.5m dong/month

The foregoing alternate hypothesis is called a two-sided hypothesis: two-sided because in the
alternate hypothesis the mean can be less than or greater than.

Such an AH implies that we will reject the NH and accept the AH for sample means above or
below the null hypothesis value.

Can also have one-sided alternative hypotheses.

Ex: We'd get a one-sided in the above example if we believe that Hanoi household consumption
is greater than the national average.

In this case, HO : μY ≤ 7.5 m dong/month

HA : μY > 7.5 m dong/month

Consider (2):

We are interested in using our statistical techniques to test the two hypotheses.

Before we talk about Hypothesis Testing (HT) formally, let's get the intuition underlying HT.

Again, let's consider the population mean. Our NH about the mean monthly income is 7.5m.

Suppose that we propose testing the hypothesis by looking at the mean of a random sample of
data.

We then calculate the sample mean & compare it to the null hypothesis mean. We know that it is
unlikely that our sample mean will = the null hypothesis mean.

When will we believe that the null hypothesis is true? When the sample mean is close to the null
hypothesis.

The issue is "what do we mean by close?"

Alternatively, what value is so far away that we will decide that the null hypothesis is false?

This is a really good question because we know that the sample mean can take on many values
(it is a random variable with its own distribution).
44
ECON335 – Statistics 10-17-21

We might fortuitously sample only very wealthy people or only unemployed people. In either
case, we will get a sample mean that is far from the actual population mean even though the
population mean is the null hypothesis mean.

Of course, the likelihood of sampling only millionaires or only unemployed people is not great.
Indeed, it is pretty unlikely.

We use this idea of "pretty unlikely" as a guide in deciding what sample means will provide
evidence against the null hypothesis.

We call the idea of “pretty unlikely” the level of significance of a test. The standard level of
significance chosen is 5% - chosen mostly out of convention. [At times people will use a 1%
level of significance.]

Level of Significance: if the likelihood of obtaining the sample mean we actually draw,
assuming that the null hypothesis is true, is lower than the level of significance we reject the NH.

If we get a sample mean that we would observe only 4% of the time, if the NH is true, we
conclude that the NH is not true. Why? Because obtaining such a sample mean would be
unlikely.

Alternatively, if we get a sample mean that we would observe 15% of the time, if the NH is true,
we would accept the NH.

So, we can think of the level of significance as defining what we feel is unlikely.

Level of significance is usually described with the Greek letter α. Thus, for a 5% level of
significance we set α = 0.05.

Note: Type I Error even though we establish a level where we will conclude that the NH is not
true, it is still possible that the NH is true: there’s a 5% chance.

With the level of significance defined, let's turn to actual testing of the hypotheses.

45
ECON335 – Statistics 10-17-21

Testing Hypotheses About the Mean in a Normal Distribution (7th: 735)

With regard to significance testing, one may test

(a) using a Z [t] statistic or, (b) using what are called “prob values.”

We will consider each. I will give you an outline of each approach.

(a) Z [t] statistic.


(i) Identify a level of significance.
(ii) Calculate a critical value of the statistic: Zc [tc].
(iii) Calculate the sample statistic: Zs [ts]
(iv) Compare Zs [ts] and Zc [tc].

(b) Prob Value


(i) Identify a level of significance: α .
(ii) Calculate the sample statistic - Zs [ts] - and determine the
probability of observing that statistic: its prob value.
(iii) Compare the prob value and the level of significance.

Z vs. t statistic: whether you use a Z or a t statistic depends on the sample size. If the sample size
is greater than 30 [50 or 120] you may use the Z statistic. Otherwise, use the t statistic.

We will compare the t and Z distributions once we get to HT in the context of regression
analysis. For now, I will use the Z statistic.

Both methods of significance testing (which are effectively the same) using the Z statistic. Let’s
take a look at the statistic.

It involves determining whether the likelihood of obtaining the observed sample mean is less
than the level of significance, assuming that the null hypothesis is true.

I will undertake the discussion for a one-sided alternative hypothesis first: i.e.,

Ho : μY ≤ 7.5m & HA: μY > 7.5m.

If Y has a normal distribution 𝑦̅ will have a general normal distribution with expected value
(mean) equal to μy and with variance σ2/n.
46
ECON335 – Statistics 10-17-21

We know that if we standardize the sample mean by subtracting its expected value and dividing
the difference by its standard deviation, the resulting statistic has a standard normal distribution;
i.e.,

Z = (𝑦̅ -μY)/(σ/n0.5)

has a standard normal distribution with mean zero & standard deviation = one.

Our sample Z statistic is Z where we place our NH mean - μ0 – in the place of the population
mean. Thus, our sample Z statistic is

Z = (𝑦̅ -μo)/(σ/n0.5)

☼ Now, consider the Z statistic.

The numerator equals the difference between the sample mean & the NH mean.

Note the italicized point. We compare our sample mean with the mean of our null hypothesis.

The difference represents how far off the sample mean is from what we hypothesis is true.

We divide that difference by the standard deviation of the sample mean.

The resulting statistic tells us how many standard deviations the sample mean is from the
NH mean.

Since we know, from our earlier discussion, that the Z statistic has a standard normal
distribution

we can use the standard normal tables to determine the likelihood of observing a sample mean of
that size or > if the NH mean is true.

Alternatively, we can identify how unlikely it is that we would obtain our sample mean if the NH
is true.

Ex 1: suppose that we get a Zs = 2. Table G.1 (7th: 741) indicates that the likelihood of observing
a sample mean of that size or larger is (1 -0.9772) = 0.0228.

Thus, a 2.28% chance of observing a sample mean of that size or >.

47
ECON335 – Statistics 10-17-21

Ex 2: Suppose that we get a Zs = 1.35 implies a likelihood of (1 - 0.9115) = 0.0885.

 Note that the foregoing probabilities are called p-values. P values answer the following
question:

“what is the largest significance value at which could carry out the test and still fail to reject the
null hypothesis?”

[ Guj: defined as [4th: 506] "the lowest significance level at which a NH can be rejected."]

What is going on graphically? Have a standard normal distribution. The p-value gives us the
probability left in the tail of the standard normal distribution. Show graphically for the above
two examples.

╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶┴╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶
1.35 2.0

The two methods of significance testing differ in that one uses the prob value and the other uses
the sample Z statistic. Let’s consider each method.

Prob Value Approach:

This method involves comparing the prob value with the level of significance.

If the prob value is less than the level of significance we reject the NH because the probability of
observing the sample mean, assuming that the NH is true, is less than the level of significance.

If the p value is greater than the level of significance we accept the NH.

Z statistic approach:

This approach, which is typically done, involves determining the Z associated with the level of
significance, call it the critical Z - Zc - and comparing it with the sample Z: ZS. Let's consider
this approach.

48
ECON335 – Statistics 10-17-21

We first look for the value of Z whose prob value, for a one-sided test, equals the level of
significance. That is the critical Z.

So, for a 5% level of significance we look for the value of Z where there’s a 5% probability of
observing that value of the statistic or greater.

Graphically,



╶╶╶╶╶╶╶╶ └╶╶╶╶╶╶╶╶╶╶
1.645

Ex: 5% level of significance implies that want a Z value that leaves 5% of the probability in the
tail. If we look at the standard normal table would see that the Z associated with a 5% probability
Z = Zc = 1.645.

With the critical value of Z at hand we calculate our sample Z – Zs – and compare the two Z
values.

If Zs > Zc [=1.645] then we reject the NH because the probability of observing such a sample
mean, if the NH is true, is less than 5%.

If Zs ≤ Zc then we accept the NH.

Do it for Ex 1. Have seen that the critical Z for a 5% one-sided test is 1.645. ZS = 2.1

Graphically, .... So, we should reject the NH.

Question: what if we get ZS = -2? We do not reject the NH. The alternate hypothesis is that
consumption in Fort Collins is greater than the national average. Our results indicate that we
have a sample mean below the national average: that's how we get a negative Z. This is not
evidence which favors the alternative hypothesis.

Two-sided vs one-sided Tests (7th: 735)

Our analysis above focused on a one-sided test.

Let’s consider the analysis for a two-sided Alternative Hypothesis (AH).

49
ECON335 – Statistics 10-17-21

With a one-sided test, we reject the NH & accept the alternative hypothesis if we get a sample
mean that's only on one side of the NH mean.

Ex: the Hanoi income example. AH was that monthly income in Hanoi is greater than the
national average of 7.5m dong.

We will reject that NH that there's no difference only if get a sample mean above the NH mean.
If get a sample mean less than the NH mean then have no evidence that consumption is greater
here.

In such a case, a 5% level of significance implies 5% left in the upper tail.

☼ Now consider the two-sided AH that income is different in Hanoi. We will reject the NH in
this case if we observe a sample mean that is far above or far below the NH mean.

If we establish a 5% level of significance we must distribute that 5% to means that are above &
are below the NH mean. Thus, we allocate 2.5% above & 2.5% below.

Graphically ….

Consider the implications of this for significance testing. Graphically, want probabilities in the 2
tails: 2.5% in each tail.

What does this imply with respect to our HT.

Prob Value method: we reject the NH if the probability of observing the sample statistic is less
than 2.5% (if it is either above or below the NH mean).

Z statistic method: we now need two critical Z values: one for a sample statistic that is above the
NH mean & one for a sample statistic which is below the NH mean.

╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶┴╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶
-Zc Zc

50
ECON335 – Statistics 10-17-21

Ex: 2-sided, 5% level of significance. Things are different because want 2.5% in each tail.
Get Zc = ±1.96.

If Zs > Zc we reject the NH and if Zs ≤ Zc we accept the NH.

We will see a lot more of this when we get to regression analysis. For now, we will summarize
our results.

Summary

Okay, that's our review of basic statistical ideas. It would be useful to summarize what we have
done because our analysis in econometrics will often be similar. If you are not sure what we are
doing in econometrics you might return to the steps we followed in statistics to gain insights into
where you are.

(1) Identified an issue in which we were interested - household consumption in Vietnam [the
U.S.]. We immediately noted that consumption varied from household to household. Thus, we
were dealing with a random variable.

Over the course of the analysis, we narrowed our focus to the mean consumption of households
in Vietnam [the U.S.].

Because we realized that we could not obtain information on all households in Vietnam [the
U.S.] - a group which we defined as the "population" in which we were interested - we discussed
obtaining info on a subset of that population. We called the subset a "sample." Indeed, we
considered a specific type of sample - Simple Random Sample.

The question that arose then was "can we be sure that information obtained from a subset of the
population will provide insights into mean household consumption in the U.S."

(2) In order to answer that question, we turned to probability theory. In that context, we
discussed probability distributions for two types of variables: (1) continuous & (2) discrete.

We used probability distributions to associate probabilities with the various values a random
variable could take. We could also derive from the distributions various characteristics of the
random variable, such as the expected value (or mean) of the distribution & its variance.

51
ECON335 – Statistics 10-17-21

(3) Identification of probability distributions then allowed us to consider how the mean of a
sample of data might behave for a distribution which was known. We derived two useful results
with respect to the sample mean

(A) LLN which said .... & (B) the CLT, which said ....

(4) Having identified how a sample behaved we turned to Statistical Inference: in other words,
we considered whether we could use a sample of data to make inferences about the mean of an
unknown population.

We discussed using estimators to make "guesses" about the unknown population characteristic.

We noted that we analyze estimators in terms of their (1) bias (consistency) and (2) efficiency.

In light of the foregoing two standards, we saw that the sample mean is a BLUE estimator.
So, we proposed using it to make guesses about the unknown population mean.

(5) Finally, we discussed Hypothesis Testing.

When we discuss econometrics keep in mind this general approach because the issues are the
same.

INTRODUCTION TO REGRESSION ANALYSIS: The Population Regression

So, that's our review of statistical analysis involving one random variable (rv) I will now turn to
an analysis in which we are interested in the relationship between two rv's.

The analysis will set up our introduction to linear regression analysis.

Let's recall our discussion of joint distributions. We are looking at the distribution of two rv's.

If X and Y are our 2 rv’s we may represent their joint distribution as f(X,Y).

☺ Now, suppose that we believe that the number of printers sold depends on whether people
purchase a PC. In other words, printer demand derives from PC demand.

Let X = # of PCs sold & Y = # of Printers sold


52
ECON335 – Statistics 10-17-21

We have already discussed how to interpret the joint probabilities in the pmf in Table A-3 (4th:
424) and we have discussed how to derive marginal probability distributions from the table.

Recall our notation: the joint probability distribution is f(x,y), &

the marginal probability distribution of x is fx(x).

B-2b: Conditional Distributions (7th: 690)

We will now turn to conditional probability distributions.

With the marginal distribution of X at hand, we can turn to conditional probabilities.

I will first define conditional probabilities and then interpret them. If we have two rv’s - X & Y -

the conditional probability distribution of Y given X is defined as

f(y|x) = f(x,y)/fx(x).

It is the joint distribution divided by the marginal distribution of X.

We may interpret the conditional probability distribution as giving us the probability


distribution of Y for a given X.

In other words, we fix X at some value & ask "what is the distribution of Y for that value of X?"

Ex: If we fix the number of PCs sold at 4, the conditional probability distribution gives us the
distribution of Y for that number of PCs sold.

We may contrast this with the marginal distribution. The marginal distribution of Printers sold
gives us the probability distribution of Printers sold regardless of the # of PCs sold.

Why would we be interested in the conditional probability distribution?

We might believe that the distribution of Printers sold differs across the # of PCs sold.

We may calculate conditional probability distributions from the joint distribution.

53
ECON335 – Statistics 10-17-21

How do we do it? Take the row entitled fx(x) and divide each row in the table by that row.

Ex. The following tables identifies the conditional probability distributions:


x
0 1 2 3 4
0 0.3750 0.2500 0.0833 0.0833 0.0313
1 0.2500 0.4167 0.2500 0.0833 0.0313
y 2 0.1250 0.1667 0.4167 0.2083 0.1563
3 0.1250 0.0833 0.2083 0.4167 0.3125
4 0.1250 0.0833 0.0417 0.2083 0.4688

E(y|x) 1.3750 1.3333 1.8750 2.5833 3.1563

Show how obtained one row of the conditional probabilities [put up on the board]

We may note that each column is a pmf in & of itself; each col describes the distribution of Y for
a given value of X.

To see what I mean, consider the col of probabilities for x = 2..

What can we say about it? The 1st thing to note is that the probabilities down the column sum to
one.

In light of all probabilities being non-negative, we see that the probabilities in that column form
a pmf.

The column, thus, reps the distribution of # of Printers sold when 2 PCs are sold. Indeed, we
could graph that distribution itself: [do it]


f(Y) │




└╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶ y|x=2
.0 1 2 3 5

54
ECON335 – Statistics 10-17-21

Since we have performed the same calculation for each column, we see that each column in the
Table is a pmf in & of itself. Thus, we could graph a distribution for each column.

✌The foregoing is an important point to emphasize; a conditional distribution function is a


probability distribution in & of itself. It is like a distribution function for a single variable; i.e.,
like a marginal distribution. The only diff is that it assumes that another variable takes on a
specific value.

Consider the information contained in conditional probabilities:

Analyze the above graph: we see that 2 PCs sold is most likely.

We might compare it with the distribution at x = 0 ….

The comparison suggests that taking into account the # of PCs sold provides insights into the #
of printers sold. Stated in another way, the # of PCs sold seems to be an important factor in
describing printer sales.

If we had focused simply on joint probabilities, we would have a hard time discerning this
difference.

If we had focused solely on marginal distributions we would have lost the different information
totally. Indeed, we can compare the conditional distributions with the marginal distribution.
[contrast the marginal with each distribution].

Continuous Variables: I will not that the foregoing analysis concerned a conditional probability
distribution for a discrete variable. We can calculate the same probability distributions for
continuous distributions.

Independence & Conditional Probabilities (7th: 690)

As a final matter, we will consider conditional probabilities when two rv's are statistically
independent.

Recall our formula for a conditional probability: f(y|x) = f(x,y)/fx(x)

When the variables are statistically independent, it turns out that f(y|x) = fy(y).

55
ECON335 – Statistics 10-17-21

Thus, the conditional distributions of Y given X equals the marginal distribution of Y for all
possible values of X.

We may interpret this as saying that when rv's are statistically independent knowing the value of
the conditioning variable does not change the probability distribution of the rv we are
considering.

Alternatively, knowing the value of X has no impact on our assessment of the probability of
obtaining a specific value of Y.

Ex: knowing the # of PCs sold provides no insights into the # of printers sold.

Are # of PCs sold and # of printers sold independent in the table?

If they were, what would the table of conditional probabilities look like?

So, does (f(x,y) = fx(x)fy(y))?

B-4e: Conditional Expectation (7th: 700)

As I noted when we discussed probability distributions of one variable; we often are interested in
reporting what’s true of the “typical” member of the population.

The most popular measure of the “typical members) is the expected value of a variable.

As we did in the case of a single variable, we can report expectations for our conditional
distributions.

We call them conditional expectations: they report the expected value of a variable (consumption
rate) conditional on the value of another variable (income).

We calculate them as we would any expected value. For discrete variables, we calculate

∑yi∙f(yi|x)
i=1

In the foregoing example, with x= 2, it would be n

∑yi∙f(yi |x = 2)
i=1

56
ECON335 – Statistics 10-17-21

Indeed, we can calculate the expectation for x = 2. We get

(0)(0.083) + (1)(0.25) + (2)(0.417) + (3)(0.2083)+ (4)(0.0417) = 1.8757.

The conditional expectation function is

X E(Y|X)
0 1.3750
1 1.3333
2 1.8750
3 2.5833
4 3.1563

The row indicates that mean consumption rates vary across # of PCs sold.

We will consider the conditional expectation function in greater depth in our discussion of
Chapters 2 and 3. Let’s turn to them now.

Note that E(Y|X) is called the Population Regression Function.

??? B-4f: Properties of Conditional Expectation (7th: 702)

57
ECON335 – Statistics 10-17-21

REDACTIONS

Ex Math SAT scores and Annual Family Income (4th: 23)

Gujarati has an example in which we are interested in whether a high school student’s Math SAT
score is related to annual family income. We will consider ideas in terms of this example.

Annual Family Income is the independent variable (X2) which affects a student’s Math SAT, the
dependent variable (Y).

Table 2-1 identifies the hypothetical population of families: there are 100 families whose income
falls into one of ten levels which range from $5,000 to $150,000. So, X is discrete with 10
possible values.

The top row of the table identifies those income levels.

For each income level, there are 10 families. The Math SAT score for the relevant student in
each family is identified in the body of the table.

You should note that the Table does not identify a joint probability mass function (“jpmf”).

A jpmf would identify the percentage of the population which has a given (income, Math SAT
score) combination, for all possible combinations of the two variables.

To construct the jpmf we need to identify the possible values of the two discrete variables in the
population. We have discussed the possible income values.

SAT scores: a review of the table indicates that the Math scores lie between 410 and 600.

So, the jpmf will identify the probability of observing a given combination of Math score which
lies in that range of values and a given income level.

The jpmf is as follows:


$5k $15 $25 $35 $45 $55 $65 $75 $90 $150 fy(y)

410 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01
420 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03
430 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02
440 0.01 0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.04
450 0.01 0.02 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.07
460 0.02 0.00 0.01 0.02 0.00 0.01 0.00 0.00 0.00 0.00 0.06
58
ECON335 – Statistics 10-17-21

470 0.01 0.00 0.01 0.00 0.02 0.00 0.01 0.00 0.01 0.00 0.06
480 0.00 0.02 0.01 0.00 0.00 0.02 0.02 0.01 0.01 0.00 0.09
490 0.01 0.00 0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.05
500 0.01 0.01 0.00 0.00 0.01 0.00 0.01 0.02 0.01 0.00 0.07
510 0.00 0.02 0.01 0.01 0.02 0.02 0.01 0.00 0.00 0.01 0.10
520 0.00 0.01 0.01 0.02 0.00 0.00 0.02 0.00 0.01 0.02 0.09
530 0.00 0.00 0.01 0.00 0.02 0.01 0.01 0.01 0.01 0.00 0.07
540 0.00 0.00 0.00 0.01 0.00 0.02 0.00 0.02 0.01 0.01 0.07
550 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.01 0.01 0.05
560 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.01 0.02 0.06
570 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02
580 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.02
590 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01
600 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01

fx(x) 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10

Relate the probabilities to Table 2-1. E.g., how many families have a (410,$5k) combination?
How many have a (410, $15k) combination, etc.

Marginal Probabilities

We have already discussed marginal probabilities. I noted that we obtain them, in the discrete
case, by summing down rows or across columns.

You will remember that, we can think of marginal probabilities as identifying the probability
distribution of one of the rv's in the joint distribution regardless of the value of the other rv.

We can derive them from the joint pmf.

In this OH, the row entitled "Marginal" was obtained by summing down the rows.

It is the marginal distribution of income. You will note that it tells us the probability a household
had a certain level of income, regardless of its consumption rate.
Ex. 0.10 implies a 10% probability a household has an annual income = $5,000.

Recall that for the joint distribution f(X,Y) we represent the marginal distribution of X as fx(X).

You may note that we could do the same for Math SAT scores: for a given SAT score we sum
across all possible incomes to get the probability of observing a given score.

59
ECON335 – Statistics 10-17-21

Conditional Probabilities

With the marginal distribution of X at hand, we can turn to conditional probabilities.

I will first define conditional probabilities and then interpret them. If we have two random
variables - X & Y -

the conditional probability of Y given X is (4th: 426)

f(Y|X) = f(X,Y)/fx(X).

It is the joint distribution divided by a marginal distribution. It will identify an probability


distribution for a given level of X.

Ex. If fix X at $5,000, it will identify the probability of observing the different Math SAT scores.

How do we calculate it? We take fx(5,000) = .10 and then divide each element in the X = $5,000
column by the 0.10. Let’s do it for some of the Math scores.

If you do the same for other income levels, you will note that we are effectively taking the fx(X)
row and dividing each row in the pdf by fx(X). The result we get is

$5k $15 $25 $35 $45 $55 $65 $75 $90 $150

410 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
420 0.20 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
430 0.00 0.10 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
440 0.10 0.00 0.10 0.10 0.10 0.00 0.00 0.00 0.00 0.00
450 0.10 0.20 0.10 0.10 0.10 0.10 0.00 0.00 0.00 0.00
460 0.20 0.00 0.10 0.20 0.00 0.10 0.00 0.00 0.00 0.00
470 0.10 0.00 0.10 0.00 0.20 0.00 0.10 0.00 0.10 0.00
480 0.00 0.20 0.10 0.00 0.00 0.20 0.20 0.10 0.10 0.00
490 0.10 0.00 0.10 0.20 0.00 0.00 0.00 0.10 0.00 0.00
500 0.10 0.10 0.00 0.00 0.10 0.00 0.10 0.20 0.10 0.00
510 0.00 0.20 0.10 0.10 0.20 0.20 0.10 0.00 0.00 0.10
520 0.00 0.10 0.10 0.20 0.00 0.00 0.20 0.00 0.10 0.20
530 0.00 0.00 0.10 0.00 0.20 0.10 0.10 0.10 0.10 0.00
540 0.00 0.00 0.00 0.10 0.00 0.20 0.00 0.20 0.10 0.10
550 0.00 0.00 0.00 0.00 0.10 0.10 0.10 0.00 0.10 0.10
560 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.20 0.10 0.20
570 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.10
60
ECON335 – Statistics 10-17-21

580 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.10 0.00
590 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10
600 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10

Each column in the table is a pmf in itself, except it’s for a given level of income. You may see
this by summing the probabilities down each column. You will see that they sum to one.

To see what I mean, consider the col of probabilities under $5,000 in income.

What can we say about it? The 1st thing to note is that the probabilities down the column sum to
one.

In light of all probabilities being non-negative, we see that the probabilities in that column form
a pmf.

Since we have performed the same calculation for each column, we see that each column in the
Table is a pmf in & of itself. Thus, we could graph a distribution for each column.

The foregoing is an important point to emphasize; a conditional distribution function is a


probability distribution in & of itself. It is like a distribution function for a single variable; i.e.,
like a marginal distribution. The only difference is that it assumes that another variable takes on
a specific value.
We may contrast this with the marginal distributions. The marginal distribution of Math SAT
scores gives us the probability distribution of consumption rates regardless of family income.

Consider the information contained in conditional probabilities:

Why would we be interested in the conditional probability?

We might think that the distribution of Math SAT scores differs across family income levels; i.e.,
the income affects Math scores

Compare the $5,000 income and $150,000 income level columns and the scores at which the
probability masses.

We see that, as we would expect, students from higher income families have higher Math Sat
Scores.

So, as was suggested above, taking into account the level of family income provides insights into
Math SAT scores.
61
ECON335 – Statistics 10-17-21

If we had focused simply on joint probabilities, we would have a hard time discerning this diff.

If we had focused solely on marginal distributions we would have lost the different information
totally. Indeed, we can compare the conditional distributions with the marginal distribution.

Conditional Expectations

As I noted when we discussed probability distributions of one variable; we often are interested in
reporting summary #'s which characterize a probability distribution. The most popular summary
# is the expected value of a variable.

As we did in the case of a single variable, we can report expectations for our conditional
distributions.

We call them conditional expectations: they report the expected value of a variable
(consumption rate) conditional on the value of another variable (income) [Gujarati 4th: 23].

We label them E(Y|X) [Gujarati 4th: 448]

We calculate them as we would any expected value. For discrete variables, we calculate

E(Y|X) = ∑yi∙f(yi/X)
i=1

In the foregoing ex, with X=5,000, it would be


n

∑yi∙f(yi/X=5,000)
i=1

Indeed, we can calculate the expectation for X=5,000. We get

(0.1)(410) + (0.2)(420) + … + (0)(600) = 452.

The conditional expectations for each column are contained in the row at the bottom of Table 2-
1. They are

$5k $15 $25 $35 $45 $55 $65 $75 $90 $150
452 475 478 488 496 505 512 528 530 552

62
ECON335 – Statistics 10-17-21

[interpret CEFs] The row reveals that mean Math SAT scores vary substantially across levels of
income: they vary by 100 points between the lowest and highest family income levels.

So, knowing a family’s income level appears to be important in inferring the Math SAT score of
a high school student in that family.

Figure 2-1 (4th: 24) plots the CE’s on a graph with a line joining them. We see that they
generally slope upward.

This discussion about CEs ties nicely into our discussion of regression analysis, for, as we will
see, regression analysis is about a special type of conditional expectation.

63

You might also like