Notes
Notes
This course is intended to give you a working knowledge of some basic econometric tools.
To give you an idea of why we need and use econometrics let's consider some examples from
Principles of Microeconomics.
Don't worry if you don't remember or have not heard about the economic concepts I'm
discussing. These are simply examples.
It's a demand curve ("D"). We derive demand functions in microeconomic theory courses.
The shape of the curve conveys the idea that as the price of rice drops more people will demand
rice.
Does economic theory say that D curves will always be downward sloping? As it turns out, the
answer is “No.”
So, our theory says that D functions might be downward or upward sloping.
How can we determine what is true of the rice market we are studying? Look to the real world.
1
ECON335 – Statistics 10-17-21
The analysis does not end there. Our discussion of economic thy also indicated that the steepness
(or elasticity) of the demand curve is important. The steeper the curve the smaller the change in
the equilibrium quantity in the market when the S curve shifts or a tax is imposed on the market;
i.e., the more inelastic the Demand function the smaller the change in equilibrium quantity.
Policymakers often are interested in the elasticity of the D function. Our econometric analysis
also provides some insight into the elasticity of the D function.
Define Econometrics
“Econometrics is based upon the development of statistical methods for estimating economic
relationships, testing economic theories, and evaluating and implementing government and
business policy.”
------------------ skip-----------------------------------------------
“Econometrics uses economic theory, as embodied in an econometric model; facts, as
summarized by relevant data; and statistical theory, as refined in econometric techniques, to
measure and to test empirically certain relationships among economic variables, thereby
giving empirical content to economic reasoning.” Intrilligator et al., p.1.
----------------------end skip---------------------------------------
(A) It is the theoretical economics which drives use of the econometrics; we have a theory and
wish to test it.
(C) We then estimate the model using data and test the results.
2
ECON335 – Statistics 10-17-21
☺ Statistics: both definitions refer to statistical theory in defining econometrics. Statistical ideas
underlie econometric theory. Because of the importance of statistics I will review basic statistical
concepts in the 1st several classes in this course. The review will also serve as a nice review of
the statistics course you have taken. The statistics review will reveal the general approach used in
econometrics but in a simpler context. We will then turn to econometrics.
In order to get in to statistics, we must remind ourselves of the summation operator and its
properties. Let’s do that now.
You will recall that the summation operator is designed to represent the summation of several
variables. Thus, if we wished to represent the idea that we are summing the numbers 1 to 100,
we could write out the whole summation, or we could write
100
∑i
i=1
How do we interpret it? Go from i=1 to i=100 and sum the variable after the operator.
Ex Now, suppose that we obtain data on how much money people spend in a year on food and
that we represent the consumption of food with the variable yi.
If we wish to represent summing the random variable for all 100 households, we could write
100
∑ yi
i=1
We get n
∑ yi
i=1
3
ECON335 – Statistics 10-17-21
Properties
∑c = c + c + ... + c = n∙c.
i=1
In other words, we sum n vars, each of which has the same value: c. The sum must = n∙c.
Ex: Suppose that the “variable” is Age and that we have 25 people (n=25) all of whom are 19
years old. Here, we are summing 25 variables each of which has value 19;
n n
We have y1, y2, & y3. ⇒ 100∙y1 + 100∙y2 + 100∙y3 = 100(y1 + y2 + y3)
n n n n
∑(yi-a)= ∑ yi - ∑a = ∑ yi - n∙a.
i=1 i=1 i=1 i=1
This rule reflects a basic fact about the summation operator. When we have linear functions like
this one, we can represent the summation of the function as the summation of each element in the
function. Of course, the n∙a reflects property (1).
Ex:
3
4
ECON335 – Statistics 10-17-21
∑(xi∙yi) What it is
i=1
What it is not:
n n
Note that the summation operator does not move through non-linear relations.
Another example
n n n
n n
∑(Yi-a)2
i=1
What it is:
n
5
ECON335 – Statistics 10-17-21
[∑(Yi-a)]2
i=1
In this formula we do the summation and then square it, while in the formula we’re considering
we square and then sum the squares.
∑(xi/yi)
i=1
n
= ∑ xi∙(1/yi)) ≠… ∑ xi∙∑(1/yi))
i=1
STATISTICS REVIEW
Will start the course by first reviewing some basic concepts learned in statistics courses. The
review will set the stage for our econometric analysis & it will describe ideas which we will use
in our econometric analysis.
I will also approach the statistics review by referring often to one example. The review will
allow me to identify basic principles that underlie our statistical and econometric analysis.
Example: Suppose that we're interested in the consumption of households in Vietnam [the U.S.].
We will consider how we might approach analysis of that consumption.
If we consider household consumption, the first thing we will note is that it varies across
households. Some households spend a lot - e.g., Jeff Bezo’s household, while other households
don't spend much - e.g., households in rural areas.
6
ECON335 – Statistics 10-17-21
A variable represents "a quantity that takes on different values for different persons or things."
- On the other hand, the speed of light emitted from flashlights does not vary from
flashlight to flashlight. It is a constant: 186,000/mps.
A random variable is a variable whose value cannot be predicted with certainty. It can take on
many values.
Ex. We cannot, however, predict with certainty a household’s (hh's) consumption. My income
differs from your income and each of your incomes likely differ from each other.
☞ There are several sources of randomness. The primary reason cited by economists for
randomness in data: we do not have complete information about the factors which affect a
variable.
(1) Even though it's possible to get all relevant data, do not have all such data.
e.g., the household consumption of food example. It will depend on many factors: education, #
people in the hh, family background, who your family knows, your preferences, etc.
If do not have all of the information affecting a variable then the variable will appear random.
e.g., 2 households with 4 people and one wage earner. One household has 30 million dong per
month in consumption of food and the other 5 million dong per month in food consumption.
If add the fact that the first wage earner is a doctor who has been in practice for 20 years & the
second wage earner has an elementary school education and works as a construction laborer
during the day & as a janitor at night, then we might explain the diff in income levels.
7
ECON335 – Statistics 10-17-21
(2) Not sure of all of the variables affecting what you're interested in.
A "population" is the whole group one is considering (7th: 714. 6th: 674). (Any well-defined
group of subjects.)
In the example we are considering (consumption of households (hh’s) in the country), it is all of
the hh's in the country in a given year.
Ex. the 2009 Vietnamese Census samples 15% of the country’s households
[ http://www.gso.gov.vn/default_en.aspx?tabid=515&idmid=5&ItemID=9813 ].
Aside: Each data point (or element) in the sample is called an "observation." An observation is
simply one of the households in the sample. Thus, if we have 15% of 22 million hh's, we have
3.3 million hh's (or observations) in the sample.
So, we're interested in some random variable: in our example, household consumption in
Vietnam [the U.S.] in some year (e.g., 2017).
Suppose that we would like to gain some insight into that variable.
8
ECON335 – Statistics 10-17-21
We must now ask "what characteristic of household consumption are we interested in?"
Or, we might be interested in the range of consumption across hh's; say, the distribution of that
consumption. In statistics, we saw that the variance measures how spread out a random variable
is.
Aside: we call characteristics like the mean and variance population parameters.
☺ Here’s an important point: we, as a rule, do not know the values of these parameters.
✌ An important question that arises is "how can we gain insights into those characteristics?"
A problem with this approach, however, is that it would be costly to obtain information on all
households in a population.
Note that a key implication of using a sample to make inferences about a population is that the
population and its relevant characteristics remain unknown to the economist (researcher).
Ex. We do not know the mean consumption of hh's in Vietnam (the US) in a certain year.
Thus, we seek to use the sample to make statements (or inferences) about an unknown
parameter (characteristic).
Ex. use the sample mean to make a statement about the population mean.
9
ECON335 – Statistics 10-17-21
The type of sample you choose is important. Return to the consumption ex. Suppose that your
sample was obtained by determining consumption levels of households in more wealthy parts of
Hanoi (in the U.S.: Cherry Creek, Beverly Hills or Beacon Hill).
We would not believe that the information obtained from those samples was "representative" of
households (hh’s) in the whole country.
Because of such possible biases, we must be careful in sampling. We want a sample that will be
representative of the population.
The type of sample we focus on primarily in this course and in statistics is called a
It is a subset of the population with each member of the population having the same probability
of being included in the sample
Ex: returning to the consumption function example. If we're interested in the population "all
households in the United States in a year" and if there are 22 million hh's in VN, a simple
random sample would place a 1/22m probability on choosing a given person in the country.
Transition
So, suppose that we have information on consumption levels of 1 million hh's in the country
obtained from a SRS. We may legitimately ask whether the info obtained about those hh's will
allow us to gain insight into characteristics of the full population; i.e., can we use information
obtained from 1 million households to make statements about the whole 90 million households?
"can we make statements about the whole population when all we have is information on a
subset ( a SRS) of the population?"
10
ECON335 – Statistics 10-17-21
In order to answer this question we have to link the sample to the population & determine if
we can use the sample to gain insights into the population.
SAMPLE → POPULATION
We link samples to populations by building a theory of probability. We then use that theory to
show how the sample allows us to make inferences about the population.
The probability theory derived in statistics is intended to describe random variables. Starting
from fundamental ideas such as sample space, an outcome and an event.
The basic point of the analysis is to describe all of the possible values a random variable might
take and associate probabilities with each possible value.
Ex: consumption may be $5k, 100k or 1m. There will be a specific probability a household has
$5k in consumption and another for $100k and ....
The analysis builds to a point where we describe random variables and the probabilities
associated w/ them w/ probability distributions. That is where I will start my analysis.
Probability Distributions
Probability distributions associate probabilities with the possible values of a random variable.
The exact way in which we define and calculate a probability depends on the type of random
variable we are considering.
First, we must make two definitions:
Let Y be the set of discrete possible outcomes that a random variable can take.
11
ECON335 – Statistics 10-17-21
Three Types of Random Variables: not all variables are the same. We must distinguish
between 3 types of random variables:
(1) Discrete Random Variables (7th: 685. 6th: 646) can take on only a fixed set of values.
Ex. Years of education completed. Can take on values from zero to ...? H.S. = 12; college = 16;
f(y) is called a probability mass function ("pmf") (Wooldridge calls it a pdf: 7th: 686. 6th: 647)
Note that f(y) is the probability of observing a given value of the random variable Y.
Ex: f(12) = 0.25 implies that the probability of observing someone who obtained a high school
diploma is 25%,
f(16) = 0.1 implies that 10% of all individuals have 16 years of education … & so on.
A pmf must possess the following properties (7th: 685. 6th: 647)
(2) ∑ f(y) = 1
y
12
ECON335 – Statistics 10-17-21
☺ We might ask what a graph of a pmf might look like. Let's do it for the education example.
The graph will have spikes at the values the variable (education) can take (the mass points of the
variable), with the height of the spike representing the probability of observing the value of that
variable.
Example of a discrete distribution which is used for a discrete (i.e., non-categorical) variable.
We have seen that continuous variables lie w/in a specified range; e.g., between [0,1] or (-∞,∞)
We use a probability density functions (pdf) to identify the probabilities for continuous
variables.
Because it is assigning probabilities for a continuous variable, a pdf is continuous. So, it will
look like, e.g.,
When dealing with continuous variables we must identify probabilities in a different way.
13
ECON335 – Statistics 10-17-21
(2) the whole area under the pdf must equal one (in mathematical terms - you don’t have to
∞
worry about this ∫−∞ f(y) 𝐝𝐲 = 1).
Requirement (2) suggests that we obtain probabilities for continuous variables by calculating an
area under a probability density function (pdf) between two values of the random variable.
As a result, we obtain probabilities for continuous variables only for ranges of values of the
random variable; not for specific values (4th: 419). I won’t go over how we calculate the
probabilities exactly – that involves mathematics not needed for this course.
☞ It is necessary to note, however, that the probability is not the value of f(y) at a specific
point or the difference between f(y) at two different points. You do not obtain probabilities by
sticking values of y into f(y).
Exs, whether one is male or female, whether one lives in an urban area and whether one has a
college degree.
A dummy variable identifies whether one is in one of two categories: it takes on a value of zero
or one.
You see that these variables are discrete in the sense that they take on a limited # of values.
14
ECON335 – Statistics 10-17-21
On the other hand, for a discrete variable like those described above, #'s make a difference.
Three doctor’s visits is different than 1 or 0 visits.
So, we treat categorical variables as if they were discrete, with the variable values having no
quantitative meaning.
This is a distribution for a random variable which takes on only two values: 0 or 1.
Is it + or 0 for all points in Y? Here Y can take 2 values: 0 & 1. The pmf = p for y =1 and = (1-p)
for y = 0.
15
ECON335 – Statistics 10-17-21
Ex., drawing one person from the pop "everyone in Vietnam [the U.S.] over 25" and determining
whether they have a college degree.
Let, y = 1 could equal one if "they have a degree" and = 0 if "they do not have a degree."
What would p be? The probability that the person has a college degree.
(There are many other discrete distributions. You will only have to know about the Bernoulli
distribution.)
There are a lot of diff continuous distributions. In fact, we will see several of them in this course.
For now I will focus only on the two distributions: (1) the General Normal probability density
function, and (2) the Standard Normal probability density function.
We will not look at the specific formula for the pdf. We will simply look at graphical examples
of it.
We represent the idea that the variable has a general normal pdf as follows: y∼N(μ,σ2)
(1) This is the familiar bell-shaped distribution that many people talk about.
(2) The distribution is symmetric. By symmetric, we mean that if draw a line down the center
the right half will look like the left half.
16
ECON335 – Statistics 10-17-21
(3) The shape of the distribution depends on mu & sigma (called parameters).
μ determines where the peak (or center) of the distribution will be, while
We can see how the size of mu and sigma affect the distribution in the following OH.
(4) The distribution with the dashed line has a μ = 0 & σ = 1. It is called the standard normal
distribution. (7th: 705. 6th: 666)
You can see that it is centered over 0. We may see how mu & sigma affect the distribution by
contrasting other gen normal distributions with the standard normal.
The distribution with the dashed and dotted line has μ = 2 and σ = 1.
So, we've simply changed the value of mu. We see that changing it causes the distribution to
shift to the right. The spread of the distribution has not changed.
We see that it has the same center (peak) as the 1st distribution but it is more spread out.
(1) a + μ implies that the distribution shifts right, as compared w/ the standard normal
distribution. A smaller μ implies a shift to the left.
(2) a smaller σ implies that the spread of the distribution is not as great.
☞ Transformations
It is important to note that we can transform any variable with a general normal distribution to
one with a standard normal distribution.
17
ECON335 – Statistics 10-17-21
This result follows from the fact that a linear transformation of a normally distributed r.v. has a
normal distribution. W/ the foregoing transformation, we get a standard normal variable.
The standard normal distribution plays a major role in our econometric analysis.
Standard Normal Tables: because the actual calculation of the probabilities is difficult (it
involves integration) econometrics and statistics texts contain tables which identify the
probability of falling between two points in a standard normal distribution.
Ex: suppose that we want to determine the probability that the value of a standard normal
variable falls between 0.5 and 0.
How do we calculate the probability? Note that the table gives us the probability that the random
variable lies between 0 and some + point. ...
☺ We might ask why textbooks only report probabilities for the standard normal distribution.
We would need an infinite # of tables and, anyway, we can calculate the probabilities for general
normal distributions using the standard normal distribution. We just undertake the transformation
described above.
I will leave our discussion of the normal distribution here & go on.
___________________________ end skip ____________________________
18
ECON335 – Statistics 10-17-21
So, our probability theory allowed us to describe random variables with probability distributions.
While it is nice to be able to describe a random variable in this way we are usually not interested
in the whole distribution but only certain characteristics of the distribution.
For our purposes, the characteristics in which we are interested fall under two broad categories:
The mean, median, and mode are MOCT. As I noted earlier, we often look to them when we
wish to describe the "typical" person or household or observation.
The most popular measure and the measure that will dominate our analysis of econometrics is
the mean of a random variable. We will focus on it.
The mean (or average) of a random variable is also called the expected value of a
distribution.
The expected value of a random variable is calculated differently for discrete and for continuous
random variables.
Discrete: if we let f(y) represent the probability mass function of a random variable y, then the
expected value of the random variable y is
where y ∈ Y requires that y be a value that the random variable can take [i.e., a mass point].
In other words, we multiply each possible value the random variable can take (each mass point)
by its probability & we sum up the products.
19
ECON335 – Statistics 10-17-21
I will note that E(Y) is a standard way in which the expected value of a random variable is
represented.
We might also represent it as μy
Ex
Y 10 20 30 40 E(Y)
f(y) 0.20 0.50 0.20 0.10
y∙f(y) 2 10 6 4 22
Ex: Bernoulli
If we put the specific functional forms for f(0) and f(1) into the formula, we get
So, the expected value in this case is simply the probability of observing x = 1.
Continuous Case:
The expected value of continuous random variables is calculated in a similar fashion except that
we integrate rather than sum. We won’t worry about integration here.
While we won’t integrate, I will identify the expected value of continuous random variables.
I will make one observation with respect to the general normal distribution, however. It can be
shown that the expected value of a general normal variable = μ.
20
ECON335 – Statistics 10-17-21
Our discussion of the expected value (or mean) of a random variable uses the expectation
operator. Generally, it applies to functions of y - g(Y). You’ll see what I mean by functions of y
through the examples we will consider.
Because the rv is a constant, we can pull it out of the summation and get E(b) = b∑f(y).
E(bY) = bE(Y)
Makes sense: if we multiply a random variable by a constant - say, 500 - each possible value it
can take will be multiplied by 500. We can - as in any summation - factor out the constant.
We saw that the expected value of the Bernoulli was p. The above formula implies that the
expected value of the function g(y) will be 5p.
(2) Property E2: E(a + bY) = a + bE(Y) (7th: 693. 6th: 653)
Note that this example indicates that the expectation operator (like the summation operator)
passes through linear transformations.
Property E3: (7th: 693. 6th: 653) a1•X1 + … + an•Xn It’s expected value is ….
Ex E(Y2) ≠ (E(Y))2
With respect to the left-hand side, the formula is E(Y2) is the square of each value of y multiplied
by the probability of observing each value of y with all of the products summed.
It is generally true that it does not equal the expected value of the variable squared.
What's E(Y2)?
02∙f(0) + 12∙f(1) = 02∙(1-p) + 12∙p = p
These measures give us a feel for how a variable is spread out. They do not describe a point in
the distribution (as do the mean & median).
We will focus on the variance of a distribution and its associated measure the standard deviation.
Variance & Standard Deviation
It is the expected value of the diff between the random variable & its mean, squared.
22
ECON335 – Statistics 10-17-21
For a continuous random variable, we calculate it in a similar way except that we integrate (don’t
worry about the integration).
We can show that E(Y-μy)2 = E(Y2) - μy2 = E(Y2) - E(Y)2. (7th: 695: B.24)
Y 10 20 30 40 E(Y)
f(y) 0.20 0.50 0.20 0.10
y∙f(y) 2 10 6 4 22
y - μy -12 -2 8 18
(y-μy)2∙f(y) 28.8 2 12.8 32.4 76
☞ While the variance does describe the spread of a distribution, it is not comparable with the
mean because it is in terms of squared values of the random variable.
Ex. Consumption. The mean is in terms of dong [dollars]. The variance is in terms of squared
dong [dollars].
Because we would like to be able to compare the dispersion & the mean, we must translate the
variance into the same units as the mean. How would we do it? Take the square root of the var.
We call the square root of the variance the standard deviation (7th: 696)
Ex: general normal distribution Y ∼N(μ,σ2). Can show that the standard deviation of the
general normal distribution is the sigma parameter; i.e., it's σ. Thus, the variance is σ2.
23
ECON335 – Statistics 10-17-21
Do it with the other formula .... E(Y2) = p & E(Y) = p ⇒ E(Y2) -E(Y)2 = p - p2 = p(1-p).
(2) If b is a constant, V(Y+b) = V(Y). The constant does not affect the variance.
The properties that we may be interested in may be summarized by considering the variance of
the following linear transformation of the rv Y:
Z = b + a∙Y
(1) The addition of a constant - a (e.g., 10) - to a random variable does not change its variance.
It simply shifts the distribution (& the mean). [Saw this in the handout with respect to the
general normal distribution.]
(2) The variance of the original random variable is multiplied by the square of b .
24
ECON335 – Statistics 10-17-21
Summary
The foregoing is our basic theoretical analysis of the distribution of a single random variable:
1st describe the random variables with probability distributions & then talk about certain
characteristics of the distributions.
We will recall that we undertook a review of probability distributions because we wanted to link
a sample of data with the population of interest.
In order to make that link, we must discuss some aspects of probability distributions of greater
than one random variable & expectations with respect thereto. So, let’s turn to them.
Joint Distributions: (7th: 688) Functions of More Than One Random Variable
We will first consider a function of two random variables. Suppose that Y and X are two random
variables which may be discrete or continuous. We will focus on the situation in which they are
discrete.
These random variables will have some probability distribution which describes the probabilities
that the two variables jointly take on specified values.
Ex. PCS sold and Printers sold Suppose that we are focusing on the population of people
who purchase computers and we’re interested in whether they also purchase a printer. Let Y
represent the number of printers sold in a day and X represent the # of PCS sold in a day.
This population includes 2 random variables: (1) # of PCs sold & (2) # of Printers sold.
We call the probability distribution for these two random variables a joint probability mass
function. It's a mass function because our variables are categorical.
The joint pmf is below. The joint pmf will describe the probability that get any combination of (#
of PCs sold in day, # of Printers sold in that day)
25
ECON335 – Statistics 10-17-21
PCs (x)
0 1 2 3 4 fy(y)
0 .03 .03 .02 .02 .01 0.11
Printers 1 .02 .05 .06 .02 .01 0.16
(y) 2 .01 .02 .10 .05 .05 0.23
3 .01 .01 .05 .10 .10 0.27
4 .01 .01 .01 .05 .15 0.23
Each element in the table describes the (joint) probability that the two random variables take on
the values identified in the column & the row.
Examples: …
☺ You will note that a joint pmf should satisfy all requirements of a pmf. Thus, all
probabilities should sum to one & there should be no negative probabilities. [confirm for the
above]
We can obtain from the joint pmf what are called marginal probabilities and marginal pmf's.
A marginal pmf, in this context, is the pmf of one the 2 random variables. The phrase marginal
pmf is really descriptive. It provides us with the probabilities we'd obtain if we summed across
rows or down the cols; i.e., the probabilities that would be contained in a margin.
Ex. Suppose that we're interested only in the number of PCs sold (X) and we're wondering
about the likelihood that a certain number is sold in a day. How would we calculate that
probability? For a given # of PCs sold, sum down all possible # of printers sold.
So, from this joint pmf we can obtain two marginal pmf's.
We represent generally the marginal probability functions of y and x as fy(y) and fx(x).
26
ECON335 – Statistics 10-17-21
Ex: in the example they are the fY(∙) column and fX(∙) row.
You should note [and confirm] that the marginal pmf's should satisfy all requirements of a pmf.
Let's turn to another concept regarding two random variables; namely, ....
Independence has a specific definition: 2 variables are independent when their joint probability
distribution is the product of the marginal pd's of the two variables.
We may interpret independence as saying that the two variables are not related.
Alternatively, knowing the value of 1 variable provides no insight into the likelihood of
obtaining a certain value of the other var.
Ex: PCs sold and Printers sold. Independence implies that knowing the number of PCs sold in a
day provides no insight into the number of printers sold in that day.
X fx(x)
0 1
Y 0 0.12 0.48 0.6
1 0.08 0.32 0.4
fy(y) 0.20 0.80
27
ECON335 – Statistics 10-17-21
We can use the marginal pmf's to obtain the expected values of each random variable (r.v.)
Expected values and variances of functions of the two random variables (4th: 437)
Does the fact that we have a joint distribution affect our conclusions with respect to
expectations?
The foregoing is a nice property of the expectation operator. It holds generally: the operator
moves through linear transformations of random variables.
Finally, we may note that the operator does not move through non-linear transformations of
variables;
ex. Z = X∙Y E(Z)≠ E(X)E(Y) in general, with one exception I'll talk about later.
Variance of Z = X + Y
What about the variance of Z = X+ Y? In order to determine the variance of the two variables we
must introduce the covariance.
28
ECON335 – Statistics 10-17-21
We can show that E((X - μx)(Y - μy)) = E(X●Y) - E(X)E(Y). This makes calculations easier.
Ex: X
0 1
Y 0 0.4 0.2
1 0.1 0.3
When discussing the expectation operator earlier, I noted that E(X∙Y) ≠ E(X)E(Y)
That is not true when variables are independent. They are equal.
What does this imply with respect to covariance? I have noted that
29
ECON335 – Statistics 10-17-21
I will note that the opposite is not true: zero covariance does not imply independence.
which implies that when variables are independent the variance of their sum is the sum of their
variances.
= cov(bX,dY) = bdcov(X,Y)
because the covariance of a constant w/ a variable = 0 (the constant does not vary).
Ex: et Y1 = Experience (in years) & Y2 = income (in 1,000s); let b=12 & d=1000.
Note that the cov(X,X) = var(X)? Plug into the general formula at (V) & consider what other
formula it looks like?
☞ There is a problem with the covariance: its size depends on the units used for a variable.
Thus, if you measured income in dollars you'd get a different covariance than if you measured it
in terms of hundreds of dollars.
30
ECON335 – Statistics 10-17-21
Because of this problem with units, we often focus on the correlation coefficient to obtain some
insight into the degree to which two variables are related.
You will note that we divide the covariance by the standard deviations of the 2 variables.
How does this get rid of the units problem? The standard deviations in the denominator will be
measured in the same units as the numerator; so, they'll cancel out.
☞ Ex: X & Y measured in thousands of dong. Get dong squared in the numerator & the
denominator.
If measured in 1000s, get 1000's squared in the numerator & the denominator.
☞ The correlation coefficient will lie between -1 and 1. It measures the extent to which
variables are related linearly.
☞ Ex. Consumption & income. Would expect increased income to produce greater
consumption. Thus, would expect them to be positively related. Expect a + covariance.
☞ Ex. Price of a product & demand. We would expect a higher price to result in lower D: a
negative covariance.
A correlation coefficient = 1 implies that the two variables are perfectly linearly related in a
positive manner.
A correlation coefficient = -1 implies that the 2 variables are perfectly related in a negative
manner; i.e., they will lie on a straight line with a negative slope.
If the correlation coefficient = 0, then the 2 variables are not linearly related.
31
ECON335 – Statistics 10-17-21
It is not simply the sum of the variances. Must account for the covariance.
The first two elements reflect the properties of the variance we have talked about.
☞ Next, let's extend our analysis to greater than two random variables.
Suppose we draw n random variables Y - Y1, Y2, ... Yn . - and that each Y is drawn from the
same probability distribution function (either continuous or discrete) with mean μY & standard
deviation σY which are known.
Suppose further that each variable is independent. Note that the independence assumption is very
important!
As a SRS of size n from a known population. Each random variable represents a draw from the
population.
32
ECON335 – Statistics 10-17-21
It's the mean of the random variables drawn: the sample mean.
You should note that because 𝑌̅is a function of random variables it is a rv. Thus, the sample
mean is a rv.
The foregoing implies that the sample mean has its own probability distribution.
☺ We might ask what are the mean and variance of the random variable called the sample mean.
What is the expected value of 𝑦̅? E(𝑦̅) = (1/n) [E(y1) + E(y2) + ... + E(yn)] (7th: 717)
If we note that each expected value of 𝑌̅ = μY because they are drawn from the same probability
distribution and that we have n of them, we see that
So, the expected value of the mean of the random sample is the population mean. That's nice.
Recalling that independence implies no covariances between the variables & the rules regarding
variances we just discussed, we see that
33
ECON335 – Statistics 10-17-21
Because there are n σY's. we see that the variance of the sample mean is V(𝑦̅) = (1/n)σY2.
So, we have seen that the mean of the probability distribution associated with the sample mean
has an expected value equal to the mean of the underlying population distribution & variance
equal to (1/n)σY2.
Now, consider what happens as n increases. Thus, we expect the distribution of the sample mean
to collapse on the population mean as the sample size approaches ∞.
╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶┴╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶
This is a nice result. It says that the expected value of a SRS is the mean of the underlying
population & that as n increases our sample mean is more & more likely to be close to that
population mean.
"If [random variable] Y has any distribution with mean μY & variance σY2, then the distribution
of
(𝑦̅ - μY)/(σY2/n)0.5
approaches the standard normal distribution as sample size n increases. Therefore the distribution
of 𝑦̅ in large samples is approximately normal with mean μY & variance σY2/n."
You should note that the random variable can have any distribution. The distribution can be
continuous or discrete.
That ends our discussion of probability theory. You will remember that we talked about
probability theory because we wanted to use a sample to make inferences about the population.
We can now make such inferences.
In our discussion immediately above, we assumed that we knew the underlying pop distribution.
We were able to show that the expected value of the sample mean is the population mean.
we have a sample of data which comes from an unknown population distribution and we're
interested in the mean value of the relevant variable for the population.
It is important to realize that we do not know the population probability distribution or any of its
characteristics.
Ex. We don’t know anything about the population of the approximately 223 million households.
☞ We would like to make a "best guess" (or an inference) about the mean (or expected value)
of the unknown population.
We are asking,
"How do we use the sample of data to makes inferences about the mean of the
unknown distribution?"
☞ Statistical inference involves searching for estimators which allow us to gain insights into
the value of the unknown population parameter: e.g., in our example it’s the population mean.
35
ECON335 – Statistics 10-17-21
Wooldridge (7th: 715): “Given a random sample … drawn from a population distribution that
depends on an unknown parameter θ, an estimator of θ is a rule that assigns each possible
outcome of the sample a value of θ.”
☞ Consider the population mean. As we will see, a variety of possible estimators for the
population mean exist.
Ex. Returning to the household consumption example: what would an estimator of the mean of
the population distribution be?
☺ There is a idea in the literature regarding estimators called the Analogy Principle (AP).
The AP states that you should look for the sample analog of the population characteristic you are
interested in. Thus, if you're interested in the population mean, you should use the sample mean
as an estimator.
So, in this case, the Analogy Principle implies considering the sample mean as an estimator of
the population mean.
The formula for the mean of a SRS of a random variable is 𝑦̅ = (1/n) ∑iyi
It is a formula (a function of the data) and it is random because we have seen that the sample
mean is a random variable. So, it is a possible estimator.
☞ Returning to the sample mean, we have already talked about its characteristics.
We have seen that its expected value = that of the population mean. That is a nice characteristic.
☞ So, should we rely on it as an estimator? The answer is not necessarily obvious because it is
not the only possible estimator of the population mean which has this characteristic.
36
ECON335 – Statistics 10-17-21
If we think about the sample mean, we realize that it is a function of the sample which places
equal weight on each observation in the sample w. the weights summing to one.
𝑦̅ is a function which puts a weight of (1/n) on each random variable in the formula.
We may also note that the weights sum to one: there are n weights of (1/n).
✌ In light of these observations, imagine an estimator which has weights which sum to one but
which has different weights than those identified above.
Ex 1: consider an estimator which places a weight of one on the first observation and a weight of
zero on all other observations.
What is its expected value? … E(Zalt) = (1)E(y1) + (0)E(y2) + ... + (0)E(yn ) = μY.
☞ So, we have two different estimators of the population mean of this random variable.
Which estimator should we choose? (7th: 716) In order to answer that question, we have to
identify certain characteristics of estimators we might deem desirable & then compare different
estimators with respect to those characteristics. We will do that now.
(1) Small Sample Theory: we consider properties of an estimator when we have a small (or fixed
size) sample, and
(2) Large Sample (Asymptotic) Theory: we consider properties of an estimator as the same size
ets larger and larger [Gujarati & Porter 4th: 497-498]
37
ECON335 – Statistics 10-17-21
Each of these approaches focuses on the probability distribution of the estimator. We will focus
primarily on small sample theory. (We will touch on large sample theory only a little bit.)
An estimator is unbiased if its expected value = the value of the pop characteristic we're
interested in (7th: 716 C.3].
Thus, if we are interested in the population mean then an unbiased estimator has an expected
value = the population mean.
The estimators we identified above were all unbiased; we saw that their expected values = the
population mean.
Indeed, any estimator with weights which sum to one will be unbiased.
38
ECON335 – Statistics 10-17-21
The fact that we have may many, many possible estimators which are unbiased is a reason we
consider efficiency.
Efficiency (7th: 719) concerns the variance of the estimator. It applies only to unbiased
estimators.
We say that one estimator is more efficient than another estimator if it has a smaller variance.
“Relative Efficiency” Defined (7th: 719) “If W1 and W2 are two unbiased estimators of parameter
θ, W1 is efficient relative to W2 when V(W1) ≤ V(W2), with strict inequality for at least one value
of θ.”
The key point to remember is that a given estimate is not likely to = the population mean. In fact,
the likelihood that it = the pop mean when the random variable is continuous = 0.
Thus, the larger the variance of an estimator the less sure we are that a given estimate is near the
actual population mean.
If we consider two estimators with different variances, the one with the smaller variance is more
likely to produce an estimate that is close to the actual population mean.
With the definition of efficiency in mind, consider the two estimators described above.
(I) Z = (1/n)(y1) + (1/n)(y2) + ... +(1/n)(yn ) Have seen that V(Z) = (1/n)σY2.
39
ECON335 – Statistics 10-17-21
So, our original estimator for the sample mean has a smaller variance than the formula
represented by Zalt. We can say that the original estimator is more efficient than the second
estimator.
The foregoing discussion of efficiency was in relative terms; we compared one estimator with
another.
It would be nice to talk about estimators in absolute terms; i.e., it would be nice to identify the
smallest possible variance among all estimators. If we could id the smallest variance then when
we derived an estimator w/ that variance we could stop looking for "better" estimators.
If we are willing to limit ourselves to estimators which are linear combinations of the
observations, we can show that
We say that the sample mean is the Best Linear Unbiased Estimator (BLUE).
Of course, this result is limited to linear, unbiased estimators. We may get rid of these limitations
but takes us beyond the scope of this course.
☺ Tosome extent, we have now answered the question we posed in the first day of class;
namely,
If we are interested in the population mean (or expected value), the answer is Yes. We know that
the sample mean is BLUE. Thus, we can make such an inference.
I noted earlier that the second type of analysis we undertake in econometrics is called Large
Sample Theory (or Asymptotic Analysis).
Asymptotic analysis considers characteristics of an estimator as the sample size gets larger and
larger, ultimately approaching infinity.
40
ECON335 – Statistics 10-17-21
We won’t discuss Large Sample Theory much. The purpose of this discussion is to give you the
basic intuition underlying large sample theory. I will mention it from time to time throughout the
semester but you will not need it for problem sets or tests.
☼ We might first ask why we focus on large sample theory when we have the unbiasedness &
efficiency criteria.
We consider large-sample theory in econometrics often because the small sample properties of
an estimator are effectively impossible to analyze.
Even though the small sample characteristics of an estimator from a known population cannot be
specified, we can talk about & analyze the distribution of the estimator as the sample size → ∞.
We say that an estimator is consistent if as n → ∞ its value is the same as the population
value of interest.
A basic conclusion is that an unbiased estimator is always consistent; since small sample theory
applies for any n, it can apply for very, very large n. So, let’s consider a biased estimator.
41
ECON335 – Statistics 10-17-21
(2) Asymptotic Efficiency: this focuses on the variance of the estimator as sample size → ∞.
We won’t consider it other than to note that efficiency implies asymptotic efficiency.
Summary
We talked about using "estimators" to make a guess about the unknown mean.
After observing that we can have many estimators for a given population parameter - in our case
we were looking for the population mean - we talked about standards we use in identifying
preferred estimators.
Those standards - for both large and small samples - focused on characteristics of the
distributions of the estimators. Specifically, the standards focused on whether the expected value
of the estimator was the pop parameter (& whether the variance of the estimator collapsed to a
point as n increased) and on the variance of the estimator.
Throughout the discussion we focused on use of the formula for the sample mean as an estimator
for the population mean. We saw that it is the BLUE estimator of the population mean.
As a result, we can conclude that if we're interested generally - for any given situation - in
obtaining an estimate of the pop mean we should use the sample mean.
We then consider possible estimators of that characteristic. We search among the various
possible unbiased estimators for that with the lowest variance (or the lowest possible variance).
42
ECON335 – Statistics 10-17-21
HT concerns testing whether statistics support our beliefs about the population parameter on
which we are focusing.
Ex: Suppose that we’re interested in the income of individuals working in Hanoi. We might have
some beliefs about the mean income of the population which we wish to test.
Suppose that we believe that the mean income of a person working in Hanoi is $7.5 million
dong/month.
We might think of using a random sample of working individuals to test whether mean income is
really $7.5m dong/month. We do so with HT.
(2) use parameter estimates obtained from a sample to test the hypotheses.
Consider (1):
Identification of the two hypotheses involves stating formally our beliefs about the population
parameter we are considering.
The null hypothesis often assumes that a believed effect or relationship does not exist.
The alternative hypothesis identifies what you believe is true if the null hypothesis is not true.
Ex. Our population is working people in Hanoi. We’re focusing on their monthly income.
[Stats suggest a mean income of 7.6m dong/month]
It states that the population mean = 7.5 million dong. This is what I meant when I said that the
null assumes that beliefs are not true.
43
ECON335 – Statistics 10-17-21
The foregoing alternate hypothesis is called a two-sided hypothesis: two-sided because in the
alternate hypothesis the mean can be less than or greater than.
Such an AH implies that we will reject the NH and accept the AH for sample means above or
below the null hypothesis value.
Ex: We'd get a one-sided in the above example if we believe that Hanoi household consumption
is greater than the national average.
Consider (2):
We are interested in using our statistical techniques to test the two hypotheses.
Before we talk about Hypothesis Testing (HT) formally, let's get the intuition underlying HT.
Again, let's consider the population mean. Our NH about the mean monthly income is 7.5m.
Suppose that we propose testing the hypothesis by looking at the mean of a random sample of
data.
We then calculate the sample mean & compare it to the null hypothesis mean. We know that it is
unlikely that our sample mean will = the null hypothesis mean.
When will we believe that the null hypothesis is true? When the sample mean is close to the null
hypothesis.
Alternatively, what value is so far away that we will decide that the null hypothesis is false?
This is a really good question because we know that the sample mean can take on many values
(it is a random variable with its own distribution).
44
ECON335 – Statistics 10-17-21
We might fortuitously sample only very wealthy people or only unemployed people. In either
case, we will get a sample mean that is far from the actual population mean even though the
population mean is the null hypothesis mean.
Of course, the likelihood of sampling only millionaires or only unemployed people is not great.
Indeed, it is pretty unlikely.
We use this idea of "pretty unlikely" as a guide in deciding what sample means will provide
evidence against the null hypothesis.
We call the idea of “pretty unlikely” the level of significance of a test. The standard level of
significance chosen is 5% - chosen mostly out of convention. [At times people will use a 1%
level of significance.]
Level of Significance: if the likelihood of obtaining the sample mean we actually draw,
assuming that the null hypothesis is true, is lower than the level of significance we reject the NH.
If we get a sample mean that we would observe only 4% of the time, if the NH is true, we
conclude that the NH is not true. Why? Because obtaining such a sample mean would be
unlikely.
Alternatively, if we get a sample mean that we would observe 15% of the time, if the NH is true,
we would accept the NH.
So, we can think of the level of significance as defining what we feel is unlikely.
Level of significance is usually described with the Greek letter α. Thus, for a 5% level of
significance we set α = 0.05.
Note: Type I Error even though we establish a level where we will conclude that the NH is not
true, it is still possible that the NH is true: there’s a 5% chance.
With the level of significance defined, let's turn to actual testing of the hypotheses.
45
ECON335 – Statistics 10-17-21
(a) using a Z [t] statistic or, (b) using what are called “prob values.”
Z vs. t statistic: whether you use a Z or a t statistic depends on the sample size. If the sample size
is greater than 30 [50 or 120] you may use the Z statistic. Otherwise, use the t statistic.
We will compare the t and Z distributions once we get to HT in the context of regression
analysis. For now, I will use the Z statistic.
Both methods of significance testing (which are effectively the same) using the Z statistic. Let’s
take a look at the statistic.
It involves determining whether the likelihood of obtaining the observed sample mean is less
than the level of significance, assuming that the null hypothesis is true.
I will undertake the discussion for a one-sided alternative hypothesis first: i.e.,
If Y has a normal distribution 𝑦̅ will have a general normal distribution with expected value
(mean) equal to μy and with variance σ2/n.
46
ECON335 – Statistics 10-17-21
We know that if we standardize the sample mean by subtracting its expected value and dividing
the difference by its standard deviation, the resulting statistic has a standard normal distribution;
i.e.,
Z = (𝑦̅ -μY)/(σ/n0.5)
has a standard normal distribution with mean zero & standard deviation = one.
Our sample Z statistic is Z where we place our NH mean - μ0 – in the place of the population
mean. Thus, our sample Z statistic is
Z = (𝑦̅ -μo)/(σ/n0.5)
The numerator equals the difference between the sample mean & the NH mean.
Note the italicized point. We compare our sample mean with the mean of our null hypothesis.
The difference represents how far off the sample mean is from what we hypothesis is true.
The resulting statistic tells us how many standard deviations the sample mean is from the
NH mean.
Since we know, from our earlier discussion, that the Z statistic has a standard normal
distribution
we can use the standard normal tables to determine the likelihood of observing a sample mean of
that size or > if the NH mean is true.
Alternatively, we can identify how unlikely it is that we would obtain our sample mean if the NH
is true.
Ex 1: suppose that we get a Zs = 2. Table G.1 (7th: 741) indicates that the likelihood of observing
a sample mean of that size or larger is (1 -0.9772) = 0.0228.
47
ECON335 – Statistics 10-17-21
Note that the foregoing probabilities are called p-values. P values answer the following
question:
“what is the largest significance value at which could carry out the test and still fail to reject the
null hypothesis?”
[ Guj: defined as [4th: 506] "the lowest significance level at which a NH can be rejected."]
What is going on graphically? Have a standard normal distribution. The p-value gives us the
probability left in the tail of the standard normal distribution. Show graphically for the above
two examples.
╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶┴╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶
1.35 2.0
The two methods of significance testing differ in that one uses the prob value and the other uses
the sample Z statistic. Let’s consider each method.
This method involves comparing the prob value with the level of significance.
If the prob value is less than the level of significance we reject the NH because the probability of
observing the sample mean, assuming that the NH is true, is less than the level of significance.
If the p value is greater than the level of significance we accept the NH.
Z statistic approach:
This approach, which is typically done, involves determining the Z associated with the level of
significance, call it the critical Z - Zc - and comparing it with the sample Z: ZS. Let's consider
this approach.
48
ECON335 – Statistics 10-17-21
We first look for the value of Z whose prob value, for a one-sided test, equals the level of
significance. That is the critical Z.
So, for a 5% level of significance we look for the value of Z where there’s a 5% probability of
observing that value of the statistic or greater.
Graphically,
│
│
│
╶╶╶╶╶╶╶╶ └╶╶╶╶╶╶╶╶╶╶
1.645
Ex: 5% level of significance implies that want a Z value that leaves 5% of the probability in the
tail. If we look at the standard normal table would see that the Z associated with a 5% probability
Z = Zc = 1.645.
With the critical value of Z at hand we calculate our sample Z – Zs – and compare the two Z
values.
If Zs > Zc [=1.645] then we reject the NH because the probability of observing such a sample
mean, if the NH is true, is less than 5%.
Do it for Ex 1. Have seen that the critical Z for a 5% one-sided test is 1.645. ZS = 2.1
Question: what if we get ZS = -2? We do not reject the NH. The alternate hypothesis is that
consumption in Fort Collins is greater than the national average. Our results indicate that we
have a sample mean below the national average: that's how we get a negative Z. This is not
evidence which favors the alternative hypothesis.
49
ECON335 – Statistics 10-17-21
With a one-sided test, we reject the NH & accept the alternative hypothesis if we get a sample
mean that's only on one side of the NH mean.
Ex: the Hanoi income example. AH was that monthly income in Hanoi is greater than the
national average of 7.5m dong.
We will reject that NH that there's no difference only if get a sample mean above the NH mean.
If get a sample mean less than the NH mean then have no evidence that consumption is greater
here.
☼ Now consider the two-sided AH that income is different in Hanoi. We will reject the NH in
this case if we observe a sample mean that is far above or far below the NH mean.
If we establish a 5% level of significance we must distribute that 5% to means that are above &
are below the NH mean. Thus, we allocate 2.5% above & 2.5% below.
Graphically ….
Consider the implications of this for significance testing. Graphically, want probabilities in the 2
tails: 2.5% in each tail.
Prob Value method: we reject the NH if the probability of observing the sample statistic is less
than 2.5% (if it is either above or below the NH mean).
Z statistic method: we now need two critical Z values: one for a sample statistic that is above the
NH mean & one for a sample statistic which is below the NH mean.
╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶┴╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶
-Zc Zc
50
ECON335 – Statistics 10-17-21
Ex: 2-sided, 5% level of significance. Things are different because want 2.5% in each tail.
Get Zc = ±1.96.
We will see a lot more of this when we get to regression analysis. For now, we will summarize
our results.
Summary
Okay, that's our review of basic statistical ideas. It would be useful to summarize what we have
done because our analysis in econometrics will often be similar. If you are not sure what we are
doing in econometrics you might return to the steps we followed in statistics to gain insights into
where you are.
(1) Identified an issue in which we were interested - household consumption in Vietnam [the
U.S.]. We immediately noted that consumption varied from household to household. Thus, we
were dealing with a random variable.
Over the course of the analysis, we narrowed our focus to the mean consumption of households
in Vietnam [the U.S.].
Because we realized that we could not obtain information on all households in Vietnam [the
U.S.] - a group which we defined as the "population" in which we were interested - we discussed
obtaining info on a subset of that population. We called the subset a "sample." Indeed, we
considered a specific type of sample - Simple Random Sample.
The question that arose then was "can we be sure that information obtained from a subset of the
population will provide insights into mean household consumption in the U.S."
(2) In order to answer that question, we turned to probability theory. In that context, we
discussed probability distributions for two types of variables: (1) continuous & (2) discrete.
We used probability distributions to associate probabilities with the various values a random
variable could take. We could also derive from the distributions various characteristics of the
random variable, such as the expected value (or mean) of the distribution & its variance.
51
ECON335 – Statistics 10-17-21
(3) Identification of probability distributions then allowed us to consider how the mean of a
sample of data might behave for a distribution which was known. We derived two useful results
with respect to the sample mean
(A) LLN which said .... & (B) the CLT, which said ....
(4) Having identified how a sample behaved we turned to Statistical Inference: in other words,
we considered whether we could use a sample of data to make inferences about the mean of an
unknown population.
We discussed using estimators to make "guesses" about the unknown population characteristic.
We noted that we analyze estimators in terms of their (1) bias (consistency) and (2) efficiency.
In light of the foregoing two standards, we saw that the sample mean is a BLUE estimator.
So, we proposed using it to make guesses about the unknown population mean.
When we discuss econometrics keep in mind this general approach because the issues are the
same.
So, that's our review of statistical analysis involving one random variable (rv) I will now turn to
an analysis in which we are interested in the relationship between two rv's.
Let's recall our discussion of joint distributions. We are looking at the distribution of two rv's.
If X and Y are our 2 rv’s we may represent their joint distribution as f(X,Y).
☺ Now, suppose that we believe that the number of printers sold depends on whether people
purchase a PC. In other words, printer demand derives from PC demand.
We have already discussed how to interpret the joint probabilities in the pmf in Table A-3 (4th:
424) and we have discussed how to derive marginal probability distributions from the table.
I will first define conditional probabilities and then interpret them. If we have two rv’s - X & Y -
f(y|x) = f(x,y)/fx(x).
In other words, we fix X at some value & ask "what is the distribution of Y for that value of X?"
Ex: If we fix the number of PCs sold at 4, the conditional probability distribution gives us the
distribution of Y for that number of PCs sold.
We may contrast this with the marginal distribution. The marginal distribution of Printers sold
gives us the probability distribution of Printers sold regardless of the # of PCs sold.
We might believe that the distribution of Printers sold differs across the # of PCs sold.
53
ECON335 – Statistics 10-17-21
How do we do it? Take the row entitled fx(x) and divide each row in the table by that row.
Show how obtained one row of the conditional probabilities [put up on the board]
We may note that each column is a pmf in & of itself; each col describes the distribution of Y for
a given value of X.
What can we say about it? The 1st thing to note is that the probabilities down the column sum to
one.
In light of all probabilities being non-negative, we see that the probabilities in that column form
a pmf.
The column, thus, reps the distribution of # of Printers sold when 2 PCs are sold. Indeed, we
could graph that distribution itself: [do it]
│
f(Y) │
│
│
│
│
└╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶╶ y|x=2
.0 1 2 3 5
54
ECON335 – Statistics 10-17-21
Since we have performed the same calculation for each column, we see that each column in the
Table is a pmf in & of itself. Thus, we could graph a distribution for each column.
Analyze the above graph: we see that 2 PCs sold is most likely.
The comparison suggests that taking into account the # of PCs sold provides insights into the #
of printers sold. Stated in another way, the # of PCs sold seems to be an important factor in
describing printer sales.
If we had focused simply on joint probabilities, we would have a hard time discerning this
difference.
If we had focused solely on marginal distributions we would have lost the different information
totally. Indeed, we can compare the conditional distributions with the marginal distribution.
[contrast the marginal with each distribution].
Continuous Variables: I will not that the foregoing analysis concerned a conditional probability
distribution for a discrete variable. We can calculate the same probability distributions for
continuous distributions.
As a final matter, we will consider conditional probabilities when two rv's are statistically
independent.
When the variables are statistically independent, it turns out that f(y|x) = fy(y).
55
ECON335 – Statistics 10-17-21
Thus, the conditional distributions of Y given X equals the marginal distribution of Y for all
possible values of X.
We may interpret this as saying that when rv's are statistically independent knowing the value of
the conditioning variable does not change the probability distribution of the rv we are
considering.
Alternatively, knowing the value of X has no impact on our assessment of the probability of
obtaining a specific value of Y.
Ex: knowing the # of PCs sold provides no insights into the # of printers sold.
If they were, what would the table of conditional probabilities look like?
As I noted when we discussed probability distributions of one variable; we often are interested in
reporting what’s true of the “typical” member of the population.
The most popular measure of the “typical members) is the expected value of a variable.
As we did in the case of a single variable, we can report expectations for our conditional
distributions.
We call them conditional expectations: they report the expected value of a variable (consumption
rate) conditional on the value of another variable (income).
We calculate them as we would any expected value. For discrete variables, we calculate
∑yi∙f(yi|x)
i=1
∑yi∙f(yi |x = 2)
i=1
56
ECON335 – Statistics 10-17-21
X E(Y|X)
0 1.3750
1 1.3333
2 1.8750
3 2.5833
4 3.1563
The row indicates that mean consumption rates vary across # of PCs sold.
We will consider the conditional expectation function in greater depth in our discussion of
Chapters 2 and 3. Let’s turn to them now.
57
ECON335 – Statistics 10-17-21
REDACTIONS
Gujarati has an example in which we are interested in whether a high school student’s Math SAT
score is related to annual family income. We will consider ideas in terms of this example.
Annual Family Income is the independent variable (X2) which affects a student’s Math SAT, the
dependent variable (Y).
Table 2-1 identifies the hypothetical population of families: there are 100 families whose income
falls into one of ten levels which range from $5,000 to $150,000. So, X is discrete with 10
possible values.
For each income level, there are 10 families. The Math SAT score for the relevant student in
each family is identified in the body of the table.
You should note that the Table does not identify a joint probability mass function (“jpmf”).
A jpmf would identify the percentage of the population which has a given (income, Math SAT
score) combination, for all possible combinations of the two variables.
To construct the jpmf we need to identify the possible values of the two discrete variables in the
population. We have discussed the possible income values.
SAT scores: a review of the table indicates that the Math scores lie between 410 and 600.
So, the jpmf will identify the probability of observing a given combination of Math score which
lies in that range of values and a given income level.
410 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01
420 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03
430 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02
440 0.01 0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.04
450 0.01 0.02 0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.07
460 0.02 0.00 0.01 0.02 0.00 0.01 0.00 0.00 0.00 0.00 0.06
58
ECON335 – Statistics 10-17-21
470 0.01 0.00 0.01 0.00 0.02 0.00 0.01 0.00 0.01 0.00 0.06
480 0.00 0.02 0.01 0.00 0.00 0.02 0.02 0.01 0.01 0.00 0.09
490 0.01 0.00 0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.05
500 0.01 0.01 0.00 0.00 0.01 0.00 0.01 0.02 0.01 0.00 0.07
510 0.00 0.02 0.01 0.01 0.02 0.02 0.01 0.00 0.00 0.01 0.10
520 0.00 0.01 0.01 0.02 0.00 0.00 0.02 0.00 0.01 0.02 0.09
530 0.00 0.00 0.01 0.00 0.02 0.01 0.01 0.01 0.01 0.00 0.07
540 0.00 0.00 0.00 0.01 0.00 0.02 0.00 0.02 0.01 0.01 0.07
550 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.01 0.01 0.05
560 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.01 0.02 0.06
570 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02
580 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.02
590 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01
600 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01
fx(x) 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10
Relate the probabilities to Table 2-1. E.g., how many families have a (410,$5k) combination?
How many have a (410, $15k) combination, etc.
Marginal Probabilities
We have already discussed marginal probabilities. I noted that we obtain them, in the discrete
case, by summing down rows or across columns.
You will remember that, we can think of marginal probabilities as identifying the probability
distribution of one of the rv's in the joint distribution regardless of the value of the other rv.
In this OH, the row entitled "Marginal" was obtained by summing down the rows.
It is the marginal distribution of income. You will note that it tells us the probability a household
had a certain level of income, regardless of its consumption rate.
Ex. 0.10 implies a 10% probability a household has an annual income = $5,000.
Recall that for the joint distribution f(X,Y) we represent the marginal distribution of X as fx(X).
You may note that we could do the same for Math SAT scores: for a given SAT score we sum
across all possible incomes to get the probability of observing a given score.
59
ECON335 – Statistics 10-17-21
Conditional Probabilities
I will first define conditional probabilities and then interpret them. If we have two random
variables - X & Y -
f(Y|X) = f(X,Y)/fx(X).
Ex. If fix X at $5,000, it will identify the probability of observing the different Math SAT scores.
How do we calculate it? We take fx(5,000) = .10 and then divide each element in the X = $5,000
column by the 0.10. Let’s do it for some of the Math scores.
If you do the same for other income levels, you will note that we are effectively taking the fx(X)
row and dividing each row in the pdf by fx(X). The result we get is
$5k $15 $25 $35 $45 $55 $65 $75 $90 $150
410 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
420 0.20 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
430 0.00 0.10 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
440 0.10 0.00 0.10 0.10 0.10 0.00 0.00 0.00 0.00 0.00
450 0.10 0.20 0.10 0.10 0.10 0.10 0.00 0.00 0.00 0.00
460 0.20 0.00 0.10 0.20 0.00 0.10 0.00 0.00 0.00 0.00
470 0.10 0.00 0.10 0.00 0.20 0.00 0.10 0.00 0.10 0.00
480 0.00 0.20 0.10 0.00 0.00 0.20 0.20 0.10 0.10 0.00
490 0.10 0.00 0.10 0.20 0.00 0.00 0.00 0.10 0.00 0.00
500 0.10 0.10 0.00 0.00 0.10 0.00 0.10 0.20 0.10 0.00
510 0.00 0.20 0.10 0.10 0.20 0.20 0.10 0.00 0.00 0.10
520 0.00 0.10 0.10 0.20 0.00 0.00 0.20 0.00 0.10 0.20
530 0.00 0.00 0.10 0.00 0.20 0.10 0.10 0.10 0.10 0.00
540 0.00 0.00 0.00 0.10 0.00 0.20 0.00 0.20 0.10 0.10
550 0.00 0.00 0.00 0.00 0.10 0.10 0.10 0.00 0.10 0.10
560 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.20 0.10 0.20
570 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.10
60
ECON335 – Statistics 10-17-21
580 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.10 0.00
590 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10
600 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10
Each column in the table is a pmf in itself, except it’s for a given level of income. You may see
this by summing the probabilities down each column. You will see that they sum to one.
To see what I mean, consider the col of probabilities under $5,000 in income.
What can we say about it? The 1st thing to note is that the probabilities down the column sum to
one.
In light of all probabilities being non-negative, we see that the probabilities in that column form
a pmf.
Since we have performed the same calculation for each column, we see that each column in the
Table is a pmf in & of itself. Thus, we could graph a distribution for each column.
We might think that the distribution of Math SAT scores differs across family income levels; i.e.,
the income affects Math scores
Compare the $5,000 income and $150,000 income level columns and the scores at which the
probability masses.
We see that, as we would expect, students from higher income families have higher Math Sat
Scores.
So, as was suggested above, taking into account the level of family income provides insights into
Math SAT scores.
61
ECON335 – Statistics 10-17-21
If we had focused simply on joint probabilities, we would have a hard time discerning this diff.
If we had focused solely on marginal distributions we would have lost the different information
totally. Indeed, we can compare the conditional distributions with the marginal distribution.
Conditional Expectations
As I noted when we discussed probability distributions of one variable; we often are interested in
reporting summary #'s which characterize a probability distribution. The most popular summary
# is the expected value of a variable.
As we did in the case of a single variable, we can report expectations for our conditional
distributions.
We call them conditional expectations: they report the expected value of a variable
(consumption rate) conditional on the value of another variable (income) [Gujarati 4th: 23].
We calculate them as we would any expected value. For discrete variables, we calculate
E(Y|X) = ∑yi∙f(yi/X)
i=1
∑yi∙f(yi/X=5,000)
i=1
The conditional expectations for each column are contained in the row at the bottom of Table 2-
1. They are
$5k $15 $25 $35 $45 $55 $65 $75 $90 $150
452 475 478 488 496 505 512 528 530 552
62
ECON335 – Statistics 10-17-21
[interpret CEFs] The row reveals that mean Math SAT scores vary substantially across levels of
income: they vary by 100 points between the lowest and highest family income levels.
So, knowing a family’s income level appears to be important in inferring the Math SAT score of
a high school student in that family.
Figure 2-1 (4th: 24) plots the CE’s on a graph with a line joining them. We see that they
generally slope upward.
This discussion about CEs ties nicely into our discussion of regression analysis, for, as we will
see, regression analysis is about a special type of conditional expectation.
63