Applied Econometrics Module
Applied Econometrics Module
Applied Econometrics Module
Preface ix
1 Introduction 1
1.1 Why Study Econometrics? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The main objective of this module . . . . . . . . . . . . . . . . . . . . 2
1.3 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Introduction to Econometrics 5
2.1 What is Econometrics? . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 What Is Regression Analysis? . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Single-Equation Linear Models . . . . . . . . . . . . . . . . . . . . . . 10
2.4 The Stochastic Error Term . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 The Signi…cance of the Stochastic Disturbance Term . . . . . 11
2.5 Few Points on Notations . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 The Estimated Regression Equation . . . . . . . . . . . . . . . . . . . 16
2.7 Structures of Economic Data . . . . . . . . . . . . . . . . . . . . . . . 16
2.7.1 Cross-Sectional Data . . . . . . . . . . . . . . . . . . . . . . . 17
2.7.2 Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.3 Pooled Cross Sections . . . . . . . . . . . . . . . . . . . . . . 18
2.7.4 Panel or Longitudinal Data . . . . . . . . . . . . . . . . . . . 19
2.8 Introduction to Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
v
vi CONTENTS
1
The data used in this chapter is from Gujarati, Damodar N. (2012) Econometrics by Example.
Palgrave Macmillan. This dataset is posted on the course webpage.
Preface
The material in this module is designed to cover a single-semester course in
applied econometrics for MBA students at the graduate (Masters) level of the pro-
gram at Addis Ababa University and most MBA programs elsewhere. The notes
are designed to equip students with the basic tools of applied econometrics that are
needed to undertake quantitative research research works in business and economics
and also to be able to read and understand academic journals articles based on
quantitative research. In addition, the lecture notes are meant to serve students
as tools to conduct their own research works in di¤erent branches of business and
economics. The basic philosophy behind the preparation of the module is that quan-
titative couses are tools to understand the literature and conduct rigrous research in
business and economics. To this e¤ect, we tried our best to discuss the business and
economic applications of the topics covered in this course. Students are advised to
practice the techniques discussed in the material by using online available datasets
and software. The software used in the material is Stata.
The module is organized into seven eight chapters. Chapter 1 motivates the
course by introducing the students about the use of econometrics for applied research
in business and economics refering to some prominent examples in the discipline.
The chapter also outlines the prerequisites of the course and what students expect
to gain from this course. Chapter 2 deals with the structure of econometrics and
introduces one of the basic concept in econometrics - regression analysis - what it
is and how it works and what researchers plan to gain out of it. In chapter 3 the
module introduces one of the basic and most commonly used estimation technique -
the ordinary least square method while chapter 4 introduces some basic concepts of
the classical linear regression model and the The Gauss-Markov Theorem. Chapter
5 deals with hypothesis testing and statistical inference. The assumptions of distri-
butions of the estimates, the t-test, p-values, and the F-test are discussed in chapter
5. Chapter 6 presents the violations of statistical assumptions and what to do when
the assumptions are violated. The chapter deals with three of such violations: mul-
ticolinearity, serial correlation, and heteroskedasticity. Chapter 7 introduces with
the regression methods used when the dependent variable is categorical or limited.
Accordingly, the chapter introduces the methods of linear probability model, logit
model, probit model, and tobit models. Finally, chapter 8 brie‡y introduces time-
series econometrics.
ix
Chapter 1
Introduction
1
2 CHAPTER 1 INTRODUCTION
1.4 Prerequisites
Econometrics is an interdisciplinary …eld. It uses insights from economics and busi-
ness in selecting the relevant variables and models, it uses computer-science methods
to collect the data and to solve econometric models, and it uses statistics and math-
ematics to develop econometric methods that are appropriate for the data and the
problem at hand. Accordingly, in this course it is assumed that that students have
some familiarity with basic concepts of di¤erentiations (calculus), basic Statisti-
cal Concepts (random variables, sample, population, measures of central tendency,
measures of dispersions, measures of skewness and kurtosis, etc., methods of estima-
tion, properties of estimators, hypothesis testing, con…dence intervals). Most texts
in econometrics contain these and more prerequisites in their appendices for easy
reference.
1.5 Resources
This is one of the standard courses o¤ered at most universities world wide. As
a result getting lecture notes, sample exam questions with their solutions, etc. is
relatively easy if one has access to internet. The best way to use internet is not to
search for the material on the whole course, instead students are advised to follow
the topics in the lecture notes closely and then look for supplementary material on
topics they feel they need additional material.
Finally, it is important to indicate, as usual, quantitative courses can be mastered
only through doing more and more exersices. Accordingly, students are advised to
try all the problems listed at the end of this module and also practice with data (their
own data or data that accompany the textbooks and online available data from the
4 CHAPTER 1 INTRODUCTION
World bank, IMF, and other research and teaching institutions. For additional re-
sources visit the course website at Course website: https://sites.google.com/site/sisayrsenbeta/home/
econometrics.
Chapter 2
Introduction to Econometrics
In this chapter we discuss some basic issues in applied econometrics. The course
assumes that you are familiar with basic concepts of statistics such as descriptive
and inferential statistics and few concepts about probability. Econometrics has a
number of roles in terms of forecasting and analyzing real data and problems. At
the core of these roles, however, is the desire to pin down the magnitudes of e¤ects
and test their signi…cance. Economic theory often points to the direction of a causal
relationship (if income rises we may expect consumption to rise), but theory rarely
suggests an exact magnitude.
5
6 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
Q= 0 + 1P + 2 Ps + 3 Yd (2.1)
The number 0.23 is called an estimated regression coe¢ cient and it is the ability
to estimate these coe¢ cients that makes econometrics valuable. The second use of
econometrics is hypothesis testing; the evaluation of alternative theories with quanti-
tative evidence. Much of economics involves building theoretical models and testing
them against evidence, and hypothesis testing is vital to that scienti…c approach.
For example, you could test the hypothesis that the product in Equation 1 is what
economists call a normal good (one for which the quantity demanded increases when
disposable income increases). This can be done by applying various statistical tests
to the estimated coe¢ cient (0.23) of disposable income (Y d ) in Equation 2.
At …rst glance, the evidence would seem to support this hypothesis, because the
coe¢ cient’s sign is positive, but the “statistical signi…cance”of that estimate would
have to be investigated before such a conclusion could be justi…ed. Even though
8 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
the estimated coe¢ cient is positive, as expected, it may not be su¢ ciently di¤erent
from zero to convince us that the true coe¢ cient is indeed positive. The third and
most di¢ cult use of econometrics is to forecast or predict what is likely to happen
next quarter, next year, or further into the future, based on what has happened
in the past. For example, economists use econometric models to make forecasts of
variables like sales, pro…ts, Gross Domestic Product (GDP), and the in‡ation rate.
The accuracy of such forecasts depends in large measure on the degree to which the
past is a good guide to the future.
Business leaders and politicians tend to be especially interested in this use of
econometrics because they need to make decisions about the future, and the penalty
for being wrong (bankruptcy for the entrepreneur and political defeat for the can-
didate) is high. To the extent that econometrics can shed light on the impact of
their policies, business and government leaders will be better equipped to make de-
cisions. For example, if the president of a company that sold the product modeled
in Equation 1 wanted to decide whether to increase prices, forecasts of sales with
and without the price increase could be calculated and compared to help make such
a decision.
The following steps are followed in empirical econometric analysis:
1. specifying the models or relationships to be studied
2. collecting the data needed to quantify the models
3. quantifying the models with the data
The speci…cations used in step 1 and the techniques used in step 3 di¤er widely
between and within disciplines. Choosing the best speci…cation for a given model is
a theory-based skill that is often referred to as the “art”of econometrics. There are
many alternative approaches to quantifying the same equation and each approach
may produce somewhat di¤erent results. The choice of approach is left to the
individual econometrician (the researcher using econometrics), but each researcher
should be able to justify that choice.
their fathers’height and the average height of sons of a group of short fathers was
greater than their fathers’height, thus “regressing”tall and short sons alike toward
the average height of all men. In the words of Galton, this was “regression to
mediocrity.”
Econometricians use regression analysis to make quantitative estimates of eco-
nomic relationships that previously have been completely theoretical in nature. Af-
ter all, anybody can claim that the quantity of a normal good demanded will increase
if the price of those goods decreases (holding everything else constant), but not many
people can put speci…c numbers into an equation and estimate by how many units
the quantity demanded will increase for each Birr that price decreases. To predict
the direction of the change, you need a knowledge of economic theory and the general
characteristics of the product in question.
To predict the amount of the change, though, you need a sample of data, and
you need a way to estimate the relationship. The most frequently used method
to estimate such a relationship in econometrics is regression analysis.
Regression analysis ia a statistical technique that attempts to “explain” move-
ments in one variable, the dependent variable, as a function of movements in a set
of other variables, called the independent (or explanatory) variables, through the
quanti…cation of one or more equations. For example, in Equation 1:
Q= 0 + 1P + 2 Ps + 3 Yd (2.3)
con…rm causality; it can only test the strength and direction of the quantitative
relationships involved.
Y = 0 + 1X (2.4)
there still is going to be some variation in Y that simply cannot be explained by the
model. This variation probably comes from sources such as omitted in‡uences,
measurement error, incorrect functional form, or purely random and to-
tally unpredictable occurrences. By random we mean something that has its
value determined entirely by chance.
Econometricians admit the existence of such inherent unexplained variation (“er-
ror”) by explicitly including a stochastic (or random) error term in their regression
models. A stochastic error term is a term that is added to a regression equation to
introduce all of the variation in Y that cannot be explained by the included Xs. It
is, in e¤ect, a symbol of the econometrician’s ignorance or inability to model all the
movements of the dependent variable.
E (Y jX) = 0 + 1X (2.6)
which states that the expected value of Y given X, denoted as E (Y jX), is a linear
function of the independent variable (or variables if there are more than one).
Unfortunately, the value of Y observed in the real world is unlikely to be exactly
equal to the deterministic expected value E (Y jX). After all, not all 13-year-old
girls are 175 CM tall. As a result, the stochastic element must be added to the
equation:
Y = E (Y jX) + = 0 + 1X + (2.7)
To get a better feeling for these components of the stochastic error term, let’s
think about a consumption function (aggregate consumption as a function of ag-
gregate disposable income). First, consumption in a particular year may have been
less than it would have been because of uncertainty over the future course of the
economy. Since this uncertainty is hard to measure, there might be no variable
measuring consumer uncertainty in the equation. In such a case, the impact of the
omitted variable (consumer uncertainty) would likely end up in the stochastic error
term.
Second, the observed amount of consumption may have been di¤erent from the
actual level of consumption in a particular year due to an error (such as a sampling
error) in the measurement of consumption in the National Income Accounts. Third,
the underlying consumption function may be nonlinear, but a linear consumption
function might be estimated.
Yi = 0 + 1 Xi + i (i = 1; 2; :::; N ) (2.8)
14 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
where: Yi = the ith observation of the dependent variable, Xi = the ith observa-
tion of the independent variable i = the ith observation of the stochastic error term
0 , 1 = the regression coe¢ cients, N = the number of observations. That is, the
regression model is assumed to hold for each observation. The coe¢ cients do not
change from observation to observation, but the values of Y , X, and do. A second
notational addition allows for more than one independent variable. Since more than
one independent variable is likely to have an e¤ect on the dependent variable, our
notation should allow these additional explanatory Xs to be added. If we de…ne:
X1i =the ith observation of the …rst independent variable
X2i =the ith observation of the second independent variable
X3i =the ith observation of the third independent variable, then all three variables
can be expressed as determinants of Y .
The resulting equation from the process outlined above is called a multivariate
(more than one independent variable) linear regression model:
The meaning of the regression coe¢ cient 1 in this equation is the impact of a
one-unit increase in X1 on the dependent variable Y , holding constant X2 and X3 .
Similarly, 2 gives the impact of a one-unit increase in X2 on Y , holding X1 and X3
constant. These multivariate regression coe¢ cients (which are parallel in nature to
partial derivatives in calculus) serve to isolate the impact on Y of a change in one
variable from the impact on Y of changes in the other variables. This is possible
because multivariate regression takes the movements of X2 and X3 into account
when it estimates the coe¢ cient of X1 . The result is quite similar to what we would
obtain if we were capable of conducting controlled laboratory experiments in which
only one variable at a time was changed.
In the real world, though, it is very di¢ cult to run controlled economic exper-
iments, because many economic factors change simultaneously, often in opposite
2.5 FEW POINTS ON NOTATIONS 15
directions. Thus the ability of regression analysis to measure the impact of one vari-
able on the dependent variable, holding constant the in‡uence of the other variables
in the equation, is a tremendous advantage. Note that if a variable is not included in
an equation, then its impact is not held constant in the estimation of the regression
coe¢ cients. An example of multivariate regression: Suppose we want to understand
how wages are determined in a particular …eld, perhaps because we think that there
might be discrimination in that …eld. The wage of a worker would be the dependent
variable (WAGE), but what would be good independent variables?
What variables would in‡uence a person’s wage in a given …eld? Well, there
are literally dozens of reasonable possibilities, but three of the most common are
the work experience (EXP), education (EDU), and gender (GEND) of the worker,
so let’s use these. To create a regression equation with these variables, we would
rede…ne the variables in the above equation to meet our de…nitions:
Y = WAGE= the wage of the worker
X1 =EXP=the years of work experience of the worker
X2 =EDU=the years of education beyond high school of the worker
X3 =GEND=the gender of the worker (1 =male and 0=female)
The last variable, GEND, is unusual in that it can take on only two values, 0
and 1; this kind of variable is called a dummy variable, and it’s extremely useful
when we want to quantify a concept that is inherently qualitative (like gender). If
we substitute these de…nitions into the above equation, we get:
where i goes from 1 to N and indicates the observation number. If the sample
consists of a series of years or months (called a time series), then the subscript i is
usually replaced with a t to denote time.
16 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
5. Experimental data
2.7 STRUCTURES OF ECONOMIC DATA 17
for variables such as income, savings, family size, and so on. In 1990, a new random
sample of households is taken using the same survey questions. To increase our
sample size, we can form a pooled cross section by combining the two years.
Pooling cross sections from di¤erent years is often an e¤ective way of analyzing
the e¤ects of a new government policy. The idea is to collect data from the years
before and after a key policy change. As an example, consider the following data set
on housing prices taken in 1993 and 1995, before and after a reduction in property
taxes in 1994. Suppose we have data on 250 houses for 1993 and on 270 houses for
1995.
Observations 1 through 250 correspond to the houses sold in 1993, and obser-
vations 251 through 520 correspond to the 270 houses sold in 1995. Although the
order in which we store the data turns out not to be crucial, keeping track of the
year for each observation is usually very important. This is why we enter year as a
separate variable.
A pooled cross section is analyzed much like a standard cross section, except that
we often need to account for secular di¤erences in the variables across the time. In
fact, in addition to increasing the sample size, the point of a pooled cross-sectional
analysis is often to see how a key relationship has changed over time.
Because panel data require replication of the same units over time, panel data
sets, especially those on individuals, households, and …rms, are more di¢ cult to ob-
tain than pooled cross sections. Not surprisingly, observing the same units over time
leads to several advantages over cross-sectional data or even pooled cross-sectional
data. The bene…t that we will focus on in this text is that having multiple observa-
tions on the same units allows us to control for certain unobserved characteristics of
individuals, …rms, and so on. As we will see, the use of more than one observation
can facilitate causal inference in situations where inferring causality would be very
di¢ cult if only a single cross section were available. A second advantage of panel
data is that they often allow us to study the importance of lags in behavior or the re-
sult of decision making. This information can be signi…cant because many economic
policies can be expected to have an impact only after some time has passed.
Causality and the Notion of Ceteris Paribus in Econometric Analysis
In most tests of economic theory, and certainly for evaluating public policy, the
economist’s goal is to infer that one variable (such as education) has a causal e¤ect
on another variable (such as worker productivity). Simply …nding an association
between two or more variables might be suggestive, but unless causality can be
established, it is rarely compelling.
The notion of ceteris paribus— which means “other (relevant) factors being equal”
- plays an important role in causal analysis. This idea has been implicit in some
of our earlier discussion, particularly Examples 1.1 and 1.2, but thus far we have
not explicitly mentioned it. You probably remember from introductory economics
that most economic questions are ceteris paribus by nature. For example, in ana-
lyzing consumer demand, we are interested in knowing the e¤ect of changing the
price of a good on its quantity demanded, while holding all other factors - such
as income, prices of other goods, and individual tastes— …xed. If other factors are
not held …xed, then we cannot know the causal e¤ect of a price change on quantity
demanded.
Holding other factors …xed is critical for policy analysis as well. In the job
training example (Example 1.2), we might be interested in the e¤ect of another
week of job training on wages, with all other components being equal (in particular,
education and experience). If we succeed in holding all other relevant factors …xed
and then …nd a link between job training and wages, we can conclude that job
training has a causal e¤ect on worker productivity. Although this may seem pretty
simple, even at this early stage it should be clear that, except in very special cases,
it will not be possible to literally hold all else equal. The key question in most
empirical studies is: Have enough other factors been held …xed to make a case for
causality? Rarely is an econometric study evaluated without raising this issue.
In most serious applications, the number of factors that can a¤ect the variable
of interest - such as criminal activity or wages— is immense, and the isolation of any
2.7 STRUCTURES OF ECONOMIC DATA 21
particular variable may seem like a hopeless e¤ort. However, we will eventually see
that, when carefully applied, econometric methods can simulate a ceteris paribus
experiment.
At this point, we cannot yet explain how econometric methods can be used to
estimate ceteris paribus e¤ects, so we will consider some problems that can arise
in trying to infer causality in economics. We do not use any equations in this
discussion. For each example, the problem of inferring causality disappears if an
appropriate experiment can be carried out. Thus, it is useful to describe how such
an experiment might be structured, and to observe that, in most cases, obtaining
experimental data is impractical. It is also helpful to think about why the available
data fail to have the important features of an experimental data set.
We rely for now on your intuitive understanding of such terms as random, in-
dependence, and correlation, all of which should be familiar from an introductory
probability and statistics course. (These concepts are reviewed in Appendix B.) We
begin with an example that illustrates some of these important issues.
The next example is more representative of the di¢ culties that arise when infer-
ring causality in applied economics.
informally, the question is posed as follows: If a person is chosen from the population
and given an-other year of education, by how much will his or her wage increase?
As with the previous examples, this is a ceteris paribus question, which implies that
all other factors are held …xed while another year of education is given to the person.
We can imagine a social planner designing an experiment to get at this issue,
much as the agricultural researcher can design an experiment to estimate fertilizer
e¤ects. Assume, for the moment, that the social planner has the ability to assign any
level of education to any person. How would this planner emulate the fertilizer ex-
periment in Example 1.3? The planner would choose a group of people and randomly
assign each person an amount of education; some people are given an eighth-grade
education, some are given a high school education, some are given two years of col-
lege, and so on. Subsequently, the planner measures wages for this group of people
(where we assume that each person then works in a job). The people here are like
the plots in the fertilizer example, where education plays the role of fertilizer and
wage rate plays the role of soybean yield. As with Example 1.3, if levels of education
are assigned independently of other characteristics that a¤ect productivity (such as
experience and innate ability), then an analysis that ignores these other factors will
yield useful results. Again, it will take some e¤ort in Chapter 2 to justify this claim;
for now, we state it without support.
The omitted factors of experience and ability in the wage example have analogs
in the fertilizer example. Experience is generally easy to measure and therefore is
similar to a variable such as rain-fall. Ability, on the other hand, is nebulous and
di¢ cult to quantify; it is similar to land quality in the fertilizer example. As we will
see throughout this text, accounting for other observed factors, such as experience,
when estimating the ceteris paribus e¤ect of another variable, such as education,
is relatively straightforward. We will also …nd that accounting for inherently un-
observable factors, such as ability, is much more problematic. It is fair to say that
many of the advances in econometric methods have tried to deal with unobserved
factors in econometric models.
One …nal parallel can be drawn between Examples 1.3 and 1.4. Suppose that in
the fertilizer example, the fertilizer amounts were not entirely determined at random.
Instead, the assistant who chose the fertilizer levels thought it would be better to put
more fertilizer on the higher-quality plots of land. (Agricultural researchers should
have a rough idea about which plots of land are of better quality, even though they
may not be able to fully quantify the di¤erences.) This situation is completely
analogous to the level of schooling being related to unobserved ability in Example
1.4. Because better land leads to higher yields, and more fertilizer was used on
the better plots, any observed relationship between yield and fertilizer might be
spurious.
Di¢ culty in inferring causality can also arise when studying data at fairly high
levels of aggregation, as the next example on city crime rates shows.
Example The E¤ect of law Enforcement on City Crime levels
The issue of how best to prevent crime has been, and will probably continue to
be, with us for some time. One especially important question in this regard is: Does
the presence of more police o¢ cers on the street deter crime?
The ceteris paribus question is easy to state: If a city is randomly chosen and
given, say, ten additional police o¢ cers, by how much would its crime rates fall?
Another way to state the question is: If two cities are the same in all respects, except
that city A has ten more police o¢ cers than city B, by how much would the two
cities’crime rates di¤er?
It would be virtually impossible to …nd pairs of communities identical in all re-
spects except for the size of their police force. Fortunately, econometric analysis
does not require this. What we do need to know is whether the data we can collect
on community crime levels and the size of the police force can be viewed as experi-
mental. We can certainly imagine a true experiment involving a large collection of
cities where we dictate how many police o¢ cers each city will use for the upcoming
year.
Although policies can be used to a¤ect the size of police forces, we clearly cannot
tell each city how many police o¢ cers it can hire. If, as is likely, a city’s decision
24 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
on how many police o¢ cers to hire is correlated with other city factors that a¤ect
crime, then the data must be viewed as nonexperimental. In fact, one way to view
this problem is to see that a city’s choice of police force size and the amount of
crime are simultaneously determined. We will explicitly address such problems in
Chapter 16.
The …rst three examples we have discussed have dealt with cross-sectional data
at various levels of aggregation (for example, at the individual or city levels). The
same hurdles arise when inferring causality in time series problems.
Even when economic theories are not most naturally described in terms of causal-
ity, they often have predictions that can be tested using econometric methods.
2.8 INTRODUCTION TO STATA 25
B. Of the …ve preceding windows, the Command and Output windows are prob-
ably the most important for ongoing analyses.
1. The Review, Variables, and Properties windows are intended primarily to
keep track of information that you have already provided to the Stata system.
2. You can insert commands from the Review window and variable names from
the Variables window into the Command window in order to save yourself some
typing.
C. Some additional windows can appear as needed or in response to particular
Stata commands.
1. The Viewer window displays help les (in response to user requests for assis-
tance) and the Stata log (a user-requested permanent record of the Stata session).
2. The Graph window displays graphs produced from Stata commands.
3. The Data Browser and Data Editor windows allow the user to inspect (and,
with the Data Editor, modify) the contents of the current open dataset.
26 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
D. There are two general methods that a user can employ to communicate with
Stata during a session.
1. The Stata menu system (i.e., npoint and click").
2. Entering commands through the Command window.
E. Generally, it is better to use commands rather than menus
1. With commands, it is much easier to keep a record of your steps and (if
necessary) reproduce the contents of your analysis.
2. Commands must be used in Stata Do- les (see below).
II. Some Basic Features and Rules of Stata
A. The user interacts with Stata by issuing commands that refer to datasets,
variables, and other objects (e.g., directories and …les outside the Stata system on
the computer or the internet).
B. The user refers to each dataset or variable by its name. There are some strict,
but easy, rules for creating Stata names.
1. Stata names can be composed of letters, numbers, and the underscore symbol
(that is, n ").
2. Stata names can be up to 32 characters long, and the …rst character must be
a letter or the underscore.
3. Within a dataset, each variable must have a unique name.
C. Some advice regarding Stata names:
2.8 INTRODUCTION TO STATA 27
box.
3. Use the Stata Command entry on the Stata Help menu to nd out about
speci…c Stata commands. This is useful when you know what you want to do, but
do not remember the command syntax or the available options (e.g., how do I use
Stata’s ttest command?).
4. When you use the Search or Stata Command items on the Help menu, Stata
returns information in the Viewer window.
III. The Stata Session
A. A basic Stata session has three parts.
1. Read a dataset into Stata.
2. Modify the contents of the dataset (if necessary).
3. Perform the statistical analysis (or other data analysis task).
B. After data are read into Stata, the other two steps can be carried out repeat-
edly and in di erent orders (i.e., the user may want to perform an analysis and then
modify the data before performing another analysis, etc.).
C. Within a Stata session only one dataset can be active at any time.
1. In order to use a second dataset within a single Stata session, the rst dataset
must be removed, using the clear command.
2. If the rst dataset has been modi ed during the course of the Stata session,
Stata will ask whether you want to save the dataset before clearing it. If you do not
save the dataset, any changes you have made to its contents will be lost.
3. Hint: If you have modi ed your dataset during the course of the Stata session,
save it under a new name (i.e., issue the command, save newname before issuing the
clear command. That way, you will have both the original, unmodi ed, dataset and
the newly- modi ed version that you just created.
4. After you have cleared the rst dataset, you can read in another datasetand
continue through the other two steps of the Stata session (i.e., data modi cations
and statistical analyses) with the new dataset.
D. Contents of the Results window during the Stata session
1. Commands, responses to commands from Stata, additional information, and
the results from statistical analyses are all printed out in the Results window.
2. Stata lls the Results window one screen at a time. If there is more content
than will t into the window, Stata stops providing the information and asks if you
want to proceed, by displaying - more - at the bottom of the screen.
a. Typing <enter> will make Stata show one more line of output.
b. Typing any other key (say, the space bar) will make Stata continue to produce
the output.
c. Many Stata commands produce more than one screen of output, so the - more
- condition will occur frequently over the course of an interac-tive Stata session.IV.
Reading Data into the Stata Session
2.8 INTRODUCTION TO STATA 29
A. Data can be read into Stata from a previously-created and saved Stata dataset,
or it can be read in nraw" form from a text le.
1. If the data are contained in a text le, then the variables must be assigned
Stata names and the user must indicate to Stata whether each variable has numeric
or character values. This process is called ndata de nition."
2. If the data consist of a previously-stored Stata dataset, they will be contained
in an electronic le with a n.dta" extension (Stata added this le extension when the
dataset was saved). In this case, the data de nition has already been completed,
and the user only needs to retrieve the dataset into the Stata system.
B. Although not absolutely necessary, it is almost always useful to change the
working directory at the beginning of a Stata session.
1. The working directory is the location in which Stata looks for external data
les. If you create any new les during the course of your Stata session (e.g., datasets,
log les, saved graphs, etc.) , they will be written to the working directory unless
you explicitly specify otherwise. Stata usually sets the default working directory to
c:ndata.
2. The cd command is used to change the working directory. Thus, if your
data …les are stored in the subdirectory ndatasets", within the npls802" directory
on a ash drive that is identi ed as ng:" on the computer, you would probably begin
your Stata session with the following command: The path to the working directory
must be enclosed in double quotes if there are any internal blanks within any of
the directory names. So, they are not really necessary here. But, they don’t hurt,
either.
C. The use command reads a previously-stored Stata dataset into the current
session.
1. If the Stata dataset is named mydata, then it will be stored in a le named
mydata.dta. If this dataset is contained in the working directory, then the command
to retrieve it into the current session would be: use mydata Note that the …le
extension (.dta) is not used in this command (Stata dis-tinguishes the dataset from
the le in which the dataset is stored).
2. If the Stata dataset is not contained in the current working directory, then
the use command must include the full path to the dataset. This might look like the
following: use g:npls828ndatasetsnmydataD. If the data are stored in nraw" form,
they should be contained within an ASCII text le (usually, with le extension n.txt")
and the information should be ar-ranged within the le as follows:
1. There should be one line of data per observation, and each line should end
with a hard return.
2. The variable values are given in the same order for every observation (and
each observation must have the same number of variable values).
3. There is whitespace (i.e., at least one blank space) between each adjacent pair
30 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
com-mands in Stata.
1. Many commands provide descriptive analyses. For these, the user issues
the command and Stata prints the results into the Results window, completing
the analysis. Examples of descriptive analysis commands include summarize and
correlate.
2. Other commands estimate the parameters of statistical models. For these,
the model estimates are retained in memory until another model is estimated. The
estimates can be recalled to the Results window very easily (perhaps, using di er-
ent options) and supplementary operations can be carried out on the model, using
Stata’s post-estimation commands. The most important model estimation command
(for purposes of this course, anyway) is regress.
C. Analyzing data by subgroups
1. Use the by pre x in order to have an analysis carried out separately on subsets
of the data de ned by the values of another variable. For example: by region :
summarize gnp policy In the preceding Stata command, summary statistics on the
variables gnp and policy would be calculated separately for subgroups of observations
de…ned by the distinct values of the variable, region. Of course, all three of these
variables must exist in the current Stata dataset.
2. In order to use the by pre x, the dataset must be sorted by the values of the
variable used to de ne the subgroups. There are three ways to do this. First, precede
the analysis command (summarize in this example) with the sort command: sort
region by region : summarize gnp policy
Second, use the sort option in the by pre x: by region, sort : summarize gnp
policy
Third, use bysort rather than by in the pre x: bysort region: summarize gnp
policyAll of these approaches would produce identical results.
D. Analyzing a single subset of the data
1. Use the if quali er to restrict the analysis to a subset of the current dataset de
ned by a logical condition. For example: summarize gnp if region == "south". The
preceding expression would calculate summary statistics only for those observations
in which the value of the variable, region, is south.
2. The general form of this quali er is the word, if, followed by a logical con-
dition. Stata will restrict the analysis speci ed by the command to those observations
for which the expression evaluates to TRUE.
3. While there are ways to combine the use of the by pre x and the if quali…er
in a single command, it is generally not a good idea to do so.
VII. Creating, Saving, and Viewing a Session Log
A. The contents of the Stata Results window provide a record of the Stata session.
But, there are are two potential drawbacks to the default operation of the Results
window:
34 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
1. The contents of the Results window are stored in memory. Most computers
have a limited amount of memory available. When Stata runs out of available
memory, it truncates the oldest elements from the current Results window. This
can be problematic in a lengthy Stata session.
2. The contents of the Results window are lost when the Stata session ends.
B. In order to overcome the preceding problems, it is a good idea to save the
contents of the Stata session to a separate le called a nStata Log."
1. The command to begin creating a Stata Log is: log using …le name
In this command, …le name is the name of a new le. Do not use a …le extension,
because Stata will add its own extension (n.smcl") onto the …le’s name.
2. Once the log le is opened, everything that appears in the Results window will
be written to the le (it will still be displayed in the Results window, as well).
3. To stop sending the contents of the Stata session to the log le, issue the
following command: log close
C. The contents of the Stata Log can be examined in the Stata Viewer window.
The easiest way to do this is to click the Log item from the File menu, select
View from the submenu that appears, and then browse to the le containing the log
(remember that this le will have the lename that you assigned it, with an extension
of .smcl).
D. The contents of the Stata Log can be saved to an ASCII text le.
1. The easiest way to do this to to click the Log item from the File menu, select
Translate from the submenu that appears, and then browse to the …le containing
the log (remember that this le will have the lename that you assigned it, with an
extension of .smcl) in the nInput File" box.
2. Next, type in a new le name in the nOutput File" box. When you clicknTranslate,"
the .smcl le will be translated to a text le with the extension.log. This is a regular
ASCII text le that can be opened in any word processor (e.g., MS Word) or text
processor (e.g. Notepad in Windows).
3. Note that the contents of a translated Stata Log should be viewed in a xed-
width font, such as Courier. A relatively small size (e.g., 9 points) is best in order
to avoid unnecessary line wrapping.
VIII. Using Do-…les to Submit (and Save) Commands
A. Up until now, it has been assumed that the user is working with Stata in
ninterac-tive" mode; that is, typing in one command at a time in the Command
window. While it is certainly possible to do this in nserious" analysis contexts,
there are several reasons for not doing so.
1. In a long Stata session, it is di cult to keep track of earlier commands and
steps in the course of the analysis.
2. The commands, themselves, are not saved. This is problematic because most
analyses will have to be run several times. Even if some changes will be made in
2.8 INTRODUCTION TO STATA 35
3.1 Introduction
The bread and butter of regression analysis is the estimation of the coef-…cients
of econometric models using a technique called Ordinary Least Squares (OLS). The
…rst two sections of this chapter summarize the reason-ing behind and the mechanics
of OLS. Regression users rely on computers to do the actual OLS calculations, so
the emphasis here is on understanding what OLS attempts to do and how it goes
about doing it.
How can you tell a good equation from a bad one once it has been esti-mated?
There are a number of useful criteria, including the extent to which the estimated
equation …ts the actual data. A focus on …t is not without per-ils, however, so we
share an example of the misuse of this criterion.
39
40 CHAPTER 3 ORDINARY LEAST SQUARES
6. De…ne the elasticity ofywith respect toxand explain its computation in the
simple linear regression model whenyandxare not transformed in any way, and when
y and/or x have been transformed to model a nonlinear relationship.
7. Explain the meaning of the statement “If regression model assumptions SR1–
SR5 hold, then the least squares estimatorb2is unbiased.”In particular, what exactly
does “unbiased”mean? Why isb2biased if an important variable has been omitted
from the model?
8. Explain the meaning of the phrase “sampling variability.”
9. Explainhowthe factorss 2 , (xi x) 2, and N a¤ect theprecisionwithwhichwe can
estimate the unknown parameterb2.
10. State and explain the Gauss–Markov theorem.
11. Use the least squares estimator to estimate nonlinear relationships and in-
terpret the results.
Yi = 0 + 1 Xi + i (3.1)
Ybi = b0 + b1 Xi (3.2)
The purpose of the estimation technique is to obtain numerical values for the
coe¢ cients of an otherwise completely theoretical regression equation.
The most widely used method of obtaining these estimates is Ordinary Least
Squares (OLS), which has become so standard that its estimates are presented as a
point of reference even when results from other estimation techniques are used.
Ordinary Least Squares (OLS) is a regression estimation technique that calcu-
lates the n s so as to minimize the sum of the squared residuals, thus:
X
N
2
OLS minimizes i (i = 1; :::; N ) (3.3)
i=1
Since these residuals ( i s) are the di¤erences between the actual Ys and the
estimated Ys produced by the regression (the Ybi s in Equation 2), Equation 3 is
equivalent to saying that OLS minimizes
X
N
2
Yi Ybi (3.4)
i=1
3.3 ESTIMATING SINGLE-INDEPENDENT-VARIABLE MODELS WITH OLS41
Yi = 0 + 1 Xi + i
OLS selects those estimates of 0 and 1 that minimize the squared residuals,
summed over all the sample data points. For an equation with just one independent
variable, these coe¢ cients are:
X
N
Xi X Yi Y
b = i=1
(3.5)
1
X
N
2
Xi X
i=1
b =Y b X (3.6)
0 1
Note that for each di¤erent data set, we will get di¤erent estimates of 0 and 1
depending on the sample.
42 CHAPTER 3 ORDINARY LEAST SQUARES
For Ordinary Least Squares, the total sum of squares has two components, variation
that can be explained by the regression and variation that cannot:
X
N
2 X
N
2 X
N
Yi Y = Ybi Y + e2i (3.9)
i=1 i=1 i=1
Total Sum of Squares (TSS) = Explained Sum of Squares (ESS) + Residual Sum of
Squares (RSS). This is usually called the decomposition of variance.
If the bread and butter of regression analysis is OLS estimation, then the heart
and soul of econometrics is …guring out how good these OLS estimates are. Many
3.3 ESTIMATING SINGLE-INDEPENDENT-VARIABLE MODELS WITH OLS43
one estimated model represents the truth any more than another, but evaluating
the quality of the …t of the equation is one ingredient in a choice between di¤erent
formulations of a regression model. Be careful, however! The quality of the …t is a
minor ingredient in this choice, and many beginning researchers allow themselves to
be overly in‡uenced by it. The simplest commonly used measure of …t is R2 , or the
coe¢ cient of determination. R2 is the ratio of the explained sum of squares to the
total sum of squares:
X
ESS RSS e2i
2
R = =1 =1 X 2 (3.10)
T SS T SS Yi Y
The higher R2 is, the closer the estimated regression equation …ts the sample
data. Measures of this type are called “goodness of …t” measures. R2 measures
the percentage of the variation of Y around Y that is explained by the regression
equation. Since OLS selects the coe¢ cient estimates that minimize RSS, OLS pro-
vides the largest possible R2 , given a linear model. Since TSS, RSS, and ESS are
all nonnegative (being squared deviations), and since ESS TSS, then R2 must lie
in the interval 0 R2 1.
A value of R2 close to one shows an excellent overall …t, whereas a value near
zero shows a failure of the estimated regression equation to explain the values of Yi
better than could be explained by the sample mean Y .
Figure 3.3: A set of data for X and Y that can be “explained” quite well with a
regression line
2
R measures the percentage of the variation of Y around its mean that is ex-
2
plained by the regression equation, adjusted for degrees of freedom. R will increase,
decrease, or stay the same when a variable is added to an equation, depending on
whether the improvement in …t caused by the addition of the new variable outweighs
2
the loss of the degree of freedom. An increase in R indicates that the marginal ben-
2
e…t of adding a variable exceeds the cost, while a decrease in R indicates that the
2
marginal cost exceeds the bene…t. The highest possible R is 1.00, the same as for
2 2
R2 . The lowest possible R , however, is not 0.00; if R2 is extremely low, R can be
2
slightly negative. R can be used to compare the …ts of equations with the same
dependent variable and di¤erent numbers of independent variables.
2
Because of this property, most researchers automatically use R instead of R2
when evaluating the …t of their estimated regression equations. Note, however, that
2
R is not as useful when comparing the …ts of two equations that have di¤erent
dependent variables or dependent variables that are measured di¤erently. Finally,
always remember that the quality of …t of an estimated equation is only one measure
of the overall quality of that regression. As mentioned previously, the degree to
which the estimated coe¢ cients conform to economic theory and the researcher’s
previous expectations about those coe¢ cients are just as important as the …t itself.
For instance, an estimated equation with a good …t but with an implausible sign for
an estimated coe¢ cient might give implausible predictions and thus not be a very
useful equation. Other factors, such as theoretical relevance and usefulness, also
come into play.
Although there are no hard and fast rules for conducting econometric research,
most investigators commonly follow a standard method for applied regression analy-
sis. The relative emphasis and e¤ort expended on each step will vary, but normally
all the steps are necessary for successful research. Note that we do not discuss the
selection of the dependent variable; this choice is determined by the purpose of the
research. Once a dependent variable is chosen, however, it is logical to follow the
following six steps in applied regression analysis:
The …rst step in any applied research is to get a good theoretical grasp of the topic
to be studied. That’s right: the best data analysts don’t start with data, but with
theory! This is because many econometric decisions, ranging from which variables
to include to which functional form to employ, are determined by the underlying
theoretical model. It’s virtually impossible to build a good econo-metric model
without a solid understanding of the topic you’re studying.
For most topics, this means that it’s smart to review the scholarly literature
before doing anything else. If a professor has investigated the theory behind your
topic, you want to know about it. If other researchers have estimated equations for
3.3 ESTIMATING SINGLE-INDEPENDENT-VARIABLE MODELS WITH OLS47
your dependent variable, you might want to apply one of their models to your data
set. On the other hand, if you disagree with the approach of previous authors, you
might want to head o¤ in a new direction. In either case, you shouldn’t have to
“reinvent the wheel.” You should start your investigation where earlier researchers
left o¤. Any academic paper on an empirical topic should begin with a summary of
the extent and quality of previous research.
The most convenient approaches to reviewing the literature are to obtain several
recent issues of the Journal of Economic Literatureor a business-oriented publication
of abstracts, or to run an Internet search or an EconLitsearch on your topic. Using
these resources, …nd and read several recent articles on your topic. Pay attention
to the bibliographies of these articles. If an older article is cited by a number of
current authors, or if its title hits your topic on the head, trace back through the
literature and …nd this article as well.
In some cases, a topic will be so new or so obscure that you won’t be able to …nd
any articles on it. What then? We recommend two possible strategies. First, try to
transfer theory from a similar topic to yours. For example, if you’re trying to build
a model of the demand for a new product, read articles that analyze the demand
for similar, existing products. Second, if all else fails, contact someone who works
in the …eld you’re investigating. For example, if you’re building a model of housing
in an unfamiliar city, call a real estate agent who works there.
2. Specify the model: Select the independent variables and the functional
form.
The most important step in applied regression analysis is the speci…cation of the the-
oretical regression model. After selecting the dependent variable, the speci…cationof
a model involves choosing the following components:
1. the independent variables and how they should be measured,
2. the functional (mathematical) form of the variables, and
3. the properties of the stochastic error term.
A regression equation is speci…ed when each of these elements has been treated
appropriately.
Each of the elements of speci…cation is determined primarily on the basis of eco-
nomic theory. A mistake in any of the three elements results in a speci…cation
error. Of all the kinds of mistakes that can be made in applied regression analysis,
speci…cation error is usually the most disastrous to the validity of the estimated
equation. Thus, the more attention paid to eco-nomic theory at the beginning of a
project, the more satisfying the regression results are likely to be. The emphasis in
this text is on estimating behavioral equations, those that describe the behavior of
economic entities. We focus on selecting inde-pendent variables based on the eco-
nomic theory concerning that behavior. An explanatory variable is chosen because
48 CHAPTER 3 ORDINARY LEAST SQUARES
Once the variables have been selected, it’s important to hypothesize the expected
signs of the slope coe¢ cients before you collect any data. In many cases, the basic
theory is general knowledge, so you don’t need to discuss the reasons for the expected
sign. However, if any doubt surrounds the choice of an expected sign, then you
should document the opposing theories and your reasons for hypothesizing a positive
or a negative slope coe¢ cient.
Obtaining an original data set and properly preparing it for regression is a surpris-
ingly di¢ cult task. This step entails more than a mechanical recording of data,
because the type and size of the sample also must be chosen.
A general rule regarding sample size is “the more observations the better,”as long
as the observations are from the same general population. Ordinarily, researchers
3.3 ESTIMATING SINGLE-INDEPENDENT-VARIABLE MODELS WITH OLS49
take all the roughly comparable observations that are readily available. In regression
analysis, all the variables must have the same number of observations. They also
should have the same frequency (monthly, quar-terly, annual, etc.) and time period.
Often, the frequency selected is deter-mined by the availability of data.
The reason there should be as many observations as possible concerns the sta-
tistical concept of degrees of freedom…rst mentioned in previous section. Consider
…tting a straight line to two points on an X, Y coordinate system as in Figure
below.If there are only two points in a data set, as in the …gure above, a straight
line can be …tted to those points mathematically without error, because two points
completely determine a straight line.
Such an exercise can be done mathematically without error. Both points lie on
the line, so there is no estimation of the coe¢ cients involved. The two points deter-
mine the two parameters, the intercept and the slope, precisely. Estimation takes
place only when a straight line is …tted to three or more points that were generated
by some process that is not exact. The excess of the number of observations (three)
over the number of coe¢ -cients to be estimated (in this case two, the intercept and
slope) is the degrees of freedom. All that is necessary for estimation is a single
degree of freedom, as in Figure below, but the more degrees of freedom there are,
the better.
This is because when the number of degrees of freedom is large, every positive
error is likely to be balanced by a negative error. When degrees of freedom are
low, the random element is likely to fail to provide such o¤setting observations.For
example, the more a coin is ‡ipped, the more likely it is that the observed proportion
of heads will re‡ect the true probability of 0.5.
Another area of concern has to do with the units of measurement of the variables.
Does it matter if a variable is measured in dollars or thousands of dollars? Does
it matter if the measured variable di¤ers consistently from the true variable by
50 CHAPTER 3 ORDINARY LEAST SQUARES
Believe it or not, it can take months to complete steps 1-4 for a regression equation,
but a computer program like Stata or EViews can estimate that equa-tion in less
than a second! Typically, estimation is done using OLS, but if another estimation
technique is used, the reasons for that alternative technique should be carefully
explained and evaluated.
You might think that once your equation has been estimated, your work is …n-
ished, but that’s hardly the case. Instead, you need to evaluate your results in a
variety of ways. How well did the equation …t the data? Were the signs and magni-
tudes of the estimated coe¢ cients what you expected? Most of the rest of this book
is concerned with the evaluation of estimated econometric equations, and beginning
researchers should be prepared to spend a consid-erable amount of time doing this
evaluation.
Once this evaluation is complete, don’t automatically go to step 6. Regres-
sion results are rarely what one expects, and additional model development often
is required. For example, an evaluation of your results might indicate that your
equation is missing an important variable. In such a case, you’d go back to step 1 to
review the literature and add the appropriate variable to your equation. You’d then
go through each of the steps in order until you had estimated your new speci…cation
in step 5. You’d move on to step 6 only if you were satis…ed with your estimated
equation. Don’t be too quick to make such adjustments, however, because we don’t
want to adjust the theory merely to …t the data. A researcher has to walk a …ne line
between making appropriate changes and avoiding inappropriate ones, and making
these choices is one of the artistic elements of applied econometrics.
Finally, it’s often worthwhile to estimate additional speci…cations of an equa-
tion in order to see how stable your observed results are. This approach is called
sensitivity analysis.
amount more per hour than women. The di¤erence does not depend on the amount
of education, and this explains why the wage-education pro…les for women and men
are parallel.
At this point, you may wonder why we do not also include in (3.13) a dummy
variable, say male, which is one for males and zero for females. This would be
redundant. In (3.13), the intercept for males is 0 , and the intercept for females is
0 + 0 . Because there are just two groups, we only need two di¤erent intercepts.
This means that, in addition to 0 , we need to use only one dummy variable; we
have chosen to include the dummy variable for females. Using two dummy variables
would intro-duce perfect collinearity because f emale + male = 1, which means that
male is a perfect linear function of female. Including dummy variables for both
genders is the simplest example of the so-called dummy variable trap, which arises
when too many dummy variables describe a given number of groups. We will discuss
this problem in detail later.
In (3.13), we have chosen males to be the base group or benchmark group, that
is, the group against which comparisons are made. This is why 0 is the intercept
for males, and 0 is the di¤erence in intercepts between females and males. We could
choose females as the base group by writing the model as
where the intercept for females is 0 and the intercept for males is 0 + 0; this
3.4 DUMMY VARIABLES 57
If educ, exper, and tenureare all relevant productivity characteristics, the null hy-
pothes is of nodi¤erence between men and women is H0 : 0 = 0. The alternative
that there is discrimination against women is H1 : 0 < 0.
How can we actually test for wage discrimination? The answer is simple: just
estimate the model by OLS, exactlyas before, and use the usual tstatistic. Nothing
changes about the mechanics of OLS or the statistical theory when some of the
independent variables are de…ned as dummy variables. The only di¤erence with
what we have done up until now is in the interpretation of the coe¢ cient on the
dummy variable. We will come back to this question when we discuss a chapter on
hypothesis testing.
Chapter 4
The assumptions that underlie the classical OLS regression method discussed
in the preceding chapter
4.2 Introduction
The term classical refers to a set of fairly basic assumptions required to hold in order
for OLS to be considered the “best”estimator available for regression models. When
one or more of these assumptions do not hold, other estimation techniques (such as
Generalized Least Squares) may be better than OLS. As a result, one of the most
important jobs in regression analysis is to decide whether the classical assumptions
hold for a particular equation. If so, the OLS estimation technique is the best
available. Otherwise, the pros and cons of alternative estimation techniques must
be weighed. These alternatives usually are adjustments to OLS that take account of
the particular assumption that has been violated. In a sense, most of the rest of the
study in econometrics deals in one way or another with the question of what to do
when one of the classical assumptions is not met. Since econometricians spend so
much time analyzing violations of them, it is crucial that they know and understand
these assumptions.
59
60 CHAPTER 4 CLASSICAL LINEAR REGRESSION MODEL
The regression model is linear, is correctly speci…ed, and has an additive error
term
Observations of the error term are uncorrelated with each other (no serial
correlation)
The error term is normally distributed (this assumption is optional but usually
is invoked)
The assumption that the regression model is linear does not require the underlying
theory to be linear. For example, an exponential function:
Yi = e 0 Xi 1 e i (4.2)
where e is the base of the natural log, can be transformed by taking the natural log
of both sides of the equation:
Let ln (Yi ) = Yi0 and ln (Xi ) = Xi0 , then the above equation can be written as
Yi0 = 0 + 0
1 Xi + i (4.4)
In the last Equation on the preceding slide, the properties of the OLS estimator
of the s still hold because the equation is linear
Two additional properties also must hold.
First, we assume that the equation is correctly speci…ed. If an equation has an
omitted variable or an incorrect functional form, the odds are against that equation
4.3 THE CLASSICAL ASSUMPTIONS 61
working well. Second, we assume that a stochastic error term has been added to
the equation. This error term must be an additive one and cannot be multiplied
by or divided into any of the variables in the equation. As was pointed out in
our previous discussions, econometricians add a stochastic (random) error term to
regression equations to account for variation in the dependent variable that is not
explained by the independent variables included in the model. The speci…c value of
the error term for each observation is determined purely by chance. Probably the
best way to picture this concept is to think of each observation of the error term
as being drawn from a random variable distribution such as the one illustrated in
Figure on the next slide.
Classical Assumption 2 says that the mean of this distribution is zero. That
is, when the entire population of possible values for the stochastic error term is
considered, the average value of that population is zero. For a small sample, it is
not likely that the mean is exactly zero, but as the size of the sample approaches
in…nity, the mean of the sample approaches zero. What happens if the mean does
not equal zero in a sample? As long as you have a constant term in the equation,
the estimate of 0 will absorb the non-zero mean.
In essence, the constant term equals the …xed portion of Y that cannot be ex-
plained by the independent variables, and the error term equals the stochastic por-
tion of the unexplained value of Y .
Observations of stochastic error terms are assumed to be drawn from a random
variable distribution with a mean of zero. If Classical Assumption II is met, the
expected value (the mean) of the error term is zero.
All explanatory variables are uncorrelated with the error term. It is assumed that
the observed values of the explanatory variables are independent of the values of the
error term. If an explanatory variable and the error term were instead correlated
with each other, the OLS estimates would be likely to attribute to the X some of the
variation in Y that actually came from the error term. If the error term and X were
positively correlated, for example, then the estimated coe¢ cient would probably be
higher than it would otherwise have been (biased upward), because the OLS program
would mistakenly attribute the variation in Y caused by to X instead. As a result,
62 CHAPTER 4 CLASSICAL LINEAR REGRESSION MODEL
it is important to ensure that the explanatory variables are uncorrelated with the
error term. Classical Assumption III is violated most frequently when a researcher
omits an important independent variable from an equation. As we discussed in
the previous classes, one of the major components of the stochastic error term is
omitted variables, so if a variable has been omitted, then the error term will change
when the omitted variable changes. If this omitted variable is correlated with an
included independent variable (as often happens in economics), then the error term
is correlated with that independent variable as well. We have violated Assumption
III! Because of this violation, OLS will attribute the impact of the omitted variable
to the included variable, to the extent that the two variables are correlated.
Observations of the error term are drawn independently from each other. If a
systematic correlation exists between one observation of the error term and another,
then OLS estimates will be less precise than estimates that account for the correla-
tion. For example, if the fact that the from one observation is positive increases
the probability that the from another observation also is positive, then the two
observations of the error term are positively correlated. Such a correlation would
violate Classical Assumption IV. In economic applications, this assumption is most
important in time-series models.
In such a context, Assumption IV says that an increase in the error term in one
time period (a random shock, for example) does not show up in or a¤ect in any way
the error term in another time period. In some cases, though, this assumption is
unrealistic, since the e¤ects of a random shock sometimes last for a number of time
periods. If, over all the observations of the sample, t+1 is correlated with t , then
the error term is said to be serially correlated (or autocorrelated), and Assumption
IV is violated. The variance (or dispersion) of the distribution from which the
observations of the error term are drawn is constant. That is, the observations of
the error term are assumed to be drawn continually from identical distributions (for
example, the one pictured in the next slide).
The alternative would be for the variance of the distribution of the error term
to change for each observation or range of observations.
4.3 THE CLASSICAL ASSUMPTIONS 63
In Figure on the next slide, for example, the variance of the error term is shown
to increase as the variable Z increases; such a pattern violates Classical Assumption
V.
The actual values of the error term are not directly observable, but the lack
of a constant variance for the distribution of the error term causes OLS to gener-
ate inaccurate estimates of the standard error of the coe¢ cients. The violation of
Assumption V is referred to as heteroskedasticity.
Perfect collinearity between two independent variables implies that they are re-
ally the same variable, or that one is a multiple of the other, and/or that a constant
has been added to one of the variables. That is, the relative movements of one
explanatory variable will be matched exactly by the relative movements of the other
even though the absolute size of the movements might di¤er. Because every move-
ment of one of the variables is matched exactly by a relative movement in the other,
the OLS estimation procedure will be incapable of distinguishing one variable from
the other. Many instances of perfect collinearity (or multicollinearity if more than
two independent variables are involved) are the result of the researcher not account-
ing for identities (de…nitional equivalences) among the independent variables. This
problem can be corrected easily by dropping one of the perfectly collinear variables
from the equation. What is an example of perfect multicollinearity?
Suppose that you decide to build a model of the pro…ts of tire stores in your city
and you include annual sales of tires (in dollars) at each store and the annual sales
tax paid by each store as independent variables. Since the tire stores are all in the
same city, they all pay the same percentage sales tax, so the sales tax paid will be a
constant percentage of their total sales (in dollars). If the sales tax rate is 7%, then
the total taxes paid will be 7% of sales for each and every tire store. Thus sales tax
will be a perfect linear function of sales, and you’ll have perfect multicollinearity!
Although we have already assumed that observations of the error term are drawn
independently (Assumption IV) from a distribution that has a zero mean (Assump-
tion II) and that has a constant variance (Assumption V), we have said little about
64 CHAPTER 4 CLASSICAL LINEAR REGRESSION MODEL
Just as the error term follows a probability distribution, so too do the estimates of
. In fact, each di¤erent sample of data typically produces a di¤erent estimate of
. The probability distribution of these b values across di¤erent samples is called
the sampling distribution of b. Recall that an estimator is a formula, such as the
4.4 THE SAMPLING DISTRIBUTION OF b 65
OLS formula
X
N
Xi X Yi Y
b1 = i=1
(4.5)
X
N
2
Xi X
i=1
that tells you how to compute b, while an estimate is the value of b computed by
the formula for a given sample. Since researchers usually have only one sample,
beginning econometricians often assume that regression analysis can produce only
one estimate of for a given population.
In reality, however, each di¤erent sample from the same population will produce
a di¤erent estimate of . The collection of all the possible samples has a distribution,
with a mean and a variance, and we need to discuss the properties of this sampling
distribution of b, even though in most real applications we will encounter only a
single draw from it. Be sure to remember that a sampling distribution refers to the
distribution of di¤erent values of b across di¤erent samples, not within one.
These b susually are assumed to be normally distributed because the normality of
the error term implies that the OLS estimates of are normally distributed as well.
For an estimation technique to be “good”, the mean of the sampling distribution of
the bs it produces should equal the true population . This property has a special
name in econometrics: unbiasedness. Although we do not know the true in this
case, it is likely that if we took enough samples - thousands perhaps - the mean of the
bs would approach the true . The moral of the story is that while a single sample
provides a single estimate of , that estimate comes from a sampling distribution
with a mean and a variance. Other estimates from that sampling distribution will
most likely be di¤erent.
When we discuss the properties of estimators in the next section, it will be impor-
tant to remember that we are discussing the properties of a sampling distribution,
not the properties of one sample.
E b = (4.6)
likely to be near the true value (assuming identical variances) than one taken from
a distribution not centered around the true value. If an estimator produces bs that
are not centered around the true , the estimator is referred to as a biased estimator.
We cannot ensure that every estimate from an unbiased estimator is better than
every estimate from a biased one, because a particular unbiased estimate could, by
chance, be farther from the true value than a biased estimate might be. This could
happen by chance or because the biased estimator had a smaller variance.
The variance of the distribution of the bs can be decreased by increasing the size
of the sample.
This also increases the degrees of freedom, since the number of degrees of freedom
equals the sample size minus the number of coe¢ cients or parameters estimated. As
the number of observations increases, other things held constant, the variance of the
sampling distribution tends to decrease. Although it is not true that a sample of
60 will always produce estimates closer to the true than a sample of 6, it is quite
4.4 THE SAMPLING DISTRIBUTION OF b 67
likely to do so; such larger samples should be sought. The …gure on the next slide
presents illustrative sampling distributions of bs for 6, 60, and 600 observations for
OLS estimators of when the true equals 1. The larger samples do indeed produce
sampling distributions that are more closely centered around .
The powerful lesson illustrated by the …gure in the previous slide is that if you
want to maximize your chances of getting an estimate close to the true value, apply
OLS to a large sample. There’s no guarantee that you will get a more accurate
estimate from a large sample, but your chances are better. Larger samples, all else
equal, tend to result in more precise estimates. And if the estimator is unbiased,
more precise estimates are more accurate estimates.
In econometrics, we must rely on general tendencies. The element of chance,
a random occurrence, is always present in estimating regression coe¢ cients, and
some estimates may be far from the true value no matter how good the estimating
technique. However, if the distribution is centered on the true value and has as
small a variance as possible, the element of chance is less likely to induce a poor
estimate. If the sampling distribution is centered around a value other than the true
(that is, if b is biased) then a lower variance implies that most of the sampling
distribution of b is concentrated on the wrong value. However, if this value is not
very di¤erent from the true value, which is usually not known in practice, then
the greater precision will still be valuable. One method of deciding whether this
decreased variance in the distribution of the bs is valuable enough to o¤set the bias
is to compare di¤erent estimation techniques by using a measure called the Mean
Square Error (MSE).
The Mean Square Error is equal to the variance plus the square of the bias. The
lower the MSE, the better.
A …nal item of importance is that as the variance of the error term increases, so
too does the variance of the distribution of b. The reason for the increased variance
of b is that with the larger variance of i , the more extreme values of i are observed
68 CHAPTER 4 CLASSICAL LINEAR REGRESSION MODEL
with more frequency, and the error term becomes more important in determining
the values of Yi . Since the standard error of the estimated coe¢ cient, SE b , is the
square root of the estimated variance of the bs, it is similarly a¤ected by the size
of the sample and the other factors we have mentioned. For example, an increase
in sample size will cause SE b to fall; the larger the sample, the more precise our
coe¢ cient estimates will be.
E bk = k (k = 0; 1; 2; :::; K) (4.7)
Best means that each bk has the smallest variance possible (in this case, out of all the
linear unbiased estimators of k ). An unbiased estimator with the smallest variance
is called e¢ cient, and that estimator is said to have the property of e¢ ciency. Since
the variance typically falls as the sample size increases, larger samples almost always
produce more accurate coe¢ cient estimates than do smaller ones.
The Gauss-Markov Theorem requires that just the …rst six of the seven classical
assumptions be met. What happens if we add in the seventh assumption, that the
error term is normally distributed? In this case, the result of the Gauss-Markov
Theorem is strengthened because the OLS estimator can be shown to be the best
(minimum variance) unbiased estimator out of all the possible estimators, not just
out of the linear estimators. In other words, if all seven assumptions are met, OLS
is “BUE.”
Given all seven classical assumptions, the OLS coe¢ cient estimators can be
shown to have the four properties discussed in the next slide.
4.5 THE GAUSS - MARKOV THEOREM 69
1. They are unbiased.That is, E b = . This means that the OLS estimates
of the coe¢ cients are centered around the true population values of the parameters
being estimated.
2. They are of minimum variance.The distribution of the coe¢ cient estimates
around the true parameter values is as tightly or narrowly distributed as is possible
for an unbiased distribution. No other unbiased estimator has a lower variance for
each estimated coe¢ cient than OLS.
3. They are consistent. As the sample size approaches in…nity, the estimates
converge to the true population parameters. Put di¤erently, as the sample size gets
larger, the variance gets smaller, and each estimate approaches the true value of the
coe¢ cient being estimated.
4. They are normally distributed.The bs are N ; V AR b .Thus various sta-
tistical tests based on the normal distribution may indeed be applied to these esti-
mates, as will be done in the next chapter.
Chapter 5
5.1 Introduction
71
72CHAPTER 5 HYPOTHESIS TESTING AND STATISTICAL INFERENCE
Explain the terms null hypothesis, alternative hypothesis, and rejection region,
giving an example and a sketch of the rejection region.
Explain the termp-value and how to use a p-value to determine the outcome
of a hypothesis test; provide a sketch showing ap-value.
Explain the di¤erence between one-tail and two-tail tests. Explain, intuitively,
how to choose the rejection region for a one-tail test.
Explain how to choose what goes in the null hypothesis, and what goes in the
alternative hypothesis.
5.3 Introduction
Many hypotheses about the world around us can be phrased as yes/no questions. Do
the mean monthly earnings of recent Ethiopian college graduates equal ETB10,000.00
per month? Are mean earnings the same for male and female college graduates?
Both these questions embody speci…c hypotheses about the population distribution
of earnings. The statistical challenge is to answer these questions based on a sample
of evidence. In this chapter we describe hypothesis tests concerning the population
mean (Does the population mean of monthly earnings equal ETB10,000.00?). Hy-
pothesis tests involving two populations (Are mean earnings the same for men and
women?).
5.3 INTRODUCTION 73
To test yourself, take a moment and think about what the null and alternative
hypotheses will be if you expect a negative coe¢ cient.
H0 : 0 (5.3)
HA : <0
The above hypotheses are for a one-sided test because the alternative hy-
potheses have values on only one side of the null hypothesis. Another
approach is to use a two-sided test (or a two-tailed test) in which the alternative
hypothesis has values on both sides of the null hypothesis.
H0 : =0 (5.4)
HA : 6= 0
Note that the null hypothesis and the alternative hypothesis are jointly exhaus-
tive. Note also that eonomists always put what they expect in the alternative
hypothesis. This allows us to make rather strong statements when we reject a null
hypothesis. However, we can never say that we accept the null hypothesis; we must
always say that we cannot reject the null hypothesis. As put by one Econo-
metrician: Just as a court pronounces a verdict as not guilty rather than innocent,
so the conclusion of a statistical test is do not reject rather than accept.
74CHAPTER 5 HYPOTHESIS TESTING AND STATISTICAL INFERENCE
We will refer to these errors as Type I and Type II Errors, respectively. Suppose
we have the following null and alternative hypotheses:
H0 : 0 (5.5)
HA : 0
Even if the true parameter is not positive, the particular estimate obtained by
a researcher may be su¢ ciently positive to lead to the rejection of the null hypothesis
that 0. This is a Type I Error; we have rejected the truth! Alternatively, it’s
possible to obtain an estimate of that is close enough to zero (or negative) to
be considered “not signi…cantly positive”. Such a result may lead the researcher to
“accept” the hypothesis that 0 when in truth > 0. This is a Type II Error;
we have failed to reject a false null hypothesis!
Suppose we are dealing with evaluation of an impact of a given intervention,
what do these errors mean? A type I error occurs when an evaluation concludes
that a program has had an impact, when in reality it had no impact. A type II error
occurs when an evaluation concludes that the program has had no impact, when in
fact it has had an impact. We can generalize the discussion of the Type I and Type
II errors as follows:
found in tables in annexes of almost every stat or economtrics text. A decision rule
should be formulated before regression estimates are obtained. The range of possible
values of b is divided into two regions, an “acceptance”region and a rejection region,
where the terms are expressed relative to the null hypothesis. To de…ne these regions,
we must determine a critical value (or, for a two-tailed test, two critical values) of
b. Thus, a critical value is a value that divides the “acceptance” region from the
rejection region when testing a null hypothesis. Graphs of these “acceptance” and
rejection regions are presented in Figures on next slides. To use a decision rule, we
need to select a critical value.
Let’s suppose that the critical value is 1.8. If the observed b is greater than
1.8, we can reject the null hypothesis that is zero or negative. To see this, take a
look at the Figure on the next slide. Any b above 1.8 can be seen to fall into the
rejection region, whereas any b below 1.8 can be seen to fall into the “acceptance”
region. The rejection region measures the probability of a Type I Error if the null
hypothesis is true. Some students react to this news by suggesting that we make the
rejection region as small as possible. Unfortunately, decreasing the chance of a
Type I Error means increasing the chance of a Type II Error (not rejecting
a false null hypothesis). If you make the rejection region so small that you almost
never reject a true null hypothesis, then you are going to be unable to reject almost
every null hypothesis, whether they are true or not! As a result, the probability of
a Type II Error will rise.
Given that, how do you choose between Type I and Type II Errors?
The answer is easiest if you know that the cost (to society or the decision maker)
of making one kind of error is dramatically larger than the cost of making the other.
If you worked for the authority regulating and approving drugs in a country, for
example, you would want to be very sure that you had not released a product that
76CHAPTER 5 HYPOTHESIS TESTING AND STATISTICAL INFERENCE
we can calculate t-values for each of the estimated coe¢ cients in the equation.
Note that t-tests are usually done only on the slope coe¢ cients; for these, the
relevant form of the t-statistic for the k th coe¢ cient is
bk H0
tk = (k = 1; 2; :::; K) (5.7)
SE bk
5.4 THE T-TEST 77
How do you decide what border is implied by the null hypothesis? Some null
hypotheses specify a particular value. For these, H0 is simply that value; if H0 :
= S, then H0 = S. Other null hypotheses involve ranges, but we are concerned
only with the value in the null hypothesis that is closest to the border between the
“acceptance” region and the rejection region. This border value then becomes the
H0 . For example, if H0 : 0 and HA : < 0, then the value in the null hypothesis
closest to the border is zero, and H0 = 0. Since most regression hypotheses test
whether a particular regression coe¢ cient is signi…cantly di¤erent from zero, H0
is typically zero. Zero is particularly meaningful because if the true equals zero,
then the variable does not belong in the equation. Before we drop the variable from
the equation and e¤ectively force the coe¢ cient to be zero, however, we need to be
careful and test the null hypothesis that = 0. Thus, the most-used form of the
t-statistic becomes
b 0
k
tk = (k = 1; 2; :::; K) (5.8)
SE bk
which simpli…es to
b
k
tk = (k = 1; 2; :::; K) (5.9)
SE bk
or the estimated coe¢ cient divided by the estimate of its standard error. This is
the t-statistic formula used by most computer programs.
For an example of this calculation, let’s consider the following estimated equation
which is in a typical format of reporting estimation results
depending on whether the test is one-sided or two-sided, on the level of Type I Error
you specify, and on the degrees of freedom, N K 1. The level of Type I Error
in a hypothesis test is also called the level of signi…cance of that test and we will
discuss it in more detail later in this chapter.
The t-table was created to save time during research; it consists of critical t-
values given speci…c areas underneath curves such as those in the …gure for one
sided test for Type I Errors. A critical t-value is thus a function of the probability
of Type I Error that the researcher wants to specify. Once you have obtained a
calculated t-value tk and a critical t-value tc , you reject the null hypothesis if the
calculated t-value is greater in absolute value than the critical t-value and if the
calculated t-value has the sign implied by HA . Thus, the rule to apply when testing
a single regression coe¢ cient is that you should: Reject H0 if jtk j > tc and if tk also
has the sign implied by HA . Do not reject H0 otherwise.
This decision rule works for calculated t-values and critical t-values for one-sided
hypotheses around zero:
H0 : 0
HA : >0
H0 : 0
HA : <0
H0 : =0
HA : 6= 0
H0 : S
HA : >S
H0 : S
HA : <S
Also for two-sided hypothesis based on hypothesized values other than zero:
H0 : =S
HA : 6= S
The decision rule is the same: Reject the null hypothesis if the appropriately
calculated t-value, tk , is greater in absolute value than the critical t-value, tc , as
5.4 THE T-TEST 79
long as the sign of tk is the same as the sign of the coe¢ cient implied in HA .
Otherwise, do not reject H0 .
Always use Equation
bk H0
tk = (k = 1; 2; :::; K)
SE bk
If you know that a Type II Error will be extremely costly, for example, then it
makes sense to consider using a 10-percent level of signi…cance when you determine
your critical value. Such judgments are di¢ cult, however, so beginning researchers
are encouraged to adopt a 5-percent level of signi…cance as standard. If we can reject
a null hypothesis at the 5-percent level of signi…cance, we can summarize our results
by saying that the coe¢ cient is “statistically signi…cant”at the 5-percent level. Since
the 5-percent level is arbitrary, we shouldn’t jump to conclusions about the value of
a variable simply because its coe¢ cient misses being signi…cant by a small amount; if
a di¤erent level of signi…cance had been chosen, the result might have been di¤erent.
Some researchers produce tables of regression results, typically without hypothesized
signs for their coe¢ cients, and then mark “signi…cant” coe¢ cients with asterisks.
The asterisks indicate when the t-score is larger in absolute value than the two-sided
10-percent critical value (which merits one asterisk), the two-sided 5-percent critical
value (**), or the two-sided 1-percent critical value (***). Such a use of the t-value
should be regarded as a descriptive rather than a hypothesis-testing use of statistics.
Now and then researchers will use the phrase “degree of con…dence”or “level of
con…dence”when they test hypotheses. What do they mean? The level of con…dence
is nothing more than 100 percent minus the level of signi…cance. Thus a t-test for
which we use a 5-percent level of signi…cance can also be said to have a
95-percent level of con…dence. Since the two terms have identical meanings,
we will use level of signi…cance throughout this module. Another reason we prefer
the term level of signi…cance to level of con…dence is to avoid any possible confusion
with the related concept of con…dence intervals.
Some researchers avoid choosing a level of signi…cance by simply stating the
lowest level of signi…cance possible for each estimated regression coe¢ -cient. The
resulting signi…cance levels are called p-values.
5.4.4 p-Value
There’s an alternative to the t-test based on a measure called the p-value, or marginal
signi…cance level. A p-value for a t-score is the probability of observing a t-score that
size or larger (in absolute value) if the null hypothesis were true. Graphically, it’s
two times the area under the curve of the t-distribution between the absolute value
of the actual t-score and in…nity. A p-value is a probability, so it runs from 0 to 1.
It tells us the lowest level of signi…cance at which we could reject the null hypothesis
(assuming that the estimate is in the expected direction). A small p-value casts
doubt on the null hypothesis, so to reject a null hypothesis, we need a low
p-value. How do we calculate a p-value? Standard regression software packages
calculate p-values automatically and print them out for every estimated coe¢ cient.
You are thus able to read p-values o¤ your regression output just as you would your
b.
5.4 THE T-TEST 81
H0 : 1 0
H0 : 1 >0
As you can see from the regression output on the previous slide, the p-value for
b
income is .025. This is a two-sided p-value and we are running a one-sided test, so
we need to divide .025 by 2, getting .0125. Since .0125 is lower than our chosen level
of signi…cance of .05, and since the sign of b1 is positive and agrees with that in HA ,
we can reject H0 . Not surprisingly, this is the same result we would get if we ran
a conventional t-test. p-values have a number of advantages. They’re easy to use,
and they allow readers of research to choose their own levels of signi…cance instead
of being forced to use the level chosen by the original researcher.
In addition, p-values convey information to the reader about the relative strength
with which we can reject a null hypothesis. Because of these bene…ts, many re-
searchers use p-values on a consistent basis.
Beginning researchers bene…t from learning the standard t-test procedure, par-
ticularly since it is more likely to force them to remember to hypothesize the sign
of the coe¢ cient and to use a one-sided test when a particular sign can be hypoth-
esized. In addition, if you know how to use the standard t-test approach, it’s easy
to switch to the p-value approach, but the reverse is not necessarily true. However,
82CHAPTER 5 HYPOTHESIS TESTING AND STATISTICAL INFERENCE
we acknowledge that practicing econometricians today spend far more energy esti-
mating models and coe¢ cients than they spend testing hypotheses. This is because
most researchers are more con…dent in their theories (say, that demand curves slope
downward) than they are in the quality of their data or their regression methods.
In such situations, where the statistical tools are being used more for descriptive
purposes than for hypothesis testing purposes, it’s clear that the use of p-values
saves time and conveys more information than does the standard t-test procedure.
The four steps to use when working with the t-test are:
1. Set up the null and alternative hypotheses.
2. Choose a level of signi…cance and therefore a critical t-value.
3. Run the regression and obtain an estimated t-value (or t-score).
4. Apply the decision rule by comparing the calculated t-value with the critical
t-value in order to reject or not reject the null hypothesis.
independent vari-able or the standard error of the independent variable would make
more sense.
The t-Test Is Not Intended for Tests of the Entire population
The t-test helps make inferences about the true value of a parameter from an
estimate calculated from a sample of the population (the group from which the
sample is being drawn). If a coe¢ cient is calculated from the entire population,
then an unbiased estimate already measures the population value and a signi…cant
t-test adds nothing to this knowledge. One might forget this property and attach too
much importance to t-scores that have been obtained from samples that approximate
the population in size. There is a third way to test hypothesis: It is based on the
concept of a con…dence interval.
What exactly does this mean? If the Classical Assumptions hold true, the con-
…dence interval formula produces ranges that contain the true value of 90 percent
of the time. In this case, there is a 90 percent chance the true value of I is between
0.365 and 2.211. If it is not in that range, it’s due to an unlucky sample. How can
we use a con…dence interval for a two-tailed hypothesis test? If the null hypothesis
is I = 0, we can reject it at the 10-percent level because 0 is not in the con…dence
interval. If the null hypothesis is that I = 1:0, we cannot reject it because 1.0 is
in the interval. In general, if your null hypothesis border value is in the con…dence
interval, you cannot reject the null hypothesis. Thus, con…dence intervals can be
used for two-sided tests, but they are more complicated. So why bother with them?
It turns out that con…dence intervals are very useful in telling us how precise a co-
e¢ cient estimate is. And for many people using econometrics in the real world, this
may be more important than hypothesis testing.
equation. If the …ts of the constrained equation and the unconstrained equation
are not substantially di¤erent, the null hypothesis should not be rejected. If the …t
of the unconstrained equation is substan-tially better than that of the constrained
equation, then we reject the null hypothesis. The …t of the constrained equation is
never superior to the …t of the unconstrained equation, as we’ll explain next.
The …ts of the equations are compared with the general F-statistic:
(RSSM RSS) =M
F =
RSS= (N K 1)
Re ject: H0 if F > Fc
Do not reject : H0 if F Fc
Chapter 6
6.1 Introduction
In this chapter we deal with violations of the Classical Assumptions and remedies
for those violations: multicolinearity, serial correlation and heteroskedasticity. For
each of these three problems, we will attempt to answer the following questions:
What is the nature of the problem?
What are the consequences of the problem?
How is the problem diagnosed?
What remedies for the problem are available?
The word collinearity describes a linear correlation between two independent
variables, and multicollinearity indicates that more than two independent variables
are involved. In common usage, multicollinearity is used to apply to both cases.
Explain what is meant by a serially correlated time series, and how we measure
serial correlation
Explain how and why plots of least squares residuals can reveal heteroskedas-
ticity.
Specify a variance function and use it to test for heteroskedasticity with (a) a
Breusch–Pagan test and (b) a White test.
87
88 CHAPTER 6 VIOLATION OF CLASSICAL ASSUMPTIONS
Describe and compare the properties of the least squares and generalized least
squares estimators when heteroskedasticity exists.
6.3 Multicolinearity
In this chapter we deal with violations of the Classical Assumptions and remedies
for those violations: multicolinearity, serial correlation and heteroskedasticity. For
each of these three problems, we will attempt to answer the following questions:
What is the nature of the problem?
What are the consequences of the problem?
How is the problem diagnosed?
What remedies for the problem are available?
The word collinearity describes a linear correlation between two independent
variables, and multicollinearity indicates that more than two independent variables
are involved. In common usage, multicollinearity is used to apply to both cases.
Strictly speaking, perfect multicollinearity is the violation of Classical Assumption
VI - that no independent variable is a perfect linear function of one or more other
independent variables.
Perfect multicollinearity is rare, but severe imperfect multicollinearity, although
not violating Classical Assumption VI, still causes substantial problems. Recall
that the coe¢ cient k can be thought of as the impact on the dependent variable
of a one-unit increase in the independent variable Xk , holding constant the other
independent variables in the equation. If two explanatory variables are signi…cantly
related, then the OLS computer program will …nd it di¢ cult to distinguish the e¤ects
of one variable from the e¤ects of the other. In essence, the more highly correlated
two (or more) independent variables are, the more di¢ cult it becomes to accurately
estimate the coe¢ cients of the true model. If two variables move identically, then
there is no hope of distinguishing between their impacts, but if the variables are only
roughly correlated, then we still might be able to estimate the two e¤ects accurately
enough for most purposes.
Perfect multicollinearity: violates Classical Assumption VI, which speci…es
that no explanatory variable is a perfect linear function of any other explanatory
variable. The word perfect in this context implies that the variation in one explana-
tory variable can be completely explained by movements in another explanatory
variable. Such a perfect linear function between two independent variables would
be:
x1i = 0 + 1 x2i
6.3 MULTICOLINEARITY 89
where the s are constants and the xs are independent variables in:
yi = 0 + 1 x1i + 2 x2i + i
Notice that there is no error term in Equation (3.1). This implies that x1 can be
exactly calculated given x2 and the equation. Typical equations for such perfect
linear relationships would be:
x1i = 5x2i
x1i = 2 + 3x2i
Perfect multicollinearity ruins our ability to estimate the coe¢ cients because the
two variables cannot be distinguished. You cannot “hold all the other indepen-
dent variables in the equation constant”if every time one variable changes, another
changes in an identical manner. With perfect multicollinearity, an independent vari-
able can be completely explained by the movements of one or more other independent
variables. Perfect multicollinearity can usually be avoided by careful screening of
the independent variables before a regression is run.
A special case related to perfect multicollinearity occurs when a variable that
is de…nitionally related to the dependent variable is included as an independent
variable in a regression equation. Such a dominant variable is by de…nition so
highly correlated with the dependent variable that it completely masks the e¤ects
of all other independent variables in the equation. In a sense, this is a case of
perfect collinearity between the dependent variable and an independent variable.
For example, if you include a variable measuring the amount of raw materials used
by the shoe industry in a production function for that industry, the raw materials
variable would have an extremely high t-score, but otherwise important variables
like labor and capital would have quite insigni…cant t-scores. Why?
In essence, if you knew how much leather was used by a shoe factory, you could
predict the number of pairs of shoes produced without knowing anythingabout labor
or capital. The relationship is de…nitional, and the dominant variable should be
dropped from the equation to get reasonable estimates of the coe¢ cients of the other
variables. Since perfect multicollinearity is fairly easy to avoid, econometricians
rarely talk about it. Instead, when we use the word multicollinearity, we really are
talking about severe imperfect multicollinearity.
90 CHAPTER 6 VIOLATION OF CLASSICAL ASSUMPTIONS
b
k H0
tk = (k = 1; 2; :::; K)
SE bk
Figure 6.1: Severe Multicollinearity increases the variances of the estimated coe¢
The overall …t of the equation and the estimation of the coe¢ cients of non multi-
collinear variables will be largely una¤ected. Even though the individual t-scores are
often quite low in a multicollinear equation, the overall …t of the equation, as mea-
2
sured by R , will not fall much, if at all, in the face of signi…cant multicollinearity.
Given this, one of the …rst indications of severe multicollinearity is the combination
2
of a high R with no statistically signi…cant individual regression coe¢ cients.
Similarly, if an explanatory variable in an equation is not multicollinear with
the other variables, then the estimation of its coe¢ cient and standard error usually
will not be a¤ected. Because the overall …t is largely unchanged, it’s possible for
the F-test of overall signi…cance to reject the null hypothesis even though none of
the t-tests on individual coe¢ cients can do so. Such a result is a clear indication of
severe imperfect multicollinearity.
Finally, since multicollinearity has little e¤ect on the overall …t of the equation,
it also will have little e¤ect on the use of that equation for prediction or forecasting,
as long as the independent variables maintain the same pattern of multicollinearity
in the forecast period that they demonstrated in the sample.
One measure of the severity of multicollinearity that is easy to use and that is gaining
in popularity is the variance in‡ation factor. The variance in‡a-tion factor (VIF)is a
method of detecting the severity of multicollinearity by looking at the extent to which
a given explanatory variable can be explained by all the other explanatory variables
in the equation. There is a VIF for each explanatory variable in an equation.
The VIF is an index of how much multi-collinearity has increased the variance of
an estimated coe¢ cient. A high VIF indicates that multicollinearity has increased
the estimated variance of the estimated coe¢ cient by quite a bit, yielding a decreased
t-score.
Suppose you want to use the VIF to attempt to detect multicollinearity in an
original equation with K independent variables:
Doing so requires calculating K di¤erent VIFs, one for each Xi . Calculating the
VIF for a given Xi involves two steps:
6.4 SERIAL CORRELATION 93
1. Run an OLS regression that has Xi as a function of all the other explanatory
variables in the equation. For i =1, this equation would be:
The approach of this section to the problem of serial correlation will be similar to
that used in the previous section. We’ll attempt to answer the same four questions:
1. What is the nature of the problem?
2. What are the consequences of the problem?
3. How is the problem diagnosed?
4. What remedies for the problem are available?
The most commonly assumed kind of serial correlation is …rst-order serial cor-
relation, in which the current value of the error term is a function of the previous
value of the error term:
t = t 1 + ut
serial correlation is caused by the underlying distribution of the error term of the
true speci…cation of an equation (which cannot be changed by the researcher), im-
pure serial correlation is caused by a speci…cation error that often can be corrected.
How is it possible for a speci…cation error to cause serial correlation?
Recall that the error term can be thought of as the e¤ect of omitted variables,
nonlinearities, measurement errors, and pure stochastic disturbances on the depen-
dent variable. This means, for example, that if we omit a relevant vari-able or use
the wrong functional form, then the portion of that omitted e¤ect that cannot be
represented by the included explanatory variables must be absorbed by the error
term. The error term for an incorrectly speci…ed equa-tion thus includes a portion
of the e¤ect of any omitted variables and/or a portion of the e¤ect of the di¤erence
between the proper functional form and the one chosen by the researcher.
This new error term might be serially correlated even if the true one is not. If
this is the case, the serial correlation has been caused by the researcher’s choice of a
speci…cation and not by the pure error term associated with the correct speci…cation.
6.4 SERIAL CORRELATION 97
of the original model. If the lagged residuals are signi…cant in explaining this time’s
residuals, then we can reject the null hypothesis of no serial correlation. The place to
start in correcting a serial correlation problem is to look carefully at the speci…cation
of the equation for possible errors that might be causing impure serial correlation.
Is the functional form correct?
Are you sure that there are no omitted variables?
Only after the speci…cation of the equation has been reviewed carefully should
the possibility of an adjustment for pure serial correlation be considered.
Generalized least squares (GLS) is a method of ridding an equation of pure …rst-
order serial correlation and in the process restoring the minimum variance property
to its estimation.
Newey - West standard errorsare SE b that take account of serial correlation
without changing the bs themselves in any way.
The logic behind Newey - West standard errors is powerful. If serial correlation
does not cause bias in the bs but does impact the standard errors, then it makes
sense to adjust the estimated equation in a way that changes the SE b s but not
the bs.
Thus Newey - West standard errors have been calculated speci…cally to avoid
the consequences of pure …rst-order serial correlation. The Newey - West procedure
yields an estimator of the standard errors that, while they are biased, is generally
more accurate than uncorrected standard errors for large samples (greater than 100)
in the face of serial correlation. As a result, Newey - West standard errors can be used
for t-tests and other hypothesis tests in most samples without the errors of inference
potentially caused by serial correlation. Typically, Newey–West SE b s are larger
than OLS SE b s, thus producing lower t-scores and decreasing the probability
that a given estimated coe¢ cient will be signi…cantly di¤erent from zero.
6.5 Heteroskedasticity
Heteroskedasticity is the violation of Classical Assumption V, which states that the
observations of the error term are drawn from a distribution that has a constant
variance. The assumption of constant variances for di¤erent observations of the
error term (homoskedasticity) is not always realistic. For example, in a model
explaining heights, it’s likely that error term observations associated with the height
of a basketball player would come from distributions with larger variances than those
associated with the height of a mouse. Heteroskedasticity is important because
OLS, when applied to heteroskedastic models, is no longer the minimum variance
estimator (it still is unbiased, however). In general, heteroskedasticity is more likely
to take place in cross-sectional models than in time-series models. This focus on
6.5 HETEROSKEDASTICITY 99
might have a large variance, and that the error term distri-bution for small obser-
vations might have a small variance.
In cross-sectional data sets, it’s easy to get such a large range between the highest
and lowest values of the variables. The di¤erence between Oromia and Gambela (or
Harari) in terms of the Birr value of the consumption of goods and services, for
instance, is quite large (comparable in percentage terms to the di¤erence between
the heights of a basketball player and a mouse). Since cross-sectional models often
include observations of widely di¤erent sizes in the same sample (cross-state studies
of Ethiopia usually include Oromia and Gambela as individual observations, for
example), heteroskedasticity is hard to avoid if economic topics are going to be
studied cross sectionally.
The simplest way to visualize pure heteroskedasticity is to picture a world in
which the observations of the error term could be grouped into just two di¤erent
distributions, “wide” and “narrow.” We’ll call this simple version of the problem
discrete heteroskedasticity. Here, both distributions would be centered around zero,
but one would have a larger variance than the other, as indicated in the bottom half
6.5 HETEROSKEDASTICITY 101
of the Figure above. Note the di¤erence between the two halves of the …gure. With
homoskedasticity, all the error term observations come from the same distribu-tion;
with heteroskedasticity, they come from di¤erent distributions.
For an example of discrete heteroskedasticity, we need go no further than our
discussion of the heights of basketball players and mice. We’d certainly expect
the variance of eto be larger for basketball players as a group than for mice, so
the distribution of efor the heights of basketball players might look like the “wide”
distribution in the Figure above, and the distribution of efor mice would be much
narrower than the “narrow”distribution in Figure above.
Heteroskedasticity takes on many more complex forms. In fact, the num-ber
of di¤erent models of heteroskedasticity is virtually limitless, and an analysis of
even a small percentage of these alternatives would be a huge task. Instead, we’d
like to address the general principles of heteroskedasticity by focusing on the most
frequently speci…ed model of pure heteroskedasticity, just as we focused on pure,
positive, …rst-order serial correlation in the previ-ous chapter. However, don’t let
this focus mislead you into concluding that econometricians are concerned only with
one kind of heteroskedasticity.
In this model of heteroskedasticity, the variance of the error term is related to
an exogenous variable Zi . For a typical regression equation:
Yi = 0 + i X1i + 2 X2i + i
the variance of the otherwise classical error term might be equal to:
2
V AR ( i ) = Zi
where Z may or may not be one of the Xs in the equation. The variable Z is called a
proportionality factor because the variance of the error term changes proportion-
ally to Zi . The higher the value of Zi , the higher the variance of the distribution of
the ith observation of the error term. There would be N di¤erent distributions, one
for each observation, from which the observations of the error term could be drawn
depending on the number of di¤erent values that Z takes. To see what homoskedas-
tic and heteroskedastic distributions of the error term look like with respect to Z,
compare the two Figures below. Note that the heteroskedastic distribution gets
wider as Z increases but that the homoskedastic distribution maintains the same
width no matter what value Z takes.
What is an example of a proportionality factor Z? How is it possible for an
exogenous variable such as Z to change the whole distribution of an error term?
Think about a function that relates the consumption expenditures in a state to
its income. The expenditures of a small state like Rhode Island are not likely to
be as variable in absolute value as the expenditures of a large state like California
because a 10-percent change in spending for a large state involves a lot more money
102 CHAPTER 6 VIOLATION OF CLASSICAL ASSUMPTIONS
than a 10-percent change for a small one. In such a case, the dependent variable
would be consumption expenditures and a likely proportionality factor, Z, would
be population. As population rose, so too would the variance of the error term of
an equation built to explain expenditures. The error term distributions would look
something like those in Figure XXX, where the Z in Figure XXX is population.
This example helps emphasize that heteroskedasticity is likely to occur in cross-
sectional models because of the large variation in the size of the dependent variable
involved. An exogenous disturbance that might seem huge to a small state could
seem miniscule to a large one, for instance.
Heteroskedasticity can occur in a time-series model with a signi…cant amount of
change in the dependent variable. If you were modeling sales of DVD players from
1994 to 2015, it’s quite possible that you would have a heteroskedastic error term.
As the phenomenal growth of the industry took place, the variance of the error term
probably increased as well. Such a possibility is unlikely in time series that have
low rates of change, however.
Heteroskedasticity also can occur in any model, time series or cross sectional,
where the quality of data collection changes dramatically within the sample. As
data collection techniques get better, the variance of the error term should fall
because measurement errors are included in the error term. As measurement errors
decrease in size, so should the variance of the error term. For more on this topic
(called “errors in the variables”), see Section 14.6.
E b = for all s
Lack of bias does not guarantee “accurate”coe¢ cient estimates, espe-cially since
heteroskedasticity increases the variance of the estimates, but the distribution of the
estimates is still centered around the true . Equations with impure heteroskedas-
ticity caused by an omitted variable, of course, will have possible speci…cation bias.
2. Heteroskedasticity typically causes OLS to no longer be the minimum-variance
estimator (of all the linear unbiased estimators). Pure heteroskedasticity causes no
104 CHAPTER 6 VIOLATION OF CLASSICAL ASSUMPTIONS
bias in the estimates of the OLS coe¢ cients, but it does a¤ect the minimum-variance
property.
If the error term of an equation is heteroskedastic with respect to a proportion-
ality factor Z:
V AR ( i ) = 2 Zi
then the minimum-variance portion of the Gauss–Markov Theorem cannot be
proven because there are other linear unbiased estimators that have smaller vari-
ances.
This is because the heteroskedastic error term causes the dependent variable
to ‡uctuate, and the OLS estimation procedure attributes this ‡uctuation to the
independent variables. Thus, OLS is more likely to misestimate the true in the
face of hetero-skedasticity. The bs still are unbiased because overestimates are just
as likely as underestimates.
3. Heteroskedasticity causes the OLS estimates of the SE b s to be biased,
leading to unreliable hypothesis testing and con…dence intervals. With heteroskedas-
ticity, the OLS formula for the standard error produces biased estimates of the
SE b s. Because the SE b s is a prime component in the t-statistic, these bi-
ased SE b s cause biased t-scores and unreliable hypothesis testing in general.In
essence, heteroskedasticity causes OLS to produce incorrect SE b s and t-scores!
Not surprisingly, most econometricians therefore are very hesitant to put much faith
in hypothesis tests that were conducted in the face of pure heteroskedasticity.
What sort of bias in the standard errors does heteroskedasticity tend to cause?
Typically, heteroskedasticity causes OLS estimates of the standard errors to be bi-
ased downward, making them too small. Sometimes, however, they’re biased up-
ward; it’s hard to predict in any given case. But either way, it’s a big problem for
hypothesis testing and con…dence intervals. pure heteroskedasticity can make quite
a mess of our results. Hypothesis testing will become unreliable, and con…dence
intervals will be misleading.
1. Are there any obvious speci…cation errors? Are there any likely omitted
variables? Have you speci…ed a linear model when a double-log model is more
appropriate? Don’t test for heteroskedasticity until the speci…ca-tion is as good as
possible. After all, if you …nd heteroskedasticity in an incorrectly speci…ed model,
there’s a chance it will be impure.
2. Are there any early warning signs of heteroskedasticity? Just as certain kinds
of clouds can warn of potential storms, certain kinds of data can signal possible
heteroskedasticity. In particular, if the dependent vari-able’s maximum value is
many, many times larger than its minimum, beware of heteroskedasticity.
3. Does a graph of the residuals show any evidence of heteroskedasticity? It
sometimes saves time to plot the residuals against a potential Z propor-tionality
factor or against the dependent variable. If you see a pattern in the residuals,
you’ve got a problem. See the Figures below for a few examples of heteroskedastic
patterns in the residuals.
Note that the …gures above show “textbook” examples of heteroskedasticity. The
real world is nearly always a lot messier than textbook graphs. It’s not unusual to
look at a real-world residual plot and be unsure whether there’s a pattern or not.
As a result, even if there are no obvious speci…cation errors, no early warning signs,
and no visible residual patterns, it’s a good idea to do a formal statistical test for
heteroskedasticity, so we’d better get started.
Step 3: Test the overall signi…cance of the equation above (6.1) with a chi-square
test. The null and alternative hypotheses are:
H0 : 1 = 2 =0
HA : H0 is false
6.5 HETEROSKEDASTICITY 107
Probably the most popular of all the heteroskedasticity tests is the White test be-
cause it can …nd more types of heteroskedasticity than any other test. Let’s see how
it works.
The White testinvestigates the possibility of heteroskedasticity in an equation
by seeing if the squared residuals can be explained by the equa-tion’s independent
variables, their squares, and their cross–products. To run the White test:
1. Obtain the residuals of the estimated regression equation.
2. Estimate an auxiliary regression, using the squared residuals as the dependent
variable, with each X from the original equation, the square of each X, and the
product of each X times every other X as the explanatory variables.
3. Test the overall signi…cance of the Equation with a chi-square test. Once again
the test statistic here is N R2 , or the sample size (N) times the unadjusted R2 from
the above Equation. This test statistic has a chi-square distribution with degrees
of freedom equal to the number of slope coe¢ cients in the auxiliary regression (the
above Equation). The null hypothesis is that all the slope coe¢ cients in the auxiliary
108 CHAPTER 6 VIOLATION OF CLASSICAL ASSUMPTIONS
regression (the above Equation) equal zero, and if N R2 is greater than or equal to
the critical chi-square value, then we reject the null hypothesis of homoskedasticity.
Check out the explanatory variables in the Equation on the previous slide. They
include every variable in the original model, their squares, and their cross products.
Including all the variables from the original model allows the White test to check
to see if any or all of them are Z proportionality factors. Including all the squared
terms and cross products allows us to test for more exotic and complex types of
heteroskedasticity. This is the White test’s greatest strength.
However, the White test contains more right-hand-side variables than the original
regression, sometimes a lotmore. This can be its greatest weakness. To see why,
note that as the number of explanatory variables in an original regression rises, the
number of right-hand variables in the White test auxil-iary regression goes up much
faster. With three variables in the original model, the White regression could have
nine. With 12 explanatory variables in the original model, there could be 90 in the
White regression with all the squares and interactive terms included! And this is
where the weakness becomes a real problem.
If the number of right-hand variables in the auxiliary regression exceeds the
number of observations, you can’t run the White test regression because you would
have negative degrees of freedom in the auxiliary equation! Even if the degrees of
freedom in the auxiliary equation are positive but small, the White test might do
a poor job of detecting heteroskedasticity because the fewer the degrees of freedom
there are, the less powerful the statistical test is. In such a situation, you’d be
limited to the Breusch–Pagan test or an alternative.
The …rst thing to do if the Breusch–Pagan test or the White test indicates the
possibility of heteroskedasticity is to examine the equation carefully for speci-…cation
errors. Although you should never include an explanatory variable simply because
a test indicates the possibility of heteroskedasticity, you ought to rigorously think
through the speci…cation of the equation. If this rethinking allows you to discover
a variable that should have been in the regression from the beginning, then that
variable should be added to the equation. Similarly, if you had the wrong functional
form to begin with, the discovery of heteroskedasticity might be the hint you need
to rethink the speci…cation and switch to the functional form that best represents
the underlying theory. However, if there are no obvious speci…cation errors, the
heteroskedasticity is probably pure in nature, and one of the remedies described in
this section should be considered.
6.5 HETEROSKEDASTICITY 109
your equation. In some cases, the only rede…nition that’s needed to rid an equa-
tion of heteroskedasticity is to switch from a linear functional form to a double-log
functional form. The double-log form has inherently less variation than the linear
form, so it’s less likely to encounter heteroskedasticity. In addition, there are many
research topics for which the double-log form is just as theoretically logical as the
linear form. In other situations, it might be necessary to completely rethink the
research project in terms of its underlying theory.
For example, consider a cross-sectional model of the total expenditures by the
governments of di¤erent cities. Logical explanatory variables to consider in such an
analysis are the aggregate income, the population, and the average wage in each city.
The larger the total income of a city’s residents and businesses, for example, the
larger the city government’s expenditures. In this case, it’s not very enlightening to
know that the larger cities have larger incomes and larger expenditures (in absolute
magnitude) than the smaller ones. Fitting a regression line to such data also gives
undue weight to the larger cities because they would otherwise give rise to large
squared residuals. That is, since OLS minimizes the summed squared residuals, and
since the residuals from the large cities are likely to be large due simply to the size
of the city, the regression estimation will be especially sensitive to the residuals from
the larger cities. This is often called “spurious correlation”due to size. In addition,
the residuals may indicate heteroskedasticity.
It makes sense to consider reformulating the model in a way that will discount
the scale factor (the size of the cities) and emphasize the underlying behavior. In
this case, per capita expenditures would be a logical dependent variable. This form
of the equation places Addis Ababa on the same scale as, say, Adama, Hawasa, Bahir
Dar, Mekele, and thus gives them the same weight in estimation. If an explanatory
variable happened not to be a function of the size of the city, however, it would not
need to be adjusted to per capita terms. If the equation included the average wage
of city workers, for example, that wage would not be divided through by population
in the transformed equation. Suppose your original equation is
could have heteroskedasticity; the error variances might be larger for the observa-
tions having the larger per capita values for expenditures than they are for smaller
per capita values. Thus, it is legitimate to suspect and test for heteroskedasticity
even in this transformed model.
Such heteroskedasticity in the transformed equation is unlikely, however, because
there will be little of the variation in size normally associated with heteroskedasticity.
The above transformed equation is very similar to the equation for Weighted
Least Squares (WLS).
Weighted Least Squares is a remedy for heteroskedasticity that consists of divid-
ing the entire equation (including the constant and the heteroskedastic error term)
by the proportionality factor Z and then re-estimating the equation with OLS. For
the example above, the WLS equation would be:
EXPi IN Ci W AGEi
= 0 + 1 + 2 + 3 + ui
P OPi P OPi P OPi
where the variables and s in Equation above are identical to those in Equation
in the previos slide. Dividing through by Z means that u is a homoskedastic error
term as long as Z is the correct proportionality factor. This is not a trivial problem,
however, and other transformations and HCSEs are much easier to use than WLS
is, so the use of WLS is no longer recommended.
Chapter 7
How to model qualitative factors with more than two categories (like region
of the country), and how to interpret the resulting model, giving an example.
How to test the equivalence of two regression equations using indicator vari-
ables. Based on the material in this chapter, you should be able to:
113
114CHAPTER 7 REGRESSION MODELS FOR CATEGORICAL AND LIMITED DEPE
Explain why probit, or logit, is usually preferred to least squares when esti-
mating a model in which the dependent variable is binary.
Compare and contrast the multinomial logit model to the conditional logit
model.
7.2 Introduction
In all the regression models that we have considered so far, we have implicitly
assumed that the regressand, the dependent variable, or the response variable Y is
quantitative, whereas the explanatory variables are either quantitative, qualitative
(or dummy), or a mixture thereof. We have brei‡y discussed how the dummy
regressors (explanatory variables) are introduced in a regression model and what
role they play in speci…c situations.
In this chapter we consider several models in which the regressand (the dependent
variable) itself is qualitative in nature. Although increasingly used in various areas
of social sciences and medical research, qualitative response regression models pose
interesting estimation and interpretation challenges. Suppose we want to study the
labor force participation (LFP) decision of adult males. Since an adult is either in
the labor force or not, LFP is a yes or no decision. Hence, the response variable, or
regressand, can take only two values, say, 1 if the person is in the labor force and 0 if
he or she is not. In other words, the regressand is a binary, or dichotomous, variable.
For the present purposes, the important thing to note is that the regressand is a
qualitative variable. One can think of several other examples where the regressand
is qualitative in nature. Thus, a family either owns a house or it does not, it has
disability insurance or it does not, both husband and wife are in the labor force or
only one spouse is. Similarly, a certain drug is e¤ective in curing an illness or it is not.
A …rm decides to declare a stock dividend or not, a parliamentarian decides to vote
for a tax cut or not. We do not have to restrict our response variable to yes/no or
dichotomous categories only. We can have a polychotomous (or multiple-category)
response variable.
7.3 THE LOGIT MODEL 115
1. The LPM assumes that the probability of the outcome moves linearly with
the value of the explanatory variable, no matter how small or large that value is.
2. The probability value must lie between 0 and 1, yet there is no guarantee that
the estimated probability values from the LPM will lie within these limits.
3. The usual assumption that the error term is normally distributed cannot
hold when the dependent variable takes only values of 0 and 1, since it follows the
binomial distribution.
4. The error term in the LPM is heteroscedastic, making the traditional signi…-
cance tests suspect
Lets use home ownership to explain the basic ideas underlying the logit model.
In explaining home ownership in relation to income, the LPM was
Pi = 1 + 2 Xi
where X is income and Pi = E(Yi = 1jXi ) means the family owns a house. But now
consider the following representation of home ownership:
1
Pi = (
1+e 1 + 2 Xi )
house. Thus, if Pi = 0.8, it means that odds are 4 to 1 in favor of the family owning
a house. Now if we take the natural log of (3.5), we obtain a very interesting result,
namely,
Pi
Li = ln = Zi
1 Pi
= 1 + 2 Xi
The variables age and education are highly statistically signi…cant (see z values)
and have the expected signs. As age increases, the value of the logit decreases,
118CHAPTER 7 REGRESSION MODELS FOR CATEGORICAL AND LIMITED DEPE
perhaps due to health concerns - that is, as people age, they are less likely to smoke.
Likewise, more educated people are less likely to smoke, perhaps due to the ill
e¤ects of smoking. The price of cigarettes has the expected negative sign and is
signi…cant at about the 7% level. Ceteris paribus, the higher the price of cigarettes,
the lower is the probability of smoking. Income has no statistically visible impact
on smoking, perhaps because expenditure on cigarettes may be a small proportion
of family income.
The interpretation of the various coe¢ cients is as follows: holding other variables
constant, if, for example, education increases by one year, the average logit value
goes down by 0 09, that is, the log of odds in favor of smoking goes down by
about 0.09. Other coe¢ cients are interpreted similarly. But the logit language is
not everyday language. What we would like to know is the probability of smoking,
given values of the explanatory variables. But this can be computed from Equation
(3.2) above.
To illustrate, lets assume that smoker number 2 in the dataset has the following
characteristics: age = 28, educ = 15, income = 12,500 and pcigs79 = 60.0. Inserting
these values in Equation (3.2), we obtain:
1
P = ( 0:4935)
0:3782
1+e
That is, the probability that a person with the given characteristics is a smoker
is about 38%. Can we compute the marginal e¤ect of an explanatory variable on the
probability of smoking, holding all other variables constant? Suppose we want to
…nd out @Pi =@Agei , the e¤ect of a unit change in age on the probability of smoking,
holding other variables constant.
This was very straightforward in the LPM, but it is not that simple with logit or
7.3 THE LOGIT MODEL 119
probit models. This is because the change in probability of smoking if age changes
by a unit (say, a year) depends not only on the coe¢ cient of the age variable but
also on the level of probability from which the change is measured. But the latter
depends on values of all the explanatory variables. Eviews and Stata can do this
job readily.
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
We can also test the null hypothesis that all the coe¢ cients are simultaneously
zero with the likelihood ratio (LR) statistic, which is the equivalent of the F test
in the linear regression model. Under the null hypothesis that none of the regressors
are signi…cant, the LR statistic follows the chi-square distribution with df equal
to the number of explanatory variables: four in our example. As the estimation
result for smoker shows, the value of the LR statistic is about 47.26 and the p
value (i.e. the exact signi…cance level) is practically zero, thus refuting the null
hypothesis. Therefore we can say that the four variables included in the logit model
are important determinants of smoking habits.
maker who has to choose among several alternatives. We use the term “choice” to
represent the alternatives or options that face an individual. The context of the
problem will make clear which term we have in mind. Nominal MRM for chooser
or individual-speci…c data. In this model the choices depend on the characteristics
of the chooser, such as age, income, education, religion, and similar factors. For
example, in educational choices, such as secondary education, a two-year college
education, a four-year college education and graduate school, age, family income,
religion, and parents’education are some of the variables that will a¤ect the choice.
These variables are speci…c to the chooser. These types of model are usually es-
timated by multinomial logit (MLM) or multinomial probit models (MPM). The
primary question these models answer is: How do the choosers’characteristics af-
fect their choosing a particular alternative among a set of alternatives? Therefore
MLM is suitable when regressors vary across individuals.
This is because the sum of the probabilities of mutually exclusive and exhaustive
events must be 1. We will call the s the response probabilities. This means that in
our example if we determine any two probabilities, the third one is determined auto-
matically. In other words, we cannot estimate the three probabilities independently.
Now what are the factors or variables that determine the probability of choosing a
particular option?
In our school choice example we have information on the following variables:
X2 = hscath = 1 if Catholic school graduate, 0 otherwise
X3 = grades = average grade in math, English, and social studies on a 13 point
grading scale, with 1 for the highest grade and 13 for the lowest grade. Therefore,
higher grade-point denotes poor academic performance
X4 = faminc = gross family income in 1991 in thousands of dollars
X5 = famsiz =number of family members
X6 = parcoll = 1 if the most educated parent graduated from college or had an
advanced degree
X7 = 1 if female
X8 = 1 if black
We will use X1 to represent the intercept. Notice some of the variables are
qualitative or dummy (X2, X6, X7, X8) and some are quantitative (X3, X4, X5).
Also note that there will be some random factors that will also a¤ect the choice, and
these random factors will be denoted by the error term in estimating the model.
Generalizing the bivariate logit model discussed in the preceding section, we can
write the multinomial logit model (MLM) as:
j + j Xi
e
j = P3
j + j Xi
j=1 e
Notice that we have put the subscript j on the intercept and the slope coe¢ cient
to remind us that the values of these coe¢ cients can di¤er from choice to choice. In
other words, a high school graduate who does not want to go to college will attach
a di¤erent weight to each explanatory variable than a high school graduate who
wants to go to a 2-year college or a 4-year college. Likewise, a high school graduate
who wants to go to a 2-year college but not to a 4-year college will attach di¤erent
weights (or importance if you will) to the various explanatory variables. Also, keep
in mind that if we have more than one explanatory variable in the model, X will
7.4 MULTINOMIAL REGRESSION MODELS 123
then represent a vector of variables and then)will be a vector of coe¢ cients. So,
if we decide to include the seven explanatory variables listed above, we will have
seven slope coe¢ cients and these slope coe¢ cients may di¤er from choice to choice.
In other words, the three probabilities estimated from Equation (3.10) may have
di¤erent coe¢ cients for the regressors. In e¤ect, we are estimating three regressions.
As we noted before, we cannot estimate all the three probabilities independently.
The common practice in MLM is to choose one category or choice as the base,
reference or comparison category and set its coe¢ cient values to zero. So if we
choose the …rst category (no college) and set 1 = 0 and 1 = 0, we obtain the
following estimates of the probabilities for the three choices.
Multinomial logistic regression Number of obs = 1,000
LR chi2(14) = 377.82
Prob > chi2 = 0.0000
Log likelihood = -829.74657 Pseudo R2 = 0.1855
1
hscath -14.11493 698.6953 -0.02 0.984 -1383.532 1355.303
grades .6983612 .0574514 12.16 0.000 .5857585 .810964
faminc -.0148641 .0041227 -3.61 0.000 -.0229444 -.0067839
famsiz .0666033 .0720741 0.92 0.355 -.0746593 .2078659
parcoll -1.02433 .2774019 -3.69 0.000 -1.568028 -.4806322
female .0575788 .1964323 0.29 0.769 -.3274214 .442579
black -1.495237 .4170395 -3.59 0.000 -2.312619 -.6778546
_cons -5.008206 .5671367 -8.83 0.000 -6.119774 -3.896638
2
hscath -15.10527 724.2084 -0.02 0.983 -1434.528 1404.317
grades .3988077 .0446722 8.93 0.000 .3112518 .4863635
faminc -.0050481 .0025969 -1.94 0.052 -.010138 .0000418
famsiz -.0305312 .0652636 -0.47 0.640 -.1584454 .097383
parcoll -.4978009 .2043127 -2.44 0.015 -.8982465 -.0973554
female .199134 .1705162 1.17 0.243 -.1350716 .5333397
black -.9392084 .3788355 -2.48 0.013 -1.681712 -.1967045
_cons -2.739292 .4401899 -6.22 0.000 -3.602048 -1.876536
3 (base outcome)
A positive coe¢ cient of a regressor suggests increased odds for choice 2 over
choice 1, holding all other regressors constant. Likewise, a negative coe¢ cient of
a regressor implies that the odds in favor of no college are greater than a 2-year
college. Thus, from Panel 1 of the table on the preceding slide we observe that if
family income increases, the odds of going to a 2-year college increase compared to
no college, holding all other variables constant.
Similarly, the negative coe¢ cient of the grades variable implies that the odds
in favor of no college are greater than a 2-year college, again holding all other
variables constant (remember how the grades are coded in this example.) Similar
interpretation applies to the second panel of the results Table in the preceding slide.
To be concrete, let us interpret the coe¢ cient of grade point average. Holding other
variables constant, if the grade point average increases by one unit, the logarithmic
124CHAPTER 7 REGRESSION MODELS FOR CATEGORICAL AND LIMITED DEPE
chance of preferring a 2-year college over no college goes down by about 0.2995. In
other words, -0.2995 gives the change in ln( 2i = 1i ) for a unit change in the grade
average. Therefore, if we take the anti-log of ln( 2i = 1i ), we obtain 2i = 1i = e02995 =
0.7412. That is, the odds in favor of choosing a 2-year college over no college are
only about 74%. This outcome might sound counterintuitive, but remember a higher
grade point on a 13-point scale means poor academic performance. Incidentally, the
odds are also known as the relative risk ratios (LRR).
Once the parameters are estimated, one can compute the three probabilities,
which is the primary objective of MLM. Since we have 1,000 observations and 7
regressors, it would be tedious to estimate these probabilities for all the individuals.
However, with appropriate command, Stata can compute such probabilities. But
this task can be minimized if we compute the three probabilities at the mean values
of the eight variables. To illustrate, for individual #10, a white male whose parents
did not have advanced degrees and who did not go to a Catholic school, had an
average grade of 6.44, family income of 42.5, and family size 6, his probabilities of
choosing option 1 (no college), or option 2 (a 2-year college) or option 3 (a 4-year
college) were, respectively, 0.2329, 0.2773 and 0.4897; these probabilities add to
0.9999 or almost 1 because of rounding errors.
Thus, for this individual the highest probability was about 0.49 (i.e. a 4-year
college). This individual did in fact choose to go to a 4-year college. Of course, it is
not the case that the estimated probabilities actually matched the choices actually
made by the individuals. In several cases the actual choice was di¤erent from the
estimated probability of that choice. That is why it is better to calculate the choice
probabilities at the mean values of the variables. We leave it for the reader to
compute these probabilities.
speci…cally developed to handle ordinal scale variables. In practice it does not make
a great di¤erence whether we use ordinal probit or ordinal logit models.
tically zero. Prst, however, is signi…cant at the 7% level. The regression coe¢ cients
given in the preceding table are ordered log-odds (i.e. logit) coe¢ cients. What do
they suggest? Take, for instance, the coe¢ cient of the education variable of 0 07.
If we increase the level of education by a unit (say, a year), the ordered log-odds of
being in a higher warmth category increases by about 0 07, holding all other re-
gressors constant. This is true of warm category 4 over warm category 3 or of warm
category 3 over 2 or warm category 2 over category 1. Other regression coe¢ cients
given in the preceding table are to be interpreted similarly. By convention, one of
the categories is chosen as the reference category and its intercept value is …xed at
zero.
In practice it is often useful to compute the odds-ratios to interpret the various
coe¢ cients. This can be done easily by exponentiating (i.e. raising e to a given
power) the estimated regression coe¢ cients. To illustrate, take the coe¢ cient of the
education variable of 0.07. Exponentiating this we obtain e0:07 1.0725. This means
if we increase education by a unit, the odds in favor of higher warmth category over
a lower category of warmth are greater than 1.
Chapter 8
Review Questions
Below are questions based on all the chapters covered in this course. Some of the
questions may require additional readings. It is good you invest your time and
practice with these questions for two reasons: First, this is the best (the only)
way you master the subject; second, your …nal exam consists of three or
four questions of the types of questions you face in this assignment.
Question 1: State with reason whether the following statements are true, false,
or uncertain. Be precise.
a. The t test of signi…cance discussed in this course requires that the sampling
distributions of estimators bs follow the normal distribution.
b. Even though the disturbance term in the Classical Linear Regression Model
is not normally distributed, the OLS estimators are still unbiased.
c. If there is no intercept in the regression model, the estimated i (=bi ) will not
sum to zero.
d. The p value and the size of a test statistic mean the same thing.
e. In a regression model that contains the intercept, the sum of the residuals is
always zero.
f. If a null hypothesis is not rejected, it is true.
g. The higher the value of 2 , the larger is the variance of bs.
h. The conditional and unconditional means of a random variable are the same
things.
Question 2: Consider the following regression output:
Ybi = 0:2033 + 0:6560Xi
SE = (0:0976) (0:1961)
2
R = 0:397 RSS = 0:0544 ESS = 0:0358
where Y = labor force participation rate (LFPR) of women in 1972 and X = LFPR
of women in 1968. The regression results were obtained from a sample of 19 cities
in the United States.
127
128 CHAPTER 8 REVIEW QUESTIONS
ci =
SP 17:8 + 33:2Ginii
SE = (4:9) (11:8) R2 = 0:16
Where SPI is index of sociopolitical instability, average for 1960–1985, and Gini is
Gini coe¢ cient for 1975 or the closest available year within the range of 1970–1980.
The sample consist of 40 countries. The Gini coe¢ cient is a measure of income
inequality and it lies between 0 and 1. The closer it is to 0, the greater the income
equality, and the closer it is to 1, the
greater the income inequality.
a. How do you interpret this regression?
b. Suppose the Gini coe¢ cient increases from 0.25 to 0.55. By how much does
SPI go up? What does that mean in practice?
c. Is the estimated slope coe¢ cient statistically signi…cant at the 5% level? Show
the necessary calculations.
d. Based on the preceding regression, can you argue that countries with greater
in-come inequality are politically unstable?
Question 4:In a study of turnover in the labor market, James F. Ragan, Jr.,
obtained the following results for the U.S. economy for the period of 1950–I to
1979–IV.
d
ln Y t = 4:47 0:34 ln X2t + 1:22 ln X3t + 1:22 ln X4t + 0:80 ln X5t 0:0055X6t
t = (4:28) ( 5:31) (3:64) (3:10) (1:10) ( 3:09)
2
R = 0:5370
Can you …nd out the sample size underlying these results? (Hint:Recall the
relationship between R2 , F , and t values.)
Question 6: From the data for 46 states in the United States for 1992, Baltagi
obtained the following regression results:
[
log C = 4:30 1:34 log P + 0:17 log Y
2
SE = (0:91) (0:32) (0:20) R = 0:27
log \
(salary) = 4:32 + 0:280 log (sales) + 0:0174roe + 0:000ros
SE = (0:32) (0:035) (0:0041) (0:00054)
2
R = 0:27
130 CHAPTER 8 REVIEW QUESTIONS
Question 11: From the annual data for the U.S. manufacturing sector for 1899–
1922, Dougherty obtained the following regression results:
[
log Y = 2:81 0:53 log L + 0:047t
SE = (1:38) (0:34) (0:021) R2 = 0:97 F = 189:8
where Y = index of real output, K = index of real capital input, L = index of real
labor input, t = time or trend.
Using the same data, he also obtained the following regression:
\
log Y =L = 0:11 + 0:11 log K=L + 0:047t
SE = (0:03) (0:15) (0:006) R2 = 0:65 F = 19:5
to what kind of extra information might be required to solve the estimation problem
they present.”
d. “... any time series regression containing more than four independent variables
results in garbage.”
Question 13: From data for 54 standard metropolitan statistical areas (SMSA),
Demaris estimated the following logit model to explain high murder rate versus low
murder rate:
where O = the odds of a high murder rate, P =1980 population size in thousands,
C = population growth rate from 1970 to 1980, R = reading quotient, and the se
are the asymptotic standard errors.
a. How would you interpret the various coe¢ cients?
b. Which of the coe¢ cients are individually statistically signi…cant?
c. What is the e¤ect of a unit increase in the reading quotient on the odds of
having a higher murder rate?
d. What is the e¤ect of a percentage point increase in the population growth
rate on the odds of having a higher murder rate?
Question 14: From the household budget survey of 1980 of the Dutch Central
Bureau of Statistics, J. S. Cramer obtained the following logit model based on a
sample of 2,820 households. The purpose of the logit model was to determine car
ownership as a function of (logarithm of) income. Car ownership was a binary
variable: Y = 1 if a household owns a car, zero otherwise.
bi =
L 2:77231 + 0:347582 ln Income
2
t = ( 3:35) (4:05) (1 df) = 16:681 (p value = 0:0000)
where L b = estimated logit and where ln Income is the logarithm of income. The 2
measures the goodness of …t of the model.
a. Interpret the estimated logit model.
b. From the estimated logit model, how would you obtain the expression for the
probability of car ownership?
c. What is the probability that a household with an income of $20,000 will own
a car? And at an income level of $25,000? What is the rate of change of probability
at the income level of $20,000?
d. Comment on the statistical signi…cance of the estimated logit model.
Bibliography
[1] Asteriou, Dimitrios, Stephen G. Hall (2011) Applied Econometrics, Second Edi-
tion. Palgrave Macmillan.
[4] Hill, R. Carter, William E. Gri¢ ths, and Guay C. Lim (2011) Principles of
Econometrics, Fourth Edition. John Wiley & Sons, Inc.
135