Applied Econometrics Module

Applied Econometrics for Management
(MGMT- MSC 5411)1
Addis Ababa University, O¢ ce for Distance and Continuing Education
February 28, 2018
1 This material is based on a material taken mainly from a book by A. H. Studenmund

with the assistance of Bruce K. Johnson (2017) Using Econometrics A Practical Guide,
7th edition and Damodar Gujarati (2012) Econometrics by Example. Other materials used
are acknowledged in the sections or chapters where they are used.
Contents
Preface ix
1 Introduction 1
1.1 Why Study Econometrics? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The main objective of this module . . . . . . . . . . . . . . . . . . . . 2
1.3 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Introduction to Econometrics 5
2.1 What is Econometrics? . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 What Is Regression Analysis? . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Single-Equation Linear Models . . . . . . . . . . . . . . . . . . . . . . 10
2.4 The Stochastic Error Term . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 The Signi…cance of the Stochastic Disturbance Term . . . . . 11
2.5 Few Points on Notations . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 The Estimated Regression Equation . . . . . . . . . . . . . . . . . . . 16
2.7 Structures of Economic Data . . . . . . . . . . . . . . . . . . . . . . . 16
2.7.1 Cross-Sectional Data . . . . . . . . . . . . . . . . . . . . . . . 17
2.7.2 Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7.3 Pooled Cross Sections . . . . . . . . . . . . . . . . . . . . . . 18
2.7.4 Panel or Longitudinal Data . . . . . . . . . . . . . . . . . . . 19
2.8 Introduction to Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Ordinary Least Squares 39

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Learning objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Estimating Single-Independent-Variable Models with OLS . . . . . . 40
3.3.1 Why Use Ordinary Least Squares? . . . . . . . . . . . . . . . 41
3.4 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 A Note on the Measurement Scales of Variables . . . . . . . . 53
3.4.2 The Nature of Dummy Variables . . . . . . . . . . . . . . . . 55
v
vi CONTENTS
3.4.3 A Single Dummy Independent Variable . . . . . . . . . . . . . 55
4 Classical Linear Regression Model 59

4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 The Classical Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 The Sampling Distribution of b . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Properties of the Mean . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2 Properties of the Variance . . . . . . . . . . . . . . . . . . . . 66
4.5 The Gauss - Markov Theorem . . . . . . . . . . . . . . . . . . . . . . 68
5 Hypothesis Testing and Statistical Inference 71

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1 Classical Null and Alternative Hypotheses . . . . . . . . . . . 73
5.3.2 Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . 74
5.3.3 Decision rules of Hypothesis Testing . . . . . . . . . . . . . . 74
5.4 The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.1 The t-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.2 The Critical t-Value and the t-Test Decision rule . . . . . . . 77
5.4.3 Choosing a Level of Signi…cance . . . . . . . . . . . . . . . . . 79
5.4.4 p-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Limitations of the t-Test . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Con…dence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.7 The F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.7.1 What Is the F-Test? . . . . . . . . . . . . . . . . . . . . . . . 84
6 Violation of Classical Assumptions 87

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Multicolinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.1 Testing for Multicollinearity . . . . . . . . . . . . . . . . . . . 91
6.3.2 Remedies for Multicollinearity . . . . . . . . . . . . . . . . . . 93
6.4 Serial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4.2 Pure Serial Correlation . . . . . . . . . . . . . . . . . . . . . . 94
6.4.3 Impure Serial Correlation . . . . . . . . . . . . . . . . . . . . 95
6.4.4 Consequences of Serial Correlation . . . . . . . . . . . . . . . 97
6.4.5 Detecting Serial Correlation . . . . . . . . . . . . . . . . . . . 97
6.5 Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
CONTENTS vii
6.5.1 The Consequences of Heteroskedasticity . . . . . . . . . . . . 103

6.5.2 Testing for Heteroskedasticity . . . . . . . . . . . . . . . . . . 104
6.5.3 Remedies for Heteroskedasticity . . . . . . . . . . . . . . . . . 108
7 Regression Models for Categorical and Limited Dependent Vari-

ables1 113
7.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 The logit model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3.1 Individual level data . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.2 Measures of goodness of …t . . . . . . . . . . . . . . . . . . . . 119
7.4 Multinomial Regression Models . . . . . . . . . . . . . . . . . . . . . 120
7.4.1 Nominal MRM for choice-speci…c data . . . . . . . . . . . . . 121
7.5 Ordinal Regression Models . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5.1 Ordinal Logit Model . . . . . . . . . . . . . . . . . . . . . . . 125
8 Review Questions 127
1
The data used in this chapter is from Gujarati, Damodar N. (2012) Econometrics by Example.
Palgrave Macmillan. This dataset is posted on the course webpage.
Preface
The material in this module is designed to cover a single-semester course in
applied econometrics for MBA students at the graduate (Masters) level of the pro-
gram at Addis Ababa University and most MBA programs elsewhere. The notes
are designed to equip students with the basic tools of applied econometrics that are
needed to undertake quantitative research research works in business and economics
and also to be able to read and understand academic journals articles based on
quantitative research. In addition, the lecture notes are meant to serve students
as tools to conduct their own research works in di¤erent branches of business and
economics. The basic philosophy behind the preparation of the module is that quan-
titative couses are tools to understand the literature and conduct rigrous research in
business and economics. To this e¤ect, we tried our best to discuss the business and
economic applications of the topics covered in this course. Students are advised to
practice the techniques discussed in the material by using online available datasets
and software. The software used in the material is Stata.
The module is organized into seven eight chapters. Chapter 1 motivates the
course by introducing the students about the use of econometrics for applied research
in business and economics refering to some prominent examples in the discipline.
The chapter also outlines the prerequisites of the course and what students expect
to gain from this course. Chapter 2 deals with the structure of econometrics and
introduces one of the basic concept in econometrics - regression analysis - what it
is and how it works and what researchers plan to gain out of it. In chapter 3 the
module introduces one of the basic and most commonly used estimation technique -
the ordinary least square method while chapter 4 introduces some basic concepts of
the classical linear regression model and the The Gauss-Markov Theorem. Chapter
5 deals with hypothesis testing and statistical inference. The assumptions of distri-
butions of the estimates, the t-test, p-values, and the F-test are discussed in chapter
5. Chapter 6 presents the violations of statistical assumptions and what to do when
the assumptions are violated. The chapter deals with three of such violations: mul-
ticolinearity, serial correlation, and heteroskedasticity. Chapter 7 introduces with
the regression methods used when the dependent variable is categorical or limited.
Accordingly, the chapter introduces the methods of linear probability model, logit
model, probit model, and tobit models. Finally, chapter 8 brie‡y introduces time-
series econometrics.
ix
Chapter 1
Introduction
1.1 Why Study Econometrics?

Econometrics is fundamental for economic measurement. However, its importance
extends far beyond the discipline of economics. Econometrics is a set of research
tools also employed in the business disciplines of accounting, …nance, marketing
and management. It is used by social scientists, speci…cally researchers in history,
political science, and sociology. Econometrics plays an important role in such diverse
…elds as forestry and agricultural economics. This breadth of interest in econometrics
arises in part because economics is the foundation of business analysis and is the
core social science. Thus research methods employed by economists, which includes
the …eld of econometrics, are useful to a broad spectrum of individuals.
Decision making in business and economics is often supported by the use of
quantitative information. Econometrics is concerned with summarizing relevant
data information by means of a model. Such econometric models help to understand
the relation between economic and business variables and to analyze the possible
e¤ects of decisions.
Econometrics was founded as a scienti…c discipline around 1930. In the early
years, most applications dealt with macroeconomic questions to help governments
and large …rms in making their long-term decisions.
Nowadays econometrics forms an indispensable tool to model empirical reality
in almost all economic and business disciplines. There are three major reasons for
this increasing attention for factual data and econometric models. Economic theory
often does not give the quantitative information that is needed in practical decision
making. Relevant quantitative data are available in many economic and business
disciplines. Realistic models can easily be solved by modern econometric techniques
to support everyday decisions of economists and business managers.
In areas such as …nance and marketing, quantitative data (on price movements,
sales patterns, and so on) are collected on a regular basis, weekly, daily, or even every
1
2 CHAPTER 1 INTRODUCTION
split second. Much information is also available in microeconomics (for instance, on

the spending behaviour of households). Econometric techniques have been developed
to deal with all such kinds of information.
On the other hand, if you plan to continue your education by enrolling in grad-
uate school or law school, you will …nd that this introduction to econometrics is
invaluable. If your goal is to earn a master’s or Ph.D. degree in economics, …-
nance, accounting, marketing, agricultural economics, sociology, political science, or
forestry, you will encounter more econometrics in your future. The graduate courses
tend to be quite technical and mathematical, and the forest often gets lost in study-
ing the trees. By taking this introduction to econometrics you will gain an overview
of what econometrics is about and develop some “intuition”about how things work
before entering a technically oriented course.
1.2 The main objective of this module

The goal of the course is to introduce students to basic econometrics techniques and
teach them the basics of the theory and practice of econometrics and to give them
experience in estimating econometric models with actual data. The main focus is
on regression analysis of di¤erent types of data and hence di¤erent estimation tech-
niques. Regression analysis allows estimating a relationship between variables. The
course also discusses problems commonly encountered in estimating such models,
and on interpreting the estimates from such models. It emphasizes the use and
interpretation of single equation regression techniques in formulating and testing
microeconomic and macroeconomic hypotheses.The course is introductory and ap-
plication oriented. Accordingly, the emphasis will be on application of techniques
and will not be overly concerned with mathematical proofs. The course also aims
to provide students with an introduction to how to use Stata in dealing with empir-
ical works. Students are encouraged to read and discuss academic articles in their
area of specialization that use regression analysis in thoughtful ways to address in-
teresting management questions and will apply the skills learnt in an independent
econometrics project using a dataset and topic of their choice.
1.3 Learning Outcomes

Upon completion of this course, students should be able to
Demonstrate an understanding of the purpose of econometrics;
Demonstrate basic knowledge and understanding of the Classical Linear Re-

gression Model;
1.4 PREREQUISITES 3
Demonstrate knowledge and understanding of the assumptions and properties

of the Classical Linear Regression Model;
Demonstrate an ability to formulate and evaluate testable statistical hypothe-

ses using the linear regression model and econometric software;
Demonstrate an ability to carefully interpret regression results;
Demonstrate knowledge and understanding of the e¤ect on regression results

when the assumptions of the Classical Linear Regression Model are violated;
and
Demonstrate knowledge and understanding of the econometric analysis for

management decision making purposes.
1.4 Prerequisites
Econometrics is an interdisciplinary …eld. It uses insights from economics and busi-
ness in selecting the relevant variables and models, it uses computer-science methods
to collect the data and to solve econometric models, and it uses statistics and math-
ematics to develop econometric methods that are appropriate for the data and the
problem at hand. Accordingly, in this course it is assumed that that students have
some familiarity with basic concepts of di¤erentiations (calculus), basic Statisti-
cal Concepts (random variables, sample, population, measures of central tendency,
measures of dispersions, measures of skewness and kurtosis, etc., methods of estima-
tion, properties of estimators, hypothesis testing, con…dence intervals). Most texts
in econometrics contain these and more prerequisites in their appendices for easy
reference.
1.5 Resources
This is one of the standard courses o¤ered at most universities world wide. As
a result getting lecture notes, sample exam questions with their solutions, etc. is
relatively easy if one has access to internet. The best way to use internet is not to
search for the material on the whole course, instead students are advised to follow
the topics in the lecture notes closely and then look for supplementary material on
topics they feel they need additional material.
Finally, it is important to indicate, as usual, quantitative courses can be mastered
only through doing more and more exersices. Accordingly, students are advised to
try all the problems listed at the end of this module and also practice with data (their
own data or data that accompany the textbooks and online available data from the
4 CHAPTER 1 INTRODUCTION
World bank, IMF, and other research and teaching institutions. For additional re-
sources visit the course website at Course website: https://sites.google.com/site/sisayrsenbeta/home/
econometrics.
Chapter 2
Introduction to Econometrics
In this chapter we discuss some basic issues in applied econometrics. The course
assumes that you are familiar with basic concepts of statistics such as descriptive
and inferential statistics and few concepts about probability. Econometrics has a
number of roles in terms of forecasting and analyzing real data and problems. At
the core of these roles, however, is the desire to pin down the magnitudes of e¤ects
and test their signi…cance. Economic theory often points to the direction of a causal
relationship (if income rises we may expect consumption to rise), but theory rarely
suggests an exact magnitude.
Yet, in a policy or business context, having a clear idea of the magnitude of an

e¤ect may be extremely important, and this is the realm of econometrics.
Decision making in business and economics is often supported by the use of
quantitative information. Econometrics is concerned with summarizing rele-vant
data information bymeans of amodel. Such econometricmodels help to understand
the relation between economic and business variables and to analyse the possible
e¤ects of decisions.
Econometrics was founded as a scienti…c discipline around 1930. In the early
years, most applications dealt with macroeconomic questions
to help governments and large …rms in making their long-term decisions.
Figure 2.1: Econometrics and its interaction with other sciences
5
6 CHAPTER 2 INTRODUCTION TO ECONOMETRICS
Nowadays econometrics forms an indispensable tool to model empirical reality

in almost all economic and business disciplines. There are three major reasons for
this increasing attention for factual data and econometric models. Economic theory
often does not give the quantitative information that is needed in practical decision
making. Relevant quantitative data are available in many economic and business
disciplines. Realistic models can easily be solved bymodern econometric techniques
to support everyday decisions of economists and business managers.
In areas such as …nance and marketing, quantitative data (on price move-ments,
sales patterns, and so on) are collected on a regular basis, weekly, daily, or even every
split second. Much information is also available in microeconomics (for instance, on
the spending behaviour of households).
Econometric techniques have been developed to deal with all such kinds of infor-
mation. Econometrics is an interdisciplinary …eld. It uses insights from economics
and business in selecting the relevant variables and models, it uses computer-science
methods to collect the data and to solve econometric models, and it uses statis-
tics and mathematics to develop econometric methods that are appropriate for the
data and the problem at hand. The interplay of these disciplines in econometric
modelling is summarized in the diagram below.
Figure 2.2: Econometrics as an interdisciplinary …eld

2.1 WHAT IS ECONOMETRICS? 7
2.1 What is Econometrics?

Econometrics - literally, “economic measurement”- is the quantitative measurement
and analysis of actual economic and business phenomena.
It attempts to quantify economic reality and bridge the gap between the abstract
world of economic theory and the real world of human activity.
Uses of Econometrics: Econometrics has three major uses:
1. describing economic reality
2. testing hypotheses about economic theory and policy
3. forecasting future economic activity
Of these three uses, the simplest use of econometrics is description.
We can use econometrics to quantify economic activity and measure marginal ef-
fects because econometrics allows us to estimate numbers and put them in equations
that previously contained only abstract symbols.
For example, consumer demand for a particular product often can be thought of
as a relationship between the quantity demanded (Q) and the product’s price (P ),
the price of a substitute (Ps ), and disposable income (Yd ).
For most goods, the relationship between consumption and disposable income is
expected to be positive, because an increase in disposable income will be associated
with an increase in the consumption of the product.
Econometrics actually allows us to estimate that relationship based upon past
consumption, income, and prices.
In other words, a general and purely theoretical functional relationship like:
Q= 0 + 1P + 2 Ps + 3 Yd (2.1)
can be changed into estimated equation like:
Q = 27:7 0:11P + 0:03Ps + 0:23Yd (2.2)
The number 0.23 is called an estimated regression coe¢ cient and it is the ability
to estimate these coe¢ cients that makes econometrics valuable. The second use of
econometrics is hypothesis testing; the evaluation of alternative theories with quanti-
tative evidence. Much of economics involves building theoretical models and testing
them against evidence, and hypothesis testing is vital to that scienti…c approach.
For example, you could test the hypothesis that the product in Equation 1 is what
economists call a normal good (one for which the quantity demanded increases when
disposable income increases). This can be done by applying various statistical tests
to the estimated coe¢ cient (0.23) of disposable income (Y d ) in Equation 2.
At …rst glance, the evidence would seem to support this hypothesis, because the
coe¢ cient’s sign is positive, but the “statistical signi…cance”of that estimate would
have to be investigated before such a conclusion could be justi…ed. Even though
the estimated coe¢ cient is positive, as expected, it may not be su¢ ciently di¤erent
from zero to convince us that the true coe¢ cient is indeed positive. The third and
most di¢ cult use of econometrics is to forecast or predict what is likely to happen
next quarter, next year, or further into the future, based on what has happened
in the past. For example, economists use econometric models to make forecasts of
variables like sales, pro…ts, Gross Domestic Product (GDP), and the in‡ation rate.
The accuracy of such forecasts depends in large measure on the degree to which the
past is a good guide to the future.
Business leaders and politicians tend to be especially interested in this use of
econometrics because they need to make decisions about the future, and the penalty
for being wrong (bankruptcy for the entrepreneur and political defeat for the can-
didate) is high. To the extent that econometrics can shed light on the impact of
their policies, business and government leaders will be better equipped to make de-
cisions. For example, if the president of a company that sold the product modeled
in Equation 1 wanted to decide whether to increase prices, forecasts of sales with
and without the price increase could be calculated and compared to help make such
a decision.
The following steps are followed in empirical econometric analysis:
1. specifying the models or relationships to be studied
2. collecting the data needed to quantify the models
3. quantifying the models with the data
The speci…cations used in step 1 and the techniques used in step 3 di¤er widely
between and within disciplines. Choosing the best speci…cation for a given model is
a theory-based skill that is often referred to as the “art”of econometrics. There are
many alternative approaches to quantifying the same equation and each approach
may produce somewhat di¤erent results. The choice of approach is left to the
individual econometrician (the researcher using econometrics), but each researcher
should be able to justify that choice.
2.2 What Is Regression Analysis?

The term regressionwas introduced by Francis Galton. In a famous paper, Galton
found that, although there was a tendency for tall parents to have tall children
and for short par-ents to have short children, the average height of children born of
parents of a given height tended to move or “regress”toward the average height in the
population as a whole. In other words, the height of the children of unusually tall or
unusually short parents tends to move toward the average height of the population.
Galton’s law of universal regressionwas con…rmed by his friend Karl Pearson, who
collected more than a thousand records of heights of members of family groups.
He found that the average height of sons of a group of tall fathers was less than
2.2 WHAT IS REGRESSION ANALYSIS? 9
their fathers’height and the average height of sons of a group of short fathers was
greater than their fathers’height, thus “regressing”tall and short sons alike toward
the average height of all men. In the words of Galton, this was “regression to
mediocrity.”
Econometricians use regression analysis to make quantitative estimates of eco-
nomic relationships that previously have been completely theoretical in nature. Af-
ter all, anybody can claim that the quantity of a normal good demanded will increase
if the price of those goods decreases (holding everything else constant), but not many
people can put speci…c numbers into an equation and estimate by how many units
the quantity demanded will increase for each Birr that price decreases. To predict
the direction of the change, you need a knowledge of economic theory and the general
characteristics of the product in question.
To predict the amount of the change, though, you need a sample of data, and
you need a way to estimate the relationship. The most frequently used method
to estimate such a relationship in econometrics is regression analysis.
Regression analysis ia a statistical technique that attempts to “explain” move-
ments in one variable, the dependent variable, as a function of movements in a set
of other variables, called the independent (or explanatory) variables, through the
quanti…cation of one or more equations. For example, in Equation 1:
Q= 0 + 1P + 2 Ps + 3 Yd (2.3)
Q is the dependent variable and P , PS , and Yd are the independent variables.

Regression analysis is a natural tool for economists because most (though not all)
economic propositions can be stated in such equations. For example, the quantity
demanded (dependent variable) is a function of price, the prices of substitutes, and
income (independent variables). Much of economics and business is concerned with
cause-and-e¤ect propositions. If the price of a good increases by one unit, then
the quantity demanded decreases on average by a certain amount, depending on
the price elasticity of demand (de…ned as the percentage change in the quantity
demanded that is caused by a one percent increase in price).
Similarly, if the quantity of capital employed increases by one unit, then output
increases by a certain amount, called the marginal productivity of capital. Proposi-
tions such as these pose an if - then, or causal, relationship that logically postulates
that a dependent variable’s movements are determined by movements in a number
of speci…c independent variables. Don’t be deceived by the words “dependent”and
“independent,”however. Although many economic relationships are causal by their
very nature, a regression result, no matter how statistically signi…cant, cannot prove
causality. All regression analysis can do is test whether a signi…cant quantitative
relationship exists. Judgments as to causality must also include a healthy
dose of economic theory and common sense. Regression analysis cannot
con…rm causality; it can only test the strength and direction of the quantitative
relationships involved.
2.3 Single-Equation Linear Models

The simplest single-equation regression model is:
Y = 0 + 1X (2.4)
This equation states that Y , the dependent variable, is a single-equation linear

function of X, the independent variable. The model is a single-equation model
because it is the only equation speci…ed.
The model is linear because if you were to plot the above equation it would be
a straight line rather than a curve. The s are the coe¢ cients that determine the
coordinates of the straight line at any point.
0 is the constant or intercept term; it indicates the value of Y when X equals
zero. 1 is the slope coe¢ cient, and it indicates the amount by which Y will change
when X increases by one unit (see the …gure in the next slide). As can be seen from
the diagram, the above equation is indeed linear. The slope coe¢ cient, 1 , shows
the response of Y to a one-unit increase in X. Much of the emphasis in regression
analysis is on slope coe¢ cients.
2.4 The Stochastic Error Term

Besides the variation in the dependent variable (Y ) that is caused by the indepen-
dent variable (X), there is almost always variation that comes from other sources
as well. This additional variation comes in part from omitted explanatory variables
(e.g., X2 and X3 ). However, even if these extra variables are added to the equation,
2.4 THE STOCHASTIC ERROR TERM 11
there still is going to be some variation in Y that simply cannot be explained by the
model. This variation probably comes from sources such as omitted in‡uences,
measurement error, incorrect functional form, or purely random and to-
tally unpredictable occurrences. By random we mean something that has its
value determined entirely by chance.
Econometricians admit the existence of such inherent unexplained variation (“er-
ror”) by explicitly including a stochastic (or random) error term in their regression
models. A stochastic error term is a term that is added to a regression equation to
introduce all of the variation in Y that cannot be explained by the included Xs. It
is, in e¤ect, a symbol of the econometrician’s ignorance or inability to model all the
movements of the dependent variable.
2.4.1 The Signi…cance of the Stochastic Disturbance Term

The disturbance term i is a surrogate for all those variables that are omitted from
the model but that collectively a¤ect Y . The obvious question is: Why not introduce
these variables into the model explicitly? Stated otherwise, why not develop a
multiple regression model with as many variables as possible? The reasons are
many.
1. Vagueness of theory: The theory, if any, determining the behavior of Y may
be, and often is, incomplete. We might know for certain that weekly income X
in‡uences weekly consumption expenditure Y , but we might be ignorant or unsure
about the other variables a¤ecting Y . Therefore, i may be used as a substitute for
all the excluded or omitted variables from the model.
2. Unavailability of data: Even if we know what some of the excluded variables
are and therefore consider a multiple regression rather than a simple regression,
we may not have quantitative information about these variables. It is a common
experience in empirical analysis that the data we would ideally like to have often
are not available. For example, in principle we could introduce family wealth as an
explanatory variable in addition to the in-come variable to explain family consump-
tion expenditure. But unfortunately, information on family wealth generally is not
available. Therefore, we may be forced to omit the wealth variable from our model
despite its great theoretical relevance in explaining consumption expenditure.
3. Core variables versus peripheral variables: Assume in our consumption-income
ex-ample that besides income X1 , the number of children per family X2 , sex X3 ,
religion X4 , education X5 , and geographical region X6 also a¤ect consumption ex-
penditure. But it is quite possible that the joint in‡uence of all or some of these
variables may be so small and at best nonsystematic or random that as a practical
matter and for cost considerations it does not pay to introduce them into the model
explicitly. One hopes that their combined e¤ect can be treated as a random variable
i.
4. Intrinsic randomness in human behavior: Even if we succeed in introducing

all the relevant variables into the model, there is bound to be some “intrinsic”
randomness in individual Y’s that cannot be explained no matter how hard we try.
The disturbances, the ’s, may very well re‡ect this intrinsic randomness.
5. Poor proxy variables: Although the classical regression model assumes that
the variables Y and X are measured accurately, in practice the data may be plagued
by errors of measurement. Consider, for example, Milton Friedman’s well-known
theory of the consumption function. He regards permanent consumption (Y p ) as
a function of permanent income (X p ). But since data on these variables are not
directly observable, in practice we use proxy variables, such as current consumption
(Y ) and current income (X), which can be observable. Since the observed Y and
X may not equal Y p and X p , there is the problem of errors of measurement. The
disturbance term u may in this case then also represent the errors of measurement.
As we will see in a later chapter, if there are such errors of measurement, they can
have serious implications for estimating the regression coe¢ cients, the ’s.
6. Principle of parsimony: Following Occam’s razor, we would like to keep
our regression model as simple as possible. If we can explain the behavior of Y
“substantially” with two or three explanatory variables and if our theory is not
strong enough to suggest what other variables might be included, why introduce
more variables? Let ui represent all other variables. Of course, we should not
exclude relevant and important variables just to keep the regression model simple.
7. Wrong functional form: Even if we have theoretically correct variables ex-
plaining a phenomenon and even if we can obtain data on these variables, very
often we do not know the form of the functional relationship between the regressand
and the regressors. Is consumption expenditure a linear (invariable) function of in-
come or a nonlinear (invariable) function? If it is the former, Yi = 1 + 2 Xi + i
is the proper functional relationship between Y and X, but if it is the latter,
Yi = 1 + 2 Xi + 3 Xi2 + i may be the correct functional form. In two-variable mod-
els the functional form of the relationship can often be judged from the scattergram.
But in a multiple regression model, it is not easy to deter-mine the appropriate func-
tional form, for graphically we cannot visualize scattergrams in multiple dimensions.
For all these reasons, the stochastic disturbances ui assume an extremely critical
role in regression analysis, which we will see as we progress.
The addition of a stochastic error term into the previous equation results in a
typical regression equation:
Y = 0 + 1X + (2.5)
This equation can be thought of as having two components, the deterministic

component and the stochastic, or random, component. The expression 0 + 1 X is
called the deterministic component of the regression equation because it indicates
the value of Y that is determined by a given value of X, which is assumed to be
2.5 FEW POINTS ON NOTATIONS 13
nonstochastic. This deterministic component can also be thought of as the expected

value of Y given X, the mean value of the Y s associated with a particular value of
X. For example, if the average height of all 13-year-old girls is 175 CM, then 175
CM is the expected value of a girl’s height given that she is 13. The deterministic
part of the equation may be written:
E (Y jX) = 0 + 1X (2.6)
which states that the expected value of Y given X, denoted as E (Y jX), is a linear
function of the independent variable (or variables if there are more than one).
Unfortunately, the value of Y observed in the real world is unlikely to be exactly
equal to the deterministic expected value E (Y jX). After all, not all 13-year-old
girls are 175 CM tall. As a result, the stochastic element must be added to the
equation:
Y = E (Y jX) + = 0 + 1X + (2.7)
To get a better feeling for these components of the stochastic error term, let’s
think about a consumption function (aggregate consumption as a function of ag-
gregate disposable income). First, consumption in a particular year may have been
less than it would have been because of uncertainty over the future course of the
economy. Since this uncertainty is hard to measure, there might be no variable
measuring consumer uncertainty in the equation. In such a case, the impact of the
omitted variable (consumer uncertainty) would likely end up in the stochastic error
term.
Second, the observed amount of consumption may have been di¤erent from the
actual level of consumption in a particular year due to an error (such as a sampling
error) in the measurement of consumption in the National Income Accounts. Third,
the underlying consumption function may be nonlinear, but a linear consumption
function might be estimated.
2.5 Few Points on Notations

Our regression notation needs to be extended to allow for the possibility of more than
one independent variable and to include reference to the number of observations. A
typical observation (or unit of analysis) is an individual person, year, or country.
For example, a series of annual observations starting in 1985 would have Y1 = Y
for 1985, Y2 for 1986, etc. If we include a speci…c reference to the observations, the
single-equation linear regression model may be written as:
Yi = 0 + 1 Xi + i (i = 1; 2; :::; N ) (2.8)
where: Yi = the ith observation of the dependent variable, Xi = the ith observa-
tion of the independent variable i = the ith observation of the stochastic error term
0 , 1 = the regression coe¢ cients, N = the number of observations. That is, the
regression model is assumed to hold for each observation. The coe¢ cients do not
change from observation to observation, but the values of Y , X, and do. A second
notational addition allows for more than one independent variable. Since more than
one independent variable is likely to have an e¤ect on the dependent variable, our
notation should allow these additional explanatory Xs to be added. If we de…ne:
X1i =the ith observation of the …rst independent variable
X2i =the ith observation of the second independent variable
X3i =the ith observation of the third independent variable, then all three variables
can be expressed as determinants of Y .
The resulting equation from the process outlined above is called a multivariate
(more than one independent variable) linear regression model:
Yi = 0 + 1 X1i + 1 X2i + 3 X3i + i .... (i = 1; 2; :::; N ) (2.9)
The meaning of the regression coe¢ cient 1 in this equation is the impact of a
one-unit increase in X1 on the dependent variable Y , holding constant X2 and X3 .
Similarly, 2 gives the impact of a one-unit increase in X2 on Y , holding X1 and X3
constant. These multivariate regression coe¢ cients (which are parallel in nature to
partial derivatives in calculus) serve to isolate the impact on Y of a change in one
variable from the impact on Y of changes in the other variables. This is possible
because multivariate regression takes the movements of X2 and X3 into account
when it estimates the coe¢ cient of X1 . The result is quite similar to what we would
obtain if we were capable of conducting controlled laboratory experiments in which
only one variable at a time was changed.
In the real world, though, it is very di¢ cult to run controlled economic exper-
iments, because many economic factors change simultaneously, often in opposite
2.5 FEW POINTS ON NOTATIONS 15
directions. Thus the ability of regression analysis to measure the impact of one vari-
able on the dependent variable, holding constant the in‡uence of the other variables
in the equation, is a tremendous advantage. Note that if a variable is not included in
an equation, then its impact is not held constant in the estimation of the regression
coe¢ cients. An example of multivariate regression: Suppose we want to understand
how wages are determined in a particular …eld, perhaps because we think that there
might be discrimination in that …eld. The wage of a worker would be the dependent
variable (WAGE), but what would be good independent variables?
What variables would in‡uence a person’s wage in a given …eld? Well, there
are literally dozens of reasonable possibilities, but three of the most common are
the work experience (EXP), education (EDU), and gender (GEND) of the worker,
so let’s use these. To create a regression equation with these variables, we would
rede…ne the variables in the above equation to meet our de…nitions:
Y = WAGE= the wage of the worker
X1 =EXP=the years of work experience of the worker
X2 =EDU=the years of education beyond high school of the worker
X3 =GEND=the gender of the worker (1 =male and 0=female)
The last variable, GEND, is unusual in that it can take on only two values, 0
and 1; this kind of variable is called a dummy variable, and it’s extremely useful
when we want to quantify a concept that is inherently qualitative (like gender). If
we substitute these de…nitions into the above equation, we get:
W AGEi = 0 + 1 EXPi + 2 EDUi + 3 GEN Di + i (2.10)
This equation speci…es that a worker’s wage is a function of the experience,

education, and gender of that worker. In such an equation, what would the meaning
of 1 be? Some of you will guess that 1 measures the amount by which the average
wage increases for an additional year of experience, but such a guess would miss the
fact that there are two other independent variables in the equation that also explain
wages. The correct answer is that 1 gives us the impact on wages of a one-year
increase in experience, holding constant education and gender. This is a signi…cant
di¤erence, because it allows researchers to control for speci…c complicating factors
without running controlled experiments. Before we conclude this section, it is worth
noting that the general multivariate regression model with K independent variables
is written as:
Yi = 0 + 1 X1i + 2 X2i + ::: + k Xki + i (2.11)
where i goes from 1 to N and indicates the observation number. If the sample
consists of a series of years or months (called a time series), then the subscript i is
usually replaced with a t to denote time.
2.6 The Estimated Regression Equation

Once a speci…c equation has been decided upon, it must be quanti…ed (with esti-
mated parameters/coe¢ cients). This quanti…ed version of the theoretical regression
equation is called the estimated regression equation and is obtained from a sample
of data for actual Xs and Ys . Although the theoretical equation is purely abstract
in nature:
Y = 0 + 1 Xi + i (2.12)
the estimated regression equation has actual numbers in it:
Yb = 103:40 + 6:38X (2.13)
2.7 Structures of Economic Data

There are several types of data structure with which economic/ econometric analysis
can be done. The most important data structures encountered are the following:
1. Cross section data
2. Pooled cross section data
3. Time series data
4. Panel or longitudinal data
5. Experimental data
2.7 STRUCTURES OF ECONOMIC DATA 17
2.7.1 Cross-Sectional Data
A cross-sectional data set consists of a sample of individuals, households, …rms,

cities, states, countries, or a variety of other units, taken at a given point in time.
Sometimes, the data on all units do not correspond to precisely the same time period.
For example, several families may be surveyed during di¤erent weeks within a year.
In a pure cross-sectional analysis, we would ignore any minor timing di¤erences in
collecting the data. If a set of families was surveyed during di¤erent weeks of the
same year, we would still view this as a cross-sectional data set.
An important feature of cross-sectional data is that we can often assume that
they have been obtained by random sampling from the underlying population. For
example, if we obtain information on wages, education, experience, and other char-
acteristics by randomly drawing 500 people from the working population, then we
have a random sample from the population of all working people. Random sampling
is the sampling scheme covered in introductory statistics courses, and it simpli…es
the analysis of cross-sectional data. A review of random sampling is contained in
Appendix C.
Sometimes, random sampling is not appropriate as an assumption for analyzing
cross-sectional data. For example, suppose we are interested in studying factors that
in‡uence the accumulation of family wealth. We could survey a random sample of
families, but some families might refuse to report their wealth. If, for example,
wealthier families are less likely to disclose their wealth, then the resulting sample
on wealth is not a random sample from the population of all families.
Another violation of random sampling occurs when we sample from units that
are large relative to the population, particularly geographical units. The potential
problem in such cases is that the population is not large enough to reasonably as-
sume the observations are independent draws. For example, if we want to explain
new business activity across states as a function of wage rates, energy prices, corpo-
rate and property tax rates, services provided, quality of the workforce, and other
state characteristics, it is unlikely that business activities in states near one another
are independent. It turns out that the econometric methods that we discuss do
work in such situations, but they sometimes need to be re…ned. For the most part,
we will ignore the intricacies that arise in analyzing such situations and treat these
problems in a random sampling framework, even when it is not technically correct
to do so. Crosssectional data are widely used in economics and other social sciences.
In economics, the analysis of cross-sectional data is closely aligned with the applied
microeconomics …elds, such as labor economics, state and local public …nance, in-
dustrial organization, urban economics, demography, and health economics. Data
on individuals, households, …rms, and cities at a given point in time are important
for testing microeconomic hypotheses and evaluating economic policies.
2.7.2 Time Series Data

A time series dataset consists of observations on a variable or several variables over
time. Examples of time series data include stock prices, money supply, consumer
price index, GDP, annual homicide rates, and automobile sales …gures. Because
past events can in‡uence future events and lags in behavior are prevalent in the
social sciences, time is an important dimension in a time series data set. Unlike the
arrangement of cross-sectional data, the chronological ordering of observations in a
time series conveys potentially important information.
A key feature of time series data that makes them more di¢ cult to analyze than
cross-sectional data is that economic observations can rarely, if ever, be assumed to
be independent across time. Most economic and other time series are related, often
strongly related, to their recent histories. For example, knowing something about
the GDP from last quarter tells us quite a bit about the likely range of the GDP
during this quarter, because GDP tends to remain fairly stable from one quarter
to the next. Although most econometric procedures can be used with both cross-
sectional and time series data, more needs to be done in specifying econometric
models for time series data before standard econometric methods can be justi…ed.
In addition, modi…cations and embellishments to standard econometric techniques
have been developed to account for and exploit the dependent nature of economic
time series and to address other issues, such as the fact that some economic variables
tend to display clear trends over time.
Another feature of time series data that can require special attention is the data
frequency at which the data are collected. In economics, the most common frequen-
cies are daily, weekly, monthly, quarterly, and annually. Stock prices are recorded
at daily intervals (excluding Saturday and Sunday). The money supply in the U.S.
economy is reported weekly. Many macroeconomic series are tabulated monthly,
including in‡ation and unemployment rates. Other macro series are recorded less
frequently, such as every three months (every quarter). GDP is an important exam-
ple of a quarterly series. Other time series, such as infant mortality rates for states
in the United States, are available only on an annual basis.
Many weekly, monthly, and quarterly economic time series display a strong sea-
sonal pattern, which can be an important factor in a time series analysis. For
example, monthly data on housing starts di¤er across the months simply due to
changing weather conditions.
2.7.3 Pooled Cross Sections

Some data sets have both cross-sectional and time series features. For example,
suppose that two cross-sectional household surveys are taken in the United States,
one in 1985 and one in 1990. In 1985, a random sample of households is surveyed
for variables such as income, savings, family size, and so on. In 1990, a new random
sample of households is taken using the same survey questions. To increase our
sample size, we can form a pooled cross section by combining the two years.
Pooling cross sections from di¤erent years is often an e¤ective way of analyzing
the e¤ects of a new government policy. The idea is to collect data from the years
before and after a key policy change. As an example, consider the following data set
on housing prices taken in 1993 and 1995, before and after a reduction in property
taxes in 1994. Suppose we have data on 250 houses for 1993 and on 270 houses for
1995.
Observations 1 through 250 correspond to the houses sold in 1993, and obser-
vations 251 through 520 correspond to the 270 houses sold in 1995. Although the
order in which we store the data turns out not to be crucial, keeping track of the
year for each observation is usually very important. This is why we enter year as a
separate variable.
A pooled cross section is analyzed much like a standard cross section, except that
we often need to account for secular di¤erences in the variables across the time. In
fact, in addition to increasing the sample size, the point of a pooled cross-sectional
analysis is often to see how a key relationship has changed over time.
2.7.4 Panel or Longitudinal Data

A panel data (or longitudinal data) set consists of a time series for each cross-
sectional member in the data set. As an example, suppose we have wage, education,
and employment history for a set of individuals followed over a 10-year period. Or we
might collect information, such as investment and …nancial data, about the same set
of …rms over a …ve-year time period. Panel data can also be collected on geographical
units. For example, we can collect data for the same set of counties in the United
States on immigration ‡ows, tax rates, wage rates, government expenditures, and
so on, for the years 1980, 1985, and 1990.
The key feature of panel data that distinguishes them from a pooled cross section
is that the same cross-sectional units (individuals, …rms, or counties in the preceding
examples) are followed over a given time period. As with a pure cross section, the
ordering in the cross section of a panel data set does not matter. We could use the
city name in place of a number, but it is often useful to have both.
A second point is that the two years of data for city 1 …ll the …rst two rows or
observations. Observations 3 and 4 correspond to city 2, and so on. Because each of
the 150 cities has two rows of data, any econometrics package will view this as 300
observations. This data set can be treated as a pooled cross section, where the same
cities happen to show up in each year. But, as we will see in Chapters 13 and 14,
we can also use the panel structure to analyze questions that cannot be answered
by simply viewing this as a pooled cross section.
Because panel data require replication of the same units over time, panel data
sets, especially those on individuals, households, and …rms, are more di¢ cult to ob-
tain than pooled cross sections. Not surprisingly, observing the same units over time
leads to several advantages over cross-sectional data or even pooled cross-sectional
data. The bene…t that we will focus on in this text is that having multiple observa-
tions on the same units allows us to control for certain unobserved characteristics of
individuals, …rms, and so on. As we will see, the use of more than one observation
can facilitate causal inference in situations where inferring causality would be very
di¢ cult if only a single cross section were available. A second advantage of panel
data is that they often allow us to study the importance of lags in behavior or the re-
sult of decision making. This information can be signi…cant because many economic
policies can be expected to have an impact only after some time has passed.
Causality and the Notion of Ceteris Paribus in Econometric Analysis
In most tests of economic theory, and certainly for evaluating public policy, the
economist’s goal is to infer that one variable (such as education) has a causal e¤ect
on another variable (such as worker productivity). Simply …nding an association
between two or more variables might be suggestive, but unless causality can be
established, it is rarely compelling.
The notion of ceteris paribus— which means “other (relevant) factors being equal”
- plays an important role in causal analysis. This idea has been implicit in some
of our earlier discussion, particularly Examples 1.1 and 1.2, but thus far we have
not explicitly mentioned it. You probably remember from introductory economics
that most economic questions are ceteris paribus by nature. For example, in ana-
lyzing consumer demand, we are interested in knowing the e¤ect of changing the
price of a good on its quantity demanded, while holding all other factors - such
as income, prices of other goods, and individual tastes— …xed. If other factors are
not held …xed, then we cannot know the causal e¤ect of a price change on quantity
demanded.
Holding other factors …xed is critical for policy analysis as well. In the job
training example (Example 1.2), we might be interested in the e¤ect of another
week of job training on wages, with all other components being equal (in particular,
education and experience). If we succeed in holding all other relevant factors …xed
and then …nd a link between job training and wages, we can conclude that job
training has a causal e¤ect on worker productivity. Although this may seem pretty
simple, even at this early stage it should be clear that, except in very special cases,
it will not be possible to literally hold all else equal. The key question in most
empirical studies is: Have enough other factors been held …xed to make a case for
causality? Rarely is an econometric study evaluated without raising this issue.
In most serious applications, the number of factors that can a¤ect the variable
of interest - such as criminal activity or wages— is immense, and the isolation of any
particular variable may seem like a hopeless e¤ort. However, we will eventually see
that, when carefully applied, econometric methods can simulate a ceteris paribus
experiment.
At this point, we cannot yet explain how econometric methods can be used to
estimate ceteris paribus e¤ects, so we will consider some problems that can arise
in trying to infer causality in economics. We do not use any equations in this
discussion. For each example, the problem of inferring causality disappears if an
appropriate experiment can be carried out. Thus, it is useful to describe how such
an experiment might be structured, and to observe that, in most cases, obtaining
experimental data is impractical. It is also helpful to think about why the available
data fail to have the important features of an experimental data set.
We rely for now on your intuitive understanding of such terms as random, in-
dependence, and correlation, all of which should be familiar from an introductory
probability and statistics course. (These concepts are reviewed in Appendix B.) We
begin with an example that illustrates some of these important issues.
Example 1 E¤ects of Fertilizer on Crop Yield: Some early econometric studies

[for example, Griliches (1957)] considered the e¤ects of new fertilizers on crop yields.
Suppose the crop under consideration is soybeans. Since fertilizer amount is only one
factor a¤ecting yields— some others include rainfall, quality of land, and presence
of para-sites - this issue must be posed as a ceteris paribus question. One way to
determine the causal e¤ect of fertilizer amount on soybean yield is to conduct an
experiment, which might include the following steps. Choose several one-acre plots
of land. Apply di¤erent amounts of fertilizer to each plot and subsequently measure
the yields; this gives us a cross-sectional data set. Then, use statistical methods (to
be introduced in Chapter 2) to measure the association between yields and fertilizer
amounts. As described earlier, this may not seem like a very good experiment because
we have said nothing about choosing plots of land that are identical in all respects
except for the amount of fertilizer. In fact, choosing plots of land with this feature is
not feasible: some of the factors, such as land quality, cannot even be fully observed.
How do we know the results of this experiment can be used to measure the ceteris
paribus e¤ect of fertilizer? The answer depends on the speci…cs of how fertilizer
amounts are chosen. If the levels of fertilizer are assigned to plots independently
of other plot features that a¤ect yield - that is, other characteristics of plots are
completely ignored when deciding on fertilizer amounts - then we are in business.
The next example is more representative of the di¢ culties that arise when infer-
ring causality in applied economics.
Example 2 Example measuring the Return to Education: Labor economists and

policy makers have long been interested in the “return to education.” Somewhat
informally, the question is posed as follows: If a person is chosen from the population
and given an-other year of education, by how much will his or her wage increase?
As with the previous examples, this is a ceteris paribus question, which implies that
all other factors are held …xed while another year of education is given to the person.
We can imagine a social planner designing an experiment to get at this issue,
much as the agricultural researcher can design an experiment to estimate fertilizer
e¤ects. Assume, for the moment, that the social planner has the ability to assign any
level of education to any person. How would this planner emulate the fertilizer ex-
periment in Example 1.3? The planner would choose a group of people and randomly
assign each person an amount of education; some people are given an eighth-grade
education, some are given a high school education, some are given two years of col-
lege, and so on. Subsequently, the planner measures wages for this group of people
(where we assume that each person then works in a job). The people here are like
the plots in the fertilizer example, where education plays the role of fertilizer and
wage rate plays the role of soybean yield. As with Example 1.3, if levels of education
are assigned independently of other characteristics that a¤ect productivity (such as
experience and innate ability), then an analysis that ignores these other factors will
yield useful results. Again, it will take some e¤ort in Chapter 2 to justify this claim;
for now, we state it without support.
Unlike the fertilizer-yield example, the experiment described in Example 1.4 is

unfeasible. The ethical issues, not to mention the economic costs, associated with
randomly determining education levels for a group of individuals are obvious. As a
logistical matter, we could not give someone only an eighth-grade education if he or
she already has a college degree.
Even though experimental data cannot be obtained for measuring the return to
education, we can certainly collect nonexperimental data on education levels and
wages for a large group by sampling randomly from the population of working people.
Such data are available from a variety of surveys used in labor economics, but these
data sets have a feature that makes it di¢ cult to estimate the ceteris paribus return
to education. People choose their own levels of education; therefore, education levels
are probably not determined independently of all other factors a¤ecting wage. This
problem is a feature shared by most nonexperimental data sets.
One factor that a¤ects wage is experience in the workforce. Since pursuing
more education generally requires postponing entering the workforce, those with
more education usually have less experience. Thus, in a nonexperimental data set
on wages and education, education is likely to be negatively associated with a key
variable that also a¤ects wage. It is also believed that people with more innate
ability often choose higher levels of education. Since higher ability leads to higher
wages, we again have a correlation between education and a critical factor that
a¤ects wage.
The omitted factors of experience and ability in the wage example have analogs
in the fertilizer example. Experience is generally easy to measure and therefore is
similar to a variable such as rain-fall. Ability, on the other hand, is nebulous and
di¢ cult to quantify; it is similar to land quality in the fertilizer example. As we will
see throughout this text, accounting for other observed factors, such as experience,
when estimating the ceteris paribus e¤ect of another variable, such as education,
is relatively straightforward. We will also …nd that accounting for inherently un-
observable factors, such as ability, is much more problematic. It is fair to say that
many of the advances in econometric methods have tried to deal with unobserved
factors in econometric models.
One …nal parallel can be drawn between Examples 1.3 and 1.4. Suppose that in
the fertilizer example, the fertilizer amounts were not entirely determined at random.
Instead, the assistant who chose the fertilizer levels thought it would be better to put
more fertilizer on the higher-quality plots of land. (Agricultural researchers should
have a rough idea about which plots of land are of better quality, even though they
may not be able to fully quantify the di¤erences.) This situation is completely
analogous to the level of schooling being related to unobserved ability in Example
1.4. Because better land leads to higher yields, and more fertilizer was used on
the better plots, any observed relationship between yield and fertilizer might be
spurious.
Di¢ culty in inferring causality can also arise when studying data at fairly high
levels of aggregation, as the next example on city crime rates shows.
Example The E¤ect of law Enforcement on City Crime levels
The issue of how best to prevent crime has been, and will probably continue to
be, with us for some time. One especially important question in this regard is: Does
the presence of more police o¢ cers on the street deter crime?
The ceteris paribus question is easy to state: If a city is randomly chosen and
given, say, ten additional police o¢ cers, by how much would its crime rates fall?
Another way to state the question is: If two cities are the same in all respects, except
that city A has ten more police o¢ cers than city B, by how much would the two
cities’crime rates di¤er?
It would be virtually impossible to …nd pairs of communities identical in all re-
spects except for the size of their police force. Fortunately, econometric analysis
does not require this. What we do need to know is whether the data we can collect
on community crime levels and the size of the police force can be viewed as experi-
mental. We can certainly imagine a true experiment involving a large collection of
cities where we dictate how many police o¢ cers each city will use for the upcoming
year.
Although policies can be used to a¤ect the size of police forces, we clearly cannot
tell each city how many police o¢ cers it can hire. If, as is likely, a city’s decision
on how many police o¢ cers to hire is correlated with other city factors that a¤ect
crime, then the data must be viewed as nonexperimental. In fact, one way to view
this problem is to see that a city’s choice of police force size and the amount of
crime are simultaneously determined. We will explicitly address such problems in
Chapter 16.
The …rst three examples we have discussed have dealt with cross-sectional data
at various levels of aggregation (for example, at the individual or city levels). The
same hurdles arise when inferring causality in time series problems.
Example 3 Example The E¤ect of the minimum Wage on Unemployment: An

important, and perhaps contentious, policy issue concerns the e¤ect of the minimum
wage on unemployment rates for various groups of workers. Although this problem
can be studied in a variety of data settings (cross-sectional, time series, or panel
data), time series data are often used to look at aggregate e¤ects. An example of a
time series data set on unemployment rates and minimum wages was given in Table
1.3.
Standard supply and demand analysis implies that, as the minimum wage is
increased above the market clearing wage, we slide up the demand curve for labor
and total employment decreases. (Labor supply exceeds labor demand.) To quantify
this e¤ect, we can study the relationship between employment and the minimum wage
over time. In addition to some special di¢ culties that can arise in dealing with time
series data, there are possible problems with inferring causality. The minimum wage
in the United States is not determined in a vacuum. Various economic and political
forces impinge on the …nal minimum wage for any given year. (The minimum
wage, once determined, is usually in place for several years, unless it is indexed for
in‡ation.) Thus, it is probable that the amount of the minimum wage is related to
other factors that have an e¤ect on employment levels.
We can imagine the U.S. government conducting an experiment to determine the
employment e¤ects of the minimum wage (as opposed to worrying about the welfare
of low-wage workers). The minimum wage could be randomly set by the government
each year, and then the employment out-comes could be tabulated. The resulting
experimental time series data could then be analyzed using fairly simple econometric
methods. But this scenario hardly describes how minimum wages are set.
If we can control enough other factors relating to employment, then we can still
hope to estimate the ceteris paribus e¤ect of the minimum wage on employment. In
this sense, the problem is very similar to the previous cross-sectional examples.
Even when economic theories are not most naturally described in terms of causal-
ity, they often have predictions that can be tested using econometric methods.
2.8 INTRODUCTION TO STATA 25
2.8 Introduction to Stata

Stata is a statistical package that includes a wide variety of capabilities, like data
management, statistical and econometric analysis, graphics, etc. Note that there
are also other software packages that people use for business and economic research
such as SPSS, Eviews or Micro…t for those getting started, RATS/CATS for the
time series specialists, or R, Matlab, Gauss, or Fortran for the really hard-core). So
the …rst question that you should ask yourself is why should I use Stata? Stata is
an integrated statistical analysis package designed for research professionals. It is
user friendly; can be used by both beginners and advanced researchers. You can use
pull-down menus from which di¤erent commands can be chosen or you can write
the commands by your own. The second one is preferred since it helps to master the
software and also to keep your commands as a do …le for future reference, to easily
revise or change your estimations, etc.
The Stata User Interface
A. The basic user interface in Stata consists of …ve main windows
1. The Command window is used to submit commands to Stata.
2. The Output window is where Stata provides responses to commands and
additional messages.
3. The Review window displays commands that have already been submitted to
Stata during the current session.
4. The Variables window displays the names of all variables contained in the
dataset that is currently open (i.e., in use within the current Stata session).
5. The Properties window displays information about the current dataset and a
selected variable within that dataset.
B. Of the …ve preceding windows, the Command and Output windows are prob-
ably the most important for ongoing analyses.
1. The Review, Variables, and Properties windows are intended primarily to
keep track of information that you have already provided to the Stata system.
2. You can insert commands from the Review window and variable names from
the Variables window into the Command window in order to save yourself some
typing.
C. Some additional windows can appear as needed or in response to particular
Stata commands.
1. The Viewer window displays help les (in response to user requests for assis-
tance) and the Stata log (a user-requested permanent record of the Stata session).
2. The Graph window displays graphs produced from Stata commands.
3. The Data Browser and Data Editor windows allow the user to inspect (and,
with the Data Editor, modify) the contents of the current open dataset.
Figure 2.3: The di¤erent windows of Stata
D. There are two general methods that a user can employ to communicate with
Stata during a session.
1. The Stata menu system (i.e., npoint and click").
2. Entering commands through the Command window.
E. Generally, it is better to use commands rather than menus
1. With commands, it is much easier to keep a record of your steps and (if
necessary) reproduce the contents of your analysis.
2. Commands must be used in Stata Do- les (see below).
II. Some Basic Features and Rules of Stata
A. The user interacts with Stata by issuing commands that refer to datasets,
variables, and other objects (e.g., directories and …les outside the Stata system on
the computer or the internet).
B. The user refers to each dataset or variable by its name. There are some strict,
but easy, rules for creating Stata names.
1. Stata names can be composed of letters, numbers, and the underscore symbol
(that is, n ").
2. Stata names can be up to 32 characters long, and the …rst character must be
a letter or the underscore.
3. Within a dataset, each variable must have a unique name.
C. Some advice regarding Stata names:
1. Do not use the underscore character.

2. Use relatively short names for variables and datasets
3. Generally, it is best to use meaningful names (e.g., age, party, etc.) rather
than ngeneric" names (e.g., var1, var2, data1, etc.).
4. Make sure you keep careful records that explain exactly what each variable is,
and what is contained within each dataset. This is absolutely critical! Otherwise,
you will perform analyses, set them aside for awhile, return to the work later, and
not know what you have already done.
D. Stata is case-sensitive.
1. The names variable1, VARIABLE1, and Variable1 are all di¤erent from each
other (according to Stata) and could exist together within a single dataset.
2. Be sure to develop your own rules for using upper- and lower-case letters in
your data processing, and always stick to these rules!
E. The basic structure of a Stata command: command name variable list, options
Where: command name is the name of a Stata command. It must be spelled
correctly and it will always be entered in lower-case letters.variable list is the name
of one or more variables to which the Stata com-mand is being applied. The comma
is optional. If it appears, it indicates to Stata that the user will be supplying
some additional information for the current command. There are speci…c options
associated with each Stata command. Note that options are used very frequently in
data analyses with Stata.
F. Some additional information about Stata commands.
1. Stata commands can be arbitrarily long (for example, when there are many
variables in the variable list). When entering commands at the keyboard, the end
of a command is indicated by the a hard return (i.e., typing the <enter>key).
2. Unless otherwise indicated, Stata commands apply to all of the observations
within the current dataset. There are additional elements that can be in-cluded to
restrict the scope of the command to particular subsets of the dataset.
3. With some Stata commands, the variable list can be omitted; in that case,the
command will apply to all relevant variables within the current dataset.
4. Some Stata commands are not applied to variables (e.g., those that refer to
directories or full datasets); in such cases, the variable list is replaced by the name
of the appropriate object.
G. Getting help with Stata.
1. Starting with Stata 11 (the most recent version at the time of this writing is
Stata 14), the multi-volume set of manuals documenting the Stata system is available
electronically, as a set of PDF …les.
2. Use the Search entry on the Stata Help menu to enter terms. This is useful
when you know what you want to do (e.g., perform a t-test), but you do not know
how to accomplish this in Stata. In that case, you might enter t-test into the Search
box.
3. Use the Stata Command entry on the Stata Help menu to nd out about
speci…c Stata commands. This is useful when you know what you want to do, but
do not remember the command syntax or the available options (e.g., how do I use
Stata’s ttest command?).
4. When you use the Search or Stata Command items on the Help menu, Stata
returns information in the Viewer window.
III. The Stata Session
A. A basic Stata session has three parts.
1. Read a dataset into Stata.
2. Modify the contents of the dataset (if necessary).
3. Perform the statistical analysis (or other data analysis task).
B. After data are read into Stata, the other two steps can be carried out repeat-
edly and in di erent orders (i.e., the user may want to perform an analysis and then
modify the data before performing another analysis, etc.).
C. Within a Stata session only one dataset can be active at any time.
1. In order to use a second dataset within a single Stata session, the rst dataset
must be removed, using the clear command.
2. If the rst dataset has been modi ed during the course of the Stata session,
Stata will ask whether you want to save the dataset before clearing it. If you do not
save the dataset, any changes you have made to its contents will be lost.
3. Hint: If you have modi ed your dataset during the course of the Stata session,
save it under a new name (i.e., issue the command, save newname before issuing the
clear command. That way, you will have both the original, unmodi ed, dataset and
the newly- modi ed version that you just created.
4. After you have cleared the rst dataset, you can read in another datasetand
continue through the other two steps of the Stata session (i.e., data modi cations
and statistical analyses) with the new dataset.
D. Contents of the Results window during the Stata session
1. Commands, responses to commands from Stata, additional information, and
the results from statistical analyses are all printed out in the Results window.
2. Stata lls the Results window one screen at a time. If there is more content
than will t into the window, Stata stops providing the information and asks if you
want to proceed, by displaying - more - at the bottom of the screen.
a. Typing <enter> will make Stata show one more line of output.
b. Typing any other key (say, the space bar) will make Stata continue to produce
the output.
c. Many Stata commands produce more than one screen of output, so the - more
- condition will occur frequently over the course of an interac-tive Stata session.IV.
Reading Data into the Stata Session
A. Data can be read into Stata from a previously-created and saved Stata dataset,
or it can be read in nraw" form from a text le.
1. If the data are contained in a text le, then the variables must be assigned
Stata names and the user must indicate to Stata whether each variable has numeric
or character values. This process is called ndata de nition."
2. If the data consist of a previously-stored Stata dataset, they will be contained
in an electronic le with a n.dta" extension (Stata added this le extension when the
dataset was saved). In this case, the data de nition has already been completed,
and the user only needs to retrieve the dataset into the Stata system.
B. Although not absolutely necessary, it is almost always useful to change the
working directory at the beginning of a Stata session.
1. The working directory is the location in which Stata looks for external data
les. If you create any new les during the course of your Stata session (e.g., datasets,
log les, saved graphs, etc.) , they will be written to the working directory unless
you explicitly specify otherwise. Stata usually sets the default working directory to
c:ndata.
2. The cd command is used to change the working directory. Thus, if your
data …les are stored in the subdirectory ndatasets", within the npls802" directory
on a ash drive that is identi ed as ng:" on the computer, you would probably begin
your Stata session with the following command: The path to the working directory
must be enclosed in double quotes if there are any internal blanks within any of
the directory names. So, they are not really necessary here. But, they don’t hurt,
either.
C. The use command reads a previously-stored Stata dataset into the current
session.
1. If the Stata dataset is named mydata, then it will be stored in a le named
mydata.dta. If this dataset is contained in the working directory, then the command
to retrieve it into the current session would be: use mydata Note that the …le
extension (.dta) is not used in this command (Stata dis-tinguishes the dataset from
the le in which the dataset is stored).
2. If the Stata dataset is not contained in the current working directory, then
the use command must include the full path to the dataset. This might look like the
following: use g:npls828ndatasetsnmydataD. If the data are stored in nraw" form,
they should be contained within an ASCII text le (usually, with le extension n.txt")
and the information should be ar-ranged within the le as follows:
1. There should be one line of data per observation, and each line should end
with a hard return.
2. The variable values are given in the same order for every observation (and
each observation must have the same number of variable values).
3. There is whitespace (i.e., at least one blank space) between each adjacent pair
of variable values on every line.

4. If a data value is missing for an observation, then it is shown as a decimal
point within the data le.
5. Most variables will have numeric values (or decimal points), but Stata can
handle variables with character values very easily. If a variable has character values,
then the user should know the maximum number of characters that will appear in
that variable’s valuesE. Raw data les (that conform to the preceding characteristics)
are read into Stata with the in…le command. The general syntax for the in…le
command is as follows: in…le variable list using text …le
Where: variable list is the list of names for the variables in the data le. Of
course, the names must conform to the Stata naming rules, and there must be as
many names as there are variable values on each line of the data le. If a variable
has character values, then its name must be preceded by strxx ,where xx is the
maximum number of characters that can appear in that variable’s values. text …le
is the name of the ASCII text le containing the raw data. This name should include
the le extension. If the le is not located within the current working directory then
the path to the le must also be given.
F. For example, assume that a text le named newdata.txt is located in the
working directory, and that it includes information on four variables. Further assume
that the …rst variable is a character variable, whose values can have up to twelve
characters. In that case, the in le statement might look like the following: in…le
str12 var1 var2 var3 var4 using newdata.txt
Note that you make up your own variable names, subject to Stata’s naming
rules. In this example, ngeneric" names are used. But, again, you should generally
employ names that make substantive sense and give you some information about
the content of each variable.G. A dataset created with the in…le command will not
have a name until it is saved. The dataset can be saved and assigned a name with
the save command. The syntax for the save command is as follows: save newdata
In the preceding command, newdata is the name of the new Stata dataset. It
must conform to the Stata naming rules. The dataset will be stored in the current
working directory. If it needs to be stored elsewhere, then the full path must be
given.
H. Stata has several commands that are useful for obtaining information about
the current dataset in the Stata session.
1. The describe command gives a succinct summary of the current dataset.
2. The list command prints out the contents of the current dataset.
3. When either of the preceding commands are given without any further ar-
guments, they will provide information about (or a listing of) all variables in the
dataset. Either of these commands can be followed by a variable list, which limits
the scope of the command to the listed variables only.
I. A new dataset can be created within the Data Editor window.

1. Open the Data Editor window by typing edit into the Command window.
2. The data editor window is similar to a spreadsheet. Each row represents an
observation and each column represents a variable. Simply type values into the cells
as appropriate, starting in the upper-left cell.
3. Stata assigns variable names, but they are not very informative (i.e., var1,var2,
etc.). Create a more meaningful variable name by clicking on an ex-isting name in
the Variables window, then double-clicking on Name in the Properties window, and
entering a new name into the eld to the right.
4. After populating the cells in the Data Editor, close this window. That will
restore the Stata user interface and the dataset can be saved or analyzed as usual.
5. Generally speaking, it is not good practice to create datasets in the Data
Editor window. It is better to create the data in a text …le, and read the information
into Stata using the in…le command.
V. Data Modi…cations
A. If the data are ready to analyze after they are read into Stata, then there is
no need to perform any data modi cations. You can skip directly to the statistical
analysis commands.
B. Descriptive labels for datasets and variables.
1. The label command enables you to attach an extended descriptive label (up
to 80 characters long) to datasets and to variables within datasets. The syntax is as
follows for a dataset label: label data "text of label" And, the syntax for a variable
label is: label variable "text of label"
2. Note that the double quotation marks must be used to enclose each label.
C. The generate command creates a new variable by performing mathematical
op-erations on existing variables and numeric constants.
1. The general syntax of the generate command is: generate new variable =
arithmetic expression
Where: new variable is the name of a variable that does not already exist in the
current Stata dataset (the user makes up this name). arithmetic expression is some
valid mathematical expression composed
of previously-de ned variable names, numeric constants, and mathemat-ical op-
erators. Note that the n=" in the generate command is not the usual mathe-matical
equality. Instead, it is an assignment operator, telling Stata to evaluate the mathe-
matical expression on the right-hand side and assign the result to the new variable
named on the left-hand side.
2. Some mathematical operators that can appear on the right-hand side include
+ for addition, - for subtraction, * for multiplication, and / for division.
3. Parentheses should be used liberally within the mathematical expression in
order to control the order of operations. Note that spacing between the elements
of a mathematical expression is generally arbitrary; use lots of spaces to enhance

readability.
4. For example, assume the current dataset contains two variables named xvar
and yvar, respectively. The following generate statement would create a new variable
named average that is created by taking the mean of xvar and yvar: average = (xvar
+ yvar) / 2
D. The replace command replaces values of an existing variable when a speci…ed
logical condition is met.
1. The general syntax for replace is as follows: replace varname = arithmetic
expression if logical condition Logical conditions can be true or false. The preceding
statement will apply the result of arithmetic expression to the variable named var-
name if and only if an observation meets logical condition.2. The logical condition
can be composed of existing variable names, mathemat-ical operators and logical
operators.
3. Logical operators include > for greater than, >= for greater than or equal
to, < for less than, <= for less than or equal to, == for logical equivalence, & for
conjunction (the logical nand"), j for disjunction (the logical nor") and ! for negation
(the logical nnot").
4. For example, assume we want to set the value of average to zero if an observa-
tion has a negative value on either xvar or yvar. That could be accomplished with
the following replace command: replace average = 0 if xvar < 0 j yvar < 0
5. When referring to a speci c value of a character variable in a logical expres-
sion, that value must be enclosed within double quotes. For example: replace xvar
= xvar + 5 if gender == "male".
6. Be careful when using replace; if you change a variable’s values and then save
the dataset, the original values of the variable will be lost.
E. Any changes to the current dataset will be lost after the current Stata session
ends, unless the dataset is saved (using the save command)
1. Once the newly-changed dataset is saved, the original information cannot be
recovered. Therefore, Stata tries to make sure that you really want to save the new
version before it will let you do so.
2. As explained earlier, you can save the altered dataset to a new dataset by
simply using a di erent and unused dataset name in the save command.3. Alterna-
tively, the command save , replace will overwrite the existing dataset. Again, make
sure you want to do this before using the replace option on the save command!
VI. Statistical Analyses in Stata
A. This handout will not cover speci c statistical analyses in much detail. The
speci…c commands vary from one type of analysis to the next, so they will be intro-
duced as needed throughout the course.
B. It is sometimes useful to distinguish between two general types of analysis
com-mands in Stata.
1. Many commands provide descriptive analyses. For these, the user issues
the command and Stata prints the results into the Results window, completing
the analysis. Examples of descriptive analysis commands include summarize and
correlate.
2. Other commands estimate the parameters of statistical models. For these,
the model estimates are retained in memory until another model is estimated. The
estimates can be recalled to the Results window very easily (perhaps, using di er-
ent options) and supplementary operations can be carried out on the model, using
Stata’s post-estimation commands. The most important model estimation command
(for purposes of this course, anyway) is regress.
C. Analyzing data by subgroups
1. Use the by pre x in order to have an analysis carried out separately on subsets
of the data de ned by the values of another variable. For example: by region :
summarize gnp policy In the preceding Stata command, summary statistics on the
variables gnp and policy would be calculated separately for subgroups of observations
de…ned by the distinct values of the variable, region. Of course, all three of these
variables must exist in the current Stata dataset.
2. In order to use the by pre x, the dataset must be sorted by the values of the
variable used to de ne the subgroups. There are three ways to do this. First, precede
the analysis command (summarize in this example) with the sort command: sort
region by region : summarize gnp policy
Second, use the sort option in the by pre x: by region, sort : summarize gnp
policy
Third, use bysort rather than by in the pre x: bysort region: summarize gnp
policyAll of these approaches would produce identical results.
D. Analyzing a single subset of the data
1. Use the if quali er to restrict the analysis to a subset of the current dataset de
ned by a logical condition. For example: summarize gnp if region == "south". The
preceding expression would calculate summary statistics only for those observations
in which the value of the variable, region, is south.
2. The general form of this quali er is the word, if, followed by a logical con-
dition. Stata will restrict the analysis speci ed by the command to those observations
for which the expression evaluates to TRUE.
3. While there are ways to combine the use of the by pre x and the if quali…er
in a single command, it is generally not a good idea to do so.
VII. Creating, Saving, and Viewing a Session Log
A. The contents of the Stata Results window provide a record of the Stata session.
But, there are are two potential drawbacks to the default operation of the Results
window:
1. The contents of the Results window are stored in memory. Most computers
have a limited amount of memory available. When Stata runs out of available
memory, it truncates the oldest elements from the current Results window. This
can be problematic in a lengthy Stata session.
2. The contents of the Results window are lost when the Stata session ends.
B. In order to overcome the preceding problems, it is a good idea to save the
contents of the Stata session to a separate le called a nStata Log."
1. The command to begin creating a Stata Log is: log using …le name
In this command, …le name is the name of a new le. Do not use a …le extension,
because Stata will add its own extension (n.smcl") onto the …le’s name.
2. Once the log le is opened, everything that appears in the Results window will
be written to the le (it will still be displayed in the Results window, as well).
3. To stop sending the contents of the Stata session to the log le, issue the
following command: log close
C. The contents of the Stata Log can be examined in the Stata Viewer window.
The easiest way to do this is to click the Log item from the File menu, select
View from the submenu that appears, and then browse to the le containing the log
(remember that this le will have the lename that you assigned it, with an extension
of .smcl).
D. The contents of the Stata Log can be saved to an ASCII text le.
1. The easiest way to do this to to click the Log item from the File menu, select
Translate from the submenu that appears, and then browse to the …le containing
the log (remember that this le will have the lename that you assigned it, with an
extension of .smcl) in the nInput File" box.
2. Next, type in a new le name in the nOutput File" box. When you clicknTranslate,"
the .smcl le will be translated to a text le with the extension.log. This is a regular
ASCII text le that can be opened in any word processor (e.g., MS Word) or text
processor (e.g. Notepad in Windows).
3. Note that the contents of a translated Stata Log should be viewed in a xed-
width font, such as Courier. A relatively small size (e.g., 9 points) is best in order
to avoid unnecessary line wrapping.
VIII. Using Do-…les to Submit (and Save) Commands
A. Up until now, it has been assumed that the user is working with Stata in
ninterac-tive" mode; that is, typing in one command at a time in the Command
window. While it is certainly possible to do this in nserious" analysis contexts,
there are several reasons for not doing so.
1. In a long Stata session, it is di cult to keep track of earlier commands and
steps in the course of the analysis.
2. The commands, themselves, are not saved. This is problematic because most
analyses will have to be run several times. Even if some changes will be made in
each analysis, it is still desirable to avoid retyping all of the commands.B. As an

alternative, a set of Stata commands can be written in a separate …le. The entire set
of commands can be submitted to Stata for processing, by using only one command
in the Stata Command window. Such …les of Stata commands are called nDo-…les."
1. A Do- …le can be created simply by saving a group of Stata commands into a
text …le. This can be done in any word processor (e.g., MS Word) or text processor
(e.g., Notepad in MS Windows).
2. A Do- …le created in a word processor must be saved as a nraw" text …le (that
is, without any of the internal codes automatically produced by word processing
software).
3. Stata has its own Do- le Editor, which is a text processor with some conve-
nient features for creating, naming, and submitting Do- les.
4. When saving the Do-…le, it is nice to use a lename extension of n.do"; this
is not mandatory, but it can be convenient for submitting Do-…les to Stata. Note
that, regardless of the lename extension, a Do-…le must be a raw ASCII text …le.
C. Submitting Do-…les
1. Do-…les can be submitted in the Command window, by typing the following
command: do do …le name. Where do …le name is just the name of the previously-
created Do-…le.
2. If the Do- le has been saved with the .do extension, then it is not necessary
to include the extension in the do command.
3. If the Do- le is not contained in the current working directory, then the lename
must include the full path to the le.
4. Do- les can also be submitted from within the Stata Do-…le Editor, using the
nDo" command from the menu bar.D. Errors in Do-…les
1. An error in a Do-…le (e.g., a misspelled word, invalid command syntax, mis-
placed command, etc.) will cause Stata to print out an error message and stop
processing the commands in the …le.
2. It is usually easiest to correct the error in the Do-…le and simply resubmit the
…le to Stata.
3. If the Do-…le tried to read a dataset into Stata before the error, you will prob-
ably have to issue the clear command before resubmitting the Do-…le (remember,
Stata can only process one dataset at a time).
4. If you started a log le within the Do-…le, then you will probably have to issue
the log close command before resubmitting the Do-…le (Only one log …le can be open
at a time).
E. Some convenient ntricks" for Do-…les.
1. Remember that Stata will only ll the Results window one screen at a time,
with the - more - line appearing at the bottom of each lled window.
a. This can be bothersome and annoying when using a Do-…le that produces a
great deal of output.

b. In order to have the output scroll continuously in the Results window, issue
the following command: set more o¤
2. Stata commands can be arbitrarily long; by default, Stata recognizes the hard
return (i.e., typing the <enter> key) as the command delimiter (i.e., the signal that
the current command is nished).
a. It is good programming practice to keep the lines of a Stata Do-…le very short
(i.e., never more than about 70 characters maximum).
b. In order to follow the preceding rule, it will often be necessary to split Stata
commands across lines; this requires a hard return, so it will be necessary to change
the command delimiter in order to do so.
c. The Stata command delimiter can be changed to a speci…c character (say, the
semicolon), by submitting the following command as part of a Do-…le: #delimit;
After issuing the preceding command, each Stata command can run across as many
lines as necessary. The command does not end until Stata en-counters a semicolon.
d. The #delimit command only works within Do-…les, and the delimiter change
only lasts for the duration of the Do-…le.
F. Log …les within a Do-…le
1. By default, a Stata log can only be written to a new …le. But, if there is
an error in the Do-…le, Stata may create the log …le (even though it does not put
anything into it). When this occurs, the log le must be deleted before the Do-…le
containing the log using command is re-run.
2. In order to avoid having to delete log les from crashed Do-…les, use the replace
option in the log using command. For example: log using new log …le, replace The
preceding command would open a log …le named new log …le. If a …lewith that name
already exists, Stata will overwrite that le with the new log.
3. Be careful using the replace option! Once the contents of a …le are replaced,
the original contents cannot be recovered.
G. Use comments within Do-…les!
1. One of the most important reasons for using Do-…les is to retain the com-
mands for later use. However, it is very easy to forget the details of the analysis
over even relatively short time periods.
2. Therefore, document the contents of your Do-…les by interspersing comments
among the Stata commands.
3. The Stata comment commanda. A command that begins with an asterisk is
regarded as a comment, and Stata will simply print out the rest of that command
in the Results window.
b. It is useful to change the command delimiter to something other than the hard
return, so comments can span more than one line. It is useful to leave whitespace
around comments, to make them more visually prominent within the Do-…le.
4. Inserting comments into a Do-…le can seem like an unnecessary botherj do it

anyway!!!
Chapter 3
Ordinary Least Squares
3.1 Introduction
The bread and butter of regression analysis is the estimation of the coef-…cients
of econometric models using a technique called Ordinary Least Squares (OLS). The
…rst two sections of this chapter summarize the reason-ing behind and the mechanics
of OLS. Regression users rely on computers to do the actual OLS calculations, so
the emphasis here is on understanding what OLS attempts to do and how it goes
about doing it.
How can you tell a good equation from a bad one once it has been esti-mated?
There are a number of useful criteria, including the extent to which the estimated
equation …ts the actual data. A focus on …t is not without per-ils, however, so we
share an example of the misuse of this criterion.
3.2 Learning objectives

Based on the material in this chapter you should be able to
1. Explain the di¤erence between an estimator and an estimate, and why the
least squares estimators are random variables, and why least squares estimates are
not.
2. Discuss the interpretation of the slope and intercept parameters of the simple
regression model, and sketch the graph of an estimated equation.
3. Explain the theoretical decomposition of an observable variableyinto its
systema-tic and random components, and show this decomposition graphically.
4. Discuss and explain each of the assumptions of the simple linear regression-
model.
5. Explain how the least squares principle is used to …t a line through a scatter
plot of data. Be able to de…ne the least squares residual and the least squares …tted
value of the dependent variable and show them on a graph.
39
40 CHAPTER 3 ORDINARY LEAST SQUARES
6. De…ne the elasticity ofywith respect toxand explain its computation in the
simple linear regression model whenyandxare not transformed in any way, and when
y and/or x have been transformed to model a nonlinear relationship.
7. Explain the meaning of the statement “If regression model assumptions SR1–
SR5 hold, then the least squares estimatorb2is unbiased.”In particular, what exactly
does “unbiased”mean? Why isb2biased if an important variable has been omitted
from the model?
8. Explain the meaning of the phrase “sampling variability.”
9. Explainhowthe factorss 2 , (xi x) 2, and N a¤ect theprecisionwithwhichwe can
estimate the unknown parameterb2.
10. State and explain the Gauss–Markov theorem.
11. Use the least squares estimator to estimate nonlinear relationships and in-
terpret the results.
3.3 Estimating Single-Independent-Variable Mod-

els with OLS
The purpose of regression analysis is to take a purely theoretical equation like:
Yi = 0 + 1 Xi + i (3.1)
and use a set of data to create an estimated equation like:
Ybi = b0 + b1 Xi (3.2)
The purpose of the estimation technique is to obtain numerical values for the
coe¢ cients of an otherwise completely theoretical regression equation.
The most widely used method of obtaining these estimates is Ordinary Least
Squares (OLS), which has become so standard that its estimates are presented as a
point of reference even when results from other estimation techniques are used.
Ordinary Least Squares (OLS) is a regression estimation technique that calcu-
lates the n s so as to minimize the sum of the squared residuals, thus:
X
N
2
OLS minimizes i (i = 1; :::; N ) (3.3)
i=1
Since these residuals ( i s) are the di¤erences between the actual Ys and the
estimated Ys produced by the regression (the Ybi s in Equation 2), Equation 3 is
equivalent to saying that OLS minimizes
X
N
2
Yi Ybi (3.4)
i=1
3.3 ESTIMATING SINGLE-INDEPENDENT-VARIABLE MODELS WITH OLS41
3.3.1 Why Use Ordinary Least Squares?

Although OLS is the most-used regression estimation technique, it’s not the only
one. Indeed, econometricians have developed what seem like zillions of di¤erent
estimation techniques, a number of which we will discuss later in this course. There
are at least three important reasons for using OLS to estimate regression models:
X
N
2
OLS is relatively easy to use. The goal of minimizing i is quite appropriate
i=1
from a theoretical point of view.
OLS estimates have a number of useful characteristics. How would OLS estimate
a single-independent-variable regression model like Equation 1?
Yi = 0 + 1 Xi + i
OLS selects those estimates of 0 and 1 that minimize the squared residuals,
summed over all the sample data points. For an equation with just one independent
variable, these coe¢ cients are:
X
N
Xi X Yi Y
b = i=1
(3.5)
1
X
N
2
Xi X
i=1
b =Y b X (3.6)
0 1
Note that for each di¤erent data set, we will get di¤erent estimates of 0 and 1
depending on the sample.
Only a few dependent variables can be explained fully by a single independent

variable. A …rm’s productivity, for example, is in‡uenced by more than just that
skills of the workers. What about the capital stock, the management practice, the
externality from neigboring …rms, etc? There is more reason to include a variety of
independent variables in economic and business applications. For example, although
the per capita quantity consumed of a product is certainly a¤ected by price, that is
not the whole story. Advertising, per capita income, the prices of substitutes, the
in‡uence of foreign markets, the quality of customer service, and changing tastes
all are important in real-world models. As a result, it is vital to move from single-
independent-variable regressions to multivariate regression models, or equations with
more than one independent variable.
The general multivariate regression model with K independent variables can be
represented by
Yi = 0 + 1 X1i + 2 X2i + ::: + k Xki + i (3.7)
The biggest di¤erence between a single-independent-variable regression model and
a multivariate regression model is in the interpretation of the latter’s slope coe¢ -
cients. These coe¢ cients, often called partial regression coe¢ cients, are de…ned to
allow a researcher to distinguish the impact of one variable from that of other in-
dependent variables. Speci…cally, a multivariate regression coe¢ cient indicates the
change in the dependent variable associated with a one-unit increase in the inde-
pendent variable in question, holding constant the other independent variables in
the equation.
Now let’s discuss some measures of how much of the variation of the dependent
variable is explained by the estimated regression equation. Such comparison of the
estimated values with the actual values can help a researcher judge the adequacy of
an estimated regression. Econometricians use the squared variations of Y around
its mean as a measure of the amount of variation to be explained by the regression.
This computed quantity is usually called the total sum of squares, or TSS, and is
written as:
XN
2
T SS = Yi Y (3.8)
i=1
For Ordinary Least Squares, the total sum of squares has two components, variation
that can be explained by the regression and variation that cannot:
X
N
2 X
N
2 X
N
Yi Y = Ybi Y + e2i (3.9)
i=1 i=1 i=1
Total Sum of Squares (TSS) = Explained Sum of Squares (ESS) + Residual Sum of
Squares (RSS). This is usually called the decomposition of variance.
If the bread and butter of regression analysis is OLS estimation, then the heart
and soul of econometrics is …guring out how good these OLS estimates are. Many
Figure 3.1: Decomposition of the Variance in Y
beginning econometricians have a tendency to accept regression estimates as they

come out of a computer, or as they are published in an article, without thinking
about the meaning or validity of those estimates. Such blind faith makes as much
sense as buying an entire wardrobe of clothes without trying them on. Some of the
clothes will …t just …ne, but many others will turn out to be big (or small) mistakes.
Instead, the job of an econometrician is to carefully think about and evaluate every
aspect of the equation, from the underlying theory to the quality of the data, before
accepting a regression result as valid. In fact, most good econometricians spend quite
a bit of time thinking about what to expect from an equation before they estimate
that equation. Once the computer estimates have been produced, however, it is time
to evaluate the regression results. The list of questions that should be asked during
such an evaluation is long. For example:
Is the equation supported by sound theory?
How well does the estimated regression …t the data?
Is the data set reasonably large and accurate?
Is OLS the best estimator to be used for this equation?
How well do the estimated coe¢ cients correspond to the expectations developed
by the researcher before the data were collected?
Are all the obviously important variables included in the equation?
Has the most theoretically logical functional form been used?
Does the regression appear to be free of major econometric problems?
We expect that a good estimated regression equation will explain the variation
of the dependent variable in the sample fairly accurately. If it does, we say that the
estimated model …ts the data well.
Looking at the overall …t of an estimated model is useful not only for evaluating
the quality of the regression, but also for comparing models that have di¤erent
data sets or combinations of independent variables. We can never be sure that
one estimated model represents the truth any more than another, but evaluating
the quality of the …t of the equation is one ingredient in a choice between di¤erent
formulations of a regression model. Be careful, however! The quality of the …t is a
minor ingredient in this choice, and many beginning researchers allow themselves to
be overly in‡uenced by it. The simplest commonly used measure of …t is R2 , or the
coe¢ cient of determination. R2 is the ratio of the explained sum of squares to the
total sum of squares:
X
ESS RSS e2i
2
R = =1 =1 X 2 (3.10)
T SS T SS Yi Y
The higher R2 is, the closer the estimated regression equation …ts the sample
data. Measures of this type are called “goodness of …t” measures. R2 measures
the percentage of the variation of Y around Y that is explained by the regression
equation. Since OLS selects the coe¢ cient estimates that minimize RSS, OLS pro-
vides the largest possible R2 , given a linear model. Since TSS, RSS, and ESS are
all nonnegative (being squared deviations), and since ESS TSS, then R2 must lie
in the interval 0 R2 1.
A value of R2 close to one shows an excellent overall …t, whereas a value near
zero shows a failure of the estimated regression equation to explain the values of Yi
better than could be explained by the sample mean Y .
Figure 3.2: X and Y are not related; in such a case, R2 would be 0.
A major problem with R2 is that adding another independent variable to a

particular equation can never decrease R2 .
That is, if you compare two equations that are identical (same dependent variable
and independent variables), except that one has an additional independent variable,
the equation with the greater number of independent variables will always have a
better (or equal) …t as measured by R2 .
Figure 3.3: A set of data for X and Y that can be “explained” quite well with a
regression line
To see this, recall the equation of R2

X
ESS RSS e2i
2
R = =1 =1 X 2 (3.11)
T SS T SS Yi Y
What will happen to R2 if we add a variable to the equation?

Adding a variable can not change TSS (can you …gure out why?)
But in most cases the added variable will reduce RSS, so R2 will rise. You know
that RSS will never increase because the OLS program could always set the coe¢ -
cient of the added variable equal to zero, thus giving the same …t as the previous
equation. The coe¢ cient of the newly added variable being zero is the only cir-
cumstance in which R2 will stay the same when a variable is added. Otherwise,
R2 will always increase when a variable is added to an equation. The inclusion of
a non-related new variable not only adds a nonsensical variable to the equation,
but it also requires the estimation of another coe¢ cient. This lessens the degrees
of freedom, or the excess of the number of observations (N ) over the number of
coe¢ cients (including the intercept) estimated (K + 1). This decrease has a cost,
since the lower the degrees of freedom, the less reliable the estimates are likely to
be. Thus, the increase in the quality of the …t caused by the addition of a variable
needs to be compared to the decrease in the degrees of freedom before a decision
can be made with respect to the statistical impact of the added variable.
To sum, R2 is of little help if we are trying to decide whether adding a variable
to an equation improves our ability to meaningfully explain the dependent variable.
Because of this problem, econometricians have developed another measure of the
2
quality of the …t of an equation. That measure is R (pronounced R-bar-squared),
which is adjusted for degrees of freedom:
X
2 e2i = (N K 1)
R =X 2 (3.12)
Yi Y = (N 1)
2
R measures the percentage of the variation of Y around its mean that is ex-
2
plained by the regression equation, adjusted for degrees of freedom. R will increase,
decrease, or stay the same when a variable is added to an equation, depending on
whether the improvement in …t caused by the addition of the new variable outweighs
2
the loss of the degree of freedom. An increase in R indicates that the marginal ben-
2
e…t of adding a variable exceeds the cost, while a decrease in R indicates that the
2
marginal cost exceeds the bene…t. The highest possible R is 1.00, the same as for
2 2
R2 . The lowest possible R , however, is not 0.00; if R2 is extremely low, R can be
2
slightly negative. R can be used to compare the …ts of equations with the same
dependent variable and di¤erent numbers of independent variables.
2
Because of this property, most researchers automatically use R instead of R2
when evaluating the …t of their estimated regression equations. Note, however, that
2
R is not as useful when comparing the …ts of two equations that have di¤erent
dependent variables or dependent variables that are measured di¤erently. Finally,
always remember that the quality of …t of an estimated equation is only one measure
of the overall quality of that regression. As mentioned previously, the degree to
which the estimated coe¢ cients conform to economic theory and the researcher’s
previous expectations about those coe¢ cients are just as important as the …t itself.
For instance, an estimated equation with a good …t but with an implausible sign for
an estimated coe¢ cient might give implausible predictions and thus not be a very
useful equation. Other factors, such as theoretical relevance and usefulness, also
come into play.
Although there are no hard and fast rules for conducting econometric research,
most investigators commonly follow a standard method for applied regression analy-
sis. The relative emphasis and e¤ort expended on each step will vary, but normally
all the steps are necessary for successful research. Note that we do not discuss the
selection of the dependent variable; this choice is determined by the purpose of the
research. Once a dependent variable is chosen, however, it is logical to follow the
following six steps in applied regression analysis:
Step 1: Review the literature and develop the theoretical model.
The …rst step in any applied research is to get a good theoretical grasp of the topic
to be studied. That’s right: the best data analysts don’t start with data, but with
theory! This is because many econometric decisions, ranging from which variables
to include to which functional form to employ, are determined by the underlying
theoretical model. It’s virtually impossible to build a good econo-metric model
without a solid understanding of the topic you’re studying.
For most topics, this means that it’s smart to review the scholarly literature
before doing anything else. If a professor has investigated the theory behind your
topic, you want to know about it. If other researchers have estimated equations for
your dependent variable, you might want to apply one of their models to your data
set. On the other hand, if you disagree with the approach of previous authors, you
might want to head o¤ in a new direction. In either case, you shouldn’t have to
“reinvent the wheel.” You should start your investigation where earlier researchers
left o¤. Any academic paper on an empirical topic should begin with a summary of
the extent and quality of previous research.
The most convenient approaches to reviewing the literature are to obtain several
recent issues of the Journal of Economic Literatureor a business-oriented publication
of abstracts, or to run an Internet search or an EconLitsearch on your topic. Using
these resources, …nd and read several recent articles on your topic. Pay attention
to the bibliographies of these articles. If an older article is cited by a number of
current authors, or if its title hits your topic on the head, trace back through the
literature and …nd this article as well.
In some cases, a topic will be so new or so obscure that you won’t be able to …nd
any articles on it. What then? We recommend two possible strategies. First, try to
transfer theory from a similar topic to yours. For example, if you’re trying to build
a model of the demand for a new product, read articles that analyze the demand
for similar, existing products. Second, if all else fails, contact someone who works
in the …eld you’re investigating. For example, if you’re building a model of housing
in an unfamiliar city, call a real estate agent who works there.
2. Specify the model: Select the independent variables and the functional
form.
The most important step in applied regression analysis is the speci…cation of the the-
oretical regression model. After selecting the dependent variable, the speci…cationof
a model involves choosing the following components:
1. the independent variables and how they should be measured,
2. the functional (mathematical) form of the variables, and
3. the properties of the stochastic error term.
A regression equation is speci…ed when each of these elements has been treated
appropriately.
Each of the elements of speci…cation is determined primarily on the basis of eco-
nomic theory. A mistake in any of the three elements results in a speci…cation
error. Of all the kinds of mistakes that can be made in applied regression analysis,
speci…cation error is usually the most disastrous to the validity of the estimated
equation. Thus, the more attention paid to eco-nomic theory at the beginning of a
project, the more satisfying the regression results are likely to be. The emphasis in
this text is on estimating behavioral equations, those that describe the behavior of
economic entities. We focus on selecting inde-pendent variables based on the eco-
nomic theory concerning that behavior. An explanatory variable is chosen because
it is a theoretical determinant of the dependent variable; it is expected to explain at

least part of the varia-tion in the dependent variable. Recall that regression gives
evidence but does not prove economic causality. Just as an example does not prove
the rule, a regression result does not prove the theory.
There are dangers in specifying the wrong independent variables. Our goal should
be to specify only relevant explanatory variables, those expected theoretically to
assert a substantive in‡uence on the dependent variable. Vari-ables suspected of
having little e¤ect should be excluded unless their possible impact on the dependent
variable is of some particular (e.g., policy) interest.
For example, an equation that explains the quantity demanded of a con-sumption
good might use the price of the product and consumer income or wealth as likely
variables. Theory also indicates that complementary and sub-stitute goods are im-
portant. Therefore, you might decide to include the prices of complements and
substitutes, but which complements and substitutes? Of course, selection of the
closest complements and/or substitutes is appro-priate, but how far should you go?
The choice must be based on theoretical judgment, and such judgments are often
quite subjective.
When researchers decide, for example, that the prices of only two other goods
need to be included, they are said to impose their priors(i.e., previous theoretical
belief) or their working hypotheses on the regression equation. Imposition of such
priors is a common practice that determines the number and kind of hypotheses
that the regression equation has to test. The danger is that a prior may be wrong
and could diminish the usefulness of the esti-mated regression equation. Each of the
priors therefore should be explained and justi…ed in detail.
3. Hypothesize the expected signs of the coe¢ cients.
Once the variables have been selected, it’s important to hypothesize the expected
signs of the slope coe¢ cients before you collect any data. In many cases, the basic
theory is general knowledge, so you don’t need to discuss the reasons for the expected
sign. However, if any doubt surrounds the choice of an expected sign, then you
should document the opposing theories and your reasons for hypothesizing a positive
or a negative slope coe¢ cient.
4. Collect the data. Inspect and clean the data.
Obtaining an original data set and properly preparing it for regression is a surpris-
ingly di¢ cult task. This step entails more than a mechanical recording of data,
because the type and size of the sample also must be chosen.
A general rule regarding sample size is “the more observations the better,”as long
as the observations are from the same general population. Ordinarily, researchers
take all the roughly comparable observations that are readily available. In regression
analysis, all the variables must have the same number of observations. They also
should have the same frequency (monthly, quar-terly, annual, etc.) and time period.
Often, the frequency selected is deter-mined by the availability of data.
The reason there should be as many observations as possible concerns the sta-
tistical concept of degrees of freedom…rst mentioned in previous section. Consider
…tting a straight line to two points on an X, Y coordinate system as in Figure
below.If there are only two points in a data set, as in the …gure above, a straight
Figure 3.4: Mathematical Fit of a Line to two points
line can be …tted to those points mathematically without error, because two points
completely determine a straight line.
Such an exercise can be done mathematically without error. Both points lie on
the line, so there is no estimation of the coe¢ cients involved. The two points deter-
mine the two parameters, the intercept and the slope, precisely. Estimation takes
place only when a straight line is …tted to three or more points that were generated
by some process that is not exact. The excess of the number of observations (three)
over the number of coe¢ -cients to be estimated (in this case two, the intercept and
slope) is the degrees of freedom. All that is necessary for estimation is a single
degree of freedom, as in Figure below, but the more degrees of freedom there are,
the better.
This is because when the number of degrees of freedom is large, every positive
error is likely to be balanced by a negative error. When degrees of freedom are
low, the random element is likely to fail to provide such o¤setting observations.For
example, the more a coin is ‡ipped, the more likely it is that the observed proportion
of heads will re‡ect the true probability of 0.5.
Another area of concern has to do with the units of measurement of the variables.
Does it matter if a variable is measured in dollars or thousands of dollars? Does
it matter if the measured variable di¤ers consistently from the true variable by
Figure 3.5: Statistical Fit of a Line to three points
10 units? Interestingly, such changes don’t matter in terms of regression analysis

except in interpreting the scale of the coe¢ cients. All conclusions about signs,
signi…cance, and economic theory are independent of units of measurement. For
example, it makes little di¤erence whether an independent variable is measured
in dollars or thousands of dollars. The constant term and measures of overall …t
remain unchanged. Such a multipli-cative factor does change the slope coe¢ cient,
but only by the exact amount necessary to compensate for the change in the units
of measurement of the independent variable. Similarly, a constant factor added to a
variable alters only the intercept term without changing the slope coe¢ cient itself.
The …nal step before estimating your equation is to inspect and clean the data.
You should make it a point always to look over your data set to see if you can
…nd any errors. The reason is obvious: why bother using sophisti-cated regression
analysis if your data are incorrect?
To inspect the data, obtain a plot (graph) of the data and look for outliers. An
outlieris an observation that lies outside the range of the rest of the observa-tions,
and looking for outliers is an easy way to …nd data entry errors. In addi-tion, it’s a
good habit to look at the mean, maximum, and minimum of each variable and then
think about possible inconsistencies in the data. Are any observations impossible or
unrealistic? Did GDP double in one year? Does a student have a 7.0 GPA on a 4.0
scale? Is consumption negative?
Typically, the data can be cleaned of these errors by replacing an incorrect
number with the correct one. In extremely rare circumstances, an observation can
be dropped from the sample, but only if the correct number can’t be found or if
that particular observation clearly isn’t from the same population as the rest of
the sample. Be careful! The mere existence of an outlier is not a justi…cation for
dropping that observation from the sample. A regression needs to be able to explain
all the observations in a sample, not just the well-behaved ones.
5. Estimate and evaluate the equation.
Believe it or not, it can take months to complete steps 1-4 for a regression equation,
but a computer program like Stata or EViews can estimate that equa-tion in less
than a second! Typically, estimation is done using OLS, but if another estimation
technique is used, the reasons for that alternative technique should be carefully
explained and evaluated.
You might think that once your equation has been estimated, your work is …n-
ished, but that’s hardly the case. Instead, you need to evaluate your results in a
variety of ways. How well did the equation …t the data? Were the signs and magni-
tudes of the estimated coe¢ cients what you expected? Most of the rest of this book
is concerned with the evaluation of estimated econometric equations, and beginning
researchers should be prepared to spend a consid-erable amount of time doing this
evaluation.
Once this evaluation is complete, don’t automatically go to step 6. Regres-
sion results are rarely what one expects, and additional model development often
is required. For example, an evaluation of your results might indicate that your
equation is missing an important variable. In such a case, you’d go back to step 1 to
review the literature and add the appropriate variable to your equation. You’d then
go through each of the steps in order until you had estimated your new speci…cation
in step 5. You’d move on to step 6 only if you were satis…ed with your estimated
equation. Don’t be too quick to make such adjustments, however, because we don’t
want to adjust the theory merely to …t the data. A researcher has to walk a …ne line
between making appropriate changes and avoiding inappropriate ones, and making
these choices is one of the artistic elements of applied econometrics.
Finally, it’s often worthwhile to estimate additional speci…cations of an equa-
tion in order to see how stable your observed results are. This approach is called
sensitivity analysis.
6. Document the results.
A standard format usually is used to present estimated regression results

The number in parentheses is the estimated standard error of the estimated
coe¢ cient, and the t-value is the one used to test the hypothesis that the true value
of the coe¢ cient is di¤erent from zero. These and other measures of the quality
of the regression will be discussed in later chapters. What is important to note is
that the documentation of regression results using an easily understood format is

considered part of the analysis itself. For time-series data sets, the documentation
also includes the frequency (e.g., quarterly or annual) and the time period of the
data.
One of the important parts of the documentation is the explanation of the model,
the assumptions, and the procedures and data used. The written doc-umentation
must contain enough information so that the entire study could be replicated by
others. Unless the variables have been de…ned in a glossary or table, short de…nitions
should be presented along with the equations. If there is a series of estimated
regression equations, then tables should provide the relevant information for each
equation. All data manipulations as well as data sources should be documented
fully. When there is much to explain, this documentation usually is relegated to a
data appendix. If the data are not available generally or are available only after
computation, the data set itself might be included in this appendix.
3.4 Dummy Variables

In our discussions so far the dependent and independent variables in our multiple
regression models have had quantitative meaning. Just a few examples include
hourly wage rate, years of education, college grade point average, amount of air
pollution, level of …rm sales, and number of arrests. In each case, the magnitude of
the variable conveys useful information. In empirical work, we must also incorporate
qualitative factors into regression models. The gender or race of an individual, the
industry of a …rm (manufacturing, retail, and so on), and the region in Ethiopia
where a city is located (South, North, West, and so on) are all considered to be
qualitative factors.
Most of this section is dedicated to qualitative independentvariables. After we
discuss the appropriate ways to describe qualitative information in below, we show
how qualitative explanatory variables can be easily incorporated into multiple re-
gression models. These sections cover almost all of the popular ways that qualitative
independent variables are used in cross-sectional regression analysis.
Qualitative factors often come in the form of binary information: a person is
female or male; a person does or does not own a personal computer; a …rm o¤ers
a certain kind of employee pension plan or it does not; a state administers capital
punishment or it does not. In all of these examples, the relevant information can
be captured by de…ning a binary variableor a zero-one variable. In econometrics,
binary variables are most commonly called dummy variables, although this name is
not especially descriptive.
In de…ning a dummy variable, we must decide which event is assigned the value
one and which is assigned the value zero. For example, in a study of indi-vidual
3.4 DUMMY VARIABLES 53
wage determination, we might de…ne femaleto be a binary variable taking on the

value one for females and the value zero for males. The name in this case indicates
the event with the value one. The same infor-mation is captured by de…ning maleto
be one if the person is male and zero if the person is female. Either of these is
better than using genderbecause this name does not make it clear when the dummy
variable is one: does gender=1correspond to male or female? What we call our
variables is unimportant for getting regression results, but it always helps to choose
names that clarify equations and expositions.
Suppose in the wage example that we have chosen the name femaleto indicate
gender. Further, we de…ne a binary variable marriedto equal one if a person is
married and zero if otherwise. The Table XXX gives a partial listing of a wage data
set that might result. We see that Person 1 is female and not married, Person 2
is female and married, Person 3 is male and not married, and so on. Why do we
use the values zero and one to describe qualitative information? In a sense, these
values are arbitrary: any two di¤erent values would do. The real bene…t of capturing
qualitative infor-mation using zero-one variables is that it leads to regression models
where the parameters have very natural interpretations, as we will see now.
Figure 3.6: A Partial Listing of the Data in WAGE1
3.4.1 A Note on the Measurement Scales of Variables

The variables that we will generally encounter fall into four broad categories: ra-
tio scale,interval scale, ordinal scale, and nominal scale. It is important that we
understand each.
Ratio Scale
For a variable X, taking two values, X1 and X2 , the ratio X1 =X2 and the distance
(X2 X1 ) are meaningful quantities. Also, there is a natural ordering (ascending or
descending) of the values along the scale. Therefore, comparisons such as X2 X1 or
X2 X1 are meaningful. Most economic variables belong to this category. Thus, it

is meaningful to ask how big this year’s GDP is compared with the previous year’s
GDP. Personal income, measured in dollars, is a ratio variable; someone earning
$100,000 is making twice as much as an-other person earning $50,000 (before taxes
are assessed, of course!).
Interval Scale
An interval scale variable satis…es the last two properties of the ratio scale vari-
able but not the …rst. Thus, the distance between two time periods, say (2000–1995)
is meaningful, but not the ratio of two time periods (2000/1995). At 11:00 a.m. PST
on August 11, 2007, Portland, Oregon, reported a temperature of 60 degrees Fahren-
heit while Tallahassee, Florida, reached 90 degrees. Temperature is not measured on
a ratio scale since it does not make sense to claim that Tallahassee was 50 percent
warmer than Portland. This is mainly due to the fact that the Fahrenheit scale does
not use 0 degrees as a natural base.
Ordinal Scale
A variable belongs to this category only if it satis…es the third property of the
ratio scale (i.e., natural ordering). Examples are grading systems (A, B, C grades)
or income class (upper, middle, lower). For these variables the ordering exists but
the distances between the categories cannot be quanti…ed. Students of economics
will recall the indi¤erence curves between two goods. Each higher indi¤erence curve
indicates a higher level of utility, but one cannot quantify by how much one indif-
ference curve is higher than the others.
Nominal Scale
Variables in this category have none of the features of the ratio scale variables.
Variables such as gender (male, female) and marital status (married, unmarried,
divorced, separated) simply denote categories. Question:What is the reason why
such variables cannot be expressed on the ratio, interval, or ordinal scales?
As we shall see, econometric techniques that may be suitable for ratio scale vari-
ables may not be suitable for nominal scale variables. Therefore, it is important to
bear in mind the distinctions among the four types of measurement scales discussed
above.
In general, the above discussed are the four types of variables that one gener-
ally encounters in empirical analysis: These are: ratio scale, interval scale, ordinal
scale,and nominal scale.The types of variables that we have encountered in the pre-
ceding chapters were essentially ratio scale.But this should not give the impression
that regression models can deal only with ratio scale variables. Regression mod-
els can also handle other types of variables mentioned previously. In this chapter,
we consider models that may involve not only ratio scale variables but also nomi-
nal scalevariables. Such variables are also known as indicator variables, categorical
variables, qualitative variables, or dummy variables.
3.4.2 The Nature of Dummy Variables

In regression analysis the dependent variable, or regressand, is frequently in‡uenced
not only by ratio scale variables (e.g., income, output, prices, costs, height, tem-
perature) but also by variables that are essentially qualitative, or nominal scale, in
nature, such as sex, race, color, religion, nationality, geographical region, political
upheavals, and party a¢ lia-tion. For example, holding all other factors constant,
female workers are found to earn less than their male counterparts or nonwhite
workers are found to earn less than whites.
This pattern may result from sex or racial discrimination, but whatever the rea-
son, qualitative variables such as sex and race seem to in‡uence the regressand and
clearly should be included among the explanatory variables, or the regressors. Since
such variables usually indicate the presence or absence of a “quality”or an attribute,
such as male or female, black or white, Catholic or non-Catholic, Democrat or
Republican, they are essentially nominal scalevariables. One way we could
“quantify” such attributes is by constructing arti…cial variables that take on val-
ues of 1 or 0, 1 indicat-ing the presence (or possession) of that attribute and 0
indicating the absence of that attribute. For example, 1 may indicate that a person
is a female and 0 may designate a male; or 1 may indicate that a person is a college
graduate, and 0 that the person is not, and so on. Variables that assume such 0 and
1 values are called dummy variables.
3.4.3 A Single Dummy Independent Variable

How do we incorporate binary information into regression models? In the simplest
case, with only a single dummy explanatory variable, we just add it as an indepen-
dent variable in the equation. For example, consider the following simple model of
hourly wage determination:
wage = 0 + 0 f emale + 1 educ +u (3.13)
We use 0 as the parameter on femalein order to highlight the interpretation of

the parameters multi-plying dummy variables; later, we will use whatever notation
is most convenient. In the above model (equation), only two observed factors a¤ect
wage: gender and education. Because female =1 when the person is female, and
female =0 when the person is male, the parameter 0 has the following interpretation:
0 is the di¤erence in hourly wage between females and males, giventhe same amount
of education (and the same error term ). Thus, the coe¢ cient 0 determines whether
there is discrimination against women: if 0 < 0, then for the same level of other
factors, women earn less than men on average.
The situation can be depicted graphically as an intercept shift between males
and females. In the …gure below, the case 0 < 0 is shown, so that men earn a …xed
amount more per hour than women. The di¤erence does not depend on the amount
of education, and this explains why the wage-education pro…les for women and men
are parallel.
Figure 3.7: Graph of wage = 0 + 0 f emale + 1 educ + for 0 <0
At this point, you may wonder why we do not also include in (3.13) a dummy
variable, say male, which is one for males and zero for females. This would be
redundant. In (3.13), the intercept for males is 0 , and the intercept for females is
0 + 0 . Because there are just two groups, we only need two di¤erent intercepts.
This means that, in addition to 0 , we need to use only one dummy variable; we
have chosen to include the dummy variable for females. Using two dummy variables
would intro-duce perfect collinearity because f emale + male = 1, which means that
male is a perfect linear function of female. Including dummy variables for both
genders is the simplest example of the so-called dummy variable trap, which arises
when too many dummy variables describe a given number of groups. We will discuss
this problem in detail later.
In (3.13), we have chosen males to be the base group or benchmark group, that
is, the group against which comparisons are made. This is why 0 is the intercept
for males, and 0 is the di¤erence in intercepts between females and males. We could
choose females as the base group by writing the model as
wage = 0 + 0 male + 1 educ +u
where the intercept for females is 0 and the intercept for males is 0 + 0; this
implies that 0 = 0 + 0 and 0 + 0 = 0 . In any application, it does not matter

how we choose the base group, but it is important to keep track of which group is
the base group.
Some researchers prefer to drop the overall intercept in the model and to include
dummy vari-ables for each group. The equation would then be wage = 0 male +
0 f emale + 1 educ + u, where the intercept for men is 0 and the intercept for
women is 0 . There is no dummy variable trap in this case because we do not have
an overall intercept. However, this formulation has little to o¤er, since testing for a
di¤erence in the intercepts is more di¢ cult, and there is no generally agreed upon
way to compute R-squared in regressions without an intercept. Therefore, we will
always include an overall intercept for the base group.
Nothing much changes when more explanatory variables are involved. Taking
males as the base group, a model that controls for experience and tenure in addition
to education is
wage = 0 + 0 f emale + 1 educ + 2 exp er + 3 ten ure +u
If educ, exper, and tenureare all relevant productivity characteristics, the null hy-
pothes is of nodi¤erence between men and women is H0 : 0 = 0. The alternative
that there is discrimination against women is H1 : 0 < 0.
How can we actually test for wage discrimination? The answer is simple: just
estimate the model by OLS, exactlyas before, and use the usual tstatistic. Nothing
changes about the mechanics of OLS or the statistical theory when some of the
independent variables are de…ned as dummy variables. The only di¤erence with
what we have done up until now is in the interpretation of the coe¢ cient on the
dummy variable. We will come back to this question when we discuss a chapter on
hypothesis testing.
Chapter 4
Classical Linear Regression Model

At the end of this chapter you will have a good understanding of
The assumptions that underlie the classical OLS regression method discussed
in the preceding chapter
The consequences of the assumptions.
Brief introduction of the consequences of the violations of the assumptions.
The The Gauss–Markov Theorem and the Properties of OLS Estimators.
4.2 Introduction
The term classical refers to a set of fairly basic assumptions required to hold in order
for OLS to be considered the “best”estimator available for regression models. When
one or more of these assumptions do not hold, other estimation techniques (such as
Generalized Least Squares) may be better than OLS. As a result, one of the most
important jobs in regression analysis is to decide whether the classical assumptions
hold for a particular equation. If so, the OLS estimation technique is the best
available. Otherwise, the pros and cons of alternative estimation techniques must
be weighed. These alternatives usually are adjustments to OLS that take account of
the particular assumption that has been violated. In a sense, most of the rest of the
study in econometrics deals in one way or another with the question of what to do
when one of the classical assumptions is not met. Since econometricians spend so
much time analyzing violations of them, it is crucial that they know and understand
these assumptions.
59
60 CHAPTER 4 CLASSICAL LINEAR REGRESSION MODEL
4.3 The Classical Assumptions

The Classical Assumptionsmust be met in order for OLS estimators to be the best
available. Because of their importance in regression analysis, the assumptions are
presented below. The classical assumptions are:
The regression model is linear, is correctly speci…ed, and has an additive error
term
The error term has a zero population mean
All explanatory variables are uncorrelated with the error term
Observations of the error term are uncorrelated with each other (no serial
correlation)
The error term has a constant variance (no heteroskedasticity)
No explanatory variable is a perfect linear function of any other explanatory

variable(s) (no perfect multicollinearity)
The error term is normally distributed (this assumption is optional but usually
is invoked)
The regression model is assumed to be linear:
Yi = 0 + 1 X1i + 2 X2i + ::: + k Xki + i (4.1)
The assumption that the regression model is linear does not require the underlying
theory to be linear. For example, an exponential function:
Yi = e 0 Xi 1 e i (4.2)
where e is the base of the natural log, can be transformed by taking the natural log
of both sides of the equation:
ln (Yi ) = 0 + 1 ln (Xi ) + i (4.3)
Let ln (Yi ) = Yi0 and ln (Xi ) = Xi0 , then the above equation can be written as
Yi0 = 0 + 0
1 Xi + i (4.4)
In the last Equation on the preceding slide, the properties of the OLS estimator
of the s still hold because the equation is linear
Two additional properties also must hold.
First, we assume that the equation is correctly speci…ed. If an equation has an
omitted variable or an incorrect functional form, the odds are against that equation
4.3 THE CLASSICAL ASSUMPTIONS 61
working well. Second, we assume that a stochastic error term has been added to
the equation. This error term must be an additive one and cannot be multiplied
by or divided into any of the variables in the equation. As was pointed out in
our previous discussions, econometricians add a stochastic (random) error term to
regression equations to account for variation in the dependent variable that is not
explained by the independent variables included in the model. The speci…c value of
the error term for each observation is determined purely by chance. Probably the
best way to picture this concept is to think of each observation of the error term
as being drawn from a random variable distribution such as the one illustrated in
Figure on the next slide.
Classical Assumption 2 says that the mean of this distribution is zero. That
is, when the entire population of possible values for the stochastic error term is
considered, the average value of that population is zero. For a small sample, it is
not likely that the mean is exactly zero, but as the size of the sample approaches
in…nity, the mean of the sample approaches zero. What happens if the mean does
not equal zero in a sample? As long as you have a constant term in the equation,
the estimate of 0 will absorb the non-zero mean.
In essence, the constant term equals the …xed portion of Y that cannot be ex-
plained by the independent variables, and the error term equals the stochastic por-
tion of the unexplained value of Y .
Observations of stochastic error terms are assumed to be drawn from a random
variable distribution with a mean of zero. If Classical Assumption II is met, the
expected value (the mean) of the error term is zero.
All explanatory variables are uncorrelated with the error term. It is assumed that
the observed values of the explanatory variables are independent of the values of the
error term. If an explanatory variable and the error term were instead correlated
with each other, the OLS estimates would be likely to attribute to the X some of the
variation in Y that actually came from the error term. If the error term and X were
positively correlated, for example, then the estimated coe¢ cient would probably be
higher than it would otherwise have been (biased upward), because the OLS program
would mistakenly attribute the variation in Y caused by to X instead. As a result,
it is important to ensure that the explanatory variables are uncorrelated with the
error term. Classical Assumption III is violated most frequently when a researcher
omits an important independent variable from an equation. As we discussed in
the previous classes, one of the major components of the stochastic error term is
omitted variables, so if a variable has been omitted, then the error term will change
when the omitted variable changes. If this omitted variable is correlated with an
included independent variable (as often happens in economics), then the error term
is correlated with that independent variable as well. We have violated Assumption
III! Because of this violation, OLS will attribute the impact of the omitted variable
to the included variable, to the extent that the two variables are correlated.
Observations of the error term are drawn independently from each other. If a
systematic correlation exists between one observation of the error term and another,
then OLS estimates will be less precise than estimates that account for the correla-
tion. For example, if the fact that the from one observation is positive increases
the probability that the from another observation also is positive, then the two
observations of the error term are positively correlated. Such a correlation would
violate Classical Assumption IV. In economic applications, this assumption is most
important in time-series models.
In such a context, Assumption IV says that an increase in the error term in one
time period (a random shock, for example) does not show up in or a¤ect in any way
the error term in another time period. In some cases, though, this assumption is
unrealistic, since the e¤ects of a random shock sometimes last for a number of time
periods. If, over all the observations of the sample, t+1 is correlated with t , then
the error term is said to be serially correlated (or autocorrelated), and Assumption
IV is violated. The variance (or dispersion) of the distribution from which the
observations of the error term are drawn is constant. That is, the observations of
the error term are assumed to be drawn continually from identical distributions (for
example, the one pictured in the next slide).
The alternative would be for the variance of the distribution of the error term
to change for each observation or range of observations.
4.3 THE CLASSICAL ASSUMPTIONS 63
In Figure on the next slide, for example, the variance of the error term is shown
to increase as the variable Z increases; such a pattern violates Classical Assumption
V.
The actual values of the error term are not directly observable, but the lack
of a constant variance for the distribution of the error term causes OLS to gener-
ate inaccurate estimates of the standard error of the coe¢ cients. The violation of
Assumption V is referred to as heteroskedasticity.
Perfect collinearity between two independent variables implies that they are re-
ally the same variable, or that one is a multiple of the other, and/or that a constant
has been added to one of the variables. That is, the relative movements of one
explanatory variable will be matched exactly by the relative movements of the other
even though the absolute size of the movements might di¤er. Because every move-
ment of one of the variables is matched exactly by a relative movement in the other,
the OLS estimation procedure will be incapable of distinguishing one variable from
the other. Many instances of perfect collinearity (or multicollinearity if more than
two independent variables are involved) are the result of the researcher not account-
ing for identities (de…nitional equivalences) among the independent variables. This
problem can be corrected easily by dropping one of the perfectly collinear variables
from the equation. What is an example of perfect multicollinearity?
Suppose that you decide to build a model of the pro…ts of tire stores in your city
and you include annual sales of tires (in dollars) at each store and the annual sales
tax paid by each store as independent variables. Since the tire stores are all in the
same city, they all pay the same percentage sales tax, so the sales tax paid will be a
constant percentage of their total sales (in dollars). If the sales tax rate is 7%, then
the total taxes paid will be 7% of sales for each and every tire store. Thus sales tax
will be a perfect linear function of sales, and you’ll have perfect multicollinearity!
Although we have already assumed that observations of the error term are drawn
independently (Assumption IV) from a distribution that has a zero mean (Assump-
tion II) and that has a constant variance (Assumption V), we have said little about
the shape of that distribution.

Assumption VII states that the observations of the error term are drawn from a
distribution that is normal (that is, bell-shaped, and generally following the symmet-
rical pattern portrayed in Figure on the next slide). This assumption of normality
is not required for OLS estimation. Its major application is in hypothesis testing
and con…dence intervals, which use the estimated regression coe¢ cient to investi-
gate hypotheses about economic behavior. Hypothesis testing is the subject of next
chapter, and without the normality assumption, most of the small sample tests of
that chapter would be invalid.
Even though Assumption VII is optional, it is usually advisable to add the
assumption of normality to the other six assumptions for two reasons:
1. The error term i can be thought of as the sum of a number of minor in‡uences
or errors. As the number of these minor in‡uences gets larger, the distribution of
the error term tends to approach the normal distribution.
2. The t-statistic and the F -statistic, which will be discussed in the next chapter,
are not truly applicable unless the error term is normally distributed. A quick look
at the Figure in the previous slide shows how normal distributions di¤er when the
means and variances are di¤erent. In normal distribution A (a Standard Normal
Distribution), the mean is 0 and the variance is 1; in normal distribution B, the mean
is 2, and the variance is 0.5. When the mean is di¤erent, the entire distribution shifts.
When the variance is di¤erent, the distribution becomes fatter or skinnier.
4.4 The Sampling Distribution of b
Just as the error term follows a probability distribution, so too do the estimates of
. In fact, each di¤erent sample of data typically produces a di¤erent estimate of
. The probability distribution of these b values across di¤erent samples is called
the sampling distribution of b. Recall that an estimator is a formula, such as the
4.4 THE SAMPLING DISTRIBUTION OF b 65
OLS formula
X
N
Xi X Yi Y
b1 = i=1
(4.5)
X
N
2
Xi X
i=1
that tells you how to compute b, while an estimate is the value of b computed by
the formula for a given sample. Since researchers usually have only one sample,
beginning econometricians often assume that regression analysis can produce only
one estimate of for a given population.
In reality, however, each di¤erent sample from the same population will produce
a di¤erent estimate of . The collection of all the possible samples has a distribution,
with a mean and a variance, and we need to discuss the properties of this sampling
distribution of b, even though in most real applications we will encounter only a
single draw from it. Be sure to remember that a sampling distribution refers to the
distribution of di¤erent values of b across di¤erent samples, not within one.
These b susually are assumed to be normally distributed because the normality of
the error term implies that the OLS estimates of are normally distributed as well.
For an estimation technique to be “good”, the mean of the sampling distribution of
the bs it produces should equal the true population . This property has a special
name in econometrics: unbiasedness. Although we do not know the true in this
case, it is likely that if we took enough samples - thousands perhaps - the mean of the
bs would approach the true . The moral of the story is that while a single sample
provides a single estimate of , that estimate comes from a sampling distribution
with a mean and a variance. Other estimates from that sampling distribution will
most likely be di¤erent.
When we discuss the properties of estimators in the next section, it will be impor-
tant to remember that we are discussing the properties of a sampling distribution,
not the properties of one sample.
4.4.1 Properties of the Mean

A desirable property of a distribution of estimates is that its mean equals the true
mean of the variable being estimated. An estimator that yields such estimates is
called an unbiased estimator.
An estimator b is an unbiased estimator if its sampling distribution has as its
expected value the true value of
E b = (4.6)
Only one value of b is obtained in practice, but the property of unbiasedness

is useful because a single estimate drawn from an unbiased distribution is more
likely to be near the true value (assuming identical variances) than one taken from
a distribution not centered around the true value. If an estimator produces bs that
are not centered around the true , the estimator is referred to as a biased estimator.
We cannot ensure that every estimate from an unbiased estimator is better than
every estimate from a biased one, because a particular unbiased estimate could, by
chance, be farther from the true value than a biased estimate might be. This could
happen by chance or because the biased estimator had a smaller variance.
4.4.2 Properties of the Variance

Just as we would like the distribution of the bs to be centered around the true
population , we would also like that distribution to be as narrow (or precise) as
possible. A distribution centered around the truth but with an extremely large
variance might be of very little use because any given estimate would quite likely be
far from the true value. For a b distribution with a small variance, the estimates
are likely to be close to the mean of the sampling distribution.
To see this more clearly, compare distributions A and B (both of which are
unbiased) in the Figure on the next slide. Distribution A, which has a larger variance
than distribution B, is less precise than distribution B.
For comparison purposes, a biased distribution (distribution C) is also pictured;
note that bias implies that the expected value of the distribution is to the right or
left of the true .
The variance of the distribution of the bs can be decreased by increasing the size
of the sample.
This also increases the degrees of freedom, since the number of degrees of freedom
equals the sample size minus the number of coe¢ cients or parameters estimated. As
the number of observations increases, other things held constant, the variance of the
sampling distribution tends to decrease. Although it is not true that a sample of
60 will always produce estimates closer to the true than a sample of 6, it is quite
4.4 THE SAMPLING DISTRIBUTION OF b 67
likely to do so; such larger samples should be sought. The …gure on the next slide
presents illustrative sampling distributions of bs for 6, 60, and 600 observations for
OLS estimators of when the true equals 1. The larger samples do indeed produce
sampling distributions that are more closely centered around .
Figure 4.1: Sampling distribution of b for various sizes of observation
The powerful lesson illustrated by the …gure in the previous slide is that if you
want to maximize your chances of getting an estimate close to the true value, apply
OLS to a large sample. There’s no guarantee that you will get a more accurate
estimate from a large sample, but your chances are better. Larger samples, all else
equal, tend to result in more precise estimates. And if the estimator is unbiased,
more precise estimates are more accurate estimates.
In econometrics, we must rely on general tendencies. The element of chance,
a random occurrence, is always present in estimating regression coe¢ cients, and
some estimates may be far from the true value no matter how good the estimating
technique. However, if the distribution is centered on the true value and has as
small a variance as possible, the element of chance is less likely to induce a poor
estimate. If the sampling distribution is centered around a value other than the true
(that is, if b is biased) then a lower variance implies that most of the sampling
distribution of b is concentrated on the wrong value. However, if this value is not
very di¤erent from the true value, which is usually not known in practice, then
the greater precision will still be valuable. One method of deciding whether this
decreased variance in the distribution of the bs is valuable enough to o¤set the bias
is to compare di¤erent estimation techniques by using a measure called the Mean
Square Error (MSE).
The Mean Square Error is equal to the variance plus the square of the bias. The
lower the MSE, the better.
A …nal item of importance is that as the variance of the error term increases, so
too does the variance of the distribution of b. The reason for the increased variance
of b is that with the larger variance of i , the more extreme values of i are observed
with more frequency, and the error term becomes more important in determining
the values of Yi . Since the standard error of the estimated coe¢ cient, SE b , is the
square root of the estimated variance of the bs, it is similarly a¤ected by the size
of the sample and the other factors we have mentioned. For example, an increase
in sample size will cause SE b to fall; the larger the sample, the more precise our
coe¢ cient estimates will be.
4.5 The Gauss - Markov Theorem

The Gauss - Markov Theorem proves two important properties of OLS estimators.
The proof of this theorem is discussed in all advanced econometrics courses, but for
a regression user, it is more important to know what the theorem implies than to
be able to prove it.The Gauss-Markov Theorem states that:
Given Classical Assumptions I through VI (Assumption VII, normality, is not
needed for this theorem), the Ordinary Least Squares estimator of k is the minimum
variance estimator from among the set of all linear unbiased estimators of k , for
k = 0; 1; 2; :::; K.
The Gauss-Markov Theorem is perhaps most easily remembered by stating that
“OLS is BLUE”where BLUE stands for “Best (meaning minimum variance) Linear
Unbiased Estimator.” Students who might forget that “best” stands for minimum
variance might be better served by remembering “OLS is MvLUE,” but such a
phrase is hardly catchy or easy to remember. If an equation’s coe¢ cient estimation
is unbiased (that is, if each of the estimated coe¢ cients is produced by an unbiased
estimator of the true population coe¢ cient), then:
E bk = k (k = 0; 1; 2; :::; K) (4.7)
Best means that each bk has the smallest variance possible (in this case, out of all the
linear unbiased estimators of k ). An unbiased estimator with the smallest variance
is called e¢ cient, and that estimator is said to have the property of e¢ ciency. Since
the variance typically falls as the sample size increases, larger samples almost always
produce more accurate coe¢ cient estimates than do smaller ones.
The Gauss-Markov Theorem requires that just the …rst six of the seven classical
assumptions be met. What happens if we add in the seventh assumption, that the
error term is normally distributed? In this case, the result of the Gauss-Markov
Theorem is strengthened because the OLS estimator can be shown to be the best
(minimum variance) unbiased estimator out of all the possible estimators, not just
out of the linear estimators. In other words, if all seven assumptions are met, OLS
is “BUE.”
Given all seven classical assumptions, the OLS coe¢ cient estimators can be
shown to have the four properties discussed in the next slide.
4.5 THE GAUSS - MARKOV THEOREM 69
1. They are unbiased.That is, E b = . This means that the OLS estimates
of the coe¢ cients are centered around the true population values of the parameters
being estimated.
2. They are of minimum variance.The distribution of the coe¢ cient estimates
around the true parameter values is as tightly or narrowly distributed as is possible
for an unbiased distribution. No other unbiased estimator has a lower variance for
each estimated coe¢ cient than OLS.
3. They are consistent. As the sample size approaches in…nity, the estimates
converge to the true population parameters. Put di¤erently, as the sample size gets
larger, the variance gets smaller, and each estimate approaches the true value of the
coe¢ cient being estimated.
4. They are normally distributed.The bs are N ; V AR b .Thus various sta-
tistical tests based on the normal distribution may indeed be applied to these esti-
mates, as will be done in the next chapter.
Chapter 5
Hypothesis Testing and Statistical

Inference
5.1 Introduction
In this chapter, we return to the essence of econometrics - an e¤ort to quan-tify

economic relationships by analyzing sample data - and ask what conclu-sions we
can draw from this quanti…cation. Hypothesis testing goes beyond calculating esti-
mates of the true population parameters to a much more complex set of questions.
Hypothesis testing and statistical inference allow us to answer important questions
about the real world from a sample. Is it likely that our result could have been
obtained by chance? Would the results generated from our sample lead us to reject
our original theories? How con-…dent can we be that our estimate is close to the true
value of the parameter? This chapter starts with a brief introduction to the topic of
hypothesis testing. We then examine the t-test, typically used for hypothesis tests
of individual regression coe¢ cients. We next look at the con…dence interval, a tool
for evaluating the precision of our estimates, and we end the chapter by learn-ing
how to use the F-test to determine whether whole groups of coe¢ cients a¤ect the
dependent variable.
Hypothesis testing and the t-test should be familiar topics to readers with strong
backgrounds in statistics, who are encouraged to skim this chapter and focus on only
those applications that seem somewhat new. The development of hypothesis testing
procedures is explained here in terms of the regression model, however, so parts of
the chapter may be instructive even to those already skilled in statistics.
Our approach will be classical in nature, since we assume that the sample data
are our best and only information about the population. An alternative, Bayesian
statistics, uses a completely di¤erent de…nition of probability and does not use the
sampling distribution concept.
71
72CHAPTER 5 HYPOTHESIS TESTING AND STATISTICAL INFERENCE

Based on the material in this chapter, you should be able to
Explain the “level of con…dence”of an interval estimator, and exactly what it

means in a repeated sampling context, and give an example.
Explain the di¤erence between an interval estimator and an interval estimate.

Explain how to interpret an interval estimate.
Explain the terms null hypothesis, alternative hypothesis, and rejection region,
giving an example and a sketch of the rejection region.
Explain the logic of a statistical test, includingwhy it is important that a test

statistic have a known probability distribution if the null hypothesis is true.
Explain the termp-value and how to use a p-value to determine the outcome
of a hypothesis test; provide a sketch showing ap-value.
Explain the di¤erence between one-tail and two-tail tests. Explain, intuitively,
how to choose the rejection region for a one-tail test.
ExplainType I error and illustrate it ina sketch.
De…ne the level of signi…canceof a test.
Explain the di¤erence between economic and statistical signi…cance.
Explain how to choose what goes in the null hypothesis, and what goes in the
alternative hypothesis.
5.3 Introduction
Many hypotheses about the world around us can be phrased as yes/no questions. Do
the mean monthly earnings of recent Ethiopian college graduates equal ETB10,000.00
per month? Are mean earnings the same for male and female college graduates?
Both these questions embody speci…c hypotheses about the population distribution
of earnings. The statistical challenge is to answer these questions based on a sample
of evidence. In this chapter we describe hypothesis tests concerning the population
mean (Does the population mean of monthly earnings equal ETB10,000.00?). Hy-
pothesis tests involving two populations (Are mean earnings the same for men and
women?).
5.3 INTRODUCTION 73
5.3.1 Classical Null and Alternative Hypotheses

The …rst step in hypothesis testing is to state the hypotheses to be tested. This
should be done before the equation is estimated because hypotheses developed after
estimation run the risk of being justi…cations of particular results rather than tests
of the validity of those results. The null hypothesis typically is a statement of
the values that the researcher does not expect. The notation used to specify the
null hypothesis is “H0:” followed by a statement of the range of values you do not
expect.
For example, if you expect a positive coe¢ cient, then you do not expect a zero
or negative coe¢ cient, and the null hypothesis is:
Null hypothes is H0 : 0 (the values you do not expect) (5.1)
The alternative hypothesis typically is a statement of the values that the

researcher expects. The notation used to specify the alternative hypothesis is “HA :”
followed by a statement of the range of values you expect. To continue our previous
example, if you expect a positive coe¢ cient, then the alternative hypothesis is:
Alternative hypothes is HA : > 0 (the values you expect) (5.2)
To test yourself, take a moment and think about what the null and alternative
hypotheses will be if you expect a negative coe¢ cient.
H0 : 0 (5.3)
HA : <0
The above hypotheses are for a one-sided test because the alternative hy-
potheses have values on only one side of the null hypothesis. Another
approach is to use a two-sided test (or a two-tailed test) in which the alternative
hypothesis has values on both sides of the null hypothesis.
H0 : =0 (5.4)
HA : 6= 0
Note that the null hypothesis and the alternative hypothesis are jointly exhaus-
tive. Note also that eonomists always put what they expect in the alternative
hypothesis. This allows us to make rather strong statements when we reject a null
hypothesis. However, we can never say that we accept the null hypothesis; we must
always say that we cannot reject the null hypothesis. As put by one Econo-
metrician: Just as a court pronounces a verdict as not guilty rather than innocent,
so the conclusion of a statistical test is do not reject rather than accept.
5.3.2 Type I and Type II Errors

The typical testing technique in econometrics is to hypothesize an expected sign
(or value) for each regression coe¢ cient (except the constant term) and then to
determine whether to reject the null hypothesis. Since the regression coe¢ cients are
only estimates of the true population parameters, it would be unrealistic to think
that conclusions drawn from regression analysis will always be right. There are two
kinds of errors we can make in such hypothesis testing:
Type I : We reject a true null hypothesis

Type II : We do not reject a false null hypothesis
We will refer to these errors as Type I and Type II Errors, respectively. Suppose
we have the following null and alternative hypotheses:
H0 : 0 (5.5)
HA : 0
Even if the true parameter is not positive, the particular estimate obtained by
a researcher may be su¢ ciently positive to lead to the rejection of the null hypothesis
that 0. This is a Type I Error; we have rejected the truth! Alternatively, it’s
possible to obtain an estimate of that is close enough to zero (or negative) to
be considered “not signi…cantly positive”. Such a result may lead the researcher to
“accept” the hypothesis that 0 when in truth > 0. This is a Type II Error;
we have failed to reject a false null hypothesis!
Suppose we are dealing with evaluation of an impact of a given intervention,
what do these errors mean? A type I error occurs when an evaluation concludes
that a program has had an impact, when in reality it had no impact. A type II error
occurs when an evaluation concludes that the program has had no impact, when in
fact it has had an impact. We can generalize the discussion of the Type I and Type
II errors as follows:
Figure 5.1: Type I and II errors
5.3.3 Decision rules of Hypothesis Testing

A decision rule is a method of deciding whether to reject a null hypothesis. Typically,
a decision rule involves comparing a sample statistic with a pre-selected critical value
5.3 INTRODUCTION 75
found in tables in annexes of almost every stat or economtrics text. A decision rule
should be formulated before regression estimates are obtained. The range of possible
values of b is divided into two regions, an “acceptance”region and a rejection region,
where the terms are expressed relative to the null hypothesis. To de…ne these regions,
we must determine a critical value (or, for a two-tailed test, two critical values) of
b. Thus, a critical value is a value that divides the “acceptance” region from the
rejection region when testing a null hypothesis. Graphs of these “acceptance” and
rejection regions are presented in Figures on next slides. To use a decision rule, we
need to select a critical value.
Let’s suppose that the critical value is 1.8. If the observed b is greater than
1.8, we can reject the null hypothesis that is zero or negative. To see this, take a
look at the Figure on the next slide. Any b above 1.8 can be seen to fall into the
rejection region, whereas any b below 1.8 can be seen to fall into the “acceptance”
region. The rejection region measures the probability of a Type I Error if the null
hypothesis is true. Some students react to this news by suggesting that we make the
rejection region as small as possible. Unfortunately, decreasing the chance of a
Type I Error means increasing the chance of a Type II Error (not rejecting
a false null hypothesis). If you make the rejection region so small that you almost
never reject a true null hypothesis, then you are going to be unable to reject almost
every null hypothesis, whether they are true or not! As a result, the probability of
a Type II Error will rise.
Figure 5.2: “Acceptance”and Rejection regions for a one-sided test of
Given that, how do you choose between Type I and Type II Errors?
The answer is easiest if you know that the cost (to society or the decision maker)
of making one kind of error is dramatically larger than the cost of making the other.
If you worked for the authority regulating and approving drugs in a country, for
example, you would want to be very sure that you had not released a product that
Figure 5.3: "Acceptance”and Rejection regions for a two-sided test of
had horrible side e¤ects.
5.4 The t-test

Econometricians generally use the t-test to test hypotheses about individual regres-
sion slope coe¢ cients. Tests of more than one coe¢ cient at a time (joint hypotheses)
are typically done with the F -test. The t-test is easy to use because it accounts for
di¤erences in the units of measurement of the variables and in the standard devia-
tions of the estimated coe¢ cients. More important, the t-statistic is the appropriate
test to use when the stochastic error term is normally distributed and when
the variance of that distribution must be estimated. Since these usually are
the case, the use of the t-test for hypothesis testing has become standard practice
in econometrics.
5.4.1 The t-statistic

For a typical multiple regression equation:
Yi = 0 + 1 X1i + 2 X2i + ::: + k Xki + i (5.6)
we can calculate t-values for each of the estimated coe¢ cients in the equation.
Note that t-tests are usually done only on the slope coe¢ cients; for these, the
relevant form of the t-statistic for the k th coe¢ cient is
bk H0
tk = (k = 1; 2; :::; K) (5.7)
SE bk
5.4 THE T-TEST 77
How do you decide what border is implied by the null hypothesis? Some null
hypotheses specify a particular value. For these, H0 is simply that value; if H0 :
= S, then H0 = S. Other null hypotheses involve ranges, but we are concerned
only with the value in the null hypothesis that is closest to the border between the
“acceptance” region and the rejection region. This border value then becomes the
H0 . For example, if H0 : 0 and HA : < 0, then the value in the null hypothesis
closest to the border is zero, and H0 = 0. Since most regression hypotheses test
whether a particular regression coe¢ cient is signi…cantly di¤erent from zero, H0
is typically zero. Zero is particularly meaningful because if the true equals zero,
then the variable does not belong in the equation. Before we drop the variable from
the equation and e¤ectively force the coe¢ cient to be zero, however, we need to be
careful and test the null hypothesis that = 0. Thus, the most-used form of the
t-statistic becomes
b 0
k
tk = (k = 1; 2; :::; K) (5.8)
SE bk
which simpli…es to
b
k
tk = (k = 1; 2; :::; K) (5.9)
SE bk
or the estimated coe¢ cient divided by the estimate of its standard error. This is
the t-statistic formula used by most computer programs.
For an example of this calculation, let’s consider the following estimated equation
which is in a typical format of reporting estimation results
In the above equation, the numbers in parentheses underneath the estimated

regression coe¢ cients are the estimated standard errors of the estimated bs, and
the numbers below them are t-values calculated according to the formula discussed
above (when H0 = 0). Note that the sign of the t-value is always the same as that
of the estimated regression coe¢ cient and the standard error is always positive.
5.4.2 The Critical t-Value and the t-Test Decision rule

To decide whether to reject or not to reject a null hypothesis based on a calculated
t-value, we use a critical t-value. A critical t-value is the value that distinguishes the
“acceptance” region from the rejection region. The critical t-value, tc , is selected
from a t-table (available as appendices of almost every stat and econometrics texts)
depending on whether the test is one-sided or two-sided, on the level of Type I Error
you specify, and on the degrees of freedom, N K 1. The level of Type I Error
in a hypothesis test is also called the level of signi…cance of that test and we will
discuss it in more detail later in this chapter.
The t-table was created to save time during research; it consists of critical t-
values given speci…c areas underneath curves such as those in the …gure for one
sided test for Type I Errors. A critical t-value is thus a function of the probability
of Type I Error that the researcher wants to specify. Once you have obtained a
calculated t-value tk and a critical t-value tc , you reject the null hypothesis if the
calculated t-value is greater in absolute value than the critical t-value and if the
calculated t-value has the sign implied by HA . Thus, the rule to apply when testing
a single regression coe¢ cient is that you should: Reject H0 if jtk j > tc and if tk also
has the sign implied by HA . Do not reject H0 otherwise.
This decision rule works for calculated t-values and critical t-values for one-sided
hypotheses around zero:
H0 : 0
HA : >0
H0 : 0
HA : <0
The same applies for two-sided hypothesis around zero:
H0 : =0
HA : 6= 0
For one-sided hypothesis based on hypothesized values other than zero:
H0 : S
HA : >S
H0 : S
HA : <S
Also for two-sided hypothesis based on hypothesized values other than zero:
H0 : =S
HA : 6= S
The decision rule is the same: Reject the null hypothesis if the appropriately
calculated t-value, tk , is greater in absolute value than the critical t-value, tc , as
5.4 THE T-TEST 79
long as the sign of tk is the same as the sign of the coe¢ cient implied in HA .
Otherwise, do not reject H0 .
Always use Equation
bk H0
tk = (k = 1; 2; :::; K)
SE bk
whenever the hypothesized value is not zero.

A null hypothesis that is commonly tested in empirical work is H0 : = 0,
that is, the slope coe¢ cient is zero. This “zero”null hypothesis has an objective of
…nding out whether Y is related at all to X, the explanatory variable. If there is
no relationship between Y and X to begin with, then testing a hypothesis such as
= 0:3 or any other value is meaningless. This null hypothesis can be easily tested
by the con…dence interval or the t-test approach. But very often such formal testing
can be shortcut by adopting the “2-t” rule of signi…cance, which may be stated as
“2-t” Rule of Thumb. If the number of degrees of freedom is 20 or more and
if , the level of signi…cance, is set at 0.05, then the null hypothesis = 0 can be
rejected if the t value [bk =SE bk ] exceeds 2 in absolute value.
5.4.3 Choosing a Level of Signi…cance

To complete the previous discussion, it was necessary to pick a level of signi…cance
before a critical t-value could be found in Statistical Tables. The words “signi…cantly
positive”usually carry the statistical interpretation that H0 ( 0) was rejected in
favor of HA ( > 0) according to the preestablished decision rule, which was set up
with a given level of signi…cance. The level of signi…cance indicates the probability of
observing an estimated t-value greater than the critical t-value if the null hypothesis
were correct. It measures the amount of Type I Error implied by a particular critical
t-value. If the level of signi…cance is 10 percent and we reject the null hypothesis at
that level, then this result would have occurred only 10 percent of the time that the
null hypothesis was indeed correct. How should you choose a level of signi…cance?
Most beginning econometricians (and many published ones, too) assume that
the lower the level of signi…cance, the better. After all, they say, doesn’t a low level
of signi…cance guarantee a low probability of making a Type I Error? Unfortunately,
an extremely low level of signi…cance also dramatically increases the probability of
making a Type II Error. Therefore, unless you’re in the unusual situation of not
caring about mistakenly “accepting”a false null hypothesis, minimizing the level of
signi…cance is not good standard practice. Instead, it is recommend that using a 5-
percent level of signi…cance except in those circumstances when you know something
unusual about the relative costs of making Type I and Type II Errors.
If you know that a Type II Error will be extremely costly, for example, then it
makes sense to consider using a 10-percent level of signi…cance when you determine
your critical value. Such judgments are di¢ cult, however, so beginning researchers
are encouraged to adopt a 5-percent level of signi…cance as standard. If we can reject
a null hypothesis at the 5-percent level of signi…cance, we can summarize our results
by saying that the coe¢ cient is “statistically signi…cant”at the 5-percent level. Since
the 5-percent level is arbitrary, we shouldn’t jump to conclusions about the value of
a variable simply because its coe¢ cient misses being signi…cant by a small amount; if
a di¤erent level of signi…cance had been chosen, the result might have been di¤erent.
Some researchers produce tables of regression results, typically without hypothesized
signs for their coe¢ cients, and then mark “signi…cant” coe¢ cients with asterisks.
The asterisks indicate when the t-score is larger in absolute value than the two-sided
10-percent critical value (which merits one asterisk), the two-sided 5-percent critical
value (**), or the two-sided 1-percent critical value (***). Such a use of the t-value
should be regarded as a descriptive rather than a hypothesis-testing use of statistics.
Now and then researchers will use the phrase “degree of con…dence”or “level of
con…dence”when they test hypotheses. What do they mean? The level of con…dence
is nothing more than 100 percent minus the level of signi…cance. Thus a t-test for
which we use a 5-percent level of signi…cance can also be said to have a
95-percent level of con…dence. Since the two terms have identical meanings,
we will use level of signi…cance throughout this module. Another reason we prefer
the term level of signi…cance to level of con…dence is to avoid any possible confusion
with the related concept of con…dence intervals.
Some researchers avoid choosing a level of signi…cance by simply stating the
lowest level of signi…cance possible for each estimated regression coe¢ -cient. The
resulting signi…cance levels are called p-values.
5.4.4 p-Value
There’s an alternative to the t-test based on a measure called the p-value, or marginal
signi…cance level. A p-value for a t-score is the probability of observing a t-score that
size or larger (in absolute value) if the null hypothesis were true. Graphically, it’s
two times the area under the curve of the t-distribution between the absolute value
of the actual t-score and in…nity. A p-value is a probability, so it runs from 0 to 1.
It tells us the lowest level of signi…cance at which we could reject the null hypothesis
(assuming that the estimate is in the expected direction). A small p-value casts
doubt on the null hypothesis, so to reject a null hypothesis, we need a low
p-value. How do we calculate a p-value? Standard regression software packages
calculate p-values automatically and print them out for every estimated coe¢ cient.
You are thus able to read p-values o¤ your regression output just as you would your
b.
5.4 THE T-TEST 81
Be careful, however, because virtually every regression package prints out p-

values for two-sided alternative hypotheses. Such two-sided p-values include the
area in both “tails,” so twosided p-values are twice the size of one-sided ones. If
your test is one-sided, you need to divide the p-value in your regression output by
2 before doing any tests. How would you use a p-value to run a t-test? If your
chosen level of signi…cance is 5 percent and the p-value is less than .05, then you can
reject your null hypothesis as long as the sign is in the expected direction. Thus,
the p-value decision rule is: Reject H0 if p-valueK < the level of signi…cance and if
bk has the sign implied by HA . Do not reject H0 otherwise.
Let’s look at an example of the use of a p-value to run a t-test. Suppose we
deal with the demand for a given product or service and run a one-sided test on the
coe¢ cient of income we have the following null and alternative hypotheses:
H0 : 1 0
H0 : 1 >0
The STATA result is
Figure 5.4: Use of p-value
As you can see from the regression output on the previous slide, the p-value for
b
income is .025. This is a two-sided p-value and we are running a one-sided test, so
we need to divide .025 by 2, getting .0125. Since .0125 is lower than our chosen level
of signi…cance of .05, and since the sign of b1 is positive and agrees with that in HA ,
we can reject H0 . Not surprisingly, this is the same result we would get if we ran
a conventional t-test. p-values have a number of advantages. They’re easy to use,
and they allow readers of research to choose their own levels of signi…cance instead
of being forced to use the level chosen by the original researcher.
In addition, p-values convey information to the reader about the relative strength
with which we can reject a null hypothesis. Because of these bene…ts, many re-
searchers use p-values on a consistent basis.
Beginning researchers bene…t from learning the standard t-test procedure, par-
ticularly since it is more likely to force them to remember to hypothesize the sign
of the coe¢ cient and to use a one-sided test when a particular sign can be hypoth-
esized. In addition, if you know how to use the standard t-test approach, it’s easy
to switch to the p-value approach, but the reverse is not necessarily true. However,
we acknowledge that practicing econometricians today spend far more energy esti-
mating models and coe¢ cients than they spend testing hypotheses. This is because
most researchers are more con…dent in their theories (say, that demand curves slope
downward) than they are in the quality of their data or their regression methods.
In such situations, where the statistical tools are being used more for descriptive
purposes than for hypothesis testing purposes, it’s clear that the use of p-values
saves time and conveys more information than does the standard t-test procedure.
The four steps to use when working with the t-test are:
1. Set up the null and alternative hypotheses.
2. Choose a level of signi…cance and therefore a critical t-value.
3. Run the regression and obtain an estimated t-value (or t-score).
4. Apply the decision rule by comparing the calculated t-value with the critical
t-value in order to reject or not reject the null hypothesis.
5.5 Limitations of the t-Test

One problem with the t-test is that it is easy to misuse. t-scores are printed out by
computer regression packages and the t-test seems easy to work with, so beginning
researchers sometimes attempt to use the t-test to “prove”things that it was never
intended to even test. For that reason, it’s probably just as important to know
the limitations of the t-test as it is to know the applica-tions of that test. Perhaps
the most important of these limitations, that the usefulness of the t-test diminishes
rapidly as more and more speci…cations are estimated and tested. The purpose of
the present section is to give additional examples of how the t-test should notbe
used.
The t-Test Does Not Test Theoretical Validity
Recall that the purpose of the t-test is to help the researcher make inferences
about a particular population coe¢ cient based on an estimate obtained from a sam-
ple of that population. Some beginning researchers conclude that any statistically
signi…cant result is also a theoretically correct one. This is dangerous because such
a conclusion confuses statistical signi…cance with theoretical validity.
The t-Test Does Not Test “Importance”
One possible use of a regression equation is to help determine which indepen-
dent variable has the largest relative e¤ect (importance) on the depen-dent variable.
Some beginning researchers draw the unwarranted conclusion that the most statisti-
cally signi…cant variable in their estimated regression is also the most important in
terms of explaining the largest portion of the movement of the dependent variable.
Statistical signi…cance says little - if anything - about which variables determine the
major portion of the variation in the dependent variable. To determine importance,
a measure such as the size of the coe¢ cient multiplied by the average size of the
5.6 CONFIDENCE INTERVAL 83
independent vari-able or the standard error of the independent variable would make
more sense.
The t-Test Is Not Intended for Tests of the Entire population
The t-test helps make inferences about the true value of a parameter from an
estimate calculated from a sample of the population (the group from which the
sample is being drawn). If a coe¢ cient is calculated from the entire population,
then an unbiased estimate already measures the population value and a signi…cant
t-test adds nothing to this knowledge. One might forget this property and attach too
much importance to t-scores that have been obtained from samples that approximate
the population in size. There is a third way to test hypothesis: It is based on the
concept of a con…dence interval.
5.6 Con…dence Interval

A con…dence interval is a range of values that will contain the true value of a
certain percentage of the time, say 90 or 95 percent. The formula for a con…dence
interval is
Conf idence interval = b tc :SE b
where tc is the two-sided critical value of the t-statistic for whatever signi…cance
level we choose. If you want a 90-percent con…dence interval, you would choose
the critical value for the 10-percent signi…cance level. For a 95-percent con…dence
interval, you would use a 5-percent critical value.
To see how con…dence intervals can be used for hypothesis tests, let us see the
estimated coe¢ cient for variable I in the following estimated equation:
What would a 90-percent con…dence interval for I look like? To check for this,
what we need is a 10-percent two-sided critical t-value for 29 degrees of freedom
which is tc = 1:699 (from statistical Table).
Continuing with the example above, the 90 percent con…dence interval for the
coe¢ cient of I in the estimated equation is:
1:288 1:699 0:543

= 1:288 0:923
) 0:365 I 2:211
What exactly does this mean? If the Classical Assumptions hold true, the con-
…dence interval formula produces ranges that contain the true value of 90 percent
of the time. In this case, there is a 90 percent chance the true value of I is between
0.365 and 2.211. If it is not in that range, it’s due to an unlucky sample. How can
we use a con…dence interval for a two-tailed hypothesis test? If the null hypothesis
is I = 0, we can reject it at the 10-percent level because 0 is not in the con…dence
interval. If the null hypothesis is that I = 1:0, we cannot reject it because 1.0 is
in the interval. In general, if your null hypothesis border value is in the con…dence
interval, you cannot reject the null hypothesis. Thus, con…dence intervals can be
used for two-sided tests, but they are more complicated. So why bother with them?
It turns out that con…dence intervals are very useful in telling us how precise a co-
e¢ cient estimate is. And for many people using econometrics in the real world, this
may be more important than hypothesis testing.
5.7 The F-test

Although the t-test is invaluable for hypotheses about individual regression coe¢ -
cients, it can’t be used to test multiple hypotheses simultaneously. Such a limitation
is unfortunate because many interesting ideas involve a number of hypotheses or in-
volve one hypothesis about multiple coe¢ cients. For exam-ple, suppose that you
want to test the null hypothesis that there is no seasonal variation in a quarterly
regression equation that has dummy variables for the seasons. To test such a hy-
pothesis, most researchers would use the F-test.
5.7.1 What Is the F-Test?

The F-test is a formal hypothesis test that is designed to deal with a null hypothesis
that contains multiple hypotheses or a single hypothesis about a group of coe¢ -
cients. Such “joint” or “compound” null hypotheses are appropriate whenever the
underlying economic theory speci…es values for multiple coe¢ cients simultaneously.
The way in which the F-test works is fairly ingenious. The …rst step is to translate
the particular null hypothesis in question into constraints that will be placed on the
equation. The resulting constrained equation can be thought of as what the equation
would look like if the null hypothesis were correct; you substitute the hypothesized
values into the regression equation in order to see what would happen if the equation
were constrained to agree with the null hypothesis. As a result, in the F-test the null
hypothesis always leads to a constrained equation, even if this violates our standard
practice that the alternative hypothesis contains what we expect is true.
The second step in an F-test is to estimate this constrained equation with OLS
and compare the …t of this constrained equation with the …t of the unconstrained
5.7 THE F-TEST 85
equation. If the …ts of the constrained equation and the unconstrained equation
are not substantially di¤erent, the null hypothesis should not be rejected. If the …t
of the unconstrained equation is substan-tially better than that of the constrained
equation, then we reject the null hypothesis. The …t of the constrained equation is
never superior to the …t of the unconstrained equation, as we’ll explain next.
The …ts of the equations are compared with the general F-statistic:
(RSSM RSS) =M
F =
RSS= (N K 1)
RSSM is always greater than or equal to RSS.

Imposing constraints on the coe¢ cients instead of allowing OLS to select their
values can never decrease the summed squared residuals. (Recall that OLS selects
that combination of values of the coe¢ cients that minimizes RSS.) At the extreme, if
the uncon-strained regression yields exactly the same estimated coe¢ cients as does
the constrained regression, then the RSS are equal and the F-statistic is zero. In
this case, H0 is not rejected because the data indicate that the constraints appear
to be correct. As the di¤erence between the constrained coe¢ cients and the uncon-
strained coe¢ cients increases, the data indicate that the null hypothesis is less likely
to be true. Thus, when F gets larger than the critical F-value, the hypothesized re-
strictions speci…ed in the null hypothesis are rejected by the test. The decision rule
to use in the F-test is to reject the null hypothesis if the calculated F-value (F) from
Equation 5.10 is greater than the appropriate critical F-value (Fc ):
Re ject: H0 if F > Fc
Do not reject : H0 if F Fc
Chapter 6
Violation of Classical Assumptions
6.1 Introduction
In this chapter we deal with violations of the Classical Assumptions and remedies
for those violations: multicolinearity, serial correlation and heteroskedasticity. For
each of these three problems, we will attempt to answer the following questions:
What is the nature of the problem?
What are the consequences of the problem?
How is the problem diagnosed?
What remedies for the problem are available?
The word collinearity describes a linear correlation between two independent
variables, and multicollinearity indicates that more than two independent variables
are involved. In common usage, multicollinearity is used to apply to both cases.

Explain the meaning of heteroskedasticity and give examples of data sets likely
to exhibit heteroskedasticity.
Explain what is meant by a serially correlated time series, and how we measure
serial correlation
Explain how and why plots of least squares residuals can reveal heteroskedas-
ticity.
Specify a variance function and use it to test for heteroskedasticity with (a) a
Breusch–Pagan test and (b) a White test.
Test for heteroskedasticity using a Goldfeldt–Quandt test applied to (a) two

subsamples with potentially di¤erent variances and (b) amodel where the vari-
ance is hypothesized to depend on an explanatory variable.
87
88 CHAPTER 6 VIOLATION OF CLASSICAL ASSUMPTIONS
Describe and compare the properties of the least squares and generalized least
squares estimators when heteroskedasticity exists.
Compute heteroskedasticity-consistent standard errors for least squares.
Describe how to transform a model to eliminate heteroskedasticity.
6.3 Multicolinearity
In this chapter we deal with violations of the Classical Assumptions and remedies
for those violations: multicolinearity, serial correlation and heteroskedasticity. For
each of these three problems, we will attempt to answer the following questions:
What is the nature of the problem?
What are the consequences of the problem?
How is the problem diagnosed?
What remedies for the problem are available?
The word collinearity describes a linear correlation between two independent
variables, and multicollinearity indicates that more than two independent variables
are involved. In common usage, multicollinearity is used to apply to both cases.
Strictly speaking, perfect multicollinearity is the violation of Classical Assumption
VI - that no independent variable is a perfect linear function of one or more other
independent variables.
Perfect multicollinearity is rare, but severe imperfect multicollinearity, although
not violating Classical Assumption VI, still causes substantial problems. Recall
that the coe¢ cient k can be thought of as the impact on the dependent variable
of a one-unit increase in the independent variable Xk , holding constant the other
independent variables in the equation. If two explanatory variables are signi…cantly
related, then the OLS computer program will …nd it di¢ cult to distinguish the e¤ects
of one variable from the e¤ects of the other. In essence, the more highly correlated
two (or more) independent variables are, the more di¢ cult it becomes to accurately
estimate the coe¢ cients of the true model. If two variables move identically, then
there is no hope of distinguishing between their impacts, but if the variables are only
roughly correlated, then we still might be able to estimate the two e¤ects accurately
enough for most purposes.
Perfect multicollinearity: violates Classical Assumption VI, which speci…es
that no explanatory variable is a perfect linear function of any other explanatory
variable. The word perfect in this context implies that the variation in one explana-
tory variable can be completely explained by movements in another explanatory
variable. Such a perfect linear function between two independent variables would
be:
x1i = 0 + 1 x2i
6.3 MULTICOLINEARITY 89
where the s are constants and the xs are independent variables in:
yi = 0 + 1 x1i + 2 x2i + i
Notice that there is no error term in Equation (3.1). This implies that x1 can be
exactly calculated given x2 and the equation. Typical equations for such perfect
linear relationships would be:
x1i = 5x2i
x1i = 2 + 3x2i
What happens to the estimation of an econometric equation where there is perfect

multicollinearity? OLS is incapable of generating estimates of the regression coef-
…cients, and most OLS computer programs will print out an error message in such
a situation. We theoretically would obtain the following estimated coe¢ cients and
standard errors:
bk = in det er min ate SE bk = 1
Perfect multicollinearity ruins our ability to estimate the coe¢ cients because the
two variables cannot be distinguished. You cannot “hold all the other indepen-
dent variables in the equation constant”if every time one variable changes, another
changes in an identical manner. With perfect multicollinearity, an independent vari-
able can be completely explained by the movements of one or more other independent
variables. Perfect multicollinearity can usually be avoided by careful screening of
the independent variables before a regression is run.
A special case related to perfect multicollinearity occurs when a variable that
is de…nitionally related to the dependent variable is included as an independent
variable in a regression equation. Such a dominant variable is by de…nition so
highly correlated with the dependent variable that it completely masks the e¤ects
of all other independent variables in the equation. In a sense, this is a case of
perfect collinearity between the dependent variable and an independent variable.
For example, if you include a variable measuring the amount of raw materials used
by the shoe industry in a production function for that industry, the raw materials
variable would have an extremely high t-score, but otherwise important variables
like labor and capital would have quite insigni…cant t-scores. Why?
In essence, if you knew how much leather was used by a shoe factory, you could
predict the number of pairs of shoes produced without knowing anythingabout labor
or capital. The relationship is de…nitional, and the dominant variable should be
dropped from the equation to get reasonable estimates of the coe¢ cients of the other
variables. Since perfect multicollinearity is fairly easy to avoid, econometricians
rarely talk about it. Instead, when we use the word multicollinearity, we really are
talking about severe imperfect multicollinearity.
Imperfect multicollinearity can be de…ned as a linear functional relationship be-

tween two or more independent variables that is so strong that it can signi…cantly
a¤ect the estimation of the coe¢ cients of the variables. In other words, imperfect
multicollinearity occurs when two (or more) explanatory variables are imperfectly
linearly related, as in:
x1i = 0 + 1 x2i + i
With imperfect multicollinearity, an independent variable is a strong but not per-

fect linear function of one or more other independent variables. Imperfect multi-
collinearity varies in degree from sample to sample. The major consequences of
multicollinearity are:
1. Estimates will remain unbiased. Even if an equation has signi…cant multi-
collinearity, the estimates of the s still will be centered around the true population
s if the …rst six Classical Assumptions are met for a correctly speci…ed equation.
2. The variances and standard errors of the estimates will increase.This is the
principal consequence of multicollinearity. Since two or more of the explanatory vari-
ables are signi…cantly related, it becomes di¢ cult to precisely identify the separate
e¤ects of the multicollinear variables.
When it becomes hard to distinguish the e¤ect of one variable from the e¤ect of
another, we are much more likely to make large errors in estimating the s than we
were before we encountered multicollinearity. As a result, the estimated coe¢ cients,
although still unbiased, now come from distributions with much larger variances
and, therefore, larger standard errors.
Even though the variances and standard errors are larger with multicollinearity
than they are without it, OLS is still BLUE when multicollinearity exists. That is,
no other linear unbiased estimation technique can get lower variances than OLS even
in the presence of multicollinearity. Thus, although the e¤ect of multicollinearity is
to increase the variance of the estimated coe¢ cients, OLS still has the property of
minimum variance.
These “minimum variances”are just fairly large.
3. The computed t-scores will fall. Multicollinearity tends to decrease the t-
scores of the estimated coe¢ cients mainly because of the formula for the t-statistic:
b
k H0
tk = (k = 1; 2; :::; K)
SE bk
4. Estimates will become very sensitive to changes in speci…cation.The addition

or deletion of an explanatory variable or of a few observations will often cause major
changes in the values of the bs when signi…cant multicollinearity exists. If you drop
a variable, even one that appears to be statistically insigni…cant, the coe¢ cients of
the remaining variables in the equation sometimes will change dramatically.
6.3 MULTICOLINEARITY 91
Figure 6.1: Severe Multicollinearity increases the variances of the estimated coe¢
The overall …t of the equation and the estimation of the coe¢ cients of non multi-
collinear variables will be largely una¤ected. Even though the individual t-scores are
often quite low in a multicollinear equation, the overall …t of the equation, as mea-
2
sured by R , will not fall much, if at all, in the face of signi…cant multicollinearity.
Given this, one of the …rst indications of severe multicollinearity is the combination
2
of a high R with no statistically signi…cant individual regression coe¢ cients.
Similarly, if an explanatory variable in an equation is not multicollinear with
the other variables, then the estimation of its coe¢ cient and standard error usually
will not be a¤ected. Because the overall …t is largely unchanged, it’s possible for
the F-test of overall signi…cance to reject the null hypothesis even though none of
the t-tests on individual coe¢ cients can do so. Such a result is a clear indication of
severe imperfect multicollinearity.
Finally, since multicollinearity has little e¤ect on the overall …t of the equation,
it also will have little e¤ect on the use of that equation for prediction or forecasting,
as long as the independent variables maintain the same pattern of multicollinearity
in the forecast period that they demonstrated in the sample.
6.3.1 Testing for Multicollinearity

How do we decide whether an equation has a severe multicollinearity problem? A
…rst step is to recognize that some multicollinearity exists in every equation. It’s
virtually impossible in a real-world example to …nd a set of explanatory variables
that are totally uncorrelated with each other (except for designed experiments). Our
main purpose in this section will be to learn to determine how much multicollinearity
exists in an equation, not whether any multicollinearity exists.
A second key point is that the severity of multicollinearity in a given equation can
change from sample to sample depending on the characteristics of the sample. As a
result, the theoretical underpinnings of the equation are not quite as important in
the detection of multicollinearity as they are in the detection of an omitted variable

or an incorrect functional form. Instead, we tend to rely more on data-oriented
techniques to determine the severity of the multicollinearity in a given sample. Of
course, we can never ignore the theory behind an equation.
Simple Correlation Coe¢ cient
If two variables are perfectly positively correlated, then r = +1

If two variables are perfectly negatively correlated, then r = 1
If two variables are totally uncorrelated, then r = 0
How high is high? Some researchers pick an arbitrary number, such as 0.80, and
become concerned about multicollinearity any time the absolute value of a simple
correlation coe¢ cient exceeds 0.80. A better answer might be that r is high if
it causes unacceptably large variances in the coe¢ cient estimates in which we’re
interested. Be careful: The use of simple correlation coe¢ cients as an indication of
the extent of multicollinearity involves a major limitation if there are more than two
explanatory variables. It is quite possible for groups of independent variables, acting
together, to cause multicollinearity without any single simple correlation coe¢ cient
being high enough to indicate that multicollinearity is in fact severe.
As a result, simple correlation coe¢ cients must be considered to be su¢ cient but
not necessary tests for multicollinearity. Although a high r does indeed indicate the
probability of severe multicollinearity, a low r by no means proves otherwise.
High Variance In‡ation Factors (VIFs)
One measure of the severity of multicollinearity that is easy to use and that is gaining
in popularity is the variance in‡ation factor. The variance in‡a-tion factor (VIF)is a
method of detecting the severity of multicollinearity by looking at the extent to which
a given explanatory variable can be explained by all the other explanatory variables
in the equation. There is a VIF for each explanatory variable in an equation.
The VIF is an index of how much multi-collinearity has increased the variance of
an estimated coe¢ cient. A high VIF indicates that multicollinearity has increased
the estimated variance of the estimated coe¢ cient by quite a bit, yielding a decreased
t-score.
Suppose you want to use the VIF to attempt to detect multicollinearity in an
original equation with K independent variables:
Yi = 0 + 1 X1i + 2 X2i + ::: + k Xki + i
Doing so requires calculating K di¤erent VIFs, one for each Xi . Calculating the
VIF for a given Xi involves two steps:
6.4 SERIAL CORRELATION 93
1. Run an OLS regression that has Xi as a function of all the other explanatory
variables in the equation. For i =1, this equation would be:
X1 = 0 + 2 X2i + 3 X3i + ::: + k Xki + i
This is referred to as an auxiliary or secondary regression. Thus there are K auxiliary

regres-sions, one for each independent variable in the original equation.
2. Calculate the variance in‡ation factor for bi :
1
V IF bi =
(1 Ri2 )
where Ri2 is the coe¢ cient of determination (the unadjusted R2 ) of the auxil-
iary regression in step one. Since there is a separate auxiliary regression for each
independent variable in the original equation, there also is an Ri2 and a V IF bi
for each. The higher the VIF, the more severe the e¤ects of multicollinearity. How
high is high? An R2 of 1, indicating perfect multicollinearity, produces a VIF of
in…nity, whereas an R2 of 0, indicating no multicollinearity at all, produces a VIF of
1. While there is no table of formal critical VIF values, a common rule of thumb is
that if V IF bi > 5, the multicollinearity is severe. As the number of independent
variables increases, it makes sense to increase this number slightly.
6.3.2 Remedies for Multicollinearity

Do Nothing
Drop a Redundant Variable
Increase the Size of the Sample
Quite often, doing nothing is the best remedy for multicollinearity.
If the multicollinearity has not decreased t-scores to the point of insig-ni…cance,
then no remedy should even be considered as long as the variables are theoretically
strong. Even if the t-scores are insigni…cant, remedies should be undertaken cau-
tiously, because all impose costs on the estimation that may be greater than the
potential bene…t of ridding the equation of multicollinearity.
6.4 Serial Correlation

Serial correlation, also called autocorrelation, can exist in any research study in
which the order of the observations has some meaning and occurs most frequently
in time-series data sets. In essence, serial correla-tion implies that the value of
the error term from one time period depends in some systematic way on the value
of the error term in other time periods. Since time-series data are used in many
applications of econometrics, it’s important to understand serial correlation and its
consequences for OLS estimators.
The approach of this section to the problem of serial correlation will be similar to
that used in the previous section. We’ll attempt to answer the same four questions:
1. What is the nature of the problem?
2. What are the consequences of the problem?
3. How is the problem diagnosed?
4. What remedies for the problem are available?
6.4.1 Time Series

Virtually every equation in the text so far has been cross-sectional in nature, but
that’s going to change dramatically in this chapter. As a result, it’s probably worth-
while to talk about some of the characteristics of time-series equations.
Time-series data involve a single entity (like a person, corporation, or state)
over multiple points in time. Such a time-series approach allows researchers to
investigate analytical issues that can’t be examined very easily with a cross-sectional
regression. For example, macroeconomic models and supply-and-demand models are
best studied using time-series, not cross-sectional, data.
Time-series studies have some characteristics that make them more di¢ cult to
deal with than cross-sections:
1. The order of observations in a time series is …xed. With a cross-sectional data
set, you can enter the observations in any order you want, but with time-series data,
you must keep the observations in chronological order.
2. Time-series samples tend to be much smaller than cross-sectional ones.Most
time-series populations have many fewer potential observations than do cross-sectional
ones, and these smaller data sets make statistical inference more di¢ cult. In ad-
dition, it’s much harder to generate atime-series observation than a cross-sectional
one. After all, it takes a year to get one more observation in an annual time series!
3. The theory underlying time-series analysis can be quite complex.In part be-
cause of the problems mentioned above, time-series econometrics includes a number
of complex topics that require advanced estimation techniques.
4. The stochastic error term in a time-series equation is often a¤ected by events
that took place in a previous time period.This is serial correlation, the topic of our
chapter, so let’s get started!
6.4.2 Pure Serial Correlation

Pure serial correlation occurs when Classical Assumption IV, which assumes uncor-
related observations of the error term, is violated in a correctly speci…ed equation.
If there is correlation between observations of the error term, then the error term is
said to be serially correlated. When econometricians use the term serial correlation
without any modi…er, they are referring to pure serial correlation.
The most commonly assumed kind of serial correlation is …rst-order serial cor-
relation, in which the current value of the error term is a function of the previous
value of the error term:
t = t 1 + ut
is called a …rst-order Markov scheme.

The new symbol, , called the …rst-order autocorrelation coe¢ cient, measures
the functional relationship between the value of an observation of the error term
and the value of the previous observation of the error term. The magnitude of
indicates the strength of the serial correlation in an equation. If is zero, then
there is no serial correlation (because i would equal u, a classical error term). As
approaches 1 in absolute value, the value of the previous observation of the error
term becomes more important in determining the current value of t , and a high
degree of serial correlation exists. For to be greater than 1 in absolute value is
unreasonable because it implies that the error term has a tendency to continually
increase in absolute value over time (“explode”).
As a result of this, we can state that:
1< < +1
The sign of indicates the nature of the serial correlation in an equation. A positive
value for implies that the error term tends to have the same sign from one time
period to the next; this is called positive serial correlation. Such a tendency means
that if t happens by chance to take on a large value in one time period, subsequent
observations would tend to retain a portion of this original large value and would
have the same sign as the original. For example, in time-series models, the e¤ects of
a large external shock to an economy (like an earthquake) in one period may linger
for several time periods. The error term will tend to be positive for a number of
observations, then negative for several more, and then back positive again.
A negative value of implies that the error term has a tendency to switch signs
from negative to positive and back again in consecutive observations; this is called
negative serial correlation. It implies that there is some sort of cycle (like a pen-
dulum) behind the drawing of stochastic disturbances. For instance, negative serial
correlation might exist in the error term of an equation that is in …rst di¤erences
because changesin a variable often follow a cyclical pattern. In most time-series
applications, however, negative pure serial correlation is much less likely than posi-
tive pure serial correlation. As a result, most econometricians analyzing pure serial
correlation concern themselves primarily with positive serial correlation.
6.4.3 Impure Serial Correlation

By impure serial correlationwe mean serial correlation that is caused by a speci…ca-
tion error such as an omitted variable or an incorrect functional form. While pure
Figure 6.2: Positive Serial Correlation
serial correlation is caused by the underlying distribution of the error term of the
true speci…cation of an equation (which cannot be changed by the researcher), im-
pure serial correlation is caused by a speci…cation error that often can be corrected.
How is it possible for a speci…cation error to cause serial correlation?
Recall that the error term can be thought of as the e¤ect of omitted variables,
nonlinearities, measurement errors, and pure stochastic disturbances on the depen-
dent variable. This means, for example, that if we omit a relevant vari-able or use
the wrong functional form, then the portion of that omitted e¤ect that cannot be
represented by the included explanatory variables must be absorbed by the error
term. The error term for an incorrectly speci…ed equa-tion thus includes a portion
of the e¤ect of any omitted variables and/or a portion of the e¤ect of the di¤erence
between the proper functional form and the one chosen by the researcher.
This new error term might be serially correlated even if the true one is not. If
this is the case, the serial correlation has been caused by the researcher’s choice of a
speci…cation and not by the pure error term associated with the correct speci…cation.
6.4.4 Consequences of Serial Correlation

1. Pure serial correlation does not cause bias in the coe¢ cient estimates.
If the error term is serially correlated, one of the assumptions of the Gauss
- Markov Theorem is violated, but this violation does not cause the coe¢ cient
estimates to be biased. If the serial correlation is impure, however, bias may be
introduced by the use of an incorrect speci…cation. This lack of bias does not
necessarily mean that the OLS estimates of the coe¢ cients of a serially correlated
equation will be close to the true coe¢ cient values. A single estimate observed in
practice can come from a wide range of possible values. In addition, the standard
errors of these estimates will typically be increased by the serial correlation. This
increase will raise the probability that a b will di¤er signi…cantly from the true
value. What unbiased means in this case is that the distribu-tion of the bs is still
centered around the true .
2. Serial correlation causes OLS to no longer be the minimum variance esti-
mator (of all the linear unbiased estimators). Although the violation of Classical
Assumption IV causes no bias, it does a¤ect the other main conclusion of the Gauss
- Markov Theorem, that of minimum variance. The serially correlated error term
causes the dependent variable to ‡uctuate in a way that the OLS estimation proce-
dure sometimes at-tributes to the independent variables. Thus, OLS is more likely
to misestimate the true in the face of serial correlation. On balance, the bs are still
unbiased because overestimates are just as likely as underestimates, but these errors
increase the variance of the distribution of the estimates, increasing the amount that
any given estimate is likely to di¤er from the true .
3. Serial correlation causes the OLS estimates of the SE b s to be biased,
leading to unreliable hypothesis testing.With serial correlation, the OLS formula for
the standard error produces biased estimates of the SE b s. Because the SE b
is a prime component in the t-statistic, these biased SE b s cause biased t-scores
and unreliable hypothesis testing in general. In essence, serial correlation causes OLS
to produce incorrect SE b s and t-scores! Not surprisingly, most econometricians
therefore are very hesitant to put much faith in hypothesis tests that were conducted
in the face of pure serial correlation.
6.4.5 Detecting Serial Correlation

To detect Serial correlation, one can use
The Durbin–Watson testis used to determine if there is …rst-order serial cor-
relation in the error term of an equation by examining the residuals of a particular
estimation of that equation the Lagrange Multiplier (LM) test, which checks for
serial correlation by analyzing how well the lagged residuals explain the residual of
the original equation in an equation that also includes all the explanatory variables
of the original model. If the lagged residuals are signi…cant in explaining this time’s
residuals, then we can reject the null hypothesis of no serial correlation. The place to
start in correcting a serial correlation problem is to look carefully at the speci…cation
of the equation for possible errors that might be causing impure serial correlation.
Is the functional form correct?
Are you sure that there are no omitted variables?
Only after the speci…cation of the equation has been reviewed carefully should
the possibility of an adjustment for pure serial correlation be considered.
Generalized least squares (GLS) is a method of ridding an equation of pure …rst-
order serial correlation and in the process restoring the minimum variance property
to its estimation.
Newey - West standard errorsare SE b that take account of serial correlation
without changing the bs themselves in any way.
The logic behind Newey - West standard errors is powerful. If serial correlation
does not cause bias in the bs but does impact the standard errors, then it makes
sense to adjust the estimated equation in a way that changes the SE b s but not
the bs.
Thus Newey - West standard errors have been calculated speci…cally to avoid
the consequences of pure …rst-order serial correlation. The Newey - West procedure
yields an estimator of the standard errors that, while they are biased, is generally
more accurate than uncorrected standard errors for large samples (greater than 100)
in the face of serial correlation. As a result, Newey - West standard errors can be used
for t-tests and other hypothesis tests in most samples without the errors of inference
potentially caused by serial correlation. Typically, Newey–West SE b s are larger
than OLS SE b s, thus producing lower t-scores and decreasing the probability
that a given estimated coe¢ cient will be signi…cantly di¤erent from zero.
6.5 Heteroskedasticity
Heteroskedasticity is the violation of Classical Assumption V, which states that the
observations of the error term are drawn from a distribution that has a constant
variance. The assumption of constant variances for di¤erent observations of the
error term (homoskedasticity) is not always realistic. For example, in a model
explaining heights, it’s likely that error term observations associated with the height
of a basketball player would come from distributions with larger variances than those
associated with the height of a mouse. Heteroskedasticity is important because
OLS, when applied to heteroskedastic models, is no longer the minimum variance
estimator (it still is unbiased, however). In general, heteroskedasticity is more likely
to take place in cross-sectional models than in time-series models. This focus on
6.5 HETEROSKEDASTICITY 99
cross-sectional models is not to say that heteroskedasticity in time-series models is

impossible, though. In fact, heteroskedasticity has turned out to be an important
factor in time-series studies of …nancial markets.
Pure heteroskedasticity refers to heteroskedasticity that is a function of the
error term of a correctly speci…ed regression equation. As with serial cor-
relation, use of the word “heteroskedasticity” without any modi…er (like pure or
impure) implies pure heteroskedasticity.
Such pure heteroskedasticity occurs when Classical Assumption V, which as-
sumes that the variance of the error term is constant, is violated in a correctly
speci…ed equation. Assumption V assumes that:
2
V AR ( i ) = =a constant (i=1,2,...,N)
VAR1ei2 = 2=a constant 1i =1, 2, . . . , N2 (10.1)

If this assumption is met, all the observations of the error term can be thought of
as being drawn from the same distribution: a distribution with a mean of zero and
a variance of 2. The property of having 2 not change for di¤erent observations of
the error term is called homoskedasticity. A homoskedastic error term distribution
is pictured in the top half of Figure 10.1; note that the variance of the distribution
is constant (even though individual observations drawn from that sample will vary
quite a bit).
With heteroskedasticity, this error term variance is not constant; instead, the
variance of the distribution of the error term depends on exactly which observation
is being discussed:
V AR ( i ) = 2i (i=1,2,...,N)
Note that the only di¤erence between the two Equations above is the subscript “i”
attached to 2 , which implies that instead of being constant over all the observations,
a heteroskedastic error term’s variance can change depend-ing on the observation
(hence the subscript).
In homoskedasticity, the distribution of the error term has a constant variance,
so the obser-vations are continually drawn from the same distribution (shown in the
top panel). In the simplest heteroskedastic case, discrete heteroskedasticity, there
would be two di¤erent error term variances and, therefore, two di¤erent distributions
(one wider than the other, as in the bottom panel) from which the observations of
the error term could be drawn.
Heteroskedasticity often occurs in data sets in which there is a wide dispar-ity
between the largest and smallest observed value of the dependent variable.
The larger the disparity between the size of observations of the dependent vari-
able in a sample is, the larger the likelihood is that the error term obser-vations
associated with them will have di¤erent variances and therefore be heteroskedastic.
That is, we’d expect that the error term distribution for very large observations
Figure 6.3: Homoskedasticity versus discrete Heteroskedasticity
might have a large variance, and that the error term distri-bution for small obser-
vations might have a small variance.
In cross-sectional data sets, it’s easy to get such a large range between the highest
and lowest values of the variables. The di¤erence between Oromia and Gambela (or
Harari) in terms of the Birr value of the consumption of goods and services, for
instance, is quite large (comparable in percentage terms to the di¤erence between
the heights of a basketball player and a mouse). Since cross-sectional models often
include observations of widely di¤erent sizes in the same sample (cross-state studies
of Ethiopia usually include Oromia and Gambela as individual observations, for
example), heteroskedasticity is hard to avoid if economic topics are going to be
studied cross sectionally.
The simplest way to visualize pure heteroskedasticity is to picture a world in
which the observations of the error term could be grouped into just two di¤erent
distributions, “wide” and “narrow.” We’ll call this simple version of the problem
discrete heteroskedasticity. Here, both distributions would be centered around zero,
but one would have a larger variance than the other, as indicated in the bottom half
of the Figure above. Note the di¤erence between the two halves of the …gure. With
homoskedasticity, all the error term observations come from the same distribu-tion;
with heteroskedasticity, they come from di¤erent distributions.
For an example of discrete heteroskedasticity, we need go no further than our
discussion of the heights of basketball players and mice. We’d certainly expect
the variance of eto be larger for basketball players as a group than for mice, so
the distribution of efor the heights of basketball players might look like the “wide”
distribution in the Figure above, and the distribution of efor mice would be much
narrower than the “narrow”distribution in Figure above.
Heteroskedasticity takes on many more complex forms. In fact, the num-ber
of di¤erent models of heteroskedasticity is virtually limitless, and an analysis of
even a small percentage of these alternatives would be a huge task. Instead, we’d
like to address the general principles of heteroskedasticity by focusing on the most
frequently speci…ed model of pure heteroskedasticity, just as we focused on pure,
positive, …rst-order serial correlation in the previ-ous chapter. However, don’t let
this focus mislead you into concluding that econometricians are concerned only with
one kind of heteroskedasticity.
In this model of heteroskedasticity, the variance of the error term is related to
an exogenous variable Zi . For a typical regression equation:
Yi = 0 + i X1i + 2 X2i + i
the variance of the otherwise classical error term might be equal to:
2
V AR ( i ) = Zi
where Z may or may not be one of the Xs in the equation. The variable Z is called a
proportionality factor because the variance of the error term changes proportion-
ally to Zi . The higher the value of Zi , the higher the variance of the distribution of
the ith observation of the error term. There would be N di¤erent distributions, one
for each observation, from which the observations of the error term could be drawn
depending on the number of di¤erent values that Z takes. To see what homoskedas-
tic and heteroskedastic distributions of the error term look like with respect to Z,
compare the two Figures below. Note that the heteroskedastic distribution gets
wider as Z increases but that the homoskedastic distribution maintains the same
width no matter what value Z takes.
What is an example of a proportionality factor Z? How is it possible for an
exogenous variable such as Z to change the whole distribution of an error term?
Think about a function that relates the consumption expenditures in a state to
its income. The expenditures of a small state like Rhode Island are not likely to
be as variable in absolute value as the expenditures of a large state like California
because a 10-percent change in spending for a large state involves a lot more money
than a 10-percent change for a small one. In such a case, the dependent variable
would be consumption expenditures and a likely proportionality factor, Z, would
be population. As population rose, so too would the variance of the error term of
an equation built to explain expenditures. The error term distributions would look
something like those in Figure XXX, where the Z in Figure XXX is population.
This example helps emphasize that heteroskedasticity is likely to occur in cross-
sectional models because of the large variation in the size of the dependent variable
involved. An exogenous disturbance that might seem huge to a small state could
seem miniscule to a large one, for instance.
Heteroskedasticity can occur in a time-series model with a signi…cant amount of
change in the dependent variable. If you were modeling sales of DVD players from
1994 to 2015, it’s quite possible that you would have a heteroskedastic error term.
As the phenomenal growth of the industry took place, the variance of the error term
probably increased as well. Such a possibility is unlikely in time series that have
low rates of change, however.
Heteroskedasticity also can occur in any model, time series or cross sectional,
where the quality of data collection changes dramatically within the sample. As
data collection techniques get better, the variance of the error term should fall
because measurement errors are included in the error term. As measurement errors
decrease in size, so should the variance of the error term. For more on this topic
(called “errors in the variables”), see Section 14.6.
Figure 6.4: aHomoskedastic error term with respect to Zi
If an error term is homoskedastic with respect to Zi , the variance of the distri-

bution of the error term is the same (constant) no matter what the value of Zi is:
V AR ( i ) = 2 .
If an error term is heteroskedastic with respect to Zi , the variance of the distrib-
ution of the error term changes systematically as a function of Zi . In this example,
the variance is an increasing function of Zi , as in V AR ( i ) = 2 Zi .
Heteroskedasticity that is caused by an error in speci…cation, such as an omitted
variable, is referred to as impure heteroskedasticity. Impure heteroskedasticity thus
Figure 6.5: A Heteroskedastic error term with respect to Zi
is similar to impure serial correlation.

An omitted variable can cause a heteroskedastic error term because the portion
of the omitted e¤ect not represented by one of the included explanatory variables
must be absorbed by the error term. If this e¤ect has a heteroskedastic component,
the error term of the misspeci…ed equation might be heteroskedastic even if the error
term of the true equation is not. This distinction is important because with impure
heteroskedasticity the correct remedy is to …nd the omitted variable and
include it in the regression. It is, therefore, important to be sure that your
speci…cation is correct before trying to detect or remedy pure heteroskedasticity.
6.5.1 The Consequences of Heteroskedasticity

If the error term of your equation is known to be heteroskedastic, what does that
mean for the estimation of your coe¢ cients? If the error term of an equation is
heteroskedastic, there are three major consequences:
1. Pure heteroskedasticity does not cause bias in the coe¢ cient estimates. Even
if the error term of an equation is known to be purely heteroskedastic, that het-
eroskedasticity will not cause bias in the OLS estimates of the coe¢ cients. This
is true because even though large positive errors are more likely, so too are large
negative errors. The two tend to aver-age each other out, leaving the OLS estimator
still unbiased. As a result, we can say that an otherwise correctly speci…ed equation
that has pure heteroskedasticity still has the property that:
E b = for all s
Lack of bias does not guarantee “accurate”coe¢ cient estimates, espe-cially since
heteroskedasticity increases the variance of the estimates, but the distribution of the
estimates is still centered around the true . Equations with impure heteroskedas-
ticity caused by an omitted variable, of course, will have possible speci…cation bias.
2. Heteroskedasticity typically causes OLS to no longer be the minimum-variance
estimator (of all the linear unbiased estimators). Pure heteroskedasticity causes no
bias in the estimates of the OLS coe¢ cients, but it does a¤ect the minimum-variance
property.
If the error term of an equation is heteroskedastic with respect to a proportion-
ality factor Z:
V AR ( i ) = 2 Zi
then the minimum-variance portion of the Gauss–Markov Theorem cannot be
proven because there are other linear unbiased estimators that have smaller vari-
ances.
This is because the heteroskedastic error term causes the dependent variable
to ‡uctuate, and the OLS estimation procedure attributes this ‡uctuation to the
independent variables. Thus, OLS is more likely to misestimate the true in the
face of hetero-skedasticity. The bs still are unbiased because overestimates are just
as likely as underestimates.
3. Heteroskedasticity causes the OLS estimates of the SE b s to be biased,
leading to unreliable hypothesis testing and con…dence intervals. With heteroskedas-
ticity, the OLS formula for the standard error produces biased estimates of the
SE b s. Because the SE b s is a prime component in the t-statistic, these bi-
ased SE b s cause biased t-scores and unreliable hypothesis testing in general.In
essence, heteroskedasticity causes OLS to produce incorrect SE b s and t-scores!
Not surprisingly, most econometricians therefore are very hesitant to put much faith
in hypothesis tests that were conducted in the face of pure heteroskedasticity.
What sort of bias in the standard errors does heteroskedasticity tend to cause?
Typically, heteroskedasticity causes OLS estimates of the standard errors to be bi-
ased downward, making them too small. Sometimes, however, they’re biased up-
ward; it’s hard to predict in any given case. But either way, it’s a big problem for
hypothesis testing and con…dence intervals. pure heteroskedasticity can make quite
a mess of our results. Hypothesis testing will become unreliable, and con…dence
intervals will be misleading.
6.5.2 Testing for Heteroskedasticity

As we’ve seen, heteroskedasticity is a potentially nasty problem. The good news is
that there are many tests for heteroskedasticity. The bad news is heteroske-dasticity
can take many di¤erent forms and no single test can …nd them all.
In this subsection, we’ll describe two of the most popular and powerful tests for
heteroskedasticity, the Breusch - Pagan testand the White test. While nei-ther test
can “prove” that heteroskedasticity exists, these tests often can give us a pretty
good idea of whether or not it’s a problem.
Before using any test for heteroskedasticity, it is a good idea to start with the
following preliminary questions:
1. Are there any obvious speci…cation errors? Are there any likely omitted
variables? Have you speci…ed a linear model when a double-log model is more
appropriate? Don’t test for heteroskedasticity until the speci…ca-tion is as good as
possible. After all, if you …nd heteroskedasticity in an incorrectly speci…ed model,
there’s a chance it will be impure.
2. Are there any early warning signs of heteroskedasticity? Just as certain kinds
of clouds can warn of potential storms, certain kinds of data can signal possible
heteroskedasticity. In particular, if the dependent vari-able’s maximum value is
many, many times larger than its minimum, beware of heteroskedasticity.
3. Does a graph of the residuals show any evidence of heteroskedasticity? It
sometimes saves time to plot the residuals against a potential Z propor-tionality
factor or against the dependent variable. If you see a pattern in the residuals,
you’ve got a problem. See the Figures below for a few examples of heteroskedastic
patterns in the residuals.
If you plot the residuals of an equation with respect to a potential Z proportion-

ality factor, a pattern in the residuals is an indication of possible heteroskedasticity.
Note that the …gures above show “textbook” examples of heteroskedasticity. The
real world is nearly always a lot messier than textbook graphs. It’s not unusual to
look at a real-world residual plot and be unsure whether there’s a pattern or not.
As a result, even if there are no obvious speci…cation errors, no early warning signs,
and no visible residual patterns, it’s a good idea to do a formal statistical test for
heteroskedasticity, so we’d better get started.
The Breusch-Pagan Test
The Breusch–Pagan test is a method of testing for heteroskedasticity in the error

term by investigating whether the squared residuals can be explained by possible
proportionality factors.
The steps for Breusch–Pagan
Step 1: Obtain the residuals from the estimated regression equation. For an
equation with two independent variables, this would be:
ei = Yi Ybi = Yi b b X1i b X2i

0 1 2
Step 2: Use the squared residuals as the dependent variable in an auxiliary

equation.
As the explanatory variables in the auxiliary regression, use right-hand variables
from the original regression that you suspect might be pro-portionality factors. For
many researchers, the default option is to include all of them.
e2i = 0 + 1 X1i + 2 X2i + ui (6.1)
Step 3: Test the overall signi…cance of the equation above (6.1) with a chi-square
test. The null and alternative hypotheses are:
H0 : 1 = 2 =0
HA : H0 is false
The null hypothesis is homoskedasticity, because if 1 = 2 = 0, then the

variance equals 0 ,which is a constant. The test statistic here is N R2 , or the sample
size (N) times the unadjusted R2 from the above Equation. This test statistic
has a chi-square distribution with degrees of freedom equal to the number of slope
coe¢ cients in the auxiliary regression (the above Equation). If N R2 is greater
than or equal to the critical chi-square value, then we reject the null hypothesis of
homoskedasticity.
If you strongly suspect that only certain variables are plausible Z factors, then
you should run the Breusch–Pagan test using only an intercept and the suspect
variables. The degrees of freedom for the chi-square statistic of course would change
in such a situation, because they’re equal to the number of right-hand-side variables
in the auxiliary equation. If you’re certain you know the one and only proportionality
factor Z and that there are no other forms of heteroskedasticity present, you don’t
even need to fool with the chi-square statistic. You can just do a two-sided t-test
on the b for Z.
The strengths of the Breusch–Pagan test are that it’s easy to use and it’s powerful
if heteroskedasticity is related to one or more linear proportionality factors. Its
weakness is that if it fails to …nd heteroskedasticity, it only means there is no evidence
of heteroskedasticity related to the Zs you’ve chosen. If you’re pretty certain that
the Xs in the auxiliary regression are the only plau-sible proportionality factors, you
can rest easy.
The White test
Probably the most popular of all the heteroskedasticity tests is the White test be-
cause it can …nd more types of heteroskedasticity than any other test. Let’s see how
it works.
The White testinvestigates the possibility of heteroskedasticity in an equation
by seeing if the squared residuals can be explained by the equa-tion’s independent
variables, their squares, and their cross–products. To run the White test:
1. Obtain the residuals of the estimated regression equation.
2. Estimate an auxiliary regression, using the squared residuals as the dependent
variable, with each X from the original equation, the square of each X, and the
product of each X times every other X as the explanatory variables.
e2i = 0 + 1 X1i + 2 X2i + 2

3 X1i + 2
4 X2i + 5 X1i X2i + ui
3. Test the overall signi…cance of the Equation with a chi-square test. Once again
the test statistic here is N R2 , or the sample size (N) times the unadjusted R2 from
the above Equation. This test statistic has a chi-square distribution with degrees
of freedom equal to the number of slope coe¢ cients in the auxiliary regression (the
above Equation). The null hypothesis is that all the slope coe¢ cients in the auxiliary
regression (the above Equation) equal zero, and if N R2 is greater than or equal to
the critical chi-square value, then we reject the null hypothesis of homoskedasticity.
Check out the explanatory variables in the Equation on the previous slide. They
include every variable in the original model, their squares, and their cross products.
Including all the variables from the original model allows the White test to check
to see if any or all of them are Z proportionality factors. Including all the squared
terms and cross products allows us to test for more exotic and complex types of
heteroskedasticity. This is the White test’s greatest strength.
However, the White test contains more right-hand-side variables than the original
regression, sometimes a lotmore. This can be its greatest weakness. To see why,
note that as the number of explanatory variables in an original regression rises, the
number of right-hand variables in the White test auxil-iary regression goes up much
faster. With three variables in the original model, the White regression could have
nine. With 12 explanatory variables in the original model, there could be 90 in the
White regression with all the squares and interactive terms included! And this is
where the weakness becomes a real problem.
If the number of right-hand variables in the auxiliary regression exceeds the
number of observations, you can’t run the White test regression because you would
have negative degrees of freedom in the auxiliary equation! Even if the degrees of
freedom in the auxiliary equation are positive but small, the White test might do
a poor job of detecting heteroskedasticity because the fewer the degrees of freedom
there are, the less powerful the statistical test is. In such a situation, you’d be
limited to the Breusch–Pagan test or an alternative.
6.5.3 Remedies for Heteroskedasticity
The …rst thing to do if the Breusch–Pagan test or the White test indicates the
possibility of heteroskedasticity is to examine the equation carefully for speci-…cation
errors. Although you should never include an explanatory variable simply because
a test indicates the possibility of heteroskedasticity, you ought to rigorously think
through the speci…cation of the equation. If this rethinking allows you to discover
a variable that should have been in the regression from the beginning, then that
variable should be added to the equation. Similarly, if you had the wrong functional
form to begin with, the discovery of heteroskedasticity might be the hint you need
to rethink the speci…cation and switch to the functional form that best represents
the underlying theory. However, if there are no obvious speci…cation errors, the
heteroskedasticity is probably pure in nature, and one of the remedies described in
this section should be considered.
Heteroskedasticity-Corrected Standard Errors
The most popular remedy for heteroskedasticity is heteroskedasticity-corrected stan-

dard errors, which adjust the estimation of the SE b s for heteroskedasticity while
still using the OLS estimates of the slope coe¢ cients. The logic behind this approach
is powerful. Since heteroskedasticity causes problems with the SE b s but not with
the bs,it makes sense to improve the estimation of the SE b s in a way that doesn’t
alter the estimates of the slope coe¢ cients. This approach is virtually identical to
the use of Newey - West standard errors as a remedy for serial correlation.
Heteroskedasticity-corrected (HC) standard errors are SE b s that have been
calculated speci…cally to avoid the consequences of heteroskedasticity. The HC
procedure yields an estimator of the standard errors that, while they are biased, are
generally more accurate than uncorrected standard errors for large samples in the
face of heteroskedasticity. As a result, the HC SE b s can be used in t-tests and
other hypothesis tests in most samples with-out the errors of inference potentially
caused by heteroskedasticity. Typically, the HC SE b s are larger than the OLS
SE b s, thus producing lower t-scores and decreasing the probability that a given
estimated coe¢ cient will be signi…-cantly di¤erent from zero. The technique was
suggested by Halbert White in the same article in which he proposed the White test
for heteroskedasticity.
There are a few problems with using heteroskedasticity-corrected standard errors.
First, the technique works best in large samples, so it’s best to avoid HC SE b s
in small samples. Second, details of the calculation of the HC SE b s are beyond
the scope of this text and imply a model that is substan-tially more general than
the basic theoretical construct, V AR ( i ) = 2 Zi , of this section. In addition,
not all computer regression software packages cal-culate heteroskedasticity-corrected
standard errors.
Rede…ning the Variables
Another approach to ridding an equation of heteroskedasticity is to go back to the

basic underlying theory of the equation and rede…ne the variables in a way that
avoids heteroskedasticity. A rede…nition of the variables often is useful in allowing
the estimated equation to focus more on the behavioral aspects of the relationship.
Such a rethinking is a di¢ cult and discouraging process because it appears to dismiss
all the work already done.
However, once the theoretical work has been reviewed, the alternative approaches
that are discovered are often exciting in that they o¤er possible ways to avoid prob-
lems that had previously seemed insurmountable. Be careful, however. Rede…ning
your variables is a functional form speci…cation change that can dramatically change
your equation. In some cases, the only rede…nition that’s needed to rid an equa-
tion of heteroskedasticity is to switch from a linear functional form to a double-log
functional form. The double-log form has inherently less variation than the linear
form, so it’s less likely to encounter heteroskedasticity. In addition, there are many
research topics for which the double-log form is just as theoretically logical as the
linear form. In other situations, it might be necessary to completely rethink the
research project in terms of its underlying theory.
For example, consider a cross-sectional model of the total expenditures by the
governments of di¤erent cities. Logical explanatory variables to consider in such an
analysis are the aggregate income, the population, and the average wage in each city.
The larger the total income of a city’s residents and businesses, for example, the
larger the city government’s expenditures. In this case, it’s not very enlightening to
know that the larger cities have larger incomes and larger expenditures (in absolute
magnitude) than the smaller ones. Fitting a regression line to such data also gives
undue weight to the larger cities because they would otherwise give rise to large
squared residuals. That is, since OLS minimizes the summed squared residuals, and
since the residuals from the large cities are likely to be large due simply to the size
of the city, the regression estimation will be especially sensitive to the residuals from
the larger cities. This is often called “spurious correlation”due to size. In addition,
the residuals may indicate heteroskedasticity.
It makes sense to consider reformulating the model in a way that will discount
the scale factor (the size of the cities) and emphasize the underlying behavior. In
this case, per capita expenditures would be a logical dependent variable. This form
of the equation places Addis Ababa on the same scale as, say, Adama, Hawasa, Bahir
Dar, Mekele, and thus gives them the same weight in estimation. If an explanatory
variable happened not to be a function of the size of the city, however, it would not
need to be adjusted to per capita terms. If the equation included the average wage
of city workers, for example, that wage would not be divided through by population
in the transformed equation. Suppose your original equation is
EXPi = 0 + 1 P OPi + 2 IN Ci + 3 W AGEi + i
where EXPi refers to expenditure, IN Ci referes to income, W AGEi referes to the

average wage, P OPi referes to the population of the city.
The transformed equation would be
EXPi IN Ci
= 0 + 1 + 2 W AGEi + ui
P OPi P OPi
where ui is a classical homoskedastic error term. While the directly transformed
Equation probably avoids heteroskedasticity, such a solution should be considered
incidental to the bene…ts of rethinking the equation in a way that focuses on the ba-
sic behavior being examined. Note that it’s possible that the reformulated Equation
could have heteroskedasticity; the error variances might be larger for the observa-
tions having the larger per capita values for expenditures than they are for smaller
per capita values. Thus, it is legitimate to suspect and test for heteroskedasticity
even in this transformed model.
Such heteroskedasticity in the transformed equation is unlikely, however, because
there will be little of the variation in size normally associated with heteroskedasticity.
The above transformed equation is very similar to the equation for Weighted
Least Squares (WLS).
Weighted Least Squares is a remedy for heteroskedasticity that consists of divid-
ing the entire equation (including the constant and the heteroskedastic error term)
by the proportionality factor Z and then re-estimating the equation with OLS. For
the example above, the WLS equation would be:
EXPi IN Ci W AGEi
= 0 + 1 + 2 + 3 + ui
P OPi P OPi P OPi
where the variables and s in Equation above are identical to those in Equation
in the previos slide. Dividing through by Z means that u is a homoskedastic error
term as long as Z is the correct proportionality factor. This is not a trivial problem,
however, and other transformations and HCSEs are much easier to use than WLS
is, so the use of WLS is no longer recommended.
Chapter 7
Regression Models for Categorical

and Limited Dependent Variables1

The di¤erence between qualitative and quantitative economic variables.
How to include a 0–1 indicator variable on the right-hand side of a regression,

how this a¤ects model interpretation, and give an example.
How to interpret the coe¢ cient on an indicator variable in a log-linear equa-

tion.
How to include a slope-indicator variable in a regression, how this a¤ects model

interpretation, and give an example.
How to include a product of two indicator variables in a regression, and how

this a¤ects model interpretation, giving an example.
How to model qualitative factors with more than two categories (like region
of the country), and how to interpret the resulting model, giving an example.
The consequences of ignoring a structural change in parameters during part

of the sample.
How to test the equivalence of two regression equations using indicator vari-
ables. Based on the material in this chapter, you should be able to:
Give some examples of economic decisions in which the observed outcome is a

binary variable.
1
The data used in this chapter is from Gujarati, Damodar N. (2012) Econometrics by Example.
Palgrave Macmillan. This dataset is posted on the course webpage.
113
114CHAPTER 7 REGRESSION MODELS FOR CATEGORICAL AND LIMITED DEPE
Explain why probit, or logit, is usually preferred to least squares when esti-
mating a model in which the dependent variable is binary.
Give some examples of economic decisions in which the observed outcome is a

choice among several alternatives, both ordered and unordered.
Compare and contrast the multinomial logit model to the conditional logit
model.
Give some examples of models in which the dependent variable is a count

variable.
Discuss the implications of censored data for least squares estimation.
Describe what is meant by the phrase “sample selection.”
7.2 Introduction
In all the regression models that we have considered so far, we have implicitly
assumed that the regressand, the dependent variable, or the response variable Y is
quantitative, whereas the explanatory variables are either quantitative, qualitative
(or dummy), or a mixture thereof. We have brei‡y discussed how the dummy
regressors (explanatory variables) are introduced in a regression model and what
role they play in speci…c situations.
In this chapter we consider several models in which the regressand (the dependent
variable) itself is qualitative in nature. Although increasingly used in various areas
of social sciences and medical research, qualitative response regression models pose
interesting estimation and interpretation challenges. Suppose we want to study the
labor force participation (LFP) decision of adult males. Since an adult is either in
the labor force or not, LFP is a yes or no decision. Hence, the response variable, or
regressand, can take only two values, say, 1 if the person is in the labor force and 0 if
he or she is not. In other words, the regressand is a binary, or dichotomous, variable.
For the present purposes, the important thing to note is that the regressand is a
qualitative variable. One can think of several other examples where the regressand
is qualitative in nature. Thus, a family either owns a house or it does not, it has
disability insurance or it does not, both husband and wife are in the labor force or
only one spouse is. Similarly, a certain drug is e¤ective in curing an illness or it is not.
A …rm decides to declare a stock dividend or not, a parliamentarian decides to vote
for a tax cut or not. We do not have to restrict our response variable to yes/no or
dichotomous categories only. We can have a polychotomous (or multiple-category)
response variable.
7.3 THE LOGIT MODEL 115
What we plan to do is to …rst consider the dichotomous regressand and then

consider various extensions of the basic model. But before we do that, it is important
to note a fundamental di¤erence between a regression model where the regressand
Y is quantitative and a model where it is qualitative. In a model where Y is
quantitative, our objective is to estimate its expected, or mean, value given the
values of the regressors. In models where Y is qualitative, our objective is to …nd
the probability of something happening, such as owning a house, or belonging to a
union, or participating in a sport, etc. Hence, qualitative response regression models
are often known as probability models
In the rest of this chapter, we seek answers to the following questions:
1. How do we estimate qualitative response regression models? Can we simply
estimate them with the usual OLS procedures?
2. Are there special inference problems? In other words, is the hypothesis testing
procedure any di¤erent from the ones we have learned so far?
3. If a regressand is qualitative, how can we measure the goodness of …t of such
models? Is the conventionally computed R2 of any value in such models?
4. Once we go beyond the dichotomous regressand case, how do we estimate and
interpret the polychotomous regression models? Also, how do we handle models in
which the regressand is ordinal, that is, an ordered categorical variable, such as
schooling (less than 8 years, 8 to 11 years, 12 years, and 13 or more years), or the
regressand is nominal where there is no inherent ordering, such as ethnicity (Black,
White, Hispanic, Asian, and other)?
5. How do we model phenomena such as the number of visits to one’s physician
per year, the number of patents received by a …rm in a given year, the number of
articles published by a college professor in a year, the number of telephone calls
received in a span of 5 minutes, or the number of cars passing through a toll booth
in a span of 5 minutes? Such phenomena, called count data, or rare event data,
are an example of the Poisson (probability) process.
We start our study of qualitative response models by …rst considering the binary
response regression model. There are four approaches to developing a probability
model for a binary response variable:
1. The linear probability model (LPM)
2. The logit model
3. The probit model
4. The tobit model
7.3 The logit model

The linear probability model (LPM) uses the OLS method to determine the proba-
bility of an outcome. As a result it su¤ers from the following Problems:
1. The LPM assumes that the probability of the outcome moves linearly with
the value of the explanatory variable, no matter how small or large that value is.
2. The probability value must lie between 0 and 1, yet there is no guarantee that
the estimated probability values from the LPM will lie within these limits.
3. The usual assumption that the error term is normally distributed cannot
hold when the dependent variable takes only values of 0 and 1, since it follows the
binomial distribution.
4. The error term in the LPM is heteroscedastic, making the traditional signi…-
cance tests suspect
Lets use home ownership to explain the basic ideas underlying the logit model.
In explaining home ownership in relation to income, the LPM was
Pi = 1 + 2 Xi
where X is income and Pi = E(Yi = 1jXi ) means the family owns a house. But now
consider the following representation of home ownership:
1
Pi = (
1+e 1 + 2 Xi )
which can be written as

1 eZ
Pi = Z
) Pi =
1+e 1 + eZ
where Z = 1 + 2 Xi .
The above equation represents what is known as the (cumulative) logistic distri-
bution function. It is easy to verify that as Zi ranges from 1 to +1, Pi ranges
between 0 and 1 and that Pi is nonlinearly related to Zi (i.e., Xi ), thus satisfying
the two requirements considered to be weaknesses of PLM. But it seems that in
satisfying these requirements, we have created an estimation problem because Pi is
nonlinear not only in X but also in the ’s as can be seen clearly from the equations
in the preceding slides.
This means that we cannot use the familiar OLS procedure to estimate the
parameters. But this problem is more apparent than real because the eqauation in
the preceding slides can be linearized, which can be shown as follows. If Pi , the
probability of owning a house, is given by (3.3), then (1 Pi ), the probability of not
owning a house, is
1
1 Pi =
1 + eZ
Therefore, we can write
Pi 1 + eZ
=
1 Pi 1+e Z
Now Pi =1 Pi is simply the odds ratio in favor of owning a house - the ratio of the
probability that a family will own a house to the probability that it will not own a
house. Thus, if Pi = 0.8, it means that odds are 4 to 1 in favor of the family owning
a house. Now if we take the natural log of (3.5), we obtain a very interesting result,
namely,
Pi
Li = ln = Zi
1 Pi
= 1 + 2 Xi
Notice the features of the logit model:

As Pi goes from 0 to 1, Li goes from 1 to +1. Although Li is linear in Xi , the
probabilities themselves are not. If Li is positive, when the value of the explanatory
variable(s) increases, the odds of the outcome increase. If it negative, the odds of
the outcome decrease. The interpretation of the logit model is as follows: Each
slope coe¢ cient shows how the log of the odds in favor of the outcome changes as
the value of the X variable changes by a unit. Once the coe¢ cients of the logit
model are estimated, we can easily compute the probabilities of the outcome.
In the LPM the slope coe¢ cient measures the marginal e¤ect of a unit change in
the explanatory variable on the probability of the outcome, holding other variables
constant. In the logit model, the marginal e¤ect of a unit change in the explanatory
variable not only depends on the coe¢ cient of that variable but also on the level of
probability from which the change is measured. The latter depends on the values of
all the explanatory variables in the model. Estimation of the logit model depends
on the type of data available for analysis. There are two types of data available:
data at the individual, or micro, level, and data at the group level. We will …rst
consider the case of individual level data.
7.3.1 Individual level data

For our smoker example, we have data on 1,196 individuals. Therefore, although the
logit model is linear, it cannot be estimated by the usual OLS method. To see why,
note that Pi = 1 if a person smokes, and Pi = 0 if a person does not smoke. But if
we put these values directly in the logit Li , we obtain expressions like Li = ln (1=0) if
a person smokes and Li = ln (0=1) if a person does not smoke. These are unde…ned
expressions. Therefore, to estimate the logit model we have to resort to alternative
estimation methods. The most popular method with attractive statistical properties
is the method of maximum likelihood (ML). Most modern statistical packages have
established routines to estimate parameters by the ML method - for the theoretical
underpinnings of this method you may refer to advanced texts.
The following is logit estimates for smoker example discussed above based on
data from 1196 individuals.
The variables age and education are highly statistically signi…cant (see z values)
and have the expected signs. As age increases, the value of the logit decreases,
. logit smoker age educ income pcigs79
Iteration 0: log likelihood = -794.47478

Logistic regression Number of obs = 1,196

LR chi2(4) = 47.27
Prob > chi2 = 0.0000
Log likelihood = -770.84086 Pseudo R2 = 0.0297
smoker Coef. Std. Err. z P>|z| [95% Conf. Interval]
age -.0208532 .003739 -5.58 0.000 -.0281814 -.013525

educ -.0909728 .0206658 -4.40 0.000 -.131477 -.0504686
income 4.72e-06 7.17e-06 0.66 0.510 -9.33e-06 .0000188
pcigs79 -.0223188 .0124723 -1.79 0.074 -.046764 .0021264
_cons 2.745082 .8291962 3.31 0.001 1.119888 4.370277
perhaps due to health concerns - that is, as people age, they are less likely to smoke.
Likewise, more educated people are less likely to smoke, perhaps due to the ill
e¤ects of smoking. The price of cigarettes has the expected negative sign and is
signi…cant at about the 7% level. Ceteris paribus, the higher the price of cigarettes,
the lower is the probability of smoking. Income has no statistically visible impact
on smoking, perhaps because expenditure on cigarettes may be a small proportion
of family income.
The interpretation of the various coe¢ cients is as follows: holding other variables
constant, if, for example, education increases by one year, the average logit value
goes down by 0 09, that is, the log of odds in favor of smoking goes down by
about 0.09. Other coe¢ cients are interpreted similarly. But the logit language is
not everyday language. What we would like to know is the probability of smoking,
given values of the explanatory variables. But this can be computed from Equation
(3.2) above.
To illustrate, lets assume that smoker number 2 in the dataset has the following
characteristics: age = 28, educ = 15, income = 12,500 and pcigs79 = 60.0. Inserting
these values in Equation (3.2), we obtain:
1
P = ( 0:4935)
0:3782
1+e
That is, the probability that a person with the given characteristics is a smoker
is about 38%. Can we compute the marginal e¤ect of an explanatory variable on the
probability of smoking, holding all other variables constant? Suppose we want to
…nd out @Pi =@Agei , the e¤ect of a unit change in age on the probability of smoking,
holding other variables constant.
This was very straightforward in the LPM, but it is not that simple with logit or
probit models. This is because the change in probability of smoking if age changes
by a unit (say, a year) depends not only on the coe¢ cient of the age variable but
also on the level of probability from which the change is measured. But the latter
depends on values of all the explanatory variables. Eviews and Stata can do this
job readily.
. margins, dydx(*) atmeans
Conditional marginal effects Number of obs = 1,196

Model VCE : OIM
Expression : Pr(smoker), predict()

dy/dx w.r.t. : age educ income pcigs79
at : age = 41.80686 (mean)
educ = 12.22115 (mean)
income = 19304.77 (mean)
pcigs79 = 60.98495 (mean)
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
age -.0048903 .0008736 -5.60 0.000 -.0066025 -.0031781

educ -.0213341 .0048365 -4.41 0.000 -.0308134 -.0118548
income 1.11e-06 1.68e-06 0.66 0.510 -2.19e-06 4.40e-06
pcigs79 -.005234 .0029242 -1.79 0.073 -.0109653 .0004972
7.3.2 Measures of goodness of …t

The conventional measure of goodness of …t, R2 , is not very meaningful when the
dependent variable takes values of 1 or 0. Measures similar to R2 , called pseudo R2 ,
are discussed in the literature. One such measure is the McFadden R2 , called RM 2
cF .
Like R2 , RM
2
cF lies between 0 and 1. For our example, its value is 0.0927. Another
goodness of …t measure is the count R2 , which is de…ned as
number of correct predictions
Count R2 =
total number of predictions
Since the dependent variable takes a value of 1 or 0, if the predicted probability
for an observation is greater than 0.5 we classify that observation as 1, but if is less
than 0.5, we classify that as 0. We then count the number of correct predictions and
the count R2 as de…ned above. It should be emphasized that in binary regression
models goodness of …t measures are of secondary importance. What matters are
the expected signs of the regression coe¢ cients and their statistical and or practical
signi…cance. From our estimation of the smoker above see that except for the income
coe¢ cient, all other coe¢ cients are individually statistically signi…cant, at least at
the 10% level.
We can also test the null hypothesis that all the coe¢ cients are simultaneously
zero with the likelihood ratio (LR) statistic, which is the equivalent of the F test
in the linear regression model. Under the null hypothesis that none of the regressors
are signi…cant, the LR statistic follows the chi-square distribution with df equal
to the number of explanatory variables: four in our example. As the estimation
result for smoker shows, the value of the LR statistic is about 47.26 and the p
value (i.e. the exact signi…cance level) is practically zero, thus refuting the null
hypothesis. Therefore we can say that the four variables included in the logit model
are important determinants of smoking habits.
7.4 Multinomial Regression Models

In the preceding sections, we considered the logit model in which the objective was
to choose between two discrete choices: to smoke or not to smoke.Such models are
called dichotomous or binary regression models. But there are many occasions where
we may have to choose among several discrete alternatives. Such models are called
multinomial regression models (MRM).
Some examples are: Transportation choices: car, bus, railroad, bicycle
Choice of cereal brands
Choice of education: high school, college, postgraduate
Choice of job: do not work, work part time, or work full time.
Buying a car: American, Japanese, European
How do we estimate models that involve choosing among several alternatives?
In what follows we will consider some of the techniques that are commonly used in
practice. But before we proceed, it may be noted that there are several names for
such models: polytomous or polychotomous (multiple category) regression models.
For discussion purposes we will use the term multinomial models for all these models.
At the outset we can distinguish between nominal or unordered MRM and or-
dered MRM. For example, the transportation choice is nominal MRM because there
is no particular (natural) order among the various options. On the other hand, if one
is responding to a questionnaire which makes a statement and asks you to respond
on a three-response scale, such as do not agree, somewhat agree, completely agree,
it is an example of an ordered MRM. In this section we consider the nominal MRMs
and discuss ordered MRMs in the next section. Even within the nominal MRMs we
have to distinguish three cases:
Nominal MRM for chooser-speci…c data
Nominal MRM for choice-speci…c data
Nominal MRM for chooser-speci…c and choice-speci…c data, or mixed nominal
MRM.
Note that we are using the term “chooser”to represent an individual or decision
7.4 MULTINOMIAL REGRESSION MODELS 121
maker who has to choose among several alternatives. We use the term “choice” to
represent the alternatives or options that face an individual. The context of the
problem will make clear which term we have in mind. Nominal MRM for chooser
or individual-speci…c data. In this model the choices depend on the characteristics
of the chooser, such as age, income, education, religion, and similar factors. For
example, in educational choices, such as secondary education, a two-year college
education, a four-year college education and graduate school, age, family income,
religion, and parents’education are some of the variables that will a¤ect the choice.
These variables are speci…c to the chooser. These types of model are usually es-
timated by multinomial logit (MLM) or multinomial probit models (MPM). The
primary question these models answer is: How do the choosers’characteristics af-
fect their choosing a particular alternative among a set of alternatives? Therefore
MLM is suitable when regressors vary across individuals.
7.4.1 Nominal MRM for choice-speci…c data

Suppose we have to choose among four brands of authos: Toyota, Ford, BMW,
Fiat. We have data on the prices of these autos, the displays used by these brands
and the special features used by these brands. In other words, we have choice-
speci…c characteristics. However, in this model we do not have individual-speci…c
characteristics. Such models are usually estimated by conditional logit (CLM) or
conditional probit (CPM) models. The main questions such models answer is: how
do the characteristics or features of various alternatives a¤ect individuals’ choice
among them? For example, do people buy cars based on features, such as color,
shape, commercial advertising, and promotional features? Therefore, CLM or CPM
is appropriate when regressors vary across alternatives.
The di¤erence between MLM and CLM has been well summarized as follows:
In the standard multinomial logit model, explanatory variables are invariant with
outcome categories, but their parameters vary with the outcome. In the conditional
logit model, explanatory variables vary by outcome as well as by the individual,
whereas their parameters are assumed constant over all outcome categories.
To illustrate MLM, we consider an example about school choice. The data
consists of 1,000 secondary school graduates who are facing three choices: no college,
a 2-year college, and a 4-year college, which choices we code as 1, 2, and 3. Note that
we are treating these as nominal variables, although we could have treated them as
ordered. How does a high school graduate decide among these choices?
Intuitively, we could say that the choice will depend on the satisfaction (or utility
in economist’s jargon) that a student gets from higher education. He or she will
choose the alternative that gives him or her the highest possible satisfaction. That
choice, therefore, will have the highest probability of being chosen. To see how this
can be done, let Yij = 1, if the individual i chooses alternative j (j = 1, 2 and 3 in
the present case) = 0, otherwise. Further, let ij = P r (Yij = 1) where P r stands

for probability. Therefore, i1 ; i2 ; i3 , , represent the probabilities that individual
i chooses alternative 1, 2, or 3, respectively - that is alternatives of no college, a
2-year college and a 4-year college. If these are the only alternatives an individual
faces, then, obviously,
i1 + i2 + i3 = 1
This is because the sum of the probabilities of mutually exclusive and exhaustive
events must be 1. We will call the s the response probabilities. This means that in
our example if we determine any two probabilities, the third one is determined auto-
matically. In other words, we cannot estimate the three probabilities independently.
Now what are the factors or variables that determine the probability of choosing a
particular option?
In our school choice example we have information on the following variables:
X2 = hscath = 1 if Catholic school graduate, 0 otherwise
X3 = grades = average grade in math, English, and social studies on a 13 point
grading scale, with 1 for the highest grade and 13 for the lowest grade. Therefore,
higher grade-point denotes poor academic performance
X4 = faminc = gross family income in 1991 in thousands of dollars
X5 = famsiz =number of family members
X6 = parcoll = 1 if the most educated parent graduated from college or had an
advanced degree
X7 = 1 if female
X8 = 1 if black
We will use X1 to represent the intercept. Notice some of the variables are
qualitative or dummy (X2, X6, X7, X8) and some are quantitative (X3, X4, X5).
Also note that there will be some random factors that will also a¤ect the choice, and
these random factors will be denoted by the error term in estimating the model.
Generalizing the bivariate logit model discussed in the preceding section, we can
write the multinomial logit model (MLM) as:
j + j Xi
e
j = P3
j + j Xi
j=1 e
Notice that we have put the subscript j on the intercept and the slope coe¢ cient
to remind us that the values of these coe¢ cients can di¤er from choice to choice. In
other words, a high school graduate who does not want to go to college will attach
a di¤erent weight to each explanatory variable than a high school graduate who
wants to go to a 2-year college or a 4-year college. Likewise, a high school graduate
who wants to go to a 2-year college but not to a 4-year college will attach di¤erent
weights (or importance if you will) to the various explanatory variables. Also, keep
in mind that if we have more than one explanatory variable in the model, X will
7.4 MULTINOMIAL REGRESSION MODELS 123
then represent a vector of variables and then)will be a vector of coe¢ cients. So,
if we decide to include the seven explanatory variables listed above, we will have
seven slope coe¢ cients and these slope coe¢ cients may di¤er from choice to choice.
In other words, the three probabilities estimated from Equation (3.10) may have
di¤erent coe¢ cients for the regressors. In e¤ect, we are estimating three regressions.
As we noted before, we cannot estimate all the three probabilities independently.
The common practice in MLM is to choose one category or choice as the base,
reference or comparison category and set its coe¢ cient values to zero. So if we
choose the …rst category (no college) and set 1 = 0 and 1 = 0, we obtain the
following estimates of the probabilities for the three choices.
Multinomial logistic regression Number of obs = 1,000
LR chi2(14) = 377.82
Prob > chi2 = 0.0000
psechoice Coef. Std. Err. z P>|z| [95% Conf. Interval]
1
hscath -14.11493 698.6953 -0.02 0.984 -1383.532 1355.303
grades .6983612 .0574514 12.16 0.000 .5857585 .810964
faminc -.0148641 .0041227 -3.61 0.000 -.0229444 -.0067839
famsiz .0666033 .0720741 0.92 0.355 -.0746593 .2078659
parcoll -1.02433 .2774019 -3.69 0.000 -1.568028 -.4806322
female .0575788 .1964323 0.29 0.769 -.3274214 .442579
black -1.495237 .4170395 -3.59 0.000 -2.312619 -.6778546
_cons -5.008206 .5671367 -8.83 0.000 -6.119774 -3.896638
2
hscath -15.10527 724.2084 -0.02 0.983 -1434.528 1404.317
grades .3988077 .0446722 8.93 0.000 .3112518 .4863635
faminc -.0050481 .0025969 -1.94 0.052 -.010138 .0000418
famsiz -.0305312 .0652636 -0.47 0.640 -.1584454 .097383
parcoll -.4978009 .2043127 -2.44 0.015 -.8982465 -.0973554
female .199134 .1705162 1.17 0.243 -.1350716 .5333397
black -.9392084 .3788355 -2.48 0.013 -1.681712 -.1967045
_cons -2.739292 .4401899 -6.22 0.000 -3.602048 -1.876536
3 (base outcome)
A positive coe¢ cient of a regressor suggests increased odds for choice 2 over
choice 1, holding all other regressors constant. Likewise, a negative coe¢ cient of
a regressor implies that the odds in favor of no college are greater than a 2-year
college. Thus, from Panel 1 of the table on the preceding slide we observe that if
family income increases, the odds of going to a 2-year college increase compared to
no college, holding all other variables constant.
Similarly, the negative coe¢ cient of the grades variable implies that the odds
in favor of no college are greater than a 2-year college, again holding all other
variables constant (remember how the grades are coded in this example.) Similar
interpretation applies to the second panel of the results Table in the preceding slide.
To be concrete, let us interpret the coe¢ cient of grade point average. Holding other
variables constant, if the grade point average increases by one unit, the logarithmic
chance of preferring a 2-year college over no college goes down by about 0.2995. In
other words, -0.2995 gives the change in ln( 2i = 1i ) for a unit change in the grade
average. Therefore, if we take the anti-log of ln( 2i = 1i ), we obtain 2i = 1i = e02995 =
0.7412. That is, the odds in favor of choosing a 2-year college over no college are
only about 74%. This outcome might sound counterintuitive, but remember a higher
grade point on a 13-point scale means poor academic performance. Incidentally, the
odds are also known as the relative risk ratios (LRR).
Once the parameters are estimated, one can compute the three probabilities,
which is the primary objective of MLM. Since we have 1,000 observations and 7
regressors, it would be tedious to estimate these probabilities for all the individuals.
However, with appropriate command, Stata can compute such probabilities. But
this task can be minimized if we compute the three probabilities at the mean values
of the eight variables. To illustrate, for individual #10, a white male whose parents
did not have advanced degrees and who did not go to a Catholic school, had an
average grade of 6.44, family income of 42.5, and family size 6, his probabilities of
choosing option 1 (no college), or option 2 (a 2-year college) or option 3 (a 4-year
college) were, respectively, 0.2329, 0.2773 and 0.4897; these probabilities add to
0.9999 or almost 1 because of rounding errors.
Thus, for this individual the highest probability was about 0.49 (i.e. a 4-year
college). This individual did in fact choose to go to a 4-year college. Of course, it is
not the case that the estimated probabilities actually matched the choices actually
made by the individuals. In several cases the actual choice was di¤erent from the
estimated probability of that choice. That is why it is better to calculate the choice
probabilities at the mean values of the variables. We leave it for the reader to
compute these probabilities.
7.5 Ordinal Regression Models

In many applications in the social and medical sciences the response categories are
ordered or ranked. For example, in the Likert-type questionnaires the responses may
be “strongly agree”, “agree”, “disagree”, or “strongly disagree”. Similarly, in labor
market studies we may have workers who work full time (40+ hours per week), or
who work part time (fewer than 20 hours per week) or who are not in the workforce.
Although there is clear ranking among the various categories, we cannot treat them
as interval scale or ratio scale variables.
Thus, we cannot say that the di¤erence between full-time work and part-time
work or between part-time work and no work is the same. Also, the ratio between
any two categories here may not be practically meaningful. Although MLM models
can be used to estimate ordinal-scale categories, they do not take into account the
ordinal nature of the dependent variable. The ordinal logit and ordinal probit are
7.5 ORDINAL REGRESSION MODELS 125
Ordered logistic regression Number of obs = 2,293

LR chi2(6) = 301.72
Prob > chi2 = 0.0000
warm Coef. Std. Err. z P>|z| [95% Conf. Interval]
yr89 .5239025 .0798989 6.56 0.000 .3673036 .6805014

male -.7332997 .0784827 -9.34 0.000 -.887123 -.5794765
white -.3911595 .1183808 -3.30 0.001 -.6231816 -.1591373
age -.0216655 .0024683 -8.78 0.000 -.0265032 -.0168278
ed .0671728 .015975 4.20 0.000 .0358624 .0984831
prst .0060727 .0032929 1.84 0.065 -.0003813 .0125267
/cut1 -2.465362 .2389128 -2.933622 -1.997102

/cut2 -.630904 .2333156 -1.088194 -.1736138
/cut3 1.261854 .234018 .8031871 1.720521
speci…cally developed to handle ordinal scale variables. In practice it does not make
a great di¤erence whether we use ordinal probit or ordinal logit models.
7.5.1 Ordinal Logit Model

An illustrative example: attitudes toward working mothers
The 1977 and 1989 General Social Survey asked respondents to evaluate the
following statement: A working mother can establish just as warm and secure of
relationship with her child as a mother who does not work. Responses were recorded
as: 1 = strongly disagree, 2 = disagree, 3 = agree, and 4 = strongly agree. In all
2,293 responses were obtained.
For each respondent we have the following information: yr89 = survey year 1989,
gender, male = 1, race, white = 1, age = age in years, ed = years of education, prst
= occupational prestige. Using the ologit command of Stata, we obtained the results
in the Table on the next slide.
Before we interpret the results, let us look at the overall results. Recall that
under the null hypothesis that all regressor coe¢ cients are zero, the LR test follows
the chi-square distribution with degrees of freedom equal to the number of regressors,
6 in the present case. In our example this chi-square value is about 302. If the null
hypothesis were true, the chances of obtaining a chi-square value of as much as 302
or greater is practically zero. So collectively all the regressors have strong in‡uence
on the choice probability. The model also gives the Pseudo R2 of 0.05. This is not
the same as the usual R2 in OLS regression - that is, it is not a measure of the
proportion of the variance in the regressand explained by the regressors included in
the model. Therefore, the Pseudo R2 value should be taken with a grain of salt.
The statistical signi…cance of an individual regression coe¢ cient is measured by
the Z value (the standard normal distribution Z). All the regression coe¢ cients,
except prst, are individually highly statistically signi…cant, their p values being prac-
tically zero. Prst, however, is signi…cant at the 7% level. The regression coe¢ cients
given in the preceding table are ordered log-odds (i.e. logit) coe¢ cients. What do
they suggest? Take, for instance, the coe¢ cient of the education variable of 0 07.
If we increase the level of education by a unit (say, a year), the ordered log-odds of
being in a higher warmth category increases by about 0 07, holding all other re-
gressors constant. This is true of warm category 4 over warm category 3 or of warm
category 3 over 2 or warm category 2 over category 1. Other regression coe¢ cients
given in the preceding table are to be interpreted similarly. By convention, one of
the categories is chosen as the reference category and its intercept value is …xed at
zero.
In practice it is often useful to compute the odds-ratios to interpret the various
coe¢ cients. This can be done easily by exponentiating (i.e. raising e to a given
power) the estimated regression coe¢ cients. To illustrate, take the coe¢ cient of the
education variable of 0.07. Exponentiating this we obtain e0:07 1.0725. This means
if we increase education by a unit, the odds in favor of higher warmth category over
a lower category of warmth are greater than 1.
Chapter 8
Review Questions
Below are questions based on all the chapters covered in this course. Some of the
questions may require additional readings. It is good you invest your time and
practice with these questions for two reasons: First, this is the best (the only)
way you master the subject; second, your …nal exam consists of three or
four questions of the types of questions you face in this assignment.
Question 1: State with reason whether the following statements are true, false,
or uncertain. Be precise.
a. The t test of signi…cance discussed in this course requires that the sampling
distributions of estimators bs follow the normal distribution.
b. Even though the disturbance term in the Classical Linear Regression Model
is not normally distributed, the OLS estimators are still unbiased.
c. If there is no intercept in the regression model, the estimated i (=bi ) will not
sum to zero.
d. The p value and the size of a test statistic mean the same thing.
e. In a regression model that contains the intercept, the sum of the residuals is
always zero.
f. If a null hypothesis is not rejected, it is true.
g. The higher the value of 2 , the larger is the variance of bs.
h. The conditional and unconditional means of a random variable are the same
things.
Question 2: Consider the following regression output:
Ybi = 0:2033 + 0:6560Xi
SE = (0:0976) (0:1961)
2
R = 0:397 RSS = 0:0544 ESS = 0:0358
where Y = labor force participation rate (LFPR) of women in 1972 and X = LFPR
of women in 1968. The regression results were obtained from a sample of 19 cities
in the United States.
127
128 CHAPTER 8 REVIEW QUESTIONS
a. How do you interpret this regression?

b. Test the hypothesis: H0 : 2 =1 against HA : 2 > 1. Which test do you use?
And why? What are the underlying assumptions of the test(s) you use?
c. Suppose that the LFPR in 1968 was 0.58 (or 58 percent). On the basis of
the regression results given above, what is the mean LFPR in 1972? Establish a 95
percent con…dence interval for the mean prediction.
d. How would you test the hypothesis that the error term in the population
regression is normally distributed? Show the necessary calculations.
Question 3: Consider the following regression
ci =
SP 17:8 + 33:2Ginii
SE = (4:9) (11:8) R2 = 0:16
Where SPI is index of sociopolitical instability, average for 1960–1985, and Gini is
Gini coe¢ cient for 1975 or the closest available year within the range of 1970–1980.
The sample consist of 40 countries. The Gini coe¢ cient is a measure of income
inequality and it lies between 0 and 1. The closer it is to 0, the greater the income
equality, and the closer it is to 1, the
greater the income inequality.
a. How do you interpret this regression?
b. Suppose the Gini coe¢ cient increases from 0.25 to 0.55. By how much does
SPI go up? What does that mean in practice?
c. Is the estimated slope coe¢ cient statistically signi…cant at the 5% level? Show
the necessary calculations.
d. Based on the preceding regression, can you argue that countries with greater
in-come inequality are politically unstable?
Question 4:In a study of turnover in the labor market, James F. Ragan, Jr.,
obtained the following results for the U.S. economy for the period of 1950–I to
1979–IV.
d
ln Y t = 4:47 0:34 ln X2t + 1:22 ln X3t + 1:22 ln X4t + 0:80 ln X5t 0:0055X6t
t = (4:28) ( 5:31) (3:64) (3:10) (1:10) ( 3:09)
2
R = 0:5370
where Y= quit rate in manufacturing, de…ned as number of people leaving jobs

voluntarily per 100 employees
X2 = an instrumental or proxy variable for adult male unemployment rate
X3 = percentage of employees younger than 25
X4 = Nt 1 =N
/ t 4 = ratio of manufacturing employment in quarter (t-1) to that
in quarter (t-4)
X5 = percentage of women employees
129
X6 = time trend (1950–I = 1)

a. Interpret the foregoing results.
b. Is the observed negative relationship between the logs of Y and X2 justi…able
a priori?
c. Why is the coe¢ cient of ln X3t positive?
d. Since the trend coe¢ cient is negative, there is a secular decline of what percent
in the quit rate and why is there such a decline?
2
e. Is the R “too”low?
f. Can you estimate the standard errors of the regression coe¢ cients from the
given data? Why or why not?
Question 5: You are given the following regression results:
Ybt = 16; 899 2978:5X2t

t = (8:5152) ( 4:7280)
2
R = 0:6149
Ybt = 16; 899 2978:5X2t + 2815X3t
t = (3:3705) ( 6:6070) (2:9712)
2
R = 0:7706
Can you …nd out the sample size underlying these results? (Hint:Recall the
relationship between R2 , F , and t values.)
Question 6: From the data for 46 states in the United States for 1992, Baltagi
obtained the following regression results:
[
log C = 4:30 1:34 log P + 0:17 log Y
2
SE = (0:91) (0:32) (0:20) R = 0:27
where C = cigarette consumption, packs per year

P = real price per pack
Y = real disposable income per capita
a. What is the price elasticity of demand for cigarettes with respect to price? Is
it statistically signi…cant? If so, is it statistically di¤erent from 1?
b. What is the income elasticity of demand for cigarettes? Is it statistically
signi…cant? If not, what might be the reasons for it?
c. How would you retrieve R2 from the adjusted R2 given above?
Question 7: From a sample of 209 …rms, Wooldridge obtained the following
regression results:
log \
(salary) = 4:32 + 0:280 log (sales) + 0:0174roe + 0:000ros
SE = (0:32) (0:035) (0:0041) (0:00054)
2
R = 0:27
where salary = salary of CEO

sales = annual …rm sales
roe = return on equity in percent
ros = return on …rm’s stock and where …gures in the parentheses are the esti-
mated standard errors.
a. Interpret the preceding regression taking into account any prior expectations
that you may have about the signs of the various coe¢ cients.
b. Which of the coe¢ cients are individually statistically signi…cant at the 5
percent level?
c. What is the overall signi…cance of the regression? Which test do you use?
And why?
d. Can you interpret the coe¢ cients of roe and ros as elasticity coe¢ cients?
Why or why not?
Question 8: Based on the U.S. data for 1965-IQ to 1983-IVQ , James Doti and
Esmael Adibi obtained the following regression to explain personal consumption
expenditure (PCE) in the United States.
Yb = 10:96 + 0:93X2t 2:09X3t
t = ( 3:33) (249:06) ( 3:09)
R2 = 0:9996 F = 83; 753:7
where Y = the PCE ($, in billions)
X2 = the disposable (i.e., after-tax) income ($, in billions)
X2 = the prime rate (%) charged by banks
a. What is the marginal propensity to consume (MPC) - the amount of additional
consumption expenditure from an additional dollar’s personal disposable income?
b. Is the MPC statistically di¤erent from 1? Show the appropriate testing
procedure.
c. What is the rationale for the inclusion of the prime rate variable in the model?
A priori, would you expect a negative sign for this variable?
d. Is 3 signi…cantly di¤erent from zero?
e. Test the hypothesis that R2 = 0.
f. Compute the standard error of each coe¢ cient.
Question 9: In a study of the production function of the United Kingdom
bricks, pottery, glass, and cement industry for the period 1961 to 1981, R. Leighton
Thomas obtained the following results:
1: log Qt = 5:04 + 0:887 log K + 0:893H
SE = (1:40) (0:087) (0:137) R2 = 0:6149
2: log Qt = 8:57 + 0:0272t + 0:460 log K + 1:285 log H
SE = (2:99) (0:0204) (0:333) (0:324) R2 = 0:7706
131
where Q = the index of production at constant factor cost

K = the gross capital stock at 1975 replacement cost
H = hours worked
t = the time trend, a proxy for technology
The …gures in parentheses are the estimated standard errors.
a. Interpret both regressions.
b. In regression (1) verify that each partial slope coe¢ cient is statistically sig-
ni…cant at the 5% level.
c. In regression (2) verify that the coe¢ cients of t and log K are individually
insigni…cant at the 5% level.
d. What might account for the insigni…cance of log K variable in Model 2?
e. If you were told that the correlation coe¢ cient between t and log K is 0.980,
what conclusion would you draw?
f. Even if t and log K are individually insigni…cant in Model 2, would you accept
or reject the hypothesis that in Model 2 all partial slopes are simultaneously equal
to zero? Which test would you use?
g. In Model 1, what are the returns to scale?
Question 10: Consider the following two regressions based on the U.S. data for
1946 to 1975. (Standard errors are in parentheses.)
1:Ct = 26:19 + 0:6248GN Pt 0:4398Dt

SE = (2:73) (0:0060) (0:0736) R2 = 0:999
C 1 D
2: = 25:92 + 0:6246 0:4315
GN P t GN Pt GN Pt
SE = (2:22) (0:0068) (0:0597) R2 = 0:875
where C = aggregate private consumption expenditure

GN P = gross national product
D = national defense expenditure
t = time
The objective of Hanushek and Jackson’s study was to …nd out the e¤ect of
defense expenditure on other expenditures in the economy.
a. What might be the reason(s) for transforming the …rst equation into the
second equation?
b. If the objective of the transformation was to remove or reduce heteroscedas-
ticity, what assumption has been made about the error variance?
c. If there was heteroscedasticity, have the authors succeeded in removing it?
How can you tell?
d. Does the transformed regression have to be run through the origin? Why or
why not?
e. Can you compare the R2 values of the two regressions? Why or why not?
Question 11: From the annual data for the U.S. manufacturing sector for 1899–
1922, Dougherty obtained the following regression results:
[
log Y = 2:81 0:53 log L + 0:047t
SE = (1:38) (0:34) (0:021) R2 = 0:97 F = 189:8
where Y = index of real output, K = index of real capital input, L = index of real
labor input, t = time or trend.
Using the same data, he also obtained the following regression:
\
log Y =L = 0:11 + 0:11 log K=L + 0:047t
SE = (0:03) (0:15) (0:006) R2 = 0:65 F = 19:5
a. Is there multicollinearity in regression (1)? How do you know?

b. In regression (1), what is the a priori sign of log K? Do the results conform
to this expectation? Why or why not?
c. How would you justify the functional form of regression (1)? (Hint:Cobb–
Douglas production function.)
d. Interpret regression (1). What is the role of the trend variable in this regres-
sion?
e. What is the logic behind estimating regression (2)?
f. If there was multicollinearity in regression (1), has that been reduced by
regression (2)? How do you know?
g. If regression (2) is a restricted version of regression (1), what restriction is
imposed by the author? (Hint:returns to scale.) How do you know if this restriction
is valid? Which test do you use? Show all your calculations.
h. Are the R2 values of the two regressions comparable? Why or why not? How
would you make them comparable, if they are not comparable in the present form?
Question 12: Critically evaluate the following statements:
a. “In fact, multicollinearity is not a modeling error. It is a condition of de…cient
data.”
b. “If it is not feasible to obtain more data, then one must accept the fact that
the data one has contain a limited amount of information and must simplify the
model accordingly. Trying to estimate models that are too complicated is one of the
most common mistakes among inexperienced applied econometricians.”
c. “It is common for researchers to claim that multicollinearity is at work when-
ever their hypothesized signs are not found in the regression results, when variables
that they know apriori to be important have insigni…cant t values, or when vari-
ous regression results are changed substantively whenever an explanatory variable is
deleted. Unfortunately, none of these conditions is either necessary or su¢ cient for
the existence of collinearity, and furthermore none provides any useful suggestions as
133
to what kind of extra information might be required to solve the estimation problem
they present.”
d. “... any time series regression containing more than four independent variables
results in garbage.”
Question 13: From data for 54 standard metropolitan statistical areas (SMSA),
Demaris estimated the following logit model to explain high murder rate versus low
murder rate:
bi = 1:1387 + 0:0014Pi + 0:0561Ci

ln O 0:4050Ri
se = (0:0009) (0:0227) (0:1568)
where O = the odds of a high murder rate, P =1980 population size in thousands,
C = population growth rate from 1970 to 1980, R = reading quotient, and the se
are the asymptotic standard errors.
a. How would you interpret the various coe¢ cients?
b. Which of the coe¢ cients are individually statistically signi…cant?
c. What is the e¤ect of a unit increase in the reading quotient on the odds of
having a higher murder rate?
d. What is the e¤ect of a percentage point increase in the population growth
rate on the odds of having a higher murder rate?
Question 14: From the household budget survey of 1980 of the Dutch Central
Bureau of Statistics, J. S. Cramer obtained the following logit model based on a
sample of 2,820 households. The purpose of the logit model was to determine car
ownership as a function of (logarithm of) income. Car ownership was a binary
variable: Y = 1 if a household owns a car, zero otherwise.
bi =
L 2:77231 + 0:347582 ln Income
2
t = ( 3:35) (4:05) (1 df) = 16:681 (p value = 0:0000)
where L b = estimated logit and where ln Income is the logarithm of income. The 2
measures the goodness of …t of the model.
a. Interpret the estimated logit model.
b. From the estimated logit model, how would you obtain the expression for the
probability of car ownership?
c. What is the probability that a household with an income of $20,000 will own
a car? And at an income level of $25,000? What is the rate of change of probability
at the income level of $20,000?
d. Comment on the statistical signi…cance of the estimated logit model.
Bibliography
[1] Asteriou, Dimitrios, Stephen G. Hall (2011) Applied Econometrics, Second Edi-
tion. Palgrave Macmillan.
[2] Gujarati, Damodar N. and Dawn C. Porter (2010) Essentials of Econometrics,

Fourth Edition. McGraw-Hill/Irwin.
[3] Gujarati, Damodar N. (2012) Econometrics by Example. Palgrave Macmillan.
[4] Hill, R. Carter, William E. Gri¢ ths, and Guay C. Lim (2011) Principles of
Econometrics, Fourth Edition. John Wiley & Sons, Inc.
[5] Studenmund, A. H. and Bruce K. Johnson (2017) Using Econometrics: A Prac-

tical Guide, Seventh Edition. Pearson Education, Inc.
[6] Wooldridge, Je¤rey M. (2016) Introductory Econometrics: A Modern Approach,

Sixth Edition. Cengage Learning.
135

Applied Econometrics Module

Uploaded by

Copyright:

Available Formats

Applied Econometrics Module

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Read this document in other languages

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Econometrics Module

Uploaded by

Copyright:

Available Formats

Applied Econometrics for Management

(MGMT- MSC 5411)1

Addis Ababa University, O¢ ce for Distance and Continuing Education

February 28, 2018

1 This material is based on a material taken mainly from a book by A. H. Studenmund

3 Ordinary Least Squares 39

3.4.3 A Single Dummy Independent Variable . . . . . . . . . . . . . 55

4 Classical Linear Regression Model 59

5 Hypothesis Testing and Statistical Inference 71

6 Violation of Classical Assumptions 87

6.5.1 The Consequences of Heteroskedasticity . . . . . . . . . . . . 103

7 Regression Models for Categorical and Limited Dependent Vari-

8 Review Questions 127

1.1 Why Study Econometrics?

split second. Much information is also available in microeconomics (for instance, on

1.2 The main objective of this module

1.3 Learning Outcomes

Demonstrate an understanding of the purpose of econometrics;

Demonstrate basic knowledge and understanding of the Classical Linear Re-

Demonstrate knowledge and understanding of the assumptions and properties

Demonstrate an ability to formulate and evaluate testable statistical hypothe-

Demonstrate an ability to carefully interpret regression results;

Demonstrate knowledge and understanding of the e¤ect on regression results

Demonstrate knowledge and understanding of the econometric analysis for

Yet, in a policy or business context, having a clear idea of the magnitude of an

Figure 2.1: Econometrics and its interaction with other sciences

Nowadays econometrics forms an indispensable tool to model empirical reality

Figure 2.2: Econometrics as an interdisciplinary …eld

2.1 What is Econometrics?

can be changed into estimated equation like:

Q = 27:7 0:11P + 0:03Ps + 0:23Yd (2.2)

2.2 What Is Regression Analysis?

Q is the dependent variable and P , PS , and Yd are the independent variables.

2.3 Single-Equation Linear Models

This equation states that Y , the dependent variable, is a single-equation linear

2.4 The Stochastic Error Term

2.4.1 The Signi…cance of the Stochastic Disturbance Term

4. Intrinsic randomness in human behavior: Even if we succeed in introducing

This equation can be thought of as having two components, the deterministic

nonstochastic. This deterministic component can also be thought of as the expected

2.5 Few Points on Notations

Yi = 0 + 1 X1i + 1 X2i + 3 X3i + i .... (i = 1; 2; :::; N ) (2.9)

W AGEi = 0 + 1 EXPi + 2 EDUi + 3 GEN Di + i (2.10)

This equation speci…es that a worker’s wage is a function of the experience,

2.6 The Estimated Regression Equation

the estimated regression equation has actual numbers in it:

Yb = 103:40 + 6:38X (2.13)

2.7 Structures of Economic Data

1. Cross section data

2. Pooled cross section data

3. Time series data

4. Panel or longitudinal data

2.7.1 Cross-Sectional Data

A cross-sectional data set consists of a sample of individuals, households, …rms,

2.7.2 Time Series Data

2.7.3 Pooled Cross Sections

2.7.4 Panel or Longitudinal Data

Example 1 E¤ects of Fertilizer on Crop Yield: Some early econometric studies

Example 2 Example measuring the Return to Education: Labor economists and