Chapter 2. Simple Linear Regression Module May13
Chapter 2. Simple Linear Regression Module May13
Learning Objectives
Keywords: Simple linear regression model, regression parameters, regression line, residuals,
principle of least squares, least squares estimates, least squares line, fitted values, predicted
values, coefficient of determination, least squares estimators, distributions of least squares
estimators, hypotheses on regression parameters, confidence intervals for regression.
2.1. Introduction
Linear regression is probably the most widely used, and useful, statistical technique for solving
economic problems. Linear regression models are extremely powerful, and have the power to
empirically simplify out very complicated relationships between variables. In general, the
technique is useful, among other applications, in helping explain observations of a dependent
variable, usually denoted Y, with observed values of one or more independent variables, usually
denoted by X1, X2, ... A key feature of all regression models is the inclusion of the error term,
which capture sources of error that are not captured by other variables.
This Topic presents simple linear regression models. That is, regression models with just one
independent variable, and where the relationship between the dependent variable and the
independent variable is linear (a straight line). Although these models are of a simple nature,
1
they are important for various reasons. Firstly, they are very common. This is partly due to the
fact that non-linear relationships often can be approximated by straight lines, over limited ranges.
Secondly, in cases where a scatterplot of the data displays a non-linear relationship between the
dependent variable and the independent variable, it is sometimes possible to transform the data
into a new pair of variables with a linear relationship. That is, we can transform a simple non-
linear regression model into a simple linear regression model, and analyze the data using linear
models. Lastly, the simplicity of these models makes them useful in providing an overview of
the general methodology. In Topic 3, we shall extend the results for simple linear regression
models to the case of more than one explanatory variables.
A formal definition of the simple linear regression model is given in Section 2.2. In Section 2.3,
we discuss how to fit the model, and how to estimate the variation away from the line. Section
2.4 presents inference on simple linear regression models.
In most of the examples and exercises in Topic1, there was only one explanatory variable, and
the relationship between this variable and the dependent variable was a straight-line with some
random fluctuation around the line.
This data is obtained from 15 farmers to analyze the relationship between farm productivity (Y)
measured in qt/ha and fertilizer use (X) measured in Kg/ha.
Farm Fertilizer
productivity in use in kg
qt/ha
28 10
19 8
2
30 10
50 15
35 12
40 16
22 9
32 10
34 13
44 16
60 20
75 22
45 14
38 11
40 10
25
80
20
60
15
10
40
0
20
0 5 2010 40 15
Fertilizer used in kg
60
20 80
25
The relationship between the two variables could be described as a straight line, and some
random factors affecting farm productivity. Thus, we can use the following linear specification
as a model for analyzing the data, Yi 0 1 X i i i 1,...,15.
3
In general, suppose we have a dependent variable Y and an independent variable X. Hence, the
simple linear regression model for Y on X is given by:
Yi 0 1 X i i i 1,..., n. (2.1)
where β0 and β1 are unknown parameters, and the i s are independent random variables with
zero mean and constant variance for all.
The parameters β0 and β1 are called regression parameters (or regression coefficients), and the
line h( X i ) 0 1 X i is called the regression line or the linear predictor. Note that a general h(.)
is called a regression curve. The regression parameters β0 and β1 are unknown, non-random
parameters. They are the intercept and the slope, respectively, of the straight line relating Y to X.
The name simple linear regression model refers to the fact that the mean value of the dependent:
E (Yi ) 0 1 X i is a linear function of the regression parameters β0 and β1.
The terms i in (2.1) are called random errors or random terms. The random error i is the term
which accounts for the variation of the ith dependent variable Yi away from the linear predictor
0 1 X i at the point Xi. That is,
i Yi 0 1 X i i = 1, …,n (2.2)
The i s are independent random variables with the same variance and zero mean. Hence, the
dependent variables Yi are independent with means 0 1 X i , and constant variance equal to
the variance of i .
For the above example, an interpretation of the regression parameters β0 and β1 is as follows:
β0: The expected farm productivity for a hypothetical farmer with no fertilizer applied.
β1: The expected change in the farm productivity, when the fertilizer application is increased by
one kg. Observe that the slope of the line is negative, implying that the farm productivity
decreases with increasing fertilizer application.
4
Regression, like most statistical techniques, has a set of underlying assumptions that are expected
to be in place if we are to have confidence in estimating a model. Some of the assumptions are
required to make an estimate, even if the only goal is to describe a set of data. Other assumptions
are required if we want to make inference about a population from sample information.
We noted in Topic 2 that the error term in regression provides an estimate of the standard error
of the model and helps in making inferences, including testing the regression coefficients. To
properly use regression analysis there are a number of criteria about the error term in the model
that we must be able to reasonably assume are true. If we cannot believe these assumptions are
reasonable in our model, the results may be biased or no longer have minimum variance.
The following are some of the assumptions about the error term in the regression model.
1. The mean of the probability distribution of the error term is zero (E(εi) = 0)
This is true by design of our estimator of OLS, but it also reflects the notion that we don’t expect
the error terms to be mostly positive or negative (over or underestimate the regression line), but
centered around the regression line.
Assumptions about the error term in regression are very important for statistical inference -
making statements from a sample to a larger population.
5
This implies that we assume a constant variance for Y across all the levels of the independent
variable. This is called homoscedasticity and it enables us to pool information from all the data
to make a single estimate of the variance. Data that does not show constant error variance is
called heteroscedasticity and must be corrected by a data transformation or other methods.
This assumption follows statistical theory of the sampling distribution of the regression
coefficient and is a reasonable assumption as our sample size gets larger and larger. This enables
us to make an inference from a sample to a population, much like we did for the mean.
4. The errors terms are independent of each other and with the independent variable in the
model (Cov (εi, εj) =0 and Cov(Xi, εi) =0)
This means that the error terms are uncorrelated with each other or with the independent variable
in the model. Correlated error terms sometimes occur in time series data and is known as auto-
correlation while correlation of the error terms with the independent variable is called
endogeneity. If there is correlation among the error terms or error terms with the independent
variable it usually implies that our model is mis-specified. Another way to view this problem is
that there is still pattern left to explain in the data by including a lagged variable in time series, or
a nonlinear form in the case of correlation with an independent variable.
Having decided that a straight line might describe the relationship in the data well, the obvious
question is now: which line fits the data best?
In Figure 2.2 a line is added to a scatterplot for the data on farm productivity and fertilizer used.
There could be many lines that can be fitted into this data set. However there is only one line that
best fits this data and having the property that the sum of squared residuals is the minimum.
6
80
60
40
20
5 10 15 20 25
Fertilizer
Figure 2.2. Line of best fit to a data on farm productivity and fertilizer
The most common criterion for estimating the best fitting line to data is the principle of least
squares. This criterion is described in Subsection 2.3.1. Subsection 2.3.2 concerns a measure of
the strength of the straight-line relationship. When we estimate the regression line, we effectively
estimate the two regression parameters β0 and β1. That leaves one remaining parameter in the
model: the common variance σ2 of the dependent variables. We discuss how to estimate σ2 in
Subsection 2.3.3.
The principle of least squares is based on the residuals. For any line, the residuals are the
deviations of the dependent variables Yi away from the line. Note that residuals always refer to a
given line or curve. The residuals are usually denoted by i like the random errors in (2.2). The
reason for this notation is that, if the line is the true regression line of the model, then the
~ ~ ~
residuals are exactly the random errors i in (2.2). For a given line h ( X ) 0 1 X , the
observed value of i is the difference between the ith observation Yi and the linear predictor
~ ~
0 1 X at the point Xi. That is,
~ ~
~i Yi 0 1 X i i = 1, …, n. (2.3)
7
The observed values of i are called observed residuals (or just residuals). The residuals are the
vertical deviations of the observed values from the line of best fit.
Note that, the better the line fits the data, the smaller the residuals will be. Thus, we can use the
‘sizes’ of the residuals as a measure of how well a proposed line fits the data. If we simply used
the sum of the residuals, we would get a problem with large positive and large negative values
cancelling out; this problem can be avoided by using the sum of the squared residuals instead. If
this measure-the sum of squared residuals-is small, the line explains the variation in the data
well; if it is large, the line explains the variation in the data poorly. The principle of least squares
is to estimate the regression line by the line which minimizes the sum of squared residuals. Or,
equivalently: estimate the regression parameters β0 and β1by the values which minimize the sum
of squared residuals.
The sum of squared residuals, or, as it is usually called, the residual sum of squares, is denoted
by RSS (or RSS(β0, β1) to emphasize that it is a function of β0 and β1), and is given by
n n
RSS RSS ( 0 , 1 ) ˆi yi ˆ0 ˆ1 xi
2 2
(2.4)
i 1 i 1
In order to minimize RSS with respect to β0 and β1 we differentiate (2.4), and get
RSS n
( 0 , 1 ) 2 yi 0 1 xi
ˆ0 i 1
RSS n
( 0 , 1 ) 2 xi yi 0 1 xi
ˆ1 i 1
Putting the derivatives equal to zero and re-arranging the terms, yields the following equations
n n
y i 0 n 1 xi
i 1 i 1
n n n
xi y i 0 x i 1 xi
i 1 i 1 i 1
2
8
Solving the equations for β0 and β1 provides the least squares estimates ˆ0 (reads beta-naught-
hat) and ˆ1 (beta-one-hat) of β0 and β1, respectively. They are given by
ˆ0 y 1 x
n
x i x yi y
ˆ1 i 1
n
x
i 1
i x
2
n n
y i x i
where y i 1
and x i 1
denote the sample means of the dependent and explanatory
n n
variable, respectively.
The estimated regression line is called the least squares line or the fitted regression line and is
given by:
(2.5)
yˆ ˆ0 ˆ1 x
The values yˆ i ˆ0 ˆ1 xi are called the fitted values or the predicted values. The fitted value ŷ i
is an estimate of the expected dependent for a given value xi of the explanatory variable. The
residuals corresponding to the fitted regression line, are called the fitted residuals, or simply the
residuals. They are given by
ˆi yi yˆ i yi ˆ0 ˆ1 xi i 1,..., n. (2.6)
The fitted residuals can be thought of as observations of the random errors i in the simple linear
regression model (2.1).
It is convenient to use the following shorthand notation for the sums involved in the expressions
for the parameter estimates (all summations are for i = 1,.., n):
2
n
xi
xi i 1 ,
n n
s xx xi x
2 2
i 1 i 1 n
9
2
n
yi
yi i 1 ,
n n
s yy yi y
2 2
i 1 i 1 n
n n
xi yi
s xy s yx xi x yi y xi yi i 1 i 1
n n
i 1 i 1 n
The sums s xx and s yy are called corrected sums of squares, and the sums s xy and s yx are called
corrected sums of cross products. The corresponding sums involving the random variables Yi
rather than the observations y i are denoted by upper-case letters: S yy , S xy and S yx . In this
notation, the least squares estimates of the regression parameters β0 and β1 of the slope and
intercept of the regression line are given by:
s xy
ˆ1 (2.7)
s xx
and
ˆ0 y ˆx (2.8)
respectively.
Note that the estimate of ˆ1 is undefined if s xx 0 (division by zero). But this is not a problem in
practice: if s xx 0 the explanatory variable only takes one value, and there can be no best line.
Note also that the least squares line passes through the centroid (the point ( x , y ) ) of the data.
For the data on farm productivity and fertilizer, the least squares estimates of the regression
parameters are given by
ˆ1 3.271
ˆ0 3.278
So, the fitted least squares line has equation
yˆ 3.278 3.271x
10
The least squares line is shown in Figure 2.4. The line appears to fit the data reasonably well.
80
60
40
20
5 10 15 20 25
Fertilizer
Figure 2.3: Farm productivity and fertilizer data; the least squares line
The least squares principle is the traditional and most common method for estimating the
regression parameters. But there exists other estimating criteria: e.g. estimating the parameters
by the values that minimize the sum of absolute values of the residuals, or by the values that
minimize the sum of orthogonal distances between the observed values and the fitted line. The
principle of least squares has various advantages to the other methods. For example, it can be
shown that, if the dependent variables are normally distributed (which is often the case), the least
squares estimates of the regression parameters are exactly the maximum likelihood estimates of
the parameters.
In the previous subsection we used the principle of least squares to fit the ‘best’ straight line to
data. But how well does the least squares line explain the variation in the data? In this subsection
we describe a measure for roughly assessing how well a fitted line describes the variation in data:
the coefficient of determination.
The coefficient of determination compares the amount of variation in the data away from the
fitted line with the total amount of variation in the data. The argument is as follows: if we did not
have the linear model we would have to use the ‘naïve’ model yˆ y instead. The variation away
11
n
from the naïve model is S yy ( yi y ) 2 : the total amount of variation in the data. However, if
i 1
we use the least squares line (2.5) as model, the variation away from model is only
S
2
n S xy
RSS ( ˆ0 , ˆ1 ) yi ˆ0 ˆ1 xi
2
yy
i 1 s xx
A measure of the strength of the linear relationship between Y and X is the coefficient of
determination R2: it is the proportional reduction in variation obtained by using the least squares
line instead of the naïve model. That is, the reduction in variation away from the model
SYY - RSS as a proportion of the total variation Syy:
S yy RSS S yy S yy S 2 xy / s xx S 2 xy
R2
S yy S yy s xx S yy
The larger the value of R 2 , the greater the reduction from S yy to RSS relative to S yy , and the
s 2 xy
r2
s xx S yy
Note that the square root of r2 is exactly the estimate from Module 1 of the Pearson correlation
coefficient, ρ, between x and Y when x is regarded as a random variable:
s xy / n 1
r
sx s y
respectively.
The value of R2 will always lie between 0 and 1 (or, in percentage, between 0% and 100%). It is
equal to 1 if ˆ1 0 and RSS = 0, that is, if all the data points lie precisely on the fitted straight
line (i.e. when there is a ‘perfect’ relationship between Y and x ). If the coefficient of
12
determination is close to 1, it is an indication that the data points lie close to the least squares
line. The value of R2 is zero if RSS = Syy, that is, the fitted straight-line model offers no more
information about the value of Y than the naïve model does.
It is tempting to use R2 as a measure of whether a model is good or not. This is not appropriate.
Try and think of why for a moment before reading on.
The coefficient of determination is only a measure of how well a straight-line model describes
the variation in the data compared to the naïve model-not to other models in general. Even
though R2 is close to 1 (i.e. a straight-line explains a large proportion of the variation), it could
easily be that a non-linear model explains the data-variation much better than the linear. Methods
for assessing the appropriateness of the assumption of a straight-line relationship between Y and
x will be discussed in Topic 4.
The relevant summary statistics for the data on farm productivity and fertilizer application are:
Since the coefficient of determination is very high, the model seems to describe the variation in
the data very well.
In Subsection 2.3.1, we found that the principle of least squares can provide estimates of the
regression parameters in a simple linear regression model. But, in order to fit the model we also
need an estimate for the common variance 2 . Such an estimate is required for making
statistical inferences about the true straight-line relationship between x and Y. Since 2 is the
13
common variance of the residuals i , i 1,..., n, it would be natural to estimate it by the sample
variance of the fitted residuals (2.6). That is, an estimate would be
y
n
ˆ0 ˆ1 xi
2
i
RSS
i 1
n 1 n 1
where RSS RSS ( ˆ0 , ˆ1 ) . However, it can be shown that this is a biased estimate of 2 , that is,
RSS
the corresponding estimator does not have the ‘correct’ mean value: E 2 . An unbiased
n 1
estimate of the common variance, 2 , is given by
s 2 xy
s yy
RSS ( ˆ0 , ˆ1 ) s xx
s2
n2 n2 (2.9)
The denominator in (2.9) is the residual degrees of freedom (df), that is
df = number of observations - number of estimated parameters.
In particular, for simple linear regression models, we have n observations and we have estimated
the two regression parameters β0 and β1, so the residual df is n-2.
The relevant summary statistics for the data on farm productivity and fertilizer application are:
s xx 234.93 s yy 3256.57
n 15 , s xy 717.30
n2 14
14
In Section 2.3 we produced an estimate of the straight line that describes the data-variation best.
However, since the estimated line is based on the particular sample of data, xi and y i i = 1,…, n
we have observed, we would almost certainly get a different line if we took a new sample of data
and estimated the line on the basis of the new sample. For example, if we measured farm
productivity and fertilizer application of farmers in Haramaya district the one in Example 2.1, we
would invariably get different measurements, and therefore a different least squares line. In other
words: the least squares line is an observation of a random line which varies from one
experiment to the next. Likewise, the least squares estimates ˆ0 and ˆ1 of the intercept and
slope, respectively, of the least squares line, are both observations of random variables. These
random variables are called the least squares estimators. An estimate is non-random and is an
observation of an estimator, which is a random variable. The least squares estimators are given
by:
n
S x i x yi y
ˆ1 xy i 1
n
x x
S xx 2
i
i 1 (2.10)
ˆ0 y 1 x (2.11)
n
Y i
where Y i 1
, and with all summations from i = 1 to n . By a similar argument we find that an
n
unbiased estimator for the common variance 2 is given by
S
2
Y
S yy xy n
Yˆi
2
s xx i
S
2 i 1
n2 n2 (2.12)
where yˆ i ˆ0 ˆ1 xi , with ˆ0 and ˆ1 being the least squares estimators. Note that the
randomness in the estimators is due to the dependent variables only, since the explanatory
variables are non-random. In particular, it can be seen from (2.10) and (2.11) that ˆ0 and ˆ1 are
linear combinations of the dependent variables.
15
It can be shown that the least squares estimators are unbiased, that is, that they have the ‘correct’
mean values:
E ( ˆ0 ) 0 and E ( ˆ1 ) 1 (2.13)
Also, the estimator S2 is an unbiased estimator of the common variance σ2, that is
E(S 2 ) 2 (2.14)
The variances of the estimators ˆ0 and ˆ1 can be found from standard results on variances (we
shall not do it here). The variances are given by
ˆ 2 1 x2
var( 0 ) (2.15)
n s xx
2
Var ( ˆ1 ) (2.16)
s xx
Note that both variances decrease when the sample size n increases. Also, the variances decrease
if S xx xi x is increased. That is, if the x-values are widely dispersed. In some studies, it
2
is possible to design the experiment such that the value of s xx is high, and hence the variances of
the estimators are small. It is desirable to have small variances, as it improves the precision of
results drawn from the analysis.
In order to make inferences about the model, such as testing hypotheses and producing
confidence intervals for the regression parameters, we need to make some assumption on the
distribution of the random variables Yi . The most common assumption-and the one we shall
Topic 4 is concerned with various methods for checking the assumptions of regression models.
In this section, we shall simply assume the following about the dependent variables: the Yi s are
independent normally distributed random variables with equal variances and mean values
depending linearly on xi .
16
2.4.1. Inference on the regression parameters
To test hypotheses and construct confidence intervals for the regression parameters β0 and β1, we
need the distributions of the parameter estimators ˆ0 and ˆ1 . Recall from (2.10) and (2.11) that
the least squares estimators ˆ0 and ˆ1 are linear combinations of the dependent variables Yi .
Standard theory on the normal distribution says that a linear combination of independent, normal
random variables is normally distributed. Thus, since the Yi s are independent, normal random
variables, the estimators ˆ0 and ˆ1 are both normally distributed. In (2.13)-(2.16), we found the
mean values and variances of the estimators. Putting everything together, we get that
1 x2
ˆ0 ~ N ( 0 , 2
n s xx
2
ˆ1 ~ N ( 1 , )
s xx
It can be shown that the distribution of the estimator S2 of the common variance σ2 is given by
2 n 2 2
S2 ~
n2
where n 2 denotes a chi-square distribution with n-2 degrees of freedom. Moreover, it can be
2
shown that the estimator S2 is independent of the estimators ˆ0 and ˆ1 . (But the estimators
We can use these distributional results to test hypotheses on the regression parameters. Since
both ˆ0 and ˆ1 have normal distributions with variances depending on the unknown quantity σ2,
we can apply standard results for normal random variables with unknown variances. Thus, in
order to test i equal to some value i *, i = 0,1, that is, to test hypotheses of the form
ˆi i *
t ˆ ( y ) , i 0,1
i
se( ˆi ) (2.17)
17
where se( ˆi ) denotes the estimated standard error of the estimator ˆi . That is
1 x2
se( ˆ0 ) var( ˆ0 ) s 2
n s xx
and
s2
se( ˆ1 ) var( ˆ1 )
s xx
It can be shown that both test statistics t ˆ ( y ) and t ˆ ( y ) have t-distributions with n-2 degrees of
0 1
freedom.
The test statistics in (2.17) can be used for testing the parameter i (i=0,1) equal to any value
i * . However, for the slope parameter β1, one value is particularly important: if we can test β1
equal to zero, the simple linear regression model simplifies to:
Yi 0 i i 1,..., n.
That is, the value of y i does not depend on the value of xi . In other words: the dependent
variable and the independent variable are unrelated!
It is common-for instance in computer output-to present the estimates and standard errors of the
least squares estimators in a table like the following.
The column ‘t-statistic’ contains the t-test statistic (2.17) for testing the hypotheses H 0 : 0 0
and H 0 : 1 0 respectively. If you wish to test a parameter equal to a different value, it is easy
to produce the appropriate test statistic (2.17) from the table. The column ‘p –value’ contains the
p-values corresponding to the t-test statistic in the same row.
18
For the data on farm productivity and fertilizer application, the table is given by
Not surprisingly, neither parameter can be tested equal to zero. If, for some reason, we wished to
test whether the slope parameter was equal to 1.58, say, the test statistic would be
Since n = 15 in this example, the test statistic has a t(13)-distribution. The t-value for this test is
2.16, thus, on the basis of these data we reject the hypothesis that the slope parameter is 1.58, at
the 5% significance level.
A second practical use of the table is to provide confidence intervals for the regression
parameters. The 1–α confidence interval for β0 and β1 are given by, respectively,
ˆ0 t1 / 2 (n 2)se( ˆ0 ),
and
ˆ1 t1 / 2 (n 2) se( ˆ1 ),
In order to construct the confidence intervals, all that is needed is the table and t1 / 2 (n 2) : the
For the data on farm productivity and fertilizer application, the 95% confidence intervals for the
regression parameters can be obtained from the table for these data and the 0.975-quantile of a
t(13)-distribution:t0.975(13) = 2.16. The confidence intervals for β0 and β1 are, respectively,
0 : (13.17,6.62)
and
19
1 : (2.50,4.04)
Learning activity 2.1. Consider data on two continuous variables which you are familiar with
and test whether the explanatory variable significantly affects the dependent variable.
In this topic, the simple linear regression model has been discussed. We have described a
method, based on the principle of least squares, for fitting simple linear regression models to
data. The principle of least squares says to estimate the regression line by the line which
minimizes the sum of the squared deviations of the observed data away from the line. The
intercept and slope of the fitted line are estimates of the regression parameters β0 and β1,
respectively. Further, an unbiased estimate of the common variance has been given. Under the
assumption of normality of the dependent variable, we have tested hypotheses and constructed
confidence intervals for the regression parameters.
References
Michael Sullivan III. Statistics Informed Decisions Using Data. Upper Saddle River, New
Jersey: Pearson Education, 2004.
Michael H. Kutner, Christopher J. Nachtsheim, John Neter and William Li. Applied Linear
Statistical Models. New York: McGraw-Hill Irwin, 2005.
20