lecture 6 linear regression
lecture 6 linear regression
The relationship between two variables may be one of functional dependence of one on the other.
That is, the magnitude of one of the variables (the dependent variable) is assumed to be
determined by-that is, is a function of-the magnitude of the second variable (the independent
variable), whereas the reverse is not true. For example, in the relationship between blood
pressure and age in humans, blood pressure may be considered the dependent variable and age
the independent variable; we may reasonably assume that although the magnitude of a person's
blood pressure might be a function of age, age is not determined by blood pressure. This is not to
say that age is the only biological determinant of blood pressure, but we do consider it to be one
determining factor. The term dependent does not necessarily imply a cause-and-effect
relationship between the two variables. Such a dependence relationship is called a regression.
The term simple regression refers to the simplest kind of regression, one in which only two
variables are considered. Multiple regression involves more than two variables.
Data amenable to simple regression analysis consist of pairs of data measured on a ratio or
interval scale. These data are composed of measurements of a dependent variable (Y) that is a
random effect and an independent variable (X) that is either a fixed effect or a random effect.
It is convenient and informative to graph simple regression data using the ordinate (Y axis) for
the dependent variable and the abscissa (X axis) for the independent variable. Such a graph is
shown in the figure below for the n = 13, where the data appear as a scatter of 13 points, each
point representing a pair of X and Y values. One pair of X and Y data may be designated as (X1,
Y1), another as (X2, Y2), another as (X3, Y3), and so on, resulting in what is called a scatter plot
of all n of the (Xi, Yi) data.
Y= α +βX
Where α and β are population parameters hence constants: estimated by sample statistics
‘a’ and ‘b’ Hence y=a+bx
The only way to determine the population parameters α and β would be to possess all the data for
the entire population. Since this is nearly always impossible, we have to estimate these
parameters from a sample of n data, where n is the number of pairs of X and Y values. The
calculations required to arrive at such estimates, as well as to execute the testing of a variety of
important hypotheses, involve the computation of sums of squared deviations from the mean.
Recall that the "sum of squares" of Xi values is defined as å(Xi - X)2, which is more easily
obtained on a calculator as
Here it will be convenient to define xi = Xi - , so that this sum of squares can be abbreviated as
å xi2 or, more simply, as å x2
Another quantity needed for regression analysis is referred to as the sum of the cross products of
deviations from the mean:
Where y denotes a deviation of a Y value from the mean of all Y's just as x denotes a deviation of
an X value from the mean of all X's. The sum of the cross products, analogously to the sum of
squares, has a simple-to-use "machine formula":
The Regression Coefficient: The parameter β is termed the regression coefficient, or the slope
of the best-fit regression line. The best sample estimate of β is ‘a’
Although the denominator in this calculation is always positive, the numerator may be positive,
negative, or zero, and the value of b theoretically can range from -∞ to + ∞, including zero
(b) The Y Intercept: An infinite number of lines possess any stated slope, all of them parallel.
However, each such line can be defined uniquely by stating α, in addition to β at anyone point on
the line-that is, any pair of coordinates,
(Xi, Yi). The point conventionally chosen is the point on the line where X = 0.The value of Y in
the population at this point is the parameter α, which is called the Y intercept. It can be shown
mathematically that the point ( , ) always lies on the best-fit regression line. Thus, substituting
and in the equation of a straight line, we find that;
For any given slope, there exist an infinite number of possible regression lines, each with a
different Y intercept. Similarly, for any given Y intercept, there exist an infinite number of
possible regression lines, each with a different slope. Three of the infinite number are shown
here.
Predicting Values of Y. Knowing the parameter estimates a and b for the linear regression
equation, we can predict the value of the dependent variable expected in the population at a
stated value of Xi. For example, the wing length of a sparrow at 13.0 days of age would be
predicted using the regression equation as follows;
1. For each value of X, the values of Yare to have come at random from the sampled
population and are to be independent of one another. That is, obtaining a particular Y
from the population is in no way dependent upon the obtaining of any other Y.
2. For any value of X in the population there exists a normal distribution of Yvalues
3. There is homogeneity of variances in the population; that is, the variances of the
distributions of Y values must all be equal to each other.
4. In the population, the mean of the Y's at a given X lies on a straight line with the mean of
all other Y's at all other X's. That is, the actual relationship between Yand X is linear.
5. The measurements of X were obtained without error. This, of course, is typically
impossible; so what we do in practice is assume that the errors in measuring X are
negligible, or at least small, compared with errors in measuring Y.
Violations of assumptions 2, 3, or 4 can sometimes be countered by transformation of data. Data
in violation of assumption 3 will underestimate the residual mean square and result in an
inflation of the test statistic (F or t), thus increasing the probability of a Type I error.
Heteroscedastic data may sometimes be analyzed advantageously by a procedure known as
weighted regression, which will not be discussed here.
Regression statistics are known to be robust with respect to at least some of these underlying
assumptions. So violations of them are not usually of concern unless they are severe.
The slope, b, of the regression line computed from the sample data expresses quantitatively the
straight-line dependence of Y on X in the sample. But what is really desired is information about
the functional relationship (if any) in the population from which the sample came. Indeed, the
finding of a dependence of Y on X in the sample (i.e., b ≠0) does not necessarily mean that there
is a dependence in the population (i.e., (β≠ 0). Consider the following Figure, a scatter plot
representing a population of data points with no dependence of Y on X; the best-fit regression
line for this population would be parallel to the X axis (i.e., the slope, (β, would be zero).
However, it is possible, by random sampling, to obtain a sample of data points having the five
values circled in the figure. By calculating b for this sample of five, we would estimate that β
was positive, even though it is, in fact, zero.
A hypothetical population of data points, having a regression coefficient, p, of zero. The
circled points are a possible sample of five.
We are not likely to obtain five such points out of this population, but we desire to assess just
how likely it is; therefore, we can set up a null hypothesis, Ho: β= 0, and the alternate
hypothesis, HA: β ≠0, appropriate to that assessment. If we conclude that there is a reasonable
probability (i.e., a probability greater than the chosen level of significance-say, 5%) that the
calculated b could have come from sampling a population with a β= 0, the Ho is not rejected. If
the probability of obtaining the calculated b is small (say, 5% or less), then Ho is rejected, and
HA is assumed to be true.
The Ho, may be tested by an analysis of variance (ANOVA) procedure. First, the overall
variability of the dependent variable is calculated by computing the sum of squares of deviations
of Yi values from
, a quantity termed the total sum of squares:
Then we determine the amount of variability among the Yi values that is attributable to there
being a linear regression; this is termed the linear regression sum of squares
Summary of calculations