Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

REGRESSION

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 7

REGRESSION

- This is a concept, which refers to the changes which occur in the dependent variable as a
result of changes occurring on the independent variable. Knowledge of regression is
particularly very useful in business statistics where it is necessary to consider the
corresponding changes on dependent variables whenever independent variables change
- It should be noted that most business activities involve a dependent variable and either
one or more independent variable. Therefore, knowledge of regression will enable a
business statistician to predict or estimate the expenditure value of a dependent variable
when given an independent variable e.g. consider the above example for annual incomes
and annual expenditures. Using the regression techniques one can be able to determine
the estimated expenditure of a given family if the annual income is known and vice versa
- The general equation used in simple regression analysis is as follows
y = a + bx
Where y = Dependent variable
a= Interception y axis (constant)
b = Slope on the y axis
x = Independent variable
i. The determination of the regression equation such as given above is normally
done by using a technique known as “the method of least squares’.

Regression equation of y on x i.e. y = a + bx

The following sets of equations normally known as normal equation are used to determine the
equation of the above regression line when given a set of data.
Σy = an + bΣx
Σxy = aΣx + bΣx2
Where Σy = Sum of y values
Σxy = sum of the product of x and y
Σx = sum of x values
Σx2= sum of the squares of the x values
a = The intercept on the y axis
b = Slope gradient line of y on x
NB: The above regression line is normally used in one way only i.e. it is used to estimate the y
values when the x values are given.
Regression line of x on y i.e. x = a + by
- The fact that regression lines can only be used in one way leads to what is known as a
regression paradox
- This means that the regression lines are not ordinary mathematical line graphs which may
be used to estimate the x and y simultaneously
- Therefore, one has to be careful when using regression lines as it becomes necessary to
develop an equation for x and y before doing the estimation.

The following example will illustrate how regression lines are used

Example

1|Page
An investment company advertised the sale of pieces of land at different prices. The following
table shows the pieces of land their acreage and costs

Piece of (x)Acreage (y) Cost £ 000 xy x2


land Hectares
A 2.3 230 529 5.29
B 1.7 150 255 2.89
C 4.2 450 1890 17.64
D 3.3 310 1023 10.89
E 5.2 550 2860 27.04
F 6.0 590 3540 36
G 7.3 740 5402 53.29
H 8.4 850 7140 70.56
J 5.6 530 2969 31.36
2
Σx =44.0 Σy = 4400 Σxy= 25607 Σx = 254.96

Required
Determine the regression equations of
i. y on x and hence estimate the cost of a piece of land with 4.5 hectares
ii. Estimate the expected average if the piece of land costs £ 900,000
Σy = an + bΣxy
Σxy = a∑x + bΣx2

By substituting of the appropriate values in the above equations we have


4400 = 9a + 44b …….. (i)
25607 = 44a + 254.96b ……..(ii)
By multiplying equation …. (i) by 44 and equation …… (ii) by 9 we have
193600 = 396a + 1936b …….. (iii)
230463 = 396a + 2294.64b ……..(iv)
By subtraction of equation …. (iii) from equation …… (iv) we have
36863 = 358.64b
102.78 = b
by substituting for b in …….. (i)
4400 = 9a + 44( 102.78)
4400 – 4522.32 = 9a
–122.32 = 9a
-13.59 = a
Therefore the equation of the regression line of y on x is
Y = 13.59 + 102.78x
When the acreage (hectares) is 4.5 then the cost
(y) = -13.59 + (102.78 x 4.5)
= 448.92
= £ 448, 920
Note that
Where the regression equation is given by
y= a + bx

2|Page
Where a is the intercept on the y axis and
b is the slope of the line or regression coefficient
n is the sample size
then,
intercept a =

Slope b =

Example
The calculations for our sample size n = 10 are given below. The linear regression model is
y = a + bx
Table

Distance x Time y mins xy x2 y2


miles
3.5 16 56.0 12.25 256
2.4 13 31.0 5.76 169
4.9 19 93.1 24.01 361
4.2 18 75.6 17.64 324
3.0 12 36.0 9.0 144
1.3 11 14.3 1.69 121
1.0 8 8.0 1.0 64
3.0 14 42.0 9.0 196
1.5 9 13.5 2.25 81
4.1 16 65.6 16.81 256
Σx = 28.9 Σy = 136 Σxy = Σx2 = 99.41 Σy2= 1972
435.3

The Slope b =

= 2.66

and the intercept a =

= 5.91
We now insert these values in the linear model giving
y = 5.91 + 2.66x
or
Delivery time (mins) = 5.91 + 2.66 (delivery distance in miles)

3|Page
The slope of the regression line is the estimated number of minutes per mile needed for a
delivery. The intercept is the estimated time to prepare for the journey and to deliver the goods,
that is the time needed for each journey other than the actual traveling time.

PREDICTION WITHIN THE RANGE OF SAMPLE DATA

We can use the linear regression model to predict the mean of dependent variable for any given
value of independent variable
For example, if the sample model is given by
Time (min) = 5.91 + 2.66 (distance in miles)
Then if the distance is 4.0 miles then our estimated mean time is
Ý = 5.91 + 2.66 x 4.0 = 16.6 minutes

Multiple Linear Regression Models

There are situations in which there is more than one factor which influence the dependent
variable

Example
Cost of production per week in a large department depends on several factors;
i. Total numbers of hours worked
ii. Raw material used during the week
iii. Total number of items produced during the week
iv. Number of hours spent on repair and maintenance
It is sensible to use all the identified factors to predict department costs
Scatter diagram will not give the relationship between the various factors and total costs
The linear model for multiple linear regression if of the type; (which is the line of best fit).
y = α + b1x1 +b2x2 +………… + bnxn
We assume that errors or residuals are negligible.
In order to choose between the models, we examine the values of the multiple correlation
coefficient r and the standard deviation of the residuals α.
A model which describes well the relationship between y and x’s has multiple correlation
coefficient r close to ±1 and the value of α which is small.

Example
Odino chemicals limited are aware that its power costs are semi variable cost and over the last
six months these costs have shown the following relationship with a standard measure of output.

Month Output (standard units) Total power costs £


000
1 12 6.2
2 18 8.0
3 19 8.6
4 20 10.4
5 24 10.2
6 30 12.4

4|Page
Required
i. Using the method of least squares, determine an appropriate linear relationship
between total power costs and output
ii. If total power costs are related to both output and time (as measured by the number of
the month) the following least squares regression equation is obtained
Power costs = 4.42 + (0.82) output + (0.10) month
Where the regression coefficients (i.e. 0.82 and 0.10) have t values 2.64 and 0.60
respectively and coefficient of multiple correlation amounts to 0.976
Compare the relative merits of this fitted relationship with one you determine in (a).
Explain (without doing any further analysis) how you might use the data to forecast
total power costs in seven months.
Solution
a)
Output (x) Power costs (y) x2 y2 xy
12 6.2 144 38.44 74.40
18 8.0 324 64.00 144.00
19 8.6 361 73.96 163.40
20 10.4 400 108.16 208.00
24 10.2 576 104.04 244.80
30 12.4 900 153.76 372.00
2 2
Σx = 123 Σy = 55.8 Σx = 2705 Σy = 542.36 Σxy=
1,206.60

b=

= = 0.342

a = (Σy – bΣx)

= (55.8 – 0.342) 123

= 2.29
(Power costs) = 2.29 + 0.342 (output)
b. For linear regression calculated above, the coefficient of correlation r is

5|Page
r=

= 0.96

This show a strong correlation between power cost and output. The multiple correlation when
both output and time are considered at the same time is 0.976. We observe that there has been
very little increase in r which means that inclusion of time variable does not improve the
correlation significantly
The value for time variable is only 0.60 which is insignificant as compared with a t value of 2.64
for the output variable
In fact, if we work out correlation between output and time, there will be a high correlation.
Hence there is no necessity of taking both the variables. Inclusion of time does improve the
correlation coefficient but by a very small amount.
If we use the linear regression analysis and attempt to find the linear relationship between output
and time i.e.

Month Output
1 12
2 18
3 19
4 20
5 24
6 30
The value of b and a will turn out to be 3.11 and 9.6 i.e. relationship will be of the form
Output = 9.6 + 3.11 × month
For this equation forecast for 7th month will be
Output = 9.6 + 3.11 × 7
= 9.6 + 21.77
= 31.37 units
Using the equation , Power costs = 2.29 + 0.34 × output
= 2.29 + 0.34 × 31.37
= 2.29 + 10.67
= 12.96 i.e. £ 12,960

Non Linear Relationships


If the scatter diagram and the correlation coefficient do not indicate linear relationship, then the
relationship may be non – linear
Two such relationships are of peculiar interest

Both of these can be reduced to linear model. Simple or multiple linear regression methods are
then used to determine the values of the coefficients

6|Page
i. Exponential model

Take log of both sides


log y = log a + log bx
log y = log a + xlog b
Let log y = Y and log a = A and log b = B

Thus we get Y = A + Bx. This is a linear regression model

ii. Geometric model

using the same technique as above


log y = log a + blog x
Y = A + bX
Where Y = log y
A = log a
X = log x
Using linear regression technique (the method of least squares), it is possible to calculate the
value of a and b

7|Page

You might also like