Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DSC 402

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Regression Analysis

DSC 402
BUSINESS STATISTICS-II

UNIT-1
REGRESSION
Introduction:
The concept 'Regression' is a continuation of 'Correlation' which we
learnt in Business Statistics–I.

If Correlation establishes a relation between two variables, regression is


the method of estimating (to predict) the value of one variable for a
given value of the other variable. For example, if we know that the
expenditure on advertisement and sales are correlated, we may have to
find the expected amount of sales for a given advertising expenditure
or the expected amount of expenditure for attaining a given amount of
sales.

The literal meaning of the word 'Regression' is stepping back or


returning to the average value. The term was first used by British
biometrician Sir Francis Galton in the later part of the 19th century in
connection with some studies he made on estimating the extent to
which the stature of the sons of tall parents reverts or regresses back to
the mean stature of the population. He studied the relationship
between the heights of about one thousand fathers and sons and
published the results in a paper Regression towards Mediocrity in
Hereditary Stature'. The interesting features of his study were:
I. The tall fathers have tall sons and short fathers have short sons.
II. The average height of the sons of group of tall fathers is less than
that of the fathers and the average height of the sons of a group
of short fathers is more than that of the fathers.
Regression Analysis

In other words, Galton's studies revealed that the off springs of


abnormally tall or short parents tend to revert or step back to the
average height of the population.
He concluded that if the average height of a certain group of fathers is
'a' cms, above (below) the general average height, then average height
of their sons will be (a x r) cms, above (below) the general average
height where r is the correlation coefficient between the heights of the
given group of fathers and their sons. In this case correlation is positive
and since |r| we have (a x r) ≤ a. This supports the result in (II) above.

But today, the word regression as used in Statistics has a much wider
perspective without any reference to biometry.

Regression analysis, in the general sense, means the estimation or


prediction of the unknown value of one variable from the known value
of the other variable.

It is one of the very important statistical tools which is extensively used


in almost all sciences - natural, social and physical. It is specially used in
business and economics to study the relationship between two or more
variables that are related causally and for estimation of demand and
supply curves, cost functions, production and consumption functions,
etc.

Prediction or estimation is one of the major problems in almost all


spheres of human activity. The estimation or prediction of future
production, consumption, prices, investments, sales, profits, income,
etc. are of paramount importance to a businessman or economist.
Population estimates and population projections are indispensable for
efficient planning of an economy. The pharmaceutical concerns are
interested in studying or estimating the effect of new drugs on patients.
Regression Analysis

Regression:
Regression analysis is a mathematical measure of the average
relationship between two or more variables in terms of the original
units of the data.

For instance, the yield of a crop depends on the rainfall, the cost or
price of a product depends on the production and advertising
expenditure, the demand for a particular product depends on its price,
expenditure of a person depends on his income, and so on.

Simple Regression:
The regression analysis confined to the study of only two variables at a
time is termed as Simple regression.
But quite often the values of a particular phenomenon may be affected
by multiplicity of factors.

Multiple regression:
The regression analysis for studying more than two variables at a time
is known as multiple regression. However, in this chapter we shall
confine ourselves to simple regression only.

In regression analysis there are two types of variables. The variable


whose value is influenced or is to be predicted is called dependent
variable and the variable which influences the values or is used for
prediction is called independent variable.

In regression analysis independent variable is also known as regressor


or predictor or explanatory while the dependent variable is also known
as regressed or explained variable.

LINEAR AND NON-LINEAR REGRESSION


The given bivariate data are plotted on a graph. The points so obtained
on the scatter diagram will more or less concentrate round a curve,
Regression Analysis

called the 'curve of regression'. Often such a curve is not distinct and is
quite confusing and sometimes complicated too. The mathematical
equation of the regression curve, usually called the regression
equation, enables us to study the average change in the value of the
dependent variable for any given value of the independent variable.

If the regression curve is a straight line, we say that there is linear


regression between the variables under study. The equation of such a
curve is the equation of a straight line. That is, a first degree equation in
the variables x and y. In case of linear regression, the values of the
dependent variable increase by a constant absolute amount for a unit
change in the value of the independent variable. However, if the curve
of regression is not a straight line, the regression is termed as curved or
non-linear regression. The regression equation will be a functional
relation between x and y involving terms in x and y of degree higher
than one i.e., involving terms of the type x 2, y2, xy, etc. However, in this
chapter we shall confine our discussion to linear regression between
two variables only.

COMPARSION BETWEEN CORRELATION AND REGRESSION


Correlation and regression are the most common statistical tools. They
are calculated under different assumptions they furnish different types
of information and it is not always clear as to which measure should
used in a given problem situation. The following are the points
difference between the two.

1. Coefficient of correlation is a measure of degree of co-variability


between X and Y, whereas the objective of regression analysis is to
study the nature of relationship between the variables so that we
may be able to predict the value of one on the basis of another.
2. Correlation is merely a tool of ascertaining the degree of
relationship between two variables and, therefore, we cannot say
that one variable is the cause and other the effect. For example, a
Regression Analysis

high degree of correlation between price and demand for a certain


commodity may not suggest which is the cause and which is the
effect. However, in regression analysis, one variable is taken as
dependent while the other as independent thus making it possible
to study the cause and effect relationship.
3. In correlation analysis, 'r' is a measure of direction and degree of
linear relationship between two variables X and Y. r xy and ryx are
symmetric. (i.e.,rxy = ryx )It is immaterial which among X and Y is
dependent variable and which is independent variable. In
regression analysis the regression coefficients b xy and byx are not
symmetric, i. e., bxy ≠ byx and hence it definitely makes a difference
as to which variable is dependent and which is independent.
4. There may be nonsense correlation between two variables which is
purely due to chance and has no practical relevance such as
increase in income and increase in weight of a group of people.
However, there is nothing like non sense regression.
5. Correlation coefficient is independent of change of scale and origin.
Regression coefficients are independent of change of origin but not
of scale. There is something common in both regression and
correlation analysis. The coefficient of correlation (r) takes the same
sign as the regression coefficients bxy and byx .Also, if the value of b
is significant at a given level of significance, then r is also significant
at that level.
Note:
 Cause and effect relationship between two variables is called
causation.
 Correlation in the absence of causation is called spurious
correlation or nonsense correlation.

Variables may have either linear or non-linear relationship. Two


variables are said to have linear relationship when change in the
independent variable (say X) by one unit leads to constant change in
the dependent variable (Y). When two variables have linear
Regression Analysis

relationship, the regression lines can be used to find out the values of
dependent variable.
When we plot two variables (say X and Y on a scatter diagram and draw
two lines of best fit which pass through the plotted points, these lines
are called regression lines. In linear regression, these lines are straight
ones. These regression lines are based on two equations called
regression equations which give best estimate of one variable when the
other is exactly known or given.

LINES OF REGRESSION
Line of regression is the line which gives the best estimate of one
variable for any given value of the other variable. In case of two
variables x and y, we shall have two lines of regression; one of y on x
and the other of x on y.
Line of regression of y on x is the line which gives the best estimate for
the value of y for any specified value of x. Similarly, line of regression of
x on y is the line which gives the best estimate for the value of x for any
specified value of y.

The term best fit is interpreted in accordance with the Principle of Least
Squares which consists in minimising the sum of the squares of the
residuals or the errors of estimates, i.e., the deviations between the
given observed values of the variable and their corresponding
estimated values as given by the line of best fit. We may minimise the
sum of the squares of the errors parallel to y-axis or parallel to x-axis,
the former (i.e., minimising the sum of squares of errors parallel to y-
axis), gives the equation of the line of regression of y on x and the
latter, viz., minimising the sum of squares of the errors parallel to x-axis
gives the equation of the line of regression of x on y.
We shall explain below the technique of deriving the equation of the
line of regression of y on x.
Regression Analysis

Derivation of the equation of the line of regression of Y on X:

For a bivariate data on 'x' and 'y', the regression equation obtained with
the assumption that 'x' is dependent on 'y' is called regression equation
of x on y.
The regression equation of x on y is
(X - X ) = bxy = (Y - Y )
The regression equation obtained with the assumption that 'y' is
dependent on 'x' is called regression equation of y on x.
The regression equation of y on x is:
(y - Y ) = byx (X- X ).
Here, the constants 'b' and 'b' are the regression coefficients. The
regression equation of x on y is used for the estimation of x values and
the regression equation of y on x is used for the estimation of y values.

Example 1:- Given the bivariate data :


x 2 6 4 3 2 2 8 4
Y 7 2 1 1 2 3 2 6
a) Fit the regression line Y on X and hence predicit Y, if X = 20.
b) Fit the regression line line of X on Y and hence predicit Y, if Y=5
Solution: In this case both the regression lines are needed. It is
necessary to find 'a’ and ‘b’.
Given X, Y can be estimated and vice versa.
Computation of regression equations
X Y X2 Y2 XY
2 7 4 49 14
6 2 36 4 12
4 1 16 1 4
3 1 9 1 3
2 2 4 4 4
2 3 4 9 6
8 2 64 4 16
4 6 16 36 24
2 2
ΣX = 31 ΣY = 24 Σ X = 153 ΣY = 108 ΣXY = 83
Regression Analysis

In order to find the values of a and b, two equations are to be solved


simultaneously.
ΣY = Na + b ΣX
ΣXY = a Σx + b ΣX2
Substitute the values ; we get
24 = 8a + 31b
83 = 31a + 153b
Multiply Eq. (1) by 31 and Eq. (2) by 8 we get
744 = 248a + 961b
664 = 248a + 1224b
Subtract (3) from (4)
248a + 1224b = 664
248a + 961b = 744
263b = -80
b = - 0.3
Substituting b = -0.3 in eq (1), we get
24 = 8a + 31 (-0.3)
24 = 8a – 9.3
8a = 33.3
a = 4.1625
Thus, regression of Y on X is
Y = 4.1625 - 0.3 x (Ans)
Similarly, we can solve second regression equation with the help of two
simultaneous equations. The regression equation of X on Y is
X = a + by
The two normal equation are
ΣX = Na + b ΣY
ΣXY = a ΣY + b ΣY2
Substitute the values in eq. (1) and (2)
31 = 8a + 24b
83 = 24a + 108b
Multiply eq (1) by 3 and subtract (2) from (1) we get
93 = 24a + 72b
Regression Analysis

83 = 24a + 108b
10 = -36b
b = -0.28 approx.
Put this in Eq. (1)
8a + (24) (-0.28) = 31
8a – 6.72 = 31
8a = 31 + 6.72
A = 4.715
Thus the regression of X on Y is
X = 4.715 - 0.28y (Ans)
a) Now let us predict the value of Y if X = 20
Y on X regression equation is
Y = 4.1625 - 0.3 x
= 4.1625 - 0.3 (20)
= - 1.8375 (Ans)
b) Now let us predict X if Y = 5
X = 4.715 - 0.28 (5)
= 4.715 - 1.40
X = 3.315 (Ans)

Example 2: The following data relates to the age of husbands and


wives.
Age of 25 28 30 32 35 36 38 39 42 45
huband (yrs)
Age of 20 26 29 30 25 18 26 35 35 46
wives (yrs)
Obtain the two regression equations and determine the most likely age
of husband when wife’s age is 25 years. Also, determine the most likely
age of the wife when husband’s age is 30 years.
Solution : let x denotes the age of husband and y denotes the age of
wife.
x y u = x - 35 V = y - 30 u2 v2 uv
25 20 -10 -10 100 100 100
Regression Analysis

28 26 -7 -4 49 16 28
30 29 -5 -1 25 1 5
32 30 -3 0 9 0 0
35 25 0 -5 0 25 0
36 18 1 -12 1 144 -12
38 26 3 -4 9 16 -12
39 35 4 5 16 25 20
42 35 7 5 49 25 35
45 46 10 16 100 256 160
350 290 0 -10 358 608 324
Σx 350
x=
n
= 10
Σy 29 0
y=
n
= 10
Bxy =

EXERCISE:
1. What is regression? Difference between regression and
correlation.
2. What are the properties and limitations of regression ?
3.
4. Find the regression equation "y" on "x" from the following data.
X 16 22 18 04 03 10 05 12
Y 87 88 89 68 78 80 75 83

5. Calculate the regression coefficient for the data given below:


X 8 6 4 7 5
Y 9 8 5 6 2
Ans: bxy = 1.2, byx = 0.4

3. Obtain the regression of Y on X for the following data:


Regression Analysis

Age (yrs) 66 38 56 42 72 36 63 47 55 45
(x)
Blood 145 124 147 125 160 118 149 128 150 124
pressure
(Y)

Estimation he blood pressure of man whose age is 50 years.

Ans. Y=79.02 + 1.115X, 134.77

4. Calculate coefficient of correlation and obtain the lines of regression for the

following data
Advertisement expenditure (in 10 12 15 23 20

lakhs) X

Sales (in crores) Y 14 17 23 25 21

Ans: byx = 0.7119, bxy = 1.05, r = 0.865

5. Calculate coefficient of correlation and obtain the lines of regression for the

following data
Adv. Expenditure (in 60 62 65 70 73 75 71

lakhs)

Sales (in crs.) 10 11 13 15 16 19 14


(1)The sales for adv. expenses of 90 lakhs
(2)The adv. Expenses for sales targets of 25 crs.
Ans: x (25 crs) = 87.69 lakhs, Y (90 lakhs) = 25.22 crs.
6. Calculate coefficient of correlation and obtain the lines of
regression for the following data
X 1 2 3 4 5 6 7
Y 9 8 10 12 11 13 14
Ans: r = 0.929, Y = 0.93x + 7.29, X = 0.93y + 6.997, when X = 6.2 then Y = 13.04
7. Find two regression lines for data
Regression Analysis

x 120 90 80 100 110


Y 40 36 40 45 40
Ans: X on Y, X = 60.06 = 0.980y, Y on X Y = 36.2 = 0.04X
8. Calculate regresssion equation of X on Y from the following data:
X 40 38 35 42 30
Y 30 35 40 36 29
Ans. X = 27.878 + 0.268Y

9. Obtain the regression equations for the following:


X 15 27 27 30 34 38 46
Y 120 140 150 170 180 200 250
Ans. Y on X, X = 40.62 + 4.2657x, X on Y, X = 0.2189y - 6.8383

10. Find the regression lines for the data :


X 1 2 3 4 5
Y 1 20 17 25 27
Ans. Y on X, Y = 0.9 + 5.7x; X on Y, X = 0.5808 + 0.1344y

11. Obtain the regression equation from the following


X 6 2 10 4 8
Y 9 11 5 8 7
Ans. Y on X, Y = 11.9 - 0.65x: X on Y, X=16.4 - 1.3y

12. Find two regression lines for the data:


X 15 20 16 14 10
Y 35 40 38 32 30
Ans. X on Y, X = -13.8225 + 0.8235y: Y = 18.845 + 1.0770x

13. Obtain the equation of the lines of regression for data given below
X 1 2 3 4 5 6 7 8 9
Y 9 8 10 12 11 13 14 16 15
Ans. Y = 7.25 + 0.95x: X = -6.4 + 0.95y

14. The following results were worked out from scores in statistics and
mathematics in a certain examination:
Score in statistics Score in mathematics
Regression Analysis

Mean 39.5 47.5


Standard deviation 10.8 17.8
Karl Pearson’s correlation coefficient between x and y = +0.42
Find both regression lines. Use these regression lines to estimate the value of
Y for X = 50 and also estimate the value of X for Y = 30
Ans:
15. A survey was conducted to study the relationship between expenditure on
accommodation (X) and expenditure on food and entertainment (Y) and the
following results were obtained :
Mean S.D.
Expenditure on accommodation Rs. 173 63.15
Expenditure on food and Rs. 47.8 22.98
entertainment
Coefficient of correlation = +0.57
Write down the equation of regression of Y on X and estimate the
expenditure on food and entertainment, if the expenditure on
accommodation is Rs. 200
Ans:
16. In a correlation study the following values are obtained:
X Y
Mean 65 67
Standard deviation 0.8 3.5
Coefficient of correlation 0.8
Find the two regression equation that are associated with the above values.
Ans:
17. Given the following data :
Advertising expenditure (Rs. Lakhs) Sales (Rs. Crores)
Mean 40 90
σ 10 20
Coefficient of correlation = +0.92
I. Calculate two regression equation.
II. Estimate the likely sales for an advertising expenditure of Rs. 60 lakhs.
III. What should be the advertisement expenditure for attaining sales
target of Rs. 120 crores.
Ans:
18. The following data relate to marks obtained by 250 students in
accountancy and statistics in B.Com examination of university :
Regression Analysis

Subject Arithmetic mean Standard deviation


Accountancy 48 4
statistics 55 5
Coefficient of correlation between marks in accountancy and statistics is +0.8.
Draw the two lines of regression and estimate the marks obtained by a
student in statistics who secured 50 marks in accountancy.
Ans:
19. To study the relationship between expenditure on accommodation, Rs. X
and expenditure on food and entertainment, Rs. Y, an enquiry into 50
families gave the following results:
∑ X =8500 , ∑ Y =9600, σx = 60, σy = 20, r = 0.6
Ans:

You might also like