BS Ref17
BS Ref17
BS Ref17
result is known.
14
—Ovid
I never think of the future,
Regression Analysis
LEARNING OBJECTIVES
14.1 INTRODUCTION
In Chapter 13 we introduced the concept of statistical relationship between two variables
such as: level of sales and amount of advertising; yield of a crop and the amount of
fertilizer used; price of a product and its supply, and so on. The relationship between
such variables indicate the degree and direction of their association, but fail to answer
following question:
• Is there any functional (or algebraic) relationship between two variables? If yes,
can it be used to estimate the most likely value of one variable, given the value of
other variable?
The statistical technique that expresses the relationship between two or more variables
in the form of an equation to estimate the value of a variable, based on the given value of
another variable, is called regression analysis. The variable whose value is estimated using
the algebraic equation is called dependent (or response) variable and the variable whose
value is used to estimate this value is called independent (regressor or predictor) variable. The
linear algebraic equation used for expressing a dependent variable in terms of independent
variable is called linear regression equation.
The term regression was used in 1877 by Sir Francis Galton while studying the
relationship between the height of father and sons. He found that though ‘tall father has
tall sons’, the average height of sons of tall father is x above the general height, the
average height of sons is 2x/3 above the general height. Such a fall in the average height
was described by Galton as ‘regression to mediocrity’. However, the theory of Galton is
not universally applicable and the term regression is applied to other types of variables
in business and economics. The term regression in the literary sense is also referred as
‘moving backward’.
481
482 BUSINESS S T A T I S T I C S
The basic differences between correlation and regression analysis are summarized as
follows:
1. Developing an algebraic equation between two variables from sample data and
predicting the value of one variable, given the value of the other variable is referred
to as regression analysis, while measuring the strength (or degree) of the relationship
between two variables is referred as correlation analysis. The sign of correlation
coefficient indicates the nature (direct or inverse) of relationship between two variables,
while the absolute value of correlation coefficient indicates the extent of relationship.
2. Correlation analysis determines an association between two variables x and y but not
that they have a cause-and-effect relationship. Regression analysis, in contrast to
correlation, determines the cause-and-effect relationship between x and y, that is, a
change in the value of independent variable x causes a corresponding change (effect)
in the value of dependent variable y if all other factors that affect y remain unchanged.
3. In linear regression analysis one variable is considered as dependent variable and
other as independent variable, while in correlation analysis both variables are
considered to be independent.
4. The coefficient of determination r2 indicates the proportion of total variance in the dependent
variable that is explained or accounted for by the variation in the independent variable. Since
value of r2 is determined from a sample, its value is subject to sampling error. Even
if the value of r2 is high, the assumption of a linear regression may be incorrect
because it may represent a portion of the relationship that actually is in the form of
a curve.
then such a regression model is called a multiple regression model. For example, sales
turnover of a product (a dependent variable) is associated with multiple independent
variables such as price of the product, expenditure on advertisement, quality of the
product, competitors, and so on. Now if we want to estimate possible sales turnover with
respect to only one of these independent variables, then it is an example of a simple
regression model, otherwise multiple regression model is applicable.
Figure 14.1
Straight Line Relationship
The intercept β0 and the slope β1 are unknown regression coefficients. The equation
(14-1) requires to compute the values of β0 and β1 to predict average values of y for a
given value of x. However Fig. 14.1 presents a scatter diagram where each pair of values
(xi, yi) represents a point in a two-dimensional coordinate system. Although the mean or
average value of y is a linear function of x, but not all values of y fall exactly on the
straight line rather fall around the line.
Since few points do not fall on the regression line, therefore values of y are not
exactly equal to the values yielded by the equation: E(y|x) = β0 + β1x, also called line of
mean deviations of observed y value from the regression line. This situation is responsible for
random error (also called residual variation or residual error) in the prediction of y values for
given values of x. In such a situation, it is likely that the variable x does not explain all the
variability of the variable y. For instance, sales volume is related to advertising, but if
other factors related to sales are ignored, then a regression equation to predict the sales
volume (y) by using annual budget of advertising (x) as a predictor will probably involve
some error. Thus for a fixed value of x, the actual value of y is determined by the mean
value function plus a random error term as follows:
y = Mean value function + Deviation
= β0 + β1x + e = E(y) + e (14-2)
where e is the observed random error. This equation is also called simple probabilitic linear
regression model.
The error component e allows each individual value of y to deviate from the line of
means by a small amount. The random errors corresponding to different observations
(xi, yi) for i=1, 2, ..., n are assumed to follow a normal distribution with mean zero and
(unknown) constant standard deviation.
484 BUSINESS S T A T I S T I C S
The term e in the expression (14-2) is called the random error because its value,
associated with each value of variable y, is assumed to vary unpredictably. The extent of
this error for a given value of x is measured by the error variance σ2e . Lower the value of
σ2e , better is the fit of linear regression model to a sample data.
If the line passing through the pair of values of variables x and y is curvilinear, then
the relationship is called nonlinear. A nonlinear relationship implies a varying absolute
change in the dependent variable with respect to changes in the value of the independent
variable. A nonlinear relationship is not very useful for predictions.
In this chapter, we shall discuss methods of simple linear regression analysis involving
single independent variable, whereas those involving two or more independent variables
will be discussed in Chapter 15.
∂L n
= − 2 ∑ ( yi − b0 − b1xi ) xi = 0
∂β1 b0 b1 i =1
Remark: The sum of the residuals is zero for any least-squares regression line. Since
∑ yi = ∑ ˆyi , therefore so ∑ ei = 0.
C H A P T E R 14 REGRESSION ANALYSIS 485
Figure 14.2
Graphical Illustration of Assumptions
in Regression Analysis
Assumptions
1. The relationship between the dependent variable y and independent variable x exists
and is linear. The average relationship between x and y can be described by a simple
linear regression equation y = a + bx + e, where e is the deviation of a particular
value of y from its expected value for a given value of independent variable x.
2. For every value of the independent variable x, there is an expected (or mean) value of
the dependent variable y and these values are normally distributed. The mean of
these normally distributed values fall on the line of regression.
3. The dependent variable y is a continuous random variable, whereas values of the
independent variable x are fixed values and are not random.
4. The sampling error associated with the expected value of the dependent variable y is
assumed to be an independent random variable distributed normally with mean
zero and constant standard deviation. The errors are not related with each other in
successive observations.
5. The standard deviation and variance of expected values of the dependent variable y
about the regression line are constant for all values of the independent variable x
within the range of the sample data.
6. The value of the dependent variable cannot be estimated for a value of an independent
variable lying outside the range of values in the sample data.
The two variables x and y which are correlated can be expressed in terms of each
other in the form of straight line equations called regression equations. Such lines should
be able to provide the best fit of sample data to the population data. The algebraic
expression of regression lines is written as:
• The regression equation of y on x
y = a + bx
is used for estimating the value of y for given values of x.
• Regression equation of x on y
x = c + dy
is used for estimating the value of x for given values of y.
Remarks
1. When variables x and y are correlated perfectly (either positive or negative) these lines
coincide, that is, we have only one line.
2. Higher the degree of correlation, nearer the two regression lines are to each other.
3. Lesser the degree of correlation, more the two regression lines are away from each other.
That is, when r = 0, the two lines are at right angle to each other.
4. Two linear regression lines intersect each other at the point of the average value of variables
x and y.
4. The correlation coefficient will have the same sign (either positive or negative) as that of the
two regression coefficients. For example, if by x = – 0.664 and bxy = – 0.234, then
r=– 0.664 × 0.234 = – 0.394.
5. The arithmetic mean of regression coefficients bxy and byx is more than or equal to the
correlation coefficient r, that is, (by x + bx y ) / 2 ≥ r. For example, if byx = – 0.664 and bx y =
– 0.234, then the arithmetic mean of these two values is (– 0.664 – 0.234)/2 = – 0.449, and
this value is more than the value of r = – 0.394.
6. Regression coefficients are independent of origin but not of scale.
Since the regression line passes through the point ( x , y ), the mean values of x and
y and the regression equations can be used to find the value of constants a and c as
follows:
a = y − bx for regression equation of y on x
c = x – d y for regression equation of x on y
The calculated values of a, b and c, d are substituted in the regression line y = a + bx
and x = c + dy respectively to determine the exact relationship.
Example 14.1: Use least squares regression line to estimate the increase in sales revenue
expected from an increase of 7.5 per cent in advertising expenditure.
Σ xΣ y 40 × 56
where Sxy = Σ x y – = 373 – = 93
n 8
(Σ x)2 (56)2
Sxx = Σ x2 – = 524 – = 132
n 8
The intercept ‘a’ on the y-axis is calculated as:
40 56
a = y − bx = – 0.704 × = 5 – 0.704×7 = 0.072
8 8
Substituting the values of a = 0.072 and b = 0.704 in the regression equation, we get
y = a + bx = 0.072 + 0.704 x
For x = 0.075, we have y = 0.072 + 0.704 (0.075) = 0.1248 or 12.48%.
Example 14.2: The owner of a small garment shop is hopeful that his sales are rising
significantly week by week. Treating the sales for the previous six weeks as a typical
example of this rising trend, he recorded them in Rs 1000’s and analysed the results
Week : 1 2 3 4 5 6
Sales : 2.69 2.62 2.80 2.70 2.75 2.81
Fit a linear regression equation to suggest to him the weekly rate at which his sales
are rising and use this equation to estimate expected sales for the 7th week.
Solution: Assume sales ( y) is dependent on weeks (x). Then the normal equations for
regression equation: y = a + bx are written as:
Σ y = na + bΣ x and Σxy = aΣ x + bΣx2
Calculations for sales during various weeks are shown in Table 14.2.
(a) Deviations Taken from Actual Mean Values of x and y If deviations of actual values of
variables x and y are taken from their mean values x and y , then the regression
equations can be written as:
• Regression equation of y on x • Regression equation of x on y
y – y = byx (x – x ) x – x = bx y (y – y )
where by x = regression coefficient of where bx y = regression coefficient
= y on x. where bxy = of x on y.
The value of byx can be calculated using The value of bx y can be calculated
the using the formula formula
Σ (x − x ) ( y − y ) Σ (x − x ) ( y − y )
by x = bx y =
Σ ( x − x )2 Σ ( y − y )2
(b) Deviations Taken from Assumed Mean Values for x and y If mean value of either x or y or
both are in fractions, then we must prefer to take deviations of actual values of
variables x and y from their assumed means.
• Regression equation of y on x • Regression equation of x on y
y – y = byx (x – x ) x – x = bxy (y – y )
n Σ dx dy − (Σ dx ) (Σ d y ) n Σ dx dy − (Σ dx )(Σ dy )
where by x = where bx y =
nΣ dx2 − (Σ dx ) 2
n Σ dy2 − (Σ dy )2
(c) Regression Coefficients in Terms of Correlation Coefficient If deviations are taken from
actual mean values, then the values of regression coefficients can be alternatively
calculated as follows:
Σ (x − x ) ( y − y ) Σ (x − x ) ( y − y )
byx = 2
bxy =
Σ (x − x ) Σ ( y − y )2
Covariance(x, y) σy Covariance(x, y) σx
= = r⋅ = = r⋅
σ2x σx σ2y σy
Example 14.3: The following data relate to the scores obtained by 9 salesmen of a company
in an intelligence test and their weekly sales (in Rs 1000’s)
Salesmen : A B C D E F G H I
Test scores : 50 60 50 60 80 50 80 40 70
Weekly sales : 30 60 40 50 60 30 70 50 60
(a) Obtain the regression equation of sales on intelligence test scores of the salesmen.
(b) If the intelligence test score of a salesman in 65, what would be his expected weekly
sales. [HP Univ., MCom, 1996]
Solution: Assume weekly sales (y) as dependent variable and test scores (x) as independent
variable. Calculations for the following regression equation are shown in Table 14.3.
y – y = by x (x – x )
C H A P T E R 14 REGRESSION ANALYSIS 491
Σx 540 Σy 450
(a) x = = = 60; y = = = 50
n 9 n 9
Σ dx dy − (Σ dx )(Σ d y ) 1200
byx = = = 0.75
Σ dx2 − (Σ dx )2
1600
Substituting values in the regression equation, we have
y – 50 = 0.75 (x – 60) or y = 5 + 0.75x
For test score x = 65 of salesman, we have
y = 5 + 0.75 (65) = 53.75
Hence we conclude that the weekly sales is expected to be Rs 53.75 (in Rs 1000’s) for a
test score of 65.
Example 14.4: A company is introducing a job evaluation scheme in which all jobs are
graded by points for skill, responsibility, and so on. Monthly pay scales (Rs in 1000’s) are
then drawn up according to the number of points allocated and other factors such as
experience and local conditions. To date the company has applied this scheme to 9 jobs:
Job : A B C D E F G H I
Points : 5 25 7 19 10 12 15 28 16
Pay (Rs) : 3.0 5.0 3.25 6.5 5.5 5.6 6.0 7.2 6.1
(a) Find the least squares regression line for linking pay scales to points.
(b) Estimate the monthly pay for a job graded by 20 points.
Solution: Assume monthly pay (y) as the dependent variable and job grade points (x) as
the independent variable. Calculations for the following regression equation are shown
in Table 14.4.
y – y = byx (x – x )
Table 14.4: Calculations for Regression Equation
Σx 137 Σy 48.15
(a) x = = = 15.22; y = = = 5.35
n 9 n 9
Since mean values x and y are non-integer value, therefore deviations are taken from
assumed mean as shown in Table 14.4.
n Σ dx d y − (Σ dx ) (Σ d y ) 9 × 65.40 − 2 × 3.15 582.3
byx = = = = 0.133
n Σ dx2 − (Σ dx )2 2
9 × 484 − (2) 4352
Substituting values in the regression equation, we have
y – y = byx (x – x ) or y – 5.35 = 0.133 (x – 15.22) = 3.326 + 0.133x
(b) For job grade point x=20, the estimated average pay scale is given by
y = 3.326 + 0.133x = 3.326 + 0.133 (20) = 5.986
Hence, likely monthly pay for a job with grade points 20 is Rs 5986.
Example 14.5: The following data give the ages and blood pressure of 10 women.
Age : 156 142 136 147 149 142 160 172 163 155
Blood pressure : 147 125 118 128 145 140 155 160 149 150
(a) Find the correlation coefficient between age and blood pressure.
(b) Determine the least squares regression equation of blood pressure on age.
(c) Estimate the blood pressure of a woman whose age is 45 years.
[Ranchi Univ. MBA; South Gujarat Univ., MBA, 1997]
Solution: Assume blood pressure (y) as the dependent variable and age (x) as the
independent variable. Calculations for regression equation of blood pressure on age are
shown in Table 14.5.
Σx 522 Σy 1417
x = = = 52.2; y = = = 141.7
n 10 n 10
n Σ dx dy − Σ dx Σ dy 10(1115) − 32(− 33) 12206
and byx = = = = 1.11
2
n Σ dx − (Σ dx )2 10(1202) − (32)2 10996
Substituting these values in the above equation, we have
y – 141.7 = 1.11 (x – 52.2) or y = 83.758+1.11x
This is the required regression equation of y on x.
(c) For a women whose age is 45, the estimated average blood pressure will be
y = 83.758+1.11(45) = 83.758+49.95 = 133.708
Hence, the likely blood pressure of a woman of 45 years is 134.
Example 14.6: The General Sales Manager of Kiran Enterprises—an enterprise dealing
in the sale of readymade men’s wear—is toying with the idea of increasing his sales to
Rs 80,000. On checking the records of sales during the last 10 years, it was found that
the annual sale proceeds and advertisement expenditure were highly correlated to the
extent of 0.8. It was further noted that the annual average sale has been Rs 45,000 and
annual average advertisement expenditure Rs 30,000, with a variance of Rs 1600 and Rs
625 in advertisement expenditure respectively.
In view of the above, how much expenditure on advertisement would you suggest
the General Sales Manager of the enterprise to incur to meet his target of sales?
[Kurukshetra Univ., MBA, 1998]
Solution: Assume advertisement expenditure (y) as the dependent variable and sales (x)
as the independent variable. Then the regression equation advertisement expenditure
on sales is given by
σy
(y – y ) = r (x − x )
σx
Given r = 0.8, σx = 40, σy = 25, x = 45,000, y = 30,000. Substituting these value
in the above equation, we have
25
(y – 30,000) = 0.8 (x – 45,000) = 0.5 (x – 45,000)
40
y = 30,000 + 0.5x – 22,500 = 7500 + 0.5x
When a sales target is fixed at x = 80,000, the estimated amount likely to the spent on
advertisement would be
y = 7500 + 0.5×80,000 = 7500 + 40,000 = Rs 47,500
Example 14.7: You are given the following information about advertising expenditure
and sales:
3
x – 10 = 0.8 (y – 90) or x = – 8 + 0.2y
12
Regression equation of y on x is given by
σy
(y – y ) = r (x − x )
σx
12
y – 90 = 0.8 ( x − 10) or y = 58 + 3.2x
3
(b) Substituting x = 15 in regression equation of y on x. The likely average sales
volume would be
y = 58 + 3.2 (15) = 58 + 48 = 106
Thus the likely sales for advertisement budget of Rs 15 lakh is Rs 106 lakh.
(c) Substituting y = 120 in the regression equation of x on y. The likely advertisement
budget to attain desired sales target of Rs 120 lakh would be
x = – 8 + 0.2 y = – 8 + 0.2 (120) = 16
Hence, the likely advertisement budget of Rs 16 lakh should be sufficient to attain
the sales target of Rs 120 lakh.
Example 14.8: In a partially destroyed laboratory record of an analysis of regression
data, the following results only are legible:
Variance of x = 9
Regression equations : 8x – 10y + 66 = 0 and 40x – 18y = 214
Find on the basis of the above information:
(a) The mean values of x and y,
(b) Coefficient of correlation between x and y, and
(c) Standard deviation of y. [Pune Univ., MBA, 1996; CA May 1999]
Solution: (a) Since two regression lines always intersect at a point ( x , y ) representing
mean values of the variables involved, solving given regression equations to get the mean
values x and y as shown below:
8x – 10y = – 66
40x – 18y = 214
Multiplying the first equation by 5 and subtracting from the second, we have
32y = 544 or y = 17, i.e. y = 17
Substituting the value of y in the first equation, we get
8x – 10(17) = – 66 or x = 13, that is, x = 13
(b) To find correlation coefficient r between x and y, we need to determine the
regression coefficients bxy and byx.
Rewriting the given regression equations in such a way that the coefficient of dependent
variable is less than one at least in one equation.
66 8
8x – 10 y = – 66 or 10 y = 66 + 8x or y= + x
10 10
That is, byx = 8/10 = 0.80
214 18
40x – 18y = 214 or 40x = 214 + 18y or x = + y
40 40
That is, bxy = 18/40 = 0.45
Hence coefficient of correlation r between x and y is given by
r= bxy × byx = 0.45 × 0.80 = 0.60
(c) To determine the standard deviation of y, consider the formula:
σy byx σ x 0.80 × 3
byx = r or σy = = =4
σx r 0.6
Example 14.9: There are two series of index numbers, P for price index and S for stock
of a commodity. The mean and standard deviation of P are 100 and 8 and of S are 103
and 4 respectively. The correlation coefficient between the two series is 0.4. With these
C H A P T E R 14 REGRESSION ANALYSIS 495
data, work out a linear equation to read off values of P for various values of S. Can the
same equation be used to read off values of S for various values of P?
Solution: The regression equation to read off values of P for various values S is given by
σp
P = a + bS or (P – P ) = r (S − S)
σs
Given P = 100, S = 103, σp = 8, σs = 4, r = 0.4. Substituting these values in the
above equation, we have
8
P – 100 = 0.4 (S − 103) or P = 17.6 + 0.8 S
4
This equation cannot be used to read off values of S for various values of P. Thus to read
off values of S for various values of P we use another regression equation of the form:
σ
S = c + dP or S − S = s (P − P)
σp
Substituting given values in this equation, we have
4
S – 103 = 0.4 (P – 100) or S = 83 + 0.2P
8
Example 14.10: The two regression lines obtained in a correlation analysis of 60
observations are:
5x = 6x + 24 and 1000y = 768 x – 3708
What is the correlation coefficient and what is its probable error? Show that the ratio
of the coefficient of variability of x to that of y is 5/24. What is the ratio of variances of x
and y?
Solution: Rewriting the regression equations
6 24
5x = 6y + 24 or x = y+
5 5
That is, bxy = 6/5
768 3708
1000y = 768x – 3708 or y = x −
1000 1000
That is, byx = 768/1000
σ 6 σy 768
We know that bx y = r x = and byx = r = , therefore
σy 5 σx 1000
6 768
bx y byx = r2 = × = 0.9216
5 1000
Hence r = 0.9216 = 0.96.
Since both bxy and byx are positive, the correlation coefficient is positive and hence
r = 0.96.
1 − r2 1 − (0.96)2
Probable error of r = 0.6745 = 0.6745
n 60
0.0528
= = 0.0068
7.7459
Solving the given regression equations for x and y, we get x = 6 and y = 1 because
regression lines passed through the point ( x , y ).
σx 6 σ 6 σ 6 5
Since r = or 0.96 x = or x = =
σy 5 σy 5 σy 5 × 0.96 4
σx / x y σx 1 5 5
Also the ratio of the coefficient of variability = = ⋅ = × = .
σy / y x σ y 6 4 24
n Σ dx dy − Σ fdx Σ fdy h
bxy = ×
n Σ fd2y − (Σ fdy )2 k
n Σ fdx d y − Σ fdx Σ fd y k
byx = ×
nΣ fdx2 − (Σ fdx ) 2
h
where h = width of the class interval of sample data on x variable
k = width of the class interval of sample data on y variable
Example 14.11: The following bivariate frequency distribution relates to sales turnover
(Rs in lakh) and money spent on advertising (Rs in 1000’s). Obtain the two regression
equations
Advertising Budget
y 50–60 60–70 70–80 80–90
m.v 55 65 75 85 f fdx f d x2 fdx dy
Sales dy –2 –1 0 1
x m.v. dx
20–50 35 1 2 1 2 5 10 – 10 10 0
4 1 — –5
50–80 65 0 3 4 7 6 20 0 0 0
— — — —
80–110 95 1 1 5 8 6 20 20 20 –1
–2 –5 — 6
110–140 125 2 2 7 9 2 20 40 80 – 18
–8 – 14 — 4
f 8 17 26 19 n = 70 50 = 110 = – 19
Σ fdx Σ fdx2 Σ fdxdy
fdy – 16 – 17 0 19 – 14 =
Σ f dy
fdy2 32 17 0 19 68 =
Σ f dy2
fdxdy –6 – 18 0 5 – 19 =
Σ fdxdy
C H A P T E R 14 REGRESSION ANALYSIS 497
Similarly, the regression equation for estimating the advertising budget (y) on sales
turnover of Rs 200 lakh is written as:
y – y = byx (x – x )
n Σ fdx dy − (Σ fdx ) (Σ fd y )
where byx =
n Σ fdx2 − (Σ fdx )2
The calculations for regression coefficients bxy and byx are shown in Table 14.6.
Σfdx 50
x = A+ × h = 65 + × 30 = 65 + 21.428 = 86.428
n 70
Σfd y 14
y = B+ × k = 75 – × 10 = 75 – 2 = 73
n 70
n Σ fdx dy − (Σ fdx ) (Σ fdy ) h 70 × −19 − (50)(−14) 30
bxy = × = ×
n Σ fd2y − (Σ fdy )2 k 70 × 68 − (−14)2 10
S e l f-P r a c t i c e P r o b l e m s 14A
14.1 The following calculations have been made for prices of Write down the regression equation and estimate the
twelve stocks (x) at the Calcutta Stock Exchange on a expenditure on food and entertainment if the
certain day along with the volume of sales in thousands expenditure on accommodation is Rs 200.
of shares (y). From these calculations find the regression [Bangalore Univ., BCom, 1998]
equation of price of stocks on the volume of sales of 14.3 The following data give the experience of machine
shares. operators and their performance ratings given by the
Σ x = 580, Σ y = 370, Σ xy = 11494, number of good parts turned out per 100 pieces:
Σ x2 = 41658, Σ y2 = 17206.
Operator : 11 12 13 4 5 16 7 18
[Rajasthan Univ., MCom, 1995]
14.2 A survey was conducted to study the relationship experience (x) : 16 12 18 4 3 10 5 12
between expenditure (in Rs) on accommodation (x) and Performance
expenditure on food and entertainment (y) and the ratings (y) : 87 88 89 68 78 80 75 83
following results were obtained: Calculate the regression lines of performance ratings
Mean Standard on experience and estimate the probable performance
Deviation if an operator has 7 years experience.
• Expenditure on 173..1 63.15 [Jammu Univ., MCom; Lucknow Univ., MBA, 1996]
• accommodation 14.4 A study of prices of a certain commodity at Delhi and
• Expenditure on food 47.8 22.98 Mumbai yield the following data:
• and entertainment
• Coefficient of correlation r = 0.57
498 BUSINESS S T A T I S T I C S
x = Σ x/n = 40/8 = 5.625; y = Σy/n = 297/8 = 37.125 14.9 Let production and capacity utilization be denoted by x
and y, respectively.
n Σ dx d y − (Σ dx )(Σ d y ) 8 × 238 − (− 3) (1)
byx = = (a) Regression equation of capacity utilization (y) on
n Σ dx2 − (Σ dx )2 8 × 57 − (− 3)2 production (x)
= 4.266 ; σy
where dx = x – 6, dy = y – 37 y– y = r (x − x )
σx
Regression equation of annual profit on R&D
expenditure 8.5
y – 84.8 = 0.62 (x – 35.6)
y – y = byx (x – x ) 10.5
y – 37.125 = 4.26 (x – 5.625) y = 66.9324 + 0.5019x
or y = 13.163 + 4.266x (b) Regression equation of production (x) on capacity
For x = Rs 1,00,000 as R&D expenditure, we have from utilization (y)
above equation y = Rs 439.763 as annual profit. σx
14.7 Let sales revenue and advertising expenditure be x – x =r (y − y)
σy
denoted by x and y respectively
10.5
Σ fdx 12 x – 35.6 = 0.62 (y – 84.8)
x =A+ × h = 150 + × 50 = 159.09 8.5
n 66
x = – 29.3483 + 0.7659y
Σ fd y 26 When y = 70, x = – 29.3483+0.7659(70) = 24.2647
y =B+ × k = 30 – × 10 = 26.06
n 66 Hence the estimated production is 2,42,647 units when
n Σ fdx d y − (Σ fdx ) (Σ fd y ) h the capacity utilization is 70 per cent.
bxy = ×
n Σ fd 2y − (Σ fdy )2 k 14.10 x = Σ x/n = 270/8 = 33.75; y = Σ y/n = 400/8 = 50
n Σ dx dy − (Σ dx )(Σ dy ) 8 × 4800 − 6 × 0
66 (−14) − 12(− 26) 50 byx = =
= 2
× = – 0.516 n Σ dx2 − (Σ dx )2 8 × 3592 − (6)2
66 (100) − ( − 26) 10
(a) Regression equation of x on y = 1.338;
x – x = bxy (y – y ) where dx = x – 33 and dy = y – 50
x – 159.09 = – 0.516 (y – 26.06) Regression equation of y on x
or x = 172.536 – 0.516y y – y = byx (x – x )
For y = 50, x = 147.036 y – 50 = 1.338 (x – 33.75)
(b) Regression equation of y on x y = 4.84 + 1.338x
n Σ fdx d y − (Σ fdx ) (Σ fd y ) For x = 10, y = 18.22
k
byx = × 14.11 Let intelligence test score be denoted by x and weekly
n Σ fdx2 − (Σ fdx )2 h sales by y
66 (−14) − 12(− 26) 10 x = 540/9 = 60; y = 450/9 = 50,
= 2
× = – 0.027.
66 (70) − (12) 50
n Σ dx dy − (Σ dx)(Σ dy) 9 × 1200
byx = = = 0.75
y – y = byx (x – x ) n Σ dx2 − (Σ dx )2 9 × 1600
y – 26.06 = – 0.027 (x – 159.09) Regression equation of y on x :
y = 30.355 – 0.027x
y − y = byx ( x − x )
For x = 300, y = 22.255
y – 50 = 0.75 (x – 60)
(c) r = bxy × byx = – 0.516 × 0.027 = – 0.1180
y = 5 + 0.75x
14.8 Let test score and production rating be denoted by x For x = 65, y = 5 + 0.75 (65) = 53.75
and y respectively. 14.12 (a) Solving two regression lines:
x = Σ x/n = 612/10 = 61.2; 3x + 2y = 6 and 6x + y = 31
y = Σ y/n = 622/10 = 62.2 we get mean values as x = 4 and y = 7
n Σ dx d y − (Σ dx ) (Σd y ) 10 × 3213 − 2 × 2 (b) Rewritting regression lines as follows:
byx = = = 0.904
n Σ dx2 − (Σ dx )2 10 × 3554 − (2)2 3x + 2y = 26 or y = 13 – (3/2)x,
Regression equation of production rating (y) on test So byx = – 3/2
score (x) is given by 6x + y = 31 or x = 31/6 – (1/6)y,
y – y = byx (x – x ) So bxy = – 1/6
y – 62.2 = 0.904 (x – 61.2) Correlation coefficient,
y = 6.876 + 0.904x r= bxy × byx = − (3 / 2)(1 / 6) = – 0.5
C H A P T E R 14 REGRESSION ANALYSIS 501
Given, Var(x) = 25, so σx = 5. Calculate σy using the Also r= byx × bxy = – 1.5 × 0.2 = – 0.5477
formula:
14.14 Let advertising expenditure and sales be denoted by x
σy
byx = r and y respectively.
σy
x = Σ x/n = 217/8 = 27.125; y = Σ y/n = 58.2/8 = 7.26
3 σy
or − = 0.5 or σy = 15 n∑ dx dy − (∑ dx)(∑ dy)
2 5 byx =
n∑ dx2 − (∑ dx)2
14.13 The regression equation of y on x is stated as:
8(172.2) − (25)(2.1) 1325.1
σy = = = 0.125
y − y = bxy ( x − x ) = r ⋅ (x − x ) 8(1403) − (25)2 10599
σx
Thus regression equation of y on x is:
Given, x = 53.20; y = 27.90, byx = – 1.5; bxy = – 0.2
y − y = byx ( x − x )
Thus y – 27.90 = – 1.5(x – 53.20)
or y – 7.26 = 0.125(x – 27.125)
or y = 107.70 – 1.5x
y = 3.86 + 0.125x
For x = 60, we have y = 107.70 – 1.5(60) = 17.7
When x = 60, the estimated value of y = 3.869 +
0.125(60) = 11.369
Σ ei2 Σ ( yi − yi )2 SSE
ˆ 2e =
S2yx or σ = =
n−2 n−2 n−2
The denominator, n – 2 represents the error or residual degrees of freedom and is determined
by subtracting from sample size n the number of parameters β0 and β1 that are estimated
by the sample parameters a and b in the least squares equation. The subscript ‘yx’ indicates
that the standard deviation is of dependent variable y, given (or conditional) upon
independent variable x.
The standard error of estimate Syx also called standard deviation of the error term t measures
the variability of the observed values around the regression line, i.e. the amount
502 BUSINESS S T A T I S T I C S
by which the y values are away from the sample y values (dot points). In other words, Syx
is based on the deviations of the sample observations of y-values from the least squares
line or the estimated regression line of y values. The standard deviation of error about
the least squares line is defined as:
Σ ( y − ˆy)2 SSE
Syx or σe = = (14-4)
n−2 n−2
Figure 14.3
Residuals
Σ ( y − ˆy)2 Σ y2 − a Σ y − b Σ xy
Syx = =
n−2 n−2
The variance S2yx measures how the least squares line ‘best fits’ the sample y-values. A large
variance and standard error of estimate indicates a large amount of scatter or dispersion
of dot points around the line. Smaller the value of Syx, the closer the dot points (y-values)
fall around the regression line and better the line fits the data and describes the better
average relationship between the two variables. When all dot points fall on the line, the
value of Syx is zero, and the relationship between the two variables is perfect.
A smaller variance about the regression line is considered useful in predicting the
value of a dependent variable y. In actual practice, some variability is always left over
about the regression line. It is important to measure such variability due to the following
reasons:
(i) This value provides a way to determine the usefulness of the regression line in
predicting the value of the dependent variable.
(ii) This value can be used to construct interval estimates of the dependent variable.
(iii) Statistical inferences can be made about other components of the problem.
Figure 14.4 displays the distribution of conditional average values of y about a least
squares regression line for given values of independent variable x. Suppose the amount
of deviation in the values of y given any particular value of x follow normal distribution.
Since average value of y changes with the value of x, we have different normal distributions
of y-values for every value of x, each having same standard deviation. When a relationship
between two variables x and y exists, the standard deviation (also called standard error of
estimate) is less than the standard deviation of all the x-values in the population computed
about their mean.
Based on the assumptions of regression analysis, we can describe sampling properties
of the sample estimates such as a, b, and Syx, as these vary from sample to sample. Such
knowledge is useful in making statistical inferences about the relationship between the
two variables x and y.
C H A P T E R 14 REGRESSION ANALYSIS 503
Figure 14.4
Regression Line Showing the
Error Variance
The standard error of estimate can also be used to determine an approximate interval
estimate based on sample data (n < 30) for the value of the dependent variable y for a
given value of the independent variable x as follows:
Approximate interval estimate = ŷ ± tdf Syx
where value of t is obtained using t-distribution table based upon a chosen probability
level. The interval estimate is also called a prediction interval.
Example 14.12: The following data relate to advertising expenditure (Rs in lakh) and
their corresponding sales (Rs in crore)
Advertising expenditure : 10 12 15 23 20
Sales : 14 17 23 25 21
(a) Find the equation of the least squares line fitting the data.
(b) Estimate the value of sales corresponding to advertising expenditure of Rs 30 lakh.
(c) Calculate the standard error of estimate of sales on advertising expenditure.
Solution: Let the advertising expenditure be denoted by x and sales by y.
(a) The calculations for the least squares line are shown in Table 14.7
residual is equal to the actual value minus fitted value. The residuals indicate how well
the least squares line fits the actual data values.
(b) The least squares equation obtained in part (a) may be used to estimate the sales
turnover corresponding to the advertising expenditure of Rs 30 lakh as:
ŷ = 8.608 + 0.712x = 8.608 + 0.712 (30) = Rs 29.968 crore
(c) Calculations for standard error of estimate Sy⋅x of sales (y) on advertising expenditure
(x) are shown in Table 14.9.
x y y2 xy
10 14 196 140
12 17 289 204
15 23 529 345
23 25 625 575
20 21 441 420
80 100 2080 1684
where SST = total sum of square deviations (or total variance) of sampled response
variable y-values from the mean value of y.
n n
= Syy = ∑ ( yi − y )2 = ∑ yi2 − n ( y )2
i =1 i =1
SSE = sum of squares of error or unexplained variation in response variable
y-values from the least squares line due to sampling errors, i.e. it measures
the residual variation in the data that is not explained by predictor
variable x
n n n n
= ∑ ( yi − ˆyi )2 = ∑ yi2 − a ∑ yi − b ∑ xi yi
i =1 i =1 i =1 i =1
SSR = sum of squares of regression or explained variation is the sample values of
response variable y accounted for or explained by variation among
x-values
= SST – SSE
n n n
= ∑ ( ˆyi − y )2 = a ∑ yi + b ∑ xi yi − n ( y )2
i =1 i =1 i =1
The three variations associated with the regression analysis of a data set are shown in
Fig 14.5. Thus
Σ ( y − ˆy)2 S2yx
r2 = 1 – = 1 – ; S y⋅ x = S y 1 − r 2
Σ ( y − y )2 S2y
Σ ( y − ˆy)2
where = fraction of the total variation that is explained or accounted for
Σ ( y − y )2
Σ ( y − ˆy)2
Sy · x = n − 2 , variance of response variable y-values from the least squares
line
1
S2y = Σ ( y − y )2 , total variance of response variable y-values
n−2
Figure 14.5
Relationship Between Three
Types of Variations
Since the formula of r2 is not convenient to use therefore an easy formula for the
sample coefficient of determination is given by
a Σ y + b Σ xy − n ( y )2
r2 = ← Short-cut method
Σ y2 − n ( y )2
For example, the coefficient of determination that indicates the extent of relationship
between sales revenue (y) and advertising expenditure (x) is calculated as follows from
Example 14.1:
506 BUSINESS S T A T I S T I C S
Correlation Regression
• Measurement level Interval or ratio scale Interval or ratio scale
• Nature of variables Both continuous, and Both continuous, and
linearly related linearly related
• x – y relationship x and y are symmetric y is dependent, x is independent;
regression of x on y differs from y
on x
• Correlation bxy = byx Correlation between x and y is the
same as the correlation between
y and x
• Coefficient of Explains common Proportion of variability of x exp-
determination variance of x and y lained by its least-squares regres-
sion on y
C o n c e p t u a l Q u e s t i o n s 14A
1. (a) Explain the concept of regression and point out its 9. Point out the role of regression analysis in business
usefulness in dealing with business problems. decision-making. What are the important properties of
[Delhi Univ., MBA, 1993] regression coefficients?
(b) Distinguish between correlation and regression. Also [Osmania Univ., MBA; Delhi Univ., MBA, 1999]
point out the properties of regression coefficients. 10. (a) Distinguish between correlation and regression
2. Explain the concept of regression and point out its analysis.
importance in business forecasting. [Dipl in Mgt., AIMA, Osmania Univ., MBA, 1998]
[Delhi Univ., MBA, 1990, 1998] (b) The coefficient of correlation and coefficient of
3. Under what conditions can there be one regression line? determination are available as measures of association
Explain. [HP Univ., MBA, 1996] in correlation analysis. Describe the different uses of
4. Why should a residual analysis always be done as part of these two measures of association.
the development of a regression model? 11. What are regression coefficients? State some of the
5. What are the assumptions of simple linear regression important properties of regression coefficients.
analysis and how can they be evaluated? [Dipl in Mgt., AIMA, Osmania Univ., MBA, 1989]
6. What is the meaning of the standard error of estimate? 12. What is regression? How is this concept useful to business
7. What is the interpretation of y-intercept and the slope in a forecasting? [Jodhpur Univ., MBA, 1999]
regression model? 13. What is the difference between a prediction interval and a
8. What are regression lines? With the help of an example confidence interval in regression analysis?
illustrate how they help in business decision-making. 14. Explain what is required to establish evidence of a cause-
[Delhi Univ., MBA, 1998] and-effect relationship between y and x with regression
analysis.
C H A P T E R 14 REGRESSION ANALYSIS 507
15. What technique is used initially to identify the kind of 18. Give examples of business situations where you believe a
regression model that may be appropriate. straight line relationship exists between two variables.
16. (a) What are regression lines? Why is it necessary to What would be the uses of a regression model in each of
consider two lines of regression? these situations.
(b) In case the two regression lines are identical, prove 19. ‘The regression lines give only the best estimate of the
that the correlation coefficient is either + 1 or – 1. If value of quantity in question. We may assess the degree of
two variables are independent, show that the two uncertainty in the estimate by calculating a quantity known
regression lines cut at right angles. as the standard error of estimate’ Elucidute.
17. What are the purpose and meaning of the error terms in 20. Explain the advantages of the least-squares procedure for
regression? fitting lines to data. Explain how the procedure works.
Formulae Used
Σ( y − ˆy)2 Sy.x = Sy 1 − r2
Sy.x =
n−2
8. Interval estimate based on sample data: y ± tdf Syx
True or False
1. A statistical relationship between two variables does not 8. Correlation coefficient is the geometric mean of regression
indicate a perfect relationship. (T/F) coefficients. (T/F)
2. A dependent variable in a regression equation is a 9. If the sign of two regression coefficients is negative, then
continuous random variable. (T/F) sign of the correlation coefficient is positive. (T/F)
3. The residual value is required to estimate the amount of 10. Correlation coefficient and regression coefficient are
variation in the dependent variable with respect to the independent. (T/F)
fitted regression line. (T/F) 11. The point of intersection of two regression lines
4. Standard error of estimate is the conditional standard represents average value of two variables. (T/F)
deviation of the dependent variable. (T/F) 12. The two regression lines are at right angle when the
5. Standard error of estimate is a measure of scatter of the correlation coefficient is zero. (T/F)
observations about the regression line. (T/F) 13. When value of correlation coefficient is one, the two
6. If one of the regression coefficients is greater than one the regression lines coincide. (T/F)
other must also be greater than one. (T/F) 14. The product of regression coefficients is always more than
7. The signs of the regression coefficients are always same. one. (T/F)
(T/F) 15. The regression coefficients are independent of the change
of origin but not of scale. (T/F)
508 BUSINESS S T A T I S T I C S
Multiple Choice
16. The line of ‘best fit’ to measure the variation of observed 24. If two regression lines are: y = a + bx and x = c + dy, then
values of dependent variable in the sample data is the ratio of a/c is equal to
(a) regression line (b) correlation coefficient
(c) standard error (d) none of these 1−b 1+ b b −1
(a) b/d (b) (c) (d)
17. Two regression lines are perpendicular to each other when 1− d 1+ d d −1
(a) r = 0 (b) r = 1/3 25. If two coefficients of regression are 0.8 and 0.2, then the
(c) r = – 1/2 (d) r = ± 1 value of coefficient of correlation is
18. The change in the dependent variable y corresponding to (a) 0.16 (b) – 0.16 (c) 0.40 (d) – 0.40
a unit change in the independent variable x is measured by 26. If two regression lines are: y = 4 + k x and x = 5 + 4y, then
(a) bxy (b) byx the range of k is
(c) r (d) none of these (a) k ≤ 0 (b) k ≥ 0
19. The regression lines are coincident provided (c) 0 ≤ k ≤ 1 (d) 0 ≤ 4k ≤ 1
(a) r = 0 (b) r = 1/3 27. If two regression lines are: x + 3y + 7 = 0 and 2x + 5y
(c) r = – 1/2 (d) r = ± 1 = 12, then x and y are respectively
20. If byx is greater than one, then bxy is (a) 2, 1 (b) 1, 2
(a) less than one (b) more than one (c) 2, 3 (d) 2, 4
(c) equal to one (d) none of these 28. The residual sum of square is
21. If bxy is negative, then byx is (a) minimized (b) increased
(a) negative (b) positive (c) maximized (d) decreased
(c) zero (d) none of these 29. The standard error of estimate Sy⋅x is the measure of
22. If two regression lines are: y = a + bx and x = c + dy, then (a) closeness (b) variability
the correlation coefficient between x and y is (c) linearity (d) none of these
(a) bc (b) ac (c) ad (d) bd 30. The standard error of estimate is equal to
23. If two regression lines are: y = a + bx and x = c + dy, then
(a) σ y 1 − r 2 (b) σ y 1 + r 2
the ratio of standard deviations of x and y are
(a) c/b (b) c/a (c) d/a (d) d/b (c) σ x 1 − r 2 (d) σ x 1 + r 2
1. T 2. T 3. T 4. T 5. T 6. F 7. T 8. T 9. F
10. F 11. T 12. T 13. T 14. F 15. T 16. (a) 17. (a) 18. (b)
19. (d) 20. (a) 21. (a) 22. (d) 23. (d) 24. (b) 25. (a) 26. (d) 27. (b)
28. (a) 29. (b) 30. (a)
R e v i e w S e l f-P r a c t i c e P r o b l e m s
14.18 You are given below the following information about 14.23 The quantity of a raw material purchased by ABC Ltd.
advertisement expenditure and sales: at specified prices during the post 12 months is given
below.
Adv. Exp. (x) Sales (y)
(Rs in crore) (Rs in crore) Month Price per Quantity Month Price per Quantity
kg (in Rs) (in kg) kg (in Rs) (in kg)
Mean 20 120
Standard deviation 05 025 Jan 96 250 July 112 220
Feb 110 200 Aug 112 220
Correlation coefficient 0.8 March 100 250 Sept 108 200
(a) Calculate the two regression equations. April 90 280 Oct 116 210
(b) Find the likely sales when advertisement May 86 300 Nov 86 300
expenditure is Rs 25 crore. June 92 300 Dec 92 250
(c) What should be the advertisement budget if the
(a) Find the regression equations based on the above
company wants to attain sales target of Rs 150
data.
crore?
(b) Can you estimate the approximate quantity likely
[Jammu Univ., MCom, 1997; Delhi Univ., MBA, 1999]
to be purchased if the price shoots up to Rs 124 per
14.19 For 50 students of a class the regression equation of kg?
marks in Statistics (x) on the marks in Accountancy (y) is
(c) Hence or otherwise obtain the coefficient of
3y – 5x + 180 = 0. The mean marks in Accountancy is
correlation between the price prevailing and the
44 and the variance of marks in Statistics is 9/16th of
quantity demanded.
the variance of marks in Accountancy. Find the mean
marks in Statistics and the coefficient of correlation 14.24 With ten observations on price (x) and supply (y), the
between marks in the two subjects. following data were obtained (in appropriate units):
Σ x = 130, Σ y = 220, Σ x2 = 2288, Σ y2 = 5506, Σ x y
14.20 The HRD manager of a company wants to find a
= 3467. Obtain the line of regression of y on x and
measure which he can use to fix the monthly income of
estimate the supply when the price is 16 units. Also find
persons applying for a job in the production department.
out the standard error of the estimate.
As an experimental project, he collected data on 7
persons from that department referring to years of 14.25 Data on the annual sales of a company in lakhs of rupees
service and their monthly income. over the past 11 years is shown below. Determine a
suitable straight line regression model y = β0 + β1x + ∈
Years of service : 11 7 9 5 8 6 10
for the data. Also calculate the standard error of
Income (Rs in 1000’s) : 10 8 6 5 9 7 11 regression of y for values of x.
(a) Find the regression equation of income on years of Year : 1978 79 80 81 82 83 84 85 86 87 88
service. sales : 1 5 4 7 10 8 9 13 14 13 18
(b) What initial start would you recommend for a person
From the regression line of y on x, predict the values of
applying for the job after having served in a similar
annual sales for the year 1989.
capacity in another company for 13 years?
14.26 Find the equation of the least squares line fitting the
(c) Do you think other factors are to be considered (in
following data:
addition to the years of service) in fixing the income
x: 1 2 3 4 5
with reference to the above problems? Explain.
y: 2 6 5 3 4
14.21 The following table gives the age of cars of a certain
Calculate the standard error of estimate of y on x.
make and their annual maintenance costs. Obtain the
regression equation for costs related to age. 14.27 The following data relating to the number of weeks of
experience in a job involving the wiring of an electric
Age of cars : 12 14 16 18
motor and the number of motors rejected during the
(in years)
past week for 12 randomly selected workers.
Maintenance costs : 10 20 25 30
(Rs in 100’s) [HP Univ., MBA, 1994] Workers Experience (weeks) No. of Rejects
14.22 An analyst in a certain company was studying the 1 2 26
relationship between travel expenses in rupees (y) for 2 9 20
102 sales trips and the duration in days (x) of these trips. 3 6 28
He has found that the relationship between y and x is 4 14 16
linear. A summary of the data is given below: 5 8 23
Σx = 510; Σy = 7140; Σx2 = 4150; Σxy = 54,900, 6 12 18
and Σy2 = 7,40,200 7 10 24
(a) Estimate the two regression equations from the 8 4 26
above data. 9 2 38
(b) A given trip takes seven days. How much money 10 11 22
should a salesman be allowed so that he will not 11 1 32
run short of money? 12 8 25
510 BUSINESS S T A T I S T I C S
(a) Determine the linear regression equation for 14.29 A financial analyst obtained the following information
estimating the number of components rejected relating to return on security A and that of market M
given the number of weeks of experience. for the past 8 years:
Comment on the relationship between the two Year : 11 12 13 14 15 16 17 8
variables as indicated by the regression equation. Return A : 10 15 18 14 16 16 18 4
(b) Use the regression equation to estimate the number Market M : 12 14 13 10 9 13 14 7
of motors rejected for an employee with 3 weeks
(a) Develop an estimating equation that best describes
of experience in the job.
these data.
(c) Determine the 95 per cent approximate prediction
(b) Find the coefficient of determination and interpret
interval for estimating the number of motors
it.
rejected for an employee with 3 weeks of experience
in the job, using only the standard error of estimate. (c) Determine the percentage of total variation in
security return being explained by the return on
14.28 A financial analyst has gathered the following data about
the market portfolio.
the relationship between income and investment in
securities in respect of 8 randomly selected families: 14.30 The equation of a regression line is
Income : 18 12 19 24 43 37 19 16 ŷ = 50.506 – 1.646x
(Rs in 1000’s) and the data are as follows:
Per cent invested x: 15 17 11 12 19 25
in securities : 36 25 33 15 28 19 20 22 y: 47 38 32 24 22 10
(a) Develop an estimating equation that best describes Solve for residuals and graph a residual plot. Do these
these data. data seem to violate any of the assumptions of
(b) Find the coefficient of determination and interpret regression?
it. 14.31 Graph the followign residuals and indicate which of the
(c) Calculate the standard error of estimate for this assumptions underlying regresion appear to be in
relationship. jeopardy on the basis of the graph:
(d) Find an approximate 90 per cent confidence interval x : 13 16 27 29 37 47 63
for the percentage of income invested in securities
by a family earning Rs 25,000 annually. y − ˆy : – 11 –5 –2 –1 6 10 12
[Delhi Univ., MFC, 1997]
14.15 x = Σ x = 21/8 = 2.625; y = Σ y/n = 4/8 = 0.50 Given y = 50, σy = 10, x = 30, σx = 5, r = 0.8
n Σ dx dy − (Σ dx ) (Σ dy ) 8 × 30 − (− 3) ( − 12) σy
byx = = y – y =r (x − x) ;
n Σ dx2 2
− (Σ dx ) 2
8 × 45 − ( − 1) σx
= 0.568;
10
dx = x – 3; dy = y – 3. y – 50 = 0.8 (x – 30)
5
Regression equation:
y – y = byx (x – x ) y = 2 + 1.6x
or y – 0.5 = 0.568 (x – 2.625) For x = 40,y = 2 + 1.6 (40) = 66 quintals.
y = – 0.991 + 0.568x 14.17 Let x = age of wife y = age of husband.
n Σ dx dy − (Σ dx )(Σ dy ) 8 × 30 − (− 3) ( − 12) Given x = 25, y = 22, σx = 4, σy = 5, r = 0.8
(b) bxy = = (a) Regression equation of x on y
n Σ d2y 2
− (Σ dy ) 8 × 84 − (− 12) 2
= 0.386 σx
x−x = r ( y − y)
Regression equation: σy
x – x = bxy (y – y ) 4
x – 25 = 0.8 (y – 22)
or x – 2.625 = 0.386 (y – 5) 5
x = 0.695 + 0.386y x = 10.92 + 0.64 y
14.16 Let x = rainfall y = production by y. The expected yield When age of wife is y = 16; x = 10.92 + 0.64 (16) = 22
corresponding to a rainfall of 40 inches is given by approx.(husband’s age)
regression equation of y on x. (b) Left as an exercise
C H A P T E R 14 REGRESSION ANALYSIS 511
14.26 x = Σ x/n = 15/5 = 3, y = Σx/n = 20/5 = 4 relationship between weeks of experience (x) and the
number of rejects (y) in the sample week
The regression equation is:
(b) For x = 3, we have ŷ = 35.57 – 1.40(3) ≅ 31
y − y = byx (x – x )
Σy2 − a Σy − b Σxy
y – 4 = 0.7 (x – 3) or ŷ = 1.9 + 0.7x (c) Syx =
n−2
Standard error of estimate,
7,798 − (35.57) (298) − 1.40 (2048)
Σ( y − ˆy) 5.1 =
Sy x = = = 1.303 12 − 2
n−2 3
= 2.56
Σ xy − n x y 2048 − 12(7.67) (24.83) 95 per cent approximate prediction interval
14.27 (a) b = = = – 1.40
Σ x 2 − n ( x )2 876 − 12(7.67)2
ŷ ± tdf Sy⋅x = 31.37 ± 2.228 (2.56)
a = y – b x = 24.83 – (– 1.40)(7.67) = 35.57
= 25.67 to 37.07 or 26 to 37 rejects.
Thus ŷ = a + bx = 35.57 – 1.40x 14.28 4.724; – 0.983; – 0.399, – 6.753, 2.768, 0.644
Since b = – 1.40, it indicate an inverse (negative) 14.29 Error term non-independent.
Case Studies