Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
45 views32 pages

BS Ref17

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 32

The cause is hidden, but the

result is known.

14
—Ovid
I never think of the future,

C h a p t e r it comes soon enough.


—Albert Einstein

Regression Analysis

LEARNING OBJECTIVES

After studying this chapter, you should be able to


z use simple linear regression for building models to business data.
z understand how the method of least squares is used to predict values of a
dependent (or response) variable based on the values of an independent (or
explanatory) variable.
z measure the variability (residual) of the dependent variable about a straight
line (also called regression line) and examine whether regression model fits
to the data.

14.1 INTRODUCTION
In Chapter 13 we introduced the concept of statistical relationship between two variables
such as: level of sales and amount of advertising; yield of a crop and the amount of
fertilizer used; price of a product and its supply, and so on. The relationship between
such variables indicate the degree and direction of their association, but fail to answer
following question:
• Is there any functional (or algebraic) relationship between two variables? If yes,
can it be used to estimate the most likely value of one variable, given the value of
other variable?
The statistical technique that expresses the relationship between two or more variables
in the form of an equation to estimate the value of a variable, based on the given value of
another variable, is called regression analysis. The variable whose value is estimated using
the algebraic equation is called dependent (or response) variable and the variable whose
value is used to estimate this value is called independent (regressor or predictor) variable. The
linear algebraic equation used for expressing a dependent variable in terms of independent
variable is called linear regression equation.
The term regression was used in 1877 by Sir Francis Galton while studying the
relationship between the height of father and sons. He found that though ‘tall father has
tall sons’, the average height of sons of tall father is x above the general height, the
average height of sons is 2x/3 above the general height. Such a fall in the average height
was described by Galton as ‘regression to mediocrity’. However, the theory of Galton is
not universally applicable and the term regression is applied to other types of variables
in business and economics. The term regression in the literary sense is also referred as
‘moving backward’.
481
482 BUSINESS S T A T I S T I C S

The basic differences between correlation and regression analysis are summarized as
follows:
1. Developing an algebraic equation between two variables from sample data and
predicting the value of one variable, given the value of the other variable is referred
to as regression analysis, while measuring the strength (or degree) of the relationship
between two variables is referred as correlation analysis. The sign of correlation
coefficient indicates the nature (direct or inverse) of relationship between two variables,
while the absolute value of correlation coefficient indicates the extent of relationship.
2. Correlation analysis determines an association between two variables x and y but not
that they have a cause-and-effect relationship. Regression analysis, in contrast to
correlation, determines the cause-and-effect relationship between x and y, that is, a
change in the value of independent variable x causes a corresponding change (effect)
in the value of dependent variable y if all other factors that affect y remain unchanged.
3. In linear regression analysis one variable is considered as dependent variable and
other as independent variable, while in correlation analysis both variables are
considered to be independent.
4. The coefficient of determination r2 indicates the proportion of total variance in the dependent
variable that is explained or accounted for by the variation in the independent variable. Since
value of r2 is determined from a sample, its value is subject to sampling error. Even
if the value of r2 is high, the assumption of a linear regression may be incorrect
because it may represent a portion of the relationship that actually is in the form of
a curve.

14.2 ADVANTAGES OF REGRESSION ANALYSIS


The following are some important advantages of regression analysis:
1. Regression analysis helps in developing a regression equation by which the value of
a dependent variable can be estimated given a value of an independent variable.
2. Regression analysis helps to determine standard error of estimate to measure the
variability or spread of values of a dependent variable with respect to the regression
line. Smaller the variance and error of estimate, the closer the pair of values (x, y) fall
about the regression line and better the line fits the data, that is, a good estimate can
be made of the value of variable y. When all the points fall on the line, the standard
error of estimate equals zero.
3. When the sample size is large (df ≥ 29), the interval estimation for predicting the
value of a dependent variable based on standard error of estimate is considered to be
acceptable by changing the values of either x or y. The magnitude of r2 remains the
same regardless of the values of the two variables.

14.3 TYPES OF REGRESSION MODELS


The primary objective of regression analysis is the development of a regression model to
explain the association between two or more variables in the given population. A regression
model is the mathematical equation that provides prediction of value of dependent variable
based on the known values of one or more independent variables.
The particular form of regression model depends upon the nature of the problem
under study and the type of data available. However, each type of association or
relationship can be described by an equation relating a dependent variable to one or
more independent variables.

14.3.1 Simple and Multiple Regression Models


If a regression model characterizes the relationship between a dependent y and only one
independent variable x, then such a regression model is called a simple regression model.
But if more than one independent variables are associated with a dependent variable,
C H A P T E R 14 REGRESSION ANALYSIS 483

then such a regression model is called a multiple regression model. For example, sales
turnover of a product (a dependent variable) is associated with multiple independent
variables such as price of the product, expenditure on advertisement, quality of the
product, competitors, and so on. Now if we want to estimate possible sales turnover with
respect to only one of these independent variables, then it is an example of a simple
regression model, otherwise multiple regression model is applicable.

14.3.2 Linear and Nonlinear Regression Models


If the value of a dependent (response) variable y in a regression model tends to increase
in direct proportion to an increase in the values of independent (predictor) variable x,
then such a regression model is called a linear model. Thus, it can be assumed that the
mean value of the variable y for a given value of x is related by a straight-line relationship.
Such a relationship is called simple linear regression model expressed with respect to the
population parameters β0 and β1 as:
E(y|x) = β0 + β1x (14-1)
where β0 = y-intercept that represents mean (or average) value of the dependent
variable y when x = 0
β1 = slope of the regression line that represents the expected change in the
value of y (either positive or negative) for a unit change in the value of x.

Figure 14.1
Straight Line Relationship

The intercept β0 and the slope β1 are unknown regression coefficients. The equation
(14-1) requires to compute the values of β0 and β1 to predict average values of y for a
given value of x. However Fig. 14.1 presents a scatter diagram where each pair of values
(xi, yi) represents a point in a two-dimensional coordinate system. Although the mean or
average value of y is a linear function of x, but not all values of y fall exactly on the
straight line rather fall around the line.
Since few points do not fall on the regression line, therefore values of y are not
exactly equal to the values yielded by the equation: E(y|x) = β0 + β1x, also called line of
mean deviations of observed y value from the regression line. This situation is responsible for
random error (also called residual variation or residual error) in the prediction of y values for
given values of x. In such a situation, it is likely that the variable x does not explain all the
variability of the variable y. For instance, sales volume is related to advertising, but if
other factors related to sales are ignored, then a regression equation to predict the sales
volume (y) by using annual budget of advertising (x) as a predictor will probably involve
some error. Thus for a fixed value of x, the actual value of y is determined by the mean
value function plus a random error term as follows:
y = Mean value function + Deviation
= β0 + β1x + e = E(y) + e (14-2)
where e is the observed random error. This equation is also called simple probabilitic linear
regression model.
The error component e allows each individual value of y to deviate from the line of
means by a small amount. The random errors corresponding to different observations
(xi, yi) for i=1, 2, ..., n are assumed to follow a normal distribution with mean zero and
(unknown) constant standard deviation.
484 BUSINESS S T A T I S T I C S

The term e in the expression (14-2) is called the random error because its value,
associated with each value of variable y, is assumed to vary unpredictably. The extent of
this error for a given value of x is measured by the error variance σ2e . Lower the value of
σ2e , better is the fit of linear regression model to a sample data.
If the line passing through the pair of values of variables x and y is curvilinear, then
the relationship is called nonlinear. A nonlinear relationship implies a varying absolute
change in the dependent variable with respect to changes in the value of the independent
variable. A nonlinear relationship is not very useful for predictions.
In this chapter, we shall discuss methods of simple linear regression analysis involving
single independent variable, whereas those involving two or more independent variables
will be discussed in Chapter 15.

14.4 ESTIMATION : THE METHOD OF LEAST SQUARES


To estimate the values of regression coefficients β0 and β1, suppose a sample of n pairs of
observations (x1, y1), (x2, y2), . . ., (xn, yn) is drawn from the population under study. A
method that provides the best linear unbiased estimates of β0 and β1 is called the method
of least squares. The estimates of β0 and β1 should result in a straight line that is ‘best fit’ to
the data points. The straight line so drawn is referred to as ‘best fitted’ (least squares
or estimated) regression line because the sum of the squares of the vertical deviations (difference
between the acutal values of y and the estimated values y predicted from the fitted line) is as small
as possible.
Using equation (14-2), we may express given n observations in the sample data as:
yi = β0 + β1xi + ei or ei = yi – ( β0 + β1xi), for all i
Mathematically, we intend to minimize
n n
L = ∑ e 2i = ∑ { yi − ( β0 + β1xi )}2
i =1 i =1
Let b0 and b1 be the least-squares estimators of β0 and β1 respectively. The least-
squares estimators b0 and b1 must satisfy
∂L n
= − 2 ∑ ( yi − b0 − b1 xi ) = 0
∂β0 b0 b1 i =1

∂L n
= − 2 ∑ ( yi − b0 − b1xi ) xi = 0
∂β1 b0 b1 i =1

After simplifying these two equations, we get


n n
∑ yi = nb0 + b1 ∑ xi (14-3)
i =1 i =1
n n n
2
∑ xi yi = b0 ∑ xi + b1 ∑ xi
i =1 i =1 i =1
Equations (14-3) are called the least-squares normal equations. The values of least squares
estimators b0 and b1 can be obtained by solving equations (14-3). Hence the fitted or
estimated regression line is given by:
ŷ = b0 + b1x
where ŷ (called y hat) is the value of y lying on the fitted regression line for a given
x value and ei = yi – ŷ i is called the residual that describes the error in fitting of the
regression line to the observation yi. The fitted value ŷ is also called the predicted value of
y because if actual value of y is not known, then it would be predicted for a given value of
x using the estimated regression line.

Remark: The sum of the residuals is zero for any least-squares regression line. Since
∑ yi = ∑ ˆyi , therefore so ∑ ei = 0.
C H A P T E R 14 REGRESSION ANALYSIS 485

14.5 ASSUMPTIONS FOR A SIMPLE LINEAR REGRESSION MODEL


To make valid statistical inference using regression analysis, we make certain assumptions
about the bivariate population from which a sample of paired observations is drawn and
the manner in which observations are generated. These assumptions form the basis for
application of simple linear regression models. Figure 14.2 illustrates these assumptions.

Figure 14.2
Graphical Illustration of Assumptions
in Regression Analysis

Assumptions
1. The relationship between the dependent variable y and independent variable x exists
and is linear. The average relationship between x and y can be described by a simple
linear regression equation y = a + bx + e, where e is the deviation of a particular
value of y from its expected value for a given value of independent variable x.
2. For every value of the independent variable x, there is an expected (or mean) value of
the dependent variable y and these values are normally distributed. The mean of
these normally distributed values fall on the line of regression.
3. The dependent variable y is a continuous random variable, whereas values of the
independent variable x are fixed values and are not random.
4. The sampling error associated with the expected value of the dependent variable y is
assumed to be an independent random variable distributed normally with mean
zero and constant standard deviation. The errors are not related with each other in
successive observations.
5. The standard deviation and variance of expected values of the dependent variable y
about the regression line are constant for all values of the independent variable x
within the range of the sample data.
6. The value of the dependent variable cannot be estimated for a value of an independent
variable lying outside the range of values in the sample data.

14.6 PARAMETERS OF SIMPLE LINEAR REGRESSION MODEL


The fundamental aim of regression analysis is to determine a regression equation (line)
that makes sense and fits the representative data such that the error of variance is as small
as possible. This implies that the regression equation should adequately be used for
prediction. J. R. Stockton stated that
• The device used for estimating the values of one variable from the value of the other
consists of a line through the points, drawn in such a manner as to represent the average
relationship between the two variables. Such a line is called line of regression.
486 BUSINESS S T A T I S T I C S

The two variables x and y which are correlated can be expressed in terms of each
other in the form of straight line equations called regression equations. Such lines should
be able to provide the best fit of sample data to the population data. The algebraic
expression of regression lines is written as:
• The regression equation of y on x
y = a + bx
is used for estimating the value of y for given values of x.
• Regression equation of x on y
x = c + dy
is used for estimating the value of x for given values of y.

Remarks
1. When variables x and y are correlated perfectly (either positive or negative) these lines
coincide, that is, we have only one line.
2. Higher the degree of correlation, nearer the two regression lines are to each other.
3. Lesser the degree of correlation, more the two regression lines are away from each other.
That is, when r = 0, the two lines are at right angle to each other.
4. Two linear regression lines intersect each other at the point of the average value of variables
x and y.

14.6.1 Regression Coefficients


To estimate values of population parameter β0 and β1, under certain assumptions, the
fitted or estimated regression equation representing the straight line regression model is
written as:
ŷ = a + bx
where ŷ = estimated average (mean) value of dependent variable y for a given value
of independent variable x.
a or b0 = y-intercept that represents average value of ŷ
b = slope of regression line that represents the expected change in the value
of y for unit change in the value of x
To determine the value of ŷ for a given value of x, this equation requires the
determination of two unknown constants a (intercept) and b (also called regression
coefficient). Once these constants are calculated, the regression line can be used to
compute an estimated value of the dependent variable y for a given value of independent
variable x.
The particular values of a and b define a specific linear relationship between x and y
based on sample data. The coefficient ‘a’ represents the level of fitted line (i.e., the distance
of the line above or below the origin) when x equals zero, whereas coefficient ‘b’ represents
the slope of the line (a measure of the change in the estimated value of y for a one-unit
change in x).
The regression coefficient ‘b’ is also denoted as:
• byx (regression coefficient of y on x) in the regression line, y = a + bx
• bxy (regression coefficient of x on y) in the regression line, x = c + dy

Properties of regression coefficients


1. The correlation coefficient is the geometric mean of two regression coefficients, that is, r =
byx × bxy .
2. If one regression coefficient is greater than one, then other regression coefficient must be
less than one, because the value of correlation coefficient r cannot exceed one. However,
both the regression coefficients may be less than one.
3. Both regression coefficients must have the same sign (either positive or negative). This
property rules out the case of opposite sign of two regression coefficients.
C H A P T E R 14 REGRESSION ANALYSIS 487

4. The correlation coefficient will have the same sign (either positive or negative) as that of the
two regression coefficients. For example, if by x = – 0.664 and bxy = – 0.234, then
r=– 0.664 × 0.234 = – 0.394.
5. The arithmetic mean of regression coefficients bxy and byx is more than or equal to the
correlation coefficient r, that is, (by x + bx y ) / 2 ≥ r. For example, if byx = – 0.664 and bx y =
– 0.234, then the arithmetic mean of these two values is (– 0.664 – 0.234)/2 = – 0.449, and
this value is more than the value of r = – 0.394.
6. Regression coefficients are independent of origin but not of scale.

14.7 METHODS TO DETERMINE REGRESSION COEFFICIENTS


Following are the methods to determine the parameters of a fitted regression equation.

14.7.1 Least Squares Normal Equations


Let ŷ = a + bx be the least squares line of y on x, where ŷ is the estimated average value
of dependent variable y. The line that minimizes the sum of squares of the deviations of
the observed values of y from those predicted is the best fitting line. Thus the sum of
residuals for any least-square line is minimum, where

L = Σ ( y − ˆy)2 = Σ {y – (a + bx)}2; a, b = constants


Differentiating L with respect to a and b and equating to zero, we have
∂L
= – 2 Σ{y – (a + bx)} = 0
∂a
∂S
= – 2 Σ{y – (a + bx)}x = 0
∂b
Solving these two equations, we get the same set of equations as equations (14-3)
Σ y = na + bΣ x (14-4)
2
Σ xy = aΣx + bΣ x
where n is the total number of pairs of values of x and y in a sample data. The equations
(14-4) are called normal equations with respect to the regression line of y on x. After
solving these equations for a and b, the values of a and b are substituted in the regression
equation, y = a + bx.
Similarly if we have a least squares line x̂ = c + dy of x on y, where x̂ is the estimated
mean value of dependent variable x, then the normal equations will be
Σ x = nc + d Σ y
Σ xy = n Σ y + d Σ y2
These equations are solved in the same manner as described above for constants c and d.
The values of these constants are substituted to the regression equation x = c + dy.

Alternative method to calculate value of constants


Instead of using the algebraic method to calculate values of a and b, we may directly use
the results of the solutions of these normal equation.
The gradient ‘b’ (regression coefficient of y on x) and ‘d’ (regression coefficient of x
on y) are calculated as:
Sx y n n 1 n n
b = , where Sx y = ∑ ( xi − x ) ( yi − y ) = ∑ xi yi − ∑ xi ∑ yi
S xx i =1 i =1 n i =1 i =1
2
n 2 2
n 1 n 
Sxx = ∑ ( xi − x ) = ∑ xi −  ∑ xi 
i =1 i =1 n  i =1 
2
S yx n 2 n 1 n 
Syy = ∑ ( yi − y ) = ∑ yi −
2
and d = , where  ∑ y
S yy i =1 i =1 n  i =1 
488 BUSINESS S T A T I S T I C S

Since the regression line passes through the point ( x , y ), the mean values of x and
y and the regression equations can be used to find the value of constants a and c as
follows:
a = y − bx for regression equation of y on x
c = x – d y for regression equation of x on y
The calculated values of a, b and c, d are substituted in the regression line y = a + bx
and x = c + dy respectively to determine the exact relationship.
Example 14.1: Use least squares regression line to estimate the increase in sales revenue
expected from an increase of 7.5 per cent in advertising expenditure.

Firm Annual Percentage Increase Annual Percentage Increase


in Advertising Expenditure in Sales Revenue
A 1 1
B 3 2
C 4 2
D 6 4
E 8 6
F 9 8
G 11 8
H 14 9

Solution: Assume sales revenue ( y) is dependent on advertising expenditure (x).


Calculations for regression line using following normal equations are shown in Table 14.1
Σ y = na + bΣ x and Σ x y = a Σ x + bΣ x2

Table 14.1: Calculation for Normal Equations

Sales Revenue Advertising x2 xy


y Expenditure, x
1 1 1 1
2 3 9 6
2 4 16 8
4 6 36 24
6 8 64 48
8 9 81 72
8 11 121 88
9 14 196 126
40 56 524 373

Approach 1 (Normal Equations):


Σ y = na + bΣ x or 40 = 8a + 56b
Σ xy = aΣ x + bΣ x2 or 373 = 56a + 524b
Solving these equations, we get
a = 0.072 and b = 0.704
Substituting these values in the regression equation
y = a + bx = 0.072 + 0.704x
For x = 7.5% or 0.075 increase in advertising expenditure, the estimated increase in
sales revenue will be
y = 0.072 + 0.704 (0.075) = 0.1248 or 12.48%
Approach 2 (Short-cut method):
S xy 93
b= = = 0.704,
S xx 132
C H A P T E R 14 REGRESSION ANALYSIS 489

Σ xΣ y 40 × 56
where Sxy = Σ x y – = 373 – = 93
n 8
(Σ x)2 (56)2
Sxx = Σ x2 – = 524 – = 132
n 8
The intercept ‘a’ on the y-axis is calculated as:
40 56
a = y − bx = – 0.704 × = 5 – 0.704×7 = 0.072
8 8
Substituting the values of a = 0.072 and b = 0.704 in the regression equation, we get
y = a + bx = 0.072 + 0.704 x
For x = 0.075, we have y = 0.072 + 0.704 (0.075) = 0.1248 or 12.48%.

Example 14.2: The owner of a small garment shop is hopeful that his sales are rising
significantly week by week. Treating the sales for the previous six weeks as a typical
example of this rising trend, he recorded them in Rs 1000’s and analysed the results
Week : 1 2 3 4 5 6
Sales : 2.69 2.62 2.80 2.70 2.75 2.81
Fit a linear regression equation to suggest to him the weekly rate at which his sales
are rising and use this equation to estimate expected sales for the 7th week.
Solution: Assume sales ( y) is dependent on weeks (x). Then the normal equations for
regression equation: y = a + bx are written as:
Σ y = na + bΣ x and Σxy = aΣ x + bΣx2
Calculations for sales during various weeks are shown in Table 14.2.

Table 14.2: Calculations of Normal Equations

Week (x) Sales ( y) x2 xy


1 2.69 1 2.69
2 2.62 4 5.24
3 2.80 9 8.40
4 2.70 16 10.80
5 2.75 25 13.75
6 2.81 36 16.86
21 16.37 91 57.74

The gradient ‘b’ is calculated as:


S xy 0.445 ΣxΣ y 21 × 16.37
b= = = 0.025; Sxy = Σ xy – = 57.74 – = 0.445
S xx 17.5 n 6
( ∑ x )2 (21)2
Sxx = ∑ x 2 − = 91 – = 17.5
n 6
The intercept ‘a’ on the y-axis is calculated as
16.37 21
a = y − bx = − 0.025 ×
6 6
= 2.728 – 0.025×3.5 = 2.64
Substituting the values a = 2.64 and b = 0.025 in the regression equation, we have
y = a + bx = 2.64 + 0.025x
For x = 7, we have y = 2.64 + 0.025 (7) = 2.815
Hence the expected sales during the 7th week is likely to be Rs 2.815 (in Rs 1000’s).

14.7.2 Deviations Method


Calculations to least squares normal equations become lengthy and tedious when values
of x and y are large. Thus the following two methods may be used to reduce the
computational time.
490 BUSINESS S T A T I S T I C S

(a) Deviations Taken from Actual Mean Values of x and y If deviations of actual values of
variables x and y are taken from their mean values x and y , then the regression
equations can be written as:
• Regression equation of y on x • Regression equation of x on y
y – y = byx (x – x ) x – x = bx y (y – y )
where by x = regression coefficient of where bx y = regression coefficient
= y on x. where bxy = of x on y.
The value of byx can be calculated using The value of bx y can be calculated
the using the formula formula
Σ (x − x ) ( y − y ) Σ (x − x ) ( y − y )
by x = bx y =
Σ ( x − x )2 Σ ( y − y )2

(b) Deviations Taken from Assumed Mean Values for x and y If mean value of either x or y or
both are in fractions, then we must prefer to take deviations of actual values of
variables x and y from their assumed means.
• Regression equation of y on x • Regression equation of x on y
y – y = byx (x – x ) x – x = bxy (y – y )
n Σ dx dy − (Σ dx ) (Σ d y ) n Σ dx dy − (Σ dx )(Σ dy )
where by x = where bx y =
nΣ dx2 − (Σ dx ) 2
n Σ dy2 − (Σ dy )2

n = number of observations n = number of observations


dx = x – A; A is assumed mean of x dx = x – A; A is assumed mean
dx = = of x
dy = y – B; B is assumed mean of y dy = y – B; B is assumed mean
of y

(c) Regression Coefficients in Terms of Correlation Coefficient If deviations are taken from
actual mean values, then the values of regression coefficients can be alternatively
calculated as follows:
Σ (x − x ) ( y − y ) Σ (x − x ) ( y − y )
byx = 2
bxy =
Σ (x − x ) Σ ( y − y )2
Covariance(x, y) σy Covariance(x, y) σx
= = r⋅ = = r⋅
σ2x σx σ2y σy

Example 14.3: The following data relate to the scores obtained by 9 salesmen of a company
in an intelligence test and their weekly sales (in Rs 1000’s)

Salesmen : A B C D E F G H I
Test scores : 50 60 50 60 80 50 80 40 70
Weekly sales : 30 60 40 50 60 30 70 50 60

(a) Obtain the regression equation of sales on intelligence test scores of the salesmen.
(b) If the intelligence test score of a salesman in 65, what would be his expected weekly
sales. [HP Univ., MCom, 1996]
Solution: Assume weekly sales (y) as dependent variable and test scores (x) as independent
variable. Calculations for the following regression equation are shown in Table 14.3.
y – y = by x (x – x )
C H A P T E R 14 REGRESSION ANALYSIS 491

Table 14.3: Calculation for Regression Equation

Weekly dx = x – 60 dx2 Test dy = y – 50 dy2 dxdy


Sales, x Score, y
50 – 10 100 30 – 20 400 200
60 0 0 60 10 100 0
50 – 10 100 40 – 10 100 100
60 0 0 50 0 0 0
80 20 400 60 10 100 200
50 – 10 100 30 – 20 400 200
80 20 400 70 20 400 400
40 – 20 400 50 0 0 0
70 10 100 60 10 100 100
540 0 1600 450 0 1600 1200

Σx 540 Σy 450
(a) x = = = 60; y = = = 50
n 9 n 9
Σ dx dy − (Σ dx )(Σ d y ) 1200
byx = = = 0.75
Σ dx2 − (Σ dx )2
1600
Substituting values in the regression equation, we have
y – 50 = 0.75 (x – 60) or y = 5 + 0.75x
For test score x = 65 of salesman, we have
y = 5 + 0.75 (65) = 53.75
Hence we conclude that the weekly sales is expected to be Rs 53.75 (in Rs 1000’s) for a
test score of 65.
Example 14.4: A company is introducing a job evaluation scheme in which all jobs are
graded by points for skill, responsibility, and so on. Monthly pay scales (Rs in 1000’s) are
then drawn up according to the number of points allocated and other factors such as
experience and local conditions. To date the company has applied this scheme to 9 jobs:
Job : A B C D E F G H I
Points : 5 25 7 19 10 12 15 28 16
Pay (Rs) : 3.0 5.0 3.25 6.5 5.5 5.6 6.0 7.2 6.1

(a) Find the least squares regression line for linking pay scales to points.
(b) Estimate the monthly pay for a job graded by 20 points.
Solution: Assume monthly pay (y) as the dependent variable and job grade points (x) as
the independent variable. Calculations for the following regression equation are shown
in Table 14.4.
y – y = byx (x – x )
Table 14.4: Calculations for Regression Equation

Grade dx = x – 15 dx2 Pay Scale, y dy = y – 5 dy2 dx dy


Points, x
5 – 10 100 3.0 – 2.0 4 20
25 10 100 5.0 ← B 0 0 0
7 –8 64 3.25 – 1.75 3.06 14
19 4 16 6.5 1.50 2.25 6
10 –5 25 5.5 0.50 0.25 – 2.5
12 –3 9 5.6 0.60 0.36 – 1.8
15 ← A 0 0 6.0 1.00 1.00 0
28 13 169 7.2 2.2 4.84 28.6
16 1 1 6.1 1.1 1.21 1.1
137 2 484 48.15 3.15 16.97 65.40
492 BUSINESS S T A T I S T I C S

Σx 137 Σy 48.15
(a) x = = = 15.22; y = = = 5.35
n 9 n 9
Since mean values x and y are non-integer value, therefore deviations are taken from
assumed mean as shown in Table 14.4.
n Σ dx d y − (Σ dx ) (Σ d y ) 9 × 65.40 − 2 × 3.15 582.3
byx = = = = 0.133
n Σ dx2 − (Σ dx )2 2
9 × 484 − (2) 4352
Substituting values in the regression equation, we have
y – y = byx (x – x ) or y – 5.35 = 0.133 (x – 15.22) = 3.326 + 0.133x
(b) For job grade point x=20, the estimated average pay scale is given by
y = 3.326 + 0.133x = 3.326 + 0.133 (20) = 5.986
Hence, likely monthly pay for a job with grade points 20 is Rs 5986.
Example 14.5: The following data give the ages and blood pressure of 10 women.
Age : 156 142 136 147 149 142 160 172 163 155
Blood pressure : 147 125 118 128 145 140 155 160 149 150
(a) Find the correlation coefficient between age and blood pressure.
(b) Determine the least squares regression equation of blood pressure on age.
(c) Estimate the blood pressure of a woman whose age is 45 years.
[Ranchi Univ. MBA; South Gujarat Univ., MBA, 1997]
Solution: Assume blood pressure (y) as the dependent variable and age (x) as the
independent variable. Calculations for regression equation of blood pressure on age are
shown in Table 14.5.

Table 14.5: Calculations for Regression Equation

Age, x dx = x – 49 dx2 Blood, y dy = y – 145 dy2 dx dy


56 7 49 147 2 4 14
42 –7 49 125 – 20 400 140
36 – 13 169 118 – 27 729 351
47 –2 4 128 – 17 289 34
49 ← A 0 0 145 ← B 0 0 0
42 –7 49 140 –5 25 35
60 11 121 155 10 100 110
72 23 529 160 15 225 345
63 14 196 149 4 16 56
55 6 36 150 5 25 30
522 32 1202 1417 – 33 1813 1115

(a) Coefficient of correlation between age and blood pressure is given by


n Σ dx dy − Σ dxΣ dy
r=
n Σ dx2 − (Σ dx )2 n Σ dy2 − (Σ dy )2
10 (1115) − (32) (− 33)
=
10 (1202) − (32)2 10(1813) − (− 33)2
11150 + 1056 12206
= = = 0.892
12020 − 1024 18130 − 1089 13689
We may conclude that there is a high degree of positive correlation between age and
blood pressure.
(b) The regression equation of blood pressure on age is given by
y – y = byx (x – x )
C H A P T E R 14 REGRESSION ANALYSIS 493

Σx 522 Σy 1417
x = = = 52.2; y = = = 141.7
n 10 n 10
n Σ dx dy − Σ dx Σ dy 10(1115) − 32(− 33) 12206
and byx = = = = 1.11
2
n Σ dx − (Σ dx )2 10(1202) − (32)2 10996
Substituting these values in the above equation, we have
y – 141.7 = 1.11 (x – 52.2) or y = 83.758+1.11x
This is the required regression equation of y on x.
(c) For a women whose age is 45, the estimated average blood pressure will be
y = 83.758+1.11(45) = 83.758+49.95 = 133.708
Hence, the likely blood pressure of a woman of 45 years is 134.
Example 14.6: The General Sales Manager of Kiran Enterprises—an enterprise dealing
in the sale of readymade men’s wear—is toying with the idea of increasing his sales to
Rs 80,000. On checking the records of sales during the last 10 years, it was found that
the annual sale proceeds and advertisement expenditure were highly correlated to the
extent of 0.8. It was further noted that the annual average sale has been Rs 45,000 and
annual average advertisement expenditure Rs 30,000, with a variance of Rs 1600 and Rs
625 in advertisement expenditure respectively.
In view of the above, how much expenditure on advertisement would you suggest
the General Sales Manager of the enterprise to incur to meet his target of sales?
[Kurukshetra Univ., MBA, 1998]
Solution: Assume advertisement expenditure (y) as the dependent variable and sales (x)
as the independent variable. Then the regression equation advertisement expenditure
on sales is given by
σy
(y – y ) = r (x − x )
σx
Given r = 0.8, σx = 40, σy = 25, x = 45,000, y = 30,000. Substituting these value
in the above equation, we have
25
(y – 30,000) = 0.8 (x – 45,000) = 0.5 (x – 45,000)
40
y = 30,000 + 0.5x – 22,500 = 7500 + 0.5x
When a sales target is fixed at x = 80,000, the estimated amount likely to the spent on
advertisement would be
y = 7500 + 0.5×80,000 = 7500 + 40,000 = Rs 47,500
Example 14.7: You are given the following information about advertising expenditure
and sales:

Advertisement (x) Sales (y)


(Rs in lakh) (Rs in lakh)
Arithmetic mean, x 10 90
Standard deviation, σ 3 12

Correlation coefficient = 0.8


(a) Obtain the two regression equations.
(b) Find the likely sales when advertisement budget is Rs 15 lakh.
(c) What should be the advertisement budget if the company wants to attain sales target
of Rs 120 lakh? [Kumaon Univ., MBA, 2000, MBA, Delhi Univ., 2002]
Solution: (a) Regression equation of x on y is given by
σx
x– x = r ( y − y)
σy
Given x = 10, r = 0.8, σx = 3, σy = 12, y = 90. Substituting these values in the
above regression equation, we have
494 BUSINESS S T A T I S T I C S

3
x – 10 = 0.8 (y – 90) or x = – 8 + 0.2y
12
Regression equation of y on x is given by
σy
(y – y ) = r (x − x )
σx
12
y – 90 = 0.8 ( x − 10) or y = 58 + 3.2x
3
(b) Substituting x = 15 in regression equation of y on x. The likely average sales
volume would be
y = 58 + 3.2 (15) = 58 + 48 = 106
Thus the likely sales for advertisement budget of Rs 15 lakh is Rs 106 lakh.
(c) Substituting y = 120 in the regression equation of x on y. The likely advertisement
budget to attain desired sales target of Rs 120 lakh would be
x = – 8 + 0.2 y = – 8 + 0.2 (120) = 16
Hence, the likely advertisement budget of Rs 16 lakh should be sufficient to attain
the sales target of Rs 120 lakh.
Example 14.8: In a partially destroyed laboratory record of an analysis of regression
data, the following results only are legible:
Variance of x = 9
Regression equations : 8x – 10y + 66 = 0 and 40x – 18y = 214
Find on the basis of the above information:
(a) The mean values of x and y,
(b) Coefficient of correlation between x and y, and
(c) Standard deviation of y. [Pune Univ., MBA, 1996; CA May 1999]
Solution: (a) Since two regression lines always intersect at a point ( x , y ) representing
mean values of the variables involved, solving given regression equations to get the mean
values x and y as shown below:
8x – 10y = – 66
40x – 18y = 214
Multiplying the first equation by 5 and subtracting from the second, we have
32y = 544 or y = 17, i.e. y = 17
Substituting the value of y in the first equation, we get
8x – 10(17) = – 66 or x = 13, that is, x = 13
(b) To find correlation coefficient r between x and y, we need to determine the
regression coefficients bxy and byx.
Rewriting the given regression equations in such a way that the coefficient of dependent
variable is less than one at least in one equation.
66 8
8x – 10 y = – 66 or 10 y = 66 + 8x or y= + x
10 10
That is, byx = 8/10 = 0.80
214 18
40x – 18y = 214 or 40x = 214 + 18y or x = + y
40 40
That is, bxy = 18/40 = 0.45
Hence coefficient of correlation r between x and y is given by
r= bxy × byx = 0.45 × 0.80 = 0.60
(c) To determine the standard deviation of y, consider the formula:
σy byx σ x 0.80 × 3
byx = r or σy = = =4
σx r 0.6
Example 14.9: There are two series of index numbers, P for price index and S for stock
of a commodity. The mean and standard deviation of P are 100 and 8 and of S are 103
and 4 respectively. The correlation coefficient between the two series is 0.4. With these
C H A P T E R 14 REGRESSION ANALYSIS 495

data, work out a linear equation to read off values of P for various values of S. Can the
same equation be used to read off values of S for various values of P?
Solution: The regression equation to read off values of P for various values S is given by
σp
P = a + bS or (P – P ) = r (S − S)
σs
Given P = 100, S = 103, σp = 8, σs = 4, r = 0.4. Substituting these values in the
above equation, we have
8
P – 100 = 0.4 (S − 103) or P = 17.6 + 0.8 S
4
This equation cannot be used to read off values of S for various values of P. Thus to read
off values of S for various values of P we use another regression equation of the form:
σ
S = c + dP or S − S = s (P − P)
σp
Substituting given values in this equation, we have
4
S – 103 = 0.4 (P – 100) or S = 83 + 0.2P
8
Example 14.10: The two regression lines obtained in a correlation analysis of 60
observations are:
5x = 6x + 24 and 1000y = 768 x – 3708
What is the correlation coefficient and what is its probable error? Show that the ratio
of the coefficient of variability of x to that of y is 5/24. What is the ratio of variances of x
and y?
Solution: Rewriting the regression equations
6 24
5x = 6y + 24 or x = y+
5 5
That is, bxy = 6/5
768 3708
1000y = 768x – 3708 or y = x −
1000 1000
That is, byx = 768/1000
σ 6 σy 768
We know that bx y = r x = and byx = r = , therefore
σy 5 σx 1000
6 768
bx y byx = r2 = × = 0.9216
5 1000
Hence r = 0.9216 = 0.96.
Since both bxy and byx are positive, the correlation coefficient is positive and hence
r = 0.96.
1 − r2 1 − (0.96)2
Probable error of r = 0.6745 = 0.6745
n 60
0.0528
= = 0.0068
7.7459
Solving the given regression equations for x and y, we get x = 6 and y = 1 because
regression lines passed through the point ( x , y ).
σx 6 σ 6 σ 6 5
Since r = or 0.96 x = or x = =
σy 5 σy 5 σy 5 × 0.96 4
σx / x y σx 1 5 5
Also the ratio of the coefficient of variability = = ⋅ = × = .
σy / y x σ y 6 4 24

14.7.3 Regression Coefficients for Grouped Sample Data


The method of finding the regression coefficients bx y and by x would be little different
than the method discussed earlier for the case when data set is grouped or classified into
frequency distribution of either variable x or y or both. The values of bxy and byx shall be
calculated using the formulae:
496 BUSINESS S T A T I S T I C S

n Σ dx dy − Σ fdx Σ fdy h
bxy = ×
n Σ fd2y − (Σ fdy )2 k
n Σ fdx d y − Σ fdx Σ fd y k
byx = ×
nΣ fdx2 − (Σ fdx ) 2
h
where h = width of the class interval of sample data on x variable
k = width of the class interval of sample data on y variable
Example 14.11: The following bivariate frequency distribution relates to sales turnover
(Rs in lakh) and money spent on advertising (Rs in 1000’s). Obtain the two regression
equations

Sales Turnover Advertising Budget (Rs in 1000’s)


(Rs in lakh) 50–60 60–70 70–80 80–90
20 –050 2 1 2 5
50 –080 3 4 7 6
80 –110 1 5 8 6
110 –140 2 7 9 2

Estimate (a) the sales turnover corresponding to advertising budget of Rs 1,50,000,


and (b) the advertising budget to achieve a sales turnover of Rs 200 lakh.
Solution: Let x and y represent sales turnover and advertising budget respectively. Then
the regression equation for estimating the sales turnover (x) on advertising budget (y) is
expressed as:
x – x = bxy (y – y )
n Σ fdx dy − Σ fdx Σ fdy
where bxy =
n Σ fd2y − (Σ fdy )2

Table 14.6: Calculations for Regression Coefficients

Advertising Budget
y 50–60 60–70 70–80 80–90
m.v 55 65 75 85 f fdx f d x2 fdx dy
Sales dy –2 –1 0 1
x m.v. dx
20–50 35 1 2 1 2 5 10 – 10 10 0
4 1 — –5
50–80 65 0 3 4 7 6 20 0 0 0
— — — —
80–110 95 1 1 5 8 6 20 20 20 –1
–2 –5 — 6
110–140 125 2 2 7 9 2 20 40 80 – 18
–8 – 14 — 4
f 8 17 26 19 n = 70 50 = 110 = – 19
Σ fdx Σ fdx2 Σ fdxdy
fdy – 16 – 17 0 19 – 14 =
Σ f dy
fdy2 32 17 0 19 68 =
Σ f dy2
fdxdy –6 – 18 0 5 – 19 =
Σ fdxdy
C H A P T E R 14 REGRESSION ANALYSIS 497

Similarly, the regression equation for estimating the advertising budget (y) on sales
turnover of Rs 200 lakh is written as:
y – y = byx (x – x )
n Σ fdx dy − (Σ fdx ) (Σ fd y )
where byx =
n Σ fdx2 − (Σ fdx )2
The calculations for regression coefficients bxy and byx are shown in Table 14.6.
Σfdx 50
x = A+ × h = 65 + × 30 = 65 + 21.428 = 86.428
n 70
Σfd y 14
y = B+ × k = 75 – × 10 = 75 – 2 = 73
n 70
n Σ fdx dy − (Σ fdx ) (Σ fdy ) h 70 × −19 − (50)(−14) 30
bxy = × = ×
n Σ fd2y − (Σ fdy )2 k 70 × 68 − (−14)2 10

−1330 + 700 30 −18,900


= × = = – 0.414
4760 − 196 10 45,640
n Σ fdx d y − (Σ fdx ) (Σ fd y ) k 70 × − 19 − (50) (−14) 10
byx = × = ×
nΣ fdx2 2
− (Σ fdx ) h 70 × 110 − (50)2 30

−1330 + 700 10 − 6300


= × = = – 0.040
7700 − 2500 30 1,56,000
Substituting these values in the two regression equations, we get
(a) Regression equation of sales turnover (x) to advertising budget ( y) is:
x – x = bx y (y – y )
x – 86.428 = – 0.414 (y – 73), or x = 116.65 – 0.414 y
For y = 150, we have x = 116.65 – 0.414 × 150 = Rs 54.55 lakh
(b) Regression equation of advertising budget (y) on sales turnover (x) is:
y – y = byx (x – x )
y – 73 = – 0.040 (x – 86.428) or y = 76.457– 0.04x
For x = 200, we have y = 76.457 – 0.04 (200) = Rs 68.457 thousand.

S e l f-P r a c t i c e P r o b l e m s 14A

14.1 The following calculations have been made for prices of Write down the regression equation and estimate the
twelve stocks (x) at the Calcutta Stock Exchange on a expenditure on food and entertainment if the
certain day along with the volume of sales in thousands expenditure on accommodation is Rs 200.
of shares (y). From these calculations find the regression [Bangalore Univ., BCom, 1998]
equation of price of stocks on the volume of sales of 14.3 The following data give the experience of machine
shares. operators and their performance ratings given by the
Σ x = 580, Σ y = 370, Σ xy = 11494, number of good parts turned out per 100 pieces:
Σ x2 = 41658, Σ y2 = 17206.
Operator : 11 12 13 4 5 16 7 18
[Rajasthan Univ., MCom, 1995]
14.2 A survey was conducted to study the relationship experience (x) : 16 12 18 4 3 10 5 12
between expenditure (in Rs) on accommodation (x) and Performance
expenditure on food and entertainment (y) and the ratings (y) : 87 88 89 68 78 80 75 83
following results were obtained: Calculate the regression lines of performance ratings
Mean Standard on experience and estimate the probable performance
Deviation if an operator has 7 years experience.
• Expenditure on 173..1 63.15 [Jammu Univ., MCom; Lucknow Univ., MBA, 1996]
• accommodation 14.4 A study of prices of a certain commodity at Delhi and
• Expenditure on food 47.8 22.98 Mumbai yield the following data:
• and entertainment
• Coefficient of correlation r = 0.57
498 BUSINESS S T A T I S T I C S

department. In order to do this he selects a random


Delhi Mumbai
sample of 10 applicants. They are given the test and
• Average price per kilo (Rs) 2.463 2.797 later assigned a production rating. The results are as
• Standard deviation 0.326 0.207 follows:
• Correlation coefficient
Worker : A B C D E F G H I J
• between prices at Delhi
• and Mumbai r = 0.774 Test score : 53 36 88 84 86 64 45 48 39 69
Production
Estimate from the above data the most likely price (a) at rating : 45 43 89 79 84 66 49 48 43 76
Delhi corresponding to the price of Rs 2.334 per kilo at
Mumbai (b) at Mumbai corresponding to the price of Fit a linear least squares regression equation of
3.052 per kilo at Delhi. production rating on test score. [Delhi Univ., MBA, 200]
14.5 The following table gives the aptitude test scores and 14.9 Find the regression equation showing the capacity
productivity indices of 10 workers selected at random: utilization on production from the following data:
Aptitude Average Standard
scores (x) : 60 62 65 70 72 48 53 73 65 82 Deviation
Productivity
• Production 35.6 10.5
index (y) : 68 60 62 80 85 40 52 62 60 81
• (in lakh units) :
Calculate the two regression equations and estimate (a)
the productivity index of a worker whose test score is • Capacity utilization
92, (b) the test score of a worker whose productivity • (in percentage) : 84.8 8.5
index is 75. [Delhi Univ., MBA, 2001] • Correlation coefficient r = 0.62
14.6 A company wants to assess the impact of R&D
Estimate the production when the capacity utilization is
expenditure (Rs in 1000s) on its annual profit; (Rs in
70 per cent.
1000’s). The following table presents the information
for the last eight years: [Delhi Univ., MBA, 1997; Pune Univ., MBA, 1998]
14.10 Suppose that you are interested in using past
Year R & D expenditure Annual profit expenditure on R&D by a firm to predict current
expenditures on R&D. You got the following data by
1991 9 45 taking a random sample of firms, where x is the amount
1992 7 42 spent on R&D (in lakh of rupees) 5 years ago and y is
1993 5 41 the amount spent on R&D (in lakh of rupees) in the
1994 10 60 current year:
1995 4 30 x : 30 50 20 180 10 20 20 40
1996 5 34 y : 50 80 30 110 20 20 40 50
1997 3 25 (a) Find the regression equation of y on x.
1998 2 20 (b) If a firm is chosen randomly and x = 10, can you use
the regression to predict the value of y? Discuss.
Estimate the regression equation and predict the annual [Madurai-Kamraj Univ., MBA, 2000]
profit for the year 2002 for an allocated sum of Rs 14.11 The following data relates to the scores obtained by a
1,00,000 as R&D expenditure. salesmen of a company in an intelligence test and their
[ Jodhpur Univ., MBA, 1998] weekly sales (in Rs. 1000’s ):
14.7 Obtain the two regression equations from the following
Salesman
bivariate frequency distribution:
intelligence : A B C D E F G H I
Sales Revenue Advertising Expenditure (Rs in thousand) Test score : 50 60 50 60 80 50 80 40 70
(Rs in lakh) 5–15 15–25 25–35 35–45 Weekly sales : 30 60 40 50 60 30 70 50 60

75–125 3 4 4 8 (a) Obtain the regression equation of sales on


125–175 8 6 5 7 intelligence test scores of the salesmen.
175–225 2 2 3 4 (b) If the intelligence test score of a salesman is 65,
what would be his expected weekly sales?
225–275 3 3 2 2
[HP Univ., M.com., 1996]
Estimate (a) the sales corresponding to advertising 14.12 Two random variables have the regression equations:
expenditure of Rs 50,000, (b) the advertising 3x + 2y – 26 = 0 and 6x + y – 31 = 0
expenditure for a sales revenue of Rs 300 lakh, (c) the (a) Find the mean values of x and y and coefficient of
coefficient of correlation. [Delhi Univ., MBA, 2002] correlation between x and y.
14.8 The personnel manager of an electronic manufacturing (b) If the varaince of x is 25, then find the standard
company devises a manual test for job applicants to deviation of y from the data.
predict their production rating in the assembly [MD Univ., M.Com., 1997; Kumaun Univ., MBA, 2001]
C H A P T E R 14 REGRESSION ANALYSIS 499

14.13 For a given set of bivariate data, the fiollowing results


Year Adv. expenditure Sales
were obtained
(Rs. 1000’s) (in lakhs Rs)
x = 53.2, y = 27.9,
Regression coefficient of y on x = – 1.5, and Regression 1996 12 5.0
coefficient of x and y = – 0.2. 1997 15 5.6
1998 17 5.8
Find the most probable value of y when x = 60. 1999 23 7.0
14.14 In trying to evaluate the effectiveness in its advertising 2000 24 7.2
compaign, a firm compiled the following information: 2001 38 8.8
Calculate the regression equation of sales on advertising 2002 42 9.2
expenditure. Estimate the probable sales when 2003 48 9.5
advertisement expenditure is Rs. 60 thousand.
[Bharathidasan Univ., MBA, 2003]

Hints and Answers

14.1 x = Σx/n = 580/12 = 48.33; (a) Regression equation of y on x


y = Σy/n = 370/12 = 30.83 σy
y– y =r (x – x )
Σ xy − n x y 11494 − 12 × 48.33 × 30.83 σx
bxy = =
Σ y2 − n ( y )2 17206 − 12(30.83)2 0.326
y – 2.463 = 0.774 (x – 2.797)
= – 1.102 0.207
Regression equation of x on y: For x = Rs 2.334, the price at Delhi would be
x – x = bxy (y – y ) y = Rs 1.899.
(b) Regression on equation of x on y
x – 48.33 = – 1.102 (y – 30.83)
or x = 82.304 – 1.102y σx
x– x = r (y − y)
σy
14.2 Given x = 172, y = 47.8, σx = 63.15, σy = 22.98, and
r = 0.57 0.207
or x – 2.791 = 0.774 (y – 2.463)
Regression equation of food and entertainment (y) on 0.326
accomodation (x) is given by For y = Rs 3.052, the price at Mumbai would be
x = Rs 3.086.
σy
y– y =r (x − x ) 14.5 Let aptitude score and productivity index be represented
σx by x and y respectively.
x = Σx/n = 650/10 = 65; y = Σy/n = 650/10 = 65
22.98
y – 47.8 = 0.57 (x – 173)
63.15 n Σ dx dy − (Σ dx ) (Σ dy ) 1044
bxy = = = 0.596;
or y = 11.917 + 0.207x n Σ dy2 − Σ ( d y )2 1752
For x = 200, we have y = 11.917 + 0.207(200) = 53.317 where dx = x – x ; dy = y – y
14.3 Let the experience and performance rating be
represented by x and y respectively. (a) Regression equation of x on y
x – x = bxy (y – y )
x = Σ x/n = 80/8 = 10; y = Σ y/n = 648/8 = 81
or x – 65 = 0.596 (y – 65)
n Σ dx dy − Σ dx Σ dy 247 or x = 26.26 + 0.596y
byx = = = 1.133;
n Σ dx2 − (Σ dx )2 218 When y = 75, x = 26.26 + 0.596 (75) = 70.96 ≅ 71
n Σ dx dy − (Σdx ) (Σdy ) 1044
where dx = x – x , dy = y – y (b) byx = = = 1.168
n Σdx2 − (Σdx ) 2
894
Regression equation of y on x
y – y = byx (x – x ) y – y = byx (x – x )
or y – 81 = 1.133 (x – 10) or y – 65 = 1.168 (x – 65)
or y = 69.67 + 1.133x or y = – 10.92 + 1.168x
When x = 7, y = 69.67 + 1.133 (7) = 77.60 ≅ 78 When x = 92, y = – 10.92 + 1.168 (92) = 96.536 ≅ 97
14.4 Let price at Mumbai and Delhi be represented by x and 14.6 Let R&D expenditure and annual profit be denoted by
y, respectively x and y respectively
500 BUSINESS S T A T I S T I C S

x = Σ x/n = 40/8 = 5.625; y = Σy/n = 297/8 = 37.125 14.9 Let production and capacity utilization be denoted by x
and y, respectively.
n Σ dx d y − (Σ dx )(Σ d y ) 8 × 238 − (− 3) (1)
byx = = (a) Regression equation of capacity utilization (y) on
n Σ dx2 − (Σ dx )2 8 × 57 − (− 3)2 production (x)
= 4.266 ; σy
where dx = x – 6, dy = y – 37 y– y = r (x − x )
σx
Regression equation of annual profit on R&D
expenditure 8.5
y – 84.8 = 0.62 (x – 35.6)
y – y = byx (x – x ) 10.5
y – 37.125 = 4.26 (x – 5.625) y = 66.9324 + 0.5019x
or y = 13.163 + 4.266x (b) Regression equation of production (x) on capacity
For x = Rs 1,00,000 as R&D expenditure, we have from utilization (y)
above equation y = Rs 439.763 as annual profit. σx
14.7 Let sales revenue and advertising expenditure be x – x =r (y − y)
σy
denoted by x and y respectively
10.5
Σ fdx 12 x – 35.6 = 0.62 (y – 84.8)
x =A+ × h = 150 + × 50 = 159.09 8.5
n 66
x = – 29.3483 + 0.7659y
Σ fd y 26 When y = 70, x = – 29.3483+0.7659(70) = 24.2647
y =B+ × k = 30 – × 10 = 26.06
n 66 Hence the estimated production is 2,42,647 units when
n Σ fdx d y − (Σ fdx ) (Σ fd y ) h the capacity utilization is 70 per cent.
bxy = ×
n Σ fd 2y − (Σ fdy )2 k 14.10 x = Σ x/n = 270/8 = 33.75; y = Σ y/n = 400/8 = 50
n Σ dx dy − (Σ dx )(Σ dy ) 8 × 4800 − 6 × 0
66 (−14) − 12(− 26) 50 byx = =
= 2
× = – 0.516 n Σ dx2 − (Σ dx )2 8 × 3592 − (6)2
66 (100) − ( − 26) 10
(a) Regression equation of x on y = 1.338;
x – x = bxy (y – y ) where dx = x – 33 and dy = y – 50
x – 159.09 = – 0.516 (y – 26.06) Regression equation of y on x
or x = 172.536 – 0.516y y – y = byx (x – x )
For y = 50, x = 147.036 y – 50 = 1.338 (x – 33.75)
(b) Regression equation of y on x y = 4.84 + 1.338x
n Σ fdx d y − (Σ fdx ) (Σ fd y ) For x = 10, y = 18.22
k
byx = × 14.11 Let intelligence test score be denoted by x and weekly
n Σ fdx2 − (Σ fdx )2 h sales by y
66 (−14) − 12(− 26) 10 x = 540/9 = 60; y = 450/9 = 50,
= 2
× = – 0.027.
66 (70) − (12) 50
n Σ dx dy − (Σ dx)(Σ dy) 9 × 1200
byx = = = 0.75
y – y = byx (x – x ) n Σ dx2 − (Σ dx )2 9 × 1600
y – 26.06 = – 0.027 (x – 159.09) Regression equation of y on x :
y = 30.355 – 0.027x
y − y = byx ( x − x )
For x = 300, y = 22.255
y – 50 = 0.75 (x – 60)
(c) r = bxy × byx = – 0.516 × 0.027 = – 0.1180
y = 5 + 0.75x
14.8 Let test score and production rating be denoted by x For x = 65, y = 5 + 0.75 (65) = 53.75
and y respectively. 14.12 (a) Solving two regression lines:
x = Σ x/n = 612/10 = 61.2; 3x + 2y = 6 and 6x + y = 31
y = Σ y/n = 622/10 = 62.2 we get mean values as x = 4 and y = 7
n Σ dx d y − (Σ dx ) (Σd y ) 10 × 3213 − 2 × 2 (b) Rewritting regression lines as follows:
byx = = = 0.904
n Σ dx2 − (Σ dx )2 10 × 3554 − (2)2 3x + 2y = 26 or y = 13 – (3/2)x,
Regression equation of production rating (y) on test So byx = – 3/2
score (x) is given by 6x + y = 31 or x = 31/6 – (1/6)y,
y – y = byx (x – x ) So bxy = – 1/6
y – 62.2 = 0.904 (x – 61.2) Correlation coefficient,
y = 6.876 + 0.904x r= bxy × byx = − (3 / 2)(1 / 6) = – 0.5
C H A P T E R 14 REGRESSION ANALYSIS 501

Given, Var(x) = 25, so σx = 5. Calculate σy using the Also r= byx × bxy = – 1.5 × 0.2 = – 0.5477
formula:
14.14 Let advertising expenditure and sales be denoted by x
σy
byx = r and y respectively.
σy
x = Σ x/n = 217/8 = 27.125; y = Σ y/n = 58.2/8 = 7.26
3 σy
or − = 0.5 or σy = 15 n∑ dx dy − (∑ dx)(∑ dy)
2 5 byx =
n∑ dx2 − (∑ dx)2
14.13 The regression equation of y on x is stated as:
8(172.2) − (25)(2.1) 1325.1
σy = = = 0.125
y − y = bxy ( x − x ) = r ⋅ (x − x ) 8(1403) − (25)2 10599
σx
Thus regression equation of y on x is:
Given, x = 53.20; y = 27.90, byx = – 1.5; bxy = – 0.2
y − y = byx ( x − x )
Thus y – 27.90 = – 1.5(x – 53.20)
or y – 7.26 = 0.125(x – 27.125)
or y = 107.70 – 1.5x
y = 3.86 + 0.125x
For x = 60, we have y = 107.70 – 1.5(60) = 17.7
When x = 60, the estimated value of y = 3.869 +
0.125(60) = 11.369

14.8 STANDARD ERROR OF ESTIMATE AND PREDICTION INTERVALS


The pattern of dot points on a scatter diagram is an indicator of the relationship between
two variables x and y. Wide scatter or variation of the dot points about the regression line
represents a poor relationship. But a very close scatter of dot points about the regression
line represents a close relationship between two variables. The variability in observed
values of dependent variable y about the regression line is measured in terms of residuals.
A residual is defined as the difference between an observed value of dependent variable
y and its estimated (or fitted) value ŷ determined by regression equation for a given
value of the independent variable x. The residual about the regression line is given by
Residual ei = yi – ˆyi
The residual values ei are plotted on a diagram with respect to the least squares
regression line ŷ = a + bx. These residual values represent error of estimation for
individual values of dependent variable and are used to estimate, the variance σ2 of the
error term. In other words, residuals are used to estimate the amount of variation in the
dependent variable with respect to least squares regression line. Here it should be noted
that the variations are not the variations (deviations) of observations from the mean value
in the sample data set, rather these variations are the vertical distances of every observation
(dot point) from the least squares line as shown in Fig. 14.3.
Since sum of the residuals is zero, therefore it is not possible to determine the total
amount of error by summing the residuals. This zero-sum characteristic of residuals can
be avoided by squaring the residuals and then summing them. That is
n n 2
∑ ( yi − ˆyi ) ← Error or Residual sum of squares
2
∑ ei =
i =1 i =1
This quantity is called the sum of squares of errors (SSE).
The estimate of variance of the error term σe2 or S2y.x is obtained as follows:

Σ ei2 Σ ( yi − yi )2 SSE
ˆ 2e =
S2yx or σ = =
n−2 n−2 n−2
The denominator, n – 2 represents the error or residual degrees of freedom and is determined
by subtracting from sample size n the number of parameters β0 and β1 that are estimated
by the sample parameters a and b in the least squares equation. The subscript ‘yx’ indicates
that the standard deviation is of dependent variable y, given (or conditional) upon
independent variable x.
The standard error of estimate Syx also called standard deviation of the error term t measures
the variability of the observed values around the regression line, i.e. the amount
502 BUSINESS S T A T I S T I C S

by which the y values are away from the sample y values (dot points). In other words, Syx
is based on the deviations of the sample observations of y-values from the least squares
line or the estimated regression line of y values. The standard deviation of error about
the least squares line is defined as:

Σ ( y − ˆy)2 SSE
Syx or σe = = (14-4)
n−2 n−2

Figure 14.3
Residuals

To simplify the calculations of Syx, generally the following formula is used

Σ ( y − ˆy)2 Σ y2 − a Σ y − b Σ xy
Syx = =
n−2 n−2
The variance S2yx measures how the least squares line ‘best fits’ the sample y-values. A large
variance and standard error of estimate indicates a large amount of scatter or dispersion
of dot points around the line. Smaller the value of Syx, the closer the dot points (y-values)
fall around the regression line and better the line fits the data and describes the better
average relationship between the two variables. When all dot points fall on the line, the
value of Syx is zero, and the relationship between the two variables is perfect.
A smaller variance about the regression line is considered useful in predicting the
value of a dependent variable y. In actual practice, some variability is always left over
about the regression line. It is important to measure such variability due to the following
reasons:
(i) This value provides a way to determine the usefulness of the regression line in
predicting the value of the dependent variable.
(ii) This value can be used to construct interval estimates of the dependent variable.
(iii) Statistical inferences can be made about other components of the problem.
Figure 14.4 displays the distribution of conditional average values of y about a least
squares regression line for given values of independent variable x. Suppose the amount
of deviation in the values of y given any particular value of x follow normal distribution.
Since average value of y changes with the value of x, we have different normal distributions
of y-values for every value of x, each having same standard deviation. When a relationship
between two variables x and y exists, the standard deviation (also called standard error of
estimate) is less than the standard deviation of all the x-values in the population computed
about their mean.
Based on the assumptions of regression analysis, we can describe sampling properties
of the sample estimates such as a, b, and Syx, as these vary from sample to sample. Such
knowledge is useful in making statistical inferences about the relationship between the
two variables x and y.
C H A P T E R 14 REGRESSION ANALYSIS 503

Figure 14.4
Regression Line Showing the
Error Variance

The standard error of estimate can also be used to determine an approximate interval
estimate based on sample data (n < 30) for the value of the dependent variable y for a
given value of the independent variable x as follows:
Approximate interval estimate = ŷ ± tdf Syx
where value of t is obtained using t-distribution table based upon a chosen probability
level. The interval estimate is also called a prediction interval.
Example 14.12: The following data relate to advertising expenditure (Rs in lakh) and
their corresponding sales (Rs in crore)
Advertising expenditure : 10 12 15 23 20
Sales : 14 17 23 25 21
(a) Find the equation of the least squares line fitting the data.
(b) Estimate the value of sales corresponding to advertising expenditure of Rs 30 lakh.
(c) Calculate the standard error of estimate of sales on advertising expenditure.
Solution: Let the advertising expenditure be denoted by x and sales by y.
(a) The calculations for the least squares line are shown in Table 14.7

Table 14.7: Calculations for Least Squares Line

Advt. dx = x – 16 dx2 Sales dy = y – 20 dy2 dxdy


Expenditure, x y
10 –6 36 14 –6 36 36
12 –4 16 17 –3 9 12
15 –1 1 23 3 9 –3
23 7 49 25 5 25 35
20 4 16 21 1 1 4
80 0 118 100 0 80 84

x = Σ x/n = 80/5 = 16; y = Σ y/n = 100/5 = 20


n Σ dx d y − Σ dx Σ d y 5 × 84
byx = = = 0.712
n Σ dx2
− (Σ dx ) 2 5 × 118
(a) Regression equation of y on x is
y – y = byx (x – x )
y – 20 = 0.712 (x – 16)
y = 8.608 + 0.712 x
where parameter a = 8.608 and b = 0.712.
Table 14.8 gives the fitted values and the residuals for the data in Table 14.7. The
fitted values are obtained by substituting the value of x into the regression equation
(equation for the least squares line). For example, 8.608 + 0.712(10) = 15.728. The
504 BUSINESS S T A T I S T I C S

residual is equal to the actual value minus fitted value. The residuals indicate how well
the least squares line fits the actual data values.

Table 14.8: Fitted Values and Residuals for Sample Data

Value, x Fitted Value Residuals


y = 8.608 + 0.712x
10 15.728 – 5.728
12 17.152 – 5.152
15 19.288 – 4.288
23 24.984 – 1.984
20 22.848 – 2.848

(b) The least squares equation obtained in part (a) may be used to estimate the sales
turnover corresponding to the advertising expenditure of Rs 30 lakh as:
ŷ = 8.608 + 0.712x = 8.608 + 0.712 (30) = Rs 29.968 crore
(c) Calculations for standard error of estimate Sy⋅x of sales (y) on advertising expenditure
(x) are shown in Table 14.9.

Table 14.9: Calculations for Standard Error of Estimate

x y y2 xy
10 14 196 140
12 17 289 204
15 23 529 345
23 25 625 575
20 21 441 420
80 100 2080 1684

Σy2 − a Σy − bΣxy 2080 − 8.608 × 100 − 0.712 × 1684


Sy⋅x = =
n−2 5−2

2080 − 860.8 − 1199


= = 2.594
3

14.8.1 Coefficient of Determination: Partitioning of Total Variation


The objective of regression analysis is to develop a regression model that best fits the
sample data, so that the residual variance S2y. x is as small as possible. But the value of S2y. x
depends on the scale with which the sample y-values are measured. This drawback with
the calculation of S2y. x restricts its interpretation unless we consider the units in which
the y-values are measured. Thus, we need another measure of fit called coefficient of
determination that is not affected by the scale with which the sample y-values are measured.
It is the proportion of variability of the dependent variable, y accounted for or explained by the
independent variable, x, i.e. it measures how well (i.e. strength) the regression line fits the
data. The coefficient of determination is denoted by r2 and its value ranges from 0 to 1.
A particular r2 value should be interpreted as high or low depending upon the use and
context in which the regression model was developed. The coefficient of determination
is given by
SSR SST − SSE SSE
r2 = = =1–
SST SST SST
Residual variation in response variable y -values from least-squares line
= 1–
Total variance of y-values
C H A P T E R 14 REGRESSION ANALYSIS 505

where SST = total sum of square deviations (or total variance) of sampled response
variable y-values from the mean value of y.
n n
= Syy = ∑ ( yi − y )2 = ∑ yi2 − n ( y )2
i =1 i =1
SSE = sum of squares of error or unexplained variation in response variable
y-values from the least squares line due to sampling errors, i.e. it measures
the residual variation in the data that is not explained by predictor
variable x
n n n n
= ∑ ( yi − ˆyi )2 = ∑ yi2 − a ∑ yi − b ∑ xi yi
i =1 i =1 i =1 i =1
SSR = sum of squares of regression or explained variation is the sample values of
response variable y accounted for or explained by variation among
x-values
= SST – SSE
n n n
= ∑ ( ˆyi − y )2 = a ∑ yi + b ∑ xi yi − n ( y )2
i =1 i =1 i =1
The three variations associated with the regression analysis of a data set are shown in
Fig 14.5. Thus
Σ ( y − ˆy)2 S2yx
r2 = 1 – = 1 – ; S y⋅ x = S y 1 − r 2
Σ ( y − y )2 S2y

Σ ( y − ˆy)2
where = fraction of the total variation that is explained or accounted for
Σ ( y − y )2

Σ ( y − ˆy)2
Sy · x = n − 2 , variance of response variable y-values from the least squares
line

1
S2y = Σ ( y − y )2 , total variance of response variable y-values
n−2
Figure 14.5
Relationship Between Three
Types of Variations

Since the formula of r2 is not convenient to use therefore an easy formula for the
sample coefficient of determination is given by

a Σ y + b Σ xy − n ( y )2
r2 = ← Short-cut method
Σ y2 − n ( y )2
For example, the coefficient of determination that indicates the extent of relationship
between sales revenue (y) and advertising expenditure (x) is calculated as follows from
Example 14.1:
506 BUSINESS S T A T I S T I C S

a Σ y + b Σ xy − n ( y )2 0.072 × 40 + 0.704 × 373 − 8(5)2


r2 = =
Σ y2 − n ( y )2 270 − 8(5)2

2.88 + 262.592 − 200 65.47


= = = 0.9352
270 − 200 70
The value r2 = 0.9352 indicates that 93.52% of the variance in sales revenue is accounted
for or statistically explained by advertising expenditure.
A comparison between bivariate correlation and regression summarized in Table 14-
10 could provide further insight about the relationship between two variables x and y in
the data set.

Table 14.10: Comparison between Linear Correlation and Regression

Correlation Regression
• Measurement level Interval or ratio scale Interval or ratio scale
• Nature of variables Both continuous, and Both continuous, and
linearly related linearly related
• x – y relationship x and y are symmetric y is dependent, x is independent;
regression of x on y differs from y
on x
• Correlation bxy = byx Correlation between x and y is the
same as the correlation between
y and x
• Coefficient of Explains common Proportion of variability of x exp-
determination variance of x and y lained by its least-squares regres-
sion on y

C o n c e p t u a l Q u e s t i o n s 14A

1. (a) Explain the concept of regression and point out its 9. Point out the role of regression analysis in business
usefulness in dealing with business problems. decision-making. What are the important properties of
[Delhi Univ., MBA, 1993] regression coefficients?
(b) Distinguish between correlation and regression. Also [Osmania Univ., MBA; Delhi Univ., MBA, 1999]
point out the properties of regression coefficients. 10. (a) Distinguish between correlation and regression
2. Explain the concept of regression and point out its analysis.
importance in business forecasting. [Dipl in Mgt., AIMA, Osmania Univ., MBA, 1998]
[Delhi Univ., MBA, 1990, 1998] (b) The coefficient of correlation and coefficient of
3. Under what conditions can there be one regression line? determination are available as measures of association
Explain. [HP Univ., MBA, 1996] in correlation analysis. Describe the different uses of
4. Why should a residual analysis always be done as part of these two measures of association.
the development of a regression model? 11. What are regression coefficients? State some of the
5. What are the assumptions of simple linear regression important properties of regression coefficients.
analysis and how can they be evaluated? [Dipl in Mgt., AIMA, Osmania Univ., MBA, 1989]
6. What is the meaning of the standard error of estimate? 12. What is regression? How is this concept useful to business
7. What is the interpretation of y-intercept and the slope in a forecasting? [Jodhpur Univ., MBA, 1999]
regression model? 13. What is the difference between a prediction interval and a
8. What are regression lines? With the help of an example confidence interval in regression analysis?
illustrate how they help in business decision-making. 14. Explain what is required to establish evidence of a cause-
[Delhi Univ., MBA, 1998] and-effect relationship between y and x with regression
analysis.
C H A P T E R 14 REGRESSION ANALYSIS 507

15. What technique is used initially to identify the kind of 18. Give examples of business situations where you believe a
regression model that may be appropriate. straight line relationship exists between two variables.
16. (a) What are regression lines? Why is it necessary to What would be the uses of a regression model in each of
consider two lines of regression? these situations.
(b) In case the two regression lines are identical, prove 19. ‘The regression lines give only the best estimate of the
that the correlation coefficient is either + 1 or – 1. If value of quantity in question. We may assess the degree of
two variables are independent, show that the two uncertainty in the estimate by calculating a quantity known
regression lines cut at right angles. as the standard error of estimate’ Elucidute.
17. What are the purpose and meaning of the error terms in 20. Explain the advantages of the least-squares procedure for
regression? fitting lines to data. Explain how the procedure works.

Formulae Used

1. Simple linear regression model • Computational formula


y = β0 + β1x + e
2. Simple linear regression equation based on sample data Σy2 − a Σy − b Σxy
Sy.x =
y = a + bx n−2
3. Regression coefficient in sample regression equation 6. Coefficient of determination based on sample data
b = ŷ • Sums of squares formula
a = y − bx Σ ( y − ˆy)2
r2 = 1 –
4. Residual representing the difference between an observed Σ ( y − y )2
value of dependent variable y and its fitted value • Computational formula
e = y – ŷ a Σ y + b Σ xy − n ( y )2
r2 =
5. Standard error of estimate based on sample data Σ y2 − n ( y )2
• Deviations formula 7. Regression sum of squares

Σ( y − ˆy)2 Sy.x = Sy 1 − r2
Sy.x =
n−2
8. Interval estimate based on sample data: y ± tdf Syx

Chapter Concepts Quiz

True or False
1. A statistical relationship between two variables does not 8. Correlation coefficient is the geometric mean of regression
indicate a perfect relationship. (T/F) coefficients. (T/F)
2. A dependent variable in a regression equation is a 9. If the sign of two regression coefficients is negative, then
continuous random variable. (T/F) sign of the correlation coefficient is positive. (T/F)
3. The residual value is required to estimate the amount of 10. Correlation coefficient and regression coefficient are
variation in the dependent variable with respect to the independent. (T/F)
fitted regression line. (T/F) 11. The point of intersection of two regression lines
4. Standard error of estimate is the conditional standard represents average value of two variables. (T/F)
deviation of the dependent variable. (T/F) 12. The two regression lines are at right angle when the
5. Standard error of estimate is a measure of scatter of the correlation coefficient is zero. (T/F)
observations about the regression line. (T/F) 13. When value of correlation coefficient is one, the two
6. If one of the regression coefficients is greater than one the regression lines coincide. (T/F)
other must also be greater than one. (T/F) 14. The product of regression coefficients is always more than
7. The signs of the regression coefficients are always same. one. (T/F)
(T/F) 15. The regression coefficients are independent of the change
of origin but not of scale. (T/F)
508 BUSINESS S T A T I S T I C S

Multiple Choice
16. The line of ‘best fit’ to measure the variation of observed 24. If two regression lines are: y = a + bx and x = c + dy, then
values of dependent variable in the sample data is the ratio of a/c is equal to
(a) regression line (b) correlation coefficient
(c) standard error (d) none of these 1−b 1+ b b −1
(a) b/d (b) (c) (d)
17. Two regression lines are perpendicular to each other when 1− d 1+ d d −1
(a) r = 0 (b) r = 1/3 25. If two coefficients of regression are 0.8 and 0.2, then the
(c) r = – 1/2 (d) r = ± 1 value of coefficient of correlation is
18. The change in the dependent variable y corresponding to (a) 0.16 (b) – 0.16 (c) 0.40 (d) – 0.40
a unit change in the independent variable x is measured by 26. If two regression lines are: y = 4 + k x and x = 5 + 4y, then
(a) bxy (b) byx the range of k is
(c) r (d) none of these (a) k ≤ 0 (b) k ≥ 0
19. The regression lines are coincident provided (c) 0 ≤ k ≤ 1 (d) 0 ≤ 4k ≤ 1
(a) r = 0 (b) r = 1/3 27. If two regression lines are: x + 3y + 7 = 0 and 2x + 5y
(c) r = – 1/2 (d) r = ± 1 = 12, then x and y are respectively
20. If byx is greater than one, then bxy is (a) 2, 1 (b) 1, 2
(a) less than one (b) more than one (c) 2, 3 (d) 2, 4
(c) equal to one (d) none of these 28. The residual sum of square is
21. If bxy is negative, then byx is (a) minimized (b) increased
(a) negative (b) positive (c) maximized (d) decreased
(c) zero (d) none of these 29. The standard error of estimate Sy⋅x is the measure of
22. If two regression lines are: y = a + bx and x = c + dy, then (a) closeness (b) variability
the correlation coefficient between x and y is (c) linearity (d) none of these
(a) bc (b) ac (c) ad (d) bd 30. The standard error of estimate is equal to
23. If two regression lines are: y = a + bx and x = c + dy, then
(a) σ y 1 − r 2 (b) σ y 1 + r 2
the ratio of standard deviations of x and y are
(a) c/b (b) c/a (c) d/a (d) d/b (c) σ x 1 − r 2 (d) σ x 1 + r 2

Concepts Quiz Answers

1. T 2. T 3. T 4. T 5. T 6. F 7. T 8. T 9. F
10. F 11. T 12. T 13. T 14. F 15. T 16. (a) 17. (a) 18. (b)
19. (d) 20. (a) 21. (a) 22. (d) 23. (d) 24. (b) 25. (a) 26. (d) 27. (b)
28. (a) 29. (b) 30. (a)

R e v i e w S e l f-P r a c t i c e P r o b l e m s

14.15 Given the following bivariate data: Coefficient of correlation r = 0.8.


x: – 1 5 3 2 1 1 7 3 [Bharthidarsan Univ., MCom, 1996]
y: – 6 1 0 0 1 2 1 5 14.17 The coefficient of correlation between the ages of
(a) Fit a regression line of y on x and predict y if x = 10. husbands and wives in a community was found to be
(b) Fit a regression line of x on y and predict x if y = 2.5. + 0.8, the average of husbands age was 25 years and
[Osmania Univ., MBA, 1996] that of wives age 22 years. Their standard deviations
were 4 and 5 years respectively. Find with the help of
14.16 Find the most likely production corresponding to a
regression equations:
rainfall of 40 inches from the following data:
(a) the expected age of husband when wife’s age is 16
Rainfall Production years, and
(in inches) (in quintals) (b) the expected age of wife when husband’s age is 33
years. [Osmania Univ., MBA, 2000]
Average 30 50
Standard deviation 35 10
C H A P T E R 14 REGRESSION ANALYSIS 509

14.18 You are given below the following information about 14.23 The quantity of a raw material purchased by ABC Ltd.
advertisement expenditure and sales: at specified prices during the post 12 months is given
below.
Adv. Exp. (x) Sales (y)
(Rs in crore) (Rs in crore) Month Price per Quantity Month Price per Quantity
kg (in Rs) (in kg) kg (in Rs) (in kg)
Mean 20 120
Standard deviation 05 025 Jan 96 250 July 112 220
Feb 110 200 Aug 112 220
Correlation coefficient 0.8 March 100 250 Sept 108 200
(a) Calculate the two regression equations. April 90 280 Oct 116 210
(b) Find the likely sales when advertisement May 86 300 Nov 86 300
expenditure is Rs 25 crore. June 92 300 Dec 92 250
(c) What should be the advertisement budget if the
(a) Find the regression equations based on the above
company wants to attain sales target of Rs 150
data.
crore?
(b) Can you estimate the approximate quantity likely
[Jammu Univ., MCom, 1997; Delhi Univ., MBA, 1999]
to be purchased if the price shoots up to Rs 124 per
14.19 For 50 students of a class the regression equation of kg?
marks in Statistics (x) on the marks in Accountancy (y) is
(c) Hence or otherwise obtain the coefficient of
3y – 5x + 180 = 0. The mean marks in Accountancy is
correlation between the price prevailing and the
44 and the variance of marks in Statistics is 9/16th of
quantity demanded.
the variance of marks in Accountancy. Find the mean
marks in Statistics and the coefficient of correlation 14.24 With ten observations on price (x) and supply (y), the
between marks in the two subjects. following data were obtained (in appropriate units):
Σ x = 130, Σ y = 220, Σ x2 = 2288, Σ y2 = 5506, Σ x y
14.20 The HRD manager of a company wants to find a
= 3467. Obtain the line of regression of y on x and
measure which he can use to fix the monthly income of
estimate the supply when the price is 16 units. Also find
persons applying for a job in the production department.
out the standard error of the estimate.
As an experimental project, he collected data on 7
persons from that department referring to years of 14.25 Data on the annual sales of a company in lakhs of rupees
service and their monthly income. over the past 11 years is shown below. Determine a
suitable straight line regression model y = β0 + β1x + ∈
Years of service : 11 7 9 5 8 6 10
for the data. Also calculate the standard error of
Income (Rs in 1000’s) : 10 8 6 5 9 7 11 regression of y for values of x.
(a) Find the regression equation of income on years of Year : 1978 79 80 81 82 83 84 85 86 87 88
service. sales : 1 5 4 7 10 8 9 13 14 13 18
(b) What initial start would you recommend for a person
From the regression line of y on x, predict the values of
applying for the job after having served in a similar
annual sales for the year 1989.
capacity in another company for 13 years?
14.26 Find the equation of the least squares line fitting the
(c) Do you think other factors are to be considered (in
following data:
addition to the years of service) in fixing the income
x: 1 2 3 4 5
with reference to the above problems? Explain.
y: 2 6 5 3 4
14.21 The following table gives the age of cars of a certain
Calculate the standard error of estimate of y on x.
make and their annual maintenance costs. Obtain the
regression equation for costs related to age. 14.27 The following data relating to the number of weeks of
experience in a job involving the wiring of an electric
Age of cars : 12 14 16 18
motor and the number of motors rejected during the
(in years)
past week for 12 randomly selected workers.
Maintenance costs : 10 20 25 30
(Rs in 100’s) [HP Univ., MBA, 1994] Workers Experience (weeks) No. of Rejects
14.22 An analyst in a certain company was studying the 1 2 26
relationship between travel expenses in rupees (y) for 2 9 20
102 sales trips and the duration in days (x) of these trips. 3 6 28
He has found that the relationship between y and x is 4 14 16
linear. A summary of the data is given below: 5 8 23
Σx = 510; Σy = 7140; Σx2 = 4150; Σxy = 54,900, 6 12 18
and Σy2 = 7,40,200 7 10 24
(a) Estimate the two regression equations from the 8 4 26
above data. 9 2 38
(b) A given trip takes seven days. How much money 10 11 22
should a salesman be allowed so that he will not 11 1 32
run short of money? 12 8 25
510 BUSINESS S T A T I S T I C S

(a) Determine the linear regression equation for 14.29 A financial analyst obtained the following information
estimating the number of components rejected relating to return on security A and that of market M
given the number of weeks of experience. for the past 8 years:
Comment on the relationship between the two Year : 11 12 13 14 15 16 17 8
variables as indicated by the regression equation. Return A : 10 15 18 14 16 16 18 4
(b) Use the regression equation to estimate the number Market M : 12 14 13 10 9 13 14 7
of motors rejected for an employee with 3 weeks
(a) Develop an estimating equation that best describes
of experience in the job.
these data.
(c) Determine the 95 per cent approximate prediction
(b) Find the coefficient of determination and interpret
interval for estimating the number of motors
it.
rejected for an employee with 3 weeks of experience
in the job, using only the standard error of estimate. (c) Determine the percentage of total variation in
security return being explained by the return on
14.28 A financial analyst has gathered the following data about
the market portfolio.
the relationship between income and investment in
securities in respect of 8 randomly selected families: 14.30 The equation of a regression line is
Income : 18 12 19 24 43 37 19 16 ŷ = 50.506 – 1.646x
(Rs in 1000’s) and the data are as follows:
Per cent invested x: 15 17 11 12 19 25
in securities : 36 25 33 15 28 19 20 22 y: 47 38 32 24 22 10
(a) Develop an estimating equation that best describes Solve for residuals and graph a residual plot. Do these
these data. data seem to violate any of the assumptions of
(b) Find the coefficient of determination and interpret regression?
it. 14.31 Graph the followign residuals and indicate which of the
(c) Calculate the standard error of estimate for this assumptions underlying regresion appear to be in
relationship. jeopardy on the basis of the graph:
(d) Find an approximate 90 per cent confidence interval x : 13 16 27 29 37 47 63
for the percentage of income invested in securities
by a family earning Rs 25,000 annually. y − ˆy : – 11 –5 –2 –1 6 10 12
[Delhi Univ., MFC, 1997]

Hints and Answers

14.15 x = Σ x = 21/8 = 2.625; y = Σ y/n = 4/8 = 0.50 Given y = 50, σy = 10, x = 30, σx = 5, r = 0.8
n Σ dx dy − (Σ dx ) (Σ dy ) 8 × 30 − (− 3) ( − 12) σy
byx = = y – y =r (x − x) ;
n Σ dx2 2
− (Σ dx ) 2
8 × 45 − ( − 1) σx
= 0.568;
10
dx = x – 3; dy = y – 3. y – 50 = 0.8 (x – 30)
5
Regression equation:
y – y = byx (x – x ) y = 2 + 1.6x
or y – 0.5 = 0.568 (x – 2.625) For x = 40,y = 2 + 1.6 (40) = 66 quintals.
y = – 0.991 + 0.568x 14.17 Let x = age of wife y = age of husband.
n Σ dx dy − (Σ dx )(Σ dy ) 8 × 30 − (− 3) ( − 12) Given x = 25, y = 22, σx = 4, σy = 5, r = 0.8
(b) bxy = = (a) Regression equation of x on y
n Σ d2y 2
− (Σ dy ) 8 × 84 − (− 12) 2

= 0.386 σx
x−x = r ( y − y)
Regression equation: σy
x – x = bxy (y – y ) 4
x – 25 = 0.8 (y – 22)
or x – 2.625 = 0.386 (y – 5) 5
x = 0.695 + 0.386y x = 10.92 + 0.64 y
14.16 Let x = rainfall y = production by y. The expected yield When age of wife is y = 16; x = 10.92 + 0.64 (16) = 22
corresponding to a rainfall of 40 inches is given by approx.(husband’s age)
regression equation of y on x. (b) Left as an exercise
C H A P T E R 14 REGRESSION ANALYSIS 511

14.18 (a) Regression equation of x on y n (Σ xy) − (Σ x) (Σ y)


byx =
σ n Σ x 2 − (Σ x )2
x – x = bxy (y – y ) = r x ( y − y )
σy 102 × 54900 − 510 × 7140
= = 12
5
x – 20 = 0.8 (y – 120) 102 × 4150 − (510)2
25 Regression lines:
x = 0.8 + 0.16y
x − x = bxy ( y − y )
Regression equation of y on x
x – 5 = 0.08 (y – 70) or x = 0.08y – 0.6
σy
y – y =r (x − x ) and y − y = byx (x – x )
σx
y – 70 = 12 (x – 5) or y = 12x + 10
25
y – 120 = 0.8 (x – 20) When x = 7, y = 12×7 + 10 = 94
5
y = 40 + 4x 14.23 Let price be denoted by x and quantity by y
(b) When advertisement expenditure is of Rs 25 crore, x = Σx/n = 1200/12 = 100 ;
likely sales is y = Σy/n = 2980/12 = 248.33
y = 40 + 4x = 40 + 4 (25) = 140 crore.
(a) Regression coefficients:
(c) For y = 150, x = 0.8 + 0.16y = 0.8 + 0.16(150)
n Σ dx dy − (Σ dx)(Σ dy)
= 24.8 bxy = = – 0.26
14.19 Let x = marks in Statistics and y = marks in Accountancy, n Σ d2y − (Σ d y )2
Given: 3y – 5x + 180 = 0 n Σ dx dy − (Σ dx )(Σ dy )
bxy = = – 3.244
or x =(3/5) y + (180/5) n Σ dx2 − (Σ dx )2
For y = 44, x = (3/5) × 44 + (180/5) = 62.4
Regression coefficient of x on y, bxy = 3/5 Regression lines:
Coefficient of regression x – x = bxy (y – y )
σ 9
bxy = r x = (given) x – 100 = – 0.26 (y – 248.33)
σy 16
or x = – 0.26y + 164.56
3 9 3 3r and y – y = byx (x – x )
or = r or =
5 16 5 4 y – 248.33 = – 3.244 (x – 100)
Hence 3r = 2.4 or r = 0.8
y = – 3.244x + 572.73
14.20 Let x = years of service and y = income.
(b) For x = 124,
(a) Regression equation of y on x
y = – 3.244 × 124 + 572.73 = 170.474
n Σ xy − (Σ x)(Σ y) 7 × 469 − 56 × 56
byx = = = 0.75 14.24 (a) Regression line of y on x is given by
n Σ x 2 − (Σ x)2 7 × 476 − (56)2
Given y = Σy/n = 220/10 = 22 ;
y – y = byx (x – x )
y – 8 = 0.75 (x – 8) x = Σx/n = 130/10 = 13
y = 2 + 0.75x y – y = byx (x – x )
(b) When x = 13 years, the average income would be y – 22 = 1.015 (x – 13)
y = 2 + 0.75x = 2 + 0.75(13) = Rs 11,750 y = 8.805 + 1.015x
14.21 Let x = age of cars and y = maintenance costs. n Σ xy − Σ x Σ y 10 × 3467 − 130 × 220
byx = =
The regression equation of y on x 2
n Σ x − (Σ x) 2
10 × 2288 − (130)2
x = Σ x/n = 20/4 = 5; y = Σ y/n = 85/4 = 21.25 34670 − 28600 6070
= = = 1.015
n Σ xy − Σ x Σ y 7 × 490 − 20 × 85 22880 − 16900 5980
and byx = = = 3.25
n Σ x 2 − (Σ x)2 7 × 120 − (20)2 (b) When x = 16,
y − y = byx ( x – x ) y = 8.805 + 1.015 (16) = 25.045
y – 21.25 = 3.25 (x – 5)
(c) Syx = Sy 1 − r 2 = 8.16 1 − (0.9618)2 = 2.004
y = 5 + 3.25x
14.22 x = Σx/n = 510/102 = 5; y = Σy/n = 7140/102 = 70 14.25 Take years as x = – 5, – 4, – 3, – 2, – 1, 0, 1, 2, 3, 4, 5,
where 1983 = 0. The regression equation is
Regression coefficients:
ŷ = 9.27 + 1.44x
n (Σ xy) − (Σ x)(Σ y)
bxy = For x = 1989, ŷ = 9.27 + 1.44 (6) = 17.91
n Σ y2 − (Σ y)2
102 × 54900 − 510 × 7140 Σ ( y − ˆy)2 21.2379
= = 0.08 Syx = = = 1.4573
102 × 740200 − (7140)2 n−2 10
512 BUSINESS S T A T I S T I C S

14.26 x = Σ x/n = 15/5 = 3, y = Σx/n = 20/5 = 4 relationship between weeks of experience (x) and the
number of rejects (y) in the sample week
The regression equation is:
(b) For x = 3, we have ŷ = 35.57 – 1.40(3) ≅ 31
y − y = byx (x – x )
Σy2 − a Σy − b Σxy
y – 4 = 0.7 (x – 3) or ŷ = 1.9 + 0.7x (c) Syx =
n−2
Standard error of estimate,
7,798 − (35.57) (298) − 1.40 (2048)
Σ( y − ˆy) 5.1 =
Sy x = = = 1.303 12 − 2
n−2 3
= 2.56
Σ xy − n x y 2048 − 12(7.67) (24.83) 95 per cent approximate prediction interval
14.27 (a) b = = = – 1.40
Σ x 2 − n ( x )2 876 − 12(7.67)2
ŷ ± tdf Sy⋅x = 31.37 ± 2.228 (2.56)
a = y – b x = 24.83 – (– 1.40)(7.67) = 35.57
= 25.67 to 37.07 or 26 to 37 rejects.
Thus ŷ = a + bx = 35.57 – 1.40x 14.28 4.724; – 0.983; – 0.399, – 6.753, 2.768, 0.644
Since b = – 1.40, it indicate an inverse (negative) 14.29 Error term non-independent.

Case Studies

Case 14.1: Made in China


The phrase ‘made in China’ has become an issue of con-
cern in the last few years, as Indian Companies try to
Questions for Discussion
protect their products from overseas competition. In these
years a major trade imbalance in India has been caused 1. Find the least-squares line for predicting the volume
by a flood of imported goods that enter the country and of import as a function of year for the years 1989-
are sold at lower price than comparable Indian made 2000.
goods. One prime concern is the electronic goods in 2. Is there a significant linear relationship between the
which total imported items have steadily increased dur- volume of import and the year?
ing the year 1990s to 2004s. The Indian companies have 3. Predict the volume of import of goods using 95%
been worried on complaints about product quality, prediction intervals for each of the years 2002, 2003
worker layoffs, and high prices and has spent millions in and 2004.
advertising to produce electronic goods that will satisfy 4. Do the predictions obtained in Step 4 provide accu-
consumer demands. Have these companies been success- rate estimates of the actual values observed in these
ful in stopping the flood these imported goods purchased years? Explain.
by Indian consumers? The given data represent the vol-
5. Add the data for 1989-2004 to your database, and
ume of imported goods sold in India for the years 1999-
recalculate the regression line. What effect have the
2004. To simplify the analysis, we have coded the year
new data points had on the slope> What is the effect
using the coded variable x = Year 1989.
of SSE?
Year x = Year 1989 Volume of Import 6. Given the form of the scattered diagram for the years
(in Rs billion) 1989-2004, does it appear that a straight line pro-
vides an accurate model for the data? What other
1989 0 1.1
1990 1 1.3 type of model might be more appropriate?
1991 2 1.6
1992 3 1.6
1993 4 1.8
1994 5 1.4
1995 6 1.6
1996 7 1.5
1997 8 2.1
1998 9 2.0
1999 10 2.3
2000 11 2.4
2001 12 2.3
2002 13 2.2
2003 14 2.4
2004 15 2.4

You might also like