Regression - and - Correlation 2 PDF
Regression - and - Correlation 2 PDF
com
a) The average runs scored by seven leading test cricketers during the year 2010 are given
below:
Find the Spearman's rank correlation coefficient for the runs scored in first and second
innings and interpret your result.
r = 1 – 1.8571
r = – 0.8571
Line x on y
5x – 4y + 2 = 0
5x – 4y – 2
x = 4/5y – 2/5
Line y on x
x – 5y + 3 = 0
5y = x + 3
y = 1/5 x + 3/5
Here byx= 1/5
The quantities sold by T&P Limited during the past seven months are as follows:
Product x 11 20 04 zero 18 07 16
Product y 15 02 32 35 05 28 10
x y xy x2 y2
11 15 165 121 225
20 2 40 400 4
4 32 128 16 1,024
0 35 0 0 1,225
18 5 90 324 25
7 28 196 49 784
16 10 160 256 100
76 127 779 1,166 3,387
since a = y – bx
127 76
therefore a = − (−1.7598)
7 7
y = a + bx
y = 37.2498 – 1.7598x
n∑xy−∑x∑y 7(779)−76×127
ii) r= =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ] √[7(1166)−(76)2 ][7(3387)−(127)2 ]
Results Interpretation
The value of r shows high negative correlation between the quantities sold of the two
products. r2 = 0.9749 signifies that 97.5% variation in sale of Product y is due to variation in
sale of Product x. The other 2.5% variation is due to other factors.
Price (Rs.) 33 55 50 42 48 61 53 33
Demand (1000 kg) 91 60 59 65 61 49 42 91
n∑xy−∑x∑y
and b=
n∑x2 −(∑x)2
8(23129)−375(518)
= 8(18301)(375)2
185032−194250 −9218
= 146108−140625 = = – 1.59
5783
∑y 518
y= = = 64.75
n 8
∑x 375
x= = = 46.875
n 8
Regression line
y = 139.28 – 1.59x
b) Correlation coefficient
n∑xy−∑x∑y
r=
√[n∑x2 −(∑x)2 ][n∑y2 (∑y)2 ]
8(23129)−375(518)
=
√[8(18301)−(375)2 ][8(35754)(518)2 ]
185032−194250
=
√[(146408)−140625][286032−268324]
−9218 −9218
= = = – 0.91
√(5783)(17708) 10119.55
c) There is a strong inverse correlation between the demand of a product and price of the
product. The coefficient of determination 0.8281 shows that 82.81% is the relationship
which is explainable i.e., the demand decreases when prices increase. However, 17.19%
is relationship is due to other reasons which are unexplainable.
a) Following data shows the marks obtained by 11 students in Mathematics and Physics:
Mathematics 27 73 34 25 64 91 70 62 55 48 59
Physics 67 62 41 21 74 85 66 49 55 44 68
Find the Spearman's rank correlation coefficient for the above data and interpret your
result.
b) In order to determine the relationship between experience of its employees and their
respective output, a company has gathered the following data:
Experience in years 2 4 6 8 10 12 14 16 18 20
Output in % 30 35 44 43 46 50' 45 48 39 34
6∑d2
P=1–
n(n2 −1)
6(80)
= 1 – 11(121−1)
480
= 1 – 1320
= 1 – 0.36 = 0.64
The rank correlation coefficient of 0.64 shows that normally the students good in mathematics
are good in physics as well.
Or
94
b= = 0.285
330
a = 38.265
Hence line y = 38.265 + 0.285x
ii) Apparently more experience shows better output. However, after 12 years of
experience the employees start feeling complacency resultantly show lower output.
Rupees in million
Year Annual profit
Advertising expenditure
2001 90 45
2002 100 42
2003 95 44
2004 110 60
2005 130 30
2006 145 34
2007 150 35
2008 140 30
a) Construct the least square regression equation and predict the annual profit for the
year 2009 if the advertising expenditure is budgeted at Rs. 160 million.
b) Determine the coefficient of correlation and interpret your result.
a = y – bx
∑y 320
y= = = 40
n 8
∑x 960
x= = = 120
n 8
At x = 160 million
y = 72.4 – 0.27 (160) = 72.4 – 43.2 = 29.4 million
−8720 −8720
= = = – 0.64
√(32400)(5648) 13527.57
The coefficient of correlation suggests that with the increase in advertising expenditure annual
profit decreases to a reasonable proportion.
Machine 1 2 3 4 5 6 7 8 9
Age (in months) 5 10 15 20 30 30 30 50 50
Cost (Rs. In 000) 19 24 25 30 31 32 30 30 35
Age(x) Cost(y) xy x2 y2
5 19 95 25 361
10 24 240 100 576
15 25 375 225 625
20 30 600 400 900
∑y = na + b∑x
∑xy = a∑x + b∑x2
Multiplying (i) by 80, (ii) by 3 and subtracting (i) from (ii) (i) from (ii)
b = 1570/6150 = 0.255
4710 4710
= = 5620
= 0.84
√[18450][1712]
Coefficient of correlation suggests that there is a relatively strong positive relationship between
age of a machine and its maintenance cost.
b) Students who finish the examinations more quickly than the rest are often thought to be
smarter. The following set of data shows the score of 12 students and the order in which
they finished their examination:
Order of finish 1 2 3 4 5 6 7 8 9 10 11 12
Exam score 90 78 76 60 92 86 74 60 60 78 68 64
Find the Spearman's rank correlation co-efficient for the above data.
768
byx = 1000 = 0.768
6∑d2
P = 1 – n(n2−1)
6(151.5) 909
= 1 – 12(122 – 1) = 1 – 1,716 = 1 – 0.53 = 0.47
X 13 16 14 11 17 9 13 17 18 12
Y 6.2 8.6 7.2 4.5 9.0 3.5 6.5 9.3 9.5 5.7
∑x 140 ∑y 70
x= = = 14, y = = =7
n 10 n 10
y – y = byx (x – x)
y – 7 = 0.71 (x – 14)
y – 7 = 0.71x – 9.94, y = 0.71x – 9.94 + 7
y = 0.71x – 2.94
X 9 2 12 7 16 5 8
Y 12 18 11 16 9 16 14
b) Coefficient of correlation
n∑xy−∑x∑y 7(724)−59(96) 5068−5664
r= = =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ] √[7(623)−(59)2 ][7(1378)−(96)2 ] √[4361−3481][9646−9236]
−596
= = – 0.992
√(880)(400)
Explain the type of relationship you would expect between x and y in each of the above
cases.
y – y = byx (x – x)
∑y 468
where y = = = 78
n 6
∑x 10.91
and x= = = 1.82
n 6
y – 78 = 40.22 (x – 1.82)
y = 40.22x – 73.2 + 78 = 4.8 + 40.22x
To find out the weight of a person, the height of person will be multiplied by 40.22
and adding 4.8 in it
To find out the height of a person, the weight of person will be multiplied by 0.024
and subtracting 0.052 from it
i) r = 1 indicates perfect positive correlation between two variables i.e the change in one
variable will cause change in other variable in definite proportion and in the same
direction.
ii) r = –1 indicates perfect inverse correlation between two variables i.e. the change in
one variable will cause change in other variable in definite proportion but in opposite
direction.
iii) r = 0 indicates no association between two variables i.e. both variables are
independent.
iv) r = 0.90 indicates very strong correlation between two variables. It is the indication of
change in the same direction. However, it is not said to be perfect.
v) r = 0.10 shows that relationship between two variables is quite weak. However, this
relationship is positive
vi) r = – 0.88 indicates very strong inverse correlation between two variables. But the
correlation is not perfect.
X –3 –1 0 1 3
Y 12 7 6 4 1
b) Calculate the co-efficient of correlation 'r' for the above data set. Does it seem consistent
with the above scatter diagram?
0 6 0 0 36
1 4 4 1 16
3 1 3 9 1
0 30 – 36 20 246
Coefficient of correlation being negative shows the same result as of the scatter diagram.
a) Regression line is y = a + bx
For a & b
∑y = na + b∑x
∑xy = a∑x + b∑x2
9.75 = 11a + 58b____(i)
47.32 = 58a + 326b___(ii)
b = – 44.98/222 = – 0.2
Putting this value – in (i)
9.75 = 11a+ (58) (–0.2)
11a = 9.75 + 11.6 = 21.35
a = 21.35/11 =1.94
b) Slope being negative, the price of cars will have negative relationship with respect to age.
c) y = 1.94 – 0.2 (3) = 1.94 – 0.6 = 1.34 million
5 800 4000 25
6 780 4680 36
7 780 5460 49
8 660 5280 64
9 640 5760 81
10 600 6000 100
11 620 6820 121
12 620 7440 144
68 5500 45440 620
b = – 31.19
a= y – bx
∑y 5500
y= = = 687.5
n 8
∑x 68
x= = = 8.5
n 8
62 110
71 170
66 120
68 150
70 150
67 130
63 120
65 100
Compute the co-efficient of correlation and co-efficient of determination and interpret your
results.
n∑xy−∑x∑y
Coefficient of correlation r =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ]
4860
= = 0.84
5765.55
The interpretation of the result is that 0.7056 or 70.56% of the relationship of weight and height
is due to proportionate increase or decreased of both. However 100 – 70.56 = 29.44% is
difference due to unknown other factors which might be due to diet, parents, atmosphere etc.
a) The age and price data for a sample of 11 Nissan Sunny Cars are presented in the
following table:
b) If 8 members of a tennis club are classified A players, 6 are classified B players and 10
are classified C players, in how many different ways can 2 players from each group be
chosen to represent the club.
n∑xy−∑x∑y
i) Coefficient of correlation r =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ]
11(4732)−58(975) 52052−56550 −4498
r= = = = – 0.92
√[11(326)−(58)2 ][11(96129)−(975)2 ] √(3586−3364)(1057419−950625) 4869.11
ii) There is a strong negative correlation between the ages of cars and their prices i.e. the
more the age of car, the less the relative price
Plot the data in a scatter diagram. Based on the scatter diagram, what observations can you
make?
Fertilizer applications 1 2 4 5 6 8 10
Tons of crop per acre 2 3 4 7 12 10 7
Find a suitable linear regression relationship to help the farmer in making the required prediction
and from your result predict the number of tons per acre of crop from 7 applications of fertilizer.
45 36
a = y – bx = – 0.81 ( ) = 6.43 – 4.17 = 2.26
7 7
y = 2.26 + 0.81x
Plot the data in a scatter diagram. Based on the scatter diagram, what observations can you
make?
b) A computer while calculating the correlation co-efficient between two variables X and Y
form 25 pairs of observation obtained the following sums:
x = 125
x2 = 650
y = 100
y2 = 460
xy = 508
8 6 6 8
x y xy x2
2 20 40 4
4 36 144 16
6 38 228 36
8 38 304 64
10 52 520 100
12 54 648 144
Total 42 238 1884 364
Corrected sums
∑x =125 – 6 – 8 + 8 + 6 = 125
∑y =100 – 14 – 6 + 12 + 8 =100
∑x2 = 650 – 36 – 64 + 64 + 36 = 650
∑y2 =460 – 196 – 36 + 144 + 64 = 436
∑xy = 508 – 84 – 48 + 96 + 48 = 520
n∑xy−∑x∑y 25(520)−125(100)
r= =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ] √[25(650)−(125)2 ][25(436)−(100)2 ]
13000−12500
=
√(16250−15625)(10900−10000)
500 500
= = = 0.67
√(625)(900) 750
Required:
b) For the following two sets of bivariate data, the regression lines for each set are,
respectively:
Required:
Find the product moment coefficient of correlation in each case.
2 5 137 685 25
3 7 149 1043 49
4 5 129 645 25
Total 23 557 3225 135
Regression line being y = a + bx where b is the variable cost and a is fixed cost.
b) The research director of a bank collected 24 observations of mortgage interest rates (x)
and number of house sale (y) at each interest rate. The director computed.
n∑xy−∑x∑y 24(8690)−276(768)
Correlation coefficient r = =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ] √[24(3300)−(276)2 ][24(25000)−(768)2 ]
a) Calculate the equation of the least squares regression line of y on x from the following
data:
X 1 3 3 4 5 5
Y 5 3 2 2 0 1
b) Five students were given following marks in a general knowledge competition by two
different judges:
13 = 6a + 21b__(i)
33 = 21 a + 85b_(ii)
66 = 42a + 170b
– 91 = – 42a + – 147b
– 25 = 23b
b = – 25/23 = – 1.09
13 = 6a + 21 (–1.09) = 6a – 22.89
6a = 13 + 22.89 = 35.89
a = 5.98 say 6.00
Line of regression y = 6 – 1.09x
b)
Student Marks Marks Rank Rank
Name A B A B d d2
Ali 70 54 3.5 3 0.5 0.25
Adil 92 43 1 4.5 – 3.5 12.25
Asif 80 43 2 4.5 – 2.5 6.25
Ahmad 65 67 5 1 4.0 16.00
Ayub 70 64 3.5 2 1.5 2.25
37.00
b) An equal number of a families from eight different cities of various sizes were asked
how much money they spend on food, clothing and housing per year. The data on city
sizes and average family expenditures are given below:
Sx = 5.12, Sy = 5.6
rSy
(y – y) = (x – x)
Sx
(0.68)(5.6)
(y – 52) = 5.12 (x – 68)
y – 52 = 0.74(x – 68)
y – 52 = 0.74x – 50.32
rSx
(x – x) = (y – y)
Sy
(0.68)(5.12)
(x – 68) = (y – 52)
5.6
x – 68 = 0.62 (y – 52)
x = 0.62y – 32.24 + 68
x = 0.62y + 35.76
x y xy x2 y2
30 65 1950 900 4225
50 77 3850 2500 5929
75 79 5925 5625 6241
100 80 8000 10000 6400
150 82 12300 22500 6724
200 90 18000 40000 8100
175 84 14700 30625 7056
120 81 9720 14400 6561
900 638 74445 126550 51236
n∑xy−∑x∑y 8(74445)−900(638)
i) r= =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ] √[8(126550)−(900)2 ][8(51236)−(638)2 ]
r2 = 0.7921
The result shows that increase in expenditure at (0.7921) (100) i.e. 79.21% is due to
increase in population. However, the increase in expenditure 100 – 79.21 = 20.79% is
due to the reasons other than the population increase, which might be inflation.
at x = 0; y = 8.25
and at x = 3; y = 12
Find the value of,
i) a (y intercept)
ii) byx (Regression Coefficient)
iii) What does byx represent
225 = 7a + 77b_____________(i)
2506 = 77a + 863b__________(ii)
A researcher compiled the following information to investigate the relationship between poking
and lung cancer:
Y X
Sr. Country Per capita Cigarette Deaths per 100,000 xy y2 x2
No. Consumption From lung cancer
1 U.S.A 1300 20 26000 1690000 400
2 UK 1100 46 50600 1210000 2116
3 Finland 1100 35 38500 1210000 1225
4 Switzerland 510 25 12750 260100 625
5 Canada 500 15 7500 250000 225
6 Holland 490 24 11760 240100 576
7 Australia 480 18 8640 230400 324
8 Denmark 380 17 6460 144400 289
9 Sweden 300 11 3300 90000 121
10 Norway 250 9 2250 62500 81
11 Iceland 230 6 1380 52900 36
Total 6640 226 169140 5440400 6018
n∑xy−∑x∑y (11)(169140)−(226)(6640)
r= =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ] √[(11)(6018)−(226)2 ][(11)(5440400)−(6640)2 ]
r = 0.7373, r2 = 0.5437
The above information shows that in 54.37% of cases the deaths from lungs cancer is due to
cigarette consumption and that there is a positive relationship in cigarette consumption and
deaths from lungs cancer.
x1 x2 y
1 2 1
8 8 4
3 1 1
5 7 3
6 4 2
10 6 4
1 2 1 2 1 2 1 4
8 8 4 64 32 32 64 64
3 1 1 3 3 1 9 1
5 7 3 35 15 21 25 49
6 4 2 24 12 8 36 16
10 6 4 60 40 24 100 36
33 28 15 188 103 88 235 170
i) y = a + b1x1 + b2 x2
To solve for a, b1, b2 simultaneous equations are
∑y = na + b1∑x1 + b2∑x2
∑x1y = a∑x1 + b1∑x21 + b2∑x1x2
∑ x2y = a∑x2 + b1∑x1x2 + b2∑x22
15 = 6a + 33b1 + 28b2_______________(i)
103 = 33a + 235b1 + 188b2___________(ii)
88 = 28a + 188b1 + 170b2____________(iii)
Multiplying (iv) by 102, (v) by 107 and subtracting (iv) from (v)
b2 = 0.28
41 = 107b1 +68(0.28)
107b1 = 41 – 19.04 = 21.96 or b1 = 0.21
Putting values of b1 & b2 in (i)
15 = 6a+ 33 (0.21) + 28 (0.28)
15 = 6a+ 6.93+ 7.84
15 – 14.77 = 6a = 0.23, a = 0.04
y = 0.04 + 0.21x1 + 0.28x2
b) Find the equation of the least square regression of y on x for the following data.
X 1 2 4 6 7 8 10
Y 10 14 12 13 15 12 13
79
= = 79/87.73 = 0.9
√(74)(104)
b)
x y xy x2
1 10 10 1
2 14 28 4
4 12 48 16
6 13 78 36
7 15 105 49
8 12 96 64
10 13 130 100
38 89 495 270
∑y = na + b∑x
∑xy = a∑x + b∑x2
89 = 7a + 38b______________(i)
495 = 38a + 270b____________(ii)
b = 0.186
Putting it in (i)
89 = 7a + 38 (0.186)
89 = 7a + 7.07 or 7a = 89 – 7.07 = 81.93, a = 11.704
ii) What is the purpose of finding the correlation coefficient and what does its value
indicate in respect of the above data on advertising expenditure and sales revenue?
x y xy x2 y2
1 1 1 1 1
3 2 6 9 4
4 4 16 16 16
6 4 24 36 16
8 5 40 64 25
9 7 63 81 49
11 8 88 121 64
14 9 126 196 81
Total 56 40 364 524 256
In this particular question where r = 0.98 it can be argued that up to (0.98)2 = 0.9608
or 96% of change in the sale volume is due to advertising expense or vice versa.
X 5 7 9 11 13 15
Y 1.7 2.4 2.8 3.4 3.7 4.4
ii) Estimate from this equation the profit per unit on an output of 10500 units.
y x xy x2
1.7 5 8.5 25
2.4 7 16.8 49
2.8 9 25.2 81
3.4 11 37.4 121
3.7 13 48.1 169
4.4 15 66.0 225
Total 18.4 Total 60 Total 202.0 Total 670
i) y = a + bx
∑y = na + b∑x
18
b = 70 = 0.26
∑y = na + b∑x
∑xy = a∑x + b∑x2
20.04 = 7a, So a = 2.86
6.71 = 28b, b = 0.24
rSy
(y – y) = (x – x)
Sx
(0.91)(2.9)
y – 3.5 = (x – 5)
1.5
y = 1.76 (x – 5) + 3.5
y = 1.76x – 8.8 + 3.5 = 1.76x – 5.3
∑x =30, ∑y =180
∑x2 = 200, ∑xy=1000
Regression of y upon x;
i) y – y = byx (x – x )
∑y 180
where y = = = 30
n 6
∑x 30
and x = = = 5
n 6
y – 30 = 2 (x – 5)
y = 2x – 10 + 30 = 20 + 2x
y = 20 + 2(8)
y = 20 + 16 = 36
∑x 7860
∑x = 7860, x= = = 1310
n 6
Height (x). 64 68 70 72 74
Weight (y). 160 170 180 190 195
a) Time series:
The arrangement of data according to the time of occurrence at regular intervals of time
like hours, days, months or years is called time series. Examples of time series are hourly
temperature recorded at a locality for a specific period, the production of fertilizer of
certain kind at Pak-Saudi Fertilizer Plant at Multan, the enrolment of students appearing
in C.A final examination over past few years etc.
A time series might be composed of four basic types of movements which are called its
components. These are;
i) Secular Trend which is long term movement and persists for a long period
normally not less than a decade.
ii) Seasonal variations which are mainly due to change in season and are short-term
movements the fluctuations being repeated during a year or shorter.
iii) Cyclic variations. These tend to occur in a more or less regular pattern over a
period of certain number of years fluctuating from peak to some lowest point and
then back to peack at a maximum point. It is like a business cycle with duration of
3, 5, 7 etc. years.
iv) Irregular variations which are also called random or accidental fluctuations, are
unsystematic in nature like floods, strikes earthquakes, wars and some political
events, etc. The study of these variations is somewhat difficult.
The net affect of all the four components is either additive or multiplicative
or y=TxSxCxl
y = a + bx
Multiply (i) by 348 by, (ii) by 5 and subtracting (i) from (ii)
312550 = 1740a + 121400b
– 311460 = – 1740a ± 121104b
1090 = 296b
b = 3.68
Putting this value in (i)
Coefficient of correlation r
n∑xy−∑x∑y 5(62510)−(895)(348)
r= =
√[n∑x2 −(∑x)2 ][n∑y2 −(∑y)2 ] √[5(24280)−(348)2 ][5(161025)−(895)2 ]
On the basis of these data, it wanted to know, how much sales it could do if it spends Rs. 73,000
on advertisement?
Advertisement
Cost (Rs. 1000) Sales (Rs. 100,000)
X (y) xy x2
33 45 1485 1089
40 51 2040 1600
43 56 2408 1849
47 52 2444 2209
50 61 3050 2500
52 60 3120 2704
56 51 2856 3136
63 63 3969 3969
66 65 4290 4356
450 504 25662 23412
The line is y = a + bx
y = 31 + (0.5) (73)
Company A B C D E F G
Sales 5.7 6.7 0.2 0.6 3.8 12.5 0.5
Profit 0.27 0.12 0.00 0.04 0.05 0.46 0.00
∑y = na + b∑x2
∑xy = a∑x + b∑x
0.94 = 7a + 30b__________________(i)
8.307 = 30a + 248.72b ____________(ii)
b = 0.036
X 2 3 4 5 6 7 8
y 2 8 11 9 19 14 14
b) i)
x y xy x2 y y–y
2 2 4 4 5 -3
3 8 24 9 7 +1
4 11 44 16 9 +2
5 9 45 25 11 -2
6 19 114 36 13 +6
7 14 98 49 15 -1
8 14 112 64 17 -3
35 77 441 203 0
∑x 35 ∑y 77
x= = = 5, y= = = 11
n 7 n 7
y – 11 = 2(x – 5), y – 11 = 2x – 10
y = 2x – 10 + 11, y = 1 + 2x