Module 12. Linear Corr & Reg Analysis
Module 12. Linear Corr & Reg Analysis
Regression Analysis
Regression analysis is a statistical technique used for determining the probable form of the
relationship between variables. The ultimate objective when using this method of analysis is usually to
predict or estimate the value of one variable corresponding to a given value of another variable.
Simple regression analysis a form of linear relationship consisting only one independent variable X to
predict dependent variable Y. Objective: To find the possible relationship between two variables X and
Y, where X and Y are paired variables.
Regression Line
We assume that the value of X is known in advance and that the value assumed by Y depends in part on
the particular value of X under consideration while Y called the dependent or response variable, the
variable X whose value is used to help predict the behavior of Y is called the independent or predictor
variable or the regressor.
In the simple linear regression model
Page 1
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
y = a + βx
where a = denotes the intercept and β = the slope of the regression line.
Scatter Diagram
Scatter diagram (also called scatter plots, scatter graphs and correlation chart) are similar to line
graphs. A line graph uses a line on an X-Y axis to plot a continuous function, while a scatter plot uses
dots to represent individual pieces of data.
In statistics, these plots are useful to see if two variables are related to each other. For example, a
scatter chart can suggest a linear relationship (i.e. a straight line).
Page 2
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
Amount (ml) 17 11 10 18 20 5 22 14 17 25 22 8 11 18 21
Time 56 50 120 70 80 120 30 45 55 60 64 56 76 48 92
(in minutes)
Solution:
Draw the scatter diagram (Figure 3)
The parameters α and β are estimated by the methods of least squares. From the many straight lines
that can be drawn through a scatter diagram, we choose the one that “best fits” the data. The estimated
regression line takes form
y = a + bx
If we let ei denote the vertical distances from a point (x, y) to the estimated regression line, then each
data point satisfies the equation
y = a + bxi + ei
The term ei is called the residual. Figure 2 illustrates this idea.
y
(x3, y3)
(x2, y2) e3
(x1, y1) e2
e1
e5
e4 (x5, y5)
Page 3
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
(x4, y4)
x
Figure 4. The least-squares procedure minimizes the sum of the squares of the residuals e i.
The residual for a data point that lies above the estimated regression line is positive; for the
point that lies below the regression, the residual is negative. If the residuals are summed, the negative
and the positive values will counteract one another and the sum will always be zero.
a = y - bx
n ∑ xiyi ∑ xi ∑ yi
b=
2
n ∑ xi2 ∑ xi
The graphs of the regression line are shown below, with relative positive, negative, and zero
slopes.
y y y
x x x
Figure 5. Relative positive, negative, and zero slopes for a regression line.
Example 2
The relationship between energy consumption and household income was studied, yielding the
following data on household income X (in units of P1000/month) and energy consumption Y (in
kilowatt hour/month).
Page 4
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
Table 1
1 70 200
2 85 175
3 45 100
4 56 120
5 60 80
6 100 350
7 93 255
8 81 400
9 48 70
10 115 450
11 90 320
12 57 125
∑ xy = 227,045
To estimate the simple linear regression line, we estimate the slope b and the intercept a. These
estimates are
y = - 172.58 + 5.24x
Page 5
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
1000
900
800
700
600
500
400
300
200
100 0 50 100 150 200 300 Income
To predict the energy consumption when the household income is P200, 000 per month, we
substitute the value 200 for x in the equation
y = -172.58+5.24x
Correlation
It is desirable to observe and measure the association which occurs between two statistical series. For
example, it is desirable to know whether there is a relationship between changes in the cost of living
and changes in wages; the grades on an examination and the intelligent quotient of a group of students;
and the academic material retained in memory after various intervals of time and many other similar
associated data.
The relationship between two data may be established and measured by means of the correlation
method.
Coefficient of Correlation
The coefficient of correlation is used as the comparative measure of association. The coefficient of
correlation will have the limits 0 to 1.00. The value of 1 or -1 indicates perfect positive or negative
linear relationships, respectively. A value of 0 indicates no linear relationship. When this happens, we
say that X and Y are uncorrected.
The coefficient of correlation will be positive or negative: positive when the compared variables are
directly proportional, i.e., as one variable increases the other variable also increases, or as one variable
decreases the other variable also decreases; negative when the compared variables are inversely
Page 6
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
proportional, that is to say, an increase in the value of X results in a corresponding decrease in the
value of Y. Under these circumstances the line of regression slopes downward.
Table 2
Page 7
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
The Pearson’s Product-Moment coefficient of correlation measures the linear relationship of two
variables, defined by
r=
σx σy
where, ∑xy ∑x ∑y
= -
N N N
∑x2 ∑x 2
σx = -
N
N
∑y2 ∑y 2
σY = -
N N
Example 3
Table 3
x Y xy x2 y2
36 21 756 1296 441
42 18 756 1764 324
37 15 555 1369 225
31 11 341 961 121
25 15 375 625 225
28 9 252 784 81
33 10 330 1089 100
28 20 560 784 400
42 16 672 1764 256
39 11 429 1521 121
38 21 798 1444 441
Page 8
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
∑x2 ∑x 2
15001 419
2
σx = - =
N 12
N 12
= 1250.08 – 1219.41
= 30.67 = 5.54
The standard deviation for y,
∑y2 ∑y 2
2931 181
2
σy = - =
N 12
N 12
= 244.25 – 227.51
= 16.74 = 4.09
5.41 5.41
r= = = = 0.24
σx σy (5.54)(4.09) 22.66
The strength of relationship between two ranked variables can be measured by the Spearman’s Rank-
Order coefficient or correlation, defined by
6 ∑d12
Page 9
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
r=1 -
n (n2-1)
Example 4
The tables below show the score and judge rank of 8 contestants.
Contestant 1 2 3 4 5 6 7 8
Judge’s Rank 3 1 6 2 4 7 8 5
Score 26 40 52 25 20 60 37 48
Table 5
Ranks and Data in Table 2
Contestant 1 2 3 4 5 6 7 8
Judge’s Rank (xi) 3 1 6 2 4 7 8 5
Table 6
Differences and Square of Differences for the Contestant Ranks
Contestant xi yi di = x – y di2
1 3 5 -2 4
2 1 4 -3 9
3 6 2 4 16
4 2 6 -4 16
5 4 8 -4 16
6 7 1 6 36
7 8 5 3 9
8 5 3 2 4
Total ∑=110
6 ∑d12
r = 1 -- -
n (n2-1)
6 (110)
r=1- = 1 - 1.309 = - 0.309
n (64-1)
Page 10
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
There is an inverse slight relationship between the judges’ rank and the contestants’ scores.
SUGGESTED ENRICHMENTACTIVITY
Watch video clips of relevant topics from You-tube or internet.
REFERENCES
Jonathan B. Cabero, Lorina G. Salamat and Antonina C. Sta. Maria (2013). Business Statistics. Anvil
Publishing Inc., Mandaluyong City.
Gerald Keller (2013). Business Statistics. Cengage Learning Asia Pte Ltd., Singapore
Faith B. Basilio et. al. (2003) Fundamentals of Statistics. Trinitas Publishing Inc., Bulacan
Internet-based references
Page 11
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
Direction: Answer the following problems in separate answer sheet. Use black pen only.
PROBLEMS:
1. A student wants to determine how are grades in college algebra and in Statistics related. From a
random of eight students she obtained the following scores.
Calculate the coefficient of correlation using the Pearson Product Moment coefficient of
correlation.
2. The values in X below are hours spent studying, and the values in Y are grades on a test.
X = {3.2, 3.0, 1.0, 2.5, 1.9, 1.6, 3.1, 3.5, 4.2, 3.0}
Y = { 90, 88, 57, 86, 79, 71, 84, 97, 90, 91}
Page 12
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
SOLUTIONS TO PROBLEMS
Page 13
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
REVIEW QUESTIONS
REGRESSION AND CORRELATION ANALYSIS
Page 14
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
16. The estimate of β in the regression equation D. The distribution of possible εi values have
Y=α+βX+e by the method of least square is: equal variances for all values of x.
A. Biased 19. An investigator reports that the arithmetic mean
B. Unbiased of two regression coefficients of a regression
C. Consistent line is 0.7 and the correlation coefficient is 0.75.
D. Efficient The investigation results are:
17. Which of the following is not one of the A. Valid
assumptions required for the t test for B. Invalid
determining whether the correlation is C. Inconclusive
significant? D. None of these
A. The data are interval or ratio level. 20. Homogeneity of three or more population
B. The variances are equal or σ12= σ22 correlation coefficients can be tested by
C. The two variables are distributed as a A. t-test
bivariate normal distribution. B. Z-test
D. All three are required assumptions. C. χ2-test
18. Which of the following is not true regarding the D. F-test
error term ε? 21. In multiple linear regression analysis, the square
A. Individual values of the error term εi are root of Mean Squared Error (MSE) is called the:
statistically dependent on each other. A. Multiple correlation coefficient
B. For a given value of x, there can exist many B. Standard error of estimate
values of εI. C. Coefficient of determination
C. The distribution of possible εi values for any D. None of these
x value is normal.
PROBLEMS
1. Calculate the test statistic t for a correlation hypothesis test when the sample correlation coefficient
is r = 0.889 and the sample size is n = 10.
A. 5.337
B. 5.491
C. 5.519
D. 5.664
2. Compute the slope of the regression equation based on these sample data.
X = {3.2, 3.0, 1.0, 2.5, 1.9, 1.6, 3.1, 3.5, 4.2, 3.0}
Y = { 90, 88, 57, 86, 79, 71, 84, 97, 90, 91}
A. 9.638
B. 10.144
C. 10.835
D. 11.169
3. Compute the y intercept of the regression equation based on these sample data.
X = {3.2, 3.0, 1.0, 2.5, 1.9, 1.6, 3.1, 3.5, 4.2, 3.0}
Y = { 90, 88, 57, 86, 79, 71, 84, 97, 90, 91}
A. 52.338
B. 54.045
C. 55.159
D. 56.779
Page 15