Simple Correlation and Regression Analysis
Simple Correlation and Regression Analysis
X-axis X-axis
Fig: Positive relationship Fig: Negative relationship
1
Simple Correlation And Regression Analysis Madhab Bhatta
Y-axis
Y-axis
X-axis
Fig: Positive Curve X-axis
Linear Relationship Fig: Negative Curve
Linear Relationship
ii) Karl Pearson’s correlation coefficient (r): The coefficient of correlation between
two variables say ‘X’ and ‘Y’ defined by Karl Pearson’s to measure the strength of
relationship between these two variables is denoted by ‘r’ and it’s value always lies
between -1 and +1 and is calculated by using the following relations.
Where
r = Correlation coefficient and its value always lies between -1 and +1
n = Number of pairs of data.
2
Simple Correlation And Regression Analysis Madhab Bhatta
Test statistics:
3
Simple Correlation And Regression Analysis Madhab Bhatta
When the regression analysis is used to measure the strength of relationship between
one dependent (y) and one independent (x) variables then it is called simple regression
analysis.
Where
y = dependent variable
x1 = independent variable.
0 = y-intercept for the population.
1 = slope for the population. i.e. Regression coefficients of dependent variably (y) on
independent variable (x1).
e = error term, is the difference between the observed and estimated value of the dependent
variable (y).
To obtain the best fit of the regression model of y on x, we need the value of 0 and 1, which
are unknown. By using the principle of lest square, we can get two normal equation of
regression model (1).
The two normal equation of regression line (1) are
Or
After finding the value of b0 and b1, we get the required fitted regression model of y on x as
Where
= estimated value of dependent variable (y) for some given value of independent variable
(x1)
x1 = independent variable.
b0 = estimated value of 0 i.e. y- intercept.
b1 = estimated value of 1 i.e. regression coefficient of y on x1 or slope of the regression line.
n = Number of pairs of data.
= mean of the independent variable
= mean of the dependent variable.
4
Simple Correlation And Regression Analysis Madhab Bhatta
a. The coefficient ‘b0’ (estimated value of 0) represents the average value of the
dependent variable (y) when value of independent variable (x1) is zero.
For example, in the above model, b0 = 15, this means, the average value of the
dependent variable (y) is 15 when x1 = 0.
b. The regression coefficient ‘b1’ (estimated value of 1) measure the average rate of
increase or decrease in the value of dependent variable (y) while increasing the value of
independent variable (x1) by unit.
For example, in the above model, b1 = -3, this means , the value of dependent variable
(y) is decreased by 3 while the value of independent variable (x1) is increase by 1.
Note : If in the above model b1 = 3, this means, the value of dependent variable (y)
is increased by 3 while the value of independent variable (x1) is increase by 1.
Where
e = Error term
= Observed value of the dependent variable.
= Estimated value of the dependent variable for a given value of independent variable.
5 Measures of Variation:
To examine the ability of the independent variable to predict the dependent variable (y) in the
regression model, several measures of variation need to be developed. In a regression analysis,
the total variation or total sum of squares (SST) is subdivide into explained variation or
regression sum of squares (SSR) and unexplained variation or error sum of squares (SSE).
These different measures of variation are shown in the following figure.
5
Simple Correlation And Regression Analysis Madhab Bhatta
Y
SSE
Y-axis
SST
yˆ b0 b1 x1
SSR
X-axis
Where,
6
Simple Correlation And Regression Analysis Madhab Bhatta
e. Approximate prediction interval: This interval gives within which the actual value of
the dependent variable (Y) lies for a given value of the independent variable.
Where
= Estimated value of the dependent variable for a given value of independent
variable.
= Standard error of y-intercept (b0)
= Standard error of the regression coefficient (b1)
7
Simple Correlation And Regression Analysis Madhab Bhatta
Null hypothesis (H0): 1 = 0 (This means there is no linear relationship between dependent
and independent variables)
Alternative hypothesis (H1): 1 0 (This means there is a significant linear relationship
between dependent and independent variable.) (Two tailed)
Test Statistics:
Or
8
Simple Correlation And Regression Analysis Madhab Bhatta
If the regression coefficient (b1) is negative then take the negative sign.
If the regression coefficient (b1) is positive then take the positive sign.
Adjusted coefficient of determination (r 2
):
adj. The adjusted coefficient of determination
is calculated by using the following relation.
i) Normality of errors: This assumption requires that, the errors around the regression
line be normally distributed for each value of X (independent variables). As long as the
distribution of the errors around the regression line for each value of independent
variables in not extremely different from a normal distribution, then inference about the
line of regression and regression coefficients will not be seriously affected.
ii) Homoscedasticity: This assumption requires that the variation around the line of
regression be constant for all values of independent variables(X). This means that the
errors vary the same amount when X is a low value as when X is a high value. The
Homoscedasticity assumption is important for using the least square method to fit the
regression line. If there are serious departures from this assumption, either data
transformations or weighted least square method can be applied.
iii) Independence of errors: This assumption requires that the errors around the regression
line be independent for each value of explanatory variables. This is particularly
9
Simple Correlation And Regression Analysis Madhab Bhatta
important when data are collected over a period of time. In such situation, errors for a
specific time period are often correlated with those of the previous time period.
11 Residual analysis:
The residual analysis is a graphical method to evaluate whether the regression model that has been
fitted to the data is an appropriate model. In addition, residual analysis enables potential violations
of the assumption of the regression model.
The aptness of the fitted regression model is evaluated by plotting the residual on the
vertical axis against the corresponding X values of the independence variable along the x- axis. If
the fitted model is appropriate for the data then there will be no apparent pattern in this plot.
However, if the fitted model is not appropriate then there will be a relationship between X values
and the residual (e).
By plotting the histogram, box-and-whisker plot, stem-and-leaf display of the errors term,
we can measure the normality of the errors.
10
Simple Correlation And Regression Analysis Madhab Bhatta
5. Sales of major appliances vary with the new housing market: when new home sales are
good, so are the sales of dishwashers, washing machines, driers, and refrigerators. A
trade association compiled the following historical data (in thousands of units) on major
appliance sales and housing starts:
Housing starts (thousands): 2.0 2.5 3.2 3.6 3.3 4.0 4.2 4.6 4.8
Appliance sales (thousands): 5 5.5 6 7 7.2 7.7 8.4 9 9.7
a. Develop an equation for the relationship between appliance sales (in thousands) and
housing starts (in thousands)
b. Interpret the slope of the regression line.
c. Compute and interpret the standard error of estimate.
d. Compute the 90% prediction interval for the appliance sales when housing is 8.0
e. Compute the coefficient of determination and coefficient of correlation and interpret the
value.
6. A study by the department of transportation on the effect of bus ticket price upon the
number of passengers produced the following results
Ticket price (Rs.): 25 30 35 40 45 50 55 60
Passenger per 100 miles: 800 780 780 660 640 600 620 620
a. Develop the estimation equation that best describe these data.
b. Interpret the regression coefficient (slope of the regression line)
c. Predict the number of passengers per 100 miles if the ticket price were Rs. 50. And also
obtain the 95% approximate prediction intervals for ticket price Rs 50.
7. A statistician for American automobile manufacturer would like to develop a model for
predicting delivery time (the days between the ordering of the car and the actual
delivery of the car) of custom ordered new automobile. A random sample of 15 cars is
selected with the result is summarized in the following table
Car 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
No. of 3 4 4 7 7 8 9 11 12 12 14 16 20 23 25
ordered
11
Simple Correlation And Regression Analysis Madhab Bhatta
(x)
Delivery 25 32 26 38 34 41 39 46 44 51 58 53 64 66 70
(y)
a. Given a correlation coefficient, r = 0.9726 between the number of options ordered and
the delivery time in days, examine if this linear relationship is significant at the 5%
level of significance.
b. Given the linear regression line = 22.2123 + 2.0218X, compute the residual for car
6.
c. Next given that 2
=153.421, compute the standard error of the estimate Se
Sy.x) and interpret its meaning.
d. Given that 2
=657.33, test the regression coefficient at the 5% level of
significance.
e. Compute the 95% prediction interval of the delivery time for a car with 14 options
ordered.
f. Compute the 95% confidence interval estimate of the slope (regression coefficient)
8. Fitting a straight line to a set of data yields the regression equation: = 2+ 5X
a. Interpret the meaning of the y intercept b and slope of the regression line b
0 1.
12
Simple Correlation And Regression Analysis Madhab Bhatta
13. In a regression problem with a sample size of 17, the slope was found to be 3.71 and
the standard error of estimate 28.654. The quantity 2
–n 2
= 871.56, Where X
is an independent variable.
a) Find the standard error of the regression coefficient (slope).
b) Construct a 95% confidence interval for the population slope and
interpret.
14. The managers of a brokerage firm are interested in finding out if the number of
new clients a broker bring into the firm affects the sales generated by the broker.
They sample 10 brokers and determine the number of new clients they enrolled in
the last year and their sales amounts in thousands of dollars. These data are
presented in the table that follows.
Broker Clients (X) Sales (Y) Calculation shows that:
n = 10
1 27 52 = 260
2 11 37 = 480
3 40 64
X2 =7594
4 33 55
5 15 29 Y2 =24276
6 15 34 XY =13377
7 25 58 SSX = 2
=834
8 36 59 SST= 2
=1236
9 28 44 SSE = 2
= 271.241
10 30 48
a) Assuming a linear relationship, what is the least square prediction for the amount of
sales (in $ 1,000) for a person who brings 25 new clients into the firm?
b) Calculate the standard error of estimate and interpret the result.
c) Suppose the managers of the brokerage firm want to obtain a 99% prediction interval
for the sales made by a broker who has brought into the firm 18 new clients. What
would be the prediction interval for this problem?
15. Cocacola is studying the effect of its latest advertising people chosen at random
were called and asked how many cans of coca cola they had bought in the past
week and how many coca cola advertisements they had either read or seen in the
past week. The data collected from different people are as follows
People 1 2 3 4 5 6 7 8 9 10 11 12
Number of 3 7 6 6 10 12 12 13 12 13 14 15
ads (x)
Calculation shows that
, , , ,
Find the coefficient of correlation between the number of ads and cans purchased, examine if
this linear relationship is significant at the 5% level of significance.
a. Find the linear regression line. Calculate the standard error of the estimate, S yx and
interpret its meaning.
b. Test the regression coefficient at the 1% level of significance.
c. Compute the 90% prediction interval of the can purchased for people 7.
13
Simple Correlation And Regression Analysis Madhab Bhatta
16. The marketing manger of a large supermarket chain would like to determine the effect of
shelf space on the sales of pet food. A random sample of 12 equal sized stores is selected
with the following results
Store 1 2 3 4 5 6 7 8 9 10 11 12
Weekly 1.6 2.2 1.4 1.9 2.4 2.6 2.3 2.7 2.8 2.6 2.9 3.1
sales ,Y(Hundreds
of $)
Shelf space, X 5 5 5 10 10 10 15 15 15 20 20 20
(Feet)
Calculation shows that:
= 150, X2 =2250, = 28.5, Y2 =70.69, XY = 384
a. Assuming a linear relationship, use the least squares methods to find the best fitting
regression equation and hence compute the residual for store 6.
b. What percentage of the total variation in sales is explained by shelf space?
c. Set up 95% confidence interval estimate of the average weekly sales for all stores
that have 10 feet of shelf space for pet food.
23. An operation manager is interested in predicting costs C (in ‘000 Rs) based on the amount of
raw material input R (in ’00 pounds) for a jeans manufacturer. If the slope is significantly greater
than 0.5 in the following sample data, then there is something wrong with the production process
and the assembly-line machine should be adjusted. At the 0.05 significance level, should the
machinery be adjusted? State explicit hypothesis and an appropriate conclusion.
C 10 7 5 6 7 6
R 25 20 16 17 19 18
14