Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
91 views

Simple Correlation and Regression Analysis

1. The document discusses simple correlation and regression analysis. It defines correlation as measuring the relationship between two variables, and distinguishes between positive and negative correlation. 2. Methods for measuring correlation are described, including scatter plots and Karl Pearson's correlation coefficient (r) which ranges from -1 to 1. Positive values indicate positive correlation and negative values indicate negative correlation. 3. Simple linear regression analysis is introduced as a method to predict a dependent variable (y) based on an independent variable (x). The regression line and coefficients are defined.

Uploaded by

Bibhush Maharjan
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

Simple Correlation and Regression Analysis

1. The document discusses simple correlation and regression analysis. It defines correlation as measuring the relationship between two variables, and distinguishes between positive and negative correlation. 2. Methods for measuring correlation are described, including scatter plots and Karl Pearson's correlation coefficient (r) which ranges from -1 to 1. Positive values indicate positive correlation and negative values indicate negative correlation. 3. Simple linear regression analysis is introduced as a method to predict a dependent variable (y) based on an independent variable (x). The regression line and coefficients are defined.

Uploaded by

Bibhush Maharjan
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

Simple Correlation And Regression Analysis Madhab Bhatta

Simple Correlation And


Regression Analysis
Simple correlation:
1 Introduction
Correlation is a statistical method used to determine whether a relationship between two or
more variables exists. When it is used to measure the relationship between two variables then it
is called simple correlation. When the value of both variables increases (or decrease) in the
same direction at the same time then it is called positive correlation. But when the value of one
variable is increased while the value of other variable is decreased at the same time then it is
called negative correlation

2 Method of measuring the correlation between two


variables say ‘X’ and ‘Y’
i) The Scatter diagram (Scatter Plot)
ii) Karl Pearson’s correlation coefficient (r)

i) Scatter diagram: This is the graphical method of measuring the relationship


between two variables. In scatter plot, Pairs of data are plot in the graph by plotting
the values of one variable along X-axis and values of other variable along Y-axis. A
scatter of point so formed is called the scatter plot. If these plotted points shows
some trend either upward or downward then two variables are said to be correlated
if these plotted points does not shows any trend then two variables are said to be
uncorrelated. Some types of relationship obtained by scatter plot are shown in the
following figure
Y-axis
Y-axis

X-axis X-axis
Fig: Positive relationship Fig: Negative relationship

1
Simple Correlation And Regression Analysis Madhab Bhatta

Y-axis

Y-axis

X-axis
Fig: Positive Curve X-axis
Linear Relationship Fig: Negative Curve
Linear Relationship

ii) Karl Pearson’s correlation coefficient (r): The coefficient of correlation between
two variables say ‘X’ and ‘Y’ defined by Karl Pearson’s to measure the strength of
relationship between these two variables is denoted by ‘r’ and it’s value always lies
between -1 and +1 and is calculated by using the following relations.

Where
r = Correlation coefficient and its value always lies between -1 and +1
n = Number of pairs of data.

2
Simple Correlation And Regression Analysis Madhab Bhatta

3 Interpretation of correlation coefficient (r)


If r = 0, this means there is no correlation (relation) between two variables.
If r > 0, this means there is positive correlation between two variables.
If r < 0, this means there is negative correlation between two variables.
If r = -1, this means there is highly (perfect) negative relationship between two variables.
If r = +1, this means there is highly positive relationship between two variables.

4 Test of significance for correlation coefficient:


The test statistics to test the significance of the correlation coefficient is obtained under the
assumption that in null hypothesis, the population correlation coefficient () is set to be zero.
Thus, the null and alternative hypotheses are set as
Null hypothesis (H0):  = 0 (This null hypothesis means that there is no correlation between
the x and y variables in the population.)
Alternative hypothesis (H1):  ≠ 0 (This alternative hypothesis means that there is a
significant correlation between the x and y variables in the population). (Two tailed)

Test statistics:

It follows student’s t-distribution with (n-2) degree of freedom.


Decision: if the calculated value of the test statistics ( tcal) is less than tabulated value (ttab)
then null hypothesis is accepted otherwise null hypothesis is rejected. i.e.
If tcal < tα, n-2, then null hypothesis is accepted. Otherwise alternative hypothesis is accepted.
Where
tα, n-2 = tabulated value of‘t’ at (n-2) degree of freedom and ‘’ level of significance, obtained
from two tailed t-table.
n = number of pairs of data.
α = level of significance
r = sample correlation coefficient
 = population correlation coefficient.

Simple Regression Analysis:


1 Introduction
Regression analysis is used to measure the strength of relationship between two or more
variables. Regression analysis is used to predict or estimate the value of one dependent or
response or endogenous variables based on the known values of independent or explanatory or
regressor variables. The unknown variables which we have to estimate or predict is called
dependent variable and denoted by ‘y’. The variable whose value is given to estimate the value
of dependent variable (y) is called independent variable and denoted by ‘x’.

3
Simple Correlation And Regression Analysis Madhab Bhatta

When the regression analysis is used to measure the strength of relationship between
one dependent (y) and one independent (x) variables then it is called simple regression
analysis.

2 Regression line (Regression Model)


A simple linear regression line between one dependent variable (y) and one independent
variable (x1) is written as

Where
y = dependent variable
x1 = independent variable.
0 = y-intercept for the population.
1 = slope for the population. i.e. Regression coefficients of dependent variably (y) on
independent variable (x1).
e = error term, is the difference between the observed and estimated value of the dependent
variable (y).

To obtain the best fit of the regression model of y on x, we need the value of 0 and 1, which
are unknown. By using the principle of lest square, we can get two normal equation of
regression model (1).
The two normal equation of regression line (1) are

By solving these two normal equations we get the value of b0 and b1 as

Or

After finding the value of b0 and b1, we get the required fitted regression model of y on x as
Where
= estimated value of dependent variable (y) for some given value of independent variable
(x1)
x1 = independent variable.
b0 = estimated value of 0 i.e. y- intercept.
b1 = estimated value of 1 i.e. regression coefficient of y on x1 or slope of the regression line.
n = Number of pairs of data.
= mean of the independent variable
= mean of the dependent variable.

4
Simple Correlation And Regression Analysis Madhab Bhatta

3 Interpreting the regression coefficients:


Suppose we have the fitted simple regression model

a. The coefficient ‘b0’ (estimated value of 0) represents the average value of the
dependent variable (y) when value of independent variable (x1) is zero.
For example, in the above model, b0 = 15, this means, the average value of the
dependent variable (y) is 15 when x1 = 0.
b. The regression coefficient ‘b1’ (estimated value of 1) measure the average rate of
increase or decrease in the value of dependent variable (y) while increasing the value of
independent variable (x1) by unit.
For example, in the above model, b1 = -3, this means , the value of dependent variable
(y) is decreased by 3 while the value of independent variable (x1) is increase by 1.

Note : If in the above model b1 = 3, this means, the value of dependent variable (y)
is increased by 3 while the value of independent variable (x1) is increase by 1.

4 Error term (Residual):


The difference between the observed and estimated value of the dependent variable (y) is
called error or residual and it is denoted by ‘e’

Where
e = Error term
= Observed value of the dependent variable.
= Estimated value of the dependent variable for a given value of independent variable.

5 Measures of Variation:
To examine the ability of the independent variable to predict the dependent variable (y) in the
regression model, several measures of variation need to be developed. In a regression analysis,
the total variation or total sum of squares (SST) is subdivide into explained variation or
regression sum of squares (SSR) and unexplained variation or error sum of squares (SSE).
These different measures of variation are shown in the following figure.

5
Simple Correlation And Regression Analysis Madhab Bhatta

Y
SSE
Y-axis

SST
yˆ  b0  b1 x1

SSR

X-axis

From the figure, mathematically


Total Sum of Square (SST) = Regression Sum of Square (SSR) + Error Sum of Square
(SSE) i.e.

Where,

6 Standard error of the estimate (Se or Sy.x)


The standard error of the estimate measures the average variation or scatter ness of the
observed data point around the regression line. Standard error of the estimate is used to
measure the reliability of the regression equation and it is denoted by Se or Sy.x and is calculated
by using the following relation.

The notations have their usual meaning.

6
Simple Correlation And Regression Analysis Madhab Bhatta

Interpreting the standard error of the estimate:


The regression line having the lesser value of the standard error of the estimate is more reliable
then the regression line having the higher value of the standard error of the estimate i.e. how
much the value of the standard error of the estimate is less, the fitted regression line is more
reliable.
a. Is Se = 0, this means there is no variation of the observed data around the regression
line i.e. all the observed data lies in the regression line. So we expect that the regression
line is perfect for predicting the dependent variable.
b. If the value of Se is large then fitted regression line is poor for predicting the dependent
variable since there is greater variation of the observed data around the regression line.
c. If the value of Se is small, this means there is less variation of the observed data around
the regression line. So the regression line will be better for predicting the dependent
variable.
If Se = 2.5, this means, the average variation of the observed data around the regression line is
2.5.

7 Confidence Interval Estimate


a. Confidence interval for Y-intercept (0)

b. Confidence interval for the regression coefficient or slope (1)

c. Confidence interval estimate for the mean of dependent variable (y)

d. Prediction interval for an individual response of dependent variable (y)

e. Approximate prediction interval: This interval gives within which the actual value of
the dependent variable (Y) lies for a given value of the independent variable.

Where
= Estimated value of the dependent variable for a given value of independent
variable.
= Standard error of y-intercept (b0)
= Standard error of the regression coefficient (b1)

7
Simple Correlation And Regression Analysis Madhab Bhatta

Se = Standard error of the estimate


= Tabulated value of the ‘t’ is obtained from student’s t-table at (n-2) degree of
freedom and ‘α/2’ level of significance.
n = Number of pairs of observations.
Other notations have their usual meanings.

8 Test of significance for the regression coefficient (1):


To determine the existence of a significant linear relationship between the dependent variable
(y) and independent variable (x1), a hypothesis test concerning the population slope (1 i.e.
regression coefficient) is made by setting the null and alternative hypothesis as stated below.

Null hypothesis (H0): 1 = 0 (This means there is no linear relationship between dependent
and independent variables)
Alternative hypothesis (H1): 1  0 (This means there is a significant linear relationship
between dependent and independent variable.) (Two tailed)
Test Statistics:

This test statistics follows t-distribution with (n-2) degree of freedom.


Decision: if the calculated value of the test statistics ( tcal) is less than tabulated value (ttab)
then null hypothesis is accepted otherwise null hypothesis is rejected. i.e.
If tcal < tα, n-2, then null hypothesis is accepted. Otherwise alternative hypothesis is accepted.
Where
tα, n-2 = Tabulated value of ‘t’ at (n-2) degree of freedom and ‘α’ level of significance,
obtained from two tailed t-table.
n = number of pairs of data.
α = level of significance
b1 = Regression coefficients of y on x1.
= Standard error of the regression coefficient (b1)

Or

Other notations have their usual meanings.

9 Coefficient of determination (r2):


The coefficient of determination measures the strength or extent of the association that exists
between dependent variable (y) and independent variable (x1). It measures the proportion of
variation in the dependent variable (y) that is explained by the regression line. In other word,
coefficient of variation measures the total variation in the dependent variable due to the
variation in the independent variable and it is denoted by ‘r 2’. The following relations are used
to obtain the value of coefficient of determination.

8
Simple Correlation And Regression Analysis Madhab Bhatta

Note: Since coefficient of determination is the square of the Correlation coefficient. So


correlation coefficient is the square root of the coefficient of determination and can be obtained
from the coefficient of determination by the following relation.

If the regression coefficient (b1) is negative then take the negative sign.
If the regression coefficient (b1) is positive then take the positive sign.
Adjusted coefficient of determination (r 2
):
adj. The adjusted coefficient of determination
is calculated by using the following relation.

Interpretation of the coefficient of determination (r 2): The regression


model having the higher value of coefficient of determination is better, more reliable than the
regression model having the smaller value of coefficient of determination, this means higher
value of r2 is better than lesser value of r2.
For example if r2 = 0.91, this means 91% of the total variation in the dependent variable
(y) is due to the variation in the independent variable (x1) and remaining 9% variation in the
dependent variable is due to the other factor which are not accounted in the independent
variable.

10 Assumption on Regression Analysis:


The following three assumptions are necessary for the regression analysis. Which are described as,
i) Normality of errors
ii) Homoscedasticity
iii) Independence of errors

i) Normality of errors: This assumption requires that, the errors around the regression
line be normally distributed for each value of X (independent variables). As long as the
distribution of the errors around the regression line for each value of independent
variables in not extremely different from a normal distribution, then inference about the
line of regression and regression coefficients will not be seriously affected.
ii) Homoscedasticity: This assumption requires that the variation around the line of
regression be constant for all values of independent variables(X). This means that the
errors vary the same amount when X is a low value as when X is a high value. The
Homoscedasticity assumption is important for using the least square method to fit the
regression line. If there are serious departures from this assumption, either data
transformations or weighted least square method can be applied.
iii) Independence of errors: This assumption requires that the errors around the regression
line be independent for each value of explanatory variables. This is particularly

9
Simple Correlation And Regression Analysis Madhab Bhatta

important when data are collected over a period of time. In such situation, errors for a
specific time period are often correlated with those of the previous time period.

11 Residual analysis:
The residual analysis is a graphical method to evaluate whether the regression model that has been
fitted to the data is an appropriate model. In addition, residual analysis enables potential violations
of the assumption of the regression model.
The aptness of the fitted regression model is evaluated by plotting the residual on the
vertical axis against the corresponding X values of the independence variable along the x- axis. If
the fitted model is appropriate for the data then there will be no apparent pattern in this plot.
However, if the fitted model is not appropriate then there will be a relationship between X values
and the residual (e).
By plotting the histogram, box-and-whisker plot, stem-and-leaf display of the errors term,
we can measure the normality of the errors.

12 Problems on Simple Correlation and Regression


Analysis:
1. A study of the car running cost and family income of 10 families gave the following
data
= 3150 = 315 X2 =1128750 Y2 =12225
XY =116375 and n= 10
a. Calculate the correlation coefficient.
b. Calculate the regression equation relating the running cost of a car and the family
income
2. An instructor is interested in finding out how the number of students absent on a given
day is related to the mean temperature that day. A random sample of 10 days was used
for the study. The following data indicate the number of students absent (ABS) and the
mean temperature (TEMP) for each day.
ABS: 8 7 5 4 2 3 5 6 8 9
TEMP: 10 20 25 30 40 45 50 55 59 60
a. Develop the estimating equation that best describe the data.
b. What is the logical explanation for the observed relationship?
c. Compute the residual when X= 25
d. Compute the standard error of the estimate and interpret the standard error of the
estimate.
3. Cost accounts often estimating overhead based on the level of production. At the
standard knitting co., they have collected information on overhead expenses and units
produced at different plants, and want to estimate a regression equation to predict future
overhead.
Overhead; 191 170 272 155 280 173 234 116 153 178
Units: 40 42 53 35 56 39 48 30 37 40
a. Develop the regression equation for the cost accountants.
b. Predict overhead when 50 units are produced.
c. Calculate the standard error of estimate and interpret the value of the standard error
of the estimate.

10
Simple Correlation And Regression Analysis Madhab Bhatta

d. Calculate the correlation coefficient and test it at 95% confidence level.

4. Using the data given below


X 16 6 10 5 12 14
Y -4.4 8 2.1 8.7 0.1 -2.9
a. Develop the estimating equation that best describe the data.
b. Predict Y for X= 5, 6, 7
c. Interpret the meaning of the slope.
d. Compute the coefficient of determination and interpret its meaning.
e. Obtain the estimate of the correlation coefficient and interpret its meaning.
f. Obtain the standard error of the estimate and interpret its meaning.
g. Test for the significance of the slope.
h. Obtain the 95% confidence interval estimate of the slope.
i. Carry out the t-test for the correlation coefficient.
j. Obtain the confidence interval estimate for the mean of Y for x= 7.
k. Obtain the 95% prediction interval for individual Y for the value of x= 7.
l. Obtain the 95% approximate prediction interval of Y for the value of x=7.

5. Sales of major appliances vary with the new housing market: when new home sales are
good, so are the sales of dishwashers, washing machines, driers, and refrigerators. A
trade association compiled the following historical data (in thousands of units) on major
appliance sales and housing starts:
Housing starts (thousands): 2.0 2.5 3.2 3.6 3.3 4.0 4.2 4.6 4.8
Appliance sales (thousands): 5 5.5 6 7 7.2 7.7 8.4 9 9.7
a. Develop an equation for the relationship between appliance sales (in thousands) and
housing starts (in thousands)
b. Interpret the slope of the regression line.
c. Compute and interpret the standard error of estimate.
d. Compute the 90% prediction interval for the appliance sales when housing is 8.0
e. Compute the coefficient of determination and coefficient of correlation and interpret the
value.
6. A study by the department of transportation on the effect of bus ticket price upon the
number of passengers produced the following results
Ticket price (Rs.): 25 30 35 40 45 50 55 60
Passenger per 100 miles: 800 780 780 660 640 600 620 620
a. Develop the estimation equation that best describe these data.
b. Interpret the regression coefficient (slope of the regression line)
c. Predict the number of passengers per 100 miles if the ticket price were Rs. 50. And also
obtain the 95% approximate prediction intervals for ticket price Rs 50.
7. A statistician for American automobile manufacturer would like to develop a model for
predicting delivery time (the days between the ordering of the car and the actual
delivery of the car) of custom ordered new automobile. A random sample of 15 cars is
selected with the result is summarized in the following table
Car 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
No. of 3 4 4 7 7 8 9 11 12 12 14 16 20 23 25
ordered

11
Simple Correlation And Regression Analysis Madhab Bhatta

(x)
Delivery 25 32 26 38 34 41 39 46 44 51 58 53 64 66 70
(y)
a. Given a correlation coefficient, r = 0.9726 between the number of options ordered and
the delivery time in days, examine if this linear relationship is significant at the 5%
level of significance.
b. Given the linear regression line = 22.2123 + 2.0218X, compute the residual for car
6.
c. Next given that 2
=153.421, compute the standard error of the estimate Se
Sy.x) and interpret its meaning.
d. Given that 2
=657.33, test the regression coefficient at the 5% level of
significance.
e. Compute the 95% prediction interval of the delivery time for a car with 14 options
ordered.
f. Compute the 95% confidence interval estimate of the slope (regression coefficient)
8. Fitting a straight line to a set of data yields the regression equation: = 2+ 5X
a. Interpret the meaning of the y intercept b and slope of the regression line b
0 1.

b. Predict the average value of Y for X = 3.


9. Suppose that you are testing the null hypothesis that there in no relationship between
two variables X and Y. from your sample on n = 18, you determine that b 1 = 4.5 and Sb1
=1.5
a. What is the value of the test statistics?
b. At α = 0.05 level of significance, test the regression coefficients.
10. Suppose that you are testing the null hypothesis that there is no relationship between
two variables X and Y. From your sample of n= 20, you determine that SSR= 60 and
SSE = 40. Calculate the coefficient of correlation and test its significance at α=10%.
11. Based on a sample of 20 observations, the least squares method was used to obtain the
following linear regression equation = 5+ 3X. In addition, Syx= 1.0, = 2 and
2
= 20
a. Set up a 95% confidence interval estimate of the population average response
for X= 2.
b. Set up a 95% prediction interval of the individual response for X= 2.
12. Below is given five year’s data on money supply (MI) and domestic credit (DM_CR)
for Nepal. Both variables are expressed in thousand rupees
Year 1991 1992 1993 1994 1995
DM_CR 18.90 21.81 25.48 25.51 27.32
MI 10.05 10.80 11.05 10.80 12.20
a) Develop the estimation linear equation to predict DM_CR from MI.
b) How do you interpret the slope of the regression line?
c) Compute and interpret the standard error of estimate.
d) Compute an approximate 90% prediction interval for DM_CR when the money supply
is Rs.26.

12
Simple Correlation And Regression Analysis Madhab Bhatta

13. In a regression problem with a sample size of 17, the slope was found to be 3.71 and
the standard error of estimate 28.654. The quantity 2
–n 2
= 871.56, Where X
is an independent variable.
a) Find the standard error of the regression coefficient (slope).
b) Construct a 95% confidence interval for the population slope and
interpret.
14. The managers of a brokerage firm are interested in finding out if the number of
new clients a broker bring into the firm affects the sales generated by the broker.
They sample 10 brokers and determine the number of new clients they enrolled in
the last year and their sales amounts in thousands of dollars. These data are
presented in the table that follows.
Broker Clients (X) Sales (Y) Calculation shows that:
n = 10
1 27 52 = 260
2 11 37 = 480
3 40 64
X2 =7594
4 33 55
5 15 29 Y2 =24276
6 15 34 XY =13377
7 25 58 SSX = 2
=834
8 36 59 SST= 2
=1236
9 28 44 SSE = 2
= 271.241
10 30 48
a) Assuming a linear relationship, what is the least square prediction for the amount of
sales (in $ 1,000) for a person who brings 25 new clients into the firm?
b) Calculate the standard error of estimate and interpret the result.
c) Suppose the managers of the brokerage firm want to obtain a 99% prediction interval
for the sales made by a broker who has brought into the firm 18 new clients. What
would be the prediction interval for this problem?

15. Cocacola is studying the effect of its latest advertising people chosen at random
were called and asked how many cans of coca cola they had bought in the past
week and how many coca cola advertisements they had either read or seen in the
past week. The data collected from different people are as follows
People 1 2 3 4 5 6 7 8 9 10 11 12
Number of 3 7 6 6 10 12 12 13 12 13 14 15
ads (x)
Calculation shows that
, , , ,
Find the coefficient of correlation between the number of ads and cans purchased, examine if
this linear relationship is significant at the 5% level of significance.
a. Find the linear regression line. Calculate the standard error of the estimate, S yx and
interpret its meaning.
b. Test the regression coefficient at the 1% level of significance.
c. Compute the 90% prediction interval of the can purchased for people 7.

13
Simple Correlation And Regression Analysis Madhab Bhatta

16. The marketing manger of a large supermarket chain would like to determine the effect of
shelf space on the sales of pet food. A random sample of 12 equal sized stores is selected
with the following results
Store 1 2 3 4 5 6 7 8 9 10 11 12
Weekly 1.6 2.2 1.4 1.9 2.4 2.6 2.3 2.7 2.8 2.6 2.9 3.1
sales ,Y(Hundreds
of $)
Shelf space, X 5 5 5 10 10 10 15 15 15 20 20 20
(Feet)
Calculation shows that:
= 150, X2 =2250, = 28.5, Y2 =70.69, XY = 384
a. Assuming a linear relationship, use the least squares methods to find the best fitting
regression equation and hence compute the residual for store 6.
b. What percentage of the total variation in sales is explained by shelf space?
c. Set up 95% confidence interval estimate of the average weekly sales for all stores
that have 10 feet of shelf space for pet food.

23. An operation manager is interested in predicting costs C (in ‘000 Rs) based on the amount of
raw material input R (in ’00 pounds) for a jeans manufacturer. If the slope is significantly greater
than 0.5 in the following sample data, then there is something wrong with the production process
and the assembly-line machine should be adjusted. At the 0.05 significance level, should the
machinery be adjusted? State explicit hypothesis and an appropriate conclusion.

C 10 7 5 6 7 6
R 25 20 16 17 19 18

14

You might also like