Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
69 views

Correlation Regression

This document discusses correlation and regression analysis. It defines univariate, bivariate, and multivariate analysis. Correlation refers to the relationship between two variables. There can be positive, negative, linear, or non-linear correlation. The correlation coefficient measures the strength and direction of linear correlation between two variables. Scatter plots are used to visualize the relationship between variables. The rank correlation coefficient measures correlation using ranks rather than raw data values.

Uploaded by

Varshney Nitin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
69 views

Correlation Regression

This document discusses correlation and regression analysis. It defines univariate, bivariate, and multivariate analysis. Correlation refers to the relationship between two variables. There can be positive, negative, linear, or non-linear correlation. The correlation coefficient measures the strength and direction of linear correlation between two variables. Scatter plots are used to visualize the relationship between variables. The rank correlation coefficient measures correlation using ranks rather than raw data values.

Uploaded by

Varshney Nitin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CORRELATION & REGRESSION

Prepared By
Nitin Varshney
Assistant Professor
Agricultural Statistics
CoA, NAU, Waghai.
 The study related to the characteristics of only one variable
such as height, weight, age, marks, wages etc. is known as
univariate analysis.
 The study related to the relationship between two variables
such as height & weight. is known as Bivariate analysis.

CORRELATION
 When we study two or more variables simultaneously, we
observe that movements in one variable are accompanied by
movements in other variable.
 Example:
 Husband’s age and wife’s age move together
 Scores on IQ test move with scores in university
examinations
 Relation b/w income & expenditure on household.
 Relation b/w price & demand of commodity.
Meaning of Correlation

 In bivariate distribution (study of two variables), we are


interested to find out if there is any correlation or
covariation b/w the two variables.
 If the change in one variable affects a change in the other
variable, the variables are said to be correlated.

Types of Correlation

 Positive and negative


 Linear and Non-linear
 Multiple and Partial
Positive and negative correlation
 If the two variables deviate in the same direction i.e. if the
increase (or decrease) in one results in a corresponding
increase (or decrease) in other, correlation is said to be direct
or positive.
Example: Correlation between
 Heights & weights of a group of persons
 Income & expenditure

 If the two variables deviate in the opposite direction i.e. if the


increase (or decrease) in one results in a corresponding
decrease (or increase) in other, correlation is said to be
diverse or negative.
Example: Correlation between
 Price & demand of a commodity
 Volume & pressure of a perfect gas
Linear and non-linear correlation

 If the ratio of change b/w the two variables is constant then


there will be linear correlation b/w them. consider the
following example:

X 2 4 6 8 10 12 14 16
Y 3 6 9 12 15 18 21 24

 Here the ratio of change


b/w the two variables is same.
 If we plot these point on a

graph we will get a straight


line.
 If the amount of change in the one variable does not show a
constant change in the other variable. Then there will be
curvilinear or nonlinear correlation b/w them.

X Y
2 2
4 6
6 8
8 12
10 18
12 24
14 36
16 44
18 54
20 67
22 75
24 89
Multiple and Partial Correlation
 When there are interrelationship between many variables
and the value of one variable is influenced by many other
variables, e.g. The yield of crop per acre (X1) may depends
upon quality of seed (X2), fertility of soil (X3), fertilizer
used (X4), irrigation facilities (X5), weather conditions (X6)
and so on.
 Whenever we are interested in studying the joint effect of
a group of variables upon a variable, then the correlation
is known as multiple correlation.

 The correlation b/w only two variables X1 and X2, while


eliminating the linear effect of other variables is known as
partial correlation.
SCATTER DIAGRAM
 It is a simplest way of the diagrammatic representation of a
bivariate data.
 For a bivariate distribution (xi, yi); i=1, 2, …, n, if the values
of the variables X and Y are plotted along the x-axis and y-
axis respectively in the x-y plane, the diagram of dots so
obtained is known as scatter diagram.
 From the scatter diagram, we can form a fairly good idea
whether the variables are correlated or not. e.g.
 If the points are very dense (very close to each other):
There is good correlation between variables
 If the points are widely scattered: There is poor correlation
between variables.

Karl Pearson’s coefficient of correlation


 Karl Pearson developed a formula called correlation
coefficient as a measure of intensity or degree of linear
relationship between two variables.
 Correlation coefficient between two random variables X
and Y, usually denoted by r(X, Y) or rXY, is a numerical
measure of linear relationship between two variables. It is
defined as:
Cov( X , Y )
r( X , Y ) 
 XY

 It provide a measure of linear relationship between X and


Y.
 If (xi, yi); i=1, 2, …, n is the bivariate distribution, then
Coviance  Cov( X , Y )   X  Y  E[{X  E ( X )}  {Y  E (Y )}]


1
 ( xi  x )( yi  y )
n

(x  x)
1
Variance   X2  E{ X  E ( X )}2  i
2
n
  ( y  y)
1
Variance   Y2  E{Y  E (Y )}2 i
2
n

Variance   X2  E{ X  E ( X )}2 Cov( X , Y )  E[{ X  E ( X )}  {Y  E (Y )}]


1

 ( xi  x ) 2 1
n  ( xi  x )( yi  y )
n

1
 ( xi2  x 2  2 xi x )

1
n  ( xi yi  xi y  xyi  xy )
n
 
1 1
 xi2  x 2  2 x
  
xi 1
n n  xi yi y xi  x yi  x y
n

1
 X2  xi2  x 2

1
n Cov( X , Y )  xi yi xy
n
 y
1 1
Variance   Y2  E{Y  E (Y )}2  ( yi  y ) 2  2
i  y2
n n
PROPERTIES OF CORRELATION COEFFICIENT
 Range is -1 to +1.
 is independent of change of origin and scale.
 Two independent variables are uncorrelated.

Interpretation of correlation coefficient


 when r=1, there is perfect positive correlation b/w variables.
 when r=-1, there is perfect negative correlation b/w variables.
 when r=0, there is no relation b/w variables.
 when the value of r lies b/w +1 to -1, it signifies that there is
correlation b/w variables.
 when the value of r is close to +1 or -1 then it signifies high
positive or negative correlation b/w variables.
 when the value of r is close to 0 then it signifies very less
correlation b/w variables.
RANK CORRELATION
 This method is useful to study the qualitative measure of
attributes like honesty, intelligence, color, beauty, morality
etc.
 This method is based on ranks of the character under study.
 Group of individuals is arranged in an order of merit or
proficiency of any two characters A and B.
 Example: If we want to find the relation between
intelligence and beauty.
A: Intelligence B: Beauty

Ranks xi yi i=1, 2, 3, …, n

 Pearsonian coefficient of correlation between ranks xi’s and


yi’s is called rank correlation coefficient between A and B for
that group of individual.
SPEARMAN'S RANK CORRELATION COEFFICIENT
 This method is developed by Edward Spearman.
 Spearman’s formula for the rank correlation coefficient is
given by n n

d
i 1
i
2
6d i 1
i
2

  1  1
2n X2 n(n  1)
2

 di is the difference between ranks di=xi-yi.


 Range of rank correlation coefficient is -1 to +1.

Q. 2. In a marketing survey the price of tea and coffee in a town


based on quality. Data is given as follows, find the relation b/w
price of tea and coffee.

Price of Tea 88 90 95 70 60 75 50
Price of Coffee 120 134 150 115 110 140 100
REGRESSION
 The term “regression” literally means “stepping towards
the average”.
 It is given by Sir Francis Galton.

 Galton found that the offspring of abnormally tall or short


parents tend to regress or step back to the average
population height.
 Regression analysis is a mathematical measures of the
average relationship between two or more variables in
terms of the original units of the data.
 In regression analysis there are two types of variables.

Dependent Variable Independent Variable

OR OR
Regressed Regressor Explanatory
Explained
Predictor variable
variable

The variable which


The variable whose
influences the values
value is influenced or
or is used for
is to be predicted.
prediction.
LINEAR REGRESSION
 If the variables in a bivariate distribution are related
(means variables are correlated), we will find that the
points in the scatter diagram will cluster round some
curve called the “curve of regression”.
 If the curve is straight line, it is called the “line of
regression”.
 Then there is said to be linear regression between the
variables, otherwise curvilinear regression.

Linear Regression Equation


Let us suppose that in the bivariate distribution (xi, yi);
i=1, 2, 3, …, n; Y is dependent variable and X is
independent variable. Let the line of regression of Y on X
be
Y=a+bX (a, b are constants)
 There are two regression lines
 If Y is dependent variable and X is independent variable, then
it is called the line of regression Y on X.
Y= a+ byx X
 If X is dependent variable and Y is independent variable, then
it is called the line of regression X on Y.
Y= a+ bxy X
 where byx is regression coefficient (slope) of the regression line
Y on X.
 where bxy is regression coefficient (slope) of the regression line
X on Y.
 The line of regression is the line which gives the best estimate
to the value of one variable for any specific value of the other
variable.
 Thus the line of regression is the line of best fit.
 It is obtained by the principle of least squares.
PRINCIPLE OF LEAST SQUARES
 Let the line of regression of Y on X be
Y= a+ byx X
 ei=yi-(a+byxxi) is called the error of estimate or residual for
y i.
 According to the principle of least squares, we have to
determine a and b so that
n n
E 
i 1
ei2  ( y  a  b
i 1
i yx xi )
2

is minimum.
 By solving the partial derivatives we will get two normal
equations for estimating a and b.
n n

y
i 1
i  na  byx xi 1
i (i)

n n n


i 1
xi yi  a 
i 1
xi  byx 
i 1
xi2 (ii)
 If we divide the eqn. (i) by n then we get
y  a  byx x
 Thus the line of regression of Y on X passes through the point
(x , y).
 So regression coefficient (slope) of the line of regression of Y
on X is given by
Cov( x, y)
byx 
V ( x)

 xy 
(  x)( y)
byx  n
and a  y  byx x
 x  n
2
( x) 2

 So regression coefficient (slope) of the line of regression of X


on Y is given by
Cov( x, y)
bxy 
V ( y)


(  x)( y)
bxy 
xy 
n and a  y  bxy x
y 2

(  y) 2

n
 Since byx is the slope of the regression of Y on X and since the
line of regression passes through the point ( x, y ), its equation
is
Cov( X , Y ) 
Y  y  byx ( X  x )  ( X  x)  r Y ( X  x)
V (X ) X

 Similarly for the line X on Y

Cov( X , Y ) 
X  x  bxy (Y  y )  (Y  y )  r X (Y  y )
V (Y ) Y

Cov( X , Y )
r  Cov( X , Y )  r X  Y
V ( X )V (Y )
Cov( X , Y )
bYX   Cov( X , Y )  bYX  X2
V (X )
 r X  Y  bYX  X2
Y X
 bYX  r similarlyb XY  r
X Y
PROPERTIES OF REGRESSION COEFFICIENT
1. Fundamental Property: Correlation coefficient is the geometric
mean between the regression coefficients.
 
b XY  bYX  r Y  r X  r 2
X Y
 r   b XY  bYX

2. Signature Property: Sign of correlation coefficient is the same as


that of regression coefficients. Thus if the regression coefficients
are positive then correlation coefficient will be positive and vice-
versa.
3. Magnitude Property: If one of the regression coefficients is
greater than unity, the other must be less than unity.
If bYX  1 then bXY  1
4. Mean Property: The modulus value of the arithmetic mean of the
regression coefficients is not less than the modulus value of
correlation coefficient r.
1
(b XY  bYX )  r
2
5. Regression coefficients are independent of the change of
origin but not of scale.
6. Angle between two lines of regression: If θ is the acute
angle between the two lines of regression, then
 2 
  
1 1  r
  tan   X Y 
 r   X   Y 
 2 2

 If r=0, tan θ =∞→ θ=90°. Thus if the two variables are uncorrelated,
the lines of regression become perpendicular to each other.
 If r=±1, tan θ =0→ θ=0° or 180°. Thus if the two variables are
perfectly correlated, the lines of regression coincide to each other.

Q.3. From a paddy field, 15 plants were selected randomly. The length of
panicle (cm) and number of grains per panicle were recorded. Fit the
regression line for the given dataset and compute the number of estimated
grains per panicle if the panicle length is 25.2 cm.

Length of Panicle (cm) 22.4 23.3 24.1 24.3 23.5 23.1 21 20.6 26.4 25.4 23.4 21.4 23.6 24.5 22.5

No. of grains per


95 109 133 132 136 116 94 85 143 138 129 88 127 142 110
panicle

You might also like