Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Correlation

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

PABITRA SUBEDI

Correlation: Correlation is defined as the degree of relationship existing between two or more
variables. Two variables are said to be correlated when the change in the value of one variable is
accompanied by the change of other variable. For example change in temperature and pulse rate, drug
dosage administrated and the resultant physical response etc.

Types of correlation

Positive, Negative and Zero correlation: If the increase/ decrease in the value of one variable on an
average is associated with the increase/decrease in the value of another variable then the positive
relationship is said to be exit. The relationship will be negative if increased/ decrease in the value of one
variable is associated with the decrease/ increased in the value of another variable. If the relationship
between the variable doesn’t exist at all, the value of correlation coefficient will be said to zero.

Linear and non- linear correlation: A correlation is said to be linear if a unit change in one variable is
associated with a constant change in other variable. A correlation is said to non linear if unit change in
one variable is not associated with change of other variable.

Simple, partial and multiple correlations: The distinction between simple, partial and multiple
correlation is based upon the number of variables studied. If the relationship between two variables is
studied then the correlation is said to be simple. In partial and multiple correlation, more than two
variables are studied simultaneously. In partial correlation, the correlation between two variables is
studied keeping the linear effect of other variable as fixed or constant. If the correlation between one
variable and the joint effect of the other variables on that variable is studied, then it is known as multiple
correlation.

Method of studying the correlation

Scatter diagram method: It is the graphical method of studying the correlation coefficient between the
two variables. Sets of points of two related variables are plotted along X-axis and Y-axis of rectangular
co-ordinates. For example let X and Y be the two variables each containing the same number of values.
On observing the closeness and disperse of co-ordinate, an idea about the magnitude and direction of
correlation between the two variable can be made.

Closer the dots to a straight line, higher will be the value of correlation coefficient. The correlation
coefficient is positive, if the direction of dots is to move from lower left to upper right corner of the curve
opposite to this, the coefficient will be negative if the direction of dots is to move upper left to lower right
corner of the curve. If the dots take the shape of narrow band, high degree of correlation is said to exist. If
they take the shape of scattered band and low degree of correlation is said to be existed. There will be no
correlation at all if the dots form a circle or other figure.

Karl Pearson’s Correlation Coefficient: Karl Pearson’s correlation coefficient is denoted by r measures
the intensity or magnitude or degree of relationship between the two variable and is given by the formula

Cov( X ,Y )
r=
σx∗σy

where Cov(X,Y)= Covariance of X and Y

∑( X−x)(Y − y )
=
n
PABITRA SUBEDI

σx= Standard deviation of random variable X

σy = Standard deviation of random variable Y

This formula can further be written as

n ∑ xy−∑ x ∑ y
r=
√ n∑ x 2−¿ ¿ ¿
Note: The value of r lies in between -1 to +1

Interpretation

Degree Direction
Positive Negative
Perfect +1 -1
Very high +0.75to +0.99 -0.75to 0.99
Moderate +0.74 to +0.5 -0.74 to -0.5
Low +0.49 to +0.01 -0.49 to -0.01
Absent 0 0

Probable Error P.E.(r): Probable error is the measure to test the reliability of the calculated correlation
coefficient. If r be calculated correlation coefficient of n pair of observation, then the probable error is
denoted by P.E.(r) and is defined as
2
1−r
P.E.(r) =0.6745
√n
Interpretation

If r˂ P.E.(r) then r is not significant

If r ˃ 6 P.E.(r) then r is significant

Probable error also helps to estimate the limits of population correlation coefficient. The limit of
population correlation is = r ± P.E.(r)

Properties of correlation coefficient

1. Correlation coefficient is pure number that is independent of the units of measurement.


2. Correlation coefficient is independent of change of origin and scale
rxy= ruv
where U=X-A and V=Y-B
3. The limits of correlation coefficient is ±1.
4. The correlation between two independent random variables is zero.
5. The correlation coefficient is geometric mean of two regression coefficient i.e.
r=±√ b xy b yx

Problems
PABITRA SUBEDI

1. During a laboratory experiment muscular contraction of a frog muscle were measured against different
doses of a given drug. The height of the curves was considered as the response to the drug, the
observations were as below:

Serial no. 1 2 3 4 5
Dose of drug 0.3 0.4 0.6 0.8 0.9
Response to drug 54 59 60 65 70
Calculate correlation coefficient

2. For the following data, calculate the coefficient of correlation to determine association if any between
fluoride of drinking water and community florists index

Fluoride level(mg/lit) 0.8 1.3 1.5 1.9 2.3 2.3 2.4


Community fluorisis index(%) 0.1 0.4 0.8 0.6 0.7 1.1 0.8
(0.8259)

3. A researcher wants to find out if there of is a relationship between the heights of the sons and
heights of their fathers. Calculate the coefficient of correlation of height ( in ) given below

Father 63 65 66 67 67 68
Sons 66 68 65 67 69 70
(0.5976)

4. A researcher wished to determine if a person’s age is related to the number of hours he/she exercises
per week. The data obtained

Age 18 26 32 38 52 59
Hours 10 5 2 3 1.5 1
(-0.832)

5. The following table gives information on the average saturated fat( in gm) consumed per day and the
cholesterol level( in milligrams per hundred milliliters) for eight men.

Fat consumption 55 68 50 34 43 58 77 36
Cholesterol level 180 215 195 165 170 204 235 150
Find the correlation coefficient. (0.0.954)

6. A study was reported in a medical journal suggesting that the peak heart rate of an individual can reach
during intensive exercise decrease with age. A cardiologist wanted to do his own study. The next 9
patients were given a stress test on the treadmill at 6 miles per hour and their heart rate were recorded as
follows:

Age 30 30 40 20 20 45 30 45 50
Heart rate 190 180 180 200 195 170 185 175 165
Calculate the coefficient of correlation and interpret the nature of the values as calculate. Also calculate
population correlation limit.

Spearman’s Rank correlation coefficient (Rank correlation)

There are three kind of rank correlation

 When the rank is given.


PABITRA SUBEDI

 When the rank is not given


 When the rank is repeated.

When the rank is given

Spearman’s rank correlation coefficient is denoted by R and is defined as

6∑2
R = 1- d
2
n(n −1)

Where d= R1-R2
R1 = rank of 1st series
R2 = rank of 2nd series
n= no of observation

When the rank is not given

When the rank is not given then first rank the given series of data either ascending or descending order.

Apply the same formula

When the rank is repeated

When the rank is repeated the formula is reduced as


3 3
m1−m 1 m2−m 2
6( ∑ 2+ + +…)
R=1– d 12 12
2
n(n −1)

Where m1 = 1st repetitions

m2= 2nd repetition. And so on.

Problems

1. Calculate the Spearman’s rank correlation coefficient between smoking and cancer from the given data

Smoking 1 2 3 4 5 6 7 8 9 10
Cancer 2 1 4 3 6 7 5 9 8 10
(0.927)

2. Six items are ranked by three quality control experts in the following order

Ranked by 1st expert 1 2 3 4 5 6


Ranked by 2nd expert 1 2 3 4 6 5
Ranked by 3rd expert 6 5 4 3 2 1
Use the method of rank correlation to determine which pair of expert have nearest approach
PABITRA SUBEDI

3. The distribution of marks in Microbiology and Bio statistics for the students in 1 st term exam is given
below

Microbiology 25 28 32 36 40 38 39 42
Bio statistics 50 42 40 38 43 47 35 41
Calculate rank correlation coefficient.

4. The following are the scores represents a nurse assessment (X) and a physicians assessment(Y) of the
condition of 10 patients at time of admission to a trauma centre are:

X 18 13 18 15 10 12 8 4 7 3
Y 23 20 18 16 14 11 10 7 6 4
Calculate the rank correlation coefficient (0.9273)

5. Ten students got the following percentage of marks in Anatomy and Physiology

Anatomy 78 45 36 78 62 90 65 75 39 41
Physiology 84 55 50 60 82 86 58 60 47 51
(0.89)

6. Find the rank correlation coefficient from the following data of the age of wife and husband

Age of husband 23 27 28 29 30 31 33 35 36 39
Age of wife 21 22 23 24 25 26 28 29 30 32

Test for the rank correlation coefficient

Case I: Small sample case i.e. n≤30

Hypothesis

Null hypothesis H0:ρ = 0

i.e There is no relation in the population.

Alternative hypothesis H1: ρ≠ 0 i.e There is correlation in the population

Test statistics

6∑2 m31−m m32−m


6( ∑ 2+ + + …)
Test statistics is your formula ie R =1- d or1 – d 12 12
n(n 2−1) n(n 2−1)

Critical value: Tabulated value is taken from the table with respective d.f. ie from n d.f.

Decision: Accept H0 if calculated value is less or equal to than tabulated value. Reject otherwise.

Case II : For large sample i.e n˃30

Use t test ie use test statistics as


PABITRA SUBEDI

t=R√ (n−1) Which follows the Z distribution i.e Normal distribution.

Regression: The literal meaning of regression is step back towards average. In regression analysis,
there are two variable – dependent and independent variable. The value of variable which is to be
estimated or predicted is called dependent variable. The variable which is used for prediction is called
independent variable. It is also called regression or predictor or explanatory variables. The main objective
of regression analysis is to establish a functional relationship between the dependent and independent
variable and is used to predict the values of dependent variables.

Lines of regression : From the bi-variate data we can plot scatter diagram as before and there we will
find some points that will cluster round some curve and is known as curve of regression. If the curve is
straight line, the regression is said to be linear otherwise it is non linear or curvilinear. A linear
relationship between two variables is described by straight line through the points and is known as line of
regression. A line of regression gives the best estimate of one unknown variable for any given value of a
known variable. A line fitted by the method of least square is the best fit. In other word, the least squares
method calculates the line that comes the closest to running through all of the data points i.e. this line is
the one that passes though the center of the data point. Therefore, the least square method derives the
equation of the line that relate the relationship between all the data points with a minimum of error.

There are two lines of regression one is Y on X and other is X on Y. The line of regression Y on X is used
to estimate the value of dependent variable Y for any given value of the independent variable X.
Similarly, the line of regression of X on Y is used to estimate the value of dependent variable X for any
given value of the independent variable Y.

The regression equation of Y on X is obtained by minimizing the sum of square of error parallel Y axis
while in obtaining the regression equation of X on Y the sum of square of error are minimizing the
parallel to X axis.

Assumption of linear regression:

The following are the assumption of linear regression equation

1. The regression model is linear in parameters


2. Ɛ is random real variable
3. Ɛ follows Normal distribution with mean 0 and variance σ2 .
4. The random error Ɛ are independent E(Ɛi Ɛj) =0
5. X are uncorrelated with error Ɛ ie E(x Ɛ) =0
6. X are measured without error.

Regression equation Y on X

The line of regression Y on X is given by

y- y =byx(x- x )

where y is dependent variable, x is independent variable.

n ∑ xy−∑ x ∑ y
byx = regression coefficient of y on x and = 2 2
n ∑ x −(∑ x)
PABITRA SUBEDI

n ∑ uv−∑ u ∑ v
if u = X-A and V= Y –B then byx= 2 2
n ∑ u −(∑ u)

Regression coefficient byx is also called the rate of change of Y per unit X.

Another method

Regression equation of y on x is

Y=a+bx

Which can be estimated by using method of least square. Whose normal equations are

∑y=na+b∑x

∑xy=a∑x+b∑x^2

By solving these two equations we will get the value of a and b

Regression equation X on Y

The line of regression X on y is given by

X - x = bxy(Y- y )

Where X is dependent variable and Y is independent variable

n ∑ xy−∑ x ∑ y
bxy = Regression coeffient X on Y and = 2 2
n ∑ y −(∑ y )

n ∑uv−∑ u ∑ v
If u= X-A and v=Y-B then bxy= 2 2
n∑ v −(∑ v )

Regression coefficient bxy is also called the rate of change of X per unit Y

Regression equation of x on y is

Y=a’+b’x

Which can be estimated by using method of least square. Whose normal equations are

∑x=na’+b’∑y

∑xy=a’∑x+b’∑y^2

By solving these two equations we will get the value of a and b

Properties of regression coefficient

1. Correlation coefficient is the geometric mean of the product of two regression coefficient
r = ± √ b xy∗¿b ¿
yx

2. Both regression coefficients have same sign. i.e if one is +ve then other also +ve and one is –ve then
other also –ve
PABITRA SUBEDI

3. The product of two regression coefficient can’t be exceed unity i.e. bxy* byx≤1
4. Both lines of regression passes through their average i.e ( x , y )
5. The arithmetic mean of regression coefficient is greater than correlation coefficient.
6. The regression coefficient is independent of change on origin but not scale.

Exercise

1. The following table gives the normal weight of baby during the first six months of life

Age in months 0 2 3 5 6
Weight in lbs 5 7 8 10 12
Find the two lines of regression and also estimate the weight of baby at the age of 4 months
2. During a laboratory experiment muscular contractions of a frog muscle were measured against different
doses of a given drug. The height of the curves was considered as he response to the drug. The
observation were as below:

Doses of drug 0.3 .4 0.6 0.8 0.9


Response of drug 54 59 60 65 70
Calculate the response of drug for a dose of 0.5

3. The following table gives the ages and blood pressure of 6 women

Age in years 56 42 72 36 63 47
Blood pressure 147 125 160 118 149 128
i. Compute the two lines of regression.
ii. Estimate the b.p of women whose age is 45 years.

4. A study was reported in a medical journal suggesting that the peak heart rate of an individual can reach
during intensive exercise decrease with age. A cardiologist wanted to do his own study. The next 9 patient
were given a stress test on the tread mill at 6 miles per hours and then age (X) and their heart rates (Y)
were recorded as follows:

X 30 30 40 20 20 45 30 45 50
Y 190 180 180 200 195 170 185 175 165
Can we predict the peak heart rate of an over 80 year old man who is given a similar stress test? If so,
what peak heart rate do you predict?

Multiple Regression Analysis: In simple regression analysis, we studied the linear relationship
between only two variables one is independent and other dependent. Based on the relationship, we could
predict the value of dependent variable for a given value of independent variable. Multiple regression
analysis consists of the measurement of the relationship between the dependent variable and two or more
independent variables. The product is similar to that of simple regression, with a difference that other
independent variables are added to the regression equations.

The multiple regression equation of dependent variables X1 on independent variables X2 and X3 is given
by
X1 = a + b1X2 + b2X3………………….(1)

Where a= y intercept
b1= the partial regression coefficient of X1 on X2 keeping X3 constant ( also written as b12.3)
PABITRA SUBEDI

b2= the partial regression coefficient of X1 on X3 keeping X2 constant ( also written as b13.2)

The values of a, b1 and b2 are estimated by the method of least square which gives the normal equation as:
∑X1 = na+b1∑X2+ b2∑X3……………………………………………………….2
∑X1X2 = a∑X2 + b1∑X22+b2∑X2X3 ………………………………………...3
∑X1X3 = a∑X3+b1∑X2X3+ b2∑X33……………………………………………4

Solving equations (2),(3) and (4) we get the values of a,b1 and b2. Substituting these values in equations
(1) ,we get the fitted equation of regression equation of X1 on X2 and X3.

You might also like