Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Stats Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Shri Ramdeobaba College of Engineering and Management,

Nagpur
First Semester B.E.
Statistics

Department of Mathematics RCOEM

Scope of the Statistics:


Modern civilization demands a systematic study of the characteristics of
variation, i.e., differenc eof measured values from its normal or previous
values, in the real world problem such as

• changes of atmospheric parameters (e.g., temperature, humidity, rain-


fall, wind flow, etc.),

• growing pattern of living beings, behaviour, etc., and

• fluctuations in a production and consumption systems in industry, elec-


tricity, agriculture, etc.

The tools for studying these varying parameters systematically are mainly
discussed in the field of statistics. There would be no need for startistical
methods if there were no variation. However, not all variations are studied in
the field of statistics. Statistics only deals with variation that includes some
degree of randomness. The degree of randomness in the variation relates
statistical method to probability theory.
Statistics is a subject that helps in transforming given data into meaning-
ful information to help in decision-making. More specifically, statistics is the
study of populations from where data is obtained, variation in measurement
of parameters features related to the data, and a method of data reduction.
The meaning of statistics can be explained in a plural as well as a singular
sense.
In the plural sense, statistics means numerical facts and data presented
in a significant information for a definite purpos. On the other hand . i.e., in
the singular sense, statistics reffer to the scientific methods used for analysing
interpreting and presenting data.
Probability is very close to statistics. The subject matter of probability
theory and statistical techniques are often described as part of constructive
mathematics. The probability theory is developed based on three different
aspects:

1. The formal logical context,

1
2. the intutive background, and

3. the applications.

Whereas the statistical techniques are developed to study the scientific ac-
tivities and it is applied for handling numerical data in all branches of study
such as business, arts, commerce, social and behavioral science, computer or
any branches of science and technology.

Role of Statistics:
Statistics plays a key role in prediction, estimation, identification, classifica-
tion, etcmany areas of arts, business, computer science, science and technol-
ogy. The following questionshelp to classify the problem relatedto a statistical
problem.

1. What is the population related to data/ problem that has to be de-


scribed?

2. What are the sources of variation?

3. How should the data be reduced?

Limitations of Statistics:
There are number of limitations that the study of stastical methods presents.
Some of the limitations are

• statistical methods use only numerical data , since the algorithm de-
veloped for statistical methods are based on numerical computation,

• statistical methods are designed based only on collective matters and


not for individual events,

• statistical method perform decision-making task based on past obser-


vations,

• only the appropriate mixture of collected data and statistical methods


lead to design a good system, etc.

Curve fitting:
In many branches of applied Mathematics, it is required to express a given
data, obtained from observations, in the from observations, in the form of
a law connecting the two variables involved. Such a law is known as the
empirical law. Several equations of different types can be obtained to express
the given data approximately. But the problem is to find the equation of the

2
curve of ’best fit’ which may be most suitable for predicting the unkown
values. The process of fiding such an equation of ’best fit’ is known as curve
fitting.
The best method of fitting a unique curve to a given data is the method
of least square.

Principle of Least Squares:


The method of least squares is probably the most systematic procedure to fit
a unique curve through the given data points. Let y = f (x) be the equation
of curve to be fitted to the given data (observed or experimental) points
(x1 , y1 ), (x2 , y2 ),....,(xn , yn ). At x = x1 , the observed (or experimental) value
of the ordinate P1 M1 i.e., f (x1 ). The difference of the observed and expected
(theoretical) value is P1 N1 = P1 M1 − N1 M1 = e1 this difference is called the
error.
Similarly,
e1 = y1 − f (x1 )
e2 = y2 − f (x2 )
e3 = y3 − f (x3 )
.
.
.
en = yn − f (xn ).
Some of the errors e1 , e2 , ..., en will be positive and others will be negative.
In finding the total error, errors are added. In addition, some negative and
some possitive errors may cancel and in some cases sum of all the errors may
be zero, which leads to false result. To avoid such situation all the errors can
be made possitive by squaring

S = e21 + e22 + ..... + e2n .

The curve of the best fit is that for which the sum of squares of errors (s) is
minimum. This is known as the principle of least squares.

Method of Least Squares:


Let
y = a + bx (1)

3
be the straight line to be fitted to the given data points (x1 , y1 ), (x2 , y2 ),....,
(xn , yn ). Let yt1 be the theoretical ordinate for x1 . That is
P M = y1 , N M = yt1
where
yt1 = a + bx1 & P N = e1
∴ P N = P M − NM =⇒ e1 = y1 − yt1
∴ e1 = y1 − (a + bx1 )
On squaring, we get,
e21 = (y1 − a − bx1 )2
X
s = e21 + e22 + ....... + e2n = e2i
n
X
∴s= (yi − a − bxi )2
i=1
For s to be minimum,
n
∂s X
= 2(yi − a − bxi )(−1) = 0
∂a i=1
or X
(y − a − bx) = 0 (2)
(to generalise we write yi as y and xi as x)

n
∂s X
= 2(yi − a − bxi )(−xi ) = 0
∂b i=1
or X
(xy − ax − bx2 ) = 0 (3)
on simplifying 2 and 3 becomes,
X X
y = na + b x (4)
X X X
xy = a x+b x2 (5)
Equations 4 and 5 are known as Normal Equations. On solving equations 4
and 5, we get values of unkowns a and b.

Note:

4
1. Fitting of straight lineP
using least square principle.
To fit y = a + bx Put before every term of above equation.
X X X X X
y= a+ bx , i.e., y = na + b x.
P
Multiply by x to equation y = a + bx and put before each term,
X X X
xy = ax + bx2
X X X
xy = a x+b x2

2. Fitting of a parabola y = a + bx + cx2


Simillarly, the normal equations are
X X X
y = na + b x+c x2
X X X X
xy = a x+b x2 + c x3
X X X X
xy = a x2 + b x3 + c x4

3. Fitting of other curves (Exponential curves)

(a) y = axb
(b) y = abx
(c) y = a.ebx

Such equations can be converted into straight line by taking log


Consider y = axb (Take log on bothe sides)

log y = log(axb ) = log a + b log x


Let Y = log y, A = log a, Bx = b log x

=⇒ Y = A + Bx

which is a equation of straight line.

Example 1: Fit a straight line y = a + bx to the following data by the


method of least squares
x 0 1 3 6 8
y 1 3 2 5 4

5
Answer: We have y = a + bx
x y xy x2
0 1 0 0
1 3 3 1
3 2 6 9
6 5 30 36
P 8 P 4 P 32 P 264
x = 18 y = 15 xy = 71 x = 110

Normal equations are: X X


y = na + b x (6)
X X X
xy = a x+b x2 (7)
From equation 6,
15 = 5a + 18b (8)
from equation 7,
71 = 18a + 110b (9)
on solving equations 8 and 9, we get b = 0.376 and a = 1.646
Put values of a and b in equation 6, we get

y = 1.646 + 0.376x.

Example 2: Fit a curve y = ax2 + b for the following data,


x 12 16 20 22 24 26 30
y 6.44 7.5 6.9 10.76 10.76 11.76 14.0

Answer: We have y = ax2 + b

x y xy x2
12 6.44 77.28 144
16 7.5 120 256
20 6.9 138 400
22 10.76 236.72 484
24 10.76 258.24 576
26 11.76 305.76 676
P 30 P 14.0 P 420 P 2900
x = 150 y = 68.12 xy = 1556 x = 3436

6
Normal equations are given by,

68.12 = 150a + 7b

1556 = 3436a + 150b


On solving these two equations, we get a = 0.0105 and b = 4.5746

Example 3: Fit a curve y = axb


x 1 2 3 4 5 6
y 2.98 4.26 5.21 6.1 6.8 7.5

Answer: Given equation is y = axb . Apply log we get,

log y = log a + b log x

Let, log y = Y , log a = A, b log x = BX

=⇒ Y = A + BX
The normal equations are given by,

X2
P P P P P
Y = nA + B X and XY = A X +B
P P P P
X= log x = 2.8574., Y = log y = 4.3133,
P P P 2
XY = (log x ∗ log y) = 2.2671, X = 1.7749

on solving nornal equations, we get

a = 2.9785 and b = 0.5143

=⇒ y = 2.9785 x0.5143

Multiple Regression:
A variable z is to be estimated from variables x and y by means of a regression
equation theory. Consider the equation,

z = a + bx + cy

7
. The sum of the squares of the deviation is given by,
X
S= (a + bx + cy − z)2 .

For S to be minimum,
∂S ∂S ∂S
= 0, = 0, =0
∂a ∂b ∂c
Therefore the normal equations are,
X X X
z = na + b x+c y
X X X X
xz = a x+b x2 + c xy
X X X X
yz = a y+b xy + c y2
By solving these three equationswe get the values of a, b and c and by putting
values of a, b and c in above equation we get the equation of best fit.

Example: Table shows the Weight (z) to the nearest pounds, Height (x) to
the nearest inches and age (y) to the nearest years of 12 boys.

weight(z) 66 71 53 67 55 58 77 57 56 51 76 68
Height (x) 57 59 49 62 51 50 55 48 52 42 61 57
Age (y) 8 10 6 11 8 7 10 9 10 6 12 9

1. Find the least square regression equation of z on x and y.

2. Estimate the weight of boy who is 9 years old and 54 inches tall.

Answer: Here we have z = a + bx + cy and n = 12. The normal equations


are given by, X X X
z = 12a + b x+c y
X X X X
xz = a x+b x2 + c xy
X X X X
yz = a y+b xy + c y2
P P P P 2
From
P 2the data we
P have, x =P643, y = 106,
P z = 755, x =
34843, y = 976, xz = 40944, yz = 6812, xy = 5779.
The normal equations becomes,

755 = 12a + 643b + 106c

8
40944 = 643a + 34843b + 5779c
6812 = 106a + 5779b + 976c
On solving these equations simultaneously, we get
a = 1.731, b = 0.932 and c = 1.266.
Therefore the required regression equation of z on x and y is given by
z = 1.731 + 0.932x + 1.266y.
When x = 54 and y = 9, we have
z = 1.731 + 0.932(54) + 1.266(9) = 63.453

Examples for practice:

1. Fit a straight line y = a + bx

x 1 2 3 4 6 8
y 2.4 3 3.6 4 5 6

2. Find the least squares fit of the form y = a + bx2 to the following data

x -1 0 1 2
y 2 5 3 0

3. Fit a curve y = abx

x 2 3 4 5 6
y 144 172.3 207.4 248.8 298.5

4. Use least square method to fit a curve of the form y = aeb x to the data,

x 1 2 3 4 5 6
y 7.209 5.265 3.846 2.809 2.052 1.499

(Answer:- y = 9.86832 e−0.3141x )

5. Fit an exponential curve obeying the gas equation pv r = k for the


following data,

x 50 100 150 200


y 135 48 26 17

(Answer:- pv 1.4978 = 47350)

9
6. Employ the method of least squares to fit parabola y = a + bx + cx2 in
the following
(x, y) : (−1, 2); (0, 0); (0, 1); (1, 2)
(Answer:- y = 0.5 + 1.5x2 )

7. Find the least square polynomial approximation of degree two to the


data,

x 0 1 2 3 4
y -4 -1 4 11 20

(Answer:- −4 + 2x + x2 )

8. Fit a second degree curve of regression of y on x to the following data

x -2 -1 0 1 2
y 15 1 1 3 19

(Answer: 1.0571 + x + 4.4286x2 )

9. Fit a second degree parabola to the following data,

x 1 2 3 4 5 6 7 8 9
y 2 6 7 8 10 11 11 10 9

(Answer: −0.9282 + 3.523x − 0.2673x2 )


b
10. Fit the curve y = ax + x
to the following data,

x 1 2 3 4 5 6 7 8
y 5.43 6.28 8.23 10.32 12.63 14.86 17.27 19.51

3.0262
(Answer:- 2.3972x + x
)

11. Table shows the corresponding values of 3 variables, x, y, z. Find the


linear least square regression of z on x & y and hence estimate z when
x = 10 & y = 6.

x 3 5 6 8 12 14
y 16 10 7 4 3 2
z 90 72 54 42 30 12

10
(Answer:- z = 61.4 − 3.64x + 2.53y, z = 40.18)

Correlation:
When the changes in one variable are associated or followed by changes in
the other, is called correlation and such a data which connects two variables
is called bivariate population.
If an increase (or decrease) in the values of one variablecorresponds to an
increase (or decrease) in the other, the correlation is said to be positive. If
the increase (or decrease) in one corresponds to the decrease (or increase) in
the other, the correlation is said to be negative. If there is no relationship
indicated between the variables, they are said to be indeprndent or uncorre-
lated.

Coefficient of Correlation:
The numerical measure of correlation is called the coefficient of correlation
and is defined by, P
XY
r=
nσx σy
where, X=deviation from the mean (x̄) = x − x̄,
Y =deviation from the mean (ȳ) = y − ȳ,
σx = standard deviation of x-series, σy = standard deviation of y-series
and n = number of values of the two variables.

The value of coefficient of correlation (r) always varties from −1 to 1.


The sign of r determines the nature of correlation, positive where r is posi-
tive and negative where r is negative.

Lines of Regression:

It frequently happens that the dots of the scatter diagram generally, tend
to cluster along a well defined directionwhich suggests a linear relationship
between the variables x and y. Such a line of best-fit for the given distribu-
tion of dots is called the line of regression. In fact there are two such lines,
one giving the best possible mean values of y for each specified values of x
and the other giving the best possible mean values of x for given values of y.
The first one is known as the line of regression of y on x and the second one
as the line of regression of x on y.

Consider a line of regression of y on x i.e., y = a + bx

11
The normal eqution is, X X
y = na + b x
Divide by n P P
y x
=a+b
n n
We know that, P
x
x̄ = = M ean of x − series
n
P
y
ȳ = = M ean of y − series
n
=⇒ ȳ = a + bx̄
=⇒ (x̄, ȳ) satisfies line of regression y = a + bx, i.e., (x̄, ȳ) lie on line of
regression.
=⇒ y − ȳ = b(x − x̄)
=⇒ Y = bX
XY = b X 2
P P
Normal equation is given by,
P
∴b= PXY2 where, b = coefficient of regression of y on x.
X

Simmillarly, for a line of regression of x on y, i.e., x = c + dy


P
d= PXY where, d = coefficient of regression of x on y.
Y2

Also, coefficient of correlation is given by,



P P
XY XY
r = pP = = bd
nσx σy
P
X2 Y 2
X
=⇒ XY = rnσx σy
=⇒ b = r σσxy , Regression coefficient y on x.

=⇒ d = r σσxy , Regression coefficient x on y.



=⇒ r = ± bd
Properties of correlation and regression coefficient

1. −1 ≤ r ≤ 1.

12
2. The correlation coefficient and the two regression coefficient have same
sign i.e., if r is +ve, then both b and d are +ve and if r is −ve, then
both b and d are −ve.

3. If one of the regression coefficient is greater than unity then other must
be less than unity.

4. If r = +1, perfect positive correlation i.e., all points lie on regression


line and both regression lines coincide.

5. Arithmatic mean of regression coefficient is greater than the corelation


coefficient.
Example:1 The two regression equations of the variables x and y are
x = 19.13 − 0.87y and y = 11.64 − 0.50x. Find (i) mean of x0 s, (ii) mean of
y 0 s and (iii) the correlation coefficient between x and y.
Solution: Since the mean of x0 s and the mean of y 0 s lie on the two regression
lines, we have
x̄ = 19.13 − 0.87ȳ
ȳ = 11.64 − 0.50x̄
On solving these equations we get, x̄ = 15.79 and ȳ = 3.74
∴ regression coefficient of y on x is −0.50 and that of x on y is −0.87.
Now since the coefficient of correlation is the geometric mean between the
two regression coefficients.
p √
r= (−0.50)(−0.87) = 0.43 = −0.66
[Note:- −ve sign is taken since both the regression coefficients are −ve]

Example:2 Find thecoefficient of correlation and obtain the equations


to the lines of regression for the data.
x 6 2 10 4 8
y 9 11 5 8 7
Estimate the values of y, when x = 10.

Solution:- Here n = 5,
P
x 30
∴ x̄ = = =6
n 5
P
x 40
∴ ȳ = = =8
n 5
13
Coefficient of correlation is given by,
P
XY −26
r = pP
2
P 2 =√ = −0.919
X Y 800

The regression coefficient y on x is given by,


P
XY −26
b= P 2 = = −0.65
X 40

Equation of line of regression y on x is ngiven by,

y − ȳ = b(x − x̄ =⇒ y − 8 = −0.65x(x − 6)

∴ y = −0.65x + 11.9
The regression coefficient of x on y is,

x − x̄ = d(y − ȳ) =⇒ x − 6 = −1.3(y − 8)


∴ x = −1.3y + 16.4
Rank Correlation:

A group of n individuals may be arranged in order to merit with respect to


some characteristic. The same group would give different orders for different
characteristics. Considering the orders corresponding to two characteristics
A and B, the correlation between these n pairs of ranks is called the rank
correlation in the characteristics A and B for that group of individuals.
Let xi , yi be the ranks of the ith individuals in A and B respectively.
Assuming that no two individuals are bracketed equal in either case, each of
the variables taking the values 1, 2, 3, ...., n, we have

1 + 2 + 3 + ... + n n(n + 1) n+1


x̄ = ȳ = = =
n 2n 2
If X, Y be the deviation of x, y from their means, then

X X X X X n(n + 1)2 n+1 X


Xi2 = (xi −x̄)2 = x2i +n(x̄)2 −2x̄ xi = n2 + −2 . n
4 2

X n(n + 1)(2n + 1) n(n + 1)2 n(n + 1)2 1


Xi = + − = (n3 − n)
6 4 2 12

14
Similarly,
X 1 3
Yi2 =
(n − n)
12
Now let, di = xi − yi so that di = (xi − x̄) − (yi − ȳ) = Xi − Yi

X X X X
∴ d2i = Xi2 + Yi2 − 2 Xi Yi
or
X 1 X 2 X 2 X 2 1 1X 2
Xi Yi = ( Xi + Yi − di ) = (n3 − n) − di .
2 12 2
Hence the correlation coeficient between these variables is,

1
(n3 − n) − 21 d2i
P
6 d2i
P P
Xi Yi 12
r = p P 2P 2 = 1 =1− 3
( Xi Yi ) 12
(n3 − n) (n − n)

This is called the rank correlation coefficient and is denoted by ρ.

Rank Correlation for Equal Ranks:


Let there are more than one item with the same rank. The rank to the equal
item is assigned by average rank to each of these individuals.
Ex : − Suppose an item is reapeted at the rank 5 i.e., the 5th and 6th items
are having the same values then the common rank assigned to 5th and 6th
item is 5+6
2
= 5.5, which is the mean (average) of 5 & 6. The next rank
assigned will be 7.
If an item is repeated thrice at rank 2, then common rank assigned to each
value will be 2+3+4
3
= 3, which is the arithmatic mean of 2, 3 & 4. The next
rank to be assigned will be 5. To find the rank of correlation coefficient of
2
repeated ranks, correlation factor i.e., m(m12−1) is added to
P 2
d , where m is
the number of times an item say(a1 ) is repeated. This factor is addded for
each repeated values in both the series, n is number of observations.
∴ Rank correlation coefficient for equal ranks is given by,

1 1 1
P 
6 d2 + 12
(m31 − m1 ) + 12
(m32 − m2 ) + 12
(m33 − m3 ) + .....
r =1−
n(n2 − 1)
.
Example:1 Calculate the coefficient of rank correlation for the following
data

15
x 2 4 5 6 8 11
y 18 12 10 8 7 5
Solution:- Here n = 6
x y R1 R2 d = R1 − R2 d2
2 18 6 1 5 25
4 12 5 2 3 9
5 10 4 3 1 1
6 8 3 4 -1 1
8 7 2 5 -3 9
11 5 1 6 -5 25
Coefficient of rank correlation is given by,
6 d2
P
6 ∗ 70
r =1− 2
=1− = −1
n(n − 1) 6(36 − 1)
Example:2 Ten participants in a contest are ranked by two judges as
follows:
x 1 6 5 10 3 2 4 9 7 8
y 6 4 9 8 1 2 3 10 5 7
d2i = 60
P
Solution: If di = xi −yi , then di = −5, 2, −4, 2, 2, 0, 1, −1, 2, 1,
6 d2i
P
6 ∗ 60
σ =1− 3 =1− = 0.6364.
n −n 990
Example:3 Obtain the rank correlation coefficient for the following data:
x 68 64 75 50 64 80 75 40 55 64
y 62 58 68 45 81 60 68 48 50 70
Answer:
x y R1 (for x) R2 (for y) d = R1 − R2 d2
68 62 4 5 -1 1
64 58 6 7 -1 1
75 68 2.5 3.5 -1 1
50 45 9 10 -1 1
64 81 6 1 5 25
80 60 1 6 -5 25
75 68 2.5 3.5 -1 1
40 48 10 9 1 1
55 50 8 8 0 0
64 70 6 2 4 16

16
1 1 1
P 
6 d2 + 12
(m31 − m1 ) + 12 (m32 − m2 ) + 12
(m33 − m3 )
r =1−
n(n2 − 1)

1 1 1
 
6 72 + 12
(23 − 2) + 12 (33 − 3) + 12
(23 − 2)
r =1−
10(100 − 1)

=⇒ r = 0.545
Examples for practice

1. Find the correlation coefficient and regression lines for the data,

x 1 2 3 4 5
y 2 5 3 8 7

(Answer:- r = 0.8062, y = 1.3x + 1.1, x = 0.5y + 0.5)

2. Find the regression line of y on x for the data:

x 1 4 2 3 5
y 3 1 2 5 4

3. The following marks have been obtained by a class of students in statis-


tics.

Paper-I 80 45 55 56 58 60 65 68 70 75 85
Paper-II 81 56 50 48 60 62 64 65 70 74 90

Compute the coefficient of correlation for the above data. Find the
lines of regression.
(Answer:- r = 0.918, y−65.45 = 0.981(x−65.18), x−65.18 = 0.859(y−
65.45))

4. The following results were obtained from lineups in Applied Mechanics


and Engineering Mathematics in an examination:

AppliedM echanics(x) Engg.M aths(y)


M ean 47.5 39.5
Standarddeviation 16.8 10.8

17
Find both the regression equation. Also estimate the values of y for
x = 30.
(Answer:- y = 0.611x + 10.5, x = 1.478y − 1.143, y = 28.83)

5. The following results were obtaine dfrom records of age (x) and systolic
blood pressure (y) of a group of 10 men:

x y P
M ean 53 142 and (x − x̄)(y − ȳ) = 1220
V ariation 130 165

Find the appropriate regression equation and use it to estimate the


blood pressure of a man whose age is 45.
(Answer:- y = 0.94x + 92.26, Bloodpressure = 134.56)

6. The regression equation are: 7x − 16y + 9 = 0, 5y − 4x − 3 = 0, find x̄,


ȳ and r.
3
(Answer:- x̄ = − 29 , ȳ = 15
29
, r = 43 )

7. If two regression coefficients are 0.8 and 0.2, what would be the value
of coefficient of correlation?
(Answer:- r=0.4)

8. Two random variables have the least square regression lines with equa-
tions 3x + 2y = 26 and 6x + y = 31. Find mean values and correlation
coefficients between x and y.
(Answer:- x̄ = 4, ȳ = 7, r = −0.5)

9. Table shown the respective heights x and y of a sample of 12 fathers


and their oldest sons.

x(inches) 65 63 67 64 68 62 70 66 68 67 69 71
y(inches) 68 66 68 65 69 66 68 65 71 67 68 70

Find the rank correlation coefficient between x and y.


(Answer:- r = 0.7221)

18

You might also like