Stats Main
Stats Main
Stats Main
Nagpur
First Semester B.E.
Statistics
The tools for studying these varying parameters systematically are mainly
discussed in the field of statistics. There would be no need for startistical
methods if there were no variation. However, not all variations are studied in
the field of statistics. Statistics only deals with variation that includes some
degree of randomness. The degree of randomness in the variation relates
statistical method to probability theory.
Statistics is a subject that helps in transforming given data into meaning-
ful information to help in decision-making. More specifically, statistics is the
study of populations from where data is obtained, variation in measurement
of parameters features related to the data, and a method of data reduction.
The meaning of statistics can be explained in a plural as well as a singular
sense.
In the plural sense, statistics means numerical facts and data presented
in a significant information for a definite purpos. On the other hand . i.e., in
the singular sense, statistics reffer to the scientific methods used for analysing
interpreting and presenting data.
Probability is very close to statistics. The subject matter of probability
theory and statistical techniques are often described as part of constructive
mathematics. The probability theory is developed based on three different
aspects:
1
2. the intutive background, and
3. the applications.
Whereas the statistical techniques are developed to study the scientific ac-
tivities and it is applied for handling numerical data in all branches of study
such as business, arts, commerce, social and behavioral science, computer or
any branches of science and technology.
Role of Statistics:
Statistics plays a key role in prediction, estimation, identification, classifica-
tion, etcmany areas of arts, business, computer science, science and technol-
ogy. The following questionshelp to classify the problem relatedto a statistical
problem.
Limitations of Statistics:
There are number of limitations that the study of stastical methods presents.
Some of the limitations are
• statistical methods use only numerical data , since the algorithm de-
veloped for statistical methods are based on numerical computation,
Curve fitting:
In many branches of applied Mathematics, it is required to express a given
data, obtained from observations, in the from observations, in the form of
a law connecting the two variables involved. Such a law is known as the
empirical law. Several equations of different types can be obtained to express
the given data approximately. But the problem is to find the equation of the
2
curve of ’best fit’ which may be most suitable for predicting the unkown
values. The process of fiding such an equation of ’best fit’ is known as curve
fitting.
The best method of fitting a unique curve to a given data is the method
of least square.
The curve of the best fit is that for which the sum of squares of errors (s) is
minimum. This is known as the principle of least squares.
3
be the straight line to be fitted to the given data points (x1 , y1 ), (x2 , y2 ),....,
(xn , yn ). Let yt1 be the theoretical ordinate for x1 . That is
P M = y1 , N M = yt1
where
yt1 = a + bx1 & P N = e1
∴ P N = P M − NM =⇒ e1 = y1 − yt1
∴ e1 = y1 − (a + bx1 )
On squaring, we get,
e21 = (y1 − a − bx1 )2
X
s = e21 + e22 + ....... + e2n = e2i
n
X
∴s= (yi − a − bxi )2
i=1
For s to be minimum,
n
∂s X
= 2(yi − a − bxi )(−1) = 0
∂a i=1
or X
(y − a − bx) = 0 (2)
(to generalise we write yi as y and xi as x)
n
∂s X
= 2(yi − a − bxi )(−xi ) = 0
∂b i=1
or X
(xy − ax − bx2 ) = 0 (3)
on simplifying 2 and 3 becomes,
X X
y = na + b x (4)
X X X
xy = a x+b x2 (5)
Equations 4 and 5 are known as Normal Equations. On solving equations 4
and 5, we get values of unkowns a and b.
Note:
4
1. Fitting of straight lineP
using least square principle.
To fit y = a + bx Put before every term of above equation.
X X X X X
y= a+ bx , i.e., y = na + b x.
P
Multiply by x to equation y = a + bx and put before each term,
X X X
xy = ax + bx2
X X X
xy = a x+b x2
(a) y = axb
(b) y = abx
(c) y = a.ebx
=⇒ Y = A + Bx
5
Answer: We have y = a + bx
x y xy x2
0 1 0 0
1 3 3 1
3 2 6 9
6 5 30 36
P 8 P 4 P 32 P 264
x = 18 y = 15 xy = 71 x = 110
y = 1.646 + 0.376x.
x y xy x2
12 6.44 77.28 144
16 7.5 120 256
20 6.9 138 400
22 10.76 236.72 484
24 10.76 258.24 576
26 11.76 305.76 676
P 30 P 14.0 P 420 P 2900
x = 150 y = 68.12 xy = 1556 x = 3436
6
Normal equations are given by,
68.12 = 150a + 7b
=⇒ Y = A + BX
The normal equations are given by,
X2
P P P P P
Y = nA + B X and XY = A X +B
P P P P
X= log x = 2.8574., Y = log y = 4.3133,
P P P 2
XY = (log x ∗ log y) = 2.2671, X = 1.7749
=⇒ y = 2.9785 x0.5143
Multiple Regression:
A variable z is to be estimated from variables x and y by means of a regression
equation theory. Consider the equation,
z = a + bx + cy
7
. The sum of the squares of the deviation is given by,
X
S= (a + bx + cy − z)2 .
For S to be minimum,
∂S ∂S ∂S
= 0, = 0, =0
∂a ∂b ∂c
Therefore the normal equations are,
X X X
z = na + b x+c y
X X X X
xz = a x+b x2 + c xy
X X X X
yz = a y+b xy + c y2
By solving these three equationswe get the values of a, b and c and by putting
values of a, b and c in above equation we get the equation of best fit.
Example: Table shows the Weight (z) to the nearest pounds, Height (x) to
the nearest inches and age (y) to the nearest years of 12 boys.
weight(z) 66 71 53 67 55 58 77 57 56 51 76 68
Height (x) 57 59 49 62 51 50 55 48 52 42 61 57
Age (y) 8 10 6 11 8 7 10 9 10 6 12 9
2. Estimate the weight of boy who is 9 years old and 54 inches tall.
8
40944 = 643a + 34843b + 5779c
6812 = 106a + 5779b + 976c
On solving these equations simultaneously, we get
a = 1.731, b = 0.932 and c = 1.266.
Therefore the required regression equation of z on x and y is given by
z = 1.731 + 0.932x + 1.266y.
When x = 54 and y = 9, we have
z = 1.731 + 0.932(54) + 1.266(9) = 63.453
x 1 2 3 4 6 8
y 2.4 3 3.6 4 5 6
2. Find the least squares fit of the form y = a + bx2 to the following data
x -1 0 1 2
y 2 5 3 0
x 2 3 4 5 6
y 144 172.3 207.4 248.8 298.5
4. Use least square method to fit a curve of the form y = aeb x to the data,
x 1 2 3 4 5 6
y 7.209 5.265 3.846 2.809 2.052 1.499
9
6. Employ the method of least squares to fit parabola y = a + bx + cx2 in
the following
(x, y) : (−1, 2); (0, 0); (0, 1); (1, 2)
(Answer:- y = 0.5 + 1.5x2 )
x 0 1 2 3 4
y -4 -1 4 11 20
(Answer:- −4 + 2x + x2 )
x -2 -1 0 1 2
y 15 1 1 3 19
x 1 2 3 4 5 6 7 8 9
y 2 6 7 8 10 11 11 10 9
x 1 2 3 4 5 6 7 8
y 5.43 6.28 8.23 10.32 12.63 14.86 17.27 19.51
3.0262
(Answer:- 2.3972x + x
)
x 3 5 6 8 12 14
y 16 10 7 4 3 2
z 90 72 54 42 30 12
10
(Answer:- z = 61.4 − 3.64x + 2.53y, z = 40.18)
Correlation:
When the changes in one variable are associated or followed by changes in
the other, is called correlation and such a data which connects two variables
is called bivariate population.
If an increase (or decrease) in the values of one variablecorresponds to an
increase (or decrease) in the other, the correlation is said to be positive. If
the increase (or decrease) in one corresponds to the decrease (or increase) in
the other, the correlation is said to be negative. If there is no relationship
indicated between the variables, they are said to be indeprndent or uncorre-
lated.
Coefficient of Correlation:
The numerical measure of correlation is called the coefficient of correlation
and is defined by, P
XY
r=
nσx σy
where, X=deviation from the mean (x̄) = x − x̄,
Y =deviation from the mean (ȳ) = y − ȳ,
σx = standard deviation of x-series, σy = standard deviation of y-series
and n = number of values of the two variables.
Lines of Regression:
It frequently happens that the dots of the scatter diagram generally, tend
to cluster along a well defined directionwhich suggests a linear relationship
between the variables x and y. Such a line of best-fit for the given distribu-
tion of dots is called the line of regression. In fact there are two such lines,
one giving the best possible mean values of y for each specified values of x
and the other giving the best possible mean values of x for given values of y.
The first one is known as the line of regression of y on x and the second one
as the line of regression of x on y.
11
The normal eqution is, X X
y = na + b x
Divide by n P P
y x
=a+b
n n
We know that, P
x
x̄ = = M ean of x − series
n
P
y
ȳ = = M ean of y − series
n
=⇒ ȳ = a + bx̄
=⇒ (x̄, ȳ) satisfies line of regression y = a + bx, i.e., (x̄, ȳ) lie on line of
regression.
=⇒ y − ȳ = b(x − x̄)
=⇒ Y = bX
XY = b X 2
P P
Normal equation is given by,
P
∴b= PXY2 where, b = coefficient of regression of y on x.
X
1. −1 ≤ r ≤ 1.
12
2. The correlation coefficient and the two regression coefficient have same
sign i.e., if r is +ve, then both b and d are +ve and if r is −ve, then
both b and d are −ve.
3. If one of the regression coefficient is greater than unity then other must
be less than unity.
Solution:- Here n = 5,
P
x 30
∴ x̄ = = =6
n 5
P
x 40
∴ ȳ = = =8
n 5
13
Coefficient of correlation is given by,
P
XY −26
r = pP
2
P 2 =√ = −0.919
X Y 800
y − ȳ = b(x − x̄ =⇒ y − 8 = −0.65x(x − 6)
∴ y = −0.65x + 11.9
The regression coefficient of x on y is,
14
Similarly,
X 1 3
Yi2 =
(n − n)
12
Now let, di = xi − yi so that di = (xi − x̄) − (yi − ȳ) = Xi − Yi
X X X X
∴ d2i = Xi2 + Yi2 − 2 Xi Yi
or
X 1 X 2 X 2 X 2 1 1X 2
Xi Yi = ( Xi + Yi − di ) = (n3 − n) − di .
2 12 2
Hence the correlation coeficient between these variables is,
1
(n3 − n) − 21 d2i
P
6 d2i
P P
Xi Yi 12
r = p P 2P 2 = 1 =1− 3
( Xi Yi ) 12
(n3 − n) (n − n)
1 1 1
P
6 d2 + 12
(m31 − m1 ) + 12
(m32 − m2 ) + 12
(m33 − m3 ) + .....
r =1−
n(n2 − 1)
.
Example:1 Calculate the coefficient of rank correlation for the following
data
15
x 2 4 5 6 8 11
y 18 12 10 8 7 5
Solution:- Here n = 6
x y R1 R2 d = R1 − R2 d2
2 18 6 1 5 25
4 12 5 2 3 9
5 10 4 3 1 1
6 8 3 4 -1 1
8 7 2 5 -3 9
11 5 1 6 -5 25
Coefficient of rank correlation is given by,
6 d2
P
6 ∗ 70
r =1− 2
=1− = −1
n(n − 1) 6(36 − 1)
Example:2 Ten participants in a contest are ranked by two judges as
follows:
x 1 6 5 10 3 2 4 9 7 8
y 6 4 9 8 1 2 3 10 5 7
d2i = 60
P
Solution: If di = xi −yi , then di = −5, 2, −4, 2, 2, 0, 1, −1, 2, 1,
6 d2i
P
6 ∗ 60
σ =1− 3 =1− = 0.6364.
n −n 990
Example:3 Obtain the rank correlation coefficient for the following data:
x 68 64 75 50 64 80 75 40 55 64
y 62 58 68 45 81 60 68 48 50 70
Answer:
x y R1 (for x) R2 (for y) d = R1 − R2 d2
68 62 4 5 -1 1
64 58 6 7 -1 1
75 68 2.5 3.5 -1 1
50 45 9 10 -1 1
64 81 6 1 5 25
80 60 1 6 -5 25
75 68 2.5 3.5 -1 1
40 48 10 9 1 1
55 50 8 8 0 0
64 70 6 2 4 16
16
1 1 1
P
6 d2 + 12
(m31 − m1 ) + 12 (m32 − m2 ) + 12
(m33 − m3 )
r =1−
n(n2 − 1)
1 1 1
6 72 + 12
(23 − 2) + 12 (33 − 3) + 12
(23 − 2)
r =1−
10(100 − 1)
=⇒ r = 0.545
Examples for practice
1. Find the correlation coefficient and regression lines for the data,
x 1 2 3 4 5
y 2 5 3 8 7
x 1 4 2 3 5
y 3 1 2 5 4
Paper-I 80 45 55 56 58 60 65 68 70 75 85
Paper-II 81 56 50 48 60 62 64 65 70 74 90
Compute the coefficient of correlation for the above data. Find the
lines of regression.
(Answer:- r = 0.918, y−65.45 = 0.981(x−65.18), x−65.18 = 0.859(y−
65.45))
17
Find both the regression equation. Also estimate the values of y for
x = 30.
(Answer:- y = 0.611x + 10.5, x = 1.478y − 1.143, y = 28.83)
5. The following results were obtaine dfrom records of age (x) and systolic
blood pressure (y) of a group of 10 men:
x y P
M ean 53 142 and (x − x̄)(y − ȳ) = 1220
V ariation 130 165
7. If two regression coefficients are 0.8 and 0.2, what would be the value
of coefficient of correlation?
(Answer:- r=0.4)
8. Two random variables have the least square regression lines with equa-
tions 3x + 2y = 26 and 6x + y = 31. Find mean values and correlation
coefficients between x and y.
(Answer:- x̄ = 4, ȳ = 7, r = −0.5)
x(inches) 65 63 67 64 68 62 70 66 68 67 69 71
y(inches) 68 66 68 65 69 66 68 65 71 67 68 70
18