Statistical Computing Using Statistical Computing Using
Statistical Computing Using Statistical Computing Using
Statistical Computing Using Statistical Computing Using
P # 8. Students’ ages in the regular day time M.B.A. program and the
evening program of a management institute described by the
following samples :
(Ages)Regular M.B.A. : 23 29 27 22 24 21 25 26 27 24
(Ages) Evening M.B.A . 27 34 30 29 28 30 34 35 28 29
If homogeneity of the class is a positive factor in learning , use
coefficient of variation method to suggest which of the two groups
will be easier to teach.
Mean SD CV
Regular MBA 24.8 2.485514 11.59%
Evening MBA 30.4 2.875181 9.46%
Evening MBA group is easier to teach as CV is less.
Unit – II : Probability Distributions
Bernoulli Distribution
Binomial Distribution
Poisson Distribution
Normal Distribution
Exponential Distribution
Discrete Probability Distributions
Bernoulli Distribution
Binomial Distribution
Poisson Distribution
Bernoulli Distribution
Jacob(James) Bernoulli
Born on 6th Jan. 1655
Died on 16th Aug. 1705
Swiss mathematician.
Although he studied
theology and philosophy
at university level he was
more interested in mathematics
and astronomy.
Jacob(James) Bernoulli
STATISTICAL MODELS
• DISCRETE DISTRIBUTIONS
# BERNOULLI
1. P(x) = p x q 1–x , x = 0 , 1
2. Parameter : p
3. Range :x=0,1
Binomial distribution
N N
Cn
P(X = x) = 6C x 8C 4 – x
N–M M
n sample 14
C4
n–x x
Probability
P#3. N = 6 + 8 = 14 , n = 4 , X : No. of Toyota cars
(i) At least 2 Toyota cars = 2 or 3 or 4
R code :
> x = 2:4
> N= sum(choose(6 , x ) *choose( 8, 4 – x))
> D = choose( 14 , 4)
> P = N/D
= 0.5944056
(ii) P( Exactly 3 Maruti) = P ( 1 Toyota and 3 Maruti)
= P (X = 1)
> P = choose(6 , 1) *choose(8 , 3) /choose(14 , 4 )
= 0.3356643
Probability
P # 4. Five cards are drawn without replacement from a pack
of 52 cards. Find the prob. that (i)No diamond card is drawn
(ii)Exactly two diamond cards are drawn (iii)at least two
diamond cards are drawn.
Probability
P # 4. X : No. of diamond cards
N = 52 , M=13, n = 5 , N – M = 39
N
P(X = x) = MC x N–M
C n–x
N–M M N
Cn
n sample
P(X = x) = 13C x 39
C 5 –x
n–x x
52
C5
Probability
P # 4.
(i) P(X=0) = 13C0 × 39 C5 / 52 C5 = 0. 0.2215336
(ii) P(X=2) = 13C2 × 39C3 / 52C5 = 0.2742797
(iii) P( X >= 2) = P(X=2)+P(X=3)+P(X=4)+P(X=5)
R – code :
> x= 2:5
> sum(choose(13,x) *choose(39, 5-x)) / choose(52,5)
0.3670468
Probability
P # 5. Four cards are drawn at random from a well shuffled
pack of 52 cards. Find the prob. that :
(i)Two cards are red and the remaining black
(ii) All cards are of different suits
(iii)All are of same suit
(iv)One is King
Solution : n = 52C4
(i) m = 26 C2 × 26 C2 P = m / n = 0.3901561
(ii) m = 13C1 × 13C1 × 13C1 × 13C1 = 134 , P = m / n = 0.1054982
Probability
P # 5. (iii)P(All are of same suit)= (13C4 + 13C4 + 13C4 + 13C4 )
52C
4
= 0.01056423
(iv)P(One is King) = 4C1 48C3 = 0.2555508
52C
4
Probability
P # 6. Out of 20 persons in a company , five are graduates.
If 3 persons are selected at random, what is the prob. that
(i)They are all graduates (ii)There is no graduate
(iii)At least two of them are graduates
Solution :
n=3 N= 20
15 Non-graduates 5 graduates
(i)P(all are graduates) = 55C33 / 20
20C3 = 0.00877193
3
Exponential
Normal
EXPONENTIAL DISTRIBUTION
Definition : A continuous random variable X
is said to follow exponential distribution with
the random variable value x ≥ 0 and its p.d.f.
is given by f(x) = λ e – λx , x ≥ 0
=0 , other wise
Mean = 1 / λ F(x)= P(X≤ x) = 1 - e – λx
Another form f(x) = 1/λ e – x / λ , x ≥ 0
=0 , other wise
Mean = λ F(x)= P(X≤ x) = 1 - e – x / λ
Exponential Distribution PDF graph
EXPONENTIAL DISTRIBUTION Problems
-∞ ∞
PROPERTIES Contd..
6 # In normal distribution the Q1 and Q3 are
equidistant. Q1 = µ - 0.675 σ, Q3 = µ + 0.675 σ
PROPERTIES Contd..
6 # Quartile Deviation(Q.D.)= 0.675 σ and
Mean Deviation(M.D.)= √ 2/∏ σ
= 0.8 σ
7 # The measure of skewness = 0 ,kurtosis=0
β2 = 3
8 # The first four central moments are
µ1 = 0 , µ 2 = σ 2
µ3 = 0 , µ4 = 3σ4
PROPERTIES Contd..
9 # The points of inflections of the curve are
x=µ ±σ
Area property
Total area = 1
= 100%
Area Property
Area property
Standard Normal Variate(S.N.V.)
IMPORTANCE OF THE NORMAL
DISTRIBUTION
6# The theory of errors of observations in
physical measurements are based on
Normal distribution.
Problems of Normal distribution
P #1. X ~ N(µ = 5 , σ = 2 )
Find (i)P( X > 4) (ii)P(X ≤ 3) (iii)P(10 < X < 15)
Solution : (i) > 1 – pnorm(4 , 5 , 2 )
0.6914625
(ii) > pnorm(3 , 5 , 2)
0.1586553
(iii) > pnorm(15 , 5 , 2) – pnorm(10 , 5 , 2)
0.006209379
Problems of Normal distribution
P #1. X ~ N(µ = 5 , σ = 2 )
Find (i)P( X > 4)
(i)P(X > 4) = P( (X - µ) / σ > (4 - µ)/ σ)
= P ( Z > - 0.5)
= 0.1915 + 0.5 -0.5 Z = 0
= 0.6915
Problems of Normal distribution
Problems of Normal distribution
P # 2. Let X ~ N(µ = 100 , σ2 = 64 )
Find (i)P(X ≤ 110 ) (ii)P( | X – 95 | < 5 )
(iii) P( X ≥ K )= 0.9 , P( X < K )=0.01 , find K
Solution : > mu = 100 ; sd = 8
(i) > pnorm( 110 , mu , sd )
0.8943502
(ii) P( | X – 95 | < 5 ) = P( 90 < X < 100)
> pnorm(100 , mu , sd) – pnorm(90, mu,sd)
0.3943502
(iii) > K1= qnorm( 0.1,mu , sd)
K= 89.74759 ≈ 90
> K2= qnorm(0.01, mu , sd)
K = 81.38922 ≈ 81
Problems of Normal distribution
P # 3. If the heights of 1000 soldiers in a regiment are
normally distributed with a mean of 172 cm. and s.d.
of 5 cm. , how many soldiers have heights greater than
180 cm.
Solution : X : Heights of soldiers in cms. N = 1000
X ~ N( µ = 172 cms. σ = 5 cms.)
P(X > 180) = 1 – P( X ≤ 180)
> 1 – pnorm(180 , 172, 5 )
0.05479929
The expected no. of soldiers with height > 180 = N × P(X>180)
1000× 0.05479929 = 54.79929 ≈ 55
Problems of Normal distribution
P # 4. The income distribution of a group of 10000 persons
was found to be normal with mean of Rs. 7500 per
month and s.d. of Rs. 500 per month. What percentage
of this group had income
(i) exceeding Rs. 6680
(ii) not more than Rs. 7000
Solution : X : Monthly income of a group in RS.
X ~ N( µ = 7500 , σ = 500)
(i) P(X > 6680) = 1 – P(X ≤ 6680)
> 1 – pnorm(6680 , 7500 , 500)
0.9494974 = 94.94974 %
(ii)P(X ≤ 7000) >pnorm(7000,7500,500) = 0.1586553=15.86553%
Problems of Normal distribution
P # 5. The distribution of monthly incomes of a group of 3000 factory
workers is following normal distribution with the mean equal to
Rs. 10000 and s.d. Rs 2000.
Find (i) the percentage of workers having a monthly income of
more than Rs. 12000
(ii) the number of workers having a monthly income of less than
Rs. 9000
(iii)the highest monthly income among the lowest paid 100 workers
(iv) the least monthly income among the highest paid 100 workers
Problems of Normal distribution
P # 5. X :
X ~ N( µ = 10000 , σ = 2000 )
(iii) Proportion = 100/3000 = 0.03333= 3.333 %
(iii)Given P( X ≤ K ) = 0.03333
K = 6332.081
(iv) P( X > K) = 0.03333 3.333%
K = 13667.92 2000
2000
R code for probability distributions
Positive
Negative
Zero
Types of correlation
Positive correlation : If both the variables are varying in
the same direction then the correlation is said to be
positive.
Negative correlation :If both the variables are varying in
opposite direction then the correlation is said to be
negative.
Regression
This process of minimizing ∑ ei2 is known as “
Ordinary Least Square(OLS)
• Assumptions of OLS :
• E(ei) = 0
• Var.(ei) = E(ei)2 = σei2
= constant (Homo- scedasticity)
• Var.(ei) = E(ei)2 = σei2 ≠ constant (Hetero-
scedasticity)
Multiple Regression
Multiple Regression
• In case of simple regression analysis only one
independent variable is included and predict the value
of the dependent variable through the appropriate
regression line. e.g. Sales(Y) ,advertising
expenditure(X).
• Then the simple linear regression equation can be :
• Y = β0 + β1 X
Multiple Regression
If we take sales as a function of advertising expenditure
, then we can predict sales for a given advertising
expenditure using the regression line of sales(Y) on
advertising expenditure(X).( If R2 = 0.80 which means
that (1-0.80)% = 20% of the variation in sales could be
due to the influence of other variable(factor) besides
advertising expenditure.
Multiple Regression
For instance , per capita income in the concerned trading
area could also have an influence on sales. Then results
of the simple regression model might be improved by
adding per capita income as an explanatory
(independent) variable. This extension of the simple
regression technique i.e. the use of two or more
independent variables, is known as multiple regression
analysis.And the multiple regression model is
Y = β 0 + β 1 X 1 + β 2 X2
Multiple Regression
Some times there is interrelation between many
variables and the value of one variable may be
influenced by many others. e.g. the yield of crop per
acre say(Y) depends upon quality of seed(X1) ,fertility
of soil(X2),fertilizer used (X3) , irrigation
facility(X4),weather conditions(X5) and so on. The
joint effect of a group of variables upon a variable not
included in that group , our study is Multiple
Regression.
7 assumptions of OLS
1.The regression model is linear in the coefficients and the error
term
2. The error term has a population mean of zero
3. All independent variables are uncorrelated with the error term
4. Observations of the error term are uncorrelated with each other
5. The error term has a constant variance (no heteroscedasticity)
6. No independent variable is a perfect linear function of other
explanatory variables
7. The error term is normally distributed
Multicollinearity in Regression
Multicollinearity occurs when independent variables in
a regression model are correlated. This correlation is a
problem because independent variables should be
independent. If the degree of correlation between variables is
high enough, it can cause problems when you fit the model
and interpret the results.
Why is Multicollinearity a Potential
Problem?
• A key goal of regression analysis is to isolate the relationship
between each independent variable and the dependent
variable. The interpretation of a regression coefficient is that it
represents the mean change in the dependent variable for each
1 unit change in an independent variable when you hold all of
the other independent variables constant. That last portion is
crucial for our discussion about multicollinearity.
• The idea is that you can change the value of one independent
variable and not the others. However, when independent
variables are correlated, it indicates that changes in one
variable are associated with shifts in another variable.
Multicollinearity in Regression
• The stronger the correlation, the more difficult it is to change
one variable without changing another. It becomes difficult for
the model to estimate the relationship between each
independent variable and the dependent
variable independently because the independent variables tend
to change in unison.
Why Multi-Collinearity is a problem?
1. Variable Selection
• The most straight-forward method is to remove some variables
that are highly correlated to others and leave the more
significant ones in the set.
2. Variable Transformation
• The second method is to transform some of the variables to
make them less correlated but still maintain their feature.
How to fix Multi-Collinearity issue?
IN=c(1389,1040.7,627.3,581.4,526.3,316.7,584.8,557,704.4,956.9,1080.6,1332,1600.9
)
TOURIST=c(7070,6595,5727,5552,5109,4920,4875,4847,5057,5639,5805,6216,6972)
V=c(24111,21838,19611,19183,17670,17647,18122,17277,17845,18501,18373,18992,
20593)
Multiple Regression Problem: finding VIF
> install.packages("car") -------select India(http)
> library(car)
> data = mtcars[, c(“disp”,”hp”,”wt”,”drat”)]
> model = lm(mpg~ disp+hp+wt+drat , data = mtcars)
> VIF =vif(model)
>barplot(VIF , main = “VIF values “ , horiz =TRUE ,col =”steel blue”)
> abline(v=5,lwd = 3 , lty = 2)
For checking normality of residuals in regression in R
> res=c( 0.19,2.12,0.992,0.123,-0.47 ,-1.34,0.678)
>qqnorm(res)
>qqline(res)
ANOVA in Regression
For simple regression :
Ho : β1 = 0 , H1: β1 ≠ 0
ANOVA table
Source d.f. Sum of square(S.S.) Mean sum of square(M.S.S.) F cal.
Regression k – 1 S.S. reg. =∑(y^ - y)22 M.S. reg. = S.S. reg. / d.f. M.S.reg.
Residual n–2 S.S. res. =∑(y – y^)22 M.S. res. = S.S.res. / d.f. M.S.res.
Total n – 1 S.S.total
total
=∑(y – y )22
Total n – 17 S.S.total
total
= 153.876
Conclusion : Since F cal. (6.133) > F table(5.98) reject Ho , and conclude that
the Ht. of Son is influencing the Ht. of father.
5 % points of F distribution