Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Deepu Final

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

MSL ASSIGNMENT 3

Sneha Bhaip Deepali Arawat


111613007 111613002

Abstract:
The number of accidents on the roads have increased. In this assignment, the
data of speeds of 200 drivers at a certain instant are given. The mean speed to
drive on the highway is 80mph. we need to find if there is any correlation
between the number of accidents and speed at which the driver drives.
Accordingly, in the later part of the assignment regression analysis is done to
test the case.

Methodology:
Firstly, power analysis of the data set is done to find the sample size required
for appropriate study of the variable. Using sample size estimation command in
Minitab and setting the mean as mean (normal), we get the sample size to be
99. In this case we specify the margin of error (type 2 error). This error is the
beta error, which corresponds to the error of accepting the null hypothesis
when in reality it is false. If we increase the margin of error, we find that the
sample size decreases.
Hence the data points we have, that is 200 is more than 99. Hence it is a
suitable one.

Distribution fitting using Chi squared test


Approach:
Firstly, we plot a histogram to roughly find out the shape of the curve using the
histogram function in excel.

Now from this shape the data look roughly like a normal distribution. Using
suitable filters, we find the least values of the data in excel is 53.22 and the
largest value is about 107.322.

Now we apply the chi squared test to find if the distribution is normal or not.
We divide the data into 6 bins, each of size (max-min)/6, that is, (107.322-
53.22)/6 = 9.01, Hence the bins are,

1. 53.22 –62.23
2. 62.24 – 71.25
3. 71.26 – 80.27
4. 80.28 – 89.29
5. 89.29 – 98.30
6. 98.31 – 107.32

We find the expected frequency CDF using the Z values of the data. The data
has mean 80 and standard deviation of 10.
Here we have plotted a cdf table to check whether our observation that the
data is normal or not is true.

BINS Observed (O) Expected (E) (O-E)^2/E

Less than 62.23 9 7.68 0.2268


Less than 71.25 44 38.4 0.8166

Less than 80.27 119 108 1.12


Less than 89.29 171 164 0.2978

Less than 98.30 194 193 0.005181


Less than 107.32 200 199.34 0.002185

Total = 2.47

As we can see that the chi squared value is 2.47. comparing it with the chi
squared table for n=5 (degree of freedom)
The value of chi squared from the table for n=5 and 0.05 CI is 11.07.
Hence, the value of calculated chi squared is less than value from the table,
hence the distribution is a normal distribution.
Thus, the null hypothesis is accepted as per chi squared, hence the hypothesis
follows normal distribution.

Hypothesis Testing:
Hypothesis test for the data for centre, right tail and left tail test are as follows:
1. Centre test:
Here H0 (null hypothesis) should be equal to actual mean and standard
error should be (Standard deviation)/(sqrt(200). Therefore, the value of
standard error is 1/1.414 = 0.71
The value of alpha is 0.05, hence the null hypothesis will hold true for,
80 – 2.05*0.71 < mean < 80 + 2.05*0.71 (2.05 is the z value)
78.58 < mean < 81.42.
Distribution Plot
Normal, Mean=80, StDev=10

0.04

0.03
Density

0.02

0.01

0.025 0.025
0.00
60.40 80 99.60
X

2. Left tail test


Here the alpha value is 0.05. The Z value for this alpha for left tail test is 1.65.
Hence null hypothesis should be mean > 80 – 1.65*0.71
Hence null hypothesis for this case should be mean > 78.85
Distribution Plot
Normal, Mean=80, StDev=10

0.04

0.03
Density

0.02

0.01

0.05
0.00
63.55 80
X

3. Right tail test


Here the alpha value is 0.05. The Z value for this alpha for the right tail test is
1.65.
Hence null hypothesis should be mean < 80 + 1.65*0.71
Hence null hypothesis for this case should be mean < 81.15.
Distribution Plot
Normal, Mean=80, StDev=10

0.04

0.03
Density

0.02

0.01

0.05
0.00
80 96.45
X

Regression Analysis:
One of the variables is the speed of the vehicle noted at the point of speed
measurement. Another variable is the susceptibility to accident. This
parameter is defined only for the first 40 entries. Hence susceptibility is our
target variable (y). Hence linear regression equation is

Y = m*x + c
Where x is our speed.
The scatter plot for the same is shown below, here the susceptibility is given in
percentage and the speed is given in mph. Thus, the tentative equation is,

Y = 0.231*X + 9.5711
The meaning of R correspond to how well the regression line fits the given
data. It varies from 0 to 1. It tells the degree of fit of the regression line. The
value of R is the (1- Mse/Tse), Mse is the mean squared error and Tse is the
total squared error. The SS is the sum of squared error in the second table,
While the MS gives the mean squared error. The F value tells us about the F
statistic and the P value tells about the probability value. The lower is the P
value better is the regression fit.
In statistical modeling, Regression Analysis is used to estimate the relationships between
two or more variables:
Dependent variable (aka criterion variable) is the main factor we are trying to understand
and predict.
Independent variables (aka explanatory variables or predictors) are the factors that might
influence the dependent variable.
Regression analysis helps us to understand how the dependent variable changes when one
of the independent variables varies and allows to mathematically determine which of those
variables really has an impact.
Technically, a regression analysis model is based on the sum of squares, which is a
mathematical way to find the dispersion of data points. The goal of a model is to get the
smallest possible sum of squares and draw a line that comes closest to the data.
In statistics, they differentiate between a simple and multiple linear regression. Simple linear
regression models the relationship between a dependent variable and one independent
variables using a linear function. If we use two or more explanatory variables to predict the
independent variable, we then deal with multiple linear regression. If the dependent
variables are modelled as a non-linear function because the data relationships do not follow
a straight line, we use nonlinear regression instead.
For our data,
Independent variable is Height (x). {most contributing}
Dependent variable is Weight (y).
We now move to Regression Analysis using Excel. This is what we obtain through Data
Analysis.
1) Summary output
2) ANOVA
3) Residual output

Multiple R: It is the Correlation Coefficient that measures the strength of a linear


relationship between two variables. Our multiple R is 0.82635 which is closer to 0. So, there
is not a good relationship between our variables.
R Square: It is the Coefficient of Determination, which is used as an indicator of the
goodness of fit. Our R value is 0.68285 .
Adjusted R Square: It is the R square adjusted for the number of independent variable in the
model. We do not use this, as we have only one independent variable. So, its value 0.6745 is
quite insignificant.
Standard Error: It shows the precision of the regression analysis. The smaller the number,
the more certain we can be about our regression equation. Our standard error is 10.01
which is high but not severe.
Observations: It is simply the number of observations in our model. Our Observations are
40.

ANOVA
The second part of the output is Analysis of Variance (ANOVA) which basically splits the sum
of squares into individual components that give information about the levels of variability
within our regression model:

df is the number of the degrees of freedom associated with the sources of variance. df for
regression is k=1, residual is (n-k-1)=38, total is (n-1)=39; where n=40 & k is no. of
coefficients.

SS is the sum of squares. The smaller the Residual SS compared with the Total SS, the better
our model fits the data. SSR=104.734is the variation in group means around our overall
mean 80. SST=SSR+SSE= 330.2437 and is accumulation of variation of all N observations.
Remains the same
MS is the mean square. MSR is SSR/1=104.734 estimates variance of group mean around
overall mean estimates variation of errors around the group means.
F is the F statistic, or F-test for the null hypothesis. It is used to test the overall significance of
the model. Our model gives F as 81.81 which is significant for us.

Significance F is the P-value of F.


The Significance F value gives an idea of how reliable (statistically significant) our results are.
If Significance F is less than 0.05 (5%), your model is OK. If it is greater than 0.05, we had to
probably choose another independent variable. Our Significance F is very low, nearly 0.008.
So we reject our null Hypothesis.

Regression Analysis Output: Coefficients

It enables us to build a linear regression equation in Excel

Here,
R^2 Value Shows the percentage of the regression and this value is the
closeness to our claim that weight is related to height
Therefore, R^2 = .6825 = 68.25 %

RELATION bet speed and susceptibility.is


Y=9.57115+0.23176X

Hence, Regression Analysis is Done

Scatter plot

Conclusion:
1. It can be seen that the hypothesis is verified and the assumption is
correct according to the chi squared test.
2. The null hypothesis is tested according to the centre, left tail, and right
tail test.
3. Furthermore, linear regression is applied between 2 variables of the data
and value of R, regression coefficient is found out

You might also like