Simple Linear Regression Analysis

IE 5318 -004 APPLIED REGRESSION ANALYSIS
FALL 2017
SIMPLE LINEAR REGRESSION PROJECT
THE DEVASTATING EFFECT OF ALCOHOL

CONSUMPTION LEADING TO CIRRHOSIS OF LIVER
A SIMPLE LINEAR REGRESSION ANALYSIS
SUBMITTED ON
25th October 2017
SUBMITTED TO
Dr. Aera Kim Leboulluec
PROJECT TEAM
JOSES JENISH SMART - 1001420367
PRANESH RAM DEVARAJ - 1001490436
SREE RADESH RAJENDRA BOOPATHY - 1001238423
MANASWINI KUMAR 1001236676
TABLE OF CONTENTS
PAGE
S.NO CONTENT NO
PROJECT GANTT CHART
I PROJECT PROPOSAL 3
1 DESCRIPTION OF THE PROBLEM, VARIABLES AND THE DATA COLLECTION PROCESS 3
2 MODELING THIS DATASET MEANINGFUL? 5
DISCUSSION - SCATTER PLOTS OF THE RESPONSE VARIABLE VS. EACH PREDICTOR

3 VARIABLE. 5
4 SELECTED PREDICTOR VARIABLE FOR THE PROJECT - EXPLANATION 6
II SIMPLE LINEAR REGRESSION MODEL 7
III INFERENCES 9
A INFERENCES ON PARAMETERS 9
B INFERENCE ON THE TRUE LINE AND PREDICTION 10
IV MODEL ASSUMPTIONS 14
V FINAL DISCUSSION 18
REFERENCE 19
LIST OF TABLES
TABLE NUMBER NAME OF TABLE PG NO.
1 OBSERVED DATA INFORMATION 4
ANOVA TABLE THAT SHOWS THE PARAMETER ESTIMATES OF
2 WINE CONSUMPTION (X) AND DEATH RATE (Y). 8
3 CONFIDENCE BAND LIMITS 12
4.1 CORRELATION ANALYSIS 15
4.2 MODIFIED LEVENE TEST 17
LIST OF FIGURES
FIGURE NUMBER NAME OF THE FIGURE PG NO.

1 SCATTER PLOTS FOR X VS Y 5
SCATTER PLOT BETWEEN WINE CONSUMPTION (X) AND
2 DEATH RATE (Y) 7
3 CONFIDENCE BAND 13
4.1 WINE CONSUMPTION VS RESIDUAL 14
4.2 BOXPLOT 14
4.3 RESIDUAL VS NORMAL QUANTILES 15
1|Page
PROJECT GANTT CHART
THE DEVASTING EFFECT OF ALCOHOL CONSUMPTION LEADING TO
CIRRHOSIS OF LIVER DURATION Project Percentage
START END Contribution
WEEK DESCRIPTION (days) Meeting Completed
DATE DATE
Data search, Finalizing
9/16/17 9/20/17 1 4 Full Team 9/16/17 100%
& Approval
Data Pre Processing &
9/21/17 9/26/17 1,2 5 Full Team 9/21/17 100%
Project Proposal
Simple Linear Manaswini,
9/27/17 10/2/17 2,3 5 9/27/17 100%
Regression Model Radesh
Inferences on the Pranesh,
10/3/17 10/8/17 3 5 10/3/17 100%
Parmeters Jenish
Inferences on the True Pranesh,
10/9/17 10/13/17 4 4 10/9/17 100%
Line and Prediction Jenish
Manaswini,
10/14/17 10/19/17 5 Model Assumptions 5 10/14/17 100%
Radesh
10/20/17 10/22/17 6 Final Discussion 2 Full Team 10/20/18 100%
Dr. LeBoulluec - Review
10/23/17 10/24/17 6 1 Full Team 10/23/18 100%
& Final Draft Revision
10/24/17 10/26/17 6 Report - Submission 1 Full Team 10/24/18 100%
SIMPLE LINEAR REGRESSION TIMELINE
DESCRIPTION WEEK 1 WEEK 2 WEEK 3 WEEK 4 WEEK 5 WEEK 6

Data search, Finalizing & Approval
Data Pre Processing & Project Proposal
Simple Linear Regression Model

Inferences on the Parmeters
Inferences on the True Line and Prediction
Model Assumptions
Final Discussion
Dr. LeBoulluec - Review & Final Draft
Report - Submission
Key Itemized Contribution
FULL TEAM
RADESH, MANASWINI
JENISH, PRANESH
** Each team member has contributed equally towards this project
2|Page
I. PROJECT PROPOSAL
1. DESCRIPTION OF THE PROBLEM, VARIABLES AND THE DATA COLLECTION

PROCESS
The Problem:
Cirrhosis of Liver is a Liver disease that causes an irreversible scarring of the liver concerning the loss
of liver cells. Medical author Dennis Lee, MD says that Alcohol consumption is one of the main cause of
Cirrhosis although there are many other causes [3]. The effects of such a disease could be radical causing
weakness, loss of appetite and Jaundice.
Helmut Spaeth[1] and K Brownlee[2] were disturbed hearing the increasing trend of Cirrhosis of Liver in
various places across the USA and decided to record the population and the drinking data / alcohol
consumption in various states in the united States of America to find out the main factor that causes such
a drastic effect leading to the above mentioned conditions in human beings.
The Variables:
The observed data contained 46 different challenging samples of data. We as a team of four data analyst,
pursuing a special Applied Regression Analysis Course IE 5318 under Dr. LeBoulluec decided to use
the observed data to analyze and find out if there is any relation between the increasing death rate from
Cirrhosis in the United States of America (Y) which is considered here as the Response variable and the
four different Predictor variable such as the size of the urban population in percentage (X1), the number
of births to women between 45 to 49 (actually, the reciprocal of that value, times 100) (X2), the
consumption of wine per capita in liters (X3) and the consumption of hard liquor per capita (X4) in liters.
Summary of variables
Death Rate (Y)
Population size (X1)
No of Births (X2)
Wine Consumption (X3)
Hard Liquor Consumption (X4)
Data Collection Process:
We searched through various websites, articles and papers to find the data set that would be interesting
as well as meaningful to work on.
Finally, we landed up with one from [4] http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt

which was more exciting than we thought it would be when we went through the description.
Initially the obtained raw data had to be cleaned up to just get the required relevant data for our
analysis. There was a total of 46 observations on which the simple linear regression analysis was carried
out further.
3|Page
2. MODELING THIS DATASET MEANINGFUL?
We seek a model of the form:

Y = A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4.
From the initial scatter plot, it was found that the Wine Consumption has a direct relation with Death
Rate of the population from the data collected. The R2 values (given below) confirm the same. So,
modeling the dataset with Wine consumption as Predictor variable and Death rate as Response
variable would be meaningful and can create an impact on the society. As discussed above there are
other predictor variables or regressors other than wine consumption, such as liquor consumption,
which influences the response variable or death rate, however it is not as influential as wine
consumption. This is exhibited in the following section where it can be notices that consumption of
wine is directly related to death for the quantity of wine consumed.
3. DISCUSSION - SCATTER PLOTS OF THE RESPONSE VARIABLE vs. EACH

PREDICTOR VARIABLE.
Figure 1. Scatter plots for X vs Y

Population size (X1) Vs. Death Rate (Y) No of Births (X2) Vs. Death Rate (Y)
Wine Consumption (X3) Vs. Death Rate (Y) Liquor Consumption (X4) Vs. Death Rate (Y)
4|Page
4. SELECTED PREDICTOR VARIABLE FOR THE PROJECT EXPLANATION
R2 values:
X1 vs. Y: 0.5611
X2 vs. Y: 0.6127
X3 vs. Y: 0.7134
X4 vs. Y: 0.4651
From the R2 values we see that X3 vs. Y has the good fit and X4 vs. Y has the worst fit.
Also, from the scatter plot we see that X3 vs. Y is the best fit because there are equal number of observations
below and above the line and the plot seems to be linear.
Therefore we proceed with wine consumption (X3) as the predictor variable for the death rate.
5|Page
II. SIMPLE LINEAR REGRESSION MODEL
The simple linear regression (SLR) model is appropriate when the quantitative correlation between a
predictor or a regressor variable (X) and a corresponding response variable (Y) is to be examined. In our
case, the predictor variable is Wine consumption and the resultant response variable is death rate.
Figure 2 Scatter plot between Wine Consumption (X) and Death Rate (Y)
APPROPRIATE MODEL FORM:

Fig: 2.1 shows a clear upward trend in the linear correlation between our regressor or predictor variable
Wine consumption (x) and our corresponding response variable death rate(y). Thus, it would be
conclusive for us to model our response variable Death rate (y) in the following manner:
Yi = 0 + 1Xi + i
It is to be noted that the above equation is the standard form of the regression line in the SLR model and
when it relates to our data, xi stands for wine consumption for the ith trial and yi stands for the
corresponding death rate of the same ith trial. 0 stands for the y intercept of the regression line and 1
stands for the slope of the regression line. The resultant regression equation for this data would be of the
following form:
= 30.33467 + 2.86174x
6|Page
Here, b0 or the unbiased point estimator of the y intercept 0 equals 30.33467 and b1 or the unbiased point
estimator of the slope 1 equals 2.8617.
The SAS output of the same data confirms these values and is attached below:
Table 2. ANOVA table that shows the parameter estimates of wine consumption (X) and death rate (Y).
The Analysis of Variance results for the data are presented in the table above. It is to be noted that the p
values for wine consumption and death rate from the t table are both < 0.0001. We have chosen a
confidence level of 90% and therefore the resultant two-sided significance level is = 0.05. The ANOVA
table further gives us an insight on the sum of squares that help us analyze the variances even better. They
are explained in detailed below:
SSR (or) REGRESSION SUM OF SQUARES:
The regression or model sum of squares is a measure of how well the regression model is representative
of the actual given data that is being modeled. In our case, our Regression Sum of Squares value is 17650.
SSE (or) ERROR SUM OF SQUARES:
The error sum of squares represents the rate of error or deviations that the predicted values take from the
actual mean predictor or response values. These are irrespective of the model and are hence not explained
by the model. This value has to be as small as possible in order for the model to best fit the data. This
value is often used for selection of predictor values. The Error sum of squares value represented by our
model is 7691.66.
7|Page
SSTO (or) TOTAL SUM OF SQUARES:
The total sum of squares is the summation of all squared observations and their deviations from their
respective means. It may also be represented as total summation of deviations of the response variables
from their mean. The total sum of squares value for our model is 24741.
MSR (or) MEAN SUM OF SQUARE OF REGRESSION:
The mean sum of squares of regression or the mean sum of squares of the model is the ratio of the
regression sum of squares to its degree of freedom. As far as SLR is concerned, the degree of freedom for
MSR is equal to 1 and therefore, MSR is always equal to SSR. The MSR value for our data is 17650.
MSE (or) MEAN SUM OF SQUARE ERROR:
Similar to the mean sum of square of regression, the mean sum of square error is the ratio of the error sum
of squares to its degree of freedom. Here our degree of freedom for our data is 44 based on the total
number of observations and therefore, our mean sum of square error value is 161.17411.
F Value:
The F value is an important estimate of how well the regression model fits our data. It is nothing but the
ratio of MSR to MSE. The F value obtained from the ANOVA table is then compared with the table value
of F* to test the hypothesis H0: 1=0 vs. H1: 1 0. Since the table value for = 0.05 is 4.05 is less that the
F value obtained from the ANOVA table, we reject the null hypothesis, further confirming that there is a
strong linear relationship between wine consumption and death rate.
R2 value:
The R2 value is another important estimator of how well the regression model fits the data. It is obtained
by taking the ratio between the SSR value and the SSTO values. In other words, it is also referred to the
proportion of variation that is explained by the regression model. Therefore, the higher the value of R2, the
better fit the regression model is. Our R2 value is 0.7134 or 71.32%, which is quite good. Therefore, we
may conclude that the predictor variable selected here (wine consumption) is a good indicator of the
response variable (death rate).
III. INFERENCES
A) Inferences on the parameters
Two Sided Confidence Interval for the Slope:
Confidence interval was calculated for the slope with significance level, =0.05, n=46.

=1 (=1 )(=1 )/
b1 was calculated by using the formula, 1 = 2 2 = 2.861
=1 (=1 ) /

S{b1} was calculated by using the formula, {1 } = = 0.273

8|Page
The manual calculation done correlates with the results derived using SAS. Typically, a two-sided test is
performed, unless stated to perform a one sided test, to find the confidence interval. This test is performed
by using the formula mentioned below
1 (1 2 , 2){1 } = 2.86174 t (0.975, 44) *(0.27347)
= 2.86174 2.0168 * (0.27347) [From the table, t (0.975,44) =2.0168]
= 2.86174 0.5515
= (2.3102, 3.4132)
From these results, we are 95% confident that the mean death rate will lie between 2.3102 and 3.4132,
when the amount of wine consumption changes in the per capita of the population for a unit, in this case a
liter of consumption.
Two Sided Confidence Interval for the Y-Intercept:
Confidence interval was calculated for the Y-Intercept with significance level, =0.05, n=46.
b0 was calculated by using the formula, 0 = 1 = 63.49 (2.861)*(11.586)= 30.334
2
X
s{b0} was calculate using the formula, {0 } = [(1) + ( )2 ] = 3.680

0 (1 2 , 2){0 } = 30.334 t (0.975, 44) *(3.680)

=30.334 2.0168 *(3.680) [From the table, t (0.975,44) =2.0168]
=30.334 7.422
=(22.912, 37.757)
From these results, we are 95% confident that the Y-Intercept of this particular problem lies between the
value of 22.912 and 37.757.
B) Inferences on the True Line and Prediction

According to the Forbes magazine/website, the average amount of wine consumed in the United States
annually is 10.25 liters (per capita), so this particular value is chosen for analysis i.e., X h= 10.25 Liters or
Units, this value is substituted in the true line equation to get the predicted value,
h = 30.334 + 2.861Xh [Where, X h = 10.25 ]
Y
h = 30.334 + 2.861 (10.25)
Y
h = .
Y
Two Sided Confidence Interval of the Mean Response:
The Confidence interval of the mean response variable is calculated with significance level, =0.05,
n=46.
9|Page
h } = [(1) + (Xh) ]
2
h } was calculated using the formula, {Y
s{Y
( )2
17650
= 1 2
= (2.861)2 = 2156.299
2
h } = 161.174[(1 ) + (10.2511.586) ] = 1.907
{Y 46 2156.299
Where,
SSR or Regression Sum of Squares = 17650; MSE or Mean Sum of Squares = 161.174; n=46;
= 11.586 and b1 = 2.861.
X h=10.25; X

t(1 2 , n 2)s{ } = 59.659 t (0.975, 44)*( 1.907)
= 59.659 2.0168*(1.907) [From the table, t (0.975,44) =2.0168]

= 59.659 3.846
= (55.813, 63.505)
From these result, we are 95% confidence that for the mean wine consumption of 10.25 Liters, the death
rate will lie between 55.813 and 63.505.
Prediction Interval for the New Mean Response:
Prediction interval is calculated by using the following formula,

For 95% prediction interval, =0.05, n=46. t(1 2 , n 2)s{Pred}
h }2 + = 1.9072 + 161.174 = 12.837

To find, S{Pred} = s{Y
t(1 2 , n 2)s{Pred} = 59.659 t (0.975, 44)*(12.837)
= 59.659 2.0168*(12.837) [From the table, t (0.975,44) =2.0168]
=59.659 25.889
= (33.77, 85.548)
h =
From this results, we are 95% confident that the new death rate or death rate which was predicted (Y
59.659) with wine consumption of 10.25 liters will lie between 33.77 and 85.548.
Working-Hotelling Confidence Bands for the New Response:
The confidence band is calculated by using the formula,

2 (1 , 2, n 2 {Yh } = 59.659 2 (1 , 2, n 2) *(1.907)
=59.6592 4.0906 (1.907)
=59.659 2.860*1.907
10 | P a g e
=59.659 5.454
= (54.205, 65.113)
Table 3. Confidence Band Limits
Wine
S Death rate Predicted Death Standard Error Upper Lower
Consumption
no. (Y) Rate (Yh) s{Yh} Band Band
(X)
45.26584 26.84615
1 2 29.7 36.056 3.22
4 6
47.50332 30.33067
2 3 47.9 38.917 3.002
04 96
49.76653 33.78946
3 4 52.3 41.778 2.793
86 14
52.07693 37.22106
4 5 41.2 44.649 2.597
94 06
54.40738 40.59261
5 6 37.6 47.5 2.415
3 7
56.82219 43.89980
6 7 56.6 50.361 2.259
18 82
59.26560 47.17839
7 8 55.7 53.222 2.113
26 74
61.80912 50.35687
8 9 62.8 56.083 2.002
04 96
64.43844 53.44955
9 10 55.4 58.944 1.921
42 58
67.17645 56.43354
10 11 74.8 61.805 1.878
56 44
70.02887 59.30312
11 12 77.2 64.666 1.875
5 5
72.99284 62.06115
12 13 66.7 67.527 1.911
22 78
76.06263 64.71336
13 14 80.9 70.388 1.984
68 32
79.22967 67.26832
14 15 74.3 73.249 2.091
82 18
82.47966 69.74033
15 16 90.5 76.11 2.227
54 46
85.79543 72.14656
16 17 98.1 78.971 2.386
72 28
89.16841 74.49558
17 18 56.7 81.832 2.565
3 7
92.58429 76.80170
18 19 83.6 84.693 2.759
18 82
96.03449 79.07350
19 20 104.2 87.554 2.965
3 7
99.51615 81.31384
20 21 58.1 90.415 3.182
64 36
11 | P a g e
103.0207 83.53129
21 22 76 93.276 3.407
01 86
106.5424 85.73159
22 23 92.1 96.137 3.638
07 24
124.3482 96.53570
23 28 122.5 110.442 4.862
92 76
135.1222 102.9277
24 31 129.9 119.025 5.628
05 94
From this result, we are 95% confident that the death rate for the wine consumption of 10.25 liters will lie
between the confidence bands of 54.205 and 65.113.
Computation of Confidence Band Limits using Excel

A series of 24 unrepeated values/data were chosen to construct the confidence band. The table above
shows the Predicted value, standard error, lower and upper band of the confidence band which were
computed using MS Excel and verified by hand/manual calculation. The calculated value from the table is
plotted in the graph as shown below.
Confidence Band
160
140
120
Death Rate
100
80
60
40
20
0
0 5 10 15 20 25 30 35
Wine Consumption
Series1 Upper Band Lower Band Linear (Series1)
Figure 3. Confidence Band

The orange color dots together form the upper band and the grey dotted ones form the lower band and
sandwiched between them is the regression line which is represented as dotted black line.
It is observed that the confidence band (CB) is wider than the Confidence Interval (CI) and thus could be
interpreted as We are 95% confident that the predicted values Yh lies between the lower and the upper
limits of the Confidence Band (CB).
12 | P a g e
IV. Model Assumptions:
The following are the assumptions of the fitted model.

I. A linear model is reasonable
II. The Residuals have constant variance
III. The Residuals are not normally distributed
IV. Outliers
Residual Analysis using Plots:

The residual analysis is done to verify the model. Residual is the difference between the data point and
the regression line. For the residual analysis the graph is plotted for independent variable against the
residual.
Figure 4.1 Wine consumption vs Residual

The above graph is plotted against Wine consumption (X3) and the residual. From the above plot we see
that plot is random and we do not find any curvature. So we conclude that the linear model is OK. It is
also evident from the above plot that there is no funnel shape. So, we also conclude that the model has
constant variance.
Figure 4.2 Boxplot
13 | P a g e
From the above box plot for residuals we see that the median value is greater than the mean value. The
normality is violated, and the distribution is slightly left skewed.
Figure 4.3 Residual vs Normal Quantiles
From the above normality graph, we see that the left end is long tailed, and the right end is short tailed.
Therefore, we see a deviation from the normality and hence we conduct a normality test.
Normality test:
Hypothesis Statement:
H0: Normality is OK.
vs. H1: Normality is violated.
Table 4.1 Correlation analysis
14 | P a g e
From the above figure we see that P value is = 0.97362
Cutoff c(, n) = c(0.1,46)
From table B.6, the value for c(0.1,46) = 0.979
0.97362<0.979
We reject H0
Normality is violated.
Test for Constant variance:

Modified-Levene Test:
For Modified-Levene test, the observations are split into two groups. The first group consists of 25
observations and the second group 21 observations. For this test we will do both F test and T test. F test
will determine if the variance is equal or not and the T test will determine if the variance is constant or
not.
F test:
H0 = Equal Variance
H1 = Non Equal Variance
From the table below, P value = 0.0059 which is less than = 0.05.
Therefore, we reject H0. We conclude that variance is unequal.
T test:
H0 = Constant Variance
H1 = Non constant variance
From the table below, the P value for Satterthwaite is 0.0842 which is greater than = 0.05.
Therefore, we fail to reject H0. We conclude that the model has constant variance.
15 | P a g e
Table 4.2 Modified Levene test
Overall discussion on the Model assumption:

From the normality test, we conclude that the normality is violated.
From the Modified-Levene test, we could conclude that the variance is constant.
Transformation:
In the plot against wine consumption and residual there is no curvature or funnel shape and the plot is
scattered. Therefore, transformation is not required.
16 | P a g e
V. FINAL DISCUSSION:
We as a team once again could realize and experience the importance of the famous quote by Karl
Pearson Statistics is the grammar of science [9] by this project.
We could conduct the simple linear regression (SLR) analysis successfully and found that there is a
relation between the increasing death rate from Cirrhosis in the United States of America (Y) which is
considered here as the Response variable and the Predictor variable which is the consumption of wine per
capita in liters (X3). A positive increasing upward trend was observed on the plot between those two
variables that yet again proves the relationship between both.
This analysis helped us to identify the strength of the effect that the consumption of wine per capita which
is the predictor variable have on increasing death rate from Cirrhosis in the United States of America
which is the response variable. It also paved way to forecast the effects or impact of changes and helped
us to understand the amount of changes in the death rate from Cirrhosis in the United States of America
changes even with minor change in the consumption of wine per capita.
We have used one of the most powerful statistical analysis software package SAS to conduct and generate
various outputs that aided us to find out significant statistical inferences based on the relationship between
the increasing death rate from Cirrhosis in the United States of America (Y) which is considered here as
the Response variable and the Predictor variable or the consumption of wine per capita in liters (X3) and
were able to find the parameter estimate (b0, b1) and other data from the ANOVA table by which we were
able form the fitted model i = 30.33467 + 2.86174 Xi that was found to be significant using the F* value
obtained. The Confidence intervals, the Prediction Interval and the Confidence Bands for the mean
response were manually computed by hand and were interpreted using the plot generated using the same.
Residual plots were created to assess whether the observed error that is the residuals is consistent with the
random or the unpredicted error [6]. The test for normality denoted that the normality was not satisfied
and thus pushed us to perform the Modified Levenes Test for variance which was constant. The plot
against wine consumption and residual revealed there is no curvature or funnel shape and the plot is
scattered and thus transformation was not required.
Further, the analysis could be extended by including the remaining independent variables into the model
such as the size of the urban population in percentage (X1), the number of births to women between 45 to
49 (the reciprocal of that value, times 100) (X2), and the consumption of hard liquor per capita (X4) in
liters and a multiple linear regression analysis could be carried out by which a better predictive model
could be attained that could in-turn yield a more meaningful and contributing factors to the result
obtained.
Additionally, this project has been very supportive and helped us to learn how to work in a team
environment and understand and develop good work ethics at the end of the day.
17 | P a g e
REFERENCE
[1] Helmut Spaeth, Mathematical Algorithms for Linear Regression, Academic Press, 1991, ISBN 0-
12-656460-4.
[2] K Brownlee, Statistical Theory and Methodology, Wiley, 1965, pages 464-465.
[3] http://www.medicinenet.com/cirrhosis/article.htm
[4] http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt
[5] Lecture Notes Dr. Aera Kim Leboulluec
[6] Michael H. Kutner, C.J Nachtsheim, J. Neter, W. Li, Applied Linear Statistical Models; Fifth Edition.
[7] http://blog.minitab.com/blog/adventures-in-statistics-2/why-you-need-to-check-your-residual-plots-
for-regression-analysis
[8] http://www.statisticssolutions.com/what-is-linear-regression/
[9] https://www.brainyquote.com/quotes/keywords/statistics.html
18 | P a g e
APPENDIX
Table 1. Observed Data Information
S.No Size of Urban No of Wine Hard liquor Death

Population births consumption consumption rate
1 44 33.2 5 30 41.2
2 43 33.8 4 41 31.7
3 48 40.6 3 38 39.4
4 52 39.2 7 48 57.5
5 71 45.5 11 53 74.8
6 44 37.5 9 65 59.8
7 57 44.2 6 73 54.3
8 34 31.9 3 32 47.9
9 70 45.6 12 56 77.2
10 54 45.9 7 57 56.6
11 70 43.7 14 43 80.9
12 65 32.1 12 33 34.3
13 36 36.9 10 48 53.1
14 47 38.9 10 69 55.4
15 63 47.6 14 54 57.8
16 35 33 9 47 62.8
17 50 38.9 7 68 67.3
18 55 35.7 18 47 56.7
19 33 31.2 6 27 37.6
20 81 53.8 31 79 129.9
21 63 42.5 13 59 70.3
22 78 53.3 20 97 104.2
23 63 47 19 95 83.6
24 65 44.9 10 81 66
25 45 35.6 4 26 52.3
26 78 50.5 16 76 86.9
27 60 42.3 9 37 66.6
28 52 43.8 6 46 40.1
29 37 33.2 6 40 55.7
30 55 36 21 76 58.1
31 69 47.6 15 70 74.3
32 84 50 17 66 98.1
33 54 43.8 7 63 40.7
34 61 45 13 59 66.7
35 47 42.2 8 55 48
19 | P a g e
36 57 53 28 149 122.5
37 87 51.6 23 77 92.1
38 50 31.9 22 43 76
39 85 56.1 23 74 97.5
40 27 31.5 7 56 33.8
41 84 50 16 63 90.5
42 37 32.4 2 41 29.7
43 33 36.1 6 59 28
44 44 35.3 3 32 51.6
45 63 39.3 8 40 55.7
46 58 43.8 13 57 55.5
20 | P a g e

Simple Linear Regression Analysis

Uploaded by

Copyright:

Available Formats

Simple Linear Regression Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Linear Regression Analysis

Uploaded by

Copyright:

Available Formats

IE 5318 -004 APPLIED REGRESSION ANALYSIS

SIMPLE LINEAR REGRESSION PROJECT

THE DEVASTATING EFFECT OF ALCOHOL

PROJECT GANTT CHART

1 DESCRIPTION OF THE PROBLEM, VARIABLES AND THE DATA COLLECTION PROCESS 3

2 MODELING THIS DATASET MEANINGFUL? 5

DISCUSSION - SCATTER PLOTS OF THE RESPONSE VARIABLE VS. EACH PREDICTOR

4 SELECTED PREDICTOR VARIABLE FOR THE PROJECT - EXPLANATION 6

II SIMPLE LINEAR REGRESSION MODEL 7

B INFERENCE ON THE TRUE LINE AND PREDICTION 10

FIGURE NUMBER NAME OF THE FIGURE PG NO.

SIMPLE LINEAR REGRESSION TIMELINE

DESCRIPTION WEEK 1 WEEK 2 WEEK 3 WEEK 4 WEEK 5 WEEK 6

Data Pre Processing & Project Proposal

Simple Linear Regression Model

Inferences on the True Line and Prediction

Dr. LeBoulluec - Review & Final Draft

Key Itemized Contribution

** Each team member has contributed equally towards this project

1. DESCRIPTION OF THE PROBLEM, VARIABLES AND THE DATA COLLECTION

Finally, we landed up with one from [4] http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt

We seek a model of the form:

3. DISCUSSION - SCATTER PLOTS OF THE RESPONSE VARIABLE vs. EACH

Figure 1. Scatter plots for X vs Y

APPROPRIATE MODEL FORM:

SSR (or) REGRESSION SUM OF SQUARES:

SSE (or) ERROR SUM OF SQUARES:

MSR (or) MEAN SUM OF SQUARE OF REGRESSION:

MSE (or) MEAN SUM OF SQUARE ERROR:

b0 was calculated by using the formula, 0 = 1 = 63.49 (2.861)*(11.586)= 30.334

0 (1 2 , 2){0 } = 30.334 t (0.975, 44) *(3.680)

B) Inferences on the True Line and Prediction

= 59.659 2.0168*(1.907) [From the table, t (0.975,44) =2.0168]

h }2 + = 1.9072 + 161.174 = 12.837

t(1 2 , n 2)s{Pred} = 59.659 t (0.975, 44)*(12.837)

= 59.659 2.0168*(12.837) [From the table, t (0.975,44) =2.0168]

Working-Hotelling Confidence Bands for the New Response:

The confidence band is calculated by using the formula,

=59.6592 4.0906 (1.907)

Table 3. Confidence Band Limits

Computation of Confidence Band Limits using Excel

Series1 Upper Band Lower Band Linear (Series1)

Figure 3. Confidence Band

The following are the assumptions of the fitted model.

Residual Analysis using Plots:

Figure 4.1 Wine consumption vs Residual

Figure 4.2 Boxplot

Figure 4.3 Residual vs Normal Quantiles

Table 4.1 Correlation analysis

Test for Constant variance:

Overall discussion on the Model assumption:

[5] Lecture Notes Dr. Aera Kim Leboulluec

S.No Size of Urban No of Wine Hard liquor Death

You might also like