Simple Linear Regression Analysis
Simple Linear Regression Analysis
Simple Linear Regression Analysis
FALL 2017
SUBMITTED ON
25th October 2017
SUBMITTED TO
Dr. Aera Kim Leboulluec
PROJECT TEAM
JOSES JENISH SMART - 1001420367
PRANESH RAM DEVARAJ - 1001490436
SREE RADESH RAJENDRA BOOPATHY - 1001238423
MANASWINI KUMAR 1001236676
TABLE OF CONTENTS
PAGE
S.NO CONTENT NO
I PROJECT PROPOSAL 3
III INFERENCES 9
A INFERENCES ON PARAMETERS 9
IV MODEL ASSUMPTIONS 14
V FINAL DISCUSSION 18
REFERENCE 19
LIST OF TABLES
TABLE NUMBER NAME OF TABLE PG NO.
1 OBSERVED DATA INFORMATION 4
ANOVA TABLE THAT SHOWS THE PARAMETER ESTIMATES OF
2 WINE CONSUMPTION (X) AND DEATH RATE (Y). 8
3 CONFIDENCE BAND LIMITS 12
4.1 CORRELATION ANALYSIS 15
4.2 MODIFIED LEVENE TEST 17
LIST OF FIGURES
1|Page
PROJECT GANTT CHART
THE DEVASTING EFFECT OF ALCOHOL CONSUMPTION LEADING TO
CIRRHOSIS OF LIVER DURATION Project Percentage
START END Contribution
WEEK DESCRIPTION (days) Meeting Completed
DATE DATE
Data search, Finalizing
9/16/17 9/20/17 1 4 Full Team 9/16/17 100%
& Approval
Data Pre Processing &
9/21/17 9/26/17 1,2 5 Full Team 9/21/17 100%
Project Proposal
Simple Linear Manaswini,
9/27/17 10/2/17 2,3 5 9/27/17 100%
Regression Model Radesh
Inferences on the Pranesh,
10/3/17 10/8/17 3 5 10/3/17 100%
Parmeters Jenish
Inferences on the True Pranesh,
10/9/17 10/13/17 4 4 10/9/17 100%
Line and Prediction Jenish
Manaswini,
10/14/17 10/19/17 5 Model Assumptions 5 10/14/17 100%
Radesh
10/20/17 10/22/17 6 Final Discussion 2 Full Team 10/20/18 100%
Dr. LeBoulluec - Review
10/23/17 10/24/17 6 1 Full Team 10/23/18 100%
& Final Draft Revision
10/24/17 10/26/17 6 Report - Submission 1 Full Team 10/24/18 100%
Model Assumptions
Final Discussion
Report - Submission
FULL TEAM
RADESH, MANASWINI
JENISH, PRANESH
2|Page
I. PROJECT PROPOSAL
We searched through various websites, articles and papers to find the data set that would be interesting
as well as meaningful to work on.
Initially the obtained raw data had to be cleaned up to just get the required relevant data for our
analysis. There was a total of 46 observations on which the simple linear regression analysis was carried
out further.
3|Page
2. MODELING THIS DATASET MEANINGFUL?
From the initial scatter plot, it was found that the Wine Consumption has a direct relation with Death
Rate of the population from the data collected. The R2 values (given below) confirm the same. So,
modeling the dataset with Wine consumption as Predictor variable and Death rate as Response
variable would be meaningful and can create an impact on the society. As discussed above there are
other predictor variables or regressors other than wine consumption, such as liquor consumption,
which influences the response variable or death rate, however it is not as influential as wine
consumption. This is exhibited in the following section where it can be notices that consumption of
wine is directly related to death for the quantity of wine consumed.
Wine Consumption (X3) Vs. Death Rate (Y) Liquor Consumption (X4) Vs. Death Rate (Y)
4|Page
4. SELECTED PREDICTOR VARIABLE FOR THE PROJECT EXPLANATION
R2 values:
X1 vs. Y: 0.5611
X2 vs. Y: 0.6127
X3 vs. Y: 0.7134
X4 vs. Y: 0.4651
From the R2 values we see that X3 vs. Y has the good fit and X4 vs. Y has the worst fit.
Also, from the scatter plot we see that X3 vs. Y is the best fit because there are equal number of observations
below and above the line and the plot seems to be linear.
Therefore we proceed with wine consumption (X3) as the predictor variable for the death rate.
5|Page
II. SIMPLE LINEAR REGRESSION MODEL
The simple linear regression (SLR) model is appropriate when the quantitative correlation between a
predictor or a regressor variable (X) and a corresponding response variable (Y) is to be examined. In our
case, the predictor variable is Wine consumption and the resultant response variable is death rate.
Figure 2 Scatter plot between Wine Consumption (X) and Death Rate (Y)
Yi = 0 + 1Xi + i
It is to be noted that the above equation is the standard form of the regression line in the SLR model and
when it relates to our data, xi stands for wine consumption for the ith trial and yi stands for the
corresponding death rate of the same ith trial. 0 stands for the y intercept of the regression line and 1
stands for the slope of the regression line. The resultant regression equation for this data would be of the
following form:
= 30.33467 + 2.86174x
6|Page
Here, b0 or the unbiased point estimator of the y intercept 0 equals 30.33467 and b1 or the unbiased point
estimator of the slope 1 equals 2.8617.
The SAS output of the same data confirms these values and is attached below:
Table 2. ANOVA table that shows the parameter estimates of wine consumption (X) and death rate (Y).
The Analysis of Variance results for the data are presented in the table above. It is to be noted that the p
values for wine consumption and death rate from the t table are both < 0.0001. We have chosen a
confidence level of 90% and therefore the resultant two-sided significance level is = 0.05. The ANOVA
table further gives us an insight on the sum of squares that help us analyze the variances even better. They
are explained in detailed below:
The regression or model sum of squares is a measure of how well the regression model is representative
of the actual given data that is being modeled. In our case, our Regression Sum of Squares value is 17650.
The error sum of squares represents the rate of error or deviations that the predicted values take from the
actual mean predictor or response values. These are irrespective of the model and are hence not explained
by the model. This value has to be as small as possible in order for the model to best fit the data. This
value is often used for selection of predictor values. The Error sum of squares value represented by our
model is 7691.66.
7|Page
SSTO (or) TOTAL SUM OF SQUARES:
The total sum of squares is the summation of all squared observations and their deviations from their
respective means. It may also be represented as total summation of deviations of the response variables
from their mean. The total sum of squares value for our model is 24741.
The mean sum of squares of regression or the mean sum of squares of the model is the ratio of the
regression sum of squares to its degree of freedom. As far as SLR is concerned, the degree of freedom for
MSR is equal to 1 and therefore, MSR is always equal to SSR. The MSR value for our data is 17650.
Similar to the mean sum of square of regression, the mean sum of square error is the ratio of the error sum
of squares to its degree of freedom. Here our degree of freedom for our data is 44 based on the total
number of observations and therefore, our mean sum of square error value is 161.17411.
F Value:
The F value is an important estimate of how well the regression model fits our data. It is nothing but the
ratio of MSR to MSE. The F value obtained from the ANOVA table is then compared with the table value
of F* to test the hypothesis H0: 1=0 vs. H1: 1 0. Since the table value for = 0.05 is 4.05 is less that the
F value obtained from the ANOVA table, we reject the null hypothesis, further confirming that there is a
strong linear relationship between wine consumption and death rate.
R2 value:
The R2 value is another important estimator of how well the regression model fits the data. It is obtained
by taking the ratio between the SSR value and the SSTO values. In other words, it is also referred to the
proportion of variation that is explained by the regression model. Therefore, the higher the value of R2, the
better fit the regression model is. Our R2 value is 0.7134 or 71.32%, which is quite good. Therefore, we
may conclude that the predictor variable selected here (wine consumption) is a good indicator of the
response variable (death rate).
III. INFERENCES
A) Inferences on the parameters
Two Sided Confidence Interval for the Slope:
Confidence interval was calculated for the slope with significance level, =0.05, n=46.
=1 (=1 )(=1 )/
b1 was calculated by using the formula, 1 = 2 2 = 2.861
=1 (=1 ) /
S{b1} was calculated by using the formula, {1 } = = 0.273
8|Page
The manual calculation done correlates with the results derived using SAS. Typically, a two-sided test is
performed, unless stated to perform a one sided test, to find the confidence interval. This test is performed
by using the formula mentioned below
1 (1 2 , 2){1 } = 2.86174 t (0.975, 44) *(0.27347)
= 2.86174 2.0168 * (0.27347) [From the table, t (0.975,44) =2.0168]
= 2.86174 0.5515
= (2.3102, 3.4132)
From these results, we are 95% confident that the mean death rate will lie between 2.3102 and 3.4132,
when the amount of wine consumption changes in the per capita of the population for a unit, in this case a
liter of consumption.
Two Sided Confidence Interval for the Y-Intercept:
Confidence interval was calculated for the Y-Intercept with significance level, =0.05, n=46.
2
X
s{b0} was calculate using the formula, {0 } = [(1) + ( )2 ] = 3.680
9|Page
h } = [(1) + (Xh) ]
2
h } was calculated using the formula, {Y
s{Y
( )2
17650
= 1 2
= (2.861)2 = 2156.299
2
h } = 161.174[(1 ) + (10.2511.586) ] = 1.907
{Y 46 2156.299
Where,
SSR or Regression Sum of Squares = 17650; MSE or Mean Sum of Squares = 161.174; n=46;
= 11.586 and b1 = 2.861.
X h=10.25; X
t(1 2 , n 2)s{ } = 59.659 t (0.975, 44)*( 1.907)
=59.659 25.889
= (33.77, 85.548)
h =
From this results, we are 95% confident that the new death rate or death rate which was predicted (Y
59.659) with wine consumption of 10.25 liters will lie between 33.77 and 85.548.
2 (1 , 2, n 2 {Yh } = 59.659 2 (1 , 2, n 2) *(1.907)
=59.659 2.860*1.907
10 | P a g e
=59.659 5.454
= (54.205, 65.113)
Wine
S Death rate Predicted Death Standard Error Upper Lower
Consumption
no. (Y) Rate (Yh) s{Yh} Band Band
(X)
45.26584 26.84615
1 2 29.7 36.056 3.22
4 6
47.50332 30.33067
2 3 47.9 38.917 3.002
04 96
49.76653 33.78946
3 4 52.3 41.778 2.793
86 14
52.07693 37.22106
4 5 41.2 44.649 2.597
94 06
54.40738 40.59261
5 6 37.6 47.5 2.415
3 7
56.82219 43.89980
6 7 56.6 50.361 2.259
18 82
59.26560 47.17839
7 8 55.7 53.222 2.113
26 74
61.80912 50.35687
8 9 62.8 56.083 2.002
04 96
64.43844 53.44955
9 10 55.4 58.944 1.921
42 58
67.17645 56.43354
10 11 74.8 61.805 1.878
56 44
70.02887 59.30312
11 12 77.2 64.666 1.875
5 5
72.99284 62.06115
12 13 66.7 67.527 1.911
22 78
76.06263 64.71336
13 14 80.9 70.388 1.984
68 32
79.22967 67.26832
14 15 74.3 73.249 2.091
82 18
82.47966 69.74033
15 16 90.5 76.11 2.227
54 46
85.79543 72.14656
16 17 98.1 78.971 2.386
72 28
89.16841 74.49558
17 18 56.7 81.832 2.565
3 7
92.58429 76.80170
18 19 83.6 84.693 2.759
18 82
96.03449 79.07350
19 20 104.2 87.554 2.965
3 7
99.51615 81.31384
20 21 58.1 90.415 3.182
64 36
11 | P a g e
103.0207 83.53129
21 22 76 93.276 3.407
01 86
106.5424 85.73159
22 23 92.1 96.137 3.638
07 24
124.3482 96.53570
23 28 122.5 110.442 4.862
92 76
135.1222 102.9277
24 31 129.9 119.025 5.628
05 94
From this result, we are 95% confident that the death rate for the wine consumption of 10.25 liters will lie
between the confidence bands of 54.205 and 65.113.
Confidence Band
160
140
120
Death Rate
100
80
60
40
20
0
0 5 10 15 20 25 30 35
Wine Consumption
12 | P a g e
IV. Model Assumptions:
13 | P a g e
From the above box plot for residuals we see that the median value is greater than the mean value. The
normality is violated, and the distribution is slightly left skewed.
From the above normality graph, we see that the left end is long tailed, and the right end is short tailed.
Therefore, we see a deviation from the normality and hence we conduct a normality test.
Normality test:
Hypothesis Statement:
H0: Normality is OK.
vs. H1: Normality is violated.
14 | P a g e
From the above figure we see that P value is = 0.97362
Cutoff c(, n) = c(0.1,46)
From table B.6, the value for c(0.1,46) = 0.979
0.97362<0.979
We reject H0
Normality is violated.
F test:
H0 = Equal Variance
H1 = Non Equal Variance
From the table below, P value = 0.0059 which is less than = 0.05.
Therefore, we reject H0. We conclude that variance is unequal.
T test:
H0 = Constant Variance
H1 = Non constant variance
From the table below, the P value for Satterthwaite is 0.0842 which is greater than = 0.05.
Therefore, we fail to reject H0. We conclude that the model has constant variance.
15 | P a g e
Table 4.2 Modified Levene test
Transformation:
In the plot against wine consumption and residual there is no curvature or funnel shape and the plot is
scattered. Therefore, transformation is not required.
16 | P a g e
V. FINAL DISCUSSION:
We as a team once again could realize and experience the importance of the famous quote by Karl
Pearson Statistics is the grammar of science [9] by this project.
We could conduct the simple linear regression (SLR) analysis successfully and found that there is a
relation between the increasing death rate from Cirrhosis in the United States of America (Y) which is
considered here as the Response variable and the Predictor variable which is the consumption of wine per
capita in liters (X3). A positive increasing upward trend was observed on the plot between those two
variables that yet again proves the relationship between both.
This analysis helped us to identify the strength of the effect that the consumption of wine per capita which
is the predictor variable have on increasing death rate from Cirrhosis in the United States of America
which is the response variable. It also paved way to forecast the effects or impact of changes and helped
us to understand the amount of changes in the death rate from Cirrhosis in the United States of America
changes even with minor change in the consumption of wine per capita.
We have used one of the most powerful statistical analysis software package SAS to conduct and generate
various outputs that aided us to find out significant statistical inferences based on the relationship between
the increasing death rate from Cirrhosis in the United States of America (Y) which is considered here as
the Response variable and the Predictor variable or the consumption of wine per capita in liters (X3) and
were able to find the parameter estimate (b0, b1) and other data from the ANOVA table by which we were
able form the fitted model i = 30.33467 + 2.86174 Xi that was found to be significant using the F* value
obtained. The Confidence intervals, the Prediction Interval and the Confidence Bands for the mean
response were manually computed by hand and were interpreted using the plot generated using the same.
Residual plots were created to assess whether the observed error that is the residuals is consistent with the
random or the unpredicted error [6]. The test for normality denoted that the normality was not satisfied
and thus pushed us to perform the Modified Levenes Test for variance which was constant. The plot
against wine consumption and residual revealed there is no curvature or funnel shape and the plot is
scattered and thus transformation was not required.
Further, the analysis could be extended by including the remaining independent variables into the model
such as the size of the urban population in percentage (X1), the number of births to women between 45 to
49 (the reciprocal of that value, times 100) (X2), and the consumption of hard liquor per capita (X4) in
liters and a multiple linear regression analysis could be carried out by which a better predictive model
could be attained that could in-turn yield a more meaningful and contributing factors to the result
obtained.
Additionally, this project has been very supportive and helped us to learn how to work in a team
environment and understand and develop good work ethics at the end of the day.
17 | P a g e
REFERENCE
[1] Helmut Spaeth, Mathematical Algorithms for Linear Regression, Academic Press, 1991, ISBN 0-
12-656460-4.
[2] K Brownlee, Statistical Theory and Methodology, Wiley, 1965, pages 464-465.
[3] http://www.medicinenet.com/cirrhosis/article.htm
[4] http://people.sc.fsu.edu/~jburkardt/datasets/regression/x20.txt
[6] Michael H. Kutner, C.J Nachtsheim, J. Neter, W. Li, Applied Linear Statistical Models; Fifth Edition.
[7] http://blog.minitab.com/blog/adventures-in-statistics-2/why-you-need-to-check-your-residual-plots-
for-regression-analysis
[8] http://www.statisticssolutions.com/what-is-linear-regression/
[9] https://www.brainyquote.com/quotes/keywords/statistics.html
18 | P a g e
APPENDIX
Table 1. Observed Data Information
19 | P a g e
36 57 53 28 149 122.5
37 87 51.6 23 77 92.1
38 50 31.9 22 43 76
39 85 56.1 23 74 97.5
40 27 31.5 7 56 33.8
41 84 50 16 63 90.5
42 37 32.4 2 41 29.7
43 33 36.1 6 59 28
44 44 35.3 3 32 51.6
45 63 39.3 8 40 55.7
46 58 43.8 13 57 55.5
20 | P a g e