R Programming Codes Linear Regression
R Programming Codes Linear Regression
1. Preliminaries
Consider having a core set of preliminary commands that you always execute. These may vary depending
on your preferences. The following are mine.
Tip #1 – Always do your package installation at the console, NEVER within an R Markdown file.
Tip #2 – To execute any of the installations below, simply delete the leading “#”.
# install.packages("ggplot2")
# install.packages("mosaic")
# install.packages("gridExtra")
# install.packages("car")
Source:
Chatterjee, S; Handcock MS and Simonoff JS A Casebook for a First Course in Statistics and Data Analysis.
New York, John Wiley, 1995, pp 145-152.
Setting:
Calls to the New York Auto Club are possibly related to the weather, with more calls occurring during bad
weather. This example illustrates descriptive analyses and simple linear regression to explore this hypothesis in a
data set containing information on calendar day, weather, and numbers of calls.
R Data Set:
ers.Rdata
In this illustration, the data set ers.Rdata is accessed from the PubHlth 640 website directly.
It is then saved to your current working directory.
We see that this data set has n=28 observations on several variables. For this illustration of simple
linear regression, we will consider just two variables: calls and low. These are highlighted in red.
2. Preliminaries – Descriptives
# summary(DATATFRAME$VARIABLE)
summary(ersdata$low)
summary(ersdata$calls)
summary(ersdata)
Scatterplots
library(ggplot2)
The scatterplot on the previous page suggests, as we might expect, that lower temperatures are
associated with more calls to the NY Auto Club. We also see that the data are a bit messy.
Unfamiliar with LOWESS regression? LOWESS regression stands for “locally weighted scatterplot
smoother”. It is a technique for drawing a smooth line through the scatter plot to obtain a sense for
the nature of the functional form that relates X to Y, not necessarily linear. The method involves
the following: At each observation (x,y), the observed data point is fit to a line using some
“adjacent” points. It’s handy for seeing where in the data linearity holds and where it no longer
holds. Handy!
The lowess smoothed fit suggests that perhaps the linear relationship stops being linear as the
temperature increases above 20-25 degrees.
3. Assess Normality of Y
Recall. In normal theory regression, we assume that the outcome variable (in this case, Y=calls) can reasonably
be assumed to be distributed normal (more on violations of this later…) So a preliminary is often to check this
assumption before doing any model fits. If gross violations are apparent then, possibly, Y will be replaced by
some transformation of Y that is better behaved.
Recall. It’s okay for the predictor X (in this case X=low) to be NOT distributed normal. In fact, it is regarded as
fixed (not random at all!)
Here is a lengthy bit of R code for you, so that you can pick and choose between the basic and the more fancy!
##
## Shapiro-Wilk normality test
##
## data: ersdata$calls
## W = 0.82902, p-value = 0.0003628
library(mosaic)
histogram(ersdata$calls, width=1000, main="Distribution of Calls w Overlay Normal", xlab="C
alls", fit="normal")
The null hypothesis of normality of Y=calls is rejected (p-value = .00036). Tip- sometimes the cure
is worse than the original violation. For now, we’ll charge on.
library(ggplot2)
Histogram w Overlay Normal - w Aesthetics
# Tip- Might want to tweak binwidth=1000
# ggplot(DATAFRAME, aes(x=VARIABLENAME)) + stuff below
gg <- ggplot(ersdata, aes(x=calls))
gg <- gg + geom_histogram(binwidth=1000, colour="blue",
aes(y=..density..))
gg <- gg + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(ersdata$calls),
sd=sd(ersdata$calls)))
gg <- gg + ggtitle("Distribution of Calls w Overlay Normal")
gg <- gg + xlab("Calls") + ylab("Density")
plot_histogramcalls <- gg + theme_bw()
plot_histogramcalls
A bit fancier. The conclusion is the same. The null hypothesis of normality of Y=calls is rejected
(p-value = .00036). But for now, we’ll charge on.
4. Fit Model
Simple Linear Regression- Fit, Coefficients Table, ANOVA Table and R-squared
library(mosaic)
# FIT
# MODELNAME <- lm(YVARIABLE ~ XVARIABLE, data=DATAFRAME)
model_simple <- lm(calls ~ low, data=ersdata)
##
## Call:
## lm(formula = calls ~ low, data = ersdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3112 -1468 -214 1144 3588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7475.85 704.63 10.610 0.000000000061
## low -145.15 27.79 -5.223 0.000018649091
##
## Residual standard error: 1917 on 26 degrees of freedom
## Multiple R-squared: 0.5121, Adjusted R-squared: 0.4933
## F-statistic: 27.28 on 1 and 26 DF, p-value: 0.00001865
## 2.5 % 97.5 %
## (Intercept) 6027.4605 8924.23745
## low -202.2744 -88.03352
Note – We didn’t really need to do this. You could have seen that R-squared = 51% from the initial
report. There, you’ll see it as MULTIPLE R-squared = .5121.
Three plots in ggplot2 are shown: a) plot of 95% CI of the mean; b) plot of 95% CI of the individual
predictions and c) combined plot showing both 95% CI of mean and 95% CI of individual predictions.
library(ggplot2)
Remarks
• The overlay of the straight line fit is reasonable but substantial variability is
seen, too.
• There is a lot we still don’t know, including but not limited to the following ---
• Case influence, omitted variables, variance heterogeneity, incorrect functional
form, etc.
library(mosaic)
library(ggplot2)
library(gridExtra)
A little hard to see what’s going on here. I think I’ll look at these plots one at a time.
Mosaic has 6 nice diagnostic plots. Here I obtain each of them. Note - the plotting requires ggplot2
Not bad!
Also not bad. Note that the square root of the standardized residuals are the absolute values.
##### d) Y=Cook's Distance v X=Observation number (Good if: all are below .5)
mplot (model_simple, which=4)
In simple linear regression, the rule of thumb is to notice a Cook’s distance > 1. Clearly we have no
problem here. The largest Cook distance is less than 0.15!
##### e) Y=residuals v X=leverage (Good if: Nice even band centered at Y=0)
mplot (model_simple, which=5)
Looks okay
Hmmmm - I think I need to find a way to make the text in each of these 6 plots a lot SMALLER, so as
to make more room for the plot itself!
library(car)
library(mosaic)
library(ggplot2)
library(gridExtra)
##
## Shapiro-Wilk normality test
##
## data: residual
## W = 0.94073, p-value = 0.1154
# Plots - -
p1<- histogram(~residuals(model_simple), density=TRUE)
p2 <- mplot(model_simple, which=2)
grid.arrange(p1, p2, ncol=2)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 27 2.037655 0.052299 NA
# List observations with cook distance values > cutoff (Note: if all is well, you'll ge
t no output)
ersdata[cook>cutoff,]
# Plots - -
par(mfrow = c(1, 2)) # Set Plotting Arrangment to 1 row x 2 columns
spreadLevelPlot(model_simple,ylab="Absolute Studentized Residual", xlab="Fitted Value",
main="")
##
## Suggested power transformation: 0.9333337
# TIP!!! Restore plotting arrangement to default setting of 1x1 single panel
par(mfrow = c(1, 1))