STAT 5700 Homework 1
STAT 5700 Homework 1
STAT 5700 Homework 1
To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.
> library (MASS)
(a) How many rows are in this data set? How many columns? What do the rows and columns
represent?
>nrow(Boston)
[1] 506
>ncol(Boston)
[1] 14
The given rows represent the predictors observed for the suburbs in Boston. The columns represent the
variables for the predictors which were observed in the 506 suburban neighborhoods of Boston.
(b) Makes some pairwise scatterplots of the predictors (columns) in this data set. Describe your
findings.
> pairs(Boston[,1:14], pch = 19, lower.panel = NULL)
1
Homework 1, Alan Holloway
There is not much information that can be obtained from just the pairwise scatterplot other than some
general shapes of the data. But they do appear to be correlated. Describing a few notable shapes in the
pairwise scatter plot: In the plot where the y-axis is “crim” and the x-axis is “black” I notice that the plot
is U-shaped indicating that the relationship between the proportion of residential land zoned for lots
over 25,000 sq ft and the average population of black people by town is showing that high proportions
and low proportions of land zoned for lots over 25,000 sq ft is associated with increased population of
black people by town. In the plot where y-axis is “nox” and the x-axis is “dis” I notice that there is
curvilinear shape to the plot. This shows that high nitrogen oxide concentration is associated with low
weighted mean distance to the five Boston employment centers. In the plot where y-axis is “ptratio” and
the x-axis is “black”, I find that there is high pupil-teacher ratio associated with high proportion of black
people by town.
(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
This table has the values listed from greatest to least relationship between them and crim. The top four
are rad (0.62550515), tax (0.58276431), lstat (0.45562148), and nox(0.42097171). These four values
have a positive relationship with per capita crime rate.
2
Homework 1, Alan Holloway
(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-
teacher ratios? Comment on the range of each predictor.
>summary(crim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
Based on the summary of per capita crime rates, the median value is 0.26% with a max value of 89% so
in some areas there are particularly high crime rates.
>hist(Boston$crim)
>length(crim[crim>30])
[1] 8
3
Homework 1, Alan Holloway
>hist(Boston$tax)
>length(tax[tax>599])
[1] 137
there are 137 full value property tax rates per \$10,000 (tax) above 500 which is a notably high amount
of tax.
>hist(Boston$ptratio)
There is a particularly high rate of pupil-teacher ratio at the greater than 20 value.
(e) How many of the suburbs in this data set Bound the Charles River?
>table(Boston$chas)
0 1
471 35
Based on the table above, there are 35 suburbs bound to the Charles river.
(f) What is the median pupil-teacher ratio among the towns in this data set?
>median(Boston$ptratio)
[1] 19.05
The median pupil-teacher ratio among the towns in this data set is 19.05.
(g) Which suburb of Boston has lowest median value of owner-occupied homes? What are the values
of the other predictors for that suburb, and how of those values compare to the overall ranges for
those predictors? Comment on your findings.
>plot(as.factor(Boston$crim),Boston$medv)
4
Homework 1, Alan Holloway
>plot(as.factor(Boston$chas),Boston$medv)
>which.min(Boston$medv)
[1] 399
>Boston[which.min(Boston$medv),]
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59 5
>summary(Boston$crim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
Based on the above plots and tables it looks like the is a median value of 0.26 for crime in this suburban
area. The median value of owner-occupied homes is more in the Charles river area and far from radial
highways (rad). The maximum value in “crim” is 88.98 and the median is 0.25. The median value of
homes in this area is 38.3518 which is higher than the median so there is a greater amount of crime in
this area.
(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More
than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per
5
Homework 1, Alan Holloway
dwelling.
>summary(Boston$rm)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.561 5.886 6.208 6.285 6.623 8.780
> summary(Boston$crim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
>table(Boston$rm > 7)
FALSE TRUE
442 64
>table(Boston$rm >8)
FALSE TRUE
493 13
The overall median of crim is 0.25651 and the max in rooms greater than 8 is only 3.47428 with a
median of 0.52014 when the overall max of crim is 8.78 so it seems like crim is less in rooms greater
than 8.
>table(rooms8$chas)
0 1
11 2
Most of the houses with 8 rooms are not near the Charles river, 11 out of 13.
> summary(rooms8$lstat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.47 3.32 4.14 4.31 5.12 7.44
6
Homework 1, Alan Holloway
> summary(Boston$lstat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.73 6.95 11.36 12.65 16.95 37.97
Majority of the lower status population are not in homes with greater than 8 rooms but it looks like
there are some.
7
Homework 1, Alan Holloway
(a) Using the rnorm() function, create a vector, a, containing 100 observations drawn from a N(0,1)
distribution. This represents a feature, X.
>set.seed(1)
>x <- rnorm(100, mean = 0, sd = sqrt(1))
(b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a
N(0,0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.
>eps <- rnorm(100, mean = 0, sd = sqrt(0.25))
8
Homework 1, Alan Holloway
It looks like there is a positive correlation between x and y due to the right, upward direction of the
points on the scatterplot. I don’t see any extreme outlier points.
(e) Fit a least squares linear model to predict y using x. comment on the model obtained. How do B-
not-hat and β-1-hat compare to β -not and β -1?
>model1 = lm(y~x)
>coef(model1)
(Intercept) x
-1.0188463 0.4994698
>summary(model1)$adj.r.squared
0.4619164
β-not-hat and β-1-hat are close to β-not and β-1 but the epsilon variable creates noise.
(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression
line on the plot, in a different color. Use the legend() comment to create an appropriate legend.
>plot(x,y, typ =”p”)
>abline(model1,col=”brown”)
>abline(-1,0.5,col=”steelblue3”)
>legend(“topleft”,c(“model1 Least Square”,”Regression”), col = c(“brown”,
“steelblue3”), lty=c(1,1))
(g) Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the
quadratic term improved the model fit? Explain your answer.
>model2 = lm(y~poly(x,2))
>summary(model2)$coefficients
9
Homework 1, Alan Holloway
10
Homework 1, Alan Holloway
The Least Square line is closer to the Regression line than in the previous scatterplot due to the
reduction in noise.
11
Homework 1, Alan Holloway
(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship
between the variables.
>cor(x1,x2)
The correlation between x1 and x2 is 0.8393832
>plot(x1,x2)
(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results
obtained. What are B-not-hat, B-1-hat, and B-2-hat? How do these relate to the true B-not,B-1, and
B-2? Can you reject the null hypothesis H-not: B1=0? How about the null hypothesis H-not: B2=-?
>fit1 <-lm(y~x1+x2)
>summary(fit1)
Call:lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-2.82824 -0.62334 -0.08377 0.59433 2.00662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
12
Homework 1, Alan Holloway
The coefficients of Bo-hat is -0.07172, of B1-hat is 1.16595, of B2-hat is -0.88045. The estimated
coefficients are close in value to the original Bo, B1, and B2. We can reject the null hypothesis for
B1, but we cannot reject the null hypothesis for B2 due to the p-value being greater than 0.05.
(d) Now fit a least squares regression to predict y using x1. Comment on your results. Can you
reject the null hypothesis H-not: B-1=0?
>fit2 <-lm(y~x1)
>summary(fit2)
Call: lm(formula = y ~ x1)
Residuals:
Min 1Q Median 3Q Max
-2.7984 -0.6518 -0.0807 0.6619 2.0831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03979 0.19821 -0.201 0.8413
x1 0.69900 0.31774 2.200 0.0302 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9341 on 98 degrees of freedom
Multiple R-squared: 0.04706, Adjusted R-squared: 0.03734
F-statistic: 4.84 on 1 and 98 DF, p-value: 0.03016
Yes, we can reject the null hypothesis due to the p-value being <0.05.
(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you
reject the null hypothesis Ho:B1=0?
>fit3<-lm(y~x2)
>summary(fit3)
Call: lm(formula = y ~ x2)
Residuals:
Min 1Q Median 3Q Max
-2.6730 -0.6539 -0.0579 0.7059 2.1127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
13
Homework 1, Alan Holloway
We cannot reject the null hypothesis due to the p-value being >0.05.
(f) Do the results obtained in (c)-(e) contradict each other? Explain your answer.
In part c to d we were able to reject the null hypothesis due to a significant p-value and there was
collinearity between x1 and x2. But in part e we were not able to reject the null hypothesis due to
the p-value. We may fail to reject the null hypothesis if there is a collinearity. The results contradict
each other dur to the collinearity, which increases the standard error.
(g) Now suppose we obtained one additional observation which was unfortunately mismeasured.
> x1=c(x1, 0.1)
>x2=c(x2, 0.8)
>y=c(y,6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation
have on each of the models? In each model, is this observation an outlier? A high-leverage point?
Both? Explain your answer.
>fit4<-lm(y~x1+x2)
>summary(fit4)
Call: lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-2.6997 -0.7402 -0.1126 0.7408 4.0725
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1992 0.2231 0.893 0.3741
x1 -0.7057 0.5473 -1.289 0.2003
x2 2.2485 0.8412 2.673 0.0088 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.068 on 98 degrees of freedom
Multiple R-squared: 0.07872, Adjusted R-squared: 0.05992
F-statistic: 4.187 on 2 and 98 DF, p-value: 0.018
>fit5<-lm(y~x1)
>summary(fit5)
Call: lm(formula = y ~ x1)
Residuals:
Min 1Q Median 3Q Max
-2.7664 -0.7511 -0.1102 0.6544 5.7767
Coefficients:
14
Homework 1, Alan Holloway
>fit6<-lm(y~x2)
> summary(fit6)
Call: lm(formula = y ~ x2)
Residuals:
Min 1Q Median 3Q Max
-2.8161 -0.6824 -0.1149 0.6313 4.8282
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02758 0.17964 0.154 0.8783
x2 1.43031 0.55395 2.582 0.0113 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.072 on 99 degrees of freedom
Multiple R-squared: 0.06309, Adjusted R-squared: 0.05363
F-statistic: 6.667 on 1 and 99 DF, p-value: 0.01129
>cor(x1,x2)
0.7544409
>plot(x1,x2)
>plot(fit4)
15
Homework 1, Alan Holloway
16
Homework 1, Alan Holloway
>plot(fit5)
17
Homework 1, Alan Holloway
>plot(fit6)
18
Homework 1, Alan Holloway
summary
Based on the plots, using cook’s distance as a reference, the new observation is point 101 seems to be
an outlier and a high leverage point.
19