STAT 5700 Homework 1

Homework 1, Alan Holloway
STAT 5700 Homework 1 DUE 09/24/2020

Chapter 2 (Exercise 10)
This exercise involves the Boston housing data set.
To begin, load in the Boston data set. The Boston data set is part of the MASS library in R.
> library (MASS)
(a) How many rows are in this data set? How many columns? What do the rows and columns
represent?
>nrow(Boston)
[1] 506
>ncol(Boston)
[1] 14
The given rows represent the predictors observed for the suburbs in Boston. The columns represent the
variables for the predictors which were observed in the 506 suburban neighborhoods of Boston.
(b) Makes some pairwise scatterplots of the predictors (columns) in this data set. Describe your
findings.
> pairs(Boston[,1:14], pch = 19, lower.panel = NULL)
1
There is not much information that can be obtained from just the pairwise scatterplot other than some
general shapes of the data. But they do appear to be correlated. Describing a few notable shapes in the
pairwise scatter plot: In the plot where the y-axis is “crim” and the x-axis is “black” I notice that the plot
is U-shaped indicating that the relationship between the proportion of residential land zoned for lots
over 25,000 sq ft and the average population of black people by town is showing that high proportions
and low proportions of land zoned for lots over 25,000 sq ft is associated with increased population of
black people by town. In the plot where y-axis is “nox” and the x-axis is “dis” I notice that there is
curvilinear shape to the plot. This shows that high nitrogen oxide concentration is associated with low
weighted mean distance to the five Boston employment centers. In the plot where y-axis is “ptratio” and
the x-axis is “black”, I find that there is high pupil-teacher ratio associated with high proportion of black
people by town.
(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
This table has the values listed from greatest to least relationship between them and crim. The top four
are rad (0.62550515), tax (0.58276431), lstat (0.45562148), and nox(0.42097171). These four values
have a positive relationship with per capita crime rate.
2
(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-
teacher ratios? Comment on the range of each predictor.
>summary(crim)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
Based on the summary of per capita crime rates, the median value is 0.26% with a max value of 89% so
in some areas there are particularly high crime rates.
>hist(Boston$crim)
>length(crim[crim>30])
[1] 8
3
>hist(Boston$tax)
>length(tax[tax>599])
[1] 137
there are 137 full value property tax rates per \$10,000 (tax) above 500 which is a notably high amount
of tax.
>hist(Boston$ptratio)
There is a particularly high rate of pupil-teacher ratio at the greater than 20 value.
(e) How many of the suburbs in this data set Bound the Charles River?
>table(Boston$chas)
0 1
471 35
Based on the table above, there are 35 suburbs bound to the Charles river.
(f) What is the median pupil-teacher ratio among the towns in this data set?
>median(Boston$ptratio)
[1] 19.05
The median pupil-teacher ratio among the towns in this data set is 19.05.
(g) Which suburb of Boston has lowest median value of owner-occupied homes? What are the values
of the other predictors for that suburb, and how of those values compare to the overall ranges for
those predictors? Comment on your findings.
>plot(as.factor(Boston$crim),Boston$medv)
4
>plot(as.factor(Boston$chas),Boston$medv)
>which.min(Boston$medv)
[1] 399
>Boston[which.min(Boston$medv),]
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
399 38.3518 0 18.1 0 0.693 5.453 100 1.4896 24 666 20.2 396.9 30.59 5
>summary(Boston$crim)
0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
Based on the above plots and tables it looks like the is a median value of 0.26 for crime in this suburban
area. The median value of owner-occupied homes is more in the Charles river area and far from radial
highways (rad). The maximum value in “crim” is 88.98 and the median is 0.25. The median value of
homes in this area is 38.3518 which is higher than the median so there is a greater amount of crime in
this area.
(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More
than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per
5
dwelling.
>summary(Boston$rm)
3.561 5.886 6.208 6.285 6.623 8.780
> summary(Boston$crim)
0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
>table(Boston$rm > 7)
FALSE TRUE
442 64
>table(Boston$rm >8)
FALSE TRUE
493 13
>rooms8 = Boston[Boston$rm >8, ]

>summary(rooms8)
The overall median of crim is 0.25651 and the max in rooms greater than 8 is only 3.47428 with a
median of 0.52014 when the overall max of crim is 8.78 so it seems like crim is less in rooms greater
than 8.
>table(rooms8$chas)
0 1
11 2
Most of the houses with 8 rooms are not near the Charles river, 11 out of 13.
> summary(rooms8$lstat)
2.47 3.32 4.14 4.31 5.12 7.44
6
> summary(Boston$lstat)
1.73 6.95 11.36 12.65 16.95 37.97
Majority of the lower status population are not in homes with greater than 8 rooms but it looks like
there are some.
7
Chapter 3 (Exercise 13)

In this exercise you will create some simulated data and will fit simple linear regression models to it.
Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.
(a) Using the rnorm() function, create a vector, a, containing 100 observations drawn from a N(0,1)
distribution. This represents a feature, X.
>set.seed(1)
>x <- rnorm(100, mean = 0, sd = sqrt(1))
(b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a
N(0,0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.
>eps <- rnorm(100, mean = 0, sd = sqrt(0.25))
(c) Using x and eps, generate a vector y according to the model

Y=-1+0.5.X+∊
What is the length of the vector y? What are the vectors of β -not and β -1 in this linear model?
>y = -1 + 0.5 * x + eps
>length(y)
The length of the vector y is 100.

β-not is -1 and β 1 is 0.5.
(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
8
It looks like there is a positive correlation between x and y due to the right, upward direction of the
points on the scatterplot. I don’t see any extreme outlier points.
(e) Fit a least squares linear model to predict y using x. comment on the model obtained. How do B-
not-hat and β-1-hat compare to β -not and β -1?
>model1 = lm(y~x)
>coef(model1)
(Intercept) x
-1.0188463 0.4994698
>summary(model1)$adj.r.squared
0.4619164
β-not-hat and β-1-hat are close to β-not and β-1 but the epsilon variable creates noise.
(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression
line on the plot, in a different color. Use the legend() comment to create an appropriate legend.
>plot(x,y, typ =”p”)
>abline(model1,col=”brown”)
>abline(-1,0.5,col=”steelblue3”)
>legend(“topleft”,c(“model1 Least Square”,”Regression”), col = c(“brown”,
“steelblue3”), lty=c(1,1))
(g) Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the
quadratic term improved the model fit? Explain your answer.
>model2 = lm(y~poly(x,2))
>summary(model2)$coefficients
9
Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.9644604 0.04790161 -20.134193 2.046895e-36
poly(x, 2)1 4.4637471 0.47901614 9.318573 3.972802e-15
poly(x, 2)2 -0.6720309 0.47901614 -1.402940 1.638275e-01
>summary(model2)
Call:
lm(formula = y ~ poly(x, 2))
Residuals:
Min 1Q Median 3Q Max

-0.98252 -0.31270 -0.06441 0.29014 1.13500
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9645 0.0479 -20.134 < 2e-16 ***
poly(x, 2)1 4.4638 0.4790 9.319 3.97e-15 ***
poly(x, 2)2 -0.6720 0.4790 -1.403 0.164
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.479 on 97 degrees of freedom
Multiple R-squared: 0.4779, Adjusted R-squared: 0.4672
F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14
The quadratic term doesn’t seem to improve the model fit. The P value is still greater than 0.05 (p-value:
2.038e-14).
(h) Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise I
the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the
normal distribution used to generate the error term ∊ in (b). Describe your results.
>set.seed(1)
>eps2 <-rnorm(100,mean=0, sd = sqrt(0.1*0.25))
>y2 = -1 +0.5*x+eps2
>model3=lm(y2~x)
>coef(model3)
(Intercept) x
-1.0000000 0.6581139
>summary(model3)
10
>plot(x,y2, typ =”p”)

>abline(model3,col=”chocolate”)
>abline(-1,0.5,col=”darkslategrey”)
>legend(“topleft”,c(“model3 Least Square”,”Regression”), col = c(“chocolate”,
“darkslategrey”), lty=c(1,1))
The Least Square line is closer to the Regression line than in the previous scatterplot due to the
reduction in noise.
11
Chapter 3(Exercise 14)

This problem focuses on the collinearity problem.
Perform the following commands in R:
>set.seed(1)
>x1=runif(100)
>x2=0.5*x1+rnorm(100)/10
>y=2+2*x1+0.3*x2+rnorm(100)
(a) The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write
out the form of the linear model. What are the regression coefficients?
The regression coefficients are βo=2+rnorm(100), β1=2, and β3=0.3.
The form of the linear model is: Y=2+2x1+0.3x2+∊
(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship
between the variables.
>cor(x1,x2)
The correlation between x1 and x2 is 0.8393832
>plot(x1,x2)
(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results
obtained. What are B-not-hat, B-1-hat, and B-2-hat? How do these relate to the true B-not,B-1, and
B-2? Can you reject the null hypothesis H-not: B1=0? How about the null hypothesis H-not: B2=-?
>fit1 <-lm(y~x1+x2)
>summary(fit1)
Call:lm(formula = y ~ x1 + x2)
Residuals:
-2.82824 -0.62334 -0.08377 0.59433 2.00662
Coefficients:
12
(Intercept) -0.07172 0.20113 -0.357 0.722

x1 1.16595 0.58486 1.994 0.049 *
x2 -0.88045 0.92564 -0.951 0.344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 2.87 on 2 and 97 DF, p-value: 0.06154
>summary(fit1)$coefficients
(Intercept) -0.07171876 0.2011321 -0.3565755 0.72218412
x1 1.16595258 0.5848579 1.9935657 0.04900782
x2 -0.88044681 0.9256364 -0.9511799 0.34387712
The coefficients of Bo-hat is -0.07172, of B1-hat is 1.16595, of B2-hat is -0.88045. The estimated
coefficients are close in value to the original Bo, B1, and B2. We can reject the null hypothesis for
B1, but we cannot reject the null hypothesis for B2 due to the p-value being greater than 0.05.
(d) Now fit a least squares regression to predict y using x1. Comment on your results. Can you
reject the null hypothesis H-not: B-1=0?
>fit2 <-lm(y~x1)
>summary(fit2)
Call: lm(formula = y ~ x1)
Residuals:
-2.7984 -0.6518 -0.0807 0.6619 2.0831
Coefficients:
(Intercept) -0.03979 0.19821 -0.201 0.8413
x1 0.69900 0.31774 2.200 0.0302 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Yes, we can reject the null hypothesis due to the p-value being <0.05.
(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you
reject the null hypothesis Ho:B1=0?
>fit3<-lm(y~x2)
>summary(fit3)
Residuals:
-2.6730 -0.6539 -0.0579 0.7059 2.1127
Coefficients:
13
(Intercept) 0.1740 0.1613 1.078 0.284

x2 0.6685 0.5107 1.309 0.194
We cannot reject the null hypothesis due to the p-value being >0.05.
(f) Do the results obtained in (c)-(e) contradict each other? Explain your answer.
In part c to d we were able to reject the null hypothesis due to a significant p-value and there was
collinearity between x1 and x2. But in part e we were not able to reject the null hypothesis due to
the p-value. We may fail to reject the null hypothesis if there is a collinearity. The results contradict
each other dur to the collinearity, which increases the standard error.
(g) Now suppose we obtained one additional observation which was unfortunately mismeasured.
> x1=c(x1, 0.1)
>x2=c(x2, 0.8)
>y=c(y,6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation
have on each of the models? In each model, is this observation an outlier? A high-leverage point?
Both? Explain your answer.
>fit4<-lm(y~x1+x2)
>summary(fit4)
Call: lm(formula = y ~ x1 + x2)
Residuals:
-2.6997 -0.7402 -0.1126 0.7408 4.0725
Coefficients:
(Intercept) 0.1992 0.2231 0.893 0.3741
x1 -0.7057 0.5473 -1.289 0.2003
x2 2.2485 0.8412 2.673 0.0088 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>fit5<-lm(y~x1)
>summary(fit5)
Residuals:
-2.7664 -0.7511 -0.1102 0.6544 5.7767
Coefficients:
14

(Intercept) 0.1835 0.2298 0.799 0.426
x1 0.3981 0.3702 1.075 0.285
>fit6<-lm(y~x2)
> summary(fit6)
Residuals:
-2.8161 -0.6824 -0.1149 0.6313 4.8282
Coefficients:
(Intercept) 0.02758 0.17964 0.154 0.8783
x2 1.43031 0.55395 2.582 0.0113 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>cor(x1,x2)
0.7544409
>plot(x1,x2)
>plot(fit4)
15
16
>plot(fit5)
17
>plot(fit6)
18
summary
Based on the plots, using cook’s distance as a reference, the new observation is point 101 seems to be
an outlier and a high leverage point.
19

STAT 5700 Homework 1

Uploaded by

Copyright:

Available Formats

STAT 5700 Homework 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT 5700 Homework 1

Uploaded by

Copyright:

Available Formats

Homework 1, Alan Holloway

STAT 5700 Homework 1 DUE 09/24/2020

>rooms8 = Boston[Boston$rm >8, ]

Chapter 3 (Exercise 13)

(c) Using x and eps, generate a vector y according to the model

The length of the vector y is 100.

Estimate Std. Error t value Pr(>|t|)

Min 1Q Median 3Q Max

>plot(x,y2, typ =”p”)

Chapter 3(Exercise 14)

(Intercept) -0.07172 0.20113 -0.357 0.722

(Intercept) 0.1740 0.1613 1.078 0.284

Estimate Std. Error t value Pr(>|t|)

You might also like