BZAN 535: Linear Regression
BZAN 535: Linear Regression
The file HW4PROB1.csv reports total weekly quantity sold and average price, as well as other
information, over a period of two years, for a product sold in Kroger.
First, exclude the data for weeks 1-21 and 82-102, to focus on the year when the
typical price appears to be $1.34.
(a) Examine the scatterplot of total_quantity against avg_price. Briefly discuss your
observations.
library(ggplot2)
library(dplyr)
prob1 = read.csv("HW4PROB1.csv")
max(prob1$week_no)
## [1] 102
##
## Call:
## lm(formula = total_quantity ~ avg_price, data = prob1_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.896 -16.772 -7.644 2.262 105.011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 335.64 22.49 14.92 <2e-16 ***
## avg_price -217.90 17.56 -12.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.74 on 58 degrees of freedom
## Multiple R-squared: 0.7265, Adjusted R-squared: 0.7217
## F-statistic: 154 on 1 and 58 DF, p-value: < 2.2e-16
RMSE = sqrt(sum(b1$residuals^2)/b1$df)
c("R-squared"=summary(b1)$r.squared, "RMSE"= RMSE)
## R-squared RMSE
## 0.7264598 26.7407362
Response: When we try to fit the data into a simple linear model, assuming they have
a linear relationship, where total_quantity = intercept + (slope)* avg_price. Intercept
is 335.64, means when avg_price is 0, total quantity is 335.64, which is practically not
possible. Slope is -217.90 means at any given point on the line the ratio of y & x will
be -217.90, as with the increase of x, y is decreasing, negative is there. R^2 depicts
how well the data is fitted into the SLM. the closer the value to 1, the more fitted it is.
Here our value is .72645, without further investigation it cannot be said as the best
fit for this model, but it is not a bad fit. Similarly, RMSE is the measure of the variance
of the sample points from the line. The smaller, the better. here we found it to be
26.74. Not a bad fit
(c) Report the residual diagnostic plots for the fitted model, including a plot of the
residuals vs. week_no. Comment on any problems or inadequacies evident in the
residual diagnostic plots (as far as assumptions for inference with your model in (b)
are concerned). Also, discuss potential solutions to these problems, if any.
par(mfrow= c(2,2)); plot(b1)
c1 = prob1_data %>% mutate(residual = b1$residuals)
head(c1)
##
## Call:
## lm(formula = log.total ~ avg_price, data = c1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85295 -0.15238 -0.01512 0.18389 0.71651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5075 0.2946 25.48 <2e-16 ***
## avg_price -2.9013 0.2300 -12.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3503 on 58 degrees of freedom
## Multiple R-squared: 0.7329, Adjusted R-squared: 0.7283
## F-statistic: 159.1 on 1 and 58 DF, p-value: < 2.2e-16
par(mfrow= c(2,2)); plot(c1.log)
Response: From the Q-Q plot, the points on the right shows a bump, which means
linear model is not the good fit for these. In residual vs fitted plot, the red line do not
show any regular patterns. for the sample to be a good fit, this should have a straight
line along zero, showing the residuals are not varying. In scale-location plot, it is also
similar to, the previous one. shows, the varience between the SLM & actual value is
arbitrarily changing. Lastly the residuals vs leverage is showing there the values
outside the line who can influence the plot. From these plot, we can say, our
assumption on the linear relation between total quantity & avg price might not be
true. as the residual is not having 0 properties, & the variance is also not 0. Since we
can see a bump on the right end, we can consider the log of total quantity might be a
better fit
(d) Use bootstrap to obtain a 95% confidence interval of the slope (coefficient
corresponding to avg_price). Interpret the confidence interval you obtain.
set.seed(123)
lmod <- lm(total_quantity ~ avg_price, data = prob1_data) # fit model with
observed data
resids <- residuals(lmod) # rediduals from model on observed data
preds <- fitted(lmod) # predicted values from model on observed data
nb <- 4000 # number of bootsrap samples to draw
coefmat <- matrix(0,nb,2) # intitial empty matrix to store estimates
for(i in 1:nb) # repeat process
{
boot_y <- preds + sample(resids, rep=TRUE) # generate y* with given e*
bmod <- update(lmod, boot_y ~ .) # fit model with bootsrap data (x, y*)
coefmat[i,] <- coef(bmod) # store estimates from model on bootsrap data
}
colnames(coefmat) <- c("Intercept","Avg Price")
coefmat <- data.frame(coefmat)
cbind(t(apply(coefmat,2,function(x) quantile(x,c(0.025,0.975)))),
confint(lmod))
Response: From the bootstrap, for the sample to correspond to SLM, we are 95%
confident that the slopes of all data will be within -252.9235 & -185.2393, intercept
will be 294.1508 381.5539
(e) Using the model you fitted in (b), report a 95% confidence interval for the expected
amount of weekly unit sales at $1.59. What does this reveal? (Hint: Create and
examine the residuals vs avg_price plot and comment on the residuals at $1.59.)
prob1_data$avg_price= as.numeric(prob1_data$avg_price)
predict(b1, newdata= data.frame(avg_price= 1.59), interval= "confidence",
level=.95)
##
## Call:
## lm(formula = log(total_quantity) ~ log(avg_price), data = prob1_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82390 -0.17061 0.00956 0.18877 0.67901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.60525 0.07063 65.20 <2e-16 ***
## log(avg_price) -3.46442 0.25339 -13.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3298 on 58 degrees of freedom
## Multiple R-squared: 0.7632, Adjusted R-squared: 0.7591
## F-statistic: 186.9 on 1 and 58 DF, p-value: < 2.2e-16
par(mfrow=c(2,2)) ;plot(f1.lm)
Response: Intercept is 4.60525, means when avg_price is 0, total quantity is 4.60525,
which is practically not possible yet, better fit from the Q-Q plot. Slope is -3.46442
means at any given point on the line the ratio of y & x will be -3.46442, as with the
increase of x, y is decreasing. From the Q-Q plot, the points on are roughly fitting with
the line, which means linear model is a the good fit for these. In residual vs fitted
plot, the red line do not show any regular patterns. But showing tendency towards 0.
In scale-location plot, it is also similar to, the previous one. shows, the variance
between the SLM & actual value is arbitrarily changing. Lastly the residuals vs
leverage is showing there the values outside the line who can influence the plot.
From these plot, we can say, our assumption on the linear relation between log of
total quantity & avg price is a better fit than the previous one
(g) Report a 95% confidence interval for the expected amount of weekly unit sales at
$1.59 using this model. What does this reveal? How does this differ from your
results in (d)?
prob1_data$avg_price= as.numeric(prob1_data$avg_price)
## lower upper
## 17.27935 23.28585
Response: From this fit we are 95% confident that, expected amount of weekly sales
for 1.59 is between 17.27935 23.28585. Since neither in log scale or in linear scale
we have negative value in confidence interval it can be deemed as a better fit than
the one in e, Also, even though the residual is not 0, we now have residual close to 0,
residual observed is .3, we have less error in estimation & r2 is more than the
previous one as well.