Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
39 views

BZAN 535: Linear Regression

The document discusses fitting linear regression models to sales data from a product sold over two years. It explores fitting a simple linear regression of total quantity on average price, as well as a log-log model. Diagnostic plots show some issues with assumptions for the linear model, while diagnostic plots for the log-log model show better fit. The bootstrap confidence interval for the slope of the linear model is wide, indicating uncertainty in the estimate.

Uploaded by

Motasima
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

BZAN 535: Linear Regression

The document discusses fitting linear regression models to sales data from a product sold over two years. It explores fitting a simple linear regression of total quantity on average price, as well as a log-log model. Diagnostic plots show some issues with assumptions for the linear model, while diagnostic plots for the log-log model show better fit. The bootstrap confidence interval for the slope of the linear model is wide, indicating uncertainty in the estimate.

Uploaded by

Motasima
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

BZAN 535: Linear regression

The file HW4PROB1.csv reports total weekly quantity sold and average price, as well as other
information, over a period of two years, for a product sold in Kroger.
First, exclude the data for weeks 1-21 and 82-102, to focus on the year when the
typical price appears to be $1.34.
(a) Examine the scatterplot of total_quantity against avg_price. Briefly discuss your
observations.
library(ggplot2)
library(dplyr)
prob1 = read.csv("HW4PROB1.csv")
max(prob1$week_no)

## [1] 102

prob1_data = prob1 %>% filter(week_no %in% c(22:81))


head(prob1_data)

## week_no total_sales_value total_quantity total_retail_discount avg_price


## 1 22 46.80 35 -8.85 1.337
## 2 23 56.18 42 -10.60 1.338
## 3 24 44.15 33 -8.32 1.338
## 4 25 69.49 52 -13.19 1.336
## 5 26 54.85 41 -10.34 1.338
## 6 27 58.85 44 -11.11 1.338

ggplot(prob1_data, aes(x=avg_price, y= total_quantity)) + geom_point()


Response: With the increase of average price the total quantity decreases. We can
assume there is a linear relation between them. There is a cluster near 1.33. If we try
to fit the data in SLM it will have a negative slope & high positive intercept.
(b) Fit a simple linear regression of total_quantity on avg_price. Report summary() of
your fitted model. Report the R❑2, RMSE and slope, and interpret all three in the
context of the problem.
b1 <- lm(total_quantity~avg_price, data = prob1_data)
summary(b1)

##
## Call:
## lm(formula = total_quantity ~ avg_price, data = prob1_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.896 -16.772 -7.644 2.262 105.011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 335.64 22.49 14.92 <2e-16 ***
## avg_price -217.90 17.56 -12.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.74 on 58 degrees of freedom
## Multiple R-squared: 0.7265, Adjusted R-squared: 0.7217
## F-statistic: 154 on 1 and 58 DF, p-value: < 2.2e-16

RMSE = sqrt(sum(b1$residuals^2)/b1$df)
c("R-squared"=summary(b1)$r.squared, "RMSE"= RMSE)

## R-squared RMSE
## 0.7264598 26.7407362

Response: When we try to fit the data into a simple linear model, assuming they have
a linear relationship, where total_quantity = intercept + (slope)* avg_price. Intercept
is 335.64, means when avg_price is 0, total quantity is 335.64, which is practically not
possible. Slope is -217.90 means at any given point on the line the ratio of y & x will
be -217.90, as with the increase of x, y is decreasing, negative is there. R^2 depicts
how well the data is fitted into the SLM. the closer the value to 1, the more fitted it is.
Here our value is .72645, without further investigation it cannot be said as the best
fit for this model, but it is not a bad fit. Similarly, RMSE is the measure of the variance
of the sample points from the line. The smaller, the better. here we found it to be
26.74. Not a bad fit
(c) Report the residual diagnostic plots for the fitted model, including a plot of the
residuals vs. week_no. Comment on any problems or inadequacies evident in the
residual diagnostic plots (as far as assumptions for inference with your model in (b)
are concerned). Also, discuss potential solutions to these problems, if any.
par(mfrow= c(2,2)); plot(b1)
c1 = prob1_data %>% mutate(residual = b1$residuals)
head(c1)

## week_no total_sales_value total_quantity total_retail_discount avg_price


## 1 22 46.80 35 -8.85 1.337
## 2 23 56.18 42 -10.60 1.338
## 3 24 44.15 33 -8.32 1.338
## 4 25 69.49 52 -13.19 1.336
## 5 26 54.85 41 -10.34 1.338
## 6 27 58.85 44 -11.11 1.338
## residual
## 1 -9.31695390
## 2 -2.09905728
## 3 -11.09905728
## 4 7.46514948
## 5 -3.09905728
## 6 -0.09905728

ggplot(c1, aes(x= week_no, y= residual)) + geom_point() + geom_smooth(method


= "lm", se= TRUE)
c1 = c1 %>% mutate(log.total= log(total_quantity))
c1.log = lm(log.total ~ avg_price, data = c1)
summary(c1.log)

##
## Call:
## lm(formula = log.total ~ avg_price, data = c1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85295 -0.15238 -0.01512 0.18389 0.71651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5075 0.2946 25.48 <2e-16 ***
## avg_price -2.9013 0.2300 -12.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3503 on 58 degrees of freedom
## Multiple R-squared: 0.7329, Adjusted R-squared: 0.7283
## F-statistic: 159.1 on 1 and 58 DF, p-value: < 2.2e-16
par(mfrow= c(2,2)); plot(c1.log)

Response: From the Q-Q plot, the points on the right shows a bump, which means
linear model is not the good fit for these. In residual vs fitted plot, the red line do not
show any regular patterns. for the sample to be a good fit, this should have a straight
line along zero, showing the residuals are not varying. In scale-location plot, it is also
similar to, the previous one. shows, the varience between the SLM & actual value is
arbitrarily changing. Lastly the residuals vs leverage is showing there the values
outside the line who can influence the plot. From these plot, we can say, our
assumption on the linear relation between total quantity & avg price might not be
true. as the residual is not having 0 properties, & the variance is also not 0. Since we
can see a bump on the right end, we can consider the log of total quantity might be a
better fit
(d) Use bootstrap to obtain a 95% confidence interval of the slope (coefficient
corresponding to avg_price). Interpret the confidence interval you obtain.
set.seed(123)
lmod <- lm(total_quantity ~ avg_price, data = prob1_data) # fit model with
observed data
resids <- residuals(lmod) # rediduals from model on observed data
preds <- fitted(lmod) # predicted values from model on observed data
nb <- 4000 # number of bootsrap samples to draw
coefmat <- matrix(0,nb,2) # intitial empty matrix to store estimates
for(i in 1:nb) # repeat process
{
boot_y <- preds + sample(resids, rep=TRUE) # generate y* with given e*
bmod <- update(lmod, boot_y ~ .) # fit model with bootsrap data (x, y*)
coefmat[i,] <- coef(bmod) # store estimates from model on bootsrap data
}
colnames(coefmat) <- c("Intercept","Avg Price")
coefmat <- data.frame(coefmat)
cbind(t(apply(coefmat,2,function(x) quantile(x,c(0.025,0.975)))),
confint(lmod))

## 2.5% 97.5% 2.5 % 97.5 %


## Intercept 294.4504 381.5771 290.6243 380.6651
## Avg.Price -252.6721 -184.3160 -253.0400 -182.7532

Response: From the bootstrap, for the sample to correspond to SLM, we are 95%
confident that the slopes of all data will be within -252.9235 & -185.2393, intercept
will be 294.1508 381.5539
(e) Using the model you fitted in (b), report a 95% confidence interval for the expected
amount of weekly unit sales at $1.59. What does this reveal? (Hint: Create and
examine the residuals vs avg_price plot and comment on the residuals at $1.59.)
prob1_data$avg_price= as.numeric(prob1_data$avg_price)
predict(b1, newdata= data.frame(avg_price= 1.59), interval= "confidence",
level=.95)

## fit lwr upr


## 1 -10.81089 -24.13423 2.512451

ggplot(c1, aes(x=avg_price, y= residual)) + geom_point()+ geom_smooth(method


= "lm", se= TRUE)
Response: From this fit we are 95% confident that, expected amount of weekly sales
for 1.59 is between -24.13423 & 2.512451. the lower limit is practically not possible.
& the negative value depicts we have under estimated. This is also evident from
residuals, since we find positive residal at 1.59
(f) Now fit a simple linear regression model of log(total_quantity) on
log(avg_price). This is often referred to as the price elasticity of demand model.
Examine the regression diagnostic plots for this model and discuss your
observations. Also, interpret the slope of the model.
f1.lm = lm(log(total_quantity)~log(avg_price), data = prob1_data)
summary(f1.lm)

##
## Call:
## lm(formula = log(total_quantity) ~ log(avg_price), data = prob1_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82390 -0.17061 0.00956 0.18877 0.67901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.60525 0.07063 65.20 <2e-16 ***
## log(avg_price) -3.46442 0.25339 -13.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3298 on 58 degrees of freedom
## Multiple R-squared: 0.7632, Adjusted R-squared: 0.7591
## F-statistic: 186.9 on 1 and 58 DF, p-value: < 2.2e-16

f1 = prob1_data %>% mutate(residual = f1.lm$residuals)

ggplot(f1, aes(x= week_no, y= residual)) + geom_point() + geom_smooth(method


= "lm", se= TRUE)

par(mfrow=c(2,2)) ;plot(f1.lm)
Response: Intercept is 4.60525, means when avg_price is 0, total quantity is 4.60525,
which is practically not possible yet, better fit from the Q-Q plot. Slope is -3.46442
means at any given point on the line the ratio of y & x will be -3.46442, as with the
increase of x, y is decreasing. From the Q-Q plot, the points on are roughly fitting with
the line, which means linear model is a the good fit for these. In residual vs fitted
plot, the red line do not show any regular patterns. But showing tendency towards 0.
In scale-location plot, it is also similar to, the previous one. shows, the variance
between the SLM & actual value is arbitrarily changing. Lastly the residuals vs
leverage is showing there the values outside the line who can influence the plot.
From these plot, we can say, our assumption on the linear relation between log of
total quantity & avg price is a better fit than the previous one
(g) Report a 95% confidence interval for the expected amount of weekly unit sales at
$1.59 using this model. What does this reveal? How does this differ from your
results in (d)?
prob1_data$avg_price= as.numeric(prob1_data$avg_price)

predict(f1.lm, newdata= data.frame(avg_price= 1.59), interval= "confidence",


level=.95)

## fit lwr upr


## 1 2.998679 2.849512 3.147846
conf.interval = c("lower"= exp(2.849512), "upper"= exp(3.147846))
conf.interval

## lower upper
## 17.27935 23.28585

ggplot(f1, aes(x=(avg_price), y= residual)) + geom_point() + geom_smooth()

Response: From this fit we are 95% confident that, expected amount of weekly sales
for 1.59 is between 17.27935 23.28585. Since neither in log scale or in linear scale
we have negative value in confidence interval it can be deemed as a better fit than
the one in e, Also, even though the residual is not 0, we now have residual close to 0,
residual observed is .3, we have less error in estimation & r2 is more than the
previous one as well.

You might also like