Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Decision Making Under Uncertainty

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Question 1:

1. Prepare a brief report summarizing the home values (prices) in this area. Use both graphical and
numerical summaries. Your report should briefly describe what those summaries tell you, and
anything of particular note/interest.

Figure 1 Descriptive Statistics of Price

The mean price of the house is around 163K with a Stddev of 67K. There are outliers with house price
close to 450K which makes the distribution moderately right skewed. From the report we can also infer
that 95% of the house falls in the range of 159K-167K.

2. Does the normal model provide a good description of the prices? Use a Normal Quantile plot to frame
your response.

Figure 2 Normal Quantile Plot of Price


No, the data doesn’t look normal, from the normal quantile plot we can see the distribution is right
skewed.

3. Irrespective of your response to Q2, assume that Price ~ N(164K, (68K)2). Given this:
a. Calculate the following probabilities – P(Price > 92.8K), P(Price < 255.5K). Do these numbers agree
with what you see in the data?
P(Price > 92.8K) = 0.8524 with Mean = 164, Stdev= 68 as seen in the figure 3

Figure 3 Area for Price >92.8K

P(Price <255.5K) = 0.91 with Mean = 164, Stdev= 68 as seen in the figure 4

Figure 4 Area for Price <255.5K


Yes after calculating the probabilities and plotting the same against current distribution (shaded areas in
the figure 2 and figure 3), it resembles the given data set.
b. Once again, assuming the above normal distribution, what percentage of houses should have a
value less than 232K? Does that agree with the data?
P(Price < 232K) = 0.84

Figure 5 Area for Price <232K

Yes, after plotting the distribution against the data it resembles the data as seen in the figure 5
c. Based on the theoretical model, what do you expect should be the price of a house that is
exactly on the third quartile (75th percentile,). How does that compare to the actual?
X value is 209.87 at 75th percentile.

Page 1 of 17
Figure 6 Fitted Normal Over Price

The calculated value at 75th percentile is very close to the actual data as seen in the figure 6. There is
slight difference because 75th percentile is calculated assuming the distribution is normal, but in actual
distribution is right skewed.

4. Create a histogram and boxplot for the Living Area variable. Is the distribution symmetric? Check the
skewness measure to see if it is consistent with your observation.

Figure 7 Histogram and Boxplot for Living Area


Skewness is positive hence the graph is right skewed.
Moderately skewed since range is between 0.5 and 1.
Yes, it is consistent with my observation, from the histogram as seen in the figure 7, we can say it is right
skewed and reading the skewness value of 0.807 and kurtosis value of 0.392, we can be very sure it is
not normally distributed.

5. Create a new column in the dataset by taking the logarithm of the Living Area variable. Is the normal
distribution a better fit for this variable or the original (Living Area) variable? Why do you think this is
the case?

Page 2 of 17
Figure 8 Fitted Normal over Living Area

As the distribution of Living Area variable is right skewed, we took the logarithm of the variable to
improve the linearity of the data and plotted the distribution and could see in the figure 8 that the
normal distribution is a better fit for LogOfLA (Log of Living Area) variable as shown in the plot below.

Figure 9 Log of Living Area

Figure 10 Fitted Normal over Log of Living Area

Calculated the skewness again and could see that the value is closer to 0 which means that we have
improved the skewness of the data. We can see visually that the data resembles a normal distribution
now as seen in the figure 10.

6. Create the 90%, 95%, and 99% confidence intervals for the average home price and explain what these
mean. How do the margins of error for these three confidence intervals compare? Does that make
sense? Before creating the confidence intervals, be sure to check the conditions necessary to create
confidence intervals (and briefly describe this in your submission).
Conditions necessary to create confidence intervals are as below:

Page 3 of 17
Estimate of confidence level, sample size, and margin of error. Mean of sample
statistic, confidence level, and margin of error. Estimate of population parameter, confidence level, and
margin of error.
In order to construct a confidence interval, we are going to make three assumptions: The two
populations have the same variance. This assumption is called the assumption of homogeneity of
variance. The populations are normally distributed.

Figure 11 Distribution of Price

Figure 12 90%, 95% and 99% Confidence Intervals

90%, 95%, and 99% confidence intervals for the average home price are created as above and we could
see that
1. Estimate for Mean and Std Dev have no change and are same for all three confidence intervals.
2. Lower CI is decreased, and Upper CI is increased for Mean and Std Dev as the confidence level
increases.
3. Margin of errors comparison for all three confidence intervals is as below: N is constant
Margin of error increases as the level of confidence increases because the larger the expected
proportion of intervals that will contain the parameter, the larger the margin of error.

7. Your friend has asked you to provide an estimate for the 95th percentile of home prices in this market.
Which (if any) of the above confidence intervals can you use to give an answer? Describe briefly.
We will use 95% confidence interval to provide an estimate for the 95th Percentile of home prices
because to account for 95% of the other possible results, we should choose confidence level to be 95%.

8. The sample data given to you all come from home sales within the past 12 months. Suppose you had
sample data of the same size each year going back several years and calculated the average sale price
for each year. What kind of distribution do you expect to see for these averages and why? (Include the

Page 4 of 17
parameters of the distribution in your response, if the house prices don’t change i.e. go up or down,
over time. Clearly this is not a great assumption but make it anyway.)
The distribution we see is a normal distribution. Mean of the normal distribution will be same as the
mean of the population and the standard deviation will be sigma/root(n) where sigma is the SD of
population and n is the sample size.

9. The architecture changed significantly in this geographical area about 30 years ago. So, any houses
aged more than 30 years are considered “old” houses. What proportion of the houses in the sample is
old? Provide the 95% and 99% confidence intervals for the proportion of old houses in this area and
interpret them. Once again, make sure that the necessary conditions are satisfied before creating
confidence intervals.

Figure 13 Distribution of Age of Houses

Proportion of houses that are older than 30 years; X>30;


P(X>30)=?
Z= (X-Mean)/Stdev; Z=(30-28.061127)/34.900899=0.55554 ; P(Z>30)=0.7107

This means 71% of the houses are older than 30 years.

10. Your friend claims that the average house price in this area is above $150K. Do you agree? He also
claims that the average living area is more than 1800 Sq. ft. Do you agree with this? (Use a 5%
significance level for both). Briefly explain what the p-values in these cases mean?

Page 5 of 17
Figure 14 Distributions of Living Area and Price Respectively
Yes, I agree. With the above observations, the average house price in this area is 163.8K which is more
than 150K and the average living area is 1807.3 Sq.ft. which is more than 1800 Sq.ft.

Page 6 of 17
Questions 2:
1. How would you characterize the relationship between the area of the store and the monthly sales in
terms of (i) form, (ii) direction, (iii) strength?

Figure 15 Correlation between Store Size and Monthly Sales

i. Form: Directly correlated; change in store size has a proportionate change in monthly sales
ii. Direction: Same direction. Increase in store size results in increase in sales and vice versa;
positive slope
iii. Strength: Correlation of store size and monthly sales is 0.81. On a scale of -1 to 1, 0.81
corresponds to strong correlation and positive direction.
2. Estimate a linear relationship between the area of the store and monthly sales and provide a
managerial interpretation.

32249.334469 + 311.089 Store Size (sq. ft) = Monthly Sales


Every 311.089 increase in sales is attributed to 1 unit increase in store size approximately.
The intercept is 32249.33 which would mean base sale is 32249.33 when there is no store which doesn’t
make sense practically hence, we should not make such inferences by extrapolating beyond the sample
range.
3. Based on the evidence presented in the dataset, what is your assessment of the claim that store sales
are associated with store size?
It is true.
Correlations between the datasets as seen in the figure 15 is 0.81 and the linear fitted model has an R2 =
0.65 and p value <.0001

Page 7 of 17
4. Is the retail industry wisdom of Rs. 500 per month per sq. ft. applicable to CCD?
It is not applicable to CCD because 1-unit change in sq. ft will attribute to 311.089 change in sales.

5. You would like to use the analysis as a forecasting tool for the sales performance of new stores. What
is the 95% prediction interval of sales performance for a store with an area of 200 sq. ft.? Similarly,
what is this interval for a store of 500 sq. ft.?
200 sq. ft : [85349.51,103585.22]
500 sq. ft : [178675.74, 196911.92]

6. Save new columns corresponding to predicted sales and residuals. Then produce correlation and
covariance matrices along with the scatter plots for the following four variables: (i) monthly sales, (ii)
store area, (iii) predicted monthly sales, and (iv) residuals. Interpret your output and provide
appropriate explanations.

Page 8 of 17
Figure 16 Covariance and Correlations Matrices, and Scatter Plots for four variables: Store Size, Monthly Sales, Predicted Monthly Sales
and Residuals

Page 9 of 17
Question 3
1. Examine scatterplots of the relationships between the three variables in the regression model. Do you
notice any unusual features in the data? Do the relevant plots appear straight enough for multiple
regression?

Figure 17 Scatterplot Matrix of Apple Return, Whole Market Return and IBM Return

The relationship does not appear linear enough as seen on the scatter plot in the Figure 17. Apple return
is relatively more correlated with whole market return while whole market return is more correlated
with IBM return.

2. Fit the indicated multiple regression model and show the “Summary of Model fit”, “ANOVA Table” and
the “Parameter Estimates” table. Interpret the regression output from all these tables.

Page 10 of 17
Figure 18 Regression Model for Apple Return

Figure 19 Regression Model Summary

 R2 is only 0.2165 which means very less proportion of the errors are explained by the regression model.
There are significant errors which cannot be explained.
 RMSE, the standard deviation of response variable on the regression line, is 0.13255
 P value for whole market return is much lower, <0.0001 while that of IBM return is 0.048. This means
whole market return better explains the change in Apple return compared to IBM returns
 When everything else is constant, a unit change in whole market return explains 1.316 units change in
Apple return
 When everything else is constant, a unit change in IBM return explains 0.227 units change in Apple
return

Page 11 of 17
 Total error = sum of errors from the regression line + sum of errors from the mean; 6.66=1.44+5.21;
SST=SSR+SSE; SSR is low compared to SSE; which means that the regression line can explain very little of
the errors and there are significant random errors (shock, ε) present in the system

3. Does the estimated model appear to meet the conditions for the use of the Multiple Regression
Model? You must include the “Residuals vs. Predicted Values” plot to check for the constant variance
(homoscedasticity) assumption, and the normal quantile plot of residuals to check for the normality of
errors. Also, plot the residuals against time to check the assumption of no autocorrelation in the error.
Comment on these plots/statistics. (Hint: Time is increasing as you go down the rows.)
The residuals are bouncing around the mean with no particular pattern as seen in the figure 20 which
means they have a near constant variance and no heteroskedasticity.

Figure 20 Residuals by Predicted Values Plot

Residuals follow normal distributions as it is evident from the normal quantile plot in the figure 21 as the
value remain the acceptable range.

Page 12 of 17
Figure 21 Normal Quantile Plot of Residuals

Durbin-Watson test shows a very low auto correlation of 0.0288 of residual errors with time as seen in
the figure 22

Figure 22 Residuals against time: Durbin-Watson for Autocorrelation

4. Build a 95% prediction interval for Apple Return when the Market Return is 5% and
IBM return is -2%. Provide an appropriate managerial interpretation of this interval.

Figure 23 Prediction Equation

Using the fitted prediction equation in figure 23:


When market return is 5% and IBM return is -2%; Apple return is 0.06611 with a standard error of
Z*RMSE. So, for 95% prediction interval Apple return is in the range of -0.19395 to 0.326177.
This means Apple return could lie between a loss of 20% and a profit of 33%, 95% of the time.

5. Create a simple regression model of Apple Returns on the Market Returns. Compare the slope of this
model with that obtained for Market Returns in Q2 above. Are they the same or different? Why?
Show how you can derive one from the other

Page 13 of 17
As seen in the following figure 24, the slope of the line is very similar to previous plot. This reemphasizes
that the change in apple return is explained better with whole market return compared to IBM return.

Page 14 of 17
Figure 24 Regression model with Apple Returns on Whole Market Returns

Page 15 of 17
Page 16 of 17

You might also like