Assignment 3
Assignment 3
The purpose of this assignment is to give you practice working with the underlying concepts
of quantitative methods, and to give you feedback on your understanding of these concepts.
A group of two, three or four students (but no more than four students) may work together and
submit one set of assignment answers for the group. All members of the group, however,
MUST be enrolled in the same workshop.
For assignments submitted as a group, all valid group members will receive the same mark for
the assignment. Students that attempt to submit an assignment with a group that is not in their
own workshop, or in a group with more than four members, will not receive any credit for the
assignment. Students will form their own groups. The group will allocate one member to submit
the answers on behalf of the group. Individuals may work alone if they wish and submit their
own assignment answers, but I would urge students to work in groups.
All assignments must be submitted with Turnitin. The information that you need to do this can
be found from the following link
Turnitin student guide: https://www.lms.unimelb.edu.au/user_guides/turnitin_stu_guide.pdf
1
Students MUST copy and paste the template provided below into the top of the first page of
their assignment answers, and complete the template, before submitting answers.
It is essential that you include the name of your tutor and your allocated workshop day and
time at the top of your assignment answers in order for your assignment to be graded in a
timely manner.
Assignment answers must be typed, in 12 point font, with 1.5 line spacing. For each
question, the written part of your answers must be within the specified word limits provided
below. Words in excess of those limits will be ignored during marking. You are not required
to write as much as those word limits. Shorter answers, if well-written, concise and clear,
may receive as many or more marks than longer answers. Answer the questions directly. Do
not present unnecessary graphs or numerical measures, or discuss irrelevant matters.
Marks will be deducted if you present inappropriate or unnecessary material.
You MUST submit your assignment answers in a Portable Document Format (pdf) file. You
should also look closely at the pdf file you upload before confirming your submission, to ensure
that all your answers are included as you expect.
Good luck.
2
ASSIGNMENT QUESTIONS
Background
The senior management in a large university have asked an analyst to develop a model to help
them to understand the key factors which explain how successful students who enrol in the
degree program will be. The measure of success that the analyst decides to model is the grade
point average (GPA) that students achieve in the first three semesters of this degree program.
At this university this GPA variable can take values between 0 and 4.
As a first step the analyst selects a random sample of n = 200 students who have completed the
first three semesters of the degree program and notes the GPA value for each student. The
values of the GPA variable are stored in cells A2:A201 of the file Assignment 3 Q1.xls. For
each student he also notes the values of six variables that could possibly be used in a multiple
regression model of the GPA variable The definitions of the possible X or independent variables
and the cells their values are stored in are as follows.
HSM Average high school grades in mathematics. (in B2:B201)
HSS Average high school grades in science. (in C2:C201)
HSE Average high school grades in English. (in D2:D201)
SATM SAT score for mathematics. (in E2:E201)
SATV SAT score for verbal skills. (in F2:F201)
GENDER An indicator variable with a value of 1 for males and 0 for females (in G2:G201)
N.B. When you use the Regression tool in Data Analysis in both Questions 1 and 2 you
need to click the 4 choices in the Residuals section as shown below.
3
Question 1 (continued)
Answer the following questions
a: Perform a preliminary analysis for your regression model where you obtain the following:
The descriptive statistics for the GPA variable and for the six possible independent
variables. If we only know the values of the descriptive statistics what will our estimate of
the typical value of the GPA variable be equal to. Please note whether or not the value
that you obtain has any limitations or problems.
The set of correlation coefficients for all possible combinations of the 7 variables GPA,
HSM, HSS, HSE, SATM, SATV and GENDER. If we want to test H0: = 0 the formula for
the z statistic is r n . Consider the correlation coefficients for the GPA variable and the
HSM, HSS, HSE, SATM and SATV variables. If the critical value for z is 1.96 which of
these variables do not have a linear relationship with the GPA variable.
The three different scatter diagrams in which GPA is always the dependent variable and
the three variables HSM, HSS and HSE are the independent variables. Briefly explain
what the three scatter diagrams and the correlation coefficients for the GPA variable and
these three variables are telling us about the possible relationships between the GPA
variable and these three different possible independent variables HSM, HSS and HSE.
State why we do not look at the scatter diagram for the GPA variable and the GENDER
variable. (8 marks)
b: Estimate the multiple regression model in which GPA is the dependent variable and there
are four independent variables HSM, HSS, HSE and GENDER. Use your Excel output to
write down your estimated model.
Briefly discuss what the estimated coefficients of each of the four independent variables
are telling us about how these variables affect the value of the GPA variable. (In this part
of the question you do not need to test whether or not the values of these coefficients are
significantly different from 0.) (5 marks)
c: When we estimate a regression model with Excel in addition to the output showing the
coefficient estimates and the various evaluation statistics you will also obtain the values
for the Predicted GPA or Ŷ values, the Residuals and the Standard Residuals for all 200
students. (The Standard Residuals are the z scores for the Residuals.)
We use these values to check whether the error terms for this model satisfy certain key
assumptions about the error terms. Briefly explain why we look at these assumptions
concerning the error terms before we use the Excel output to assess the quality of our
estimated model. When you do this make sure you briefly explain why in this study we
only look at 2 of the 5 assumptions about the error term.
Using an appropriate histogram and the descriptive statistics for the Standard Residuals
briefly discuss whether the error terms are or are not normally distributed.
Using the correlation coefficient and an appropriate scatter diagram for the standard
residuals and the Predicted GPA or Ŷ values briefly discuss what heteroscedasticity is
and whether it is an issue in this model. (10 marks)
4
Question 1 (continued)
d: Using the Excel output for this model briefly explain what the F statistic and the R-
squared value are telling us about this estimated model. (4 marks)
e: Using the Excel output for this model and a level of significance of = 0.10 briefly explain
which of the independent variables has a significant impact on the GPA variable.
Briefly discuss whether there is any evidence that there is a problem with Multicollinearity
in this model. (5 marks)
f: The analyst decides to obtain a parsimonious model which only includes independent
variables for which the P-values of all the coefficients are less than 0.10. To obtain this
model the analyst starts with the model which contains the four independent variables and
removes the variable with the largest P-value. The analyst then estimates the model with
the remaining three independent variables. This process continues until we have a model
for which the P-values of all the coefficients are less than 0.10.
Write down the estimated parsimonious model.
Using the same procedure used with the four variable model check whether the residuals
from this parsimonious model satisfy the key assumptions about the error terms.
Briefly compare the R-squared and the Adjusted R-squared for the original four variable
model and for the parsimonious model.
Briefly compare the values of the coefficients in the original four variable model with the
values in the parsimonious model. (6 marks)
g: Briefly explain in 100 words or less what this model is telling us about the relationship
between students high school results and their GPA values at this university. (6 marks)
h: In high school a straight A student is someone who has a score of 10 in every subject so
this type of student will have HSM = 10, HSS = 10 and HSE = 10. The straight C student is
someone who has a score of 4 in every subject so this type of student will have HSM = 4,
HSS = 4 and HSE = 4. Using the original four variable model and the parsimonious model
obtain separate forecasts for males and females of the expected or average GPA value for
a straight A student and a straight C student.
Using these forecasts what advice should the university give to straight C students who
ask whether or not they are likely to do well in this degree program. (6 marks)
Total marks 50 = 8 + 5 + 10 + 4 + 5 + 6 + 6 + 6
5
Question 2 (50 marks)
Background
In Victoria there are significant fines imposed on real estate agents who tell potential buyers
that the expected selling price of a house is significantly lower than what the house will
actually sell for. The manager of a large real estate firm asks an analyst to develop a model of
the actual selling prices for typical houses the firm deals with. The analyst randomly selects a
sample of n = 44 houses that the firm has recently sold. For each of these houses the analyst
notes the values of the dependent variable which is the selling price (PRICE) and five possible
independent variables that might be used in a multiple regression model of PRICE.
These values are stored in the file Assignment 3 Q2.xls in the following way
PRICE The selling price of the home (in A2:A45)
SQMTRS The area of the home in square metres (in B2:B45)
BDRMS The number of bedrooms in the home (in C2:C45)
BTHRMS The number of bathrooms in the home (in D2:D45)
GARAGE The number of garages in the home (in E2:E45)
SSQMTRS The squared value of the area of the home (in F2:F45)
In this question we use a level of significance of = 0.10 and we also assume that our main
goal is to obtain accurate forecasts of the PRICE variable for houses with different features.
Answer the following questions
a: Obtain the correlation coefficients for every possible combination of the six variables listed
above along with the descriptive statistics for all six variables. (4 marks)
b: As the manager thinks the key variable which affects the PRICE is the area or the
SQMTRS variable, the analyst obtains the scatter diagram showing PRICE as the
dependent variable and SQMTRS as the independent variable. Briefly comment on what
this diagram and the relevant correlation coefficient are telling us about the relationships
between PRICE and this SQMTRS independent variable. (5 marks)
c: Estimate the simple regression model in which the PRICE variable is a linear function of
the SQMTRS variable.
PRICEi = 0 + 1SQMTRSi + i
Write down your estimated model and briefly comment on how we interpret the estimated
values of the intercept and the coefficient of the SQMTRS variable.
Using the residuals from this model briefly discuss whether there is evidence that the
errors are not consistent with the 2 relevant assumptions about the error terms.
Use the Excel output to discuss the utility of this model and whether or not the SQMTRS
variable has a significant impact on the PRICE variable.
Use this model to forecast the value of PRICE when SQMTRS is equal to its average
value obtained in part a of this question. (8 marks)
6
Question 2 (continued)
d: From her own professional knowledge along with the scatter diagram in part b of the
question the analyst now decides that the PRICE may not be a simple linear function of
the area of a home (SQMTRS). She decides to use the squared value of the area of a
home (SSQMTRS) as an additional independent variable in a multiple regression model of
PRICE. Estimate the following multiple regression model in which the PRICE variable is
now a quadratic function of the area of the house.
PRICEi = 0 + 1SQMTRSi + 2SSQMTRSi + i
Write down your estimated model. Using your model obtain the impact on PRICE when
SQMTRS changes by 10 when SQMTRS is equal to its average value. Does the impact of
a 10 unit change in SQMTRS on PRICE in this model differ from its impact in the linear
model in part c.
Briefly explain why you think that Multicollinearity is or is not affecting the quality of the
estimated coefficient values in this model.
Using the residuals from this model briefly discuss whether there is evidence that the
errors are not consistent with the 2 relevant assumptions about the error terms.
Use the Excel output to discuss the utility of this model and whether or not the different
independent variables have a significant impact on the SALES variable.
Use this model to forecast the value of PRICE when SQMTRS is equal to its average value.
(9 marks)
e. The analyst points out to the manager that PRICE is often affected by the overall quality
of the house. One way to capture the impact of quality is to use the number of bathrooms
or the BTHRMS variable as usually it is higher quality houses which have been built more
recently that have more than one bathroom. The analyst generates a new independent
variable AREABATH by multiplying the area SQMTRS by the number of bathrooms
BTHRMS. Estimate the following multiple regression model in which the model we use for
the PRICE variable is now.
PRICEi = 0 + 1SQMTRSi + 2AREABATH i + i
Write down your estimated model. Using your model obtain the impact on PRICE when
SQMTRS changes by 10 when SQMTRS is equal to its average value and both when the
house has one bathroom and when the house has two bathrooms. Does the impact of a
10 unit change in SQMTRS on PRICE in this model differ from its impact in the linear
model in part c.
Briefly explain why you think that Multicollinearity is or is not affecting the quality of the
estimated coefficient values in this model.
Using the residuals from this model briefly discuss whether there is evidence that the
errors are not consistent with the 2 relevant assumptions about the error terms.
7
Question 2 e. (continued)
Use the Excel output to discuss the utility of this model and whether or not the different
independent variables have a significant impact on the SALES variable.
Use this model to forecast the value of PRICE when SQMTRS is equal to its average
value. (8 marks)
f. The analyst now decides to obtain a multiple regression model in which PRICE is now
affected by four independent variables. The model they first estimate is
PRICEi = 0 + 1SQMTRSi + 2BBDRMS + 3BTHRMSi + 5GARAGEi + i
The analyst then uses this model to obtain the corresponding parsimonious model.
Total marks 50 = 4 + 5 + 8 + 9 + 8 + 8 + 8
END OF ASSIGNMENT 3