Predictive Model Assignment 3 - MLR Model
Predictive Model Assignment 3 - MLR Model
Table of Contents
1. ABSTRACT...................................................................................................................3
2. INTRODUCTION:........................................................................................................4
2.1 BUSINESS CASE/SCENARIO.............................................................................................4
2.2 Technical Details of MLR Model (Background)...................................................................4
3. BODY:...........................................................................................................................5
3.1 DATA UNDERSTANDING & DATA PREPARATION FOR MLR MODEL...................5
3.2 MODEL DEVELOPMENT AND ASSUMPTIONS CHECKS............................................7
4. FINAL MODEL SELECTION AND EVALUATION...................................................11
6. CONCLUSION............................................................................................................12
7. REFERENCES............................................................................................................13
8. APPENDIX/APPENDICES.........................................................................................13
3
1. ABSTRACT
This paper entails a statistical analytic method known as the multiple linear regression
(MLR) model, which establishes the association between independent variables and
dependent variables to predict outcomes (Hayes, 2022). Numerous explanatory (independent)
variables that might help explain or predict the response variable are often developed into the
model and subjected to a multiple linear regression analysis. It is a popular regression tool
analysts utilize to predict certain output, especially economic analysts.
In this paper, I chose to use the housing market in Ottawa as a scenario setting to
further my understanding and apply my knowledge of the predictive model. Regression
model is often used by real estate analysts to determine the estimated value, or adjustment
rates, of various property characteristics and ultimately predict housing sale prices. In the
body section of the paper, technical details of data understanding, and preparation are
discussed with the five essential assumptions of data feasibility. Furthermore, I worked on the
sample data by partition of 70% training data and 30% testing data. Observations and
evaluation for the developed model are given accordingly. This paper concludes after
successfully developing the MLR model based on the sample data (Ottawa Housing Market).
4
2. INTRODUCTION:
This paper uses a scenario of housing market in Ottawa, Canada and how certain
variables relate to the housing price outcome. Variables such as number of rooms, bedrooms,
bathrooms, distance to downtown, capacity of car parking, land size, etc., are treated
independent variables, which may have a linear relationship that can predict the price of the
housing.
In real-world scenario of housing, there can be many factors that can influence the
prices of housing. Nevertheless, in a normal to stable environment, analysts like to use
regression model to predicts outcomes of interest.
Multiple linear regression (MLR) model is a statistical analysis tool used to predict
outcomes by establishing correlation between explanatory variables (Hayes, 2022). Usually,
when there is more than one explanatory variable that may assist explain or predict the
response variable, all of these explanatory factors are computed into the model and run a
multiple linear regression analysis.
Regression models are widely used in analytical world for generating predictions in
practical contexts, more often when there is a shortage of data. A regression model, for
instance, can anticipate how many bicycles will be rented on a specific day in the future or
how much revenues can be generated seasonally. By creating a connection between the
system's variables, regression works out a predictable outcome.
Regression models started to be used widely since the 50s even though it was
invented decades ago (Ramcharan, 2006). During those days, regression calculations were
made on electromechanical desk "calculators" by economists, it could take up to 24 hours to
get the outcome of one regression. Regression techniques continues to be widely used for
research and forecast. In the recent years, there have been development of modified
techniques for a variety of regression problems, including robust regression, regression
involving correlated responses like time series and growth curves, regression where the
predictor (independent variable) or response variables are curves, images, graphs, or other
complex data objects, regression methods accommodating different kinds of missing data and
more (Malakooti, 2013).
5
3. BODY:
4. Homoscedasticity: The residuals have constant variance at every point in the linear
model.
These five assumptions stated above determine the reliability of data, which in turn determine
the reliability of outcome. In the case of one or more violations of data to the assumptions,
the outcome or regression output can be unreliable. Therefore, each assumption should be
carefully addressed and met.
In the following paragraph below, technical steps to assure the assumptions of data are
feasible:
A scatter plot can be created and utilized between independent and dependent variable to
observe a linear relationship. Ideally, there needs to be a scatter of points that roughly falls
onto a diagonal straight line, which will indicate there is a linear relationship between the
variables.
For instance,
Image 1. Sample Scatter Plot (Ideal)
6
In the case of violations for this assumption occurs, there are three options to carry out. The
first one is to square root the value of independent variables which can result in transforming
the result to be more linear. The second option is to add another independent variable in the
model. Final option is to eliminate the independent variable from the model as it may not be
useful to compute when there is no linear relationship.
Assumption 2: Multicollinearity
VIF (Variance inflation factor) value can be computed for each independent variable to
determine the level of correlation between the independent variables. Ideally, there should not
be high correlation between the independent variables as it can affect the coefficient output to
be unreliable.
The rule of thumb for the VIF value is not to be greater than five. Therefore, independent
variables with high VIF values should be eliminated from the model.
Assumption 3: Independence
The Durbin-Watson test which is a formal statistical test can be conducted to find out
whether or not the residuals (and consequently the observations) exhibit autocorrelation. This
is the easiest approach to find out if this assumption is true. In the case of violating this
assumption, it is advised to make sure the variables do not have very significant differences
in value and another option to add dummy variables to the model (for seasonal correlation).
Assumption 4: Homoscedasticity
A scatter plot of standardized residuals versus predicted values can be created to determine
that the residuals have constant variance consistently in the linear model.
The ideal scatter plot is shown below:
Image 2. Sample Scatter Plot (Homoscedasticity)
7
Q-Q plots can be used to check whether the residuals are normally distributed. Ideally, the
points should fall close to a diagonal straight line as shown below.
For this paper, I have developed a model in a scenario of predicting prices for housing
market in Ottawa. I used the real data and cleaned the dataset by selecting the variables that
are significant for the MLR model. The table below shows the selected variables and first 5
rows of data from a sample data of 1020 rows:
Table 1: Extract from Working Excel
Ottawa Housing Market
Rooms Distance Bedroom2 Bathroom Car Land-Size Price
2 2.5 2 1 1 202 $1,480,000
2 2.5 2 1 0 156 $1,035,000
3 2.5 3 2 0 134 $1,465,000
3 2.5 3 2 1 94 $850,000
4 2.5 3 1 2 120 $1,600,000
8
Then I continue to carry out the data partition on the sample data into 70% of data for training
and 30% of data for testing randomly through excel function tool.
I square root the price of housing on the data for better computing in excel.
Assumption 2: Linearity
The scatter plots below show a good linear relationship between independent and dependent
variable.
Graph 3. Linearity between Room & Price Graph 4. Linearity between Bedrooms & Price
Graph 5. Linearity between Bathrooms & Price Graph 6. Linearity between Car & Price
Updated correlation
Table 3. Multiple correlation Test 2
matrix
Rooms Bathroom Car Price
Rooms 1 0.50681001
Bathroom 0.55570944 1 0.37360588
0.2459107
Car 0.38394867 9 1 0.2396337
0.3736058
Price 0.50681001 8 0.2396337 1
At this stage, I have removed 3 variables (Bedroom, Distance and Land Size) based on the
violation of assumptions.
Assumption 4: Autocorrelation
The scatter plot below shows the residuals value normally distributed.
Graph 7. Residuals Scatter Plot
Residuals
800
600
400
200
0
0 100 200 300 400 500 600 700 800
-200
-400
-600
-800
Assumption 5: Homoscedasticity
The scatter plot below shows that the standard residuals have constant variance at most point
in the linear model.
Graph 8. Standard Residuals Scatter Plot
11
Standard Residuals
4
0
700 800 900 1000 1100 1200 1300 1400 1500 1600 1700
-1
-2
-3
-4
In my final model selection, I did a variable importance test. The summary output of
regression are as follows:
Table 4. Regression Output
SUMMARY OUTPUT
Regression Statistics
0.5373498
Multiple R 1
0.2887448
R Square 2
Adjusted R 0.2866802
Square 2
0.1234049
Standard Error 5
Observations 692
ANOVA
Significanc
df SS MS F eF
4.2596422 2.1298211 139.85499 1.0538E-
Regression 2 8 4 8 51
10.492630 0.0152287
Residual 689 2 8
14.752272
Total 691 4
12
Stand
Coeffic ard P- Lower Upper Lower Upper
ients Error t Stat value 95% 95% 95.0% 95.0%
In my final predictive model for housing market in Ottawa, I contrasted the predicted output
of price with the actual data (test data). It proves a moderate accuracy, and the following
scatter graph below shows the degree of closeness between actual price and predicted price.
predicted_price
2500000
2000000
1500000
f(x) = 0.321749272764029 x + 723220.144404765
R² = 0.314674736262843
1000000
500000
0
0 500000 1000000 1500000 2000000 2500000
13
6. CONCLUSION
Multiple Linear Regression is a data analysis tool used to predict outcome in the
context of when the dependent is quantitative and there are multiple independent variables
that have a linear relationship. In multiple regression model development, the five
assumptions such as linear relationship, multicollinearity, independence, homoscedasticity,
and multivariate normality, are like a screening or filtering process to determine whether the
independent variable is a significant predictor of the dependent variable and, therefore,
whether we need to continue the analysis or not. If the summary output results with a feasible
R square value and P-value, the model can be concluded as successful.
7. REFERENCES
Ramcharan, R. (2006, March). Regressions: Why Are Economists Obsessed with Them?
https://www.imf.org/external/pubs/ft/fandd/2006/03/basics.htm
8. APPENDIX/APPENDICES
0.3736058
Price 0.50681001 8 0.2396337 1
SUMMARY OUTPUT
Regression Statistics
0.5373498
Multiple R 1
0.2887448
R Square 2
Adjusted R 0.2866802
Square 2
0.1234049
Standard Error 5
Observations 692
ANOVA
df SS MS F Significanc
16
eF
2.1298211
Regression 2 4.25964228 4 139.854998 1.0538E-51
0.0152287
Residual 689 10.4926302 8
Total 691 14.7522724
Graph 3. Linearity between Room & Price Graph 4. Linearity between Bedrooms & Price
Graph 5. Linearity between Bathrooms & Price Graph 6. Linearity between Car & Price
Residuals
800
600
400
200
0
0 100 200 300 400 500 600 700 800
-200
-400
-600
-800
Standard Residuals
4
0
700 800 900 1000 1100 1200 1300 1400 1500 1600 1700
-1
-2
-3
-4
predicted_price
2500000
2000000
1500000
f(x) = 0.321749272764029 x + 723220.144404765
R² = 0.314674736262843
1000000
500000
0
0 500000 1000000 1500000 2000000 2500000