0% found this document useful (0 votes)

51 views

Predictive Model Assignment 3 - MLR Model

This document outlines the development and evaluation of a multiple linear regression model to predict housing prices using data from the Ottawa real estate market. It introduces the business case, provides background on MLR models, and describes the technical steps taken which include understanding the data, developing the model, and checking that the data meets the key assumptions of linear regression. These steps include assessing linear relationships between variables, detecting multicollinearity, and evaluating homoscedasticity and normality of residuals. The document concludes by demonstrating the developed MLR model based on the sample Ottawa housing data.

Uploaded by

Nathan Mustafa

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Predictive Model Assignment 3 - MLR Model

Uploaded by

Nathan Mustafa

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

1

Predictive model assignment –Multiple Linear Regression Model (MLR)

University Canada West

BUSI 650: BUSINESS ANALYTICS

CAMPUS-SUMMER22-04: Jul. 11, 2022 - Sep. 25, 2022

Professor Aliosman Pektas

Due: Monday, Sep 5, 2022, 11:59 pm (PDT)

Table of Contents
1. ABSTRACT...................................................................................................................3
2. INTRODUCTION:........................................................................................................4
2.1 BUSINESS CASE/SCENARIO.............................................................................................4
2.2 Technical Details of MLR Model (Background)...................................................................4
3. BODY:...........................................................................................................................5
3.1 DATA UNDERSTANDING & DATA PREPARATION FOR MLR MODEL...................5
3.2 MODEL DEVELOPMENT AND ASSUMPTIONS CHECKS............................................7
4. FINAL MODEL SELECTION AND EVALUATION...................................................11
6. CONCLUSION............................................................................................................12
7. REFERENCES............................................................................................................13
8. APPENDIX/APPENDICES.........................................................................................13
3

1. ABSTRACT

This paper entails a statistical analytic method known as the multiple linear regression
(MLR) model, which establishes the association between independent variables and
dependent variables to predict outcomes (Hayes, 2022). Numerous explanatory (independent)
variables that might help explain or predict the response variable are often developed into the
model and subjected to a multiple linear regression analysis. It is a popular regression tool
analysts utilize to predict certain output, especially economic analysts.

In this paper, I chose to use the housing market in Ottawa as a scenario setting to
further my understanding and apply my knowledge of the predictive model. Regression
model is often used by real estate analysts to determine the estimated value, or adjustment
rates, of various property characteristics and ultimately predict housing sale prices. In the
body section of the paper, technical details of data understanding, and preparation are
discussed with the five essential assumptions of data feasibility. Furthermore, I worked on the
sample data by partition of 70% training data and 30% testing data. Observations and
evaluation for the developed model are given accordingly. This paper concludes after
successfully developing the MLR model based on the sample data (Ottawa Housing Market).
4

2. INTRODUCTION:

2.1 BUSINESS CASE/SCENARIO

This paper uses a scenario of housing market in Ottawa, Canada and how certain
variables relate to the housing price outcome. Variables such as number of rooms, bedrooms,
bathrooms, distance to downtown, capacity of car parking, land size, etc., are treated
independent variables, which may have a linear relationship that can predict the price of the
housing.

In real-world scenario of housing, there can be many factors that can influence the
prices of housing. Nevertheless, in a normal to stable environment, analysts like to use
regression model to predicts outcomes of interest.

2.2 Technical Details of MLR Model (Background)

Multiple linear regression (MLR) model is a statistical analysis tool used to predict
outcomes by establishing correlation between explanatory variables (Hayes, 2022). Usually,
when there is more than one explanatory variable that may assist explain or predict the
response variable, all of these explanatory factors are computed into the model and run a
multiple linear regression analysis.

Regression models are widely used in analytical world for generating predictions in
practical contexts, more often when there is a shortage of data. A regression model, for
instance, can anticipate how many bicycles will be rented on a specific day in the future or
how much revenues can be generated seasonally. By creating a connection between the
system's variables, regression works out a predictable outcome.

Regression models started to be used widely since the 50s even though it was
invented decades ago (Ramcharan, 2006). During those days, regression calculations were
made on electromechanical desk "calculators" by economists, it could take up to 24 hours to
get the outcome of one regression. Regression techniques continues to be widely used for
research and forecast. In the recent years, there have been development of modified
techniques for a variety of regression problems, including robust regression, regression
involving correlated responses like time series and growth curves, regression where the
predictor (independent variable) or response variables are curves, images, graphs, or other
complex data objects, regression methods accommodating different kinds of missing data and
more (Malakooti, 2013).
5

3. BODY:

3.1 DATA UNDERSTANDING & DATA PREPARATION FOR MLR MODEL

MLR model ideally requires a dataset of several independent variables (explanatory

variables) and one dependent variable (response variable) to determine a linear relationship.
The accuracy of outcome is enriched according to the quality of data input and the data can
be tested through five essential assumptions. The five assumptions include:

1. Linear relationship: Testing whether there is a linear relationship between each

independent variable and the dependent variable.

2. Multicollinearity: Assuring there should be little to no correlation between

independent variables with each other.

3. Independence: The observations are independent.

4. Homoscedasticity: The residuals have constant variance at every point in the linear
model.

5. Multivariate Normality: Assuring a normally distributed residuals.

These five assumptions stated above determine the reliability of data, which in turn determine
the reliability of outcome. In the case of one or more violations of data to the assumptions,
the outcome or regression output can be unreliable. Therefore, each assumption should be
carefully addressed and met.

In the following paragraph below, technical steps to assure the assumptions of data are
feasible:

Assumption 1: Linear relationship

A scatter plot can be created and utilized between independent and dependent variable to
observe a linear relationship. Ideally, there needs to be a scatter of points that roughly falls
onto a diagonal straight line, which will indicate there is a linear relationship between the
variables.
For instance,
Image 1. Sample Scatter Plot (Ideal)
6

In the case of violations for this assumption occurs, there are three options to carry out. The
first one is to square root the value of independent variables which can result in transforming
the result to be more linear. The second option is to add another independent variable in the
model. Final option is to eliminate the independent variable from the model as it may not be
useful to compute when there is no linear relationship.

Assumption 2: Multicollinearity

VIF (Variance inflation factor) value can be computed for each independent variable to
determine the level of correlation between the independent variables. Ideally, there should not
be high correlation between the independent variables as it can affect the coefficient output to
be unreliable.

The rule of thumb for the VIF value is not to be greater than five. Therefore, independent
variables with high VIF values should be eliminated from the model.

Assumption 3: Independence

The Durbin-Watson test which is a formal statistical test can be conducted to find out
whether or not the residuals (and consequently the observations) exhibit autocorrelation. This
is the easiest approach to find out if this assumption is true. In the case of violating this
assumption, it is advised to make sure the variables do not have very significant differences
in value and another option to add dummy variables to the model (for seasonal correlation).

Assumption 4: Homoscedasticity

A scatter plot of standardized residuals versus predicted values can be created to determine
that the residuals have constant variance consistently in the linear model.
The ideal scatter plot is shown below:
Image 2. Sample Scatter Plot (Homoscedasticity)
7

Assumption 5: Multivariate Normality

Q-Q plots can be used to check whether the residuals are normally distributed. Ideally, the
points should fall close to a diagonal straight line as shown below.

Image 3. Sample Q-Q plot

In the case of violation of this

assumption, the values of dependent variable can square root by applying a non-linear
transformation, which may help in resulting a normally distributed residual.

3.2 MODEL DEVELOPMENT AND ASSUMPTIONS CHECKS

For this paper, I have developed a model in a scenario of predicting prices for housing
market in Ottawa. I used the real data and cleaned the dataset by selecting the variables that
are significant for the MLR model. The table below shows the selected variables and first 5
rows of data from a sample data of 1020 rows:
Table 1: Extract from Working Excel
Ottawa Housing Market
Rooms Distance Bedroom2 Bathroom Car Land-Size Price
2 2.5 2 1 1 202 $1,480,000
2 2.5 2 1 0 156 $1,035,000
3 2.5 3 2 0 134 $1,465,000
3 2.5 3 2 1 94 $850,000
4 2.5 3 1 2 120 $1,600,000
8

Then I continue to carry out the data partition on the sample data into 70% of data for training
and 30% of data for testing randomly through excel function tool.

Assumption test on the trained data of Ottawa Housing:

Assumption 1: Normality of Y Variable

The graph below shows the normality of Y Variable:

Graph 1. Normality Graph of Y variable

I square root the price of housing on the data for better computing in excel.

Graph 2. Normality Graph of Y Variable (Square Root)

Assumption 2: Linearity

The scatter plots below show a good linear relationship between independent and dependent
variable.

Graph 3. Linearity between Room & Price Graph 4. Linearity between Bedrooms & Price

Graph 5. Linearity between Bathrooms & Price Graph 6. Linearity between Car & Price

Assumption 3: Multiple Correlation

Table 2. Multiple Correlation test

Rooms Bedroom Bathroom Car

Rooms 1
Bedroom 0.90519913 1
0.5489483
Bathroom 0.55570944 7 1
0.3634168
Car 0.38394867 4 0.24591079 1

Removed values >0.60

Updated correlation
Table 3. Multiple correlation Test 2
matrix
Rooms Bathroom Car Price
Rooms 1 0.50681001
Bathroom 0.55570944 1 0.37360588
0.2459107
Car 0.38394867 9 1 0.2396337
0.3736058
Price 0.50681001 8 0.2396337 1

At this stage, I have removed 3 variables (Bedroom, Distance and Land Size) based on the
violation of assumptions.
Assumption 4: Autocorrelation
The scatter plot below shows the residuals value normally distributed.
Graph 7. Residuals Scatter Plot

Residuals
800

600

400

200

0
0 100 200 300 400 500 600 700 800
-200

-400

-600

-800

Assumption 5: Homoscedasticity
The scatter plot below shows that the standard residuals have constant variance at most point
in the linear model.
Graph 8. Standard Residuals Scatter Plot
11

Standard Residuals
4

0
700 800 900 1000 1100 1200 1300 1400 1500 1600 1700
-1

-2

-3

-4

4. FINAL MODEL SELECTION AND EVALUATION

In my final model selection, I did a variable importance test. The summary output of
regression are as follows:
Table 4. Regression Output

SUMMARY OUTPUT

Regression Statistics
0.5373498
Multiple R 1
0.2887448
R Square 2
Adjusted R 0.2866802
Square 2
0.1234049
Standard Error 5
Observations 692

ANOVA
Significanc
df SS MS F eF
4.2596422 2.1298211 139.85499 1.0538E-
Regression 2 8 4 8 51
10.492630 0.0152287
Residual 689 2 8
14.752272
Total 691 4
12

Stand
Coeffic ard P- Lower Upper Lower Upper
ients Error t Stat value 95% 95% 95.0% 95.0%

0.4024 0.0170 23.609 9.261E 0.3690 0.4359 0.3690 0.4359

Intercept 9841 4796 7641 -91 2621 706 2621 706

Norm_Ro 0.5348 0.0450 11.862 1.149E 0.4463 0.6234 0.4463 0.6234

oms 9873 9209 3635 -29 6434 3312 6434 3312

Norm_Bat 0.1163 0.0362 3.2095 0.0013 0.0451 0.1874 0.0451 0.1874

hroom 0747 3785 5801 9106 5761 5732 5761 5732

In the summary output, the R square is 0.28874482 and P-value is 9.261E-91.

5. FINDINGS AND DISCUSSIONS

In my final predictive model for housing market in Ottawa, I contrasted the predicted output
of price with the actual data (test data). It proves a moderate accuracy, and the following
scatter graph below shows the degree of closeness between actual price and predicted price.

Graph 9. Predicted Price Vs Actual Price Scatter Plot

predicted_price
2500000

2000000

1500000
f(x) = 0.321749272764029 x + 723220.144404765
R² = 0.314674736262843
1000000

500000

0
0 500000 1000000 1500000 2000000 2500000
13

6. CONCLUSION

Multiple Linear Regression is a data analysis tool used to predict outcome in the
context of when the dependent is quantitative and there are multiple independent variables
that have a linear relationship. In multiple regression model development, the five
assumptions such as linear relationship, multicollinearity, independence, homoscedasticity,
and multivariate normality, are like a screening or filtering process to determine whether the
independent variable is a significant predictor of the dependent variable and, therefore,
whether we need to continue the analysis or not. If the summary output results with a feasible
R square value and P-value, the model can be concluded as successful.

7. REFERENCES

Hayes, A. (2022, June 23). Multiple Linear Regression (MLR). Investopedia.

https://www.investopedia.com/terms/m/mlr.asp

Malakooti, B, 2013. Operations and Production Systems with Multiple Objectives.

John Wiley & Sons. https://books.google.ca/books?
id=tvc8AgAAQBAJ&printsec=frontcover&redir_esc=y#v=onepage&q=
%22regression%20analysis%22&f=false

Ramcharan, R. (2006, March). Regressions: Why Are Economists Obsessed with Them?
https://www.imf.org/external/pubs/ft/fandd/2006/03/basics.htm

8. APPENDIX/APPENDICES

Image 1. Sample Scatter Plot (Ideal)

Image 2. Sample Scatter Plot (Homoscedasticity)

Image 3. Sample Q-Q plot

Table 1: Extract from Working Excel

Ottawa Housing Market
Rooms Distance Bedroom2 Bathroom Car Land-Size Price
15

2 2.5 2 1 1 202 $1,480,000

2 2.5 2 1 0 156 $1,035,000
3 2.5 3 2 0 134 $1,465,000
3 2.5 3 2 1 94 $850,000
4 2.5 3 1 2 120 $1,600,000
Table 2. Multiple Correlation test

Rooms Bedroom Bathroom Car

Rooms 1
Bedroom 0.90519913 1
0.5489483
Bathroom 0.55570944 7 1
0.3634168
Car 0.38394867 4 0.24591079 1

Table 3. Multiple correlation Test 2

Rooms Bathroom Car Price

Rooms 1 0.50681001
Bathroom 0.55570944 1 0.37360588
0.2459107
Car 0.38394867 9 1 0.2396337

0.3736058
Price 0.50681001 8 0.2396337 1

Table 4. Regression Output

SUMMARY OUTPUT

Regression Statistics
0.5373498
Multiple R 1
0.2887448
R Square 2
Adjusted R 0.2866802
Square 2
0.1234049
Standard Error 5
Observations 692

ANOVA
df SS MS F Significanc
16

eF
2.1298211
Regression 2 4.25964228 4 139.854998 1.0538E-51
0.0152287
Residual 689 10.4926302 8
Total 691 14.7522724

Coeffici Standar Lower Upper Lower Upper

ents d Error t Stat P-value 95% 95% 95.0% 95.0%

0.4024 0.01704 23.60 9.261E- 0.3690 0.4359 0.3690 0.4359

Intercept 9841 796 97641 91 2621 706 2621 706

Norm_Ro 0.5348 0.04509 11.86 1.149E- 0.4463 0.6234 0.4463 0.6234

oms 9873 209 23635 29 6434 3312 6434 3312

Norm_Bat 0.1163 0.03623 3.209 0.00139 0.0451 0.1874 0.0451 0.1874

hroom 0747 785 55801 106 5761 5732 5761 5732