Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
189 views

Report On Linear Regression Using R

This document summarizes a study that used multiple linear regression to predict housing prices in King County, USA based on data from May 2014 to May 2015. The study found that the number of bedrooms, bathrooms, square footage, view, and grade were the most important factors. A final model was selected that predicted prices with 75.35% accuracy. Categorical variables like view and grade had a significant impact on price.

Uploaded by

Runsi Jia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
189 views

Report On Linear Regression Using R

This document summarizes a study that used multiple linear regression to predict housing prices in King County, USA based on data from May 2014 to May 2015. The study found that the number of bedrooms, bathrooms, square footage, view, and grade were the most important factors. A final model was selected that predicted prices with 75.35% accuracy. Categorical variables like view and grade had a significant impact on price.

Uploaded by

Runsi Jia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Zimo Zhu, Runsi Jia, Haoyu Lin

Math 3636
April 14​th ​2020
MATH 3636 Final Paper

Prediction model on House Sales in King County

Abstract

The housing price can be influenced by many factors and the prediction of the price is
sophisticated. Consequently, it becomes very influential for buyers and investors to seek for the crucial
factors affecting the price the most before the purchase or the investment. This report will focus on house
prices from May 2014 to May 2015 in the region of King County, U.S.A. Specifically, multiple linear
regression modelling was applied in this study to make a prediction on the new data set. The data
contained 22 variables and 21631 observations, which were obtained from Kaggle.com. A final multiple
linear regression model was selected after procedures including Exploratory Analysis, fitting multiple
linear regression models, influential points analysis, final model selection, Collinearity consideration, and
Prediction on test data set with its accuracy. The accuracy of the final prediction model was 75.35%.
Moreover, the extent of influence to the price is determined by the correlations of each variable to the
house price. All of the correlations were positive, while the sqft_living is the strongest and the longitude
is the weakest. Then, variables were split into two groups, categorical variables, and explanatory groups,
and two outcomes were discovered:
1. Linear relationships exist between price and explanatory variables.
2. Both view and grade have a significant influence on the price of the house.

Introduction
King County is in a perfect location for living, coastal while not far from Seattle, providing both
convenience and peace. In fact, it has been attracting tons of investors who expect profits. Also, owing to
the U.S. economy and the limitation of resources, housing price has gradually become a popular topic.
People are always eager to capture the housing price changes in advance, even though it is very
complicated. However, the data collected for the past years housing price may be a reliable source that
can be used for the prediction and analysis purpose. This report is going to discuss some findings about
building a prediction model on King County house prices on a new dataset.

There are many factors that may affect the price of a house in King County, U.S.A. In the project,
several important components were considered including but not limited to the number of bathrooms, the
number of bedrooms, the total square feet of the house, floors, the year built, year renovated, grade,
waterfronts, and view. The data for this report were drawn from Kaggle.com. The main goal is to help
potential buyers and investors to better understand which component is relatively more significant to the
house prices and how predictions of price are made utilizing modeling approaches.
Since the housing price can be also affected by the government policy, global pandemic and
economic situation, those are the factors that are beyond control. In fact, these issues cannot be perfectly
evaluated and require more complicated works. Therefore, assumptions exclude these three variables.

Data Characteristic

The entire data is imported from Kaggle.com.(​Harlfoxem, 2016​) (See the Bibliography for more
details) The data only includes houses in King County in Seattle sold between May 2014 and May 2015.
Assumptions include that houses will not be affected by the small updates shortly, and the past data will
be an accurate estimate of the future values.

The obtained data contain 21613 observations, and each observation represents a house sold
within the corresponding period. Also, there are 22 different pieces of information for observations
provided in the excel spreadsheet. They are the ID of the buyer, the date the house was bought, the year
the house was bought, price of the house, the number of bedrooms, the number of bathrooms, area of the
living room, area of the parking lot, the number of floors, the number of waterfront, view of the house,
condition of the house, grade of the house, area above the basement, area of the basement, the year the
house was built, the year the house was renovated, zip code, latitude of the house, longitude of the house,
area of the living room in 2015, and area of the parking lot in 2015. Among this information, some
variables that are significantly affecting the price of the house have been chosen. Furthermore, some
modifications of the variables have been made to better present the data modeling and predictions.

For starters, to build a multiple linear regression model to predict the price of houses in King
County and evaluate the model in terms of out-of-sample performance, the entire dataset has been
separated into two parts, a training dataset, and a test dataset. The training dataset is used to estimate the
parameters of the regression model, and the test dataset is used to validate the regression model to ensure
it can make precise out-of-sample predictions. After that, variables are selected among 22 different pieces
of information. First, some irrelevant information has been left out, the ID of the buyer, the date the house
was bought, the year the house was bought, the year the house was built, the year the house was
renovated, and zip code, because these information are not numeric and don't have a clear influence on
the price. Since the correlation of two variables measures the degree that changes in one variable are
accompanied by a corresponding or parallel change in the other, it can help select the variables whose
change will potentially have a tremendous impact on the price. Thus, via R, correlation between price and
other variables are successfully computed and determined whether they are positively correlated.
Figure 1. The correlation between price and other variables.

As a result of the correlation computed in Figure 1, it is apparent that 15 different variables all
have a positive correlation with the price. However, not all variables are associated with a strong
correlation. Some have a stronger correlation, others have weaker correlations. The stronger the
correlation, the more important the variable is, and the greater effect on the price will be. Therefore, to
further select key variables, variables that correlate greater than 0.3 are used. And that brought down to 9
variables, which are the number of bedrooms, the number of bathrooms, area of the living room, view of
the house, grade of the house, area above the basement, area of the basement, the latitude of the house,
and area of the living room in 2015. Among 9 variables are divided into two groups, categorical variables,
and explanatory variables. The view of the house and the grade of the house have been set as categorical
variables and the rest as explanatory variables.

Even though these explanatory variables have a strong positive correlation with price, the
relationships among them remain uncertain. In order to explore the relationships between price and
explanatory variables, different scatter plots of price by various explanatory variables have been created
using R.
Figure 2. The scatter plots of price by explanatory variables.

Based on Figure 2, it can be told that there are some linear relationships between price and
explanatory variables. That's an extremely valuable conclusion, which ensures that no change is needed to
the explanatory variables and can continue with the data modeling using the multiple linear regression
model.

For the sake of identifying whether "view" and "grade" is crucial to the analysis of price, different
box plots of price by levels of these categorical variables have been made.
Figure 3. The box plots of price on different categorical variables.

From Figure 3, it reflected that both view and grade have a significant influence on the price of
the house, because they both showed different variations across each level of categorical variables.
Moreover, some modifications of these two categorical variables have been made for the use of model
predictions and converted them into dummy variables. From the summary of two categorical variables, it
is acknowledged that "view" and "grade" totally have five and thirteen levels, respectively. More
importantly, from the box plots of the variable "grade", there aren't any observations in level 1 and 2, and
the majority of train data have concentrated between level 3 and level 13, so level 1 and 2 are left out,
cutting the variable "grade" down to 11 levels. Finally, these two categorical variables will be converted
to four and ten dummy variables and will be utilized in data modeling.

What's more, the analysis of outliers always plays a crucial role for the setup of data modeling.
Some outliers might exist in the dataset, and some steps must be taken to find out whether these outliers
have a huge impact and whether they should be removed or not. Hence, the box plot and normal qq plot
of the entire dataset have been sketched using R to quantify the number of outliers. According to Figure 4
shown below, it is obvious that there are lots of observations that appear to deviate from other
observations in the whole dataset. Some codes are run and obtained more than 900 data points as outliers
in the train dataset.
Figure 4. The box plot and normal qq plot of the entire dataset.

Model Selection and Interpretation

In the previous section, through the analysis and modification of the dataset, the correlation and
linear relations have been computed, which help establish explanatory and categorical variables. This
section will introduce the multiple linear regression model that can be used to do the prediction on a new
dataset. First, a synopsis of how the model was selected and built was provided, which includes the
interpretation of the variables and their effects within the model. Next, the influential points which
contain high leverage points and outliers that are removable were considered in train data. After the
dataset became more accurate, the linear regression modeling process was continued by using a forward
stepwise algorithm and selected the best R-squared model among several arrangements. Collinearity was
also calculated, but it does not make a big impact on the overall prediction result. Finally, this section
briefly describes the accuracy of the final model and predicts the test dataset. The accuracy of the
prediction model was also provided.

Exploratory Analysis

To fit a linear regression model using the prices, the first step was to make sure the data satisfy
the assumptions of linear regression models. In order to check the distribution visually, an exploratory
analysis was introduced to see how the prices were distributed. As Figure 5 shows, after plotting the
prices, the King County house prices were found as independent and identically distributed. By central
limit theorem, if there are more than 40 i.i.d. observations, then the data set is normal. Thus, the row
dataset can directly be applied to fit a multiple linear regression model.
​Figure 5. Exploratory Analysis plot

Fitting multiple linear regression model

A method of least squares was implemented to fit a multiple linear regression model. Multiple
linear regression analysis was performed using stepwise variable selection procedures to identify the best
predictors for KC house prices. Different approaches were used to fit the model, and select the model that
contains the largest adjusted R-squared value. If R-squared was chosen as a criterion to select the best
fitted linear regression model, the model that uses most variables would be selected. Therefore, the
adjusted R-squared which is the coefficient of determination adjusted for degrees of freedom was used to
circumvent this problem.

The stepwise algorithm was applied as follows. The main idea behind this process was to choose
the model with the highest adjusted R-squared. All possible regression models were implemented by
adding a variable to the model from the previous step. The R-squared value for each model can be found
in the model summary. In order to see only R-squared values, codes as follows were used:
summary(fit)$adj.r.squ. As discussed above, the main objective of the stepwise algorithm was to choose
the model with highest adjusted R-squared value, more specifically, if the new model's adjusted
R-squared value is higher than the previous model, it should be kept, but if the new R-squared is lower
than previous one, this approach is discarded and kept the previous one. Through the full summary table,
it can be easily seen the significant levels of each variable based on price prediction purposes by
identifying the stars behind each variable. The next step was to drop some variables that did not have big
impacts on the model. The above process was repeated several times until it approached the model with
the highest R-squared value. After several re-arrangements, the variables that are not having "three stars"
on the summary table were dropped. The stars indicate the significant levels of each variable, variables
with three stars are the significant variables.
Finally, the best model was approached by removing the least significant variable (sqft_lot15).
The model ended up with 11 variables, and the statistical details of this model were shown in Appendix
A. From the table, the best model found by the forward stepwise regression algorithm contains 15
variables with an adjusted R-squared value of 0.7249. Meanwhile, the relationship between these
variables appears to be pretty strong as shown.

Influential points

Once a preliminary linear regression model is fitted, whether there are any influential points is the
next step that needs to be checked. Those are the points that have an on-negligible effect on the fit model.
A data point is influential if it unduly affects the predicted responses, hypothesis test results, and
estimated slope coefficients. Both high leverage points and outliers are included in influential points.
High leverage data points is an observation that has unusual value for a set of explanatory variables. An
outlier is the data points that response value does not follow the general trend of data.

While considering removing high leverage points for observations, the hat matrix logic was being
used. More details about how to build the "hat matrix" can be referred to (Gan & Valdez, 2018). The
method was seeking the observations whose leverages were greater than three times of the average. All
leverage points were plotted in a visual graph as shown below in Figure 6. There were 17 high leverage
points in the train dataset. Those high leverage points are unusual, and cannot be representative of
observations. Since those points affected the model’s accuracy, identified high leverage points were
removed before fitting the linear regression model. Therefore, a new model was fitted based on entire
train data without these high leverage points, and the R-square now becomes 0.7275.

​Figure 6. High leverage points plot

Now, the next step is to consider the outliers that need to be removed. As previously mentioned,
there are more than 900 outliers among the data set. Altering the outliers in observations is not a standard
operating procedure. However, it is very essential to understand their impact on the predictive models. In
order to better understand the implications of outliers, a comparison of the fit on a simple linear
regression model on the train dataset with and without outliers was built. First, the outliers from the train
data set were extracted. The new dataset without outliers was formed. Then, the scatter plots of price
based on one variable, "bedrooms", was sketched. For each of the train datasets, the best fit line was
added into the plot, which helps people see the trend of the dataset and capture the importance of outliers.

Figure 7. The plots of price by bedrooms with/without outliers.

From Figure 7 above, it implied that after removing the outliers, the slope of the best fit line
became much greater. Apparently, if the outliers on the train model were removed, the predictions would
make some crucial errors and become unreliable. For instance, the large values of the price would be
exaggerated because of the larger slope. Declaring an observation as an outlier based on just one feature
could lead to unrealistic inferences. If influential outliers were removed, the slope will be affected.
Therefore, to ensure the accuracy of the predictive model, it is important to figure out the influential
outliers first before the removal of all 900 outliers. Those influential outliers are the outliers that have big
influences on data trends.

Here, the Cook's distance strategy was being used to quantify the influence of a point on the fitted
value in order to figure out which outlier points are influential. Cook's distance is a measure for
quantifying the influence of a point on the fitted value. To identify influential points on Cook's distance,
it was compared to an F-distribution. The influential points were identified with the condition that Cook's
distance on observations exceeds the 95th percentile of the F-distribution. The influential points are
observations that have a substantial influence on the model. Cook's distance mean was calculated as
0.0001345. The figure below is a plot of Cook's distance on the train data which already excludes high
leverage points.

Figure 8. Cook’s distance plot

From calculation, there were 609 influential points in the train data without high leverage points.
A portion of the influential points in details can be found in Appendix B. Among those 609 influential
points, 26 observations were both influential points and outliers. The strategy of this process was to use
the function called "inner_join". In fact, the R does not have this function declared, so it was necessary to
download a package called "tidyverse" before applying this function directly. The "inner_join" function
can identify the observations which were both influential points and outliers. Those outliers that are not
having big impacts were taken out, so that the slope would still remain the same while most outliers were
removed.

Final fitted linear regression model

As of now, the data is cleaner and more accurate, a new final linear regression model can be fitted
using a new dataset. Since it only contains influential outliers, the result of the new model would be more
accurate in terms of data evaluation. From the summary table of the new fitted model, the relationship
between these variables appears to be moderately strong as shown by adjusted R-squared value and
probability. Concluded from the p-value, the variable sqft_lot was not a significant variable for the
prediction of the price, hence dropped. A few other variables which were left out in the previous steps
were considered to be evaluated again. After several attempts, a model that gives the maximum adjusted
R-squared value. For the final fitted model, more details were provided in Appendix C, which is the
summary table of the final fitted linear regression model. The finalized model contains the following
variables: bedrooms, bathrooms, sqft_living, waterfront, condition, yr_built, yr_renovated, lat, age, and
reno1. The adjusted R-squared value for the final model was 0.6726, and the relationship between these
variables appears to be quite strong. All variables in this model were statistically significant, as well as
F-value was high and significant. As shown in Appendix C, the residual analysis perfectly shows that the
newest model fits the model better than the previous model with entire train data, which was shown in
Appendix A.
When implementing an accuracy test on the finalized model, the function called a tally table was
applied. More specifically, use number one minus the result from the tally table on actual subtract the
tally table on prediction over the tally table on actual lead to the accuracy value, which was 0.7953697.
This was a pretty high number that indicates the model was a solid model that can be used.

Collinearity

Collinearity is a factor that affects the estimation of coefficients. It occurs when one explanatory
variable can be approximated by a linear combination of other explanatory variables. Thus, when there
are more than one explanatory variable, collinearity needs to be counted. The reduction of precisions on
estimate coefficients would weaken the statistical power of the regression model. The p-value might not
be trustable because of the high correlation between some variables. The variance inflation factor can be
used to detect collinearity. If a variance inflation factor exceeds 10, then severe collinearity exists in the
data. Since the king county house price data set contains 7 explanatory variables, it would be hard to
obtain reliable regression coefficients. After executing the code of vif, 4 variables (yr_built, yr_renovated,
age, and reno) were found as vif exceeds number 10 in the final fitted model. In the presence of
multicollinearity, the solution to the regression model becomes unstable. More details about collinearity
values can be found from Appendix D. However, the overall purpose of the project was to do the accurate
prediction on the test dataset. Multicollinearity does not affect how well the model fits. A model that
satisfies residual assumptions and solid R-squared value can still produce great predictions. Thus, the
collinearity would not be an issue for the project. (​Frost, 2019​)

Prediction on test dataset and its accuracy

The final objective was to build a model that can do prediction on the new data set. In order to let
the variables in the test data set to match the train data set. Splitting process of categorical variables in the
test data set was necessary. After the view and grade factors were converted to categorical variables, the
next step was to make the test data set only contain the variables that the final model has, because the two
data sets need to match each other in order to let prediction function run. The main idea was to use the
predict function in R to make a prediction on the test dataset based on the final model from above. Here,
the accuracy-test as described in the previous section was applied again to measure how accurate the
prediction model has. After applying the accuracy-test strategy, the accuracy of the prediction model was
shown as 0.7534543. Therefore, the prediction model can predict price with an accuracy of 75.35% on the
test data set.

Summary

Housing market is one of the most unpredictable things in the world. The mortgage bubble crisis
has occurred in many countries. The housing price changes are also highly affected by the economy and
government policy. However, a model that is at least somewhat accurate in predicting the new dataset is
developed, which is test data in the case. The most accurate model found contains thirteen influential
variables, which include two categorical variables. The dataset only includes one year data from King
hopefully represents an accurate result among the areas that are similar to the King County area. The data
certainly would be more reliable if the larger area or more samples had been taken.

Another method of selection that guarantees a higher accuracy model compared might exist. It is
possible that the model does not predict well at very large data, because the dataset applied in this project
only contains the data from May 2014 to May 2015, and the region is in King County in the U.S. The
result found on this dataset might not be applicable in different areas, and the new preference among
investors in the recent year was also not considered. Hence, future studies might need to consider a larger
region or use updated data. By doing accurate analysis, a better viable model for future predictions is
calculated.

What's more,whether the data is suitable is determined by doing exploratory analysis and plotting
a graph visually. However, there could be other methods to evaluate the dataset, such as transforming all
data in each variable to normally distributed. Model selection projects may consider using this process.

It would also be interesting to consider a more specific study. For example, looking at the
year_built variable, and setting a model to test whether a relatively old house sells cheaper than a
relatively new house. This idea would allow people to know the individual factors deeper, which can help
investors to know what to consider in detail. Another example could be comparing the price in Seattle to
the other cities in King County, because Seattle is the biggest and most well-known city in the King
County area, there might be some differences in terms of house prices. This problem can be solved by a
Wilcoxon rank sum test that is always used to see whether the prices are significantly different. There are
a lot of problems solving directions that can be further explored in this dataset. As expected, the housing
price change in the future, the data also needs to be updated in order to fit the new circumstance.
However, this task is left for future researchers.
Appendices
Appendix A.

Appendix B.

Appendix C.
Appendix D.
Bibliography

Harlfoxem. (2016, August 25). House Sales in King County, USA. Retrieved from

MartinRose, R. (2020, April 21). Linear Regression for Predictive Modeling in R –. Retrieved
from https://www.dataquest.io/blog/statistical-learning-for-predictive-modeling-r/

Frost, J., Ghalsasi(2019, March 15). Multicollinearity in Regression Analysis: Problems,


Detection, and Solutions. Retrieved from
https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/

Gan, Guojun, Valdez, Emiliano. (2018) Actuarial Statistics with R: Theory and Case Studies , ACTEX.

You might also like