Certificate: House Price "
Certificate: House Price "
Certificate: House Price "
1
CERTIFICATE
2
DECLARATION
“I hereby declare that this submission is my own work and that, to the best of my
knowledge and belief, it contains no material previously published or written by
another person nor material which has been accepted for the award of any other
degree or diploma pf the university or other institute of higher learning, except where
due acknowledgment has been made in the text.”
Date:
3
ACKNOWLEDGEMENT
We are immensely thankful and express our deep sense of gratitude and in-debt to our
deputy head Mr. Ram Dubey and head of the department “Mr. Sambhav Agarwal”.
We are extremely thankful to our Project Guide Miss. Sana Afreen (Assistant
Professor, Dept. Of Computer Science and Engineering) for suggesting us a problem
of vital interest without which begin guidance and concrete and concrete advice to
this project would not have seen the light of the day. Their continuous monitoring and
time management was and inspired force for us to complete the project.
Among rest to mention we would like to thank all my colleagues and friends who
supported and helped us in completion of this work.
Lastly, we would like to thank our parents. No words can express our heartfelt
gratitude for them.
4
INDEX
1. INTRODUCTION 07
BACKGROUND
PURPOSE
SCOPE
OBJECTIVE
2. PROBLEM IDENTIFICATION 12
4. PROJECT DESCRIPTION 14
5. SNAPSHOT 20
6. FUTURE WORK 26
7. CONCLUSION 27
8. REFRENCES 28
9. BIODATA 29
10. TRAINING/INTERNSHIP 31
CERTIFICATE
5
1. INTRODUTION
In this project, we will develop and evaluate the performance and the predictive
power of a model trained and tested on data collected from houses in Boston’s
suburbs.
Once we get a good fit, we will use this model to predict the monetary value of a
house located at the Boston’s area.
A model like this would be very valuable for a real estate agent who could make use
of the information provided in a daily basis.
The real estate sector is an important industry with many stakeholders ranging from
regulatory bodies to private companies and investors. Among these stakeholders,
there is a high demand for a better understanding of the industry operational
mechanism and driving factors. Today there is a large amount of data available on
relevant statistics as well as on additional contextual factors, and it is natural to try to
make use of these in order to improve our understanding of the industry. Notably, this
has been done in Zillow’s Zestimate and Kaggle’s competitions on housing prices . In
some cases, non-traditional variables have proved to be useful predictors of real estate
trends. For example, in it is observed that Seattle apartments close to specialty food
stores such as Whole Foods experienced a higher increase in value than average
6
1.1 BACKGROUND
Housing prices are an important reflection of the economy, and housing price ranges
are of great interest for both buyers and sellers. Ask a home buyer to describe their
dream house, and they probably won’t begin with the height of the basement ceiling or
the proximity to an east-west railroad. But this playground competition’s data-set
proves that much more influences price negotiations than the number of bedrooms or a
white-picket fence.
Different levels of accuracies and results have been achieved using different
methodologies, techniques and datasets. A study of independent real estate
market forecasting on house price using data mining techniques was done by Bahia
[11]. Here the main idea was to construct the neural network model using two types
of neural network. The first one is Feed Forward Neural Network (FFNN) and the
second one is Cascade Forward Neural Network 2022 2nd International
Conference on Intelligent Technologies (CONIT) Karnataka, India. June 24-26,
2022978-1-6654-8407-7/22/$31.00 ©2022 IEEE1
(CFNN). It was observed that CFNN gives a better result compared to FFNN using
MSE performance metric.
7
1.2 PURPOSE
With the advancement of science and technology our daily life has become much
easier. In today’s world we use information and communication technology
extensively. Every day a new technology emerges in our current digital age which
improves the living standard of people. Sometimes these new technologies have
negative effect but most of the times these technologies have positive effect . AI or
more widely known as AI is one of such technology which has improved the living
standard of people worldwide. AI is widely used now days in various fields like
healthcare , real estate , stock market prediction , weather prediction , automobile
and also in many other fields . AI has many subfields like Natural Language
Processing, Machine Vision, Robotics, Expert System etc. , but in this study, ML
is used. ML is a branch of AI which deals with certain tasks using past data or
recorded data and various algorithms. These tasks of ML involve classification,
association, clustering and regression. ML can be used to make predictive models to
make predictions for future or can be used to make descriptive models to make
acquire some kind of knowledge from the given data. The main difference
between ML programming and conventional programming is that, in
conventional programming, programs are created manually by providing input data
and based on the programming logic computer generates the output. However, in ML,
the inputs and the outputs are fed into the algorithm creating the program. ML
approaches are mainly divided into three categories these are Reinforcement
Learning, Unsupervised Learning and Supervised Learning. In supervised learning
the computers are given inputs and their desired outputs by a supervisor and the
goal is to create a general rule using which a given input can be mapped into their
desired output. Here machine is trained using various ML algorithms in Boston
house dataset to create various models and using this trained machine model’s
evaluation is done.
8
1.3 SCOPE
This model can be considered as the baseline for predicting house price. Further
evaluation can be done here by increasing the data. More data can be collected and
more attributes can be increased for getting a much better evaluation of the
model. The data collected in Boston house dataset is from 1978 which is almost 50
years old and since then a lot of changes have occurred in house price due to
inflation rate. Thus, new data can be collected and further evaluation can be made
on the new collected data. In this paper four models are implemented which are
Simple Linear Regression, Polynomial Regression, Lasso Regression and Ridge
Regression on the Boston House dataset. More advanced models like Support
Vector Machine, Decision Tree, Random Forest, Multiple Linear Regression etc. can
be implemented and the results can be compared. Other ensemble learning
techniques can be used like Adaboost, Xgboost etc. and the results can be
compared to the previous models. Feature selection techniques like Linear
Discriminant Analysis, Principle Component Analysis, Independent component
Analysis etc. , can be used before implementing the models and a study can be
made on the performance of the models before applying feature selection methods
and after implementing feature selection methods.
9
1.4 OBJECTIVE
Accurately predicting the value of a plot or house is an important task for many house
owners, house buyers, plot owners, plot buyers or stake holders. Real estate agencies
and people buy and sell houses all the time, people buy houses to live in or as an
investment whereas real estate agencies buy it to run a business. But the problem
arises in evaluation of the cost of the property. Over-validation / Under-validation
have always been the issues faced in house markets due to lack of proper detection
measures. It is also very difficult task. We know that features like size, area,
location etc. affect the price of the property but there are many other features also
which affect the property such as inflation rates in market, age of the property etc. In
order to overcome these problems a throw analysis is done using Machine
Learning (ML) which is a branch of Artificial Intelligence (AI).
10
2. PROBLEM IDENTIFICATION
Everyone wishes to buy and live in a house which suits their lifestyle and
which provides amenities according to their needs. There are many factors that are to
be taken into consideration like area, location, view etc. for prediction of house
price. It is very difficult to predict house price as it is constantly changing and quite
often the prices are exaggerated for which people who want to buy houses, and
various real estate agencies who want to invest in properties, find it difficult to buy
or sell houses. For this reason, in this paper the author creates an advanced
automated Machine Learning model using Simple Linear Regression, Polynomial
Regression, Ridge Regression and Lasso Regression using the Boston house
dataset to predict house price in future accurately, and to measure the accuracy
of these models various measuring metrics like R-Squared, Root Mean Square Error
(RMSE) and Cross-Validation are used.
11
3.PREVIEW OF PREVIOUS WORK
A lot of past works have been done for predicting house prices. Different levels of
accuracies and results have been achieved using different methodologies,
techniques and datasets. A study of independent real estate market forecasting on
house price using data mining techniques was done by Bahia . Here the main idea was
to construct the neural network model using two types of neural network. The first
one is Feed Forward Neural Network (FFNN) and the second one is Cascade
Forward Neural Network 2022 2nd International Conference on Intelligent
Technologies (CONIT) Karnataka, India.
(CFNN). It was observed that CFNN gives a better result compared to FFNN using
MSE performance metric. Mu et al. did an analysis of dataset containing Boston
suburb house values using several ML methods which are Support Vector
Machine (SVM), Least Square Support Vector Machine (LSSVM) and Partial Least
Square (PLS) methods. SVM and LSSVM gives superior performance compared
to PLS. Beracha al. proved that high amenity areas experience greater price volatility
by investigating the correlation between house prices volatility, returns and local
amenities. Law [14] finds that there is a strong link between house price and street
based local area compare to the house price and region based local area. Benin
et al. to study London house price build a Geographically Weighted Regression
(GWR) model considering Euclidean distance, travel time metrics and Road network
distance. Marco et al. to reduce the prediction errors, a mixed Geographically
weighted regression(GWR) model is used that emphasize the importance and complex
of the spatial Heterogeneity in Australia. Using State level data in USA, Sean et al.
have examined the correlation among common shocks, real per capita disposable
income, house prices, net borrowing cost and macroeconomic, spatial factors and
local disturbances and state level population growth. Joep et al. using the
administrative data from the Netherlands have found that wealthy buyers and high
income leads to higher purchase price and wealthy sealer and higher income leads to
lower selling price.
12
4. PROJECT DESCRIPTION
• Feature Selection
13
Model Selection
Model selection is one of the most important tasks in ML for doing accurate
prediction. Correct models must be selected to get good accuracy. There are
various models available under regression analysis but for this paper four
regression models are used which are Simple Linear Regression, Polynomial
Regression, Lasso Regression and Ridge Regression on the Boston house dataset.
After implementation of these models, we measure the accuracy by splitting the
dataset into two parts which are training dataset and test dataset. We use 80% of the
dataset as training data and 20% is used as test data. The fitting of our models is
done using the training dataset and evaluation of the model is done in test dataset.
Techniques Used
The techniques that are implemented on the Boston house dataset in this paper are
Simple Linear Regression, Polynomial Regression, Lasso Regression and Ridge
Regression.
In this type of regression model a linear relationship is established among the target
variable which is the dependent variable (Y) and a single independent variable (X).
Linear Relationship between dependent and independent variable is established by
fitting a regressor line between them. The equation of the line is given by:
Y=a+bX (1)
14
where “a” and “b” are the model parameter called as regression coefficients.
When we take the value of X as 0, we get the value of “a” which is the Y intercept
of the line and “b” is the slope that signifies the change of Y with the change of X.
If the value of “b” is large then it means with a little change in X there will be a huge
change in Y and vice versa. To compute the values of “a” and “b” we use the
Ordinary Least Square Method. The values predicted by the model Linear Regression
may not always be accurate. There may be some difference hence we add an error
term to the original equation (1), it helps for better prediction of the model.
Y=a+bX+Ɛ (2)
There are some assumptions that are to be made in case of simple linear regression
and those are as follows:
1. The number of observations must be greater than the number of parameters present.
2. The validity of the regression data is over a restricted period. 3. 3.The mean of
the error term has expected value of 0, which means that the error term is
normally distributed.
Polynomial Regression
Y = a+b1X1+b2X2+b3X3+........+bnXn (3)
15
The advantages of polynomial regression are as follows:
1. Polynomial Regression offers the best estimate of the relationship between the
dependent and independent variable.
2. The higher the degree of the polynomial the better it fits the dataset.
3. A wide range of curves can be fit into polynomial regression by varying the
degree of the model.
These are too sensitive towards the presence of outliers in the dataset, as the
presence of outliers will increase the variance of the model. And when the model
encounters any unseen data point it under performs.
Ridge Regression
16
Regression if the value of λ is linear. A general Polynomial Regression or
Linear Regression will fail if there is highly co-linearity between the independent
variables. For this reason, Ridge Regression is used. If the parameters are more
than samples then it can be solved by Ridge Regression. The least Square
determines the values of the parameters for the equation (4), which diminishes the
sum of squared residuals. But in contrast the Ridge Regression regulates the
value for parameters that results in minimization of the sum of squared residuals
along with an additional term λ*b^2. Ridge Regression performs L2 regularization.
Lasso Regression
Lasso or Least Absolute Shrinkage and Selection Operator are very much like
Ridge regression. In ML for selection of significant subset of variables Lasso
regression is used. The prediction accuracy of Lasso regression is usually higher
when compared to interpretations of other model. Similar to ridge, lasso also
adds a little amount of bias to its result which thereby decreases the variance of
the model. Lasso Regression is evaluated by the following: Residual Sum of Squares
+ λ * ( b=Sum of the absolute value of the magnitude of coefficients) Here λ denotes
the amount of shrinkage. The main difference between ridge and lasso is that, ridge
reduces the slope asymptotically close to zero, whereas Lasso reduce the slope all
the way down to zero which results in the elimination of useless parameters
from the equation that do not have any significance role for predicting the value of
the target variable. When the predictors have huge coefficients,
17
Data Description
The dataset used in this project comes from the UCI Machine Learning
Repository which concerns housing values in the suburbs of Boston. This data was
collected in 1978 and contains 506 entries which give information about 14 attributes
of homes from various suburbs located in Boston and one “target” attribute. The
attribute description of this dataset is given below:
CRIM: This is the per capita crime rate by town
ZN: the proportion of residential land zoned for lots over 25,000 sq.ft INDUS:
the proportion of non-retail business acres per town.|
CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: The average number of rooms per dwelling
AGE: the proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centers
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per $10,000
PTRATIO: the pupil-teacher ratio by the town
B: - 1000(Bk - 0.63)^2, Where Bk is the proportion of blacks by the town
LSTAT: “% lower status of the population”
MEDV: “The median value of owner-occupied”
The “target” variable in this dataset is the MEDV variable on which will be
predicted by the ML models. The rest of the variables are used for training the
models. Statistics of the features are described in the table below -
18
The dataset doesn’t contain any null value nor does it contain any duplicate row.
The dataset contains 13 numerical values and only one categorical value which is
“CHAS”. For the attribute “ZN” it is observed that the 25th and 75th percentile
are 0, implying that the data is highly skewed. This is because the attribute “ZN”
is a conditional variable. 3
For the attribute “CHAS” it is observed that the 25th, 50th and the 75th percentile is 0
meaning that it is also highly skewed. This is because “CHAS” is a categorical
variable and it contains values either 0 or 1. Another important observation is made
and that is that the maximum value of “MEDV” is 50.00, so it seems from the
description of data that “MEDV” is censored at 50.00 (corresponding to a median
price of $50,000). Based on this observation about “MEDV” it seems that values
above 50.00 will not be helpful so we remove them. It is also observed that the
attributes “CRIM”, “ZN”, “RM” and “B” are having outliers so we remove them.
From the histogram it is observed that the attributes “CRIM”, “ZN” and “B” are
having highly skewed distributions. The attribute “MEDV” has normal
distribution whereas other attributes have either normal or binomial distribution
except “CHAS” as it is a discrete variable. After that heat map is
implemented to see the correlation of the attributes
19
It is observed that the attributes “TAX” and “RAD” are highly correlated. As both of
them are highly correlated they are having similar behavior and will also have
similar impact while doing prediction calculation. So rather than keeping redundant
attributes it is always better to remove them as it will save space and computation
time for complex algorithms. From heat map it is also observed that the attributes
“LSTAT”, “INDUS”, “RM”, “TAX”, “NOX ” , and “PTRAIQ” are having
correlation score of above 0.5 with the “MEDV” which is a good indication of
using them as predictors, so keeping only these eight attributes we discard other
attributes. Then skewness of the data is removed using log transformation. These
are the steps taken to refine the data. Refining of data is important to get
accurate and good evolution of the models. If data pre-processing is not done
then we will not get good result.
20
5. SNAPSHOT
21
22
23
24
25
6. FUTURE WORK
This model can be considered as the baseline for predicting house price. Further
evaluation can be done here by increasing the data. More data can be collected and
more attributes can be increased for getting a much better evaluation of the
model. The data collected in Boston house dataset is from 1978 which is almost 50
years old and since then a lot of changes have occurred in house price due to
inflation rate. Thus, new data can be collected and further evaluation can be made
on the new collected data. In this paper four models are implemented which are
Simple Linear Regression, Polynomial Regression, Lasso Regression and Ridge
Regression on the Boston House dataset. More advanced models like Support
Vector Machine, Decision Tree, Random Forest, Multiple Linear Regression etc. can
be implemented and the results can be compared. Other ensemble learning
techniques can be used like Adaboost, Xgboost etc. and the results can be
compared to the previous models. Feature selection techniques like Linear
Discriminant Analysis, Principal Component Analysis, Independent component
Analysis can be used before implementing the models and a study can be made on
the performance of the models before applying feature selection methods and after
implementing feature selection methods. The observation can be made on how each
of the feature selection methods impacts the performance of the model. Neural
network and deep learning methods can also be applied and the performance can
be studied. To increase the performance of the models and reduce the time complexity
of models we can use optimization techniques like Particle Swarm optimization,
Genetic Algorithm, and Ant Colony optimization etc. By implementing the
optimization techniques an observation can be made on how each of these techniques
impacts the models. There are various scopes of work that can be done on this
field which will be very helpful to people who want to buy plot or house and also to
real estate agencies for investing on houses.
26
7. CONCLUSION
27
8. REFERNCES
Stephen Law, "Defining Street-based Local Area and measuring its effect on house
price using a hedonic price approach: The case study of Metropolitan London", Cities,
vol. 60, Part A, pp. 166–179, Feb. 2017.
Tong, W., Hussain, A., Bo, W. X., & Maharjan, S., ‘ Artificial Intelligence for
Vehicle-to-Everything: a Survey’, IEEE, 2019 ,doi:10.1109/access.2019.2891073.
Sumit Das, Aritra dey, Akash Paul and NAbamita Roy, ‘Applications of Artificial
Intelligence in Machine Learning: Review and Prospects’, International Journal
of Computer Applications, 2015, DOI:10.5120/20182-2402.
John A. Bullinaria, ‘IAI : The Roots, Goals and Sub-fields of AI’, 2005.
https://www.cs.bham.ac.uk/~jxb/IAI/w2.pdf
https://www.researchgate.net/deref/https%3A%2F%2Fwww.researchgate.net%2Fprof
ile%2FMansi-Bosamia
https://github.com/rromanss23/Machine_Leaning_Engineer_Udacity_NanoDegree/bl
ob/master/projects/boston_housing/boston_housing.ipynb
https://www.ritchieng.com/machine-learning-project-boston-home-prices/
28
9. BIODATA
MANVENDRA SINGH
Manvendra.191200@gmail.com
9260935168
EDUCATIONAL QUALIFICATIONS
TECHNICAL SKILLS
PERSONAL DETAILS
GENDER : MALE
29
YATHARTH MISHRA
Manvendra.191200@gmail.com
9260935168
EDUCATIONAL QUALIFICATIONS
TECHNICAL SKILLS
PERSONAL DETAILS
GENDER : MALE
30