Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
52 views

Data Science Assignment Chapter 1

This document discusses using machine learning algorithms to predict real estate or house prices based on various features. It begins with an introduction to machine learning and why price prediction is important for both buyers and sellers. A literature review is then presented covering previous work using algorithms like random forest regressors, decision trees, and linear regression on housing datasets. The proposed work section outlines developing a model that takes user-input features and predicts a property's price based on how those features impact prices in the training dataset. The system will include data cleaning, preprocessing, training various algorithms, and selecting the most accurate model for users to get price predictions.

Uploaded by

Gaming World
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Data Science Assignment Chapter 1

This document discusses using machine learning algorithms to predict real estate or house prices based on various features. It begins with an introduction to machine learning and why price prediction is important for both buyers and sellers. A literature review is then presented covering previous work using algorithms like random forest regressors, decision trees, and linear regression on housing datasets. The proposed work section outlines developing a model that takes user-input features and predicts a property's price based on how those features impact prices in the training dataset. The system will include data cleaning, preprocessing, training various algorithms, and selecting the most accurate model for users to get price predictions.

Uploaded by

Gaming World
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

House Price Prediction

Abstract:- This paper demonstrates the usage of machine learning algorithms in the
prediction of Real estate/House prices on two real datasets downloaded from Kaggle
from Boston created by Harrison, D., and Rubinfeld, D.L. and from Melbourne created
by Anthony Pino. To this day, literature about research on machine learning prediction
of house prices in India is extremely limited. This paper provides a review of the usage
of existing machine learning algorithms on two extremely different datasets and tries to
implement this prediction engine for real-life usage by users. The findings indicate that
using different algorithms can drastically change accuracy. Also, a poor dataset can
negatively affect the predictions. Furthermore, it provides sufficient proof of what
algorithm is best suitable for this task.

I. INTRODUCTION
Machine Learning (ML) is a vital aspect of present-day business and research. It
progressively improves the performance of computer systems by using algorithms and
neural network models. Machine Learning algorithms automatically build a
mathematical model using sample data – also referred to as “training data” which
form decisions without being specifically programmed to make those decisions.
People and real estate agencies buy or sell houses, people buy to live in or as an
investment and the agencies buy to run a business. Either way, we believe everyone
should get exactly what they pay for. over-valuation/under-valuation in housing
markets has always been an issue and there is a lack of proper detection measures.
Broad measures, like house/Real-estate price-to-rent ratios, give a primary pass.
However, to decide about this issue an in-depth analysis and judgment are necessary.
Here’s where machine learning comes in, by training an ML model with hundreds and
thousands

of data a solution can be developed which will be powerful enough to predict prices
accurately and can cater to everyone’s needs.

The primary aim of this paper is to use these Machine Learning Techniques and
curate them into ML models which can then serve the users. The main objective of a
Buyer is to search for their dream house which has all the amenities they need.
Furthermore, they look for these houses/Real estates with a price in mind and there is
no guarantee that they will get the product for a deserving price and not overpriced.
Similarly, A seller looks for a certain number that they can put on the estate as a price
tag and this cannot be just a wild guess, lots of research needs to be put to conclude a
valuation of a house.
Additionally, there exists a possibility of underpricing the product. If the price is
predicted for these users this might help them get estates for their deserving prices not
more not less.
III. LITERATURE REVIEW

Real Estate has become more than a necessity in this 21st century, it represents
something much more nowadays. Not only for people looking into buying Real Estate
but also the companies that sell these Estates. According to Real Estate Property is not
only the basic need of a man but today it also represents the riches and prestige of a
person. Investment in real estate generally seems to be profitable because their
property values do not decline rapidly. Changes in the real estate price can affect
various household investors, bankers, policymakers, and many. Investment in the real
estate sector seems to be an attractive choice for investments. Thus, predicting the real
estate value is an important economic index.
suggests that Every single organization in today’s real estate business is operating
fruitfully to achieve a competitive edge over alternative competitors. There is a need
to simplify the process for a normal human being while providing the best results
proposed to use machine learning and artificial intelligence techniques to develop an
algorithm that can predict housing prices based on certain input features. The business
application of this algorithm is that classified websites can directly use this algorithm
to predict prices of new properties that are going to be listed by taking some input
variables and predicting the correct and justified price i.e., avoid taking price inputs
from customers and thus not letting any error creeping in the system used Google
Colab/Jupiter IDE. Jupiter IDE is an open-source web app that helps us to share as
well create documents that have LiveCode, visualizations, equations, and text that
narrates. It contains tools for data cleaning, data transformation, simulation of
numeric values, modeling using statistics, visualization of data, and machine learning
tools. [10] designed a system that will help people to know close to the precise price
of real estate. User can give their requirements according to which they will get the
prices of the desired houses User can also get the sample plan of the house to get a
reference for houses. In [5] Housing value of the Boston suburb is analyzed and
forecast by SVM, LSSVM, and PLS methods and the corresponding characteristics.
After getting rid of the missing samples from the original data set, 400 samples are
treated as training data and 52 samples are treated as test data. Housing value of the
training data. As per [1]’s findings, the best accuracy was provided by the Random
Forest Regressor followed by the Decision Tree Regressor. A similar result is
generated by the Ridge and Linear Regression with a very slight reduction in Lasso.
Across all groups of feature selections, there is no extreme difference between all
regardless of strong or weak groups. It gives a good sign that the buying prices can be
solely used for predicting the selling prices without considering other features to
disseminate model over-fitting. Additionally, a reduction in accuracy is apparent in
the very weak features group. The same pattern of results is visible on the Root
Square Mean Error (RMSE) for all feature selections. [2] observed that their data set
took more than one day to prepare. As opposed to performing the computations
sequentially, we might utilize various processors and parallel the computations
involved, which might possibly decrease the preparation time Furthermore prediction
period. Include All the more functionalities under the model, we can give choices for
clients with select a district alternately locale should produce those high-temperature
maps, as opposed to entering in the list. used a data set of 100 houses with several
parameters. We have used 50 percent of the data set to train the machine and 50
percent to test the machine. The results are truly accurate. And we have tested it with
different parameters also. Not using PSO makes it easier to train machines with
complex problems and hence regression is used. experimented with the most
fundamental machine learning algorithms like decision tree classifier, decision tree
regression, and multiple linear regression. Work is implemented using the Scikit-
Learn machine learning tool. This work helps the users to predict the availability of
houses in the city and also to predict the prices of the houses. used machine learning
algorithms to predict house prices. We have mentioned the step-by-step procedure to
analyze the dataset. These feature sets were then given as an input to four algorithms
and a CSV file was generated consisting of predicted house prices expressed that
There is a need to use a mix of these models a linear model gives a high bias whereas
a high model complexity-based model gives a high variance The outcome of this
study can be used in the annual revision of the guideline value of land which may add
more revenue to the State Government while this transaction is made. concludes that
by conducting this experiment with various machine learning algorithms it’s been
clear that random forest and gradient boosted trees are performing better with more
accuracy percentage and with fewer error values. when this experiment is compared
with the label and to the result achieved these algorithms predict well.

IV. PROPOSED WORK

The purpose of this system is to determine the price of a house by looking at the various
features which are given as input by the user. These features are given to the ML model
and based on how these features affect the label it gives out a prediction. This will be
done by first searching for an appropriate dataset that suits the needs of the developer as
well as the user. Furthermore, after finalizing the dataset, the dataset will go through the
process known as data cleaning where all the data which is not needed will be
eliminated and the raw data will be turned into a .csv file. Moreover, the data will go
through data preprocessing where missing data will be handled and if needed label
encoding will be done. Moreover, this will go through data transformation where it will
be converted into a NumPy array so that it can finally be sent for training the model.
While training various machine learning algorithms will be used to train the model their
error rate will be extracted and consequently an algorithm and model will be finalized
which can yield accurate predictions.
Users and companies will be able to log in and then fill a form about various attributes
about their property that they want to predict the price of. Additionally, after a
thorough selection of attributes, the form will be submitted. This data entered by the
user will then go to the model and within seconds the user will be able to view the
predicted price of the property that they put in.
4.1 Block Diagram of the System
The above block diagram is the traditional Machine Learning Approach. It consists of
two sections: the training and the testing. The training has the following components:
the label, input, feature extractor, and the machine learning algorithm. The testing
section has the following components in it: the input, feature extractor, the regression
model, and the output label.
Input: The input consists of data collected from various sources.
Feature Extractor: Only important features which affect the prediction results are kept.
Other unnecessary attributes are discarded, like ID or name.
Features: After feature extraction only, some inputs are considered which largely
contribute to the prediction of the model.
Machine Learning Algorithm: The ML Algorithm is the method by which an AI
system performs its task, and is most commonly used to predict output values from
given input values. Regression is one of the main processes of machine learning.
The Regression Model: The regression model consists of a set of machine-learning
methods that allow us to predict a label variable (y) based on the values of one or
more attribute/feature variables (x). Briefly, the goal of a regression model is to build
a mathematical equation that defines y as a function of the x variables.
Label: The label is the output obtained from the model after training. The data
obtained from the dataset is given as a training input first and the relevant training
features are extracted. These training features are preprocessed to get a normalized
dataset and labeling of the data row is done. The result from the training dataset is fed
to the machine learning algorithm. The result from the Machine Learning Algorithm
is fed to the Regression model, thus producing a trained model or trained regressor.
This trained regressor can take the new data that is the extracted feature from the test
as input and predict its output label.
Handling Missing Values
The Boston dataset only had five missing values. However, the Melbourne dataset had
a lot of missing values (in thousands). Dropping the null values was not an option
since it negatively affected the accuracy. These null values were handled by replacing
them with the median value of the column. The replacement was done by
implementing the simple imputer function in the pipeline itself. So that any missing
value in the future would be handled as soon as the data passes through the pipeline.
Additionally, the Melbourne dataset had missing label/price values which had to be
dropped for better results.

5.2 Data Preprocessing

Train and Test Split


We have split the dataset into two sets i.e. the Training set and the Testing set.
Training set consists of 80% of the
dataset and the testing set has 20% of the dataset. We had
columns with only two distinct values and wanted to make sure that the splitting
should split these values in equal
proportions. Therefore, we used a stratified shuffle split for train test splitting for
better results.

Figure 5.2.a Stratified Shuffle Split in Boston dataset

Figure 5.2.b Stratified Shuffle Split in Melbourne dataset

You might also like