Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Regression Algorithm

The document outlines the process of creating a linear regression model to predict housing prices using various features such as average area income, house age, number of rooms, and population. It includes data exploration, correlation analysis, model training, and evaluation metrics. The model's performance is assessed through predictions, scatter plots, and residual analysis.

Uploaded by

unknown2006103
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Regression Algorithm

The document outlines the process of creating a linear regression model to predict housing prices using various features such as average area income, house age, number of rooms, and population. It includes data exploration, correlation analysis, model training, and evaluation metrics. The model's performance is assessed through predictions, scatter plots, and residual analysis.

Uploaded by

unknown2006103
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Linear Regression Models

Creating a model to predict the Housing prices


based on existing features
In [1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]: %matplotlib inline

In [3]: hs=pd.read_csv('USA_Housing.csv')
hs

Out[3]: Avg. Area


Avg. Area Avg. Area Avg. Area
Area
House Number of Number of Price Address
Income Population
Age Rooms Bedrooms

208 Michael Ferry Apt.


0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 674\nLaurabury, NE
3701...

188 Johnson Views Suite


1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06
079\nLake Kathleen, CA...

9127 Elizabeth
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 Stravenue\nDanieltown,
WI 06482...

USS Barnett\nFPO AP
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06
44820

USNS Raymond\nFPO AE
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05
09386

... ... ... ... ... ... ... ...

USNS Williams\nFPO AP
4995 60567.944140 7.830362 6.137356 3.46 22837.361035 1.060194e+06
30153-7653

PSC 9258, Box


4996 78491.275435 6.999135 6.576763 4.02 25616.115489 1.482618e+06 8489\nAPO AA 42991-
3352

4215 Tracy Garden Suite


4997 63390.686886 7.250591 4.805081 2.13 33266.145490 1.030730e+06 076\nJoshualand, VA
01...

USS Wallace\nFPO AE
4998 68001.331235 5.534388 7.130144 5.44 42625.620156 1.198657e+06
73316

37778 George Ridges


4999 65510.581804 5.992305 6.792336 4.07 46501.283803 1.298950e+06 Apt. 509\nEast Holly, NV
2...

5000 rows × 7 columns

In [4]: hs.info() #gives total number of columns, total number of entries and type of data ty
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB

In [5]: hs.describe() #to get the statistical information about the dataframe

Out[5]: Avg. Area Avg. Area House Avg. Area Number Avg. Area Number of Area
Price
Income Age of Rooms Bedrooms Population

count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03

mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06

std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05

min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04

25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05

50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06

75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06

max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06

In [6]: #in above address column is not available as it has string values.

In [7]: hs.columns #gives number of columns in the given dataframe.

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[7]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')

In [8]: sns.pairplot(hs)

<seaborn.axisgrid.PairGrid at 0x21970b9e790>
Out[8]:
In [9]: #in above histogram looks like normally distributed except number of bedrooms which are 4
#around it.

In [10]: #to check distribution of the price


sns.histplot(hs['Price'])

<AxesSubplot:xlabel='Price', ylabel='Count'>
Out[10]:
In [11]: #to check the corelation between two variables use heatmap
sns.heatmap(hs.corr())
#you can see the daignol of corelation and it shows relation is perfectly ploted with so

<AxesSubplot:>
Out[11]:

In [12]: sns.heatmap(hs.corr(),annot=True)

<AxesSubplot:>
Out[12]:
In [13]: #to check the corelation between two variables
hs.corr()

Out[13]: Avg. Area Avg. Area Avg. Area Number Avg. Area Number Area
Price
Income House Age of Rooms of Bedrooms Population

Avg. Area Income 1.000000 -0.002007 -0.011032 0.019788 -0.016234 0.639734

Avg. Area House


-0.002007 1.000000 -0.009428 0.006149 -0.018743 0.452543
Age

Avg. Area Number


-0.011032 -0.009428 1.000000 0.462695 0.002040 0.335664
of Rooms

Avg. Area Number


0.019788 0.006149 0.462695 1.000000 -0.022168 0.171071
of Bedrooms

Area Population -0.016234 -0.018743 0.002040 -0.022168 1.000000 0.408556

Price 0.639734 0.452543 0.335664 0.171071 0.408556 1.000000

In [14]: # for training a linear regression model split the data into X axis that features to tra
# in this case target variable is Price column wr we are predciting the price
#here we wont deal with Address column as it has text information but in NLP we will.
hs.columns # to grab the columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[14]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')

In [15]: #take feature variables


X=hs[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]

In [16]: #take target variable that we trying to predict


y=hs['Price']
Splitting the data into training set and testing set(to
test the model that we have trained)
In [17]: #now do Train Test split in the data
#we need to import scikitlearn for splitting the data
from sklearn.model_selection import train_test_split

In [18]: x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10


#tuple unpacking to grab training set and testing set
#test size is percentage of test data you want to allocate for testing our model that is
#random state means data is allocated randomly.

In [19]: from sklearn.linear_model import LinearRegression


#to import linear regression functions

In [20]: #we need to create instance for the linear regression model
lm=LinearRegression()

In [21]: lm.fit(x_train,y_train) # at first i will fit my data into model for training the data
#use shift+tab to get the broiler code

LinearRegression()
Out[21]:

evaluate our model while checking coeffcients


In [22]: #grab the intercept while calling lm
print(lm.intercept_)

-2640159.79685267

In [23]: #grab the coeffcient for each features each of this coef relates to columns
lm.coef_

array([2.15282755e+01, 1.64883282e+05, 1.22368678e+05, 2.23380186e+03,


Out[23]:
1.51504200e+01])

In [24]: X.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[24]:
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')

In [25]: x_train.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[25]:
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')

In [26]: #create a dataframe for better understanding on the coefficients


cdf=pd.DataFrame(lm.coef_,X.columns,columns=['Coeff'])

In [27]: cdf
Out[27]: Coeff

Avg. Area Income 21.528276

Avg. Area House Age 164883.282027

Avg. Area Number of Rooms 122368.678027

Avg. Area Number of Bedrooms 2233.801864

Area Population 15.150420

In [28]: #coefficient shows that when you hold all other features fixed and if there is one unit
#is assoicated with the increase of 21.528$ in the house price.and rest all the same.

Predicting the test set


In [37]: predictions=lm.predict(x_test)

In [38]: predictions #predicted prices for the house

array([1260960.70567627, 827588.7556033 , 1742421.24254342, ...,


Out[38]:
372191.40626916, 1365217.15140897, 1914519.5417888 ])

In [39]: y_test #actual vlaues of the house

1718 1.251689e+06
Out[39]:
2511 8.730483e+05
345 1.696978e+06
2521 1.063964e+06
54 9.487883e+05
...
1776 1.489520e+06
4269 7.777336e+05
1661 1.515271e+05
2410 1.343824e+06
2302 1.906025e+06
Name: Price, Length: 2000, dtype: float64

In [40]: #to analyse above these values draw a scatter plot using both variables
plt.scatter(y_test,predictions)

<matplotlib.collections.PathCollection at 0x2197496b880>
Out[40]:
In [ ]: #it is pretty good bcs the predicted value and actual value fits into the straight line

In [41]: #create a histogram of distribution for residuals.residuals are the diffrence between ac
#predicted values (predictions).
sns.distplot((y_test-predictions))

C:\Users\Sai Shri\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarni


ng: `distplot` is a deprecated function and will be removed in a future version. Please
adapt your code to use either `displot` (a figure-level function with similar flexibilit
y) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Price', ylabel='Density'>
Out[41]:

In [42]: #the curve shows residuals are normally distributed it means the model selected is a per
#if it is not normaly distributed and showing weired behaviour then look back the data a
#is good choice for the data or not for the dataset

Regression Evaluation Matrics

Mean Absolute Error

Mean sqaured Error

Root Mean sqaured Error


In [43]: from sklearn import metrics

In [44]: metrics.mean_absolute_error(y_test,predictions)

82288.22251914942
Out[44]:

In [46]: metrics.mean_squared_error(y_test,predictions)

10460958907.208948
Out[46]:
In [47]: np.sqrt(metrics.mean_squared_error(y_test,predictions))

102278.82922290883
Out[47]:

You might also like