Regression Algorithm
Regression Algorithm
In [3]: hs=pd.read_csv('USA_Housing.csv')
hs
9127 Elizabeth
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 Stravenue\nDanieltown,
WI 06482...
USS Barnett\nFPO AP
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06
44820
USNS Raymond\nFPO AE
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05
09386
USNS Williams\nFPO AP
4995 60567.944140 7.830362 6.137356 3.46 22837.361035 1.060194e+06
30153-7653
USS Wallace\nFPO AE
4998 68001.331235 5.534388 7.130144 5.44 42625.620156 1.198657e+06
73316
In [4]: hs.info() #gives total number of columns, total number of entries and type of data ty
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
In [5]: hs.describe() #to get the statistical information about the dataframe
Out[5]: Avg. Area Avg. Area House Avg. Area Number Avg. Area Number of Area
Price
Income Age of Rooms Bedrooms Population
In [6]: #in above address column is not available as it has string values.
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[7]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')
In [8]: sns.pairplot(hs)
<seaborn.axisgrid.PairGrid at 0x21970b9e790>
Out[8]:
In [9]: #in above histogram looks like normally distributed except number of bedrooms which are 4
#around it.
<AxesSubplot:xlabel='Price', ylabel='Count'>
Out[10]:
In [11]: #to check the corelation between two variables use heatmap
sns.heatmap(hs.corr())
#you can see the daignol of corelation and it shows relation is perfectly ploted with so
<AxesSubplot:>
Out[11]:
In [12]: sns.heatmap(hs.corr(),annot=True)
<AxesSubplot:>
Out[12]:
In [13]: #to check the corelation between two variables
hs.corr()
Out[13]: Avg. Area Avg. Area Avg. Area Number Avg. Area Number Area
Price
Income House Age of Rooms of Bedrooms Population
In [14]: # for training a linear regression model split the data into X axis that features to tra
# in this case target variable is Price column wr we are predciting the price
#here we wont deal with Address column as it has text information but in NLP we will.
hs.columns # to grab the columns
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[14]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')
In [20]: #we need to create instance for the linear regression model
lm=LinearRegression()
In [21]: lm.fit(x_train,y_train) # at first i will fit my data into model for training the data
#use shift+tab to get the broiler code
LinearRegression()
Out[21]:
-2640159.79685267
In [23]: #grab the coeffcient for each features each of this coef relates to columns
lm.coef_
In [24]: X.columns
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[24]:
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')
In [25]: x_train.columns
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[25]:
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')
In [27]: cdf
Out[27]: Coeff
In [28]: #coefficient shows that when you hold all other features fixed and if there is one unit
#is assoicated with the increase of 21.528$ in the house price.and rest all the same.
1718 1.251689e+06
Out[39]:
2511 8.730483e+05
345 1.696978e+06
2521 1.063964e+06
54 9.487883e+05
...
1776 1.489520e+06
4269 7.777336e+05
1661 1.515271e+05
2410 1.343824e+06
2302 1.906025e+06
Name: Price, Length: 2000, dtype: float64
In [40]: #to analyse above these values draw a scatter plot using both variables
plt.scatter(y_test,predictions)
<matplotlib.collections.PathCollection at 0x2197496b880>
Out[40]:
In [ ]: #it is pretty good bcs the predicted value and actual value fits into the straight line
In [41]: #create a histogram of distribution for residuals.residuals are the diffrence between ac
#predicted values (predictions).
sns.distplot((y_test-predictions))
In [42]: #the curve shows residuals are normally distributed it means the model selected is a per
#if it is not normaly distributed and showing weired behaviour then look back the data a
#is good choice for the data or not for the dataset
In [44]: metrics.mean_absolute_error(y_test,predictions)
82288.22251914942
Out[44]:
In [46]: metrics.mean_squared_error(y_test,predictions)
10460958907.208948
Out[46]:
In [47]: np.sqrt(metrics.mean_squared_error(y_test,predictions))
102278.82922290883
Out[47]: