0% found this document useful (0 votes)

11 views

Regression Algorithm

The document outlines the process of creating a linear regression model to predict housing prices using various features such as average area income, house age, number of rooms, and population. It includes data exploration, correlation analysis, model training, and evaluation metrics. The model's performance is assessed through predictions, scatter plots, and residual analysis.

Uploaded by

unknown2006103

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Regression Algorithm

Uploaded by

unknown2006103

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Linear Regression Models

Creating a model to predict the Housing prices

based on existing features
In [1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]: %matplotlib inline

In [3]: hs=pd.read_csv('USA_Housing.csv')
hs

Out[3]: Avg. Area

Avg. Area Avg. Area Avg. Area
Area
House Number of Number of Price Address
Income Population
Age Rooms Bedrooms

208 Michael Ferry Apt.

0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 674\nLaurabury, NE
3701...

188 Johnson Views Suite

1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06
079\nLake Kathleen, CA...

9127 Elizabeth
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 Stravenue\nDanieltown,
WI 06482...

USS Barnett\nFPO AP
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06
44820

USNS Raymond\nFPO AE
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05
09386

... ... ... ... ... ... ... ...

USNS Williams\nFPO AP
4995 60567.944140 7.830362 6.137356 3.46 22837.361035 1.060194e+06
30153-7653

PSC 9258, Box

4996 78491.275435 6.999135 6.576763 4.02 25616.115489 1.482618e+06 8489\nAPO AA 42991-
3352

4215 Tracy Garden Suite

4997 63390.686886 7.250591 4.805081 2.13 33266.145490 1.030730e+06 076\nJoshualand, VA
01...

USS Wallace\nFPO AE
4998 68001.331235 5.534388 7.130144 5.44 42625.620156 1.198657e+06
73316

37778 George Ridges

4999 65510.581804 5.992305 6.792336 4.07 46501.283803 1.298950e+06 Apt. 509\nEast Holly, NV
2...

5000 rows × 7 columns

In [4]: hs.info() #gives total number of columns, total number of entries and type of data ty
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB

In [5]: hs.describe() #to get the statistical information about the dataframe

Out[5]: Avg. Area Avg. Area House Avg. Area Number Avg. Area Number of Area
Price
Income Age of Rooms Bedrooms Population

count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03

mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06

std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05

min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04

25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05

50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06

75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06

max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06

In [6]: #in above address column is not available as it has string values.

In [7]: hs.columns #gives number of columns in the given dataframe.

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[7]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')

In [8]: sns.pairplot(hs)

<seaborn.axisgrid.PairGrid at 0x21970b9e790>
Out[8]:
In [9]: #in above histogram looks like normally distributed except number of bedrooms which are 4
#around it.

In [10]: #to check distribution of the price

sns.histplot(hs['Price'])

<AxesSubplot:xlabel='Price', ylabel='Count'>
Out[10]:
In [11]: #to check the corelation between two variables use heatmap
sns.heatmap(hs.corr())
#you can see the daignol of corelation and it shows relation is perfectly ploted with so

<AxesSubplot:>
Out[11]:

In [12]: sns.heatmap(hs.corr(),annot=True)

<AxesSubplot:>
Out[12]:
In [13]: #to check the corelation between two variables
hs.corr()

Out[13]: Avg. Area Avg. Area Avg. Area Number Avg. Area Number Area
Price
Income House Age of Rooms of Bedrooms Population

Avg. Area Income 1.000000 -0.002007 -0.011032 0.019788 -0.016234 0.639734

Avg. Area House

-0.002007 1.000000 -0.009428 0.006149 -0.018743 0.452543
Age

Avg. Area Number

-0.011032 -0.009428 1.000000 0.462695 0.002040 0.335664
of Rooms

Avg. Area Number

0.019788 0.006149 0.462695 1.000000 -0.022168 0.171071
of Bedrooms

Area Population -0.016234 -0.018743 0.002040 -0.022168 1.000000 0.408556

Price 0.639734 0.452543 0.335664 0.171071 0.408556 1.000000

In [14]: # for training a linear regression model split the data into X axis that features to tra
# in this case target variable is Price column wr we are predciting the price
#here we wont deal with Address column as it has text information but in NLP we will.
hs.columns # to grab the columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[14]:
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')

In [15]: #take feature variables

X=hs[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]

In [16]: #take target variable that we trying to predict

y=hs['Price']
Splitting the data into training set and testing set(to
test the model that we have trained)
In [17]: #now do Train Test split in the data
#we need to import scikitlearn for splitting the data
from sklearn.model_selection import train_test_split

In [18]: x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10

#tuple unpacking to grab training set and testing set
#test size is percentage of test data you want to allocate for testing our model that is
#random state means data is allocated randomly.

In [19]: from sklearn.linear_model import LinearRegression

#to import linear regression functions

In [20]: #we need to create instance for the linear regression model
lm=LinearRegression()

In [21]: lm.fit(x_train,y_train) # at first i will fit my data into model for training the data
#use shift+tab to get the broiler code

LinearRegression()
Out[21]:

evaluate our model while checking coeffcients

In [22]: #grab the intercept while calling lm
print(lm.intercept_)

-2640159.79685267

In [23]: #grab the coeffcient for each features each of this coef relates to columns
lm.coef_

array([2.15282755e+01, 1.64883282e+05, 1.22368678e+05, 2.23380186e+03,

Out[23]:
1.51504200e+01])

In [24]: X.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[24]:
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')

In [25]: x_train.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
Out[25]:
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')

In [26]: #create a dataframe for better understanding on the coefficients

cdf=pd.DataFrame(lm.coef_,X.columns,columns=['Coeff'])

In [27]: cdf
Out[27]: Coeff

Avg. Area Income 21.528276

Avg. Area House Age 164883.282027

Avg. Area Number of Rooms 122368.678027

Avg. Area Number of Bedrooms 2233.801864

Area Population 15.150420

In [28]: #coefficient shows that when you hold all other features fixed and if there is one unit
#is assoicated with the increase of 21.528$ in the house price.and rest all the same.

Predicting the test set

In [37]: predictions=lm.predict(x_test)

In [38]: predictions #predicted prices for the house

array([1260960.70567627, 827588.7556033 , 1742421.24254342, ...,

Out[38]:
372191.40626916, 1365217.15140897, 1914519.5417888 ])

In [39]: y_test #actual vlaues of the house

1718 1.251689e+06
Out[39]:
2511 8.730483e+05
345 1.696978e+06
2521 1.063964e+06
54 9.487883e+05
...
1776 1.489520e+06
4269 7.777336e+05
1661 1.515271e+05
2410 1.343824e+06
2302 1.906025e+06
Name: Price, Length: 2000, dtype: float64

In [40]: #to analyse above these values draw a scatter plot using both variables
plt.scatter(y_test,predictions)

<matplotlib.collections.PathCollection at 0x2197496b880>
Out[40]:
In [ ]: #it is pretty good bcs the predicted value and actual value fits into the straight line

In [41]: #create a histogram of distribution for residuals.residuals are the diffrence between ac
#predicted values (predictions).
sns.distplot((y_test-predictions))

C:\Users\Sai Shri\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarni

ng: `distplot` is a deprecated function and will be removed in a future version. Please
adapt your code to use either `displot` (a figure-level function with similar flexibilit
y) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Price', ylabel='Density'>
Out[41]:

In [42]: #the curve shows residuals are normally distributed it means the model selected is a per
#if it is not normaly distributed and showing weired behaviour then look back the data a
#is good choice for the data or not for the dataset

Regression Evaluation Matrics

Mean Absolute Error

Mean sqaured Error

Root Mean sqaured Error

In [43]: from sklearn import metrics

In [44]: metrics.mean_absolute_error(y_test,predictions)

82288.22251914942
Out[44]:

In [46]: metrics.mean_squared_error(y_test,predictions)

10460958907.208948
Out[46]:
In [47]: np.sqrt(metrics.mean_squared_error(y_test,predictions))

102278.82922290883
Out[47]:

IATF Manual
50% (2)
IATF Manual
62 pages
Prac - 8 (1) - Jupyter Notebook
No ratings yet
Prac - 8 (1) - Jupyter Notebook
6 pages
Project Linear Regression
No ratings yet
Project Linear Regression
7 pages
ML LinearRegression
No ratings yet
ML LinearRegression
10 pages
01.multiple Linear Regression - Ipynb - Colaboratory
No ratings yet
01.multiple Linear Regression - Ipynb - Colaboratory
10 pages
Linear Regression Using Python
No ratings yet
Linear Regression Using Python
18 pages
Mlext
No ratings yet
Mlext
1 page
DL_1
No ratings yet
DL_1
11 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Copy of Project 4 _ House Price Prediction.ipynb - Colab
No ratings yet
Copy of Project 4 _ House Price Prediction.ipynb - Colab
5 pages
Kaggle Machine Learning
No ratings yet
Kaggle Machine Learning
6 pages
20BECE30146 ML Pratical2
No ratings yet
20BECE30146 ML Pratical2
3 pages
T2_summary_VHA
No ratings yet
T2_summary_VHA
14 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
ML Regression
No ratings yet
ML Regression
9 pages
Expt 7
No ratings yet
Expt 7
3 pages
vertopal.com_2_linear_regression_multivariate
No ratings yet
vertopal.com_2_linear_regression_multivariate
2 pages
Report
No ratings yet
Report
40 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
ML File
No ratings yet
ML File
37 pages
Emllab
No ratings yet
Emllab
6 pages
PythonFile[1]
No ratings yet
PythonFile[1]
5 pages
exp_3_ml
No ratings yet
exp_3_ml
3 pages
f3683849-7ca6-4854-8f96-af11b6e837ec
No ratings yet
f3683849-7ca6-4854-8f96-af11b6e837ec
20 pages
Chirag HOusing Price Pred
No ratings yet
Chirag HOusing Price Pred
12 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
unit 3 5
No ratings yet
unit 3 5
4 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Faisal Nadeem (SAP# 30601)
No ratings yet
Faisal Nadeem (SAP# 30601)
7 pages
The Boston Housing Dataset
100% (1)
The Boston Housing Dataset
4 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
IoT Task4 21BEC0384
No ratings yet
IoT Task4 21BEC0384
9 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Aayushi ML File
No ratings yet
Aayushi ML File
37 pages
2 - Linear - Regression - Multivariate - Ipynb - Colaboratory
No ratings yet
2 - Linear - Regression - Multivariate - Ipynb - Colaboratory
4 pages
Injecttive Blockchain
No ratings yet
Injecttive Blockchain
14 pages
Data Science Record_05
No ratings yet
Data Science Record_05
20 pages
MachineLearning
No ratings yet
MachineLearning
10 pages
Price Prediction
No ratings yet
Price Prediction
4 pages
1 - Linear - Regression - Ipynb - Colaboratory
No ratings yet
1 - Linear - Regression - Ipynb - Colaboratory
7 pages
1_Lab Manual (ML)
No ratings yet
1_Lab Manual (ML)
42 pages
Faseeh Chap 2 Report
No ratings yet
Faseeh Chap 2 Report
30 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Ex No.: Date: Problem Statement
No ratings yet
Ex No.: Date: Problem Statement
3 pages
California Housing Price Prediction .
No ratings yet
California Housing Price Prediction .
1 page
a
No ratings yet
a
2 pages
Deepak Data Analysis 1
No ratings yet
Deepak Data Analysis 1
31 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
1722414346054
No ratings yet
1722414346054
18 pages
Train
No ratings yet
Train
17 pages
Phase 5
No ratings yet
Phase 5
5 pages
ml record
No ratings yet
ml record
21 pages
New Opendocument Text
No ratings yet
New Opendocument Text
7 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
AIML
No ratings yet
AIML
5 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
Sesi 4-2B Linear Regression With Python - Jupyter Notebook
No ratings yet
Sesi 4-2B Linear Regression With Python - Jupyter Notebook
12 pages
ml2020 Pythonlab02
No ratings yet
ml2020 Pythonlab02
3 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
Medium Sudoku Puzzle Book (Printable Version)
From Everand
Medium Sudoku Puzzle Book (Printable Version)
Sheba Blake
No ratings yet
Rddrone-Bms772: Rev Date
No ratings yet
Rddrone-Bms772: Rev Date
9 pages
MN2196_Commentary_October_2023
No ratings yet
MN2196_Commentary_October_2023
12 pages
user manual english
No ratings yet
user manual english
49 pages
Mrinmay Dutta Bengaluru - Bangalore 9.10 Yrs
No ratings yet
Mrinmay Dutta Bengaluru - Bangalore 9.10 Yrs
2 pages
2501.02497v1
No ratings yet
2501.02497v1
22 pages
CFC 2310 MSLR Question Paper
No ratings yet
CFC 2310 MSLR Question Paper
7 pages
Download ebooks file Field Expedient SDR Volume 1 Introduction to Software Defined Radio 2nd Edition Paul Clark all chapters
100% (2)
Download ebooks file Field Expedient SDR Volume 1 Introduction to Software Defined Radio 2nd Edition Paul Clark all chapters
81 pages
Commerce
No ratings yet
Commerce
79 pages
sm35 Specification Sheet English
No ratings yet
sm35 Specification Sheet English
1 page
1the Making of Python
No ratings yet
1the Making of Python
5 pages
2003 Nissan Altima 2.5 Serivce Manual GW
100% (2)
2003 Nissan Altima 2.5 Serivce Manual GW
62 pages
Tutorial No.: 2 Activity No.: 1 Topic: Computer-Aided Audit Tools - ACL 9 Table Name: Ar
No ratings yet
Tutorial No.: 2 Activity No.: 1 Topic: Computer-Aided Audit Tools - ACL 9 Table Name: Ar
3 pages
商业研究提案
100% (2)
商业研究提案
8 pages
Vessel Design (External Loads)
No ratings yet
Vessel Design (External Loads)
45 pages
Electrical Machines Laboratory 20190720 - LR
No ratings yet
Electrical Machines Laboratory 20190720 - LR
48 pages
Multimedia Technology Cat
No ratings yet
Multimedia Technology Cat
17 pages
Mmcflash Vehicle Reprogramming
No ratings yet
Mmcflash Vehicle Reprogramming
11 pages
Thermocouple Repair Procedure Overlay
No ratings yet
Thermocouple Repair Procedure Overlay
10 pages
63 L2701 - E - F Vicamatic3 Rev1 EN 15 - 12 - 2020
No ratings yet
63 L2701 - E - F Vicamatic3 Rev1 EN 15 - 12 - 2020
116 pages
Data Processing SSS3 Scheme of Work - syllabusNG
No ratings yet
Data Processing SSS3 Scheme of Work - syllabusNG
12 pages
MICHLET 9.30 User's Manual
No ratings yet
MICHLET 9.30 User's Manual
36 pages
Bmat201l Complex-Variables-And-Linear-Algebra TH 1.0 65 Bmat201l
No ratings yet
Bmat201l Complex-Variables-And-Linear-Algebra TH 1.0 65 Bmat201l
3 pages
Brochura ResearchinDataScienceandAIappliedtoPA
No ratings yet
Brochura ResearchinDataScienceandAIappliedtoPA
31 pages
Excavation and Lateral Support Tekla
No ratings yet
Excavation and Lateral Support Tekla
9 pages
Notes On Brick Masonary
No ratings yet
Notes On Brick Masonary
11 pages
Bridge Equipment - Autopliot: Description & Principle of Operation
No ratings yet
Bridge Equipment - Autopliot: Description & Principle of Operation
4 pages
BF Log MM Ci 2 Ehp3 en
No ratings yet
BF Log MM Ci 2 Ehp3 en
3 pages
Pressure Tech PDF
No ratings yet
Pressure Tech PDF
12 pages
!doctype HTML Head /head Body: // Define Variables and Set To Empty Values
No ratings yet
!doctype HTML Head /head Body: // Define Variables and Set To Empty Values
7 pages

Regression Algorithm

Uploaded by

Regression Algorithm

Uploaded by

Linear Regression Models

Creating a model to predict the Housing prices

In [2]: %matplotlib inline

Out[3]: Avg. Area

208 Michael Ferry Apt.

188 Johnson Views Suite

... ... ... ... ... ... ... ...

PSC 9258, Box

4215 Tracy Garden Suite

37778 George Ridges

5000 rows × 7 columns

count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03

mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06

std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05

min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04

25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05

50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06

75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06

max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06

In [7]: hs.columns #gives number of columns in the given dataframe.

In [10]: #to check distribution of the price

Avg. Area Income 1.000000 -0.002007 -0.011032 0.019788 -0.016234 0.639734

Avg. Area House

Avg. Area Number

Avg. Area Number

Area Population -0.016234 -0.018743 0.002040 -0.022168 1.000000 0.408556

Price 0.639734 0.452543 0.335664 0.171071 0.408556 1.000000

In [15]: #take feature variables

In [16]: #take target variable that we trying to predict

In [18]: x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10

In [19]: from sklearn.linear_model import LinearRegression

evaluate our model while checking coeffcients

array([2.15282755e+01, 1.64883282e+05, 1.22368678e+05, 2.23380186e+03,

In [26]: #create a dataframe for better understanding on the coefficients

Avg. Area Income 21.528276

Avg. Area House Age 164883.282027

Avg. Area Number of Rooms 122368.678027

Avg. Area Number of Bedrooms 2233.801864

Area Population 15.150420

Predicting the test set

In [38]: predictions #predicted prices for the house

array([1260960.70567627, 827588.7556033 , 1742421.24254342, ...,

In [39]: y_test #actual vlaues of the house

C:\Users\Sai Shri\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarni

Regression Evaluation Matrics

Mean Absolute Error

Mean sqaured Error

Root Mean sqaured Error

You might also like