Linear Regression
Linear Regression
Linear Regression
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Variables
A variable is nothing but data. In a dataframe the columns are called variables.
Dependent Variable
Independent Variable
Dependent Variable:
The value of the dependent variable is dependent on the values of the independent
variables.
Independent Variables
Regression
Regression is a statistical approach to analyze the relationship between a dependent
variable and one or more independent variables.
Regression is used to determine the most suitable function that characterizes the
relationship between dependent and independent variables.
Linear Regression
Linear regression is a machine learning algorithm that depends on the linear relationship
between a dependent variable and one or more independent variables.
Two variables are in linear relationship when their values can be represented using a
straight line.
For more than one independent variables, it is called Multiple Linear regression.
Using the above equation, the value of y (dependent variable) can be calculated using x
(independent variable).
Slope β1 = δy/δx.
The error component (term), ϵ, represents the distance of the actual value from the value
predicted by the regression line.
When the relationship between the independent variable (x) and the dependent variable
(y) is linear, Linear Regression model is applied to analyze the data.
Example
In [5]: df = pd.DataFrame(data)
In [6]: df
Out[6]: voltage current
0 0 0.00
1 1 1.00
2 2 2.00
3 3 2.99
4 4 4.00
5 5 5.00
6 6 6.00
7 7 6.99
8 8 8.00
9 9 9.00
10 10 10.00
Example
In [8]: homeprice = {'area': [2600, 3000, 3200, 3600, 4000], 'price': [550000, 565000, 6
In [9]: df = pd.DataFrame(homeprice)
In [10]: df
Out[10]: area price
0 2600 550000
1 3000 565000
2 3200 610000
3 3600 680000
4 4000 725000
From the above plot, the data points can be connected more or less using a straight line.
Hence we can apply Simple Linear regression machine learning model on the above data.
The line may pass through all of the points or non of the points.
The method aims at minimizing the sum of squares of the error terms i.e., it determines
the values of β0 and β1 at which the error terms are minimum.
β0 = ȳ − β1 x̄
Measures of Variation
The sum of squares regression is the sum of squared difference between the predicted
value and the mean of the target variable.
The sum of squares of error is the sum of squared difference between target variable and
its predicted value.
Total Variation
Coefficient of Determination R2
Gives the total percentage of variation that is explained by the target variable.
R2 = SSR/SST
R2 = 1 − (SSE/SST )
Note: 0≤R2 ≤1
Case Study
In [13]: salary_df.shape
Out[13]: (30, 2)
In [14]: salary_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YearsExperience 30 non-null float64
1 Salary 30 non-null int64
dtypes: float64(1), int64(1)
memory usage: 612.0 bytes
In [15]: salary_df.head()
0 1.1 39343
1 1.3 46205
2 1.5 37731
3 2.0 43525
4 2.2 39891
In [16]: salary_df.describe()
Data Visualization
y = salary_df.loc[:, 'Salary'].values
The data will be divided into 2 parts: for training the model and for testing purpose.
Generally 70% of the rows are used for training and reamaining 30% for testing the
model.
Out[22]: (24, 1)
LinearRegression()
In [26]: reg_model.coef_
Out[26]: array([[9312.57512673]])
In [27]: reg_model.intercept_
Out[27]: array([26780.09915063])
In [29]: y_predicted
In [30]: y_test
Model Evaluation
In [33]: r_square
Out[33]: 0.988169515729126
Accuracy = 98.82%
Data Visualization
plt.show()
plt.show()
y = m1 x1 + m2 x2 + m3 x3 +. . . +b
Case Study
Home price data prediction using area, no. of bedrooms, and age
In [38]: homeprice_df.shape
Out[38]: (6, 4)
In [39]: homeprice_df.head()
In [40]: homeprice_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 6 non-null int64
1 bedrooms 5 non-null float64
2 age 6 non-null int64
3 price 6 non-null int64
dtypes: float64(1), int64(3)
memory usage: 324.0 bytes
In [41]: homeprice_df.duplicated().sum()
Out[41]: 0
In [42]: homeprice_df['bedrooms'].skew()
Out[42]: 0.0
In [44]: homeprice_df.isnull().sum()
Out[44]: area 0
bedrooms 0
age 0
price 0
dtype: int64
Data Visualization
In [45]: sns.pairplot(data=homeprice_df)
plt.show()
Model Training
y = homeprice_df['price'].values
In [49]: linear_model.fit(X, y)
Out[49]: ▾ LinearRegression i ?
LinearRegression()
In [50]: linear_model.coef_
In [51]: linear_model.intercept_
Out[51]: 485561.8928233979
y_predict
Model Evaluation
Out[54]: 0.9760698937818199
In [55]: linear_model.score(X, y)
Out[55]: 0.9760698937818199
It is not always possible to expect the data points to be in the straight line. They may
form a curved line.
In Multiple Linear Regression, each term is without any power value but in Polynomial
Linear Regression, each term is raised to a power.
The first term has power 1, the second term has power 2, and the nth term has the power
n.
y = m1 x1 + m2 x2 2 + m3 x3 3 +. . . +b
Case Study
A machine learning model should be able to yield correct results with both training and
test data. Then only the model is accurate.
While using a machine learning model, the training data points would have a lot of
deviations from the regression line. This shows the inability of the model to accurately
represent the relationship between the data points. This is called bias.
Bias represents the deviations from the original data and predicted values by the model.
On the other hand, the model may fit the test data accurately on the regression line.
When the model works well on train data, it represents low bias.
When the model failed on test (new) data, it represents high variance.
Bias and variance are inversely proportional to each other. What is needed is the low bias
and low variance.
If the model shows high bias and low variance, it is called underfitting.
If the model shows low bias and high variance, it is called overfitting.
To minimize the deviations, calculate the sum of squares of deviations and add that to
the equation of the line.
y = β0 + β1 x + ϵ
ϵ = y − β0 − β1 x
ϵ = y − (β0 + β1 x)
ϵ = yactual − ypredict
Squared Error:
When deviations are taken alone, the positive and negative deviations may cancel out.
But when the deviations are squared, the negative deviation becomes positive and add
to the overall deviation.
Regularization
Regularization is a technique to minimize the variance or overfitting to achieve better
accuracy.
When regularization is used in Linear Regression model, there would be other variations
like:
Ridge Regression
Lasso Regression
ElasticNET Regression
Ridge Regression
To control the impact on bias and variance, a regularization parameter by name λ is
used.
When λ value is increased, it reduces the variance. But if the value of λ is increased too
much, it will increase bias also. Hence it is importance to tune λ to the correct value to
keep both the bias and the variance low.
2 2
Ridge Regression = β0 + β1 x + ∑ϵi + λβ1
When λ is 0, the Ridge Regression will be equal to Simple Linear Regression. Hence
Simple Linear Regression and Ridge Regression can be used on any (train or test) of the
datasets.
Ridge regularization is used when there is a lot of features in the dataset and all features
have small coefficients.
Lasso Regression
LASSO stands for Least Absolute Shrinkage and Selection Operator.
Lasso Regression is same as Ridge Regression except that the penalty term is calculated
by: λ X Modulus of slope.
2
Lasso Regression equation = β0 + β1 x + ∑ϵi + λ|β1 |
Lasso Regression offers some bias but very low variance. Hence predictions will be more
accurate than Ridge Regression.
Ridge Regression considers all the columns in the dataset to predict the output. But
Lasso Regression selects only the columns that influence maximum on the output. Hence
Lassor Regression is generally used in feature selection.
ElasticNET Regression
ElasticNET Regression is used when a dataset has large number of columns and large
volume of data.
2 2
ElasticNET Regression equation = β0 + β1 x + ∑ϵi + +λ1 β1 + λ2 |β1 |
Case Study
In [56]: houseprice_df = pd.read_csv("./datasets/Melbourne_housing_FULL.csv")
In [57]: houseprice_df.shape
In [58]: houseprice_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Suburb 34857 non-null object
1 Address 34857 non-null object
2 Rooms 34857 non-null int64
3 Type 34857 non-null object
4 Price 27247 non-null float64
5 Method 34857 non-null object
6 SellerG 34857 non-null object
7 Date 34857 non-null object
8 Distance 34856 non-null float64
9 Postcode 34856 non-null float64
10 Bedroom2 26640 non-null float64
11 Bathroom 26631 non-null float64
12 Car 26129 non-null float64
13 Landsize 23047 non-null float64
14 BuildingArea 13742 non-null float64
15 YearBuilt 15551 non-null float64
16 CouncilArea 34854 non-null object
17 Lattitude 26881 non-null float64
18 Longtitude 26881 non-null float64
19 Regionname 34854 non-null object
20 Propertycount 34854 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 5.6+ MB
In [59]: houseprice_df.head()
Out[59]: Suburb Address Rooms Type Price Method SellerG Date Distan
68 Studley
0 Abbotsford 2 h NaN SS Jellis 3/09/2016
St
85 Turner
1 Abbotsford 2 h 1480000.0 S Biggin 3/12/2016
St
25
2 Abbotsford Bloomburg 2 h 1035000.0 S Biggin 4/02/2016
St
18/659
3 Abbotsford 3 u NaN VB Rounds 4/02/2016
Victoria St
5 Charles
4 Abbotsford 3 h 1465000.0 SP Biggin 4/03/2017
St
5 rows × 21 columns
In [61]: houseprice_df.shape
In [62]: houseprice_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Suburb 34857 non-null object
1 Rooms 34857 non-null int64
2 Type 34857 non-null object
3 Method 34857 non-null object
4 SellerG 34857 non-null object
5 Distance 34856 non-null float64
6 Bedroom2 26640 non-null float64
7 Bathroom 26631 non-null float64
8 Car 26129 non-null float64
9 Landsize 23047 non-null float64
10 BuildingArea 13742 non-null float64
11 CouncilArea 34854 non-null object
12 Regionname 34854 non-null object
13 Propertycount 34854 non-null float64
14 Price 27247 non-null float64
dtypes: float64(8), int64(1), object(6)
memory usage: 4.0+ MB
houseprice_df['Bedroom2'].fillna(0, inplace=True)
houseprice_df['Bathroom'].fillna(0, inplace=True)
houseprice_df['Car'].fillna(0, inplace=True)
houseprice_df['Propertycount'].fillna(0, inplace=True)
houseprice_df['BuildingArea'].fillna(houseprice_df['BuildingArea'].mean(), inpla
In [65]: houseprice_df.dropna(inplace=True)
In [66]: houseprice_df.isnull().sum()
Out[66]: Suburb 0
Rooms 0
Type 0
Method 0
SellerG 0
Distance 0
Bedroom2 0
Bathroom 0
Car 0
Landsize 0
BuildingArea 0
CouncilArea 0
Regionname 0
Propertycount 0
Price 0
dtype: int64
In [67]: houseprice_df.duplicated().sum()
Out[67]: 50
In [69]: houseprice_df.head()
Out[69]:
Rooms Distance Bedroom2 Bathroom Car Landsize BuildingArea Propertycount
In [70]: y = houseprice_df['Price']
X = houseprice_df.drop('Price', axis=1)
Out[75]: ▾ LinearRegression i ?
LinearRegression()
Out[76]: 0.6827792395792723
Out[77]: 0.13853683161540487
Ridge Regression
Out[79]: ▾ Ridge i ?
Ridge(max_iter=100)
Out[80]: 0.6796668251040214
Out[81]: 0.6701765758295284
Lasso Regression
Lasso(alpha=50, max_iter=100)
Out[84]: 0.6767356948457683
Out[85]: 0.6637669697137103
ElasticNET Regression
Out[87]: ▾ ElasticNet i ?
ElasticNet(alpha=100)
Out[88]: 0.06482342181996403
Out[89]: 0.07381459514861177
In [ ]: