Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

In [1]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]: plt.rcParams['figure.figsize'] = [19, 8]

In [3]: import warnings


warnings.filterwarnings('ignore')

Variables
A variable is nothing but data. In a dataframe the columns are called variables.

The variables are of 2 types:

Dependent Variable
Independent Variable

Dependent Variable:

The variable whose value is to be forecasted or predicted is the dependent variable.

The value of the dependent variable is dependent on the values of the independent
variables.

Dependent variables are also called response variables or target variables.

Dependent variables are generally represented by the letter y.

Independent Variables

Independent variables are used to explain the dependent variable.

Independent variables do not depend on any other variable.

Independent variables are also called feature variables or regressors.

Independent variables are represented by the letter x.

Regression
Regression is a statistical approach to analyze the relationship between a dependent
variable and one or more independent variables.

Regression is used to determine the most suitable function that characterizes the
relationship between dependent and independent variables.

Linear Regression
Linear regression is a machine learning algorithm that depends on the linear relationship
between a dependent variable and one or more independent variables.

Two variables are in linear relationship when their values can be represented using a
straight line.

Simple Linear Regression


A simple linear regression model has one independent variable that has a linear
relationship with the dependent variable.

For more than one independent variables, it is called Multiple Linear regression.

The Linear Equation


In mathematics, a straight line is represented by the equation: y = mx + c, where m is
slope and c is a constant.

Using the above equation, the value of y (dependent variable) can be calculated using x
(independent variable).

In statistics, the equation is written as: y = β0 + β1 x + ϵ.

Here β0 is the intercept, β1 is the slope and ϵ is the error component.

Slope β1 = δy/δx.

Intercept β0 = Distance on y axis when value of x is 0.

The error component (term), ϵ, represents the distance of the actual value from the value
predicted by the regression line.

When the relationship between the independent variable (x) and the dependent variable
(y) is linear, Linear Regression model is applied to analyze the data.

Example

In [4]: data = {'voltage': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],


'current': [0, 1, 2, 2.99, 4, 5, 6, 6.99, 8, 9, 10]
}

In [5]: df = pd.DataFrame(data)

In [6]: df
Out[6]: voltage current

0 0 0.00

1 1 1.00

2 2 2.00

3 3 2.99

4 4 4.00

5 5 5.00

6 6 6.00

7 7 6.99

8 8 8.00

9 9 9.00

10 10 10.00

In [7]: plt.scatter(data=df, x="voltage", y="current", s=200)


plt.title("Voltage Vs Current")
plt.xlabel("Voltage")
plt.ylabel("Current")
plt.show()

Example

In [8]: homeprice = {'area': [2600, 3000, 3200, 3600, 4000], 'price': [550000, 565000, 6

In [9]: df = pd.DataFrame(homeprice)

In [10]: df
Out[10]: area price

0 2600 550000

1 3000 565000

2 3200 610000

3 3600 680000

4 4000 725000

In [11]: plt.scatter(data=df, x='area', y='price', s=200)


plt.title("House Area Vs Price")
plt.xlabel("Area")
plt.ylabel("Price")
plt.show()

From the above plot, the data points can be connected more or less using a straight line.
Hence we can apply Simple Linear regression machine learning model on the above data.

Ordinary Least Square Method


The regression line that best explains the trend in the data is the best fit line.

The line may pass through all of the points or non of the points.

The method aims at minimizing the sum of squares of the error terms i.e., it determines
the values of β0 and β1 at which the error terms are minimum.

β0 = ȳ − β1 x̄

β1 = Cov(x, y)/V ar(x)

β1 = ∑((x − x̄)(y − ȳ))/∑(x − x̄)2

Measures of Variation

Sum of Squares Total (SST)


The sum of squares of total is the sum of squared difference between the target variable
and its mean value.

It is also known as Total Sum of Squares (TSS).

SST = ∑(y − ȳ)2

Sum of Squares Regression (SSR)

The sum of squares regression is the sum of squared difference between the predicted
value and the mean of the target variable.

It is also known as Regression Sum of Squares (RSS).

SSR = ∑(ypredict − ȳ)2

Sum of Squares of Error (SSE)

The sum of squares of error is the sum of squared difference between target variable and
its predicted value.

It is also known as Error Sum of Square.

SSE = ∑(y − ypredict )2

Total Variation

SST = SSR + SSE

Coefficient of Determination R2

Gives the total percentage of variation that is explained by the target variable.

R2 = SSR/SST

R2 = 1 − (SSE/SST )

Note: 0≤R2 ≤1

Case Study

Predict the Salary of an employee depending on the Years of Experience

Read the Data

In [12]: salary_df = pd.read_csv("./datasets/Salary_Data.csv")

Exploratory Data Analysis

In [13]: salary_df.shape
Out[13]: (30, 2)

In [14]: salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YearsExperience 30 non-null float64
1 Salary 30 non-null int64
dtypes: float64(1), int64(1)
memory usage: 612.0 bytes

In [15]: salary_df.head()

Out[15]: YearsExperience Salary

0 1.1 39343

1 1.3 46205

2 1.5 37731

3 2.0 43525

4 2.2 39891

In [16]: salary_df.describe()

Out[16]: YearsExperience Salary

count 30.000000 30.000000

mean 5.313333 76003.000000

std 2.837888 27414.429785

min 1.100000 37731.000000

25% 3.200000 56720.750000

50% 4.700000 65237.000000

75% 7.700000 100544.750000

max 10.500000 122391.000000

Data Visualization

In [17]: plt.scatter(data=salary_df, x='YearsExperience', y='Salary', s=200)


plt.title("Salary based on the Years of experience")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()
The data shows linear relationship between YearsofExperience and Salary. Hence we can
use Simple Linear Regression.

Independent Variable: YearsExperience

Dependent Variable: Salary

In [18]: X = salary_df.loc[:, 'YearsExperience'].values

y = salary_df.loc[:, 'Salary'].values

Split the Data into Training and Test Data

The data will be divided into 2 parts: for training the model and for testing purpose.

Generally 70% of the rows are used for training and reamaining 30% for testing the
model.

In [19]: from sklearn.model_selection import train_test_split

In [20]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_

In [21]: X_train.shape, X_test.shape, y_train.shape, y_test.shape

Out[21]: ((24,), (6,), (24,), (6,))

In [22]: X_train.reshape(-1, 1).shape

Out[22]: (24, 1)

Train the model

In [23]: from sklearn.linear_model import LinearRegression

In [24]: reg_model = LinearRegression()

In [25]: reg_model.fit(X_train.reshape(-1, 1), y_train.reshape(-1, 1))


Out[25]: ▾ LinearRegression i ?

LinearRegression()

Find the coefficient m i.e., slope

In [26]: reg_model.coef_

Out[26]: array([[9312.57512673]])

Find the intercept i.e., c

In [27]: reg_model.intercept_

Out[27]: array([26780.09915063])

Predict the Salary for the test data

In [28]: y_predicted = reg_model.predict(X_test.reshape(-1, 1))

In [29]: y_predicted

Out[29]: array([[ 40748.96184072],


[122699.62295594],
[ 64961.65717022],
[ 63099.14214487],
[115249.56285456],
[107799.50275317]])

In [30]: y_test

Out[30]: array([ 37731, 122391, 57081, 63218, 116969, 109431], dtype=int64)

Model Evaluation

In [31]: from sklearn.metrics import r2_score

The R Squared Value/R2 Score

In [32]: r_square = r2_score(y_test.reshape(-1, 1), y_predicted)

In [33]: r_square

Out[33]: 0.988169515729126

In [34]: print(f"Accuracy = {r_square:.2%}")

Accuracy = 98.82%

Data Visualization

In [35]: plt.scatter(x=X_test, y=y_test, color='red', s=300)


plt.scatter(x=X_test, y=y_predicted, color='green', s=300)
plt.title("Salary Test Data Vs Predicted")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.legend(["Test Value", "Predicted Value"], loc="lower right")

plt.show()

In [36]: sns.lmplot(data=salary_df, x='YearsExperience', y='Salary')

plt.show()

Multiple Linear Regression


Simple Linear Regression model takes only one independent variable to predict the
dependent variable value.
Multiple Linear Regression model uses multiple independent variable to predict the
dependent variable value.

The equation of a Multiple Linear Regression model will be of the form:

y = m1 x1 + m2 x2 + m3 x3 +. . . +b

Here y is the dependent or target variable, x1 , x2 , x3 are independent variables,


m1 , m2 , m3 are coefficients of the independent variables and b is the intercept.

The relationship between y and x1 , x2 , x3 should be linear.

Case Study

Home price data prediction using area, no. of bedrooms, and age

Read the data

In [37]: homeprice_df = pd.read_csv('./datasets/homeprices.csv')

Exploratory Data Analysis

In [38]: homeprice_df.shape

Out[38]: (6, 4)

In [39]: homeprice_df.head()

Out[39]: area bedrooms age price

0 2600 3.0 20 550000

1 3000 4.0 15 565000

2 3200 NaN 18 610000

3 3600 3.0 30 595000

4 4000 5.0 8 760000

In [40]: homeprice_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area 6 non-null int64
1 bedrooms 5 non-null float64
2 age 6 non-null int64
3 price 6 non-null int64
dtypes: float64(1), int64(3)
memory usage: 324.0 bytes

In [41]: homeprice_df.duplicated().sum()
Out[41]: 0

In [42]: homeprice_df['bedrooms'].skew()

Out[42]: 0.0

In [43]: homeprice_df['bedrooms'].fillna(homeprice_df['bedrooms'].mean(), inplace=True)

In [44]: homeprice_df.isnull().sum()

Out[44]: area 0
bedrooms 0
age 0
price 0
dtype: int64

Data Visualization

In [45]: sns.pairplot(data=homeprice_df)

plt.show()
Model Training

In [46]: X = homeprice_df.loc[:, 'area':'age'].values

y = homeprice_df['price'].values

In [47]: from sklearn.linear_model import LinearRegression

In [48]: linear_model = LinearRegression()

In [49]: linear_model.fit(X, y)

Out[49]: ▾ LinearRegression i ?

LinearRegression()

In [50]: linear_model.coef_

Out[50]: array([ 142.895644 , -48591.66405516, -8529.30115951])

In [51]: linear_model.intercept_

Out[51]: 485561.8928233979

In [52]: y_predict = linear_model.predict(X)

y_predict

Out[52]: array([540729.55186458, 591942.6512065 , 594933.87652772, 598332.18426826,


745951.73926671, 803109.99686623])

Model Evaluation

In [53]: from sklearn.metrics import r2_score

In [54]: r2_score(y, y_predict)

Out[54]: 0.9760698937818199

In [55]: linear_model.score(X, y)

Out[55]: 0.9760698937818199

Polynomial Linear Regression


Simple and Multiple Linear Regression models are used only when the data points are in
a straight line.

It is not always possible to expect the data points to be in the straight line. They may
form a curved line.

Polynomial Linear Regression is used on non-linear data.


Polynomial Linear Regression is a special case of Multiple Linear Regression.

In Multiple Linear Regression, each term is without any power value but in Polynomial
Linear Regression, each term is raised to a power.

The first term has power 1, the second term has power 2, and the nth term has the power
n.

The equation of a Multiple Linear Regression model will be of the form:

y = m1 x1 + m2 x2 2 + m3 x3 3 +. . . +b

Here y is the dependent or target variable, x1 , x2 , x3 are independent variables,


m1 , m2 , m3 are coefficients of the independent variables and b is the intercept.

Case Study

Bias and Variance


While building a machine learning model, the data is divided into 2 parts:

training data, to train the model


test data, to test the model

A machine learning model should be able to yield correct results with both training and
test data. Then only the model is accurate.

While using a machine learning model, the training data points would have a lot of
deviations from the regression line. This shows the inability of the model to accurately
represent the relationship between the data points. This is called bias.

Bias represents the deviations from the original data and predicted values by the model.

On the other hand, the model may fit the test data accurately on the regression line.

The difference in fits between the datasets is called variance.

When the model works well on train data, it represents low bias.

When the model failed on test (new) data, it represents high variance.

Bias and variance are inversely proportional to each other. What is needed is the low bias
and low variance.

If the model shows high bias and low variance, it is called underfitting.

If the model shows low bias and high variance, it is called overfitting.

Sum of Squared Residual


The Simple Linear Regression model uses a straight line to fit the data.
There are chances of some deviations from the straight line to the actual data. These
deviations must be minimized.

To minimize the deviations, calculate the sum of squares of deviations and add that to
the equation of the line.

The equation of the linear regression line is given by:

y = β0 + β1 x + ϵ

ϵ = y − β0 − β1 x

ϵ = y − (β0 + β1 x)

ϵ = yactual − ypredict

The error term exist for every observation in the data.

Squared Error:

ϵ2i = (yactual − ypredict )2

Sum of Squared Error:

Sum of Squared Error = ∑ϵ2i

Why to take the square of the deviations?

When deviations are taken alone, the positive and negative deviations may cancel out.
But when the deviations are squared, the negative deviation becomes positive and add
to the overall deviation.

Regularization
Regularization is a technique to minimize the variance or overfitting to achieve better
accuracy.

When regularization is used in Linear Regression model, there would be other variations
like:

Ridge Regression
Lasso Regression
ElasticNET Regression

The above 3 models are developed to minimize the variance.

Ridge Regression
To control the impact on bias and variance, a regularization parameter by name λ is
used.
When λ value is increased, it reduces the variance. But if the value of λ is increased too
much, it will increase bias also. Hence it is importance to tune λ to the correct value to
keep both the bias and the variance low.

The regularization pameter λ is also known as the penalty parameter.

2 2
Ridge Regression = β0 + β1 x + ∑ϵi + λβ1

The term λ X square of the slope is called L2 Penalty term.

λ value can be anything from 0 to positive infinity.

The addition of penalty term is called Regularization.

When λ is 0, the Ridge Regression will be equal to Simple Linear Regression. Hence
Simple Linear Regression and Ridge Regression can be used on any (train or test) of the
datasets.

Ridge regularization is used when there is a lot of features in the dataset and all features
have small coefficients.

Lasso Regression
LASSO stands for Least Absolute Shrinkage and Selection Operator.

Lasso Regression is same as Ridge Regression except that the penalty term is calculated
by: λ X Modulus of slope.

2
Lasso Regression equation = β0 + β1 x + ∑ϵi + λ|β1 |

Lasso Regression has L1 Penalty term. L1 indicates λ|β1 |.

Lasso Regression offers some bias but very low variance. Hence predictions will be more
accurate than Ridge Regression.

Ridge Regression considers all the columns in the dataset to predict the output. But
Lasso Regression selects only the columns that influence maximum on the output. Hence
Lassor Regression is generally used in feature selection.

ElasticNET Regression
ElasticNET Regression is used when a dataset has large number of columns and large
volume of data.

ElasticNET Regression is a combination of Ridge and Lasso Regressions.

2 2
ElasticNET Regression equation = β0 + β1 x + ∑ϵi + +λ1 β1 + λ2 |β1 |

Case Study
In [56]: houseprice_df = pd.read_csv("./datasets/Melbourne_housing_FULL.csv")

In [57]: houseprice_df.shape

Out[57]: (34857, 21)

In [58]: houseprice_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Suburb 34857 non-null object
1 Address 34857 non-null object
2 Rooms 34857 non-null int64
3 Type 34857 non-null object
4 Price 27247 non-null float64
5 Method 34857 non-null object
6 SellerG 34857 non-null object
7 Date 34857 non-null object
8 Distance 34856 non-null float64
9 Postcode 34856 non-null float64
10 Bedroom2 26640 non-null float64
11 Bathroom 26631 non-null float64
12 Car 26129 non-null float64
13 Landsize 23047 non-null float64
14 BuildingArea 13742 non-null float64
15 YearBuilt 15551 non-null float64
16 CouncilArea 34854 non-null object
17 Lattitude 26881 non-null float64
18 Longtitude 26881 non-null float64
19 Regionname 34854 non-null object
20 Propertycount 34854 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 5.6+ MB

In [59]: houseprice_df.head()
Out[59]: Suburb Address Rooms Type Price Method SellerG Date Distan

68 Studley
0 Abbotsford 2 h NaN SS Jellis 3/09/2016
St

85 Turner
1 Abbotsford 2 h 1480000.0 S Biggin 3/12/2016
St

25
2 Abbotsford Bloomburg 2 h 1035000.0 S Biggin 4/02/2016
St

18/659
3 Abbotsford 3 u NaN VB Rounds 4/02/2016
Victoria St

5 Charles
4 Abbotsford 3 h 1465000.0 SP Biggin 4/03/2017
St

5 rows × 21 columns

 

In [60]: houseprice_df = houseprice_df.loc[:, ['Suburb', 'Rooms', 'Type', 'Method', 'Sell


'Bedroom2', 'Bathroom', 'Car', 'Landsize',
'Regionname', 'Propertycount',
'Price']]

In [61]: houseprice_df.shape

Out[61]: (34857, 15)

In [62]: houseprice_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Suburb 34857 non-null object
1 Rooms 34857 non-null int64
2 Type 34857 non-null object
3 Method 34857 non-null object
4 SellerG 34857 non-null object
5 Distance 34856 non-null float64
6 Bedroom2 26640 non-null float64
7 Bathroom 26631 non-null float64
8 Car 26129 non-null float64
9 Landsize 23047 non-null float64
10 BuildingArea 13742 non-null float64
11 CouncilArea 34854 non-null object
12 Regionname 34854 non-null object
13 Propertycount 34854 non-null float64
14 Price 27247 non-null float64
dtypes: float64(8), int64(1), object(6)
memory usage: 4.0+ MB

In [63]: houseprice_df['Distance'].fillna(0, inplace=True)

houseprice_df['Bedroom2'].fillna(0, inplace=True)
houseprice_df['Bathroom'].fillna(0, inplace=True)

houseprice_df['Car'].fillna(0, inplace=True)

houseprice_df['Propertycount'].fillna(0, inplace=True)

In [64]: houseprice_df['Landsize'].fillna(houseprice_df['Landsize'].mean(), inplace=True)

houseprice_df['BuildingArea'].fillna(houseprice_df['BuildingArea'].mean(), inpla

In [65]: houseprice_df.dropna(inplace=True)

In [66]: houseprice_df.isnull().sum()

Out[66]: Suburb 0
Rooms 0
Type 0
Method 0
SellerG 0
Distance 0
Bedroom2 0
Bathroom 0
Car 0
Landsize 0
BuildingArea 0
CouncilArea 0
Regionname 0
Propertycount 0
Price 0
dtype: int64

In [67]: houseprice_df.duplicated().sum()

Out[67]: 50

In [68]: houseprice_df = pd.get_dummies(data=houseprice_df, drop_first=True, dtype=np.int

In [69]: houseprice_df.head()

Out[69]:
Rooms Distance Bedroom2 Bathroom Car Landsize BuildingArea Propertycount

1 2 2.5 2.0 1.0 1.0 202.0 160.2564 4019.0

2 2 2.5 2.0 1.0 0.0 156.0 79.0000 4019.0

4 3 2.5 3.0 2.0 0.0 134.0 150.0000 4019.0

5 3 2.5 3.0 2.0 1.0 94.0 160.2564 4019.0

6 4 2.5 3.0 1.0 2.0 120.0 142.0000 4019.0

5 rows × 745 columns

 

In [70]: y = houseprice_df['Price']
X = houseprice_df.drop('Price', axis=1)

In [71]: from sklearn.model_selection import train_test_split

In [72]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_

Simple Linear Regression

In [73]: from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

In [74]: linear_model = LinearRegression()

In [75]: linear_model.fit(X_train, y_train)

Out[75]: ▾ LinearRegression i ?

LinearRegression()

In [76]: linear_model.score(X_train, y_train)

Out[76]: 0.6827792395792723

In [77]: linear_model.score(X_test, y_test)

Out[77]: 0.13853683161540487

Ridge Regression

In [78]: ridge_model = Ridge(alpha=1.0, max_iter=100)

In [79]: ridge_model.fit(X_train, y_train)

Out[79]: ▾ Ridge i ?

Ridge(max_iter=100)

In [80]: ridge_model.score(X_train, y_train)

Out[80]: 0.6796668251040214

In [81]: ridge_model.score(X_test, y_test)

Out[81]: 0.6701765758295284

Lasso Regression

In [82]: lasso_model = Lasso(alpha=50, max_iter=100)

In [83]: lasso_model.fit(X_train, y_train)


Out[83]: ▾ Lasso i ?

Lasso(alpha=50, max_iter=100)

In [84]: lasso_model.score(X_train, y_train)

Out[84]: 0.6767356948457683

In [85]: lasso_model.score(X_test, y_test)

Out[85]: 0.6637669697137103

ElasticNET Regression

In [86]: elastic_model = ElasticNet(alpha=100, l1_ratio=0.5)

In [87]: elastic_model.fit(X_train, y_train)

Out[87]: ▾ ElasticNet i ?

ElasticNet(alpha=100)

In [88]: elastic_model.score(X_train, y_train)

Out[88]: 0.06482342181996403

In [89]: elastic_model.score(X_test, y_test)

Out[89]: 0.07381459514861177

In [ ]:

You might also like