Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Experiment 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Experiment 1: Linear regression

Objectives:
1- Comprehend the concept of linear regression using the least square
method.
2- Understand the code in Python used for linear regression.
3- Run real-world applications in Python.

Introduction:
Least Square Method
The least square method is the process of finding a regression line or best-fitted
line for any data set that is described by an equation. This method requires
reducing the sum of the squares of the residual parts of the points from the curve
or line and the trend of outcomes is found quantitatively. The method of curve
fitting is seen while regression analysis and the fitting equations to derive the
curve is the least square method.
Least Square Method Definition
The least-squares method is a statistical method used to find the line of best fit of
the form of an equation such as y = mx + b to the given data. The curve of the
equation is called the regression line. Our main objective in this method is to
reduce the sum of the squares of errors as much as possible. This is the reason
this method is called the least-squares method. This method is often used in data
fitting where the best-fit result is assumed to reduce the sum of squared errors
that are considered to be the difference between the observed values and
corresponding fitted value. The sum of squared errors helps in finding the
variation in observed data. For example, we have 4 data points and using this
method we arrive at the following graph.
Figure 1

The two basic categories of least-square problems are ordinary or linear least
squares and nonlinear least squares.
Limitations for Least Square Method
Even though the least-squares method is considered the best method to find the
line of best fit, it has a few limitations. They are:

• This method exhibits only the relationship between the two variables.
All other causes and effects are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. This can skew the results of
the least-squares analysis.
Least Square Method Graph
Look at the graph below, the straight line shows the potential relationship
between the independent variable and the dependent variable. The ultimate goal
of this method is to reduce this difference between the observed response and
the response predicted by the regression line. Less residual means that the
model fits better. The data points need to be minimized by the method of
reducing residuals of each point from the line. There are vertical residuals and
perpendicular residuals. Vertical is mostly used in polynomials and hyperplane
problems while perpendicular is used in general as seen in the image below.
Figure 2
Least Square Method Formula
Least-square method is the curve that best fits a set of observations with a
minimum sum of squared residuals or errors. Let us assume that the given points
of data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) in which all x’s are independent
variables, while all y’s are dependent ones. This method is used to find
a linear line of the form y = mx + b, where y and x are variables, m is the slope,
and b is the y-intercept. The formula to calculate slope m and the value of b is
given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2

b = (∑y - m∑x)/n

Here, n is the number of data points.

Following are the steps to calculate the least square using the above formulas.

• Step 1: Draw a table with 4 columns where the first two columns are for x
and y points.
• Step 2: In the next two columns, find xy and (x)2.
• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b

Let us look at an example to understand this better.


Example: Let's say we have data as shown below.

x 1 2 3 4 5

y 2 5 3 8 7

Solution: We will follow the steps to find the linear line.

x y xy x2

1 2 2 1

2 5 10 4

3 3 9 9

4 8 32 16

5 7 35 25

∑x =15 ∑y = 25 ∑xy = 88 ∑x2 = 55

Find the value of m by using the formula,

m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2


m = [(5×88) - (15×25)]/(5×55) - (15)2

m = (440 - 375)/(275 - 225)

m = 65/50 = 13/10

Find the value of b by using the formula,

b = (∑y - m∑x)/n

b = (25 - 1.3×15)/5

b = (25 - 19.5)/5

b = 5.5/5

So, the required equation of least squares is y = mx + b = 13/10x + 5.5/5.


Important Notes
• The least-squares method is used to predict the behavior of the
dependent variable concerning the independent variable.
• The sum of the squares of errors is called variance.
• The main aim of the least-squares method is to minimize the sum of the
squared errors.

Implementing the least-squares method using Python:


# Linear Regression implementation using Numpy
import matplotlib.pyplot as plt
import numpy as np
x=np.array([1,2,3,4,5,6])
y=np.array([2,2,4,4,6,6])
plt.scatter(x,y)

n=len(x)
xy=0
sumx=0
sumy=0
sq_sumx=0
xy=np.sum(x*y)
sumx=np.sum(x)
sumy=np.sum(y)
sq_sumx=np.sum(x*x)
b=(n*xy-sumx*sumy)/(n*sq_sumx-sumx**2)
print('b= ',b)
a=(sumy-b*sumx)/n
print('a= ',a)

print('The linear Regression equation is \ny=',a,'+',b,'x')


import matplotlib.pyplot as plt from scipy import stats
slope =b
intercept=a
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.plot(x, mymodel)
plt.show()

Implementing regression on the real-world data set:

The below graph explains the relation between Salary and Years of Experience

Equation : y = mx + c

This is the simple linear regression equation where c is the constant and m is

the slope and describes the relationship between x (independent


variable) and y (dependent variable). The coefficient can be positive or negative

and is the degree of change in the dependent variable for every 1 unit of change in

the independent variable.

β0 (y-intercept) and β1 (slope) are the coefficients whose values represent the
accuracy of predicted values with the actual values.
Implement Simple Linear Regression in Python

In this example, we will use the salary data concerning the experience of

employees. In this dataset, we have two columns YearsExperience and Salary

Step 1: Import the required python packages

We need Pandas for data manipulation, NumPy for mathematical calculations,

and MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for

machine learning operations

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression

Step 2: Load the dataset

Download the dataset from here and upload it to your notebook and read it into

the pandas dataframe.

# Get dataset
df_sal = pd.read_csv('/content/Salary_Data.csv')
df_sal.head()
Step 3: Data analysis

Now that we have our data ready, let's analyze and understand its trend in detail.

To do that we can first describe the data below -

# Describe data
df_sal.describe()

Here, we can see Salary ranges from 37731 to 122391 and a median of 65237.

We can also find how the data is distributed visually using Seaborn distplot

# Data distribution
plt.title('Salary Distribution Plot')
sns.distplot(df_sal['Salary'])
plt.show()

A distplot or distribution plot shows the variation in the data distribution.

It represents the data by combining a line with a histogram.

Then we check the relationship between Salary and Experience -

# Relationship between Salary and Experience


plt.scatter(df_sal['YearsExperience'], df_sal['Salary'], color
= 'lightcoral')
plt.title('Salary vs Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.box(False)
plt.show()
It is clearly visible now, our data varies linearly. That means, that an individual

receives more Salary as they gain Experience.

Step 4: Split the dataset into dependent/independent variables

Experience (X) is the independent variable

Salary (y) is dependent on experience

# Splitting variables
X = df_sal.iloc[:, :1] # independent
y = df_sal.iloc[:, 1:] # dependent

Step 4: Split data into Train/Test sets

Further, split your data into training (80%) and test (20%) sets

using train_test_split
# Splitting dataset into test/train
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 0)

Step 5: Train the regression model

Pass the X_train and y_train data into the regressor model by regressor.fit to

train the model with our training data.

# Regressor model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Step 6: Predict the result

Here comes the interesting part, when we are all set and ready to predict any value

of y (Salary) dependent on X (Experience) with the trained model

using regressor.predict

# Prediction result
y_pred_test = regressor.predict(X_test) # predicted value
of y_test
y_pred_train = regressor.predict(X_train) # predicted value
of y_train

Step 7: Plot the training and test results

Its time to test our predicted results by plotting graphs


• Plot training set data vs predictions

First we plot the result of training sets (X_train,

y_train) with X_train and predicted value

of y_train (regressor.predict(X_train))

# Prediction on training set


plt.scatter(X_train, y_train, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title
= 'Sal/Exp', loc='best', facecolor='white')
plt.box(False)
plt.show()

• Plot test set data vs predictions

Secondly, we plot the result of test sets (X_test,

y_test) with X_train and predicted value of y_train

(regressor.predict(X_train))
# Prediction on test set
plt.scatter(X_test, y_test, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title
= 'Sal/Exp', loc='best', facecolor='white')
plt.box(False)
plt.show()

We can see, in both plots, the regressor line covers train and test data.

Also, you can plot results with the predicted value of y_test

(regressor.predict(X_test)) but the regression line would remain the same at

it is generated from the unique equation of linear regression with the same

training data.
If you remember from the beginning of this article, we discussed the linear

equation y = mx + c, we can also get the c (y-

intercept) and m (slope/coefficient) from the regressor model.

# Regressor coefficients and intercept


print(f'Coefficient: {regressor.coef_}')
print(f'Intercept: {regressor.intercept_}')

The fully implemented code:

# Import libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from pandas.core.common import random_state

from sklearn.linear_model import LinearRegression

df_sal = pd.read_csv('D:\datasets/Salary_Data.csv')

df_sal.head()

# Describe data

df_sal.describe()

# Data distribution

plt.title('Salary Distribution Plot')


sns.distplot(df_sal['Salary'])

plt.show()

# Relationship between Salary and Experience

plt.scatter(df_sal['YearsExperience'], df_sal['Salary'], color = 'lightcoral')

plt.title('Salary vs Experience')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.box(False)

plt.show()

# Splitting variables

X = df_sal.iloc[:, :1] # independent

y = df_sal.iloc[:, 1:] # dependent

# Splitting dataset into test/train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =


0)

# Regressor model

regressor = LinearRegression()

regressor.fit(X_train, y_train)

# Prediction result

y_pred_test = regressor.predict(X_test) # predicted value of y_test

y_pred_train = regressor.predict(X_train) # predicted value of y_train


# Prediction on training set

plt.scatter(X_train, y_train, color = 'lightcoral')

plt.plot(X_train, y_pred_train, color = 'firebrick')

plt.title('Salary vs Experience (Training Set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best',


facecolor='white')

plt.box(False)

plt.show()

# Prediction on test set

plt.scatter(X_test, y_test, color = 'lightcoral')

plt.plot(X_train, y_pred_train, color = 'firebrick')

plt.title('Salary vs Experience (Test Set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best',


facecolor='white')

plt.box(False)

plt.show()

# Regressor coefficients and intercept

print(f'Coefficient: {regressor.coef_}')

print(f'Intercept: {regressor.intercept_}')
TASK:
implement linear regression on one of the below datasets:
1. Cancer linear regression
2. CDC data: nutrition, physical activity, obesity
3. Fish market dataset for regression
4. Medical insurance costs
5. New York Stock Exchange dataset
6. OLS regression challenge
7. Real estate price prediction
8. Red wine quality
9. Vehicle dataset from CarDekho
10. WHO statistics on life expectancy

You might also like