Experiment 1
Experiment 1
Experiment 1
Objectives:
1- Comprehend the concept of linear regression using the least square
method.
2- Understand the code in Python used for linear regression.
3- Run real-world applications in Python.
Introduction:
Least Square Method
The least square method is the process of finding a regression line or best-fitted
line for any data set that is described by an equation. This method requires
reducing the sum of the squares of the residual parts of the points from the curve
or line and the trend of outcomes is found quantitatively. The method of curve
fitting is seen while regression analysis and the fitting equations to derive the
curve is the least square method.
Least Square Method Definition
The least-squares method is a statistical method used to find the line of best fit of
the form of an equation such as y = mx + b to the given data. The curve of the
equation is called the regression line. Our main objective in this method is to
reduce the sum of the squares of errors as much as possible. This is the reason
this method is called the least-squares method. This method is often used in data
fitting where the best-fit result is assumed to reduce the sum of squared errors
that are considered to be the difference between the observed values and
corresponding fitted value. The sum of squared errors helps in finding the
variation in observed data. For example, we have 4 data points and using this
method we arrive at the following graph.
Figure 1
The two basic categories of least-square problems are ordinary or linear least
squares and nonlinear least squares.
Limitations for Least Square Method
Even though the least-squares method is considered the best method to find the
line of best fit, it has a few limitations. They are:
• This method exhibits only the relationship between the two variables.
All other causes and effects are not taken into consideration.
• This method is unreliable when data is not evenly distributed.
• This method is very sensitive to outliers. This can skew the results of
the least-squares analysis.
Least Square Method Graph
Look at the graph below, the straight line shows the potential relationship
between the independent variable and the dependent variable. The ultimate goal
of this method is to reduce this difference between the observed response and
the response predicted by the regression line. Less residual means that the
model fits better. The data points need to be minimized by the method of
reducing residuals of each point from the line. There are vertical residuals and
perpendicular residuals. Vertical is mostly used in polynomials and hyperplane
problems while perpendicular is used in general as seen in the image below.
Figure 2
Least Square Method Formula
Least-square method is the curve that best fits a set of observations with a
minimum sum of squared residuals or errors. Let us assume that the given points
of data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) in which all x’s are independent
variables, while all y’s are dependent ones. This method is used to find
a linear line of the form y = mx + b, where y and x are variables, m is the slope,
and b is the y-intercept. The formula to calculate slope m and the value of b is
given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
b = (∑y - m∑x)/n
Following are the steps to calculate the least square using the above formulas.
• Step 1: Draw a table with 4 columns where the first two columns are for x
and y points.
• Step 2: In the next two columns, find xy and (x)2.
• Step 3: Find ∑x, ∑y, ∑xy, and ∑(x)2.
• Step 4: Find the value of slope m using the above formula.
• Step 5: Calculate the value of b using the above formula.
• Step 6: Substitute the value of m and b in the equation y = mx + b
x 1 2 3 4 5
y 2 5 3 8 7
x y xy x2
1 2 2 1
2 5 10 4
3 3 9 9
4 8 32 16
5 7 35 25
m = 65/50 = 13/10
b = (∑y - m∑x)/n
b = (25 - 1.3×15)/5
b = (25 - 19.5)/5
b = 5.5/5
n=len(x)
xy=0
sumx=0
sumy=0
sq_sumx=0
xy=np.sum(x*y)
sumx=np.sum(x)
sumy=np.sum(y)
sq_sumx=np.sum(x*x)
b=(n*xy-sumx*sumy)/(n*sq_sumx-sumx**2)
print('b= ',b)
a=(sumy-b*sumx)/n
print('a= ',a)
The below graph explains the relation between Salary and Years of Experience
Equation : y = mx + c
This is the simple linear regression equation where c is the constant and m is
and is the degree of change in the dependent variable for every 1 unit of change in
β0 (y-intercept) and β1 (slope) are the coefficients whose values represent the
accuracy of predicted values with the actual values.
Implement Simple Linear Regression in Python
In this example, we will use the salary data concerning the experience of
and MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
Download the dataset from here and upload it to your notebook and read it into
# Get dataset
df_sal = pd.read_csv('/content/Salary_Data.csv')
df_sal.head()
Step 3: Data analysis
Now that we have our data ready, let's analyze and understand its trend in detail.
# Describe data
df_sal.describe()
Here, we can see Salary ranges from 37731 to 122391 and a median of 65237.
We can also find how the data is distributed visually using Seaborn distplot
# Data distribution
plt.title('Salary Distribution Plot')
sns.distplot(df_sal['Salary'])
plt.show()
# Splitting variables
X = df_sal.iloc[:, :1] # independent
y = df_sal.iloc[:, 1:] # dependent
Further, split your data into training (80%) and test (20%) sets
using train_test_split
# Splitting dataset into test/train
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state = 0)
Pass the X_train and y_train data into the regressor model by regressor.fit to
# Regressor model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Here comes the interesting part, when we are all set and ready to predict any value
using regressor.predict
# Prediction result
y_pred_test = regressor.predict(X_test) # predicted value
of y_test
y_pred_train = regressor.predict(X_train) # predicted value
of y_train
of y_train (regressor.predict(X_train))
(regressor.predict(X_train))
# Prediction on test set
plt.scatter(X_test, y_test, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title
= 'Sal/Exp', loc='best', facecolor='white')
plt.box(False)
plt.show()
We can see, in both plots, the regressor line covers train and test data.
Also, you can plot results with the predicted value of y_test
it is generated from the unique equation of linear regression with the same
training data.
If you remember from the beginning of this article, we discussed the linear
# Import libraries
import pandas as pd
import numpy as np
df_sal = pd.read_csv('D:\datasets/Salary_Data.csv')
df_sal.head()
# Describe data
df_sal.describe()
# Data distribution
plt.show()
plt.title('Salary vs Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.box(False)
plt.show()
# Splitting variables
# Regressor model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Prediction result
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.box(False)
plt.show()
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.box(False)
plt.show()
print(f'Coefficient: {regressor.coef_}')
print(f'Intercept: {regressor.intercept_}')
TASK:
implement linear regression on one of the below datasets:
1. Cancer linear regression
2. CDC data: nutrition, physical activity, obesity
3. Fish market dataset for regression
4. Medical insurance costs
5. New York Stock Exchange dataset
6. OLS regression challenge
7. Real estate price prediction
8. Red wine quality
9. Vehicle dataset from CarDekho
10. WHO statistics on life expectancy