Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
29 views

Regression Analysis in Machine Learning

Regression analysis in ml

Uploaded by

Rhiddha Acharjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Regression Analysis in Machine Learning

Regression analysis in ml

Uploaded by

Rhiddha Acharjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ABSTRACT

Regression analysis is a crucial tool in machine learning that helps identify relationships between
variables. By applying linear regression techniques, including Ordinary Least Squares , one can
model and predict outcomes based on input data. This report explores key regression concepts and
their practical applications in solving machine learning problems.

INTRODUCTION
In machine learning, regression analysis serves as a fundamental technique for predictive modeling.
It involves identifying the relationship between a dependent variable and one or more independent
variables. Among the most widely used methods is linear regression, which assumes a linear
relationship between variables. The process of fitting a regression line involves calculating the slope
and intercept to minimize the difference between actual and predicted values, often using the
Ordinary Least Squares (OLS) algorithm. Understanding the nature of positive and negative slopes
in regression is essential, as they reflect increasing or decreasing trends in the data. Regression
analysis also extends to multiple variables, allowing for more complex modeling scenarios. In this
report, we will discuss the application of linear and multiple regression techniques to analyze
datasets, compute the regression equation, and estimate correlation coefficients. These techniques
enable us to gain insights into the relationships between variables, thus aiding in decision-making
and predictions, making regression an indispensable tool in machine learning.

PROCEDURE & DISCUSSION

Regression analysis
Regression analysis helps in the prediction of a continuous variable. There are various scenarios in
the real world where we need some future predictions such as weather condition, sales prediction,
marketing trends, etc., for such case we need some technology which can make predictions more
accurately.

Types of Regression
There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:

• Linear Regression • Logistic Regression • Polynomial Regression • Support Vector


Regression • Decision Tree Regression • Random Forest Regression • Ridge Regression
• Lasso Regression

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship between
the variables. Consider the Figure-1:

FIGURE- 1
Mathematically, we can represent a linear regression as:

𝒚 = 𝒂𝟎 + 𝒂𝟏 𝒙 + 𝝐

Where,
y= Dependent Variable (Target Variable)

x= Independent Variable (predictor Variable)

𝒂𝟎 = intercept of the line (Gives an additional degree of freedom)

𝒂𝟏 = Linear regression coefficient (scale factor to each input value).

𝝐= random error

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

▪ Simple Linear Regression:

If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.

▪ Multiple Linear regression:

If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line


A linear line showing the relationship between the dependent and independent variables is called a
regression line. A regression line can show two types of relationship:

▪ Positive Linear Relationship:

If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship. The positive linear equation is plotted
in Figure-2 (A).

The line equation will be: 𝒚 = 𝒂𝟎 + 𝒂𝟏 𝒙

▪ Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis,
then such a relationship is called a negative linear relationship. The negative linear equation is
plotted in Figure -2 (B).

The line equation will be: 𝒚 = −𝒂𝟏 𝒙 + 𝒂𝟎

F IGURE- 2 (A) FIGURE- 2 (B)

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.

The different values for weights or the coefficient of lines (𝑎0 , 𝑎1) gives a different line of regression,
so we need to calculate the best values for 𝑎0 and 𝑎1 to find the best fit line, so to calculate this we
use cost function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

𝑵
𝟏
𝑴𝑺𝑬 = ∑ ( 𝒚𝒊 − (𝒂𝟏 𝒙𝒊 + 𝒂𝟎 ))𝟐
𝑵
{𝒊=𝟏}

Where,

N=Total number of observation; 𝒚𝒊 = Actual value ; (𝒂𝟏 𝒙𝒊 + 𝒂𝟎 )= Predicted value.


:The distance between the actual value and predicted values is called residual. If the observed points
are far from the regression line, then the residual will be high, and so cost function will high. If the
scatter points are close to the regression line, then the residual will be small and hence the cost
function.

Gradient Descent:
Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.

A regression model uses gradient descent to update the coefficients of the line by reducing the cost
function.

It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.

Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process
of finding the best model out of various models is called optimization. It can be achieved by below
method:

R-squared method:

R-squared is a statistical method that determines the goodness of fit.

It measures the strength of the relationship between the dependent and independent variables on a
scale of 0-100%.

The high value of R-square determines the less difference between the predicted values and actual
values and hence represents a good model.

It can be calculated from the below formula:

𝒆𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅 𝒗𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒏
𝑹 − 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 =
𝒕𝒐𝒕𝒂𝒍 𝒗𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒎

Assumption of linear regression


▪ Linear relationship: The dependent and independent variables should have a linear
relationship.
▪ No or low multicollinearity: Independent variables shouldn’t be highly correlated, as it
makes it difficult to determine their individual impact on the target.
▪ Normal distribution of errors: Error terms should be normally distributed, ensuring valid
confidence intervals.
▪ No autocorrelation: Errors should not be correlated, as this reduces model accuracy.

Ordinary Least Square method for Linear regression


OLS regression is a statistical method utilized for parameter estimation in linear regression models.
The goal of ordinary least squares (OLS) is to find the optimal line that minimizes the total squared
differences between the actual and estimated values of the dependent variable.

The key components of OLS Linear Regression are:.

• It is utilized to demonstrate the linear relationship between a response variable (y) and one or
more predictor variables (x).

• The linear equation is 𝒚 = 𝜷𝟎 + 𝜷𝟏 × 𝟏 + 𝜷𝟐 × 𝟐 + … + 𝜷𝒑𝒙𝒑 + 𝜺, where 𝛽0 is the


intercept, β1 to βp are the coefficients for 𝑥1 to 𝑥𝑝 , and ε is the error term.

• OLS chooses β0, β1, …, βp to minimize the sum of squared differences between the observed y
values and the predicted y values from the regression line.

• If the OLS estimators meet certain conditions like linearity, lack of multi-collinearity,
homoscedasticity, absence of autocorrelation, and normality of errors, they will be unbiased,
consistent, and have the lowest variance among linear unbiased estimators.

Understanding the mathematics behind OLS algorithm


To explain the OLS algorithm, a simplest example is taken . Consider the following 3 data points
according to Table-1

𝒙𝟏 𝒚𝟏

2 6.3

4 11.6

7 15.7

T ABLE- 1
Comparision between fitted line using OLS algorithm and poor fit line:

F IGURE- 3

Formula used:
̂ = 𝒘𝟎 + 𝒘𝟏 𝑿𝟏
𝒀

̂ − 𝒀𝒊
𝑬𝒓𝒓𝒐𝒓𝒊 = 𝒀

𝑺𝒖𝒎 𝒐𝒇 𝒆𝒓𝒓𝒐𝒓𝒔 (𝑺. 𝑬): 𝑳 = ∑ 𝑬𝒓𝒓𝒐𝒓𝒊


𝒊=𝟏

𝑺𝒖𝒎 𝒐𝒇 𝒂𝒃𝒔𝒐𝒍𝒖𝒕𝒆 𝒆𝒓𝒓𝒐𝒓𝒔 (𝑺𝑨𝑬); 𝑳 = ∑ |𝑬𝒓𝒓𝒐𝒓|𝒊


{𝒊=𝟏}

𝑵
𝟐
𝑺𝒖𝒎 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒆𝒓𝒓𝒐𝒓𝒔 (𝑺𝑺𝑬): 𝑳 = ∑(𝒀̂𝒊 − 𝒀𝒊 )
𝒊=𝟏

𝑵
𝟏
̂𝒊 − 𝒀𝒊 )𝟐
𝑴𝒆𝒂𝒏 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒆𝒓𝒓𝒐𝒓𝒔 (𝑴𝑺𝑬): 𝑳 = ∑(𝒀
𝑵
𝒊=𝟏
𝑵
𝟏
̂𝒊 − 𝒀𝒊 )𝟐
𝑹𝒐𝒐𝒕 𝒎𝒆𝒂𝒏 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒆𝒓𝒓𝒐𝒓𝒔 (𝑹𝑴𝑺𝑬) = √ ∑(𝒀
𝑵
𝒊=𝟏

The following plot shows these 3 data points in pink squares. the purple line is the “best-fit line”
through these 3 data points. Also, I have shown a “poor-fitting” line (the cyan line) for comparison.

The net objective is to find the Equation of the Best-Fitting Straight Line (through these 3 data
points mentioned in the above table).

It is the equation of the best-fit line (purple line in the above plot), where 𝑤1 = slope of the
line; 𝑤0 = intercept of the line.

In machine learning, this best fit is called the Linear Regression (LR) model, and 𝒘𝟎 and 𝒘𝟏 are also
called model weights or model coefficients.

Practical implications with some examples


Example: 1
The weight and blood sugar level of randomly selected 7 females in the age group 55-65 are shown
below.

Weight Blood sugar level

75 110

86 125

93 160

54 104

85 114

103 203

95 196

T ABLE- 2

It is assumed that weight and blood sugar level are jointly normally distributed.
Results :
Using Linear regression model it is found that:

F IGURE- 4

Equation of regression line related to blood sugar level to weight : 𝒚 = 𝟐. 𝟏𝟏 𝒙 + (−𝟑𝟑. 𝟗𝟗) The
correlation coefficient = 0.81

Example: 2

Eight samples of food are taken. Y , X1 , X2 coloumn corespods to Total calories, calories from fat
and calories from protein respectively.

Total Calories (Y) Calories from fat (X1) Calories from protein (X2)

140 60 22

155 62 25

159 67 24

179 70 20

192 71 15

200 72 14

212 75 14

215 78 11

T ABLE -3
Results:
Using Multiple Linear Regression method we obtain the Mean squared error (MSE), Intercept and
Coefficients. The results are given below;

F IGURE – 5 (A)

F IGURE- 5 (B)

Conclusion
In conclusion, this report provided an in-depth exploration of the regression method in machine
learning, with a focus on the linear regression model. We discussed the Ordinary Least Squares
(OLS) algorithm, detailing its significance in minimizing error to fit a linear model. Through
illustrative examples, we demonstrated the application of linear regression in various datasets,
highlighting its predictive accuracy and relevance. Supporting results validated the model’s
effectiveness in both simple and multiple regression scenarios. Overall, the study reinforces the
utility of regression methods as foundational tools for predictive analytics in machine learning.
References

1. “The elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani, and Jerome
Friedman.
2. “Pattern recognition and Machine learning “ by Christopher M. Bishop.
3. “Hands on machine learning with Scikit-Learn , Keras, and Tensorflow by Aurélien Géron
4. “Machine learning: A probabilistic Perspective” by Kevin P. Murphy.

5. Additional information was obtained from the following websites:


▪ https://www.javatpoint.com/linear-regression-in-machine-learning
▪ https://www.analyticsvidhya.com/blog/2023/01/a-comprehensive-guide-to-ols-
regression-part-1/
▪ https://www.geeksforgeeks.org/regression-in-machine-learning/
Thank you!

You might also like