Regression Analysis in Machine Learning
Regression Analysis in Machine Learning
Regression analysis is a crucial tool in machine learning that helps identify relationships between
variables. By applying linear regression techniques, including Ordinary Least Squares , one can
model and predict outcomes based on input data. This report explores key regression concepts and
their practical applications in solving machine learning problems.
INTRODUCTION
In machine learning, regression analysis serves as a fundamental technique for predictive modeling.
It involves identifying the relationship between a dependent variable and one or more independent
variables. Among the most widely used methods is linear regression, which assumes a linear
relationship between variables. The process of fitting a regression line involves calculating the slope
and intercept to minimize the difference between actual and predicted values, often using the
Ordinary Least Squares (OLS) algorithm. Understanding the nature of positive and negative slopes
in regression is essential, as they reflect increasing or decreasing trends in the data. Regression
analysis also extends to multiple variables, allowing for more complex modeling scenarios. In this
report, we will discuss the application of linear and multiple regression techniques to analyze
datasets, compute the regression equation, and estimate correlation coefficients. These techniques
enable us to gain insights into the relationships between variables, thus aiding in decision-making
and predictions, making regression an indispensable tool in machine learning.
Regression analysis
Regression analysis helps in the prediction of a continuous variable. There are various scenarios in
the real world where we need some future predictions such as weather condition, sales prediction,
marketing trends, etc., for such case we need some technology which can make predictions more
accurately.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between
the variables. Consider the Figure-1:
FIGURE- 1
Mathematically, we can represent a linear regression as:
𝒚 = 𝒂𝟎 + 𝒂𝟏 𝒙 + 𝝐
Where,
y= Dependent Variable (Target Variable)
𝝐= random error
The values for x and y variables are training datasets for Linear Regression model representation.
If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship. The positive linear equation is plotted
in Figure-2 (A).
The different values for weights or the coefficient of lines (𝑎0 , 𝑎1) gives a different line of regression,
so we need to calculate the best values for 𝑎0 and 𝑎1 to find the best fit line, so to calculate this we
use cost function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of
squared error occurred between the predicted values and actual values. It can be written as:
𝑵
𝟏
𝑴𝑺𝑬 = ∑ ( 𝒚𝒊 − (𝒂𝟏 𝒙𝒊 + 𝒂𝟎 ))𝟐
𝑵
{𝒊=𝟏}
Where,
Gradient Descent:
Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
A regression model uses gradient descent to update the coefficients of the line by reducing the cost
function.
It is done by a random selection of values of coefficient and then iteratively update the values to
reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process
of finding the best model out of various models is called optimization. It can be achieved by below
method:
R-squared method:
It measures the strength of the relationship between the dependent and independent variables on a
scale of 0-100%.
The high value of R-square determines the less difference between the predicted values and actual
values and hence represents a good model.
𝒆𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅 𝒗𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒏
𝑹 − 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 =
𝒕𝒐𝒕𝒂𝒍 𝒗𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒎
• It is utilized to demonstrate the linear relationship between a response variable (y) and one or
more predictor variables (x).
• OLS chooses β0, β1, …, βp to minimize the sum of squared differences between the observed y
values and the predicted y values from the regression line.
• If the OLS estimators meet certain conditions like linearity, lack of multi-collinearity,
homoscedasticity, absence of autocorrelation, and normality of errors, they will be unbiased,
consistent, and have the lowest variance among linear unbiased estimators.
𝒙𝟏 𝒚𝟏
2 6.3
4 11.6
7 15.7
T ABLE- 1
Comparision between fitted line using OLS algorithm and poor fit line:
F IGURE- 3
Formula used:
̂ = 𝒘𝟎 + 𝒘𝟏 𝑿𝟏
𝒀
̂ − 𝒀𝒊
𝑬𝒓𝒓𝒐𝒓𝒊 = 𝒀
𝑵
𝟐
𝑺𝒖𝒎 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒆𝒓𝒓𝒐𝒓𝒔 (𝑺𝑺𝑬): 𝑳 = ∑(𝒀̂𝒊 − 𝒀𝒊 )
𝒊=𝟏
𝑵
𝟏
̂𝒊 − 𝒀𝒊 )𝟐
𝑴𝒆𝒂𝒏 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒆𝒓𝒓𝒐𝒓𝒔 (𝑴𝑺𝑬): 𝑳 = ∑(𝒀
𝑵
𝒊=𝟏
𝑵
𝟏
̂𝒊 − 𝒀𝒊 )𝟐
𝑹𝒐𝒐𝒕 𝒎𝒆𝒂𝒏 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒆𝒓𝒓𝒐𝒓𝒔 (𝑹𝑴𝑺𝑬) = √ ∑(𝒀
𝑵
𝒊=𝟏
The following plot shows these 3 data points in pink squares. the purple line is the “best-fit line”
through these 3 data points. Also, I have shown a “poor-fitting” line (the cyan line) for comparison.
The net objective is to find the Equation of the Best-Fitting Straight Line (through these 3 data
points mentioned in the above table).
It is the equation of the best-fit line (purple line in the above plot), where 𝑤1 = slope of the
line; 𝑤0 = intercept of the line.
In machine learning, this best fit is called the Linear Regression (LR) model, and 𝒘𝟎 and 𝒘𝟏 are also
called model weights or model coefficients.
75 110
86 125
93 160
54 104
85 114
103 203
95 196
T ABLE- 2
It is assumed that weight and blood sugar level are jointly normally distributed.
Results :
Using Linear regression model it is found that:
F IGURE- 4
Equation of regression line related to blood sugar level to weight : 𝒚 = 𝟐. 𝟏𝟏 𝒙 + (−𝟑𝟑. 𝟗𝟗) The
correlation coefficient = 0.81
Example: 2
Eight samples of food are taken. Y , X1 , X2 coloumn corespods to Total calories, calories from fat
and calories from protein respectively.
Total Calories (Y) Calories from fat (X1) Calories from protein (X2)
140 60 22
155 62 25
159 67 24
179 70 20
192 71 15
200 72 14
212 75 14
215 78 11
T ABLE -3
Results:
Using Multiple Linear Regression method we obtain the Mean squared error (MSE), Intercept and
Coefficients. The results are given below;
F IGURE – 5 (A)
F IGURE- 5 (B)
Conclusion
In conclusion, this report provided an in-depth exploration of the regression method in machine
learning, with a focus on the linear regression model. We discussed the Ordinary Least Squares
(OLS) algorithm, detailing its significance in minimizing error to fit a linear model. Through
illustrative examples, we demonstrated the application of linear regression in various datasets,
highlighting its predictive accuracy and relevance. Supporting results validated the model’s
effectiveness in both simple and multiple regression scenarios. Overall, the study reinforces the
utility of regression methods as foundational tools for predictive analytics in machine learning.
References
1. “The elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani, and Jerome
Friedman.
2. “Pattern recognition and Machine learning “ by Christopher M. Bishop.
3. “Hands on machine learning with Scikit-Learn , Keras, and Tensorflow by Aurélien Géron
4. “Machine learning: A probabilistic Perspective” by Kevin P. Murphy.