Week 7 - Regression
Week 7 - Regression
Week 7 – Regression
PROF. DR. GÖKHAN SILAHTAROĞLU
Lecturer:. NADA MISK
REGRESSION
Regression analysis is a predictive modeling technique that investigates the relationship between dependent
(target) and independent variable(s)..
This estimation is used for time series modeling and for finding causal effects between variables.
For example, the relationship between the number of traffic accidents by a normal driver and impulsive driving is
best studied through regression.
REGRESSION IN DATA MINING APPROACH
Regression is a data mining technique used to predict the numerical value (also called continuous values) of a
given dataset.
For example, regression can be used to estimate the cost of a product or service with respect to other variables.
For example, given an increase in population, we can assume that food production will increase at the same rate -
this requires a strong, linear relationship between the two variables.
Advanced techniques such as multiple regression estimate the relationship between multiple variables – for
example, is there a relationship between income, education, and where the person chooses to live? Adding more
variables significantly increases the complexity of the estimation.
REGRESSION
DR. G.SILAHTAROĞLU 5
REGRESSION
y = a+b x +e
Multiple linear regression equation
y = a + b1 x1 + b2 x2 + .... + bi xi + e
Nonlinear regression
❖ Quadratic: y = a + bx + cx 2 + e
❖ Logarithmic: y = a + b ln x
❖ Exponential: y = abx
6
REGRESSION ASSUMPTIONS
Number of Data and Variables: Ideally, the ratio of the number of independent variables to the number of
data is 20:1.This ratio can be at least 5:1.That is, at least 5 records are required for each independent variable.
Normality: Data should conform to a normal distribution.
Missing data: Regression analysis is sensitive to missing data.
Outliers: Data outside the 3 standard deviations should be excluded from the data set.
Linearity: It is important to have a linear relationship between dependent and independent variables.
Multicollinearity: It is the situation where there is a very high relationship between at least two variables.
Since two variables with a very high relationship between them will be partially similar to each other, it would be
appropriate to discard either of them.
REGRESSION ANALYSIS EXAMPLES
Let's say you want to forecast a company's sales growth based on current economic conditions. You
have up-to-date company data showing that sales growth is about two and a half times economic
growth.
Based on this current and historical information, we can predict the company's future sales.
There are many benefits to using regression analysis:
1. It shows the significant relationships between the dependent variable and the independent variable.
2. It shows the strength of the effect of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured at different scales, such
as the effect of price changes and the number of promotional activities.
LINEAR REGRESSION
Theoretical notation:
To perform a regression analysis, you must define a dependent variable that you assume is influenced by one or
more independent variables.
You then need to create a comprehensive dataset to work with.
We wonder how the ticket price of an event
affects satisfaction levels.
To begin investigating whether there is a
relationship between these two variables, we
will start by plotting these data points on a
graph.
However,
How can we understand how the ticket price affects event satisfaction?
A regression formula, Y = 100 + 7X + Error term
MODEL VALIDATION AND
EVALUATION
1. MSE, RMSE, MAE, MAPE
MEAN SQUARED ERROR
The mean square error (MSE) tells how close a regression curve is to a set of points.
represents the difference between the original and predicted values, extracted by squaring the mean
difference over the data set.
MSE measures the performance of the predictor of a machine learning model, always positive value.
It is sensitive to outliers.
It can be said that estimators with an MSE value close to zero perform better.
ROOT MEAN SQUARE ERROR
It is a metric that measures the magnitude of error of a machine learning model, which is often used to
find the distance between the predicted values of the estimator and the true values. It is the error rate
based on the square root of the MSE.
The RMSE is the standard deviation of the estimation errors. That is, residuals are a measure of how far
the regression line is from the data points; The RMSE is a measure of how widespread these residues are.
The RMSE value can range from 0 to ∞. Negatively oriented scores, predictors with lower values, perform
better. A zero RMSE value means that the model has no errors.
MEAN ABSOLUTE ERROR
The mean absolute error is a measure of the difference between two continuous variables. represents the
difference between the original and predicted values inferred by averaging the absolute difference over
the dataset.
The MAE is the average vertical distance between each actual value and the line that best fits the data.
MAE is also the average horizontal distance between each data point and the best-fit line.
Since the MAE value is easily interpretable, it is frequently used in regression and time series problems.
The mean absolute percent error is used to measure the accuracy of the predictions in the regression and
time series models.
If there are zeros among the actual values, MAPE cannot be calculated because there will be division by
zero.
When MAPE is used to compare the accuracy of estimators, it is biased in that it systematically chooses a
method with very low estimates.
LOGISTIC REGRESSION
DR. G.SILAHTAROĞLU 23
LOGIT EQUALITY
a + bx
1 e
p= p= a + bx
1+ e − ( a + bx )
1+ e
DR. G.SİLAHTAROĞLU 24