0% found this document useful (0 votes)

16 views

Week 7 - Regression

Regression analysis is a technique used to investigate relationships between variables. It can be used to predict numerical values and classify data. The key aspects covered include simple and multiple linear regression, assumptions of regression models, and metrics like MSE, RMSE, MAE to evaluate models.

Uploaded by

Doğukan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Week 7 - Regression

Uploaded by

Doğukan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

MANAGEMENT INFORMATION SYSTEMS

DATA MINING AND BUSINESS INTELLIGENCE

Week 7 – Regression
PROF. DR. GÖKHAN SILAHTAROĞLU
Lecturer:. NADA MISK
REGRESSION

 The relationships between a variable

and one or more other variables
written in the form of a mathematical
equation.
 Regression is used for classification
within the framework of the following
two approaches:
Partioninig: Data is divided into various
regions depending on the class.
Prediction: Formulas are generated to
calculate the output value.
DR. G.SILAHTAROĞLU 2
LINEAR REGRESSION

 Regression analysis is a predictive modeling technique that investigates the relationship between dependent
(target) and independent variable(s)..

 This estimation is used for time series modeling and for finding causal effects between variables.

 For example, the relationship between the number of traffic accidents by a normal driver and impulsive driving is
best studied through regression.
REGRESSION IN DATA MINING APPROACH

 Regression is a data mining technique used to predict the numerical value (also called continuous values) of a
given dataset.
 For example, regression can be used to estimate the cost of a product or service with respect to other variables.
 For example, given an increase in population, we can assume that food production will increase at the same rate -
this requires a strong, linear relationship between the two variables.
 Advanced techniques such as multiple regression estimate the relationship between multiple variables – for
example, is there a relationship between income, education, and where the person chooses to live? Adding more
variables significantly increases the complexity of the estimation.
REGRESSION

 According to Number of Variables:

Simple regression analysis
Multiple regression analysis

 According to Equation Type:

Linear regression analysis
Nonlinear regression analysis

DR. G.SILAHTAROĞLU 5
REGRESSION

 A linear regression equation obtained by the least squares method

y = a+b x +e
 Multiple linear regression equation
y = a + b1 x1 + b2 x2 + .... + bi xi + e

 Nonlinear regression

❖ Quadratic: y = a + bx + cx 2 + e

❖ Logarithmic: y = a + b ln x
❖ Exponential: y = abx

6
REGRESSION ASSUMPTIONS

 Number of Data and Variables: Ideally, the ratio of the number of independent variables to the number of
data is 20:1.This ratio can be at least 5:1.That is, at least 5 records are required for each independent variable.
 Normality: Data should conform to a normal distribution.
 Missing data: Regression analysis is sensitive to missing data.
 Outliers: Data outside the 3 standard deviations should be excluded from the data set.
 Linearity: It is important to have a linear relationship between dependent and independent variables.
 Multicollinearity: It is the situation where there is a very high relationship between at least two variables.
Since two variables with a very high relationship between them will be partially similar to each other, it would be
appropriate to discard either of them.
REGRESSION ANALYSIS EXAMPLES

 Let's say you want to forecast a company's sales growth based on current economic conditions. You
have up-to-date company data showing that sales growth is about two and a half times economic
growth.
 Based on this current and historical information, we can predict the company's future sales.
 There are many benefits to using regression analysis:
1. It shows the significant relationships between the dependent variable and the independent variable.
2. It shows the strength of the effect of multiple independent variables on a dependent variable.
 Regression analysis also allows us to compare the effects of variables measured at different scales, such
as the effect of price changes and the number of promotional activities.
LINEAR REGRESSION

 Theoretical notation:

0 = The point where the line cuts the y-axis

1 = Slope of the line

 = Error
https://www.youtube.com/watch?v=zPG4NjIkCjc
HOW DOES REGRESSION ANALYSIS WORK?

 To perform a regression analysis, you must define a dependent variable that you assume is influenced by one or
more independent variables.
 You then need to create a comprehensive dataset to work with.
We wonder how the ticket price of an event
affects satisfaction levels.
To begin investigating whether there is a
relationship between these two variables, we
will start by plotting these data points on a
graph.
However,
How can we understand how the ticket price affects event satisfaction?
A regression formula, Y = 100 + 7X + Error term
MODEL VALIDATION AND
EVALUATION
1. MSE, RMSE, MAE, MAPE
MEAN SQUARED ERROR

 The mean square error (MSE) tells how close a regression curve is to a set of points.
 represents the difference between the original and predicted values, extracted by squaring the mean
difference over the data set.
 MSE measures the performance of the predictor of a machine learning model, always positive value.
It is sensitive to outliers.
 It can be said that estimators with an MSE value close to zero perform better.
ROOT MEAN SQUARE ERROR

It is a metric that measures the magnitude of error of a machine learning model, which is often used to
find the distance between the predicted values of the estimator and the true values. It is the error rate
based on the square root of the MSE.

The RMSE is the standard deviation of the estimation errors. That is, residuals are a measure of how far
the regression line is from the data points; The RMSE is a measure of how widespread these residues are.

The RMSE value can range from 0 to ∞. Negatively oriented scores, predictors with lower values, perform
better. A zero RMSE value means that the model has no errors.
MEAN ABSOLUTE ERROR

The mean absolute error is a measure of the difference between two continuous variables. represents the
difference between the original and predicted values inferred by averaging the absolute difference over
the dataset.
The MAE is the average vertical distance between each actual value and the line that best fits the data.

MAE is also the average horizontal distance between each data point and the best-fit line.

Since the MAE value is easily interpretable, it is frequently used in regression and time series problems.

The MAE value can range from 0 to ∞.

MEAN ABSOLUTE PERCENTAGE ERROR

The mean absolute percent error is used to measure the accuracy of the predictions in the regression and
time series models.

If there are zeros among the actual values, MAPE cannot be calculated because there will be division by
zero.

When MAPE is used to compare the accuracy of estimators, it is biased in that it systematically chooses a
method with very low estimates.
LOGISTIC REGRESSION

 Used when parameters are 0-1 or True / False.

 Logistic regression can also be applied to ordered categories (ordinal data), i.e. variables with more
than two ordinal categories, such as what you find in many surveys.
 It is conventional to encode a binary parameter as 0 or 1.
 For example, we can code a successfully scored field goal as 1 and a missed field goal as 0

DR. G.SILAHTAROĞLU 23
LOGIT EQUALITY

a + bx
1 e
p= p= a + bx
1+ e − ( a + bx )
1+ e

DR. G.SİLAHTAROĞLU 24