Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Simple Linear and Logistic Regression

The document provides an overview of regression analysis, detailing concepts such as slope, intercept, and types of regression models including simple linear, multiple linear, and logistic regression. It explains how these models are used to predict relationships between variables, the importance of residuals and the least squares property, and the significance of the coefficient of determination. Additionally, it discusses non-linear regression and polynomial regression as methods to address non-linear data relationships.

Uploaded by

Asma Ayub
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Simple Linear and Logistic Regression

The document provides an overview of regression analysis, detailing concepts such as slope, intercept, and types of regression models including simple linear, multiple linear, and logistic regression. It explains how these models are used to predict relationships between variables, the importance of residuals and the least squares property, and the significance of the coefficient of determination. Additionally, it discusses non-linear regression and polynomial regression as methods to address non-linear data relationships.

Uploaded by

Asma Ayub
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

REGRESSION

RECAP OF BASIC CONCEPTS


SLOPE
Slope of a linear regression line tells us - how much change
in y-variable is caused by a unit change in x-variable.

The slope indicates the steepness of a line and the intercept


indicates the location where it intersects an axis. The slope
and the intercept define the linear relationship between two
variables, and can be used to estimate an average rate of
change.

The greater the magnitude of the slope, the steeper the line
and the greater the rate of change.
The slope is positive 5. When x increases by 1, y
increases by 5. The y-intercept is 2.

The slope is negative 0.4. When x increases by 1,


y decreases by 0.4. The y-intercept is 7.2.

The slope is 0. When x increases by 1, y neither


increases or decreases. The y-intercept is -4.
INTERCEPT
The slope indicates the steepness of a line and the
intercept indicates the location where it intersects an
axis. The slope and the intercept define the linear
relationship between two variables, and can be used to
estimate an average rate of change.

The intercept (often labeled as constant) is the point


where the function crosses the y-axis.
Y-INTERCEPT

A positive y-intercept means the line crosses the y-axis


above the origin, while a negative y-intercept means
that the line crosses below the origin. Simply by
changing the values of m and b, we can define any
straight line. That's how powerful and versatile the slope
intercept formula is.
SLOPE AND INTERCEPT EXAMPLE
For example, a company determines that job
performance for employees in a production department
can be predicted using the regression model y = 130 +
4.3x, where x is the hours of in-house training they
receive (from 0 to 20) and y is their score on a job skills
test.
The value of the y-intercept (130) indicates the average
job skill score for an employee with no training.
The value of the slope (4.3) indicates that for each hour
of training, the job skill score increases, on average, by
4.3 points.
MARGINAL CHANGE / SLOPE
Marginal change – in working with two variables related
by a regression equation, the marginal change in a
variable is the amount that the variable changes when the
other variable changes by exactly one unit

⚫ The slope, 𝑏1, in the regression equation is the marginal


change in 𝑦 when 𝑥 changes by one unit
TYPES OF REGRESSION MODELS
Simple linear regression
Multiple regression
Logistic regression
Polynomial regression
SIMPLE LINEAR REGRESSION
We find the equation of the straight line that best fits the
paired sample data. That equation algebraically describes
the relationship between two variables.

The best-fitting straight line is called a regression line


and its equation is called the regression equation.
SIMPLE LINEAR REGRESSION

Regression Equation – given a collection of paired sample


data, the regression equation that algebraically describes the
relationship between the two variables 𝑥 and 𝑦 is 𝑦 =𝑏0+𝑏1𝑥
⚫ The regression equation attempts to describe a relationship
between two variables
⚫ Inherently, the equation algebraically describes how the values of
one variable are somehow associated with the values of the other
variable

Regression line – the graph of the regression equation


⚫ Also known as the “line of best fit” or the “least square line”
⚫ The regression line fits the sample points best
NOTATION

Notice the 𝑦 in the sample regression equation!


This implies that we are predicting something!
We are predicting values for 𝑦 based upon true and
observed values of 𝑥.
SIMPLE LINEAR REGRESSION
REQUIREMENTS FOR SIMPLE LINEAR REGRESSION
1.The sample of paired data is a simple random sample
of quantitative data

2.The pairs of data (𝑥,𝑦) have a bivariate normal


distribution, meaning the following:
⚫ Visual examination of the scatter plot(s) confirms that the
sample points follow an approximately straight line(s)
⚫ Because results can be strongly affected by the presence of
outliers, any outliers should be removed if they are known to
be errors (Note: Use caution when removing data points)
SIMPLE LINEAR REGRESSION

Outlier – in a scatter plot, an outlier is a point lying far


away from the other data points

Influential point – a point that strongly affects the graph


of the regression line
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION

The additional point is an influential point because the


graph of the regression line because the graph of the
regression line did change considerably.

The additional point is also an outlier because it is far


from the other points.
RESIDUAL AND THE LEAST SQUARES PROPERTY

Residual – for a pair of sample 𝑥 and 𝑦 values, the difference


between the observed sample value of 𝑦 (a true value
observed) and the y-value that is predicted by using the
regression equation 𝑦^ is the residual

⚫ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 = 𝑦 − 𝑦^


⚫ A residual represents a type of inherent prediction error
⚫ The regression equation does not, typically, pass through all the
observed data values that we have

The Least Squares Property – a straight line satisfies this


property if the sum of the squares of the residuals is the
smallest sum possible
RESIDUALS
EXAMPLE – HOME PRICE PREDICTION
EXAMPLE – HOME PRICE PREDICTION
EXAMPLE – HOME PRICE PREDICTION
EXAMPLE – HOME PRICE PREDICTION
EXAMPLE – HOME PRICE PREDICTION
EXAMPLE – HOME PRICE PREDICTION
Find sum of squared errors for all the possible lines.
Blue line gives the minimum error.
EXAMPLE – HOME PRICE PREDICTION
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION
EXAMPLE – EXAM GRADES
LEAST SQUARES METHOD
LEAST SQUARES METHOD
LEAST SQUARES METHOD
LEAST SQUARES METHOD
LEAST SQUARES METHOD
LEAST SQUARES METHOD
HOW GOOD IS THE PREDICTION?
The depends how good the regression line fits the data
The measurement the tells how good the regression line
fits the data is Coefficient of Determination.
COEFFICIENT OF DETERMINATION
R SQUARE
CORRELATION COEFFICIENT
CORRELATION COEFFICIENT
PEARSON’S CORRELATION COEFFICIENT
MULTIPLE LINEAR REGRESSION
MULTIPLE LINEAR REGRESSION
Multiple linear regression (MLR), also known simply as
multiple regression, is a statistical technique that uses
several explanatory variables to predict the outcome of a
response variable.

The goal of multiple linear regression is to model the


linear relationship between the explanatory
(independent) variables and response (dependent)
variables.
MULTIPLE LINEAR REGRESSION
What Multiple Linear Regression Can Tell You
Simple linear regression is a function that allows an
analyst or statistician to make predictions about one
variable based on the information that is known about
another variable.
Linear regression can only be used when one has two
continuous variables—an independent variable and a
dependent variable.
The independent variable is the parameter that is used to
calculate the dependent variable or outcome.
A multiple regression model extends to several
explanatory variables.
MULTIPLE LINEAR REGRESSION
The multiple regression model is based on the following
assumptions:

There is a linear relationship between the dependent


variables and the independent variables

The independent variables are not too highly correlated


with each other
MULTIPLE LINEAR REGRESSION
MULTIPLE LINEAR REGRESSION
EXAMPLE
Dataset
OUTPUT FROM EXCEL
MULTIPLE LINEAR REGRESSION
MULTIPLE LINEAR REGRESSION
MULTIPLE LINEAR REGRESSION
LOGISTIC REGRESSION
AIMS
When and Why do we Use Logistic Regression?
Binary
Multinomial
Theory Behind Logistic Regression
Assessing the Model
Assessing predictors
Things that can go Wrong
Interpreting Logistic Regression
WHEN AND WHY
To predict an outcome variable that is categorical from
one or more categorical or continuous predictor
variables.

Used because having a categorical outcome variable


violates the assumption of linearity in normal regression.
WHEN AND WHY
No assumptions about the distributions of the predictor
variables.
Predictors do not have to be normally distributed
Logistic regression does not make any assumptions of
normality, linearity, and homogeneity of variance for the
independent variables.
Because it does not impose these requirements, it is
preferred to discriminant analysis when the data does not
satisfy these assumptions.
LOGISTIC REGRESSION
Logistic regression is used to analyze relationships between a
dichotomous dependent variable and continue or dichotomous
independent variables.

A dichotomous variable is one that takes on one of only two


possible values when observed or measured. ... For example, a
dichotomous variable may be used to indicate whether a piece of
legislation passed. The dichotomous variable (pass/fail) is a
representation of the actual, and observable, vote on the
legislation.

Logistic regression combines the independent variables to estimate


the probability that a particular event will occur, i.e. a subject will
be a member of one of the groups defined by the dichotomous
dependent variable.
LOGISTIC REGRESSION
Logistic regression is a type of binary classification
machine learning algorithm used to predict the
probability of something happening, in our case whether
or not an event will occur.

The name “logistic regression” is derived from the


concept of the logistic function that it uses. The logistic
function is also known as the sigmoid function. The
value of this logistic function lies between zero and one.
The following is an example of a logistic function we
can use to find the probability of a vehicle breaking
down, depending on how many years it has been since it
was serviced last.
LOGISTIC REGRESSION
Logistic regression aims to solve classification problems.
It does this by predicting categorical outcomes, unlike
linear regression that predicts a continuous outcome.
In the simplest case there are two outcomes, which is
called binomial, an example of which is predicting if a
tumor is malignant or benign. Other cases have more
than two outcomes to classify, in this case it is called
multinomial. A common example for multinomial
logistic regression would be predicting the class of an iris
flower between 3 different species.
OBJECTIVES OF LOGISTIC REGRESSION

Identify the independent variable that impact in the


dependent variable

Establishing classification system based on the logistic


model for determining the group membership
TYPES OF LOGISTIC REGRESSION
BINARY LOGISTIC REGRESSION

⚫ It is used when the dependent variable is dichotomous.

MULTINOMIAL LOGISTIC REGRESSION

⚫ It is used when the dependent or outcomes variable has more


than two categories.
LOGISTIC REGRESSION – BINARY CLASSIFICATION
LOGISTIC REGRESSION – BINARY CLASSIFICATION
With linear regression
LOGISTIC REGRESSION – BINARY CLASSIFICATION
With logistic regression
LOGISTIC REGRESSION – BINARY CLASSIFICATION
LOGISTIC REGRESSION – BINARY CLASSIFICATION
NON-LINEAR REGRESSION
NON-LINEAR REGRESSION
In simple linear regression algorithm only works when
the relationship between the data is linear

But suppose if we have non-linear data then Linear


regression will not capable to draw a best-fit line and It
fails in such conditions.

Hence, we introduce polynomial regression to overcome


this problem, which helps identify the curvilinear
relationship between independent and dependent
variables.
NON-LINEAR REGRESSION
How Polynomial Regression Overcomes the problem of
Non-Linear data?

Polynomial regression is a form of Linear regression where


only due to the Non-linear relationship between dependent
and independent variables we add some polynomial terms to
linear regression to convert it into Polynomial regression.

Suppose we have X as Independent data and Y as dependent


data. Before feeding data to a mode in preprocessing stage we
convert the input variables into polynomial terms using some
degree.
NON-LINEAR REGRESSION
NON-LINEAR REGRESSION
NON-LINEAR REGRESSION
BASIS FUNCTION REGRESSION
Polynomial basis functions

Gaussian basis functions

You might also like