Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Data Analytics Using R

Power point presentation on Data analytics

Uploaded by

sharmahemant3612
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Data Analytics Using R

Power point presentation on Data analytics

Uploaded by

sharmahemant3612
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Logistic regression

Compiled and Presented by:


Dr.Chetna Arora
Logistic regression
• Logistic regression is a method used to understand the relationship between a result
that has two possible outcomes (like "yes" or "no") and one or more factors that
might influence this result. One or more factors that might influence this result
refers to the independent variables (also called predictors or features) that could
affect the outcome. These factors are what you measure or observe to see how they
impact the dependent variable (the binary outcome).
For example, let’s say you are trying to predict whether a student passes or fails an
exam (the result, which has two outcomes: pass/fail).
• The factors that might influence this result could be:
Study hours (the more hours a student studies, the more likely they are to pass).
Attendance (students who attend more classes might be more likely to pass).
Previous grades (students with higher grades in the past might be more likely to
pass).
These factors (study hours, attendance, previous grades) are the independent
variables, and the result (pass or fail) is the dependent variable. Logistic regression
helps determine how these factors combine to influence the outcome, and it
estimates the probability of a student passing or failing based on these factors.
• When we simply refer to "logistic regression," we are often talking about binary
logistic regression.
• In logistic regression, the independent variables (IVs) can indeed have more than two options or
values.

Binary dependent variable (DV): In logistic regression, the DV must have only two possible
outcomes (e.g., "yes" or "no", "success" or "failure").

Independent variables (IVs): The IVs can be of various types:

Binary IV: An IV with two categories (e.g., male/female, yes/no).


Categorical IV with more than two levels: An IV can have multiple categories (e.g., marital status:
single, married, divorced). In this case, you'd create dummy variables (binary variables for each
category).
Continuous IV: An IV with a wide range of numeric values (e.g., age, income, hours of study).
Ordinal IV: An IV with ordered categories (e.g., education level: high school < bachelor’s <
master’s).
So, while the DV is always binary in basic logistic regression, the IVs can have any number of
options (categories or numeric values). For example:

IV1 (binary): Gender (male/female).


IV2 (ordinal): Education level (high school, bachelor’s, master’s).
IV3 (continuous): Age (in years).
IV4 (categorical): Region (North, South, East, West).
Logistic regression models can handle this variety of IVs, and the coefficients estimated for each
IV show how they impact the probability of the binary outcome (DV).
• Binary Logistic Regression: This is used when you have only
two options for your dependent variable (DV). For example,
predicting whether a student will pass (yes/no).
• Multinomial Logistic Regression: When your dependent
variable has more than two categories that are not ordered.
For instance, if you’re predicting the type of fruit someone
likes: apple, banana, or orange. Here, there’s no ranking; they
are just different categories.
• Ordinal Logistic Regression: This is used when your
dependent variable has more than two categories that are
ordered. For example, if you’re predicting levels of education:
high school, bachelor’s, master’s. Here, the categories have a
clear order (from lower to higher education).
Why we can’t we use Linear Regression?
• Linear Regression
• Output Range: The dependent variable (Y) can take any value from −∞ to +∞. This is
suitable for continuous outcomes, like predicting a person’s height or salary.
Negative infinity: The dependent variable (Y) can be as low as we can imagine, without
any lower limit. For example, a salary can be very low, theoretically approaching
negative infinity (taking consideration of debt). Positive infinity: Similarly, Y can also
be infinitely high, meaning there's no upper limit to the values it can take (like a person’s
height or an account balance). Height is always a positive value in real life, so it wouldn't
approach negative infinity either. But the concept of linear regression applies to any
continuous variable, even if in practice they have natural limits.
• Nature of Prediction: The model fits a line to minimize the difference between the
predicted values and actual values (least squares), making it effective for numeric
predictions.
• Logistic Regression
• Output: The dependent variable (Y) is binary, meaning it can only be 0 or 1. This
represents categories like "yes/no," "success/failure," or any two distinct groups.
• Nature of Prediction: Logistic regression predicts the probability of Y being 1 (success).
It uses the logistic function to map any real-valued number into a range between 0 and 1,
ensuring that predictions are valid probabilities.
• Odds:
• In statistics, odds represent the ratio of the probability
of an event happening to the probability of it not
happening. It's often used to describe binary outcomes
(like success/failure, yes/no).
• Odds=P(event)/1−P(event)
• Where:
• P(event) is the probability of the event happening (a
value between 0 and 1).
• 1−P(event) is the probability of the event not happening.
• Example:
• If the probability of an event (say, winning a
game) is 0.8 (80%), then the odds of winning
are:
• Odds of winning=0.8/1−0.8=0.8/0.2=4
• This means the odds are 4 to 1 in favor of
winning. For every 4 chances of winning,
there's 1 chance of not winning.
• Log Odds:
• The log odds (also called the "logit") is simply
the natural logarithm of the odds. It's used in
logistic regression to convert probabilities,
which are bounded between 0 and 1, into a
range that can be positive or negative, making
it easier to model with a linear equation.
• Log Odds=ln⁡P(event)/1−P(event)
• Example:
• If the odds of winning a game are 4 (as in the
previous example), then the log odds are:
• Log Odds=ln⁡(4)≈1.386
• This means a log odds of 1.386 corresponds to
a 4-to-1 chance of winning the game. When we
say the odds are 4-to-1, it means the event is 4
times more likely to happen than not happen.
• Relationship Between Probability, Odds, and
Log Odds:
• If you have probability, you can calculate odds:
• Odds=P/1−P​
• If you have odds, you can calculate log odds:
• Log Odds=ln⁡(Odds)
• To go from log odds back to probability:
• P=eLog Odds/1+eLog Odds
• Logistic Regression Models Log Odds:
• The logistic regression equation predicts the log odds of the
dependent variable (usually binary, like 0 or 1) rather than
directly predicting the probability itself. So, understanding log
odds helps you grasp the essence of the logistic regression
model.
• The logistic regression equation:
• Log Odds=ln⁡P/(1−P)=β0+β1X1+β2X2+⋯+βnXn
• ​Here, the left-hand side is the log odds of the event happening,
and the right-hand side is a linear equation involving the
predictors (X1,X2,…,Xn) and their corresponding coefficients
(β1,β2,…,βn).
• Logistic regression transforms the log odds back
into a probability using the logistic (sigmoid)
function. This is important because we want to
predict probabilities (values between 0 and 1)
but the linear regression model works best with
unbounded values (which log odds provide).
• Once you have the log odds, you can convert
them to probabilities with this equation:
• P(y=1∣X)=1/1+e−(β0+β1X1+⋯+βnXn)
• P(y=1∣X)=1/1+e−(β0+β1X1+⋯+βnXn)
• Where:
• P(y=1∣X) is the probability that the output y is 1 given
the input X.
• e is the base of the natural logarithm.
• β0​is the intercept (constant term).
• β1,β2,…,βn are the coefficients for the predictor
variables X1,X2,…,Xnrespectively.
• X1,X2,…,Xn are the input features (independent
variables).
Steps to Perform Binary Logistic Regression
in R
• Install and Load Necessary Packages: If you haven’t already, you
may want to install the dplyr package for data manipulation.

install.packages("dplyr") # Install dplyr package


library(dplyr) # Load dplyr package
Prepare Your Data: Ensure your data is in a data frame format,
with the dependent variable as a factor.

# Sample data frame


data <- data.frame(outcome = c(0, 1, 0, 1, 1, 0, 1, 0, 1, 0),
predictor1 = c(22, 25, 30, 35, 40, 22, 30, 35, 28, 32),
predictor2 = c(5, 6, 8, 9, 10, 7, 8, 9, 5, 6))
• # Convert outcome to factor
data$outcome <- as.factor(data$outcome)

Fit the Logistic Regression Model: Use the glm()
function to fit the model.

# Fit the logistic regression model


model <- glm(outcome ~ predictor1 + predictor2, data =
data, family = binomial)
• model
• Call: glm(formula = outcome ~ predictor1 + predictor2, family =
binomial, data = data) Coefficients: (Intercept) predictor1 predictor2
-3.4926 0.1664 -0.2041
• Degrees of Freedom: 9 Total (i.e. Null); 7 Residual Null Deviance:
13.86 Residual Deviance: 12.8 AIC: 18.8
• Call: This indicates the model call that was executed.
• glm: This is the function used to fit a Generalized Linear Model.
• formula = outcome ~ predictor1 + predictor2: This specifies the
relationship between the dependent variable (outcome) and the
independent variables (predictor1 and predictor2). It means that the
model predicts the outcome based on predictor1 and predictor2.
• family = binomial: This indicates that the model is a logistic
regression model, which is appropriate for binary outcomes (e.g.,
success/failure, yes/no).data = data: This specifies the dataset used
to fit the model.
• Coefficients: (Intercept) predictor1 predictor2 -3.4926 0.1664
-0.2041
• This section lists the estimated coefficients for the model:
– (Intercept): The intercept (β₀) is -3.4926. This is the log odds of
the outcome when all predictors are 0. In the context of logistic
regression, it represents the baseline log odds.
– predictor1: The coefficient for predictor1 is 0.1664. This indicates
that for every one-unit increase in predictor1, the log odds of the
outcome increase by 0.1664, holding all other variables constant.
– predictor2: The coefficient for predictor2 is -0.2041. This indicates
that for every one-unit increase in predictor2, the log odds of the
outcome decrease by 0.2041, holding all other variables constant.
• Degrees of Freedom: 9 Total (i.e. Null); 7 Residual
• Degrees of Freedom: This refers to the number of
independent values that can vary in the analysis.
– Total (i.e. Null): This is the total degrees of freedom for
the model, which is 9 in this case. It represents the total
number of observations minus 1 (n - 1), where n is the
total number of data points.
– Residual: This is the degrees of freedom remaining after
fitting the model, which is 7 in this case. It is calculated as
the total degrees of freedom minus the number of
parameters estimated (including the intercept)
• Null Deviance: 13.86
• Null Deviance: This measures how well the
response variable is predicted by a model with
only the intercept (no predictors). It reflects
how much variation exists in the outcome
when only the mean outcome is used as a
predictor. A higher null deviance indicates
more variation.
• Residual Deviance:
Residual Deviance: 12.8
• Residual Deviance: This measures how well the response variable is
predicted by the model that includes the predictors (predictor1 and
predictor2). A lower residual deviance compared to the null deviance
indicates that the predictors improve the model fit.
• AIC:
• AIC: 18.8
• AIC (Akaike Information Criterion): This is a measure of the relative
quality of the statistical model for a given set of data. It penalizes the
complexity of the model (i.e., the number of parameters) to avoid
overfitting. Lower AIC values indicate a better-fitting model. The AIC can
be used to compare different models: the one with the lowest AIC is
generally preferred.
View model summary
• summary(model)
Make Predictions: You can use the model to predict probabilities
of success.


# Predict probabilities
data$predicted_probabilities <- predict(model, type = "response")
Classify Outcomes: Convert probabilities to binary outcomes
based on a cutoff (commonly 0.5).

data$predicted_outcomes <- ifelse(data$predicted_probabilities >


0.5, 1, 0)
Evaluate Model Performance: Check the
accuracy of your predictions.
# Evaluate accuracy
table(data$outcome,
data$predicted_outcomes)
• You can visualize the predicted probabilities to see how well the
model fits:

# Plot predicted probabilities


library(ggplot2)
ggplot(data, aes(x = predictor1, y = predicted_probabilities)) +
geom_point() +
geom_smooth(method = "glm", method.args = list(family =
"binomial")) +
labs(title = "Predicted Probabilities from Logistic Regression", x =
"Predictor 1", y = "Predicted Probability")
Summary

You might also like