Logistic regression is both classification and regression
technique depending on the scenario used. Logistic regression (logit regression) is a type of regression analysis used for predicting the outcome of a categorical dependent variable similar to OLS regression. In logistic regression, dependent variable (Y) is binary (0,1) and independent variables (X) are continuous in nature. The probabilities describing the possible outcomes (probability that Y = 1) of a single trial are modelled as a logistic function of the predictor variables. In the logistic regression model, there is no R2 to gauge the fit of the overall model; however, a chi-square test is used to gauge how well the logistic regression model fits the data. The goal of logistic regression is to predict the likelihood that Y is equal to 1 (probability that Y = 1 rather than 0) given certain values of X. That is, if X and Y have a strong positive linear relationship, the probability that a person will have a score of Y = 1 will increase as values of X increase. So, we are predicting probabilities rather than the scores of the dependent variable.
For example, we might try to predict whether or not a small
project will succeed or fail on the basis of the number of years of experience of the project manager handling the project. We presume that those project managers who have been managing projects for many years will be more likely to succeed. This means that as X (the number of years of experience of project manager) increases, the probability that Y will be equal to 1 (success of the new project) will tend to increase. If we take a hypothetical example in which 60 already executed projects were studied and the years of experience of project managers ranges from 0 to 20 years, we could represent this tendency to increase the probability that Y = 1 with a graph. To illustrate this, it is convenient to segregate years of experience into categories (i.e. 0–8, 9–16, 17–24, 25–32, 33– 40). If we compute the mean score on Y (averaging the 0s and 1s) for each category of years of experience, we will get something like
When the graph is drawn for the above values of X and Y, it
appears like the graph in Figure 8.18. As X increases, the probability that Y = 1 increases. In other words, when the project manager has more years of experience, a larger percentage of projects succeed. A perfect relationship represents a perfectly curved S rather than a straight line, as was the case in OLS regression. So, to model this relationship, we need some fancy algebra / mathematics that accounts for the bends in the curve.
An explanation of logistic regression begins with an
explanation of the logistic function, which always takes values between zero and one. The logistic formulae are stated in terms of the probability that Y = 1, which is referred to as P. The probability that Y is 0 is 1 − P. FIG. 8.18 Logistic regression
The ‘ln’ symbol refers to a natural logarithm and a + bX is
the regression line equation. Probability (P) can also be computed from the regression equation. So, if we know the regression equation, we could, theoretically, calculate the expected probability that Y = 1 for a given value of X.
‘exp’ is the exponent function, which is sometimes also
written as e.
Let us say we have a model that can predict whether a
person is male or female on the basis of their height. Given a height of 150 cm, we need to predict whether the person is male or female.
We know that the coefficients of a = −100 and b = 0.6.
Using the above equation, we can calculate the probability of male given a height of 150 cm or more formally P(male|height = 150).
y = e^(a + b × X)/(1 + e^(a + b × X))
y = exp ( −100 + 0.6 × 150)/(1 + EXP( −100 + 0.6 × X) y = 0.000046
or a probability of near zero that the person is a male.
Assumptions in logistic regression
The following assumptions must hold when building a logistic
regression model:
There exists a linear relationship between logit function and independent
variables The dependent variable Y must be categorical (1/0) and take binary value, e.g. if pass then Y = 1; else Y = 0 The data meets the ‘iid’ criterion, i.e. the error terms, ε, are independent from one another and identically distributed The error term follows a binomial distribution [n, p] n = # of records in the data p = probability of success (pass, responder)
8.3.8 Maximum Likelihood Estimation
The coefficients in a logistic regression are estimated using a
process called Maximum Likelihood Estimation (MLE). First, let us understand what is likelihood function before moving to