Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Regression3 Slides

This document provides an overview of logistic regression. It introduces logistic regression as a method for modeling discrete response variables with two or more categories. The key aspects covered are: 1) Logistic regression models the log odds of the probabilities of different outcomes as a linear combination of predictor variables. 2) This allows predicting probabilities as a function of the predictors rather than directly modeling the discrete outcomes. 3) Logistic regression is a type of generalized linear model used widely in machine learning for classification problems.

Uploaded by

Fantahun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Regression3 Slides

This document provides an overview of logistic regression. It introduces logistic regression as a method for modeling discrete response variables with two or more categories. The key aspects covered are: 1) Logistic regression models the log odds of the probabilities of different outcomes as a linear combination of predictor variables. 2) This allows predicting probabilities as a function of the predictors rather than directly modeling the discrete outcomes. 3) Logistic regression is a type of generalized linear model used widely in machine learning for classification problems.

Uploaded by

Fantahun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Regression 3: Logistic Regression

Marco Baroni

Practical Statistics in R
Outline

Logistic regression

Logistic regression in R
Outline

Logistic regression
Introduction
The model
Looking at and comparing fitted models

Logistic regression in R
Outline

Logistic regression
Introduction
The model
Looking at and comparing fitted models

Logistic regression in R
Modeling discrete response variables

I In a very large number of problems in cognitive science


and related fields
I the response variable is categorical, often binary (yes/no;
acceptable/not acceptable; phenomenon takes place/does
not take place)
I potentially explanatory factors (independent variables) are
categorical, numerical or both
Examples: binomial responses

I Is linguistic construction X rated as “acceptable” in the


following condition(s)?
I Does sentence S, that has features Y, W and Z, display
phenomenon X? (linguistic corpus data!)
I Is it common for subjects to decide to purchase the good X
given these conditions?
I Did subject make more errors in this condition?
I How many people answer YES to question X in the survey
I Do old women like X more than young men?
I Did the subject feel pain in this condition?
I How often was reaction X triggered by these conditions?
I Do children with characteristics X, Y and Z tend to have
autism?
Examples: multinomial responses
I Discrete response variable with natural ordering of the
levels:
I Ratings on a 6-point scale
I Depending on the number of points on the scale, you might
also get away with a standard linear regression
I Subjects answer YES, MAYBE, NO
I Subject reaction is coded as FRIENDLY, NEUTRAL,
ANGRY
I The cochlear data: experiment is set up so that possible
errors are de facto on a 7-point scale
I Discrete response variable without natural ordering:
I Subject decides to buy one of 4 different products
I We have brain scans of subjects seeing 5 different objects,
and we want to predict seen object from features of the
scan
I We model the chances of developing 4 different (and
mutually exclusive) psychological syndromes in terms of a
number of behavioural indicators
Binomial and multinomial logistic regression models

I Problems with binary (yes/no, success/failure,


happens/does not happen) dependent variables are
handled by (binomial) logistic regression
I Problems with more than one discrete output are handled
by
I ordinal logistic regression, if outputs have natural ordering
I multinomial logistic regression otherwise
I The output of ordinal and especially multinomial logistic
regression tends to be hard to interpret, whenever possible
I try to reduce the problem to a binary choice
I E.g., if output is yes/maybe/no, treat “maybe” as “yes”
and/or as “no”
I Here, I focus entirely on the binomial case
Don’t be afraid of logistic regression!

I Logistic regression seems less popular than linear


regression
I This might be due in part to historical reasons
I the formal theory of generalized linear models is relatively
recent: it was developed in the early nineteen-seventies
I the iterative maximum likelihood methods used for fitting
logistic regression models require more computational
power than solving the least squares equations
I Results of logistic regression are not as straightforward to
understand and interpret as linear regression results
I Finally, there might also be a bit of prejudice against
discrete data as less “scientifically credible” than
hard-science-like continuous measurements
Don’t be afraid of logistic regression!

I Still, if it is natural to cast your problem in terms of a


discrete variable, you should go ahead and use logistic
regression
I Logistic regression might be trickier to work with than linear
regression, but it’s still much better than pretending that the
variable is continuous or artificially re-casting the problem
in terms of a continuous response
The Machine Learning angle

I Classification of a set of observations into 2 or more


discrete categories is a central task in Machine Learning
I The classic supervised learning setting:
I Data points are represented by a set of features, i.e.,
discrete or continuous explanatory variables
I The “training” data also have a label indicating the class of
the data-point, i.e., a discrete binomial or multinomial
dependent variable
I A model (e.g., in the form of weights assigned to the
dependent variables) is fitted on the training data
I The trained model is then used to predict the class of
unseen data-points (where we know the values of the
features, but we do not have the label)
The Machine Learning angle

I Same setting of logistic regression, except that emphasis is


placed on predicting the class of unseen data, rather than
on the significance of the effect of the features/independent
variables (that are often too many – hundreds or thousands
– to be analyzed singularly) in discriminating the classes
I Indeed, logistic regression is also a standard technique in
Machine Learning, where it is sometimes known as
Maximum Entropy
Outline

Logistic regression
Introduction
The model
Looking at and comparing fitted models

Logistic regression in R
Classic multiple regression

I The by now familiar model:

y = β0 + β1 × x1 + β2 × x2 + ... + βn × xn + 

I Why will this not work if variable is binary (0/1)?


I Why will it not work if we try to model proportions instead
of responses (e.g., proportion of YES-responses in
condition C)?
Modeling log odds ratios
I Following up on the “proportion of YES-responses” idea,
let’s say that we want to model the probability of one of the
two responses (which can be seen as the population
proportion of the relevant response for a certain choice of
the values of the dependent variables)
I Probability will range from 0 to 1, but we can look at the
logarithm of the odds ratio instead:
p
logit(p) = log
1−p
I This is the logarithm of the ratio of probability of
1-response to probability of 0-response
I It is arbitrary what counts as a 1-response and what counts
as a 0-response, although this might hinge on the ease of
interpretation of the model (e.g., treating YES as the
1-response will probably lead to more intuitive results than
treating NO as the 1-response)
I Log odds ratios are not the most intuitive measure (at least
for me), but they range continuously from −∞ to +∞
From probabilities to log odds ratios

5
logit(p)
0
−5

0.0 0.2 0.4 0.6 0.8 1.0


p
The logistic regression model

I Predicting log odds ratios:

logit(p) = β0 + β1 × x1 + β2 × x2 + ... + βn × xn

I Back to probabilities:

elogit(p)
p=
1 + elogit(p)
I Thus:
eβ0 +β1 ×x1 +β2 ×x2 +...+βn ×xn
p=
1 + eβ0 +β1 ×x1 +β2 ×x2 +...+βn ×xn
From log odds ratios to probabilities

1.0
0.8
0.6
p
0.4
0.2
0.0

−10 −5 0 5 10
logit(p)
Probabilities and responses

1.0
● ● ● ●● ● ● ●

0.8
0.6
p
0.4
0.2
0.0

● ● ● ●● ● ● ●

−10 −5 0 5 10
logit(p)
A subtle point: no error term

I NB:

logit(p) = β0 + β1 × x1 + β2 × x2 + ... + βn × xn

I The outcome here is not the observation, but (a function of)


p, the expected value of the probability of the observation
given the current values of the dependent variables
I This probability has the classic “coin tossing” Bernoulli
distribution, and thus variance is not free parameter to be
estimated from the data, but model-determined quantity
given by p(1 − p)
I Notice that errors, computed as observation − p, are not
independently normally distributed: they must be near 0 or
near 1 for high and low ps and near .5 for ps in the middle
The generalized linear model

I Logistic regression is an instance of a “generalized linear


model”
I Somewhat brutally, in a generalized linear model
I a weighted linear combination of the explanatory variables
models a function of the expected value of the dependent
variable (the “link” function)
I the actual data points are modeled in terms of a distribution
function that has the expected value as a parameter
I General framework that uses same fitting techniques to
estimate models for different kinds of data
Linear regression as a generalized linear model

I Linear prediction of a function of the mean:

g(E(y )) = X β

I “Link” function is identity:

g(E(y )) = E(y )
I Given mean, observations are normally distributed with
variance estimated from the data
I This corresponds to the error term with mean 0 in the linear
regression model
Logistic regression as a generalized linear model

I Linear prediction of a function of the mean:

g(E(y )) = X β

I “Link” function is :
E(y )
g(E(y )) = log
1 − E(y )
I Given E(y ), i.e., p, observations have a Bernoulli
distribution with variance p(1 − p)
Estimation of logistic regression models

I Minimizing the sum of squared errors is not a good way to


fit a logistic regression model
I The least squares method is based on the assumption that
errors are normally distributed and independent of the
expected (fitted) values
I As we just discussed, in logistic regression errors depend
on the expected (p) values (large variance near .5,
variance approaching 0 as p approaches 1 or 0), and for
each p they can take only two values (1 − p if response
was 1, p − 0 otherwise)
Estimation of logistic regression models

I The β terms are estimated instead by maximum likelihood,


i.e., by searching for that set of βs that will make the
observed responses maximally likely (i.e., a set of β that
will in general assign a high p to 1-responses and a low p
to 0-responses)
I There is no closed-form solution to this problem, and the
optimal β ~ tuning is found with iterative “trial and error”
techniques
I Least-squares fitting is finding the maximum likelihood
estimate for linear regression and vice versa maximum
likelihood fitting is done by a form of weighted least squares
fitting
Outline

Logistic regression
Introduction
The model
Looking at and comparing fitted models

Logistic regression in R
Interpreting the βs

I Again, as a rough-and-ready criterion, if a β is more than 2


standard errors away from 0, we can say that the
corresponding explanatory variable has an effect that is
significantly different from 0 (at α = 0.05)
I However, p is not a linear function of X β, and the same β
will correspond to a more drastic impact on p towards the
center of the p range than near the extremes (recall the S
shape of the p curve)
I As a rule of thumb (the “divide by 4” rule), β/4 is an upper
bound on the difference in p brought about by a unit
difference on the corresponding explanatory variable
Goodness of fit

I Again, measures such as R 2 based on residual errors are


not very informative
I One intuitive measure of fit is the error rate, given by the
proportion of data points in which the model assigns p > .5
to 0-responses or p < .5 to 1-responses
I This can be compared to baseline in which the model
always predicts 1 if majority of data-points are 1 or 0 if
majority of data-points are 0 (baseline error rate given by
proportion of minority responses over total)
I Some information lost (a .9 and a .6 prediction are treated
equally)
I Other measures of fit proposed in the literature, no widely
agreed upon standard
Binned goodness of fit

I Goodness of fit can be inspected visually by grouping the


ps into equally wide bins (0-0.1,0.1-0.2, . . . ) and plotting
the average p predicted by the model for the points in each
bin vs. the observed proportion of 1-responses for the data
points in the bin
I We can also compute a R 2 or other goodness of fit
measure on these binned data
Deviance

I Deviance is an important measure of fit of a model, used


also to compare models
I Simplifying somewhat, the deviance of a model is −2 times
the log likelihood of the data under the model
I plus a constant that would be the same for all models for
the same data, and so can be ignored since we always look
at differences in deviance
I The larger the deviance, the worse the fit
I As we add parameters, deviance decreases
Deviance

I The difference in deviance between a simpler and a more


complex model approximates a χ2 distribution with the
difference in number of parameters as df’s
I This leads to the handy rule of thumb that the improvement
is significant (at α = .05) if the deviance difference is larger
than the parameter difference (play around with pchisq()
in R to see that this is the case)
I A model can also be compared against the “null” model
that always predicts the same p (given by the proportion of
1-responses in the data) and has only one parameter (the
fixed predicted value)
Outline

Logistic regression

Logistic regression in R
Preparing the data and fitting the model
Practice
Outline

Logistic regression

Logistic regression in R
Preparing the data and fitting the model
Practice
Back to the Graffeo et al.’s discount study
Fields in the discount.txt file

subj Unique subject code


sex M or F
age NB: contains some NA
presentation absdiff (amount of discount), result (price after
discount), percent (percentage discount)
product pillow, (camping) table, helmet, (bed) net
choice Y (buys), N (does not buy) → the discrete
response variable
Preparing the data

I Read the file into an R data-frame, look at the summaries,


etc.
I Note in the summary of age that R “understands” NAs
(i.e., it is not treating age as a categorical variable)
I We can filter out the rows containing NAs as follows:
> e<-na.omit(d)
I Compare summaries of d and e
I na.omit can also be passed as an option to the modeling
functions, but I feel uneasy about that
I Attach the NA-free data-frame
Logistic regression in R

> sex_age_pres_prod.glm<-glm(choice~sex+age+
presentation+product,family="binomial")

> summary(sex_age_pres_prod.glm)
Selected lines from the summary() output

I Estimated β coefficients, standard errors and z scores


(β/std. error):
Coefficients:
Estimate Std. Error z value Pr(>|z|)
sexM -0.332060 0.140008 -2.372 0.01771 *
age -0.012872 0.006003 -2.144 0.03201 *
presentationpercent 1.230082 0.162560 7.567 3.82e-14 *
presentationresult 1.516053 0.172746 8.776 < 2e-16 *
I Note automated creation of binary dummy variables:
discounts presented as percents and as resulting values
are significantly more likely to lead to a purchase than
discounts expressed as absolute difference (the default
level)
I use relevel() to set another level of a categorical
variable as default
Deviance

I For the “null” model and for the current model:

Null deviance: 1453.6 on 1175 degrees of freedom


Residual deviance: 1284.3 on 1168 degrees of freedom

I Difference in deviance (169.3) is much higher than


difference in parameters (7), suggesting that the current
model is significantly better than the null model
Comparing models

I Let us add a presentation by interaction term:

> interaction.glm<-glm(choice~sex+age+presentation+
product+sex:presentation,family="binomial")

I Are the extra-parameters justified?

> anova(sex_age_pres_prod.glm,interaction.glm,
test="Chisq")
...
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 1168 1284.25
2 1166 1277.68 2 6.57 0.04

I Apparently, yes (although summary(interaction.glm)


suggests just a marginal interaction between sex and the
percentage dummy variable)
Error rate
I The model makes an error when it assigns p > .5 to
observation where choice is N or p < .5 to observation
where choice is Y:

> sum((fitted(sex_age_pres_prod.glm)>.5 & choice=="N") |


(fitted(sex_age_pres_prod.glm)<.5 & choice=="Y")) /
length(choice)
[1] 0.2721088

I Compare to error rate by baseline model that always


guesses the majority choice:

> table(choice)
choice
N Y
363 813
> sum(choice=="N")/length(choice)
[1] 0.3086735

I Improvement in error rate is nothing to write home about. . .


Binned fit
I Function from languageR package for plotting binned
expected and observed proportions of 1-responses, as
well as bootstrap validation, require logistic model fitted
with lrm(), the logistic regression fitting function from the
Design package:
> sex_age_pres_prod.glm<-
lrm(choice~sex+age+presentation+product,
x=TRUE,y=TRUE)
I The languageR version of the binned plot function
(plot.logistic.fit.fnc) dies on our model, since it
never predicts p < 0.1, so I hacked my own version, that
you can find in the r-data-1 directory:
> source("hacked.plot.logistic.fit.fnc.R")
> hacked.plot.logistic.fit.fnc(sex_age_pres_prod.glm,e)
I (Incidentally: in cases like this where something goes
wrong, you can peek inside the function simply by typing
its name)
Bootstrap estimation

I Validation using the logistic model estimated by lrm() and


1,000 iterations:
> validate(sex_age_pres_prod.glm,B=1000)
I When fed a logistic model, validate() returns various
measures of fit we have not discussed: see, e.g., Baayen’s
book
I Independently of the interpretation of the measures, the
size of the optimism indices gives a general idea of the
amount of overfitting (not dramatic in this case)
Mixed model logistic regression

I You can use the lmer() function with the


family="binomial" option
I E.g., introducing subjects as random effects:
> sex_age_pres_prod.lmer<-
lmer(choice~sex+age+presentation+
product+(1|subj),family="binomial")
I You can replicate most of the analyses illustrated above
with this model
A warning
I Confusingly, the fitted() function applied to a glm
object returns probabilities, whereas if applied to a lmer
object it returns odd ratios
I Thus, to measure error rate you’ll have to do something
like:
> probs<-exp(fitted(sex_age_pres_prod.lmer)) /
(1 +exp(fitted(sex_age_pres_prod.lmer)))
> sum((probs>.5 & choice=="N") |
(probs<.5 & choice=="Y")) /
length(choice)

I NB: Apparently, hacked.plot.logistic.fit.fnc dies


when applied to an lmer object, on some versions of R (or
lme4, or whatever)
I Surprisingly, fit of model with random subject effect is
worse than the one of model with fixed effects only
Outline

Logistic regression

Logistic regression in R
Preparing the data and fitting the model
Practice
Practice time

I Go back to Navarrete’s et al.’s picture naming data


(cwcc.txt)
I Recall that the response can be a time (naming latency) in
milliseconds, but also an error
I Are the errors randomly distributed, or can they be
predicted from the same factors that determine latencies?
I We found a negative effect of repetition and a positive
effect of position-within-category on naming latencies – are
these factors also leading to less and more errors,
respectively?
Practice time

I Construct a binary variable from responses (error vs. any


response)
I Use sapply(), and make sure that R understands this is a
categorical variable with as.factor()
I Add the resulting variable to your data-frame, e.g., if you
called the data-frame d and the binary response variable
temp, do:
d$errorresp<-temp
I This will make your life easier later on
I Analyze this new dependent variable using logistic
regression (both with and without random effects)

You might also like